alvawalt commited on
Commit
cff0942
·
verified ·
1 Parent(s): f8c5350

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -30
README.md CHANGED
@@ -1,44 +1,143 @@
1
  ---
2
- language:
3
- - en
4
-
5
  license: mit
6
  tags:
7
- - biology
8
- - genomics
9
- - transcriptomics
10
  - cancer
11
- - bulk-rnaseq
12
- - foundation-model
13
- - masked-reconstruction
14
  - performer
15
  - gcn
 
16
  - pytorch
17
-
18
- library_name: pytorch
19
  pipeline_tag: feature-extraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- model_name: CancerTranscriptome-Mini-48M
22
- model_type: transformer
 
 
23
 
24
- datasets:
25
- - ARCHS4
 
 
 
 
 
 
 
 
 
 
26
 
27
- papers:
28
- - https://doi.org/10.1101/2025.06.11.659222
 
29
 
30
- authors:
31
- - name: Walter Alvarado
32
- affiliation: NASA Ames Research Center
33
- github: https://github.com/alwalt
34
 
35
- model_size:
36
- total_params: 48336162
37
 
38
- description: >
39
- CancerTranscriptome-Mini-48M is a small, proof-of-concept BulkFormer-inspired model
40
- trained on cancer-only bulk RNA-seq (ARCHS4, TCGA, GEO). It integrates ESM2 gene
41
- identity embeddings, Rotary Expression Embeddings (REE), GCN message passing, local
42
- bin-based Performer attention, and global Performer attention. This model is designed
43
- as a research prototype showing that BulkFormer-like architectures can be trained and
44
- used end-to-end on a single consumer GPU.
 
1
  ---
 
 
 
2
  license: mit
3
  tags:
4
+ - rna-seq
5
+ - bulk-rna
 
6
  - cancer
7
+ - transcriptomics
8
+ - graph-neural-network
9
+ - transformer
10
  - performer
11
  - gcn
12
+ - foundation-model
13
  - pytorch
14
+ model_size: 48M
 
15
  pipeline_tag: feature-extraction
16
+ library_name: pytorch
17
+ ---
18
+
19
+ # 🧬 CancerTranscriptome-Mini-48M
20
+ *A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*
21
+
22
+ **CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
23
+ It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.
24
+
25
+ This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.
26
+
27
+ ---
28
+
29
+ ## 🔬 Origin & References
30
+
31
+ ### **Primary Reference (BulkFormer)**
32
+ Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
33
+ **“A large-scale foundation model for bulk transcriptomes.”**
34
+ bioRxiv (2025).
35
+ doi: https://doi.org/10.1101/2025.06.11.659222
36
+
37
+ ### **This Model (CancerTranscriptome-Mini-48M)**
38
+ A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
39
+ Source Code: https://github.com/alwalt/BioFM
40
+
41
+ ---
42
+
43
+ # 📊 Data Source
44
+
45
+ All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:
46
+
47
+ **ARCHS4 Reference:**
48
+ Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
49
+ **“Massive mining of publicly available RNA-seq data from human and mouse.”**
50
+ *Nature Communications* 9, 1366 (2018).
51
+ Dataset: https://maayanlab.cloud/archs4/
52
+
53
+ ### **Filtering Procedure**
54
+ - Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
55
+ - Selected samples matching:
56
+ `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`
57
+ - Removed samples lacking clear disease annotations
58
+ - Used ARCHS4 log-TPM matrices (gene × sample)
59
+ - Final dataset: ~76k cancer samples, 19,357 genes
60
+
61
+ No private, clinical, controlled-access, or proprietary data were used.
62
+
63
+ ---
64
+
65
+ # 🧠 Model Architecture (Summary)
66
+
67
+ CancerTranscriptome-Mini-48M includes:
68
+
69
+ ### **1. Gene Identity Embeddings**
70
+ - Precomputed **ESM2 embeddings** for each protein-coding gene
71
+ - Projected into model dimension (320)
72
+
73
+ ### **2. Rotary Expression Embeddings (REE)**
74
+ - Deterministic sinusoidal continuous-value embedding
75
+ - Masked positions zeroed (mask token = –10)
76
+
77
+ ### **3. Graph Neural Network Layer**
78
+ - **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph
79
+ - Injects biological prior knowledge
80
+
81
+ ### **4. Expression Binning**
82
+ - Learnable importance scores sort genes
83
+ - Genes divided into 10 bins
84
+ - Each bin receives its own **local Performer** attention
85
+
86
+ ### **5. Global Performer Attention**
87
+ - 2 stacked Performer layers across all genes
88
+
89
+ ### **6. Prediction Head**
90
+ - MLP → scalar value per gene
91
+ - Used for masked-expression reconstruction
92
+
93
+ Total parameters: **48,336,162 (~48M)**
94
+
95
+ ---
96
+
97
+ # 🎯 Intended Use
98
+
99
+ This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:
100
+
101
+ - Tumor subtype prediction
102
+ - Drug response modeling
103
+ - Immune infiltration scoring
104
+ - Survival / risk modeling
105
+ - Gene expression imputation
106
+ - Dimensionality reduction
107
+ - Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets
108
+
109
+ ---
110
+
111
+ # 🚀 How to Use
112
+
113
+ Download & run:
114
 
115
+ ```python
116
+ import torch
117
+ from model import BulkFormer # from this repo
118
+ import safetensors.torch as st
119
 
120
+ # Load model + weights
121
+ model = BulkFormer(
122
+ dim=320,
123
+ graph=torch.load("edge_index.pt"), # provide your graph
124
+ gene_emb=torch.load("esm2_gene_emb.pt"),
125
+ gene_length=19357,
126
+ bin_head=8,
127
+ full_head=4,
128
+ bins=10,
129
+ gb_repeat=1,
130
+ p_repeat=2
131
+ )
132
 
133
+ state = st.load_file("model.safetensors")
134
+ model.load_state_dict(state)
135
+ model.eval()
136
 
137
+ # Example input: 19,357-gene log-TPM vector
138
+ x = torch.randn(1, 19357)
 
 
139
 
140
+ with torch.no_grad():
141
+ out = model(x)
142
 
143
+ print(out.shape) # [1, 19357]