alvawalt
/

CancerTranscriptome-Mini-48M

@@ -1,44 +1,143 @@
 ---
-language:
-  - en
 license: mit
 tags:
-  - biology
-  - genomics
-  - transcriptomics
   - cancer
-  - bulk-rnaseq
-  - foundation-model
-  - masked-reconstruction
   - performer
   - gcn
   - pytorch
-library_name: pytorch
 pipeline_tag: feature-extraction
-model_name: CancerTranscriptome-Mini-48M
-model_type: transformer
-datasets:
-  - ARCHS4
-papers:
-  - https://doi.org/10.1101/2025.06.11.659222
-authors:
-  - name: Walter Alvarado
-    affiliation: NASA Ames Research Center
-    github: https://github.com/alwalt
-model_size:
-  total_params: 48336162
-description: >
-  CancerTranscriptome-Mini-48M is a small, proof-of-concept BulkFormer-inspired model
-  trained on cancer-only bulk RNA-seq (ARCHS4, TCGA, GEO). It integrates ESM2 gene
-  identity embeddings, Rotary Expression Embeddings (REE), GCN message passing, local
-  bin-based Performer attention, and global Performer attention. This model is designed
-  as a research prototype showing that BulkFormer-like architectures can be trained and
-  used end-to-end on a single consumer GPU.

 ---
 license: mit
 tags:
+  - rna-seq
+  - bulk-rna
   - cancer
+  - transcriptomics
+  - graph-neural-network
+  - transformer
   - performer
   - gcn
+  - foundation-model
   - pytorch
+model_size: 48M
 pipeline_tag: feature-extraction
+library_name: pytorch
+---
+# 🧬 CancerTranscriptome-Mini-48M
+*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*
+**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
+It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.
+This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.
+---
+## 🔬 Origin & References
+### **Primary Reference (BulkFormer)**
+Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
+**“A large-scale foundation model for bulk transcriptomes.”**
+bioRxiv (2025).
+doi: https://doi.org/10.1101/2025.06.11.659222
+### **This Model (CancerTranscriptome-Mini-48M)**
+A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
+Source Code: https://github.com/alwalt/BioFM
+---
+# 📊 Data Source
+All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:
+**ARCHS4 Reference:**
+Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
+**“Massive mining of publicly available RNA-seq data from human and mouse.”**
+*Nature Communications* 9, 1366 (2018).
+Dataset: https://maayanlab.cloud/archs4/
+### **Filtering Procedure**
+- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
+- Selected samples matching:
+  `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`
+- Removed samples lacking clear disease annotations
+- Used ARCHS4 log-TPM matrices (gene × sample)
+- Final dataset: ~76k cancer samples, 19,357 genes
+No private, clinical, controlled-access, or proprietary data were used.
+---
+# 🧠 Model Architecture (Summary)
+CancerTranscriptome-Mini-48M includes:
+### **1. Gene Identity Embeddings**
+- Precomputed **ESM2 embeddings** for each protein-coding gene
+- Projected into model dimension (320)
+### **2. Rotary Expression Embeddings (REE)**
+- Deterministic sinusoidal continuous-value embedding
+- Masked positions zeroed (mask token = –10)
+### **3. Graph Neural Network Layer**
+- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph
+- Injects biological prior knowledge
+### **4. Expression Binning**
+- Learnable importance scores sort genes
+- Genes divided into 10 bins
+- Each bin receives its own **local Performer** attention
+### **5. Global Performer Attention**
+- 2 stacked Performer layers across all genes
+### **6. Prediction Head**
+- MLP → scalar value per gene
+- Used for masked-expression reconstruction
+Total parameters: **48,336,162 (~48M)**
+---
+# 🎯 Intended Use
+This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:
+- Tumor subtype prediction
+- Drug response modeling
+- Immune infiltration scoring
+- Survival / risk modeling
+- Gene expression imputation
+- Dimensionality reduction
+- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets
+---
+# 🚀 How to Use
+Download & run:
+```python
+import torch
+from model import BulkFormer   # from this repo
+import safetensors.torch as st
+# Load model + weights
+model = BulkFormer(
+    dim=320,
+    graph=torch.load("edge_index.pt"),   # provide your graph
+    gene_emb=torch.load("esm2_gene_emb.pt"),
+    gene_length=19357,
+    bin_head=8,
+    full_head=4,
+    bins=10,
+    gb_repeat=1,
+    p_repeat=2
+)
+state = st.load_file("model.safetensors")
+model.load_state_dict(state)
+model.eval()
+# Example input: 19,357-gene log-TPM vector
+x = torch.randn(1, 19357)
+with torch.no_grad():
+    out = model(x)
+print(out.shape)  # [1, 19357]