CellRep
A model for generating embeddings from cellular microscopy images.
Quickstart - Embedding Pipeline
The easiest way to use this model for generating embeddings is via the image-feature-extraction pipeline:
from PIL import Image
from transformers import pipeline
cellrep_pipeline = pipeline(
task="image-feature-extraction",
model="novonordisk-red/cellrep-base",
revision="3.0.0",
trust_remote_code=True
)
images = [
Image.open(PATH_TO_MY_PNG_IMAGE_A),
Image.open(PATH_TO_MY_PNG_IMAGE_B)
]
embeddings = cellrep_pipeline(images)
How to use this Model in Full
To work at a lower level than the pipeline, start by loading the model:
from transformers import Dinov2WithRegistersModel
model = Dinov2WithRegistersModel.from_pretrained(
"novonordisk-red/cellrep-base",
revision="3.0.0"
)
Then load a PNG image and pre-process it:
- Resize height and width to a multiple of 14.
- Convert image to a PyTorch tensor.
- Normalise the pixel values using ImageNet params.
- Add a leading batch dimension to the final tensor.
from PIL import Image
import torchvision.transforms.functional as f
IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
image = Image.open(PATH_TO_MY_PNG_IMAGE)
image_tensor = f.to_tensor(image)
image_resized = f.resize(image_tensor, [518, 518])
image_tensor_norm = f.normalize(
image_resized,
mean=IMAGENET_DEFAULT_MEAN,
std=IMAGENET_DEFAULT_STD,
)
image_input = image_tensor_norm.unsqueeze(0)
Then generate the embedding:
image_embedding = model(image_input).pooler_output
The pooler_output method will return the class token - only this embedding should be used for downstream tasks.
How was the Model Trained?
The training code for this model can be found at nn-research-early-development/cellrep.
Data
Our training data is composed of two large-scale cell painting datasets from the Broad Institute: CDRP-BBBC047-Bray and LINCS-Pilot, both of which can be downloaded from the Broad Institute's Cell Painting Gallery. These constitutes ~1.2 million five-channel microscopy images of cancer cells, namely U2OS and A549 cells, respectively. The cell painting assay used in these datasets captures distinct cellular components through the following channels:
- RNA/nucleoli and cytoplasmic RNA (SYTO 14)
- ER/endoplasmic reticulum (concanavalin A)
- AGP/actin, Golgi and plasma membrane (phalloidin and WGA)
- Mito/mitochondria (MitoTracker Deep Red)
- DNA/nucleus (Hoechst 33342)
These datasets contain images of cells treated with diverse chemical compounds, providing a rich set of morphological phenotypes for model training. For both training and testing datasets for all models, we applied our full normalization and PNG-conversion pipeline to ensure consistent processing across all experiments.
Training Run
The training run that yielded these model weights was logged to Weights & Biases at:
https://nn-red.wandb.io/cellular-foundation-model/cellrep-benchmark-runs/runs/rv393vct
The precise checkpoint used was:
cellular-foundation-model/cellrepv2-testing/cellrepv2-53715-teacher-624999:v0
Evaluation
Our primary benchmark uses CDRP-bio-BBBC036-Bray, a held-out subset of 124,416 images from CDRP-BBBC047-Bray containing known bioactive compounds. Each compound in this dataset has been annotated. As multiple compounds can share the same MoA, this helps test if they learn biologically meaningful features within the same assay rather than memorizing compound-specific artifacts or batch effects. To ensure statistical reliability, we restrict our evaluation to the 23 most frequent MoA classes in CDRP-bio-BBBC036-Bray.
Results
precision recall f1-score support
0 0.16 0.13 0.14 323
1 0.12 0.18 0.14 132
2 0.24 0.18 0.21 312
3 0.25 0.14 0.18 478
4 0.81 0.83 0.82 101
5 0.11 0.19 0.14 148
6 0.26 0.40 0.32 175
7 0.15 0.25 0.19 102
8 0.26 0.34 0.29 91
9 0.15 0.11 0.12 351
10 0.13 0.22 0.16 94
11 0.18 0.18 0.18 206
12 0.38 0.22 0.28 436
13 0.17 0.26 0.20 132
14 0.13 0.24 0.17 122
15 0.38 0.45 0.41 221
16 0.14 0.14 0.14 248
17 0.14 0.11 0.12 271
18 0.17 0.21 0.18 203
19 0.13 0.13 0.13 271
20 0.26 0.16 0.20 344
21 0.15 0.21 0.17 186
22 0.12 0.16 0.14 56
accuracy 0.21 5003
macro avg 0.22 0.24 0.22 5003
weighted avg 0.22 0.21 0.20 5003
References
- Downloads last month
- 22