Feature Extraction Pipeline
In Plain English: Before training, we need to convert raw microscopy images into compact numerical representations (features). This pipeline runs each image through a pre-trained vision AI (DINOv3) that “summarizes” each image into a list of ~1000 numbers. These features are saved to disk so training can iterate quickly without re-processing images. See the Glossary for term definitions.
End-to-end pipeline for extracting DINOv3 CLS token features from CPJUMP1 Cell Painting images.
Overview
TIFF images (16-bit, 5 channels per site)
|
v
image_loader: load + resize to 384x384 (save tensors)
|
v
image_loader: replicate each channel to pseudo-RGB (5 x 3 x 384 x 384)
|
v
feature_extractor: frozen DINOv3 ViT-L backbone
|
v
Per-site CLS tokens: (5, 1024) tensor saved as .pt file
|
v
dataset: group sites by well, pair with text from metadata
|
v
MorphoCLIPSample ready for contrastive trainingPipeline Stages
1. Image Loading (morphoclip.data.image_loader)
CPJUMP1 images follow the naming convention:
r{row}c{col}f{field}p{plane}-ch{channel}sk1fk1fl1.tiffChannels (Cell Painting assay):
| Channel | Stain | Target |
|---|---|---|
| ch1 | MitoTracker / Alexa 647 | Mitochondria |
| ch2 | Phalloidin / Alexa 568 | Actin |
| ch3 | WGA / Alexa 488 long | Golgi / Plasma Membrane |
| ch4 | Concanavalin A / Alexa 488 | Endoplasmic Reticulum |
| ch5 | Hoechst 33342 | DNA / Nucleus |
| ch6-ch8 | Brightfield z-planes | Not used |
Key functions:
discover_sites(image_dir)- Scan a plate directory, return{ImageKey: {channel: Path}}for all complete sitesload_site_as_tensor(channel_paths, resize=384)- Load 5 channels as a(5, 384, 384)float32 tensorprepare_channels_for_dino(site_tensor)- Replicate each grayscale channel to pseudo-RGB:(5, H, W)->(5, 3, H, W)
2. Feature Extraction (morphoclip.data.feature_extractor)
Runs each pseudo-RGB channel image through a frozen DINOv3 backbone and extracts the CLS token.
Default model: facebook/dinov3-vitl16-pretrain-lvd1689m (ViT-L/16, 300M params, 1024-dim CLS token)
Per-plate extraction (extract_plate_features):
- Discover all sites in
{plate}/Images/ - Load 5 channels per site, resize to 384x384
- Optionally save resized tensors to
data/tensors/{barcode}/ - Replicate each channel to pseudo-RGB, normalize with ImageNet stats
- Batch through frozen DINOv3 -> CLS tokens
- Save
(5, 1024)tensor per site todata/features/{barcode}/r{row}c{col}f{field}.pt
Prerequisites:
The DINOv3 model is gated on HuggingFace and requires authentication. You need a HuggingFace access token:
- Create a token at huggingface.co/settings/tokens with “read” repository access scope.
- Fill the access form and send it to the model owner at model page
- Add the token to your
.envfile in the project root:
HF_TOKEN=hf_your_token_hereThe extraction script loads this automatically via python-dotenv.
CLI:
pdm run extract-features # all plates
pdm run extract-features --plate BR00116991 # single plate
pdm run extract-features --verify-only # check completeness
pdm run extract-features --no-tensors # skip tensor saving (saves disk space, slightly faster)3. Metadata (morphoclip.data.metadata)
Maps (plate_barcode, well) -> PerturbationInfo -> text description.
Data hierarchy:
data/metadata/
platemaps/2020_11_04_CPJUMP1/
barcode_platemap.csv # barcode -> platemap name (51 plates)
platemap/
JUMP-Target-1_compound_platemap.txt # well -> broad_sample (384 wells)
JUMP-Target-1_crispr_platemap.txt
JUMP-Target-1_orf_platemap.txt
external_metadata/
JUMP-Target-1_compound_metadata_targets.tsv # broad_sample -> annotations
JUMP-Target-1_crispr_metadata.tsv
JUMP-Target-1_orf_metadata.tsvPerturbation types: COMPOUND, CRISPR, ORF, NEGCON (DMSO), POSCON
Text granularity levels (for multi-scale ablation):
name_only: “Chemical perturbation: gabapentin-enacarbil.”name_target: “Chemical perturbation: gabapentin-enacarbil. Target: CACNB4.”full: “Chemical perturbation: gabapentin-enacarbil. Target: CACNB4. Function: … SMILES: …“
4. Dataset (morphoclip.data.dataset)
MorphoCLIPDataset is a PyTorch Dataset that loads pre-extracted features and pairs them with text descriptions. Each sample represents one well (all sites aggregated).
Sample structure:
features:(num_sites, 5, 1024)- stacked site featurestext: text description from metadataplate,well: identifierspert_info: fullPerturbationInfodataclass
collate_fn pads variable site counts and returns a boolean site_mask.
5. Splitting (morphoclip.data.splits)
Three strategies for train/val/test splits:
cpjump1_official_representation: official CPJUMP1 representation splitcpjump1_official_gene_compound: official CPJUMP1 gene-compound target splitcellclip_cpjump_style: CellCLIP-style deterministic 75/25 split within benchmark slices
Configuration
All paths and hyperparameters are configured in configs/dataset.yml:
cpjump:
extraction:
model: "facebook/dinov3-vitl16-pretrain-lvd1689m"
device: "cuda"
batch_size: 48
dataset:
text_level: "full"
exclude_controls: true
split_strategy: "cpjump1_official_representation"
val_fraction: 0.1
test_fraction: 0.1
seed: 56Directory Layout (after extraction)
data/
metadata/ # downloaded via pdm run fetch-dataset --metadata
raw/{batch}/{plate}/Images/ # raw TIFF images (optional, can delete after extraction)
features/{barcode}/ # extracted CLS tokens (5, 1024) per site
tensors/{barcode}/ # resized image tensors (5, 384, 384) per site