Literature Review

MorphoCLIP: Text-Supervised Contrastive Learning for Perturbation Matching in Cell Painting Images

1. Problem Statement

A fundamental challenge in drug discovery is determining a compound’s mechanism of action (MoA) — identifying which protein or biological pathway a drug targets. The Cell Painting assay enables high-throughput morphological profiling of cells, but current analytical methods (CellProfiler-based handcrafted features + cosine similarity) detect only 5–25% of expected compound–gene matches (Chandrasekaran et al., 2024). This project investigates whether Vision-Language Models can learn richer representations that substantially improve this match rate.

2. Foundational Dataset

Chandrasekaran et al. (2024) — CPJUMP1

Paper: Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations
Venue: Nature Methods, Vol 21, pp. 1114–1121
Link: https://doi.org/10.1038/s41592-024-02241-6

Summary: Created the CPJUMP1 benchmark dataset — a carefully curated resource where each perturbed gene’s protein product is a known target of at least two chemical compounds. This enables systematic evaluation of whether chemical and genetic perturbations targeting the same protein produce similar cell morphologies.

Dataset:

~3 million Cell Painting images, ~75 million cells
3 perturbation types: compounds (303), CRISPR knockouts (160 genes), ORF overexpression (176 genes)
2 cell lines (U2OS, A549), multiple timepoints
5 fluorescent channels: DNA, ER, RNA, AGP (actin/Golgi/membrane), Mito
384-well plates with DMSO negative controls

Baseline Method: CellProfiler extracts ~1,000 handcrafted morphological features (shape, texture, intensity) per cell. Well-level profiles are created by averaging across cells. Perturbation matching is evaluated using cosine similarity and mean Average Precision (mAP).

Key Results:

Only ~5–25% of compounds correctly match their sister compounds targeting the same protein
Only ~7–17% of CRISPR guides correctly match sister guides targeting the same gene
Cross-modal matching (compound ↔ genetic perturbation) is even harder
Possible causes: suboptimal annotations, off-target effects, or limitations in how morphology is measured

Relevance to MorphoCLIP: This dataset provides the ground truth benchmark and evaluation framework. Our model’s performance will be directly compared against these CellProfiler baselines.

3. Contrastive Learning for Cell Painting

3.1 CLOOME — Sánchez-Fernández et al. (2023)

Paper: CLOOME: Contrastive learning unlocks bioimaging databases for queries with chemical structures
Venue: Nature Communications

Approach: First CLIP-style contrastive learning framework for Cell Painting. Trains two encoders jointly — a ResNet for microscopy images and an MLP for Morgan molecular fingerprints — to embed matched image–molecule pairs close together in a shared latent space.

Key Results:

Top-1 retrieval accuracy >70× higher than random baseline on a database of ~2,000 candidate images
Learned embeddings transfer well to downstream tasks: activity prediction, MoA identification, image classification via linear probing

Limitations:

Only handles chemical perturbations (not genetic — no CRISPR or ORF)
Uses ResNet without cross-channel reasoning — treats 5 Cell Painting channels like RGB, which loses channel-specific biological information
Limited to Morgan fingerprints for molecular representation

Relevance to MorphoCLIP: CLOOME established that contrastive learning works for Cell Painting data. MorphoCLIP builds on this by unifying perturbation types via text and using channel-aware image encoding.

3.2 MolPhenix — Fradkin et al. (2024)

Paper: How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
Venue: NeurIPS 2024

Approach: Builds on CLOOME with three design guidelines:

Use a pre-trained phenomics model (Phenom-1) as the image encoder rather than training from scratch — dramatically accelerates training and improves performance
Embedding averaging across replicates of the same perturbation — reduces noise
Implicit and explicit concentration encoding — treats different drug concentrations as distinct classes

Also introduced the S2L (Soft-to-Label) loss, a contrastive loss that shifts from multi-class classification to a soft multi-label problem, improving retrieval of active molecules.

Key Results:

8.78× improvement over CLOOME on active molecule retrieval
77.33% top-1% retrieval on active molecules (zero-shot)
Pre-trained Phenom-1 embeddings are critical — training from scratch is far worse

Limitations:

Chemical perturbations only (like CLOOME)
Relies on proprietary Phenom-1 model (not publicly available)
Assumes access to activity labels for evaluation

Relevance to MorphoCLIP: Validates the importance of pre-trained image encoders and thoughtful loss design. Our approach should leverage pre-trained embeddings rather than training vision models from scratch.

3.3 CellCLIP — Lu et al. (2025)

Paper: CellCLIP: Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning
Venue: NeurIPS 2025 (Poster)
Link: https://arxiv.org/abs/2506.06290
Code: https://github.com/suinleelab/CellCLIP

Approach: A CLIP-style cross-modal contrastive learning framework that uses natural language descriptions as the universal perturbation representation — unifying compounds, CRISPR knockouts, and ORFs in a single text modality.

Key architectural innovations:

Text-based perturbation encoding: Prompt template: “A Cell Painting image of [cell_type] cells treated with [drug_name], SMILES: [SMILES_string]”. For CRISPR: “…with CRISPR knockout of gene [GENE]”. Uses pre-trained language model (e.g., PubMedBERT) as text encoder. SMILES string is the most critical component — removing it causes the largest performance drop.
CrossChannelFormer: Novel transformer architecture that processes each of the 5 Cell Painting channels separately, then uses cross-attention to reason about inter-channel relationships. Captures channel-specific biology (unlike CLOOME’s naive channel stacking).
Attention-based pooling (Multiple Instance Learning): Multiple cells per perturbation are pooled into one embedding using learned attention weights. Identifies and emphasizes cells with the strongest perturbation signal. 6.7× training speedup vs. instance-level processing.
CWCL (Continuously Weighted Contrastive Loss): Replaces binary positive/negative labels with continuous similarity-based weights. Preserves morphological similarity structure in the cross-modal space.

Key Results:

Outperforms CLOOME and MolPhenix on cross-modal retrieval
Strong zero-shot gene–gene relationship recovery on RxRx3-core benchmark
Works across ALL perturbation types (compounds + CRISPR + ORF)
Attention pooling identifies biologically meaningful cells

Limitations:

Large model size (1.48B parameters)
Limited to datasets collected in prior works — no additional wet lab validation
High cost of generating Cell Painting data limits open benchmarking

Relevance to MorphoCLIP: CellCLIP is our closest prior work and primary comparison. MorphoCLIP extends this direction by exploring richer text prompts with biological knowledge, alternative VLM backbones, and combined training strategies.

3.4 CWA-MSN (2025)

Paper: Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network
Link: https://arxiv.org/abs/2509.19896

Approach: Self-supervised masked siamese network that aligns embeddings of cells subjected to the same perturbation across different wells/batches/plates. This cross-well alignment naturally handles batch effects without requiring explicit proxy labels.

Key Results:

Outperforms CellCLIP (+9%) and OpenPhenom (+29%) on gene–gene benchmarks
Only 22M parameters (vs. CellCLIP’s 1.48B) and 0.2M training images
Superior parameter and data efficiency

Relevance to MorphoCLIP: Cross-well alignment is a powerful technique for handling batch effects. We should consider incorporating this as a regularization strategy alongside our contrastive learning objective.

4. Self-Supervised and Foundation Models

4.1 OpenPhenom / Phenom Series — Kraus et al. (2024), Kenyon-Dean et al. (2024)

OpenPhenom: Publicly available ViT-S/16, self-supervised on >10M Cell Painting images from RxRx3 + cpg0016
Phenom-1: 307M parameters, trained on 93M images (proprietary)
Phenom-2: 1.86B parameters, trained on 16M images (proprietary)

These serve as strong unimodal baselines — image-only models without cross-modal alignment.

4.2 SSL for Cell Painting — Moshkov et al. (2025)

Venue: Scientific Reports

Trained DINO, MAE, and SimCLR on a JUMP Cell Painting subset. DINO (ViT-S/16) surpassed CellProfiler in drug target and gene family classification with remarkable zero-shot generalizability to unseen genetic perturbation datasets, without fine-tuning.

5. Summary: Landscape of Methods

Method	Year	Type	Perturbations	Image Encoder	Perturbation Encoder	Params	Key Metric
CellProfiler	—	Handcrafted	All	Feature extraction	N/A	N/A	mAP: 5–25%
CLOOME	2023	Contrastive	Compounds only	ResNet	Morgan FP MLP	~25M	70× random
MolPhenix	2024	Contrastive	Compounds only	Phenom-1 (frozen)	GNN (MolGPS)	~36M	8.78× CLOOME
OpenPhenom	2024	Self-supervised	Unimodal	ViT-S/16	N/A	25M	CORUM: .300
CellCLIP	2025	Contrastive	All (via text)	CrossChannelFormer	LLM text encoder	1,477M	CORUM: .354
CWA-MSN	2025	Self-supervised	Unimodal	ViT-S/16 (masked)	N/A	22M	CORUM: .386
MorphoCLIP	2025	Contrastive	All (via text)	VLM-adapted	Biomedical LLM	TBD	Target: >.386

6. Key Gaps and Opportunities

Richer text prompts: CellCLIP uses simple templates. Enriching with pathway information, protein function, DrugBank descriptions could improve the text encoder’s biological understanding.
Parameter efficiency: CellCLIP is 1.48B params while CWA-MSN achieves better results with 22M. There is room for a parameter-efficient VLM approach.
Batch effect handling: Most contrastive methods don’t explicitly address batch effects. Combining cross-well alignment (CWA-MSN) with cross-modal learning (CellCLIP) is unexplored.
Evaluation on CPJUMP1 specifically: Most recent methods evaluate on RxRx3-core. Directly improving the compound–gene match rate on CPJUMP1 (the original 5–25% problem) remains underexplored.
VLM reasoning: No prior work uses the generative capabilities of VLMs — the ability to describe what morphological changes a perturbation caused in natural language. This could enable interpretability and hypothesis generation.

References

Chandrasekaran, S.N., et al. (2024). Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nature Methods, 21, 1114–1121.
Lu, M., Weinberger, E., Kim, C., & Lee, S.-I. (2025). CellCLIP: Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning. NeurIPS 2025.
Sánchez-Fernández, A., et al. (2023). CLOOME: Contrastive learning unlocks bioimaging databases for queries with chemical structures. Nature Communications.
Fradkin, P., et al. (2024). How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval. NeurIPS 2024.
CWA-MSN (2025). Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network. arXiv:2509.19896.
Kraus, O., et al. (2024). OpenPhenom. arXiv.
Moshkov, N., et al. (2025). Self-supervision advances morphological profiling by unlocking powerful image representations. Scientific Reports.
Seal, S., et al. (2024). Cell Painting: A Decade of Discovery and Innovation in Cellular Imaging. Nature Methods.