Text Encoder

The MorphoCLIP text encoder encodes perturbation metadata into 512-d embeddings using BioClinical ModernBERT (150M params, frozen during training).

Architecture

Base model: BioClinical ModernBERT (frozen) — encodes text into 768-d embeddings
Pooling: CLS token pooling (default) or mean pooling (configurable via pooling parameter)
Projection head (trainable): Linear (768 → 512) → LayerNorm → GELU → Dropout → Linear (512 → 512) → L2 normalize
Max sequence length: 256 tokens
Output: 512-d L2-normalized embeddings in the shared contrastive space

Raw 768-d BERT features are cached separately from projected 512-d output, allowing the projection head to be retrained without re-running BERT (~10x speedup).

Usage modes

It can run in two modes:

1. Quick test (no external data)

Uses built-in sample metadata (5 perturbations).


pdm run text-encoder

2. CPJUMP1 external metadata

Load from data/metadata/external_metadata/ TSV files. Fetch metadata first if needed:


pdm run fetch-dataset --metadata   # downloads platemaps + external_metadata
pdm run text-encoder --metadata-dir data/metadata/external_metadata
pdm run text-encoder --metadata-dir data/metadata/external_metadata --limit 20  # first 20 only

Metadata sources

External TSVs (recommended): data/metadata/external_metadata/ — JUMP-Target-1_compound_metadata_targets.tsv, JUMP-Target-1_crispr_metadata.tsv, JUMP-Target-1_orf_metadata.tsv. Get via pdm run fetch-dataset --metadata.
Profile CSVs: If you have *_normalized_feature_select_negcon_batch.csv.gz with Metadata_PlateType, Metadata_Name, etc., use load_metadata_from_csv() and metadata_from_cpjump1_row().

Options

--cell-line U2OS — filter by cell line
--limit N — process first N perturbations only
--cache-dir PATH — HuggingFace model cache directory