Data Fetching
We use the CPJUMP1 dataset from the JUMP Cell Painting Consortium - a curated benchmark of 3M+ Cell Painting images covering 303 compounds and 160 genes across 2 cell lines (U2OS, A549) and multiple timepoints.
Images are fetched from the Cell Painting Gallery S3 bucket using a unified downloader (scripts/fetch_dataset.py) that supports both AWS CLI and rclone. Dataset configuration lives in configs/dataset.yml.
Commands
# Check plate sizes before downloading (dry-run estimate)
pdm run check-plates
# Fetch dataset (backend from config: cpjump.fetch.backend)
pdm run fetch-dataset
# Override backend explicitly
pdm run fetch-dataset --backend awscli
pdm run fetch-dataset --backend rclone
# Existing-plate behavior
pdm run fetch-dataset --on-existing-plate ask # prompt y/n per existing plate
pdm run fetch-dataset --on-existing-plate skip # reuse local raw files
pdm run fetch-dataset --on-existing-plate redownload # force sync even if present
# Existing-compressed behavior
pdm run fetch-dataset --on-existing-compressed ask # prompt per plate
pdm run fetch-dataset --on-existing-compressed skip # do not recompress plate
pdm run fetch-dataset --on-existing-compressed recompress # overwrite compressed output
# Metadata only
pdm run fetch-dataset --metadata
# Dry-run transfer commands (recommended before full run)
pdm run fetch-dataset --dry-run --backend rcloneSpace-Restricted Download Mode
If local disk is limited, run download + compression + cleanup in one pass:
pdm run fetch-dataset \
--compress-after-download \
--delete-original-after-compress \
--compression-workers 8This performs per plate:
- Download original TIFFs to
data/raw/<batch>/<plate>/Images/ - Compress to
data/raw_compressed/<batch>/<plate>/Images/ - Delete original TIFFs (if enabled)
Default non-experiment compression settings are in cpjump.compression.default:
codec: jpegquality: 50normalization: percentile
Compression experiment/QC settings are separate and live under cpjump.compression.experiment.
Data Strategy
Raw TIFFs (~58-118 GiB per plate) are streamed, processed through the frozen DINOv3 backbone to extract per-channel CLS features, then deleted. All 51 plates compress from 3.31 TiB to ~153 GB of cached features (+ ~459 GB resized tensors for LoRA experiments). Feature extraction takes ~7 minutes per plate on an RTX 5080. See Feature Extraction Pipeline for full details.