Label Generation and Train/Test Split
This document explains the two-step data flow:
scripts/label_generator.pybuildslabels.csvfrom image filenames.scripts/train_test_split.pysplitslabels.csvinto train/test label files.
Step 1: Label generation (scripts/label_generator.py)
Inputs
--batch-folder: batch directory containing:<batch-folder>/<plate-folder>/Images/*.tiff
--profiles-root: profile root containing:data/profiles/<batch>/<plate>/<plate>_normalized_feature_select_negcon_batch.csv.gz
Plate and filename parsing
platecodeis extracted from plate folder name by splitting on_and taking the first token.- Image files are parsed with pattern:
rXXcXXfXXpXX-chXsk1fk1fl1.tifforrXXcXXfXXpXX-chXXsk1fk1fl1.tiff
- Extracted fields:
row(for exampler01)col(for examplec12)site(for examplef03)channel(normalized asch01,ch02, …)
Well construction
Wellis derived from row/col:- row number -> ASCII letter (
1 -> A,2 -> B, …) - column -> 2-digit number
- example:
r01,c01 -> A01
- row number -> ASCII letter (
Channel completeness grouping
Rows are grouped by:
platecode,row,col,site,Well
For each group:
channel_count = nunique(channel)- complete group:
channel_count >= 5 - incomplete group:
channel_count < 5
Metadata join
For each plate, the matching profile CSV is loaded and only metadata columns are kept:
- columns starting with
Meta_orMetadata_
Grouped image rows are left-joined with profile metadata on:
- image:
Well - profile:
Metadata_Well
Label generator outputs
labels.csv: complete groups plus joined metadataincomplete_channel_wells.csv: incomplete groups pluschannel_countand joined metadata
Step 2: Train/test split (scripts/train_test_split.py)
Input expectations
The split script reads labels.csv and derives:
- plate identifier from
platecodeorMetadata_Plateorbatch - well identifier from
WellorMetadata_Wellor (row,col)
Normalized fields used for split
For each row:
batch= plate numberWell= normalized well idUNIQUE_SAMPLE_KEY={batch}-{Well}SAMPLE_KEY= same asUNIQUE_SAMPLE_KEYtreatment=Wellprompt= empty string ("") placeholder
Split unit and grouping
Split happens at unique well key level:
- Deduplicate by
UNIQUE_SAMPLE_KEY. - Group (default:
batch). - Sort unique treatments (wells) and take first
floor(train_ratio * N)for train. - Apply selected keys back to full rows.
This preserves all rows for the same well (for example all sites of a well) in the same split.
Train/test split outputs
jumpcp_training_label2.csvjumpcp_testing_label2.csv
written to --output-dir (default: output/benchmark/input).
CLI example
./.venv/bin/python scripts/label_generator.py \
--batch-folder data/raw/2020_11_04_CPJUMP1 \
--profiles-root data/profiles \
--output-dir output
./.venv/bin/python scripts/train_test_split.py \
--label-file output/labels.csv \
--output-dir output/benchmark/input \
--train-ratio 0.75 \
--group-columns batchValidation checks
The split script validates:
- required columns for splitting exist (
group-columns,treatment,UNIQUE_SAMPLE_KEY) - train and test key sets are disjoint
Last updated on