TriShift is a single-cell perturbation response prediction toolkit built around the Tripartite Reference-Conditioned Shift Model (TriShift). The repository contains the native TriShift implementation, shared evaluation code, benchmark wrappers for major baselines, and the notebooks used to generate the paper figures.
The project uses a src/ layout and is installable as a Python package.
Overall framework:
Detailed pipeline:
pip install -e .After installation:
python -c "from trishift import TriShift, TriShiftData; import trishift; print(trishift.__version__)"The core package source lives in:
src/trishift
Key runtime config files:
configs/defaults.yamlconfigs/paths.yaml
The repository includes a tiny Adamson-derived smoke-test dataset that is small enough to ship with GitHub:
examples/adamson_mini
Run it after installing the package:
python examples/adamson_mini/run_demo.pyThis demo trains and evaluates a 10% Adamson subset with Adamson-like settings, 1 split, and 20 epochs, then writes outputs to artifacts/demo/adamson_mini. It is meant to validate the code path, not to reproduce paper metrics.
If you want to try TriShift on your own AnnData, start with:
notebooks/tutorial_custom_dataset.ipynb
The tutorial shows a minimal workflow:
- build a small
AnnDatawith aconditioncolumn, - prepare a matching gene embedding table,
- initialize
TriShiftDataandTriShift, - run a minimal train/evaluate loop,
- export prediction payloads for downstream analysis.
Prepare the public benchmark datasets with:
python scripts/data/download_and_prepare_benchmark_data.py --datasets adamson dixit normanThis entrypoint delegates raw data download to GEARS/PertData, prepares the standard simulation splits, and synchronizes perturb_processed.h5ad files to the paths expected by TriShift and the evaluation wrappers.
Run this command in an environment that has GEARS/PertData installed. The core pip install -e . environment is enough for TriShift package imports, but the public benchmark downloader needs the baseline-oriented environment described below.
The maintained public benchmark scope in this repository is now limited to adamson, dixit, and norman.
If you want to run the BioLORD baseline, add the BioLORD-specific preprocessing step after the benchmark download:
python scripts/data/prepare_biolord_perturbation_data.py --datasets adamson norman dixit --split-ids 1 2 3 4 5This step aligns the repository with the external biolord_reproducibility perturbation notebooks by building GO graph neighbor embeddings and split-aware BioLORD inputs. It writes:
src/data/<dataset>/perturb_processed.h5adsrc/data/<dataset>/<dataset>_biolord.h5adsrc/data/<dataset>/<dataset>_single_biolord.h5adsrc/data/<dataset>/<dataset>_biolord_prep_summary.json
TriShift now uses GO graph neighbor embeddings for BioLORD (perturbation_neighbors*) rather than the older condition-token multihot placeholder path.
By default, the repository expects local data under src/data. You can still override locations through:
configs/paths.yaml
src/data is intentionally ignored by git. It is a local cache for downloaded datasets, processed .h5ad files, and embedding files; do not rely on files under src/data as repository entrypoints. Use the maintained script above for reproducible data preparation.
For the complete paper workflow, including the order of TriShift, baseline, Systema, and notebook runs, see:
REPRODUCIBILITY.md
Use one of the following three scopes depending on what you need to verify.
- Package smoke test
pip install -e .
python examples/adamson_mini/run_demo.pyThis validates the core TriShiftData -> TriShift train/evaluate -> saved outputs path without requiring the public benchmark stack.
- Public benchmark reproduction
python scripts/data/download_and_prepare_benchmark_data.py --datasets adamson dixit norman
python scripts/trishift/adamson/run_adamson.py
python scripts/trishift/dixit/run_dixit.py
python scripts/trishift/norman/run_norman.pyThis reproduces the maintained TriShift benchmark scope in this repository and writes model outputs under artifacts/results/trishift.
- Paper-figure regeneration
After the required baseline and Systema result folders exist, execute the figure notebooks listed in REPRODUCIBILITY.md. Primary figure artifacts are written under:
artifacts/paper_figures/mainartifacts/paper_figures/supp
The supplementary command map in output/doc/trishift_supplementary_data_cn.md also lists the tracked notebook-to-output mapping used by the paper bundle.
The lightest workflow is the core TriShift package:
pip install -e .The benchmark stack mixes several external baselines with conflicting dependencies. To keep the main package usable, the repository separates:
- Core TriShift dependencies in
pyproject.toml - Baseline-oriented environment setup in
environment_baselines.yml
Create the baseline environment with:
conda env create -f environment_baselines.yml
conda activate trishift-baselinesenvironment_baselines.yml covers the common stack used by scouter, GEARS, and shared evaluation tools. GEARS still requires a Torch/PyG installation matched to your local CUDA runtime; follow the comments in that file for the final install step.
Gene embeddings are external local artifacts and are not shipped with this repository. Download the required embedding files and place them under:
src/data/Data_GeneEmbd
The default configs/paths.yaml expects the following files:
| Config key | Expected local file | Source |
|---|---|---|
emb_a |
src/data/Data_GeneEmbd/ensem_emb_gpt3.5all_new.pickle |
scELMo library, file Gene-GPT 3.5: https://sites.google.com/yale.edu/scelmolib |
emb_b |
src/data/Data_GeneEmbd/GenePT_gene_embedding_ada_text.pickle |
GenePT Zenodo record: https://zenodo.org/records/10833191 |
emb_c |
src/data/Data_GeneEmbd/GPT_3_5_gene_embeddings.pickle |
GenePT Zenodo record: https://zenodo.org/records/10030426 |
emb_d |
src/data/Data_GeneEmbd/GenePT_gene_protein_embedding_model_3_text.pickle |
Optional GenePT protein/text embedding used only if selected in custom configs |
If your embedding files live elsewhere, update configs/paths.yaml or the dataset-specific config before training.
The benchmark preparation script builds the GEARS-native dataset folders under:
src/data/Data_GEARS/adamsonsrc/data/Data_GEARS/dixitsrc/data/Data_GEARS/norman
It also copies each generated perturb_processed.h5ad into the standard outer data directories:
src/data/adamson/perturb_processed.h5adsrc/data/dixit/perturb_processed.h5adsrc/data/norman/perturb_processed.h5ad
This keeps the repository consistent across:
- GEARS, which reads from
src/data/Data_GEARS - TriShift, Scouter, and Systema-style evaluation, which read from
src/data/<dataset>
Recommended dataset entrypoints are organized by model and dataset under scripts/<model>/<dataset>.
Only adamson, dixit, and norman are maintained public benchmark targets.
The maintained public interfaces for manuscript reproduction are:
- benchmark data preparation under
scripts/data - model entrypoints under
scripts/<model>/<dataset> - shared evaluation cores under
scripts/*/_core - figure notebooks under
notebooks - reproducibility instructions in
README.mdandREPRODUCIBILITY.md
TriShift:
scripts/trishift/adamson/run_adamson.pyscripts/trishift/dixit/run_dixit.pyscripts/trishift/norman/run_norman.py
Scouter:
scripts/scouter/adamson/run_scouter_adamson.pyscripts/scouter/dixit/run_scouter_dixit.pyscripts/scouter/norman/run_scouter_norman.py
GEARS:
scripts/gears/adamson/run_gears_adamson.pyscripts/gears/dixit/run_gears_dixit.pyscripts/gears/norman/run_gears_norman.py
Additional baselines:
scripts/data/prepare_biolord_perturbation_data.pyscripts/biolord/<dataset>/run_biolord_*.pyscripts/genepert/<dataset>/run_genepert_*.pyscripts/scgpt/<dataset>/run_scgpt_*.pyscripts/systema/<dataset>/run_systema_*.py
For BioLORD specifically, the supported public datasets are currently:
adamsonwithordered_attribute_key=perturbation_neighborsdixitwithordered_attribute_key=perturbation_neighborsnormanwithordered_attribute_key=perturbation_neighbors1
dixit follows the Adamson-style single-perturbation BioLORD path because the external BioLORD reproducibility repository does not provide an official Dixit preprocessing notebook.
Shared training core:
scripts/trishift/_core/run_dataset_core.pyscripts/trishift/train/run_dataset.py
The paper figures are generated from the notebooks under notebooks/:
Fig1_MethodOverview.ipynb-> Fig. 1Fig2_MultiDatasetBenchmark.ipynb-> Fig. 2Fig3_ReferenceConditioning.ipynb-> Fig. 3Fig4_NormanGeneralization.ipynb-> Fig. 4Fig5_DistributionRecovery.ipynb-> Fig. 5FigS1_BenchmarkExtension.ipynb-> Fig. S1FigS2_AdditionalCases.ipynb-> Fig. S2FigS3_BiologyAndAblation.ipynb-> Fig. S3FigS4_CentroidAnalysis.ipynb-> Fig. S4FigS5_Robustness.ipynb-> Fig. S5FigS6_Stage1LatentClustering.ipynb-> Fig. S6
Primary outputs are written under:
artifacts/resultsartifacts/paper_figures
- Legacy top-level
scripts/run_*files, if present, should be treated as compatibility entrypoints rather than the primary maintained interface. - Local paper drafts and supporting notes may live under
docs/; this directory is intentionally ignored by git and is not part of the reproducible repository interface. - Large local outputs, datasets, and external baseline clones are intentionally ignored by git.

