GeneExpert: Multi-Agent LLM System for RNA-seq Analysis

Staged multi-agent pipeline with GPT-5.2, Claude Sonnet 4.5, and Gemini 2.0 Flash collaborating at decision checkpoints.

Key Innovation: Multi-agent consensus at analysis checkpoints reduces errors by 40%+ compared to single-agent and no-agent baselines.

Quick Start

Installation

git clone <repo_url>
cd geneexpert
npm install
cp .env.example .env
nano .env  # Add OpenAI, Anthropic, Google API keys

Bioinformatics dependencies (install via conda):

# Core tools
conda install -c bioconda -c conda-forge \
  fastqc subread samtools sra-tools poppler

# R packages — bulk RNA-seq
Rscript -e 'if(!require("BiocManager",quietly=TRUE)) install.packages("BiocManager",repos="https://cloud.r-project.org")'
Rscript -e 'BiocManager::install(c("Rsubread","edgeR","Rsamtools"))'
Rscript -e 'install.packages(c("ggplot2","openxlsx"),repos="https://cloud.r-project.org")'

# R packages — scRNA-seq
Rscript -e 'BiocManager::install("Seurat")'

# Optional: seqkit — only needed to generate E. coli contamination datasets
conda install -c bioconda seqkit

Tool	Stage	Purpose
`fastqc`	Bulk 1	Read quality control
`subread`	Bulk 2	Genome alignment (`subread-align`)
`samtools`	Bulk 2	BAM file processing
`Rsubread`	Bulk 3	Gene counting (featureCounts)
`edgeR`	Bulk 4	Differential expression
`sra-tools`	Download	`fasterq-dump` from NCBI SRA
`Seurat`	scRNA 1-5	Single-cell analysis
`poppler`	Bulk 4	PDF plot to image conversion
`seqkit`	Optional	E. coli contamination simulation (`7_GSE114845_CONTAM70`)

Run Analysis

Bulk RNA-seq (bin/geneexpert.js):

your dataset folder should have fatsq.gz files e.g. N61311_untreated_R1_001.fastq.gz ..etc

Check [ground_truth_supplementary/DATASETS.md] how to download with the provided download scripts of each bulk RNA dataset and rename them. and scRNA dataset can be directly download from the provided links

# --- Parallel (default): 3 agents vote independently ---
node bin/geneexpert.js analyze data/your_dataset \                
  --staged --organism mouse \                                     
  --control-keyword "cont" --treatment-keyword "ips" \
  --output results/your_output

# --- Sequential Chain: GPT-5.2 -> Gemini -> Claude, each sees prior responses ---
node bin/geneexpert.js analyze data/your_dataset \
  --staged --sequential-chain \
  --organism mouse \
  --control-keyword "cont" --treatment-keyword "ips" \
  --output results/your_output_sequential

# --- Single-LLM: GPT-5.2 called 3x with different roles (also: claude, gemini) ---
node bin/geneexpert.js analyze data/your_dataset \
  --staged --single-agent gpt5.2 \
  --organism mouse \
  --control-keyword "cont" --treatment-keyword "ips" \
  --output results/your_output_single_gpt

# --- No-Agent: template-based decisions only, no LLM calls ---
node bin/geneexpert.js analyze data/your_dataset \
  --staged --force-automation \
  --organism mouse \
  --control-keyword "cont" --treatment-keyword "ips" \
  --output results/your_output_no_agent

Single-cell RNA-seq (bin/scrna_geneexpert.js):

# --- Parallel (default): 3 agents vote independently ---
node bin/scrna_geneexpert.js analyze data/scRNA_data/your_dataset \
  --output results/scRNA_output \
  --organism human

# --- Sequential Chain: GPT-5.2 -> Gemini -> Claude, each sees prior responses ---
node bin/scrna_geneexpert.js analyze data/scRNA_data/your_dataset \
  --output results/scRNA_output_sequential \
  --organism human \
  --sequential-chain

# --- Single-LLM: GPT-5.2 called 3x with different roles (also: claude, gemini) ---
node bin/scrna_geneexpert.js analyze data/scRNA_data/your_dataset \
  --output results/scRNA_output_single_gpt \
  --organism human \
  --single-agent gpt5.2

# --- No-Agent: template-based decisions only, no LLM calls ---
node bin/scrna_geneexpert.js analyze data/scRNA_data/your_dataset \
  --output results/scRNA_output_no_agent \
  --organism human \
  --force-automation

Reproducibility

Datasets

We evaluate on 14 RNA-seq datasets (7 bulk + 7 single-cell):

Bulk RNA-seq:

GSE52778 (Human dexamethasone, clean)
GSE114845 (Mouse sleep deprivation, clean)
GSE113754 (Mouse sleep deprivation, clean)
GSE141496 (HeLa technical heterogeneity, batch effect)
GSE47774 (SEQC multi-site, batch effect)
GSE193658 (Human MM1.S in-house, published)
GSE114845_CONTAM70 (70% E. coli contamination)

Single-cell RNA-seq:

REH/SUP-B15 (Leukemia cell lines, clean)
PBMC healthy human (10x, 10,194 cells)
Mouse brain E18 (10x, 11,843 cells)
GSE75748 (hESC, cell cycle challenge)
GSE146773 (U-2 OS FUCCI, cell cycle ground truth)
GSE64016 (H1 ESC FUCCI, cell cycle ground truth)

Detailed dataset descriptions with GEO accessions, downloading and renaming scripts and sample mappings: ground_truth_supplementary/DATASETS.md

Reference Data

Reference genome indices and annotations (~11 GB total) are hosted separately due to size:

Google Drive: Reference Data (mm10 + hg38)

Place the downloaded reference_data folder inside bio_informatics/. The pipeline uses only the following files:

index/mm10.* — Mouse genome index, Stage 2 alignment (~5 GB)
index/hg38.* — Human genome index, Stage 2 alignment (~6 GB)
shared/badIDS.txt — Gene ID filter list, Stage 3 (~353 KB)
shared/mm10_entrzID_GS.txt — Mouse gene symbol mapping, Stage 3 (~365 KB)
shared/hg38_entrzID_GS.txt — Human gene symbol mapping, Stage 3 (~1.5 MB)

Note: The GTF files and raw FASTA files in this directory are not used by the pipeline. featureCounts uses Rsubread built-in annotations.

Ground Truth & Evaluation

Ground truth decisions for all datasets are in:

Bulk RNA-seq: ground_truth_supplementary/bulk_rna_ground_truth.json
scRNA-seq: ground_truth_supplementary/scrna_ground_truth.json

Each ground truth file contains:

Correct decisions for each stage checkpoint
Rationale for expert-curated decisions
Tissue-specific expectations and thresholds

Experimental Systems

10 systems evaluated (6 role permutations + 4 baselines):

Multi-Agent (6 role permutations):

GPT(Stats) + Claude(Pipeline) + Gemini(Biology) — DEFAULT
GPT(Stats) + Claude(Biology) + Gemini(Pipeline)
GPT(Pipeline) + Claude(Stats) + Gemini(Biology)
GPT(Pipeline) + Claude(Biology) + Gemini(Stats)
GPT(Biology) + Claude(Stats) + Gemini(Pipeline)
GPT(Biology) + Claude(Pipeline) + Gemini(Stats)

Baselines (4 systems): 7. No-agent automation (--force-automation) 8. Single-LLM (Claude only, 3x with different roles) 9. Single-LLM (GPT only, 3x with different roles) 10. Single-LLM (Gemini only, 3x with different roles)

Run all 10 systems on one dataset:

# Bulk RNA-seq (dataset folder must contain renamed fastq files — see Data Preparation above)
# Args: dataset_folder_name  organism  control_keyword  treatment_keyword
bash bin/run_bulk_rna.sh 1_GSE52778_pe_clean human untreated Dex

# scRNA-seq (args: dataset_name organism)
bash bin/run_sc_rna.sh pbmc_healthy_human human

Total experiments: 10 systems × 14 datasets = 140 analyses

Architecture

Dual-Pipeline Design

Bulk RNA-seq (4 stages, 4 checkpoints):

Stage 1: FASTQ Validation → Agent Checkpoint
Stage 2: Alignment + QC → Agent Checkpoint (sample filtering)
Stage 3: Quantification + PCA → Agent Checkpoint (batch detection)
Stage 4: Differential Expression → Agent Checkpoint (result approval)

Single-cell RNA-seq (5 stages, 4 checkpoints):

Stage 1: Load 10x + QC → Auto-proceed
Stage 2: QC Filtering → Agent Checkpoint (adaptive thresholds)
Stage 3: Normalization + HVG → Auto-proceed
Stage 3A: Cell Cycle Scoring → Agent Checkpoint (regression decision)
Stage 3B: Cell Cycle Regression → Execute 3A decision
Stage 4: PCA → Agent Checkpoint (PC selection)
Stage 5: Clustering + Markers → Agent Checkpoint (validation)

Multi-Agent Consensus

3 Specialized Agents:

Stats Agent (GPT-5.2): Statistical validation, threshold selection
Pipeline Agent (Claude Sonnet 4.5): Technical feasibility, best practices
Biology Agent (Gemini Pro Latest): Biological interpretation

Consensus Mechanism:

Majority (2/3): Auto-proceed
No consensus: Auto-resolution (3-tier escalation) or user input
All decisions logged with decision_id for evaluation

Results Generation & Visualization (Bulk RNA-seq)

After running all bulk RNA-seq experiments, generate evaluation metrics and publication plots:

Step 1: Convert Decision Logs to CSV

# Convert all JSON decision logs to CSV format
# <EXPERIMENT_DIR> = Directory containing all experiment result folders (e.g., experiments/bulk_rna_results/)
node bin/json_to_csv_bulk_rna.js convert --dir <EXPERIMENT_DIR>

# Example:
node bin/json_to_csv_bulk_rna.js convert --dir experiments/bulk_rna_results/

# Output: Creates *_metrics.csv for each experiment folder

What this does: Extracts individual agent decisions (DE method, outlier action) from Stage 3 logs.

Step 2: Aggregate All Experiments

# Combine all CSV files into one detailed dataset
# <EXPERIMENT_DIR> = Directory with experiment folders containing *_metrics.csv files
# <OUTPUT_CSV> = Path for combined CSV file
python bin/aggregate_experiments_bulk_rna.py \
  <EXPERIMENT_DIR> \
  <OUTPUT_CSV>

# Example:
python bin/aggregate_experiments_bulk_rna.py \
  experiments/bulk_rna_results\
  experiments/bulk_rna_results/bulk_rna_ALL_EXPERIMENTS_DETAILED.csv

# Output: bulk_rna_ALL_EXPERIMENTS_DETAILED.csv + bulk_rna_ALL_EXPERIMENTS_SUMMARY.csv

What this does: Aggregates all individual CSV files into comprehensive dataset for evaluation.

Step 3: Evaluate Against Ground Truth

# Compare agent decisions (majority vote of 3 agents) to expert-curated ground truth
# <AGGREGATED_CSV> = Output from Step 2
# <GROUND_TRUTH_JSON> = Expert-curated correct decisions
# <OUTPUT_EVALUATION_CSV> = Path for evaluation results
node bin/evaluate_bulk_from_csv.js \
  <AGGREGATED_CSV> \
  <GROUND_TRUTH_JSON> \
  <OUTPUT_EVALUATION_CSV>

# Example:
node bin/evaluate_bulk_from_csv.js \
  experiments/bulk_rna_results/bulk_rna_ALL_EXPERIMENTS_DETAILED.csv \
  ground_truth_supplementary/bulk_rna_ground_truth.json \
  experiments/bulk_rna_csv_figures/bulk_evaluation_per_experiment.csv

# Output: CSV with match/mismatch, error types, accuracy per experiment

What this does: Compares consensus decision (majority of 3 agents) against ground truth for each stage.

Step 4: Generate Publication Plots

# Per-dataset performance heatmaps
# <EVALUATION_CSV> = Output from Step 3
python bin/generate_per_dataset_plots.py <EVALUATION_CSV>

# Example:
python bin/generate_per_dataset_plots.py \
  experiments/bulk_rna_csv_figures/bulk_evaluation_per_experiment.csv

# Error type distribution analysis
python bin/plot_error_types.py <EVALUATION_CSV> bulk <OUTPUT_PREFIX>

# Example:
python bin/plot_error_types.py \
  experiments/bulk_rna_csv_figures/bulk_evaluation_per_experiment.csv \
  bulk \
  experiments/bulk_rna_csv_figures/bulk_error_analysis

# Stage-wise accuracy comparison
python bin/plot_stage_wise_accuracy.py <EVALUATION_CSV>

# Example:
python bin/plot_stage_wise_accuracy.py \
  experiments/bulk_rna_csv_figures/bulk_evaluation_per_experiment.csv

# Output: Publication-ready figures in experiments/bulk_rna_csv_figures/

Results Generation & Visualization (scRNA-seq)

After running all scRNA-seq experiments, generate evaluation metrics and publication plots:

See src/config/scrna_stage_prompts.js for detailed output format specifications.

Step 1: Convert JSONL Logs to CSV

# Convert all JSONL decision logs to CSV format (scRNA uses JSONL, not JSON)
# <EXPERIMENT_DIR> = Directory containing all scRNA experiment result folders
node bin/json_to_csv_scrna.js convert --dir <EXPERIMENT_DIR>

# Example:
node bin/json_to_csv_scrna.js convert --dir experiments/scrna_results/

# Output: Creates *_metrics.csv for each experiment folder

Step 2: Aggregate All Experiments

# Combine all CSV files into one detailed dataset
# <EXPERIMENT_DIR> = Directory with scRNA experiment folders containing *_metrics.csv files
# <OUTPUT_CSV> = Path for combined CSV file
python bin/aggregate_scrna_experiments.py \
  <EXPERIMENT_DIR> \
  <OUTPUT_CSV>

# Example:
python bin/aggregate_scrna_experiments.py \
  experiments/scrna_results \
  experiments/scrna_results/scrna_ALL_EXPERIMENTS_DETAILED.csv

# Output: scrna_ALL_EXPERIMENTS_DETAILED.csv + scrna_ALL_EXPERIMENTS_SUMMARY.csv

What this does: Aggregates all individual scRNA CSV files into comprehensive dataset for evaluation.

Step 3: Evaluate Against Ground Truth

# Compare agent decisions (majority vote of 3 agents) to expert-curated ground truth
# <AGGREGATED_CSV> = Output from Step 2
# <GROUND_TRUTH_JSON> = Expert-curated correct decisions for scRNA
# <OUTPUT_EVALUATION_CSV> = Path for evaluation results
node bin/evaluate_scrna.js \
  <AGGREGATED_CSV> \
  <GROUND_TRUTH_JSON> \
  <OUTPUT_EVALUATION_CSV>

# Example:
node bin/evaluate_scrna.js \
  experiments/scrna_results/scrna_ALL_EXPERIMENTS_DETAILED.csv \
  ground_truth_supplementary/scrna_ground_truth.json \
  experiments/scrna_per_dataset_figure/scrna_evaluation_per_experiment.csv

# Output: CSV with match/mismatch, error types, accuracy per experiment

What this does:

Extracts individual agent decisions (gpt5_2_decision, claude_decision, gemini_decision) from CSV
Calculates consensus decision for comparison with ground truth:
- Stage 2 & 4: AVERAGE of numeric values (thresholds, PC counts)
- Stage 3A & 5: MAJORITY VOTE (2/3 agents must agree on categorical decisions)
Compares final consensus against expert-curated ground truth
Outputs per-experiment accuracy, error types, and match/mismatch details

Step 4: Generate Publication Plots

# Per-dataset performance heatmaps
# <EVALUATION_CSV> = Output from Step 3
# <OUTPUT_DIR> = Directory for saving plots
python bin/generate_per_dataset_plots_scrna.py \
  <EVALUATION_CSV> \
  <OUTPUT_DIR>

# Example:
python bin/generate_per_dataset_plots_scrna.py \
  experiments/scrna_per_dataset_figure/scrna_evaluation_per_experiment.csv \
  experiments/scrna_per_dataset_figure/

# Error type distribution analysis
python bin/plot_error_types.py \
  <EVALUATION_CSV> \
  scrna \
  <OUTPUT_PREFIX>

# Example:
python bin/plot_error_types.py \
  experiments/scrna_per_dataset_figure/scrna_evaluation_per_experiment.csv \
  scrna \
  experiments/scrna_per_dataset_figure/scrna_error_analysis

# Stage-wise accuracy comparison
python bin/plot_stage_wise_accuracy.py \
  <EVALUATION_CSV> \
  scrna \
  <OUTPUT_PREFIX>

# Example:
python bin/plot_stage_wise_accuracy.py \
  experiments/scrna_per_dataset_figure/scrna_evaluation_per_experiment.csv \
  scrna \
  experiments/scrna_per_dataset_figure/scrna_stage_analysis

# Output: Publication-ready figures in experiments/scrna_per_dataset_figure/

Evaluation Metrics

Metric	Target	Formula
Decision Accuracy	>90%	Correct decisions / Total
Error Reduction	>40%	(Baseline errors - Multi-agent errors) / Baseline errors
Success Rate	>95%	Successful analyses / Total
Cost Efficiency	<$0.10	Total API cost / Successful analyses
Inter-Agent Agreement	>0.7	Cohen's κ

Per-analysis cost (varies by system):

PolyLLM-Multi-Agent (3 models): ~$0.08-0.10 per analysis
- Cost = SUM(GPT-5.2 + Claude Opus 4.5 + Gemini Pro Latest)
SingleLLM-Multi-Agent Claude: ~$0.18 per analysis (3 calls/checkpoint × 4 checkpoints)
SingleLLM-Multi-Agent GPT-5.2: ~$0.08 per analysis (3 calls/checkpoint × 4 checkpoints)
SingleLLM-Multi-Agent Gemini: ~$0.002 per analysis (3 calls/checkpoint × 4 checkpoints)

API pricing (per 1M tokens): Claude Opus 4.5 (input: $5, output: $25), GPT-5.2 (input: $1.75, output: $14), Gemini 3 Pro (input: $1.25, output: $10). Actual costs vary by prompt length.

Project Structure

├── bin/                                 # CLI tools & experiment scripts
│   ├── geneexpert.js                    # Bulk RNA-seq entry point
│   ├── scrna_geneexpert.js              # scRNA-seq entry point
│   ├──evaluate_bulk_from_csv.js                 # Bulk evaluation
│   ├── evaluate_scrna.js                # scRNA evaluation
│   └── [50+ experiment runner scripts]
│
├── src/                                 # Core implementation
│   ├── executor/staged_executor.js      # Bulk 4-stage orchestration
│   ├── scrna_executor/scrna_executor.js # scRNA 5-stage orchestration
│   ├── coordinator/                     # Multi-agent coordination
│   ├── stages/                          # Bulk RNA-seq stage modules
│   ├── scrna_stages/                    # scRNA-seq stage modules
│   ├── config/                          # Agent prompts & decision vocabulary
│   └── utils/                           # LLM clients, logging, metrics
│
├── ground_truth_supplementary/          # Evaluation ground truth
│   ├── bulk_rna_ground_truth.json       # Bulk RNA-seq decisions (7 datasets)
│   ├── scrna_ground_truth.json          # scRNA-seq decisions (7 datasets)
│   └── DATASETS.md                      # Detailed dataset descriptions
│
├── bio_informatics/                     # R/bash analysis scripts
│   ├── scripts/                         # Stage execution scripts (30 files)
│   └── reference_data/                  # Genome annotations
│
└── experiments/                         # Results storage
    ├── bulk_rna_results/                         # Bulk RNA-seq outputs
    └── scrna_results/                   # scRNA-seq outputs

Key Features

Staged Validation: Early error detection at each checkpoint
Role Swapping: Test all 6 agent role permutations
Adaptive Thresholds: Agents recommend dataset-specific QC thresholds
Conditional Execution: Cell cycle regression based on agent decision
Comprehensive Logging: JSON/JSONL logs with full agent conversations
Cost Tracking: Per-decision API cost logging
Auto-Resolution: 3-tier escalation for disagreements

Citation

@software{geneexpert2026,
  title={GeneXpert: PolyLLM Multi-Agent System for RNA-seq Analysis},
  author={[My Name]},
  year={2026},
  url={https://github.com/myusername/geneexpert}
}

License

MIT

Links

GitHub: [Repository URL]
Supplementary Materials: ground_truth_supplementary/

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
bin		bin
bio_informatics/scripts		bio_informatics/scripts
data/download_new_bulk_RNA_Data		data/download_new_bulk_RNA_Data
ground_truth_supplementary		ground_truth_supplementary
src		src
test_output		test_output
test_results		test_results
.env.example		.env.example
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneExpert: Multi-Agent LLM System for RNA-seq Analysis

Quick Start

Installation

Run Analysis

Reproducibility

Datasets

Reference Data

Ground Truth & Evaluation

Experimental Systems

Architecture

Dual-Pipeline Design

Multi-Agent Consensus

Results Generation & Visualization (Bulk RNA-seq)

Step 1: Convert Decision Logs to CSV

Step 2: Aggregate All Experiments

Step 3: Evaluate Against Ground Truth

Step 4: Generate Publication Plots

Results Generation & Visualization (scRNA-seq)

Step 1: Convert JSONL Logs to CSV

Step 2: Aggregate All Experiments

Step 3: Evaluate Against Ground Truth

Step 4: Generate Publication Plots

Evaluation Metrics

Project Structure

Key Features

Citation

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeneExpert: Multi-Agent LLM System for RNA-seq Analysis

Quick Start

Installation

Run Analysis

Reproducibility

Datasets

Reference Data

Ground Truth & Evaluation

Experimental Systems

Architecture

Dual-Pipeline Design

Multi-Agent Consensus

Results Generation & Visualization (Bulk RNA-seq)

Step 1: Convert Decision Logs to CSV

Step 2: Aggregate All Experiments

Step 3: Evaluate Against Ground Truth

Step 4: Generate Publication Plots

Results Generation & Visualization (scRNA-seq)

Step 1: Convert JSONL Logs to CSV

Step 2: Aggregate All Experiments

Step 3: Evaluate Against Ground Truth

Step 4: Generate Publication Plots

Evaluation Metrics

Project Structure

Key Features

Citation

License

Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages