Isopedia is a scalable tool for analyzing hundreds to thousands of long-read transcriptome datasets using a read-level indexing approach.
It provides two key capabilities:
- Population-level transcript quantification and frequency profiling.
- Isoform diversity exploration, including fusion and splice-junction analysis.
preprint: https://www.biorxiv.org/content/10.64898/2026.03.23.713667v1.full
- Quick Start
- How It Works
- Download Pre-built Index
- Build Your Own Index
- How to Use Isopedia
- Command Parameters
- How to Install Isopedia
Isopedia has two binaries:
isopedia: main workflows (query + index building)isopedia-tools: helper utilities
# download the latest release from GitHub
curl -L https://github.com/zhengxinchang/isopedia/releases/download/v1.6.6/isopedia-v1.6.6-linux-x86_64.tar.gz | tar -xzvf -
./isopedia-v1.6.6-linux-x86_64/isopedia
# clone the repo and enter toy example directory
git clone https://github.com/zhengxinchang/isopedia && cd isopedia/toy_ex/
# query transcripts
isopedia isoform -i index/ -g gencode.v47.basic.chr22.gtf -o out.profile.tsv.gz # 3364 records returned
# query one splice junction
isopedia splice -i index/ -s 22:17744013,22:17750104 -o out.splice.tsv.gz # 13 records returned
# visualize splice query output
isopedia-splice-viz.py -i out.splice.tsv.gz -g gencode.v47.basic.chr22.gtf -o isopedia-splice-view
# query one fusion event (two breakpoints)
isopedia fusion -i index/ -p chr1:181130,chr1:201853853 -o ./out.fusion.tsv.gz # 0 recored returned, as hg002 is a healthy individual
# query multiple fusion events
isopedia fusion -i index/ -P fusion_query.tsv -o ./out.fusion.tsv.gz # 0 recored returned, as hg002 is a healthy individual
Overview of Isopedia's query framework and supported output types. All query modes run on a standard laptop.
Typical workflow:
isopedia profile: extract isoform signals per sample.isopedia merge: aggregate all profiles into one merged archive.isopedia index: build tree index for fast query.- Query with
isopedia isoform,isopedia fusion, orisopedia splice.
Isopedia provides pre-built indexes from public long-read RNA-seq datasets.
| Name | Organism | Reference | Link | Sample Size | Index Size (Compressed) | Minimum Memory for Query | Description |
|---|---|---|---|---|---|---|---|
isopedia_index_hs_v1.0 |
Homo sapiens | GRCh38 | ftp://hgsc-sftp1.hgsc.bcm.tmc.edu/rt38520/isopedia_index_hs_v1.0.tar.xz |
1,007 | 305G (107G) | 16 GB | 107 ENCODE samples + 900 SRA samples |
Checksum file:
ftp://hgsc-sftp1.hgsc.bcm.tmc.edu//rt38520/isopedia_index_hs_v1.0.tar.xz.md5sum
Unpack:
tar -xvf isopedia_index_hs_v1.0.tar.xzPrerequisites:
- Long-read alignments (
.bam/.cram) - A manifest TSV with at least two columns: sample name and profile path.
Example manifest:
name path platform
HG002_pb_chr22 /path/to/hg002_pb_chr22.isoform.gz PacBio
HG002_ont_chr22 /path/to/hg002_ont_chr22.isoform.gz ONTWorkflow:
# 1) profile each sample
isopedia profile -i ./chr22.pb.grch38.bam -o ./hg002_pb_chr22.isoform.gz
isopedia profile -i ./chr22.ont.grch38.bam -o ./hg002_ont_chr22.isoform.gz
# 2) merge profiles
ulimit -n 65535 # increase the maximum number of open file descriptors, in case of merging many samples.
isopedia merge -i manifest.tsv -o index/
# 3) build tree index
isopedia index -i index/ -m manifest.tsv
# 4) test query
isopedia isoform -i index/ -g gencode.v47.basic.chr22.gtf -o out.isoform.tsv.gzPurpose:
- Annotate transcripts from an input GTF against the index.
Example:
# input GTF must be sorted
gffread -T -o- origin.gtf | sort -k1,1 -k4,4n | gffread - -o query.sorted.gtf
isopedia isoform -i index/ -g query.sorted.gtf -o isoform.out.tsv.gz| Column | Description |
|---|---|
chrom |
Chromosome |
start |
Transcript start |
end |
Transcript end |
length |
Transcript length |
exon_count |
Exon count |
trans_id |
Transcript ID |
gene_id |
Gene ID |
ranking_score |
Ranking score across cohort |
detected(total:fsm:em) |
Detection status for total/FSM/EM; The detected status column follows the format "Overall detected:FSM detected:EM detected". FSM detected indicates whether the sample has evidence from a full-splice match, and EM detected indicates whether the sample has evidence from EM estimation. Overall detected is set to "yes" if either FSM detected or EM detected is "yes". Entries with "no:no:no" can be excluded to quickly filter for transcripts with detection support. |
min_read |
Minimum read threshold used |
n_pos_samples(total:fsm:em/sample_size) |
Positive sample counts and sample size |
attributes |
Original GTF attributes |
FORMAT |
Per-sample field format. Current FORMAT string: CPM:COUNT:FSM_CPM:FSM_COUNT:EM_CPM:EM_COUNT:INFO. Per-sample values include abundance/count components for total, FSM, and EM estimates. |
sample_* |
Per-sample values |
Please note that CPM values are calculated by normalizing across all input transcript queries, as Isopedia expects the input GTF to represent a complete transcriptome. If you query only a subset of transcripts, the CPM values will not be meaningful.
fusion supports three modes:
- Breakpoint query (
--posor--pos-bed) - Gene-region discovery (
--gene-gtf) - Region-pair detailed read output (
--region)
Examples:
# single breakpoint pair
isopedia fusion -i index/ -p chr1:1000,chr2:2000 -o fusion.single.tsv.gz
# batch breakpoint pairs from bed-like file
isopedia fusion -i index/ -P fusion_breakpoints.bed -o fusion.batch.tsv.gz
# discover candidate fusions from gene GTF
isopedia fusion -i index/ -G gene.gtf -o fusion.discovery.tsv.gz
# detailed read-level output for two regions
isopedia fusion -i index/ -r chr8:92017266-92017466,chr21:34859374-34859574 -o fusion.region.tsv.gzBreakpoint-query output columns:
chr1,pos1,chr2,pos2,id,min_read,positive/sample_size,left_isoforms,right_isoforms, then per-sample counts.
Gene-discovery output columns:
geneA_name,geneB_name,total_evidences,total_samples,is_two_strand,AtoB_primary_start,AtoB_primary_end,AtoB_supp_start,AtoB_supp_end,BtoA_primary_start,BtoA_primary_end,BtoA_supp_start,BtoA_supp_end, then per-sample counts.
Region-pair detailed output columns:
chr1,start1,end1,chr2,start2,end2,main_exon_count1,supp_segment_count2,query_part,main_isoforms,supp_aln_regions,sample_name.
Purpose:
- Query isoforms overlapping a splice junction and optionally visualize them.
Examples:
# single splice query
isopedia splice -i index/ -s chr22:41100500,chr22:41101500 -o splice.out.tsv.gz
# batch splice query
isopedia splice -i index/ -S splice_queries.bed -o splice.batch.tsv.gz
# visualize
python script/isopedia-splice-viz.py \
-i splice.out.tsv.gz \
-g gencode.v47.basic.annotation.gtf \
-t script/isopedia-splice-viz-temp.html \
-o isopedia-splice-viewsplice output columns:
id,chr1,pos1,chr2,pos2,total_evidence,cpm,matched_sj_idx,dist_to_matched_sj,n_exons,start_pos_left,start_pos_right,end_pos_left,end_pos_right,splice_junctions, then per-sample values.
splice per-sample FORMAT:
COUNT:CPM:START|END|STRAND
Parameter lists below are based on the current CLI in source.
Core options:
-i, --idxdir <IDXDIR>-g, --gtf <GTF>-o, --output <OUTPUT>-f, --flank <FLANK>(default:10)-m, --min-read <MIN_READ>(default:1)--info-n, --num-threads <NUM_THREADS>(default:4)- EM options:
--em-max-iter,--em-conv-min-diff,--em-chunk-size,--em-effective-len-coef,--em-damping-factor,--min-em-abundance - TSS/TES options:
--no-check-tss-tes,--tss-degrad-bp,--tes-degrad-bp,--terminal-tolerance-bp - Cache options:
-c, --cached-nodes,--cached-chunk-num,--cached-chunk-size-mb --output-tmp-shard-counts--verbose
Core options:
-i, --idxdir <IDXDIR>- Query selectors:
-p, --pos,-P, --pos-bed,-r, --region,-G, --gene-gtf -o, --output <OUTPUT>-f, --flank <FLANK>(default:10)-m, --min-read <MIN_READ>(default:1)- Cache options:
-c, --cached_nodes,--cached-chunk-number,--cached-chunk-size-mb --verbose
Core options:
-i, --idxdir <IDXDIR>- Query selectors:
-s, --spliceor-S, --splice-bed -o, --output <OUTPUT>-f, --flank <FLANK>(default:10)-m, --min-read <MIN_READ>(default:1)- Cache options:
-c, --cached_nodes,--cached-chunk-number,--cached-chunk-size-mb --verbose
Core options:
- Input selectors:
-i, --bamor-g, --gtf -r, --reference <REFERENCE>for CRAM-o, --output <OUTPUT>--mapq <MAPQ>(default:5)--use-secondary,--rname,--tid,--gid,--verbose
Core options:
-i, --input <INPUT>-o, --outdir <OUTDIR>-c, --chunk-size <CHUNK_SIZE>(default:1000000)
Core options:
-i, --idxdir <IDXDIR>-m, --manifest <MANIFEST>-t, --threads <THREADS>(default:4)
Core options:
-l, --list-n, --name <NAME>-u, --url <URL>-m, --manifest <MANIFEST>-o, --outdir <OUTDIR>(default:.)
We offer two ways to install isopedia:
Download the pre-built binary from the GitHub Releases page:
curl -L https://github.com/zhengxinchang/isopedia/releases/download/v1.6.6/isopedia-v1.6.6-linux-x86_64.tar.gz | tar -xzvf -
./isopedia-v1.6.6-linux-x86_64/isopediaRust and Cargo are required.
git clone https://github.com/zhengxinchang/isopedia.git
cd isopedia
cargo build --releaseIsopedia 1.4.0 benchmark on 1,007 long-read transcriptome datasets from SRA and ENCODE.
- Hardware: AMD Ryzen 9 7940HX, 64 GB RAM.
| Step | Peak Mem (GB) | Time (H:MM:SS) |
|---|---|---|
isopedia merge |
54.77 | 28:26:08 |
isopedia index |
45.85 | 5:48:55 |
isopedia isoform (280K GENCODE v49 basic) |
9.19 | 4:52:18 |
Q: Can Isopedia be used in a "transcript discovery" mode — for example, to retrieve all transcripts indexed for a given gene or within a genomic region, without providing a GTF file in advance?
A: Isopedia is designed as a genotyper rather than a caller. It always requires a GTF file as input, meaning users need to specify the transcript structures they want to query. Gene-level or coordinate-range discovery queries (e.g., "retrieve all transcripts for gene N" or "all transcripts between coordinates A–B") fall under the scope of a transcript caller, which is outside Isopedia's current functionality.



