Gene Expression : Quantification of Information
Molecules and their Applications
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology
Web Site: [Link]
These slides were created with using various resources so
no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Molecular Biology Overview
Cell Nucleus
Chromosome
Protein Gene (mRNA), Gene (DNA)
single strand
History of genomes sequencing
1977 bacteriophage øX174 (5386bp, 11 genes)
1981 mitochondrial genome (16,568bp)
1986 chloroplast genome (120,000 bp)
1995 Haemophilus influenzae (1.8Mb)
1996 Saccharomyces whole genome (12.1Mb)
1997 E. coli (4.6Mb; 4200 proteins)
1998 Caenorhabditis elegans (97 Mb; 19,000 genes)
2000 Arabidopsis thaliana (115Mb, 30,000 genes)
2001 mouse (1 year!)
2001 Homo sapiens (2 projects)
2005 Pan, rice
2006 Populus
Analysing the flow of genetic information
• Genome mapping
• Genome sequencing Structural
• Genome annotations genomics
Nucleus
• DNA arrays and chips
DNA (Genome)
• RNA sequencing
•(semi) qRT-PCR
pre-mRNA • Northern blot + hybrid.
Cytoplasm • Transcriptional fusions
mRNA
• 2D electrophoresis
mRNA (Transcriptome) • Gel-free methods Functional
Mass spectrometry genomics
Protein sequencing
Proteins (Proteome) • Translational fusional
• Immunodetection
• Enzyme activities
Metabolites
(Metabolome) • Chromatography
• Mass spectrometry
• NMR
Glycomics Lipidomics
(Sugars) (Lipids)
Metabolomics
Chromosome
(23 pair) Epigenomics
M
M
Ac
Ac
Cell Nucleus Chromatin
Organ, Tissue
Genomics (3×109)
miRNA
DNA (4 chemicals: A, T, G, C)
World of OMICs
Non-coding RNA Transcriptomics
mRNA (copies)
M C
A
A
I
V
Y
M
E Proteomics
D
Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)
The evolution of transcriptomics
Hybridization-based
P. Brown, et. al. Affymetrix, whole genome 2008 many groups, mRNA-seq:
Gene expression profiling expression profiling using tiling direct sequencing of mRNAs
using spotted cDNA array: identifying and profiling using next generation
microarray: expression levels novel genes and splicing sequencing techniques (NGS)
of known genes variants
History
➢ 1980s: antibody-based assay (protein chip?)
➢ ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo
chips)
➢ ~1995: microspotting (Stanford Univ/cDNA chips)
➢ replacing porous surface with solid surface replacing
radioactive label with fluorescent label improvement on
sensitivity
Stanford/cDNA chip
Flow diagram of cDNA chip
microarray technology,
where we detect relative
expression of each gene
cDNA Microarray Technology
Major Steps
1. Spot cloned cDNAs onto a glass microscope slide
2. Label 2 RNA samples with 2 different colors of fluorescent dye
3. Mix two labeled RNAs and hybridize to the chip
4. Make two scans - one for each color
5. Calculate ratios of amounts of each RNA that bind to each spot
Gene Expression Data
On p genes for n slides: p is O(10,000), n is O(10-100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Genes 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...
Gene expression level of gene 5 in slide 4
= Log2( Red intensity / Green intensity)
These values are conventionally displayed on a red (>0) yellow (0) green (<0)
scale.
Affymetrix Expression Arrays
Flow chart of Affymetrix
Comparison of two technologies
Stanford/cDNA chip and Affymetrix/oligo chip
Aspect cDNA Microarray Affymetrix GeneChip
Hybridization-based, uses cDNA probes Oligonucleotide-based, uses short
Technology
(longer sequences). 25-mer probes.
Short oligonucleotides (~25 base
Probe Type Double-stranded cDNA (500–5,000 bp).
pairs).
Fluorescent dyes Cy3/Cy5) for co- Single fluorescent label (usually
Target Labeling
hybridization. biotin-streptavidin).
Relative expression comparison of two Absolute or relative quantification
Data Acquisition
samples. per sample.
Lower sensitivity, especially for low- Higher sensitivity due to specific
Sensitivity
abundance transcripts. oligonucleotides.
Ratios of fluorescence intensities between Signal intensity corresponds to
Quantification
samples. expression level.
Cost Relatively lower. Relatively Higher
Comparison of gene expression between Genome-wide expression profiling,
Examples of Use
two conditions. genotyping.
Analysis of Microarray Data
Analysis of images
Preprocessing of gene expression data
Normalization of data
Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
Repeats and calculate significance (t-test)
Significance of fold used statistical method
Clustering
Supervised/Unsupervised (Hierarchical, K-means, SOM)
Prediction or Supervised Machine Learning (SVM)
Videos on Microarray
[Link] (animation)
[Link]
[Link]
What is RNA-Seq?
RNA-Seq is the process of sequencing the transcriptome which includes
protein coding and non-coding transcripts.
Applications:
Gene (exon, isoform) expression estimation
Differential gene (exon, isoform) expression analysis
Transcriptome assembly - Map exon, intron boundaries, splice junctions
Discovery of novel transcribed regions
Analyses of alternate splicing
Overview of RNA-Seq
Transcriptome profiling using NGS
Sequencing using RNA-Seq technology
(Major Steps)
1. RNA Isolation: Extract total RNA from cells or tissues.
2. mRNA Enrichment: Enrich mRNA using poly(A) selection.
3. cDNA Synthesis: Convert RNA to cDNA using reverse transcription.
4. Fragmentation: Break cDNA into smaller fragments for library preparation.
5. Adapter Ligation: Attach sequencing adapters to both ends of the cDNA fragments.
6. Library Amplification: Amplify the library using PCR to generate enough copies.
7. Quality Control: Assess the library's size and purity using tools like Bioanalyzer.
8. Sequencing: Sequence the cDNA library using a sequencer (e.g., Illumina).
9. Data Preprocessing: Trim adapters and remove low-quality reads.
10. Assembling of Reads: Mapping of reads to obtain sequence of transcriptome
Line1: Sequence identifier
FASTQ files Line2: Raw sequence
Line3: meaningless
Line4: quality values for the sequence
Analysis Pipeline
Short reads to differential expression
Raw Sequence Data QC by
FASTQ Files FastQC/R
Reads Mapping
Unspliced Mapping Spliced mapping
BWA, Bowtie TopHat, MapSplice
Mapped Reads
Expression Quantification SAM/BAM Files
Summarize read counts FPKM/RPKM
Cufflinks QC by
RNA-SeQC
DE testing
DEseq, edgeR, etc Cuffdiff
List of DE
Functional Interpretation
Function Integrate with
Infer networks
enrichment other data
Biological Insights & hypothesis
Quantification of gene expression
using RNA-seq
Mapping
Alignment to genome
-Hisat2
-STAR
Counts reads per transcript
Normalization Read counts tables
FPKM TPM
RPKM
Fragments Per Kilobase Transcripts Per
Reads Per Kilobase
of transcript per Kilobase Million
Million
Million mapped reads.
Patient Technologies Data Analysis Integration and interpretation
point mutation
Small indels
Further understanding of cancer and clinical applications
Genomics Copy number
WGS, WES variation
Functional effect
Structural of mutation
variation
Differential
expression
Transcriptomics Network and
Gene fusion pathway analysis
RNA-Seq
Alternative
splicing
RNA editing
Integrative analysis
Methylation
Epigenomics
Bisulfite-Seq Histone
ChIP-Seq modification
Transcription
Factor binding
Shyr D, Liu Q. Biol Proced Online. (2013)15,4
Concept of Single Cell
The basic unit of life
Why single cell gene expression?
Improvements in scRNA-seq methods
~10 ~100 ~1000 ~10 000 ~100 000
[Link]
From Svensson et al, 2018.
Benefits of single cell sequencing
Opens the door to several biological and clinical questions
✓ Understanding heterogeneous samples:
✓ E.g. analyse cellular heterogeneity during immune or
stem cell development
✓ Identification and analysis of rare cell types
✓ E.g. circulating tumor cells from liquid biopsy
✓ Understanding cellular transitions and switches in
cell state
✓ Dissecting complex infections and revealing drug resistance
genotypes
Gene Expression Omnibus (GEO)
➢A public repository for gene expression data
➢It is a primary database data is submitted by the scientific community.
➢ This repository maintain functional genomics data
➢MIAME (Minimum Information About a Microarray Experiment)-compliant data submissions.
➢Array- and sequence-based data are accepted.
➢Online resource for retrieval of gene expression data
➢Convenient for deposition of data, as required by funding agencies/journals.
The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer
genomics studies.
Key GDC Data Portal features
•Open, granular access to information about all datasets available in the GDC.
•Advanced search and visualization-assisted filtering of data files.
•Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from
•Open-Access MAF files.
•Cart for collecting data files of interest.
•Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
•Secure data download directly from the cart or using the GDC Data Transfer Tool.
•For more information about available datasets, see the GDC Website.
Accessing the GDC Data Portal
The GDC Data Portal is accessible using a web browser such as Chrome, Firefox, and Microsoft Edge at
the following
URL: [Link]
The front page displays a summary of all available datasets:
[Link]
GDC data portal
Data Category: SNV, transcriptome profiling, CNV, sequencing reads,
biospecimens, clinical, DNA methylation, somatic mutation, combined nucleotide
variation
Data Type: RAW single somatic mutation, annotated somatic mutation, aligned
reads, gene expression quantification, and so on…..
Clinical data: Collection of data related to patient diagnosis, demographics,
exposures, laboratory tests, and family relationships.
Data Retrieval: Data is searchable in the API, Data Portal, or Legacy Archive.
[Link]
The Cancer Genome Atlas (TCGA)
• Launched in 2006 as a pilot, expanded in 2009, ended in 2017
• NIH-funded program to perform a comprehensive and integrated analysis of key
genomic/molecular features of many cancers
• A ‘marker paper’ in each project to provide fundamental insights
• Make the data publicly available to the research community
• Serves as a model for the power of teamwork in science.
U.K., France, Netherlands, Canada, U.S.
• Uveal melanoma chosen as one of 10 rare cancers included
TCGA history
Initiated in 2005, to catalogue genetic mutations responsible for cancer
TCGA is supervised by the Center for Cancer Genomics and the National Human Genome
Research Institute.
A three-year pilot project, begun in 2006, focused on characterization of three types of
human cancers: glioblastoma, lung, and ovarian cancer.
In 2009, it expanded into phase II, which planned to complete the genomic
characterization and sequence analysis of 20–25 different tumor types by 2014
Contain Gene expression, copy number variation, SNP genotyping, DNA methylation etc.
There are 3554 authorized requesters associated with TCGA study (currently)
Project Cases Seq Exp SNV CNV Meth Clinical Clinical Supplement
TCGA-BRCA 1,098 1,098 1,097 1,044 1,098 1,095 1,098 1,098
TCGA-GBM 617 406 166 396 599 423 617 617
TCGA-OV 608 575 492 443 597 602 608 608
TCGA-LUAD 585 582 519 569 518 579 585 585
TCGA-UCEC 560 559 559 542 558 559 560 560
TCGA-KIRC 537 535 534 339 534 533 537 537
TCGA-HNSC 528 528 528 510 526 528 528 528
TCGA-LGG 516 516 516 513 515 516 516 516
TCGA-THCA 507 507 507 496 505 507 507 507
TCGA-LUSC 504 504 504 497 504 503 504 504
TCGA-PRAD 500 498 498 498 498 498 500 500
TCGA-SKCM 470 470 469 470 470 470 470 470
TCGA-COAD 461 460 459 433 460 458 461 461
TCGA-STAD 443 443 439 441 443 443 443 443
TCGA-BLCA 412 412 412 412 412 412 412 412
TCGA-LIHC 377 377 376 375 376 377 377 377
TCGA-CESC 307 307 307 305 302 307 307 307
TCGA-KIRP 291 291 291 288 290 291 291 291
TCGA-SARC 261 261 261 255 261 261 261 261
TCGA-LAML 200 195 188 149 200 140 200 200
TCGA-ESCA 185 185 184 184 185 185 185 185
TCGA-PAAD 185 185 178 183 185 184 185 185
TCGA-PCPG 179 179 179 179 179 179 179 179
TCGA-READ 172 171 167 158 167 165 172 172
TCGA-TGCT 150 150 150 150 150 150 150 150
TCGA-THYM 124 124 124 123 124 124 124 124
TCGA-KICH 113 66 66 66 66 66 113 113
TCGA-ACC 92 92 80 92 92 80 92 92
TCGA-MESO 87 87 87 83 87 87 87 87
TCGA-UVM 80 80 80 80 80 80 80 80
TCGA-DLBC 58 48 48 37 50 48 58 58
TCGA-UCS 57 57 57 57 57 57 57 57
TCGA-CHOL 51 51 36 51 36 36 51 51
11,315 10,999 10,558 10,418 11,124 10,943 11,315 11,315
Types of data
• Core dataset: • Future datasets:
➢ Pathology report ❖ 50x Whole-genome sequencing
➢ Histology images ❖ Bisulfide sequencing
➢ Clinical data ❖ Protein Array
➢ Whole exome-seq
➢ SNP 6.0 array
➢ mRNAseq
➢ miRNAseq
➢ Methylation array
Single Cell Expression Atlas
Discover and interpret gene
expression analysis results
at single cell level
[Link]/gxa/sc/
[Link]/gxa/sc/
Http://[Link]/raghava/cancerdr/
Overall Architecture of CancerlivER
Cancer Biomarkers
◆ A biomarker is a biological molecule found in blood, other body fluids, or tissues that is a sign of a
normal or abnormal process, or of a condition or disease (National Cancer Institute (NCI))
Biomarkers
Based on Disease State Based on Biomolecules
Diagnostics DNA Biomarker
RNA Biomarker
Prognostics
Protein Biomarker
Predictive
Glyco Biomarker
Thank You
[Link]
Questions & Answers Session