0% found this document useful (0 votes)
10 views50 pages

Week 9

The document discusses gene expression quantification and its applications in biomedical fields such as drug discovery, vaccine design, and biomarker identification. It covers various technologies for analyzing gene expression, including microarray and RNA sequencing, and highlights the importance of data repositories like the Gene Expression Omnibus and the Genomic Data Commons. Additionally, it outlines the history and evolution of genomic sequencing and transcriptomics, emphasizing the significance of projects like The Cancer Genome Atlas in cancer research.

Uploaded by

smishra8094
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views50 pages

Week 9

The document discusses gene expression quantification and its applications in biomedical fields such as drug discovery, vaccine design, and biomarker identification. It covers various technologies for analyzing gene expression, including microarray and RNA sequencing, and highlights the importance of data repositories like the Gene Expression Omnibus and the Genomic Data Commons. Additionally, it outlines the history and evolution of genomic sequencing and transcriptomics, emphasizing the significance of projects like The Cancer Genome Atlas in cancer research.

Uploaded by

smishra8094
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Gene Expression : Quantification of Information

Molecules and their Applications

Prof. Gajendra P.S. Raghava


Head, Center for Computational Biology

Web Site: [Link]

These slides were created with using various resources so


no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Molecular Biology Overview
Cell Nucleus

Chromosome

Protein Gene (mRNA), Gene (DNA)


single strand
History of genomes sequencing
 1977 bacteriophage øX174 (5386bp, 11 genes)
 1981 mitochondrial genome (16,568bp)
 1986 chloroplast genome (120,000 bp)
 1995 Haemophilus influenzae (1.8Mb)
 1996 Saccharomyces whole genome (12.1Mb)
 1997 E. coli (4.6Mb; 4200 proteins)
 1998 Caenorhabditis elegans (97 Mb; 19,000 genes)
 2000 Arabidopsis thaliana (115Mb, 30,000 genes)
 2001 mouse (1 year!)
 2001 Homo sapiens (2 projects)
 2005 Pan, rice
 2006 Populus
Analysing the flow of genetic information
• Genome mapping
• Genome sequencing Structural
• Genome annotations genomics

Nucleus

• DNA arrays and chips


DNA (Genome)
• RNA sequencing
•(semi) qRT-PCR
pre-mRNA • Northern blot + hybrid.
Cytoplasm • Transcriptional fusions

mRNA
• 2D electrophoresis
mRNA (Transcriptome) • Gel-free methods Functional
Mass spectrometry genomics
Protein sequencing
Proteins (Proteome) • Translational fusional
• Immunodetection
• Enzyme activities
Metabolites
(Metabolome) • Chromatography
• Mass spectrometry
• NMR
Glycomics Lipidomics
(Sugars) (Lipids)

Metabolomics
Chromosome
(23 pair) Epigenomics
M
M

Ac
Ac

Cell Nucleus Chromatin


Organ, Tissue
Genomics (3×109)
miRNA
DNA (4 chemicals: A, T, G, C)
World of OMICs
Non-coding RNA Transcriptomics
mRNA (copies)

M C
A

A
I
V

Y
M
E Proteomics
D
Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)
The evolution of transcriptomics
Hybridization-based

P. Brown, et. al. Affymetrix, whole genome 2008 many groups, mRNA-seq:
Gene expression profiling expression profiling using tiling direct sequencing of mRNAs
using spotted cDNA array: identifying and profiling using next generation
microarray: expression levels novel genes and splicing sequencing techniques (NGS)
of known genes variants
History
➢ 1980s: antibody-based assay (protein chip?)

➢ ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo


chips)

➢ ~1995: microspotting (Stanford Univ/cDNA chips)

➢ replacing porous surface with solid surface replacing


radioactive label with fluorescent label improvement on
sensitivity
Stanford/cDNA chip
Flow diagram of cDNA chip
microarray technology,
where we detect relative
expression of each gene
cDNA Microarray Technology

Major Steps
1. Spot cloned cDNAs onto a glass microscope slide

2. Label 2 RNA samples with 2 different colors of fluorescent dye

3. Mix two labeled RNAs and hybridize to the chip

4. Make two scans - one for each color

5. Calculate ratios of amounts of each RNA that bind to each spot


Gene Expression Data

On p genes for n slides: p is O(10,000), n is O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Genes 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...

Gene expression level of gene 5 in slide 4


= Log2( Red intensity / Green intensity)

These values are conventionally displayed on a red (>0) yellow (0) green (<0)
scale.
Affymetrix Expression Arrays

Flow chart of Affymetrix


Comparison of two technologies
Stanford/cDNA chip and Affymetrix/oligo chip
Aspect cDNA Microarray Affymetrix GeneChip
Hybridization-based, uses cDNA probes Oligonucleotide-based, uses short
Technology
(longer sequences). 25-mer probes.
Short oligonucleotides (~25 base
Probe Type Double-stranded cDNA (500–5,000 bp).
pairs).
Fluorescent dyes Cy3/Cy5) for co- Single fluorescent label (usually
Target Labeling
hybridization. biotin-streptavidin).
Relative expression comparison of two Absolute or relative quantification
Data Acquisition
samples. per sample.
Lower sensitivity, especially for low- Higher sensitivity due to specific
Sensitivity
abundance transcripts. oligonucleotides.
Ratios of fluorescence intensities between Signal intensity corresponds to
Quantification
samples. expression level.
Cost Relatively lower. Relatively Higher
Comparison of gene expression between Genome-wide expression profiling,
Examples of Use
two conditions. genotyping.
Analysis of Microarray Data
 Analysis of images
 Preprocessing of gene expression data
 Normalization of data
 Subtraction of Background Noise
 Global/local Normalization
 House keeping genes (or same gene)
 Expression in ratio (test/references) in log
 Differential Gene expression
 Repeats and calculate significance (t-test)
 Significance of fold used statistical method
 Clustering
 Supervised/Unsupervised (Hierarchical, K-means, SOM)
 Prediction or Supervised Machine Learning (SVM)
Videos on Microarray
 [Link] (animation)
 [Link]
 [Link]
What is RNA-Seq?
RNA-Seq is the process of sequencing the transcriptome which includes
protein coding and non-coding transcripts.

Applications:
 Gene (exon, isoform) expression estimation
 Differential gene (exon, isoform) expression analysis
 Transcriptome assembly - Map exon, intron boundaries, splice junctions
 Discovery of novel transcribed regions
 Analyses of alternate splicing
Overview of RNA-Seq
Transcriptome profiling using NGS
Sequencing using RNA-Seq technology
(Major Steps)
1. RNA Isolation: Extract total RNA from cells or tissues.
2. mRNA Enrichment: Enrich mRNA using poly(A) selection.
3. cDNA Synthesis: Convert RNA to cDNA using reverse transcription.
4. Fragmentation: Break cDNA into smaller fragments for library preparation.
5. Adapter Ligation: Attach sequencing adapters to both ends of the cDNA fragments.
6. Library Amplification: Amplify the library using PCR to generate enough copies.
7. Quality Control: Assess the library's size and purity using tools like Bioanalyzer.
8. Sequencing: Sequence the cDNA library using a sequencer (e.g., Illumina).
9. Data Preprocessing: Trim adapters and remove low-quality reads.
10. Assembling of Reads: Mapping of reads to obtain sequence of transcriptome
Line1: Sequence identifier

FASTQ files Line2: Raw sequence


Line3: meaningless
Line4: quality values for the sequence
Analysis Pipeline
Short reads to differential expression
Raw Sequence Data QC by
FASTQ Files FastQC/R

Reads Mapping

Unspliced Mapping Spliced mapping


BWA, Bowtie TopHat, MapSplice

Mapped Reads
Expression Quantification SAM/BAM Files

Summarize read counts FPKM/RPKM


Cufflinks QC by
RNA-SeQC
DE testing

DEseq, edgeR, etc Cuffdiff

List of DE
Functional Interpretation
Function Integrate with
Infer networks
enrichment other data

Biological Insights & hypothesis


Quantification of gene expression
using RNA-seq
Mapping

Alignment to genome
-Hisat2
-STAR

Counts reads per transcript

Normalization Read counts tables

FPKM TPM
RPKM
Fragments Per Kilobase Transcripts Per
Reads Per Kilobase
of transcript per Kilobase Million
Million
Million mapped reads.
Patient Technologies Data Analysis Integration and interpretation
point mutation

Small indels

Further understanding of cancer and clinical applications


Genomics Copy number
WGS, WES variation
Functional effect
Structural of mutation
variation

Differential
expression
Transcriptomics Network and
Gene fusion pathway analysis
RNA-Seq
Alternative
splicing

RNA editing
Integrative analysis
Methylation
Epigenomics
Bisulfite-Seq Histone
ChIP-Seq modification

Transcription
Factor binding

Shyr D, Liu Q. Biol Proced Online. (2013)15,4


Concept of Single Cell

The basic unit of life


Why single cell gene expression?
Improvements in scRNA-seq methods
~10 ~100 ~1000 ~10 000 ~100 000

[Link]
From Svensson et al, 2018.
Benefits of single cell sequencing
Opens the door to several biological and clinical questions

✓ Understanding heterogeneous samples:


✓ E.g. analyse cellular heterogeneity during immune or
stem cell development
✓ Identification and analysis of rare cell types
✓ E.g. circulating tumor cells from liquid biopsy
✓ Understanding cellular transitions and switches in
cell state
✓ Dissecting complex infections and revealing drug resistance
genotypes
Gene Expression Omnibus (GEO)
➢A public repository for gene expression data

➢It is a primary database data is submitted by the scientific community.


➢ This repository maintain functional genomics data
➢MIAME (Minimum Information About a Microarray Experiment)-compliant data submissions.
➢Array- and sequence-based data are accepted.
➢Online resource for retrieval of gene expression data
➢Convenient for deposition of data, as required by funding agencies/journals.
The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer
genomics studies.
Key GDC Data Portal features
•Open, granular access to information about all datasets available in the GDC.
•Advanced search and visualization-assisted filtering of data files.
•Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from
•Open-Access MAF files.
•Cart for collecting data files of interest.
•Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
•Secure data download directly from the cart or using the GDC Data Transfer Tool.
•For more information about available datasets, see the GDC Website.

Accessing the GDC Data Portal


The GDC Data Portal is accessible using a web browser such as Chrome, Firefox, and Microsoft Edge at
the following
URL: [Link]
The front page displays a summary of all available datasets:

[Link]
GDC data portal

 Data Category: SNV, transcriptome profiling, CNV, sequencing reads,


biospecimens, clinical, DNA methylation, somatic mutation, combined nucleotide
variation

 Data Type: RAW single somatic mutation, annotated somatic mutation, aligned
reads, gene expression quantification, and so on…..

 Clinical data: Collection of data related to patient diagnosis, demographics,


exposures, laboratory tests, and family relationships.

 Data Retrieval: Data is searchable in the API, Data Portal, or Legacy Archive.
[Link]
The Cancer Genome Atlas (TCGA)

• Launched in 2006 as a pilot, expanded in 2009, ended in 2017

• NIH-funded program to perform a comprehensive and integrated analysis of key


genomic/molecular features of many cancers

• A ‘marker paper’ in each project to provide fundamental insights

• Make the data publicly available to the research community

• Serves as a model for the power of teamwork in science.


U.K., France, Netherlands, Canada, U.S.

• Uveal melanoma chosen as one of 10 rare cancers included


TCGA history
 Initiated in 2005, to catalogue genetic mutations responsible for cancer

 TCGA is supervised by the Center for Cancer Genomics and the National Human Genome
Research Institute.

 A three-year pilot project, begun in 2006, focused on characterization of three types of


human cancers: glioblastoma, lung, and ovarian cancer.

 In 2009, it expanded into phase II, which planned to complete the genomic
characterization and sequence analysis of 20–25 different tumor types by 2014

 Contain Gene expression, copy number variation, SNP genotyping, DNA methylation etc.

 There are 3554 authorized requesters associated with TCGA study (currently)
Project Cases Seq Exp SNV CNV Meth Clinical Clinical Supplement
TCGA-BRCA 1,098 1,098 1,097 1,044 1,098 1,095 1,098 1,098
TCGA-GBM 617 406 166 396 599 423 617 617
TCGA-OV 608 575 492 443 597 602 608 608
TCGA-LUAD 585 582 519 569 518 579 585 585
TCGA-UCEC 560 559 559 542 558 559 560 560
TCGA-KIRC 537 535 534 339 534 533 537 537
TCGA-HNSC 528 528 528 510 526 528 528 528
TCGA-LGG 516 516 516 513 515 516 516 516
TCGA-THCA 507 507 507 496 505 507 507 507
TCGA-LUSC 504 504 504 497 504 503 504 504
TCGA-PRAD 500 498 498 498 498 498 500 500
TCGA-SKCM 470 470 469 470 470 470 470 470
TCGA-COAD 461 460 459 433 460 458 461 461
TCGA-STAD 443 443 439 441 443 443 443 443
TCGA-BLCA 412 412 412 412 412 412 412 412
TCGA-LIHC 377 377 376 375 376 377 377 377
TCGA-CESC 307 307 307 305 302 307 307 307
TCGA-KIRP 291 291 291 288 290 291 291 291
TCGA-SARC 261 261 261 255 261 261 261 261
TCGA-LAML 200 195 188 149 200 140 200 200
TCGA-ESCA 185 185 184 184 185 185 185 185
TCGA-PAAD 185 185 178 183 185 184 185 185
TCGA-PCPG 179 179 179 179 179 179 179 179
TCGA-READ 172 171 167 158 167 165 172 172
TCGA-TGCT 150 150 150 150 150 150 150 150
TCGA-THYM 124 124 124 123 124 124 124 124
TCGA-KICH 113 66 66 66 66 66 113 113
TCGA-ACC 92 92 80 92 92 80 92 92
TCGA-MESO 87 87 87 83 87 87 87 87
TCGA-UVM 80 80 80 80 80 80 80 80
TCGA-DLBC 58 48 48 37 50 48 58 58
TCGA-UCS 57 57 57 57 57 57 57 57
TCGA-CHOL 51 51 36 51 36 36 51 51
11,315 10,999 10,558 10,418 11,124 10,943 11,315 11,315
Types of data

• Core dataset: • Future datasets:


➢ Pathology report ❖ 50x Whole-genome sequencing
➢ Histology images ❖ Bisulfide sequencing
➢ Clinical data ❖ Protein Array
➢ Whole exome-seq
➢ SNP 6.0 array
➢ mRNAseq
➢ miRNAseq
➢ Methylation array
Single Cell Expression Atlas

Discover and interpret gene


expression analysis results
at single cell level

[Link]/gxa/sc/

[Link]/gxa/sc/
Http://[Link]/raghava/cancerdr/
Overall Architecture of CancerlivER
Cancer Biomarkers
◆ A biomarker is a biological molecule found in blood, other body fluids, or tissues that is a sign of a
normal or abnormal process, or of a condition or disease (National Cancer Institute (NCI))

Biomarkers

Based on Disease State Based on Biomolecules

Diagnostics DNA Biomarker

RNA Biomarker
Prognostics
Protein Biomarker

Predictive
Glyco Biomarker
Thank You
[Link]

Questions & Answers Session

You might also like