0% found this document useful (0 votes)

27 views95 pages

Introduction to Bioinformatics Course

Uploaded by

kkarenke87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views95 pages

Introduction to Bioinformatics Course

Uploaded by

kkarenke87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduc)on

to Bioinforma)cs
Online Course: IBT
Genomics
Sequencing technologies and NGS
Overview

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
Learning Objec;ves

Session 1:
Sequencing technologies and NGS Overview

 Part 1: Introduc)on to DNA Sequencing
 Part 2: DNA Sequencing in the NGS era
 Part 3: Overview of NGS Technologies
 Part 4: DNA-‐Seq Protocol : Overview
 Part 5: DNA-‐Seq Analysis Pipeline and File Formats

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
Learning Outcomes

Session 1:
Sequencing technologies and NGS Overview

 Understand basics of NGS technologies
 Understand diﬀerent NGS ﬁle formats
 Navigate through database repositories to retrieve
NGS datasets

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
Part 1

Introduc;on to DNA Sequencing

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
What is DNA Sequencing ?

DNA Sequencing is the process of reading the

nucleo)des present in DNA : determining the precise
order of nucleo;des within a DNA molecule.

DNA-‐Seq generally refers today to any NGS method
or technology that is used to determine the order of
the four bases (A, T, C, G) in a strand of DNA.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
What is DNA Sequencing ?

In fact, there are 2 main types of DNA sequencing

technologies that are used today: Sanger sequencing
and Next-‐Genera;on Sequencing (NGS).

Each of these technologies has u)lity in today’s
gene)c analysis environment.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
Main DNA sequencing technologies
1972-‐1976
(Min Jou et al., 1972; Fiers
et al., 1976) sequence of the 1977
1st complete gene and the Maxam & Gilbert
complete genome of + Sanger sequencing
Bacteriophage MS2 (viral
RNA 3,569 nt)
1953 1977
(Watson & Crick) (Sanger et al., Feb 1977)
DNA Structure as a The ﬁrst full DNA genome to
double helix be sequenced :
1995
bacteriophage φX174
First published use
of WGS sequencing 2001
Drah sequence of the human
genome (shotgun sequencing
1995 methods). Public & Private (Lander et
([Link]) 1st complete genome al., Feb 2001 Nature; Venter et al.,
of a free-‐living organism, the Feb 2001 Science)
1987
bacterium H. inﬂuenzae
ABI markets ABI370,
(circular chr >1,8Mb)
the 1st fully automated
sequencing machine

1996
First NGS technology :
Pyrosequencing Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The Human Genome Project:
the objec;ves

The Human Genome Project (HGP)= a 13-‐years (1990-‐April 14, 2003)

interna)onal eﬀort to sequence of the 3 billion "leners” of human DNA.

$300 million project, led by the U.S. DoE and the NIH.

Interna)onal Human Genome Sequencing Consor)um (IHGSC)= group of

publicly funded researchers

At any given )me, ≈ 200 labs in the United States supported these eﬀorts +
> 18 diﬀerent countries from across the globe had contributed to the HGP.

Chial, 2008 hnp://[Link]/sequencingcosts/

Introduc)on to Bioinforma)cs Online Course:IBT
[Link]
Genomics| Fatma Guerfali
The Human Genome Project:
the objec;ves

2 groups compe)ng for sequencing:

-‐ Public
-‐ Private (Celera Genomics)

Opposing philosophies :
-‐ HGP Bermuda Agreement (1996)
à all informa)on from the project would be
made freely available to all within 24h.
-‐ Private
à access restricted to paying customers !

In February 2001, drahs of the human genome

sequence were published simultaneously by both
public-‐private groups in separate ar)cles (Lander et al
(IHGSC)., Feb 2001 Nature; Venter et al., Feb 2001 Science). Chial, 2008
hnp://[Link]/sequencingcosts/
hnp://[Link]/
[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The Human Genome Project:
the method (WGS)

Hierarchical genome shotgun sequencing

-‐ Shotgun phase
à Genome fragmented into larger segments
à cloning into vectors
à clones sequencing
à shotgun sequences assembly
à relied on the physical map of the human genome
established earlier.

-‐ Finishing phase : ﬁlling in gaps and resolving DNA
sequences in ambiguous areas

Whole-‐genome shotgun sequencing

(Celera genomics)
à Genome sheared randomly into small fragments
(appropriately sized for sequencing)
à Reassembly.

The IHGSC and Celera used the same general

method of termina:on chain for the DNA
sequencing (Hood & Galas, 2003).
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Sequencing Quality

Sequencing quality depend upon the average number of )mes each base
in the genome is 'read' during the sequencing process.

For the Human Genome Project (HGP) :
-‐ 'dra^ sequence' (covering ~90% of the genome at ~99.9% accuracy)
-‐ 'finished sequence' (covering >95% of the genome at ~99.99%
accuracy).
Producing truly high-‐quality 'finished' sequence by this defini)on is very
expensive and labor-‐intensive.

Several releases of the human genome sequences

hnp://[Link]/sequencingcosts/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Finished Genome vs Dra^ Genome

Variable degrees of comple)on of published genomes

 Dra^ Sequencing
-‐ high-‐throughput or shotgun phase (whole genome or clone-‐based approach)
-‐ Assembly using speciﬁc algorithms (whole-‐genome or single-‐clone assembly)
à lower accuracy than ﬁnished sequence; some segments are missing or in
the wrong order or orienta)on.

 Finishing
-‐ Accuracy in bases iden)ﬁca)on + Quality Check + few if any gaps.
-‐ Con)guous segments of sequence are ordered and linked to one another
-‐ No ambigui)es or discrepancies about segments order and orienta)on

 Complete Genome
A Genome represented by a single con)guous sequence with no ambigui)es

à The sequences available are ﬁnished to a certain high quality.
(Mardis et al., 2002)
hnp://[Link])[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The Human Genome Project:
the heritage

The HGP project required that all human genome sequence informa)on be
freely and publicly available. The exis)ng DNA sequences have been stored
in databases available to anyone willing to exploit and analyze them.

Dedicated databases house various data for model organisms such as
sequences of known and hypothe)cal genes and proteins (GenBank, NCBI).
Other databases (Ensembl hnp://[Link]) present addi)onal data
and annota)on as well as powerful tools for visualizing and searching it.

Community eﬀorts for non-‐model organisms like Eukaryo)c Pathogens :
EuPathDB (hnp://[Link]/eupathdb/).

Computer programs have been developed to analyze and interpret the data.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
The Human Genome Project:
the heritage

The human genome contains only

about 20,000 protein-‐coding genes,
similar in number and with largely
orthologous func)ons as those in
nematodes that have only 1,000
soma)c cells.

The extent of non-‐protein-‐coding
DNA increases with increasing
complexity, reaching > 98% in
humans.

à Encode
à Gencode

(Mazck, 2011)
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The Human Genome Project:
the heritage

Encode (hnps://[Link])
The Encyclopedia of DNA Elements (ENCODE) Project aims to provide “a list
of func)onal elements in the human genome, including elements that act at
the protein and RNA levels, and regulatory elements that control cells and
circumstances in which a gene is ac)ve. “
“the genera;on of such a catalogue is crucial for understanding genome
func;on.”

Gencode (hnp://[Link])
The human genome has been the focus of intensive manual annota)on:
The GENCODE Consor)um aims to iden)fy all gene features in the human
and mouse genomes using a combina)on of computa)onal analysis, manual
annota)on, and experimental valida)on.
(Djebali et al., 2012)
(Harrow et al., 2012)
(Birney
Introduc)on to Bioinforma)cs O nline Ceourse:IBT
t al., 2007)
Genomics| Fatma Guerfali
The Human Genome Project:
the heritage

Gene)c differences in individual bases (SNPs) of a genome are by far the most
common type of gene)c varia)on.

Goal: develop a Haplotype Map of the Human Genome
= iden;fica;on and cataloging of most of the millions of SNPs es;mated to occur
commonly in the human genome.

Described variants are, their loca;on, their distribu;on among people within
popula;ons and among popula;ons in different parts of the world. à designed to
provide informa;on to link gene;c variants to the risk for specific diseases

1000 Genomes Project: has become more complete and reliable as many novel
variants have been discovered !!

hnp://[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The Human Genome Project:
the heritage

1000 Genomes ([Link]/)

► iden)fy most gene)c variants with frequencies of at least 1%.
► freely accessible ressource of human gene)c varia)on.
► ﬁnal data set = data for 2,504 individuals from 26 popula)ons.
(low coverage sequencing and exome sequence data for all)
► Interna)onal Genome Sample Resource (IGSR) for ongoing
usability of data generated by the 1000 Genomes Project.

UK10K ([Link]/)

► iden)ﬁca)on of rare gene)c variants through the study of the
DNA of 4,000 individuals and their comparison to the protein-‐
coding areas of 6,000 people with documented diseases.
► link between gene)c variants and rare diseases.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
hnp://[Link]
The Human Genome Project:
the heritage

Development of novel technologies to help increase the depth of

sequencing: Next-‐Genera)on Sequencing (NGS) technologies

Since their development, NGS technologies have gained increasing
anen)on with a considerable poten)al applica)on in both diagnos)c and
public health microbiology.

Revolu)onized the sequencing process: from Sanger to HT sequencing

(Salipante et al., 2013)

Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Part 2

DNA Sequencing in the NGS era

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA Sequencing method:
from Sanger to NGS

Sanger sequencing is the method developed by Frederick Sanger in 1977.

This method involves copying single-‐stranded DNA with chemically altered
bases called dideoxynucleo)des (ddNTPs).

ddNTPs when incorporated at the 3' end of the growing chain, terminate the
chain selec)vely at A, C, G, or T. The terminated chains are then resolved by
capillary electrophoresis.

[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA Sequencing method:
from Sanger to NGS

Applied Biosystems (Life Technologies), manufactured the

automated capillary sequencers u)lized by both Celera
Genomics and The Human Genome Project.

While capillary sequencing was the ﬁrst approach to
successfully sequence a full human genome, it was s)ll too
expensive and took too long for commercial purposes !!!

Because of this, sequencing using Sanger technology has been
displaced by technologies like pyrosequencing, or SMRT
sequencing…

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq

DNA-‐Seq is nowadays used as an eﬀec)ve sequencing strategy

aher the advent of rapid DNA sequencing methods that has
greatly accelerated biological and medical research and
discovery : de novo…

DNA-‐Seq may be used to determine the sequence of
individual genes, larger gene;c regions, full chromosomes, or
en;re genomes.

‘DNA-‐Seq’ and other related ‘seq’ technologies allow to cover
genome complexity : genomic DNA-‐Seq, Methyl-‐Seq, ChIP-‐Seq,
exome sequencing…
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
NGS

Next-‐genera)on sequencing (NGS), or high-‐throughput (HT)

sequencing = catch-‐all term describing diﬀerent modern
sequencing technologies used by diﬀerent pla•orms:

-‐ Illumina (Solexa) sequencing
-‐ Roche 454 sequencing
-‐ Ion torrent: Proton / PGM sequencing
-‐ …

DNA sequence faster and cheaper than the Sanger sequencing

= revolu)on for genomics and molecular biology.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
NGS:
Sequencing Technologies and Pladorms

Technology Company Support Chemistry

Massively Parallel
Sequencing
Solexa Illumina Bridge PCR on ﬂowcell Seq-‐By-‐Synthesis
454 Roche Applied Science emPCR on beads Pyrosequencing
SOLiD AB / Life Technologies emPCR on beads Seq-‐By-‐Liga)on
Ion Torrent Life Technologies emPCR on beads Proton detec)on
Single Molecule
Sequencing
PacBio SMRT Paciﬁc Biosciences Pol performance Real-‐)me-‐Seq
Nanopore Oxford Nanopore Tech/McNally Transloca)on NA

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
NGS:
Principles of DNA Sequencing: SBS
Tracking the addi;on of labeled nucleo;des as the DNA chain is copied
The DNA template is immobilized.
Solu)ons of A, nucleo)des C , G and T sequen)ally added and removed.
Light is generated when a nucleo)de complements the ﬁrst unpaired base.
Chemiluminescent signal detected to determine the sequence.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
NGS:
In brief : emPCR (454/SOLiD/Ion Torrent) vs Bridge PCR (Illumina)

Bridge PCR

-‐ The adaptor-‐flanked shotgun library is PCR amplified on a flow cell
-‐ both primers coat the surface of a solid substrate
-‐ Amplifica;on products from any given member of the library remain locally
fixed near the point of origin = cluster
-‐ The PCR produces clonal clusters contains copies of a single DNA.

(Shendure & Ji, 2008)

Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
NGS:
Principles of DNA Sequencing: Pyrosequencing
Incorpora;on of dNTPs by DNApol releases pyrophosphate (PPi).
ATP sulfurylase converts PPi to ATP in the presence of APS.
ATP = substrate for the luciferase-‐mediated conversion of luciferin to oxyluciferin
This conversion generates light in amounts propor)onal to the amount of ATP
detected by a camera.
Unincorporated nucleo)des and ATP are degraded by the apyrase, and the reac)on
can restart with another nucleo)de

APS : adenosine 5´phosphosulfate

PPi : Pyrophosphate
DNApol: DNA Polymerase

(hnps://[Link])
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
NGS:
In brief : emPCR (454/SOLiD/Ion Torrent) vs Bridge PCR (Illumina)

emPCR

-‐ The adaptor-‐flanked shotgun library is PCR amplified in the context of a water-‐
in-‐oil emulsion.
-‐ PCR primer is 5'-‐aiached on micron-‐scale beads.
-‐ 1 bead-‐containing compartments = 0 or 1 template DNA.
-‐ PCR amplicons are captured to the surface of the bead.
-‐ 1 clonally amplified bead = PCR products corresponding to amplifica;on of a
single molecule from the library.
(Shendure & Ji, 2008)
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Part 3

Overview of NGS Technologies

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
NGS:
Sequencing Technologies and Pladorms

Technology Company Support Chemistry

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
Solexa
The input sample must be
Flow cell
cleaved into short sec)ons.

F r a g m e n t s a r e l i g a t e d t o
adaptors and annealed to the
slide using the adaptors.

Fragments are separated into
single strands to be sequenced.

Nucleo)des are modified so that
each emits a different coloured
light when excited by a laser.
+ they have a terminator, so that
only one base is added at a )me.

PCR, process repeated in cycles,
images analyzed.
Introduc)on to (modified from
Bioinforma)cs Gilchrist,
Online 2010)
Course:IBT
Genomics| Fatma Guerfali
Solexa
Intra-‐Pladorm Comparison

hnp://[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
454 Technology

As in Illumina, the DNA is fragmented.

Adaptors added, end annealed to beads.
1 DNA fragment = 1 bead.

Fragments amplified by PCR using adaptor-‐specific primers.

The sequence can then be determined computa)onally.

Longer reads than Illumina, different lengths.

(hnp://[Link])
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
454 Technology
Intra-‐Pladorm Comparison

GS Junior System GS FLX+ System

brings the power of 454 Features combination of long reads, accuracy

Sequencing Systems directly and high-throughput, making the system well
to the laboratory benchtop suited for larger genomic projects.

Read lengths: Read lengths:

GS Junior (up to 400bp) GS FLX Titanium XL+ (up to 1,000bp)
GS Junior + (700-800bp) GS FLX Titanium XLR70 (up to 600bp)

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| ([Link]/)
Fatma Guerfali
SOLID

Sequencing is performed with a ligase, rather than a polymerase.

Each sequencing cycle introduces a par;ally degenerated popula;on of ﬂuorescently
labeled octamers. The popula)on is structured such that the label correlates with the
iden;ty of the central 2 bp in the octamer.

A^er liga;on and imaging in four channels, the labeled por;on of the octamer (that is,
'zzz') is cleaved leaving a free end for another cycle of liga)on.

In the SOLiD sequencing method, unlike

other NGS technologies (which base
detec:on of DNA fragments is performed
through polymerase reac:on) sequencing
is achieved by Sequencing-‐By-‐Liga9on.

Adapted from (Shendure & Ji, Introduc)on

2008) and (hnp://[Link]/978-‐1-‐4614-‐7725-‐9)
to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Ion Torrent

As in other kinds of NGS, the input DNA is fragmented.

Adaptors are added and one molecule is placed onto a bead.

Ampliﬁca)on on the bead by emulsion PCR. Each bead is placed into 1 well of a slide.

The pH is detected is each of the wells, as each H+ ion released will decrease the pH.
The changes in pH allow us to determine if that base, and how many thereof, was
added to the sequence read.

The dNTPs are washed away, and the
process is repeated in cycles.

Adapted from (hnp://[Link]) and (hnp://[Link]/978-‐1-‐4614-‐7725-‐9)

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
PacBio
Single Molecule Real Time (SMRT) DNA Sequencing
technology

a à DNA polymerase molecule is tethered to the bonom of
a nanowell à ZMW design ensures only one nucleo)de-‐
linked dye can be directly excited at a )me.

b à Each incorporated phospholinked nucleo)de will reside
on the enzyme’s ac)ve site for a few milliseconds, which is
enough )me for a fluorescent signal to be recorded.

NB: In other systems, the fluorescent label is aOached to the
base in nucleo:des. In SMRT technology, the fluorescent
label is aOached to the phosphate chain à The released
labled pentaphosphates will diffuse quickly.

(Osherovich
Introduc)on to Bioinforma)cs Online Ceourse:IBT
t al., 2010)
Genomics| Fatma Guerfali
Nanopore

Single Molecule Real Time (SMRT) DNA Sequencing

technology: How it works

Schema)c representa)on of nanopore

sequencing system.

 Upper protein à ssDNA.
 2nd protein
-‐ forms a nanopore in a membrane.
-‐ contains an adaptor molecule reduce
the speed of passing DNA through
the pore.
 Each base obstructs the ﬂow to a
diﬀerent degree.
 PromethION…

(hnp://[Link]/978-‐1-‐4614-‐7725-‐9)
Introduc)on (picture
to Bioinforma)cs
from MIT’s O nline Course:IBT
Technology Review)
Genomics| Fatma Guerfali
Inter-‐Pladorm Comparison

(hnps://ﬂ[Link])
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
The four main advantages of NGS over
classical Sanger sequencing

Speed
NGS is quicker than Sanger sequencing in two ways.
-‐ Chemical reac)on may be combined with the signal detec)on, whereas in Sanger
sequencing these are two separate processes.
-‐ 1 read can be taken at a )me in Sanger sequencing, whereas NGS is massively parallel.

Cost
The human genome sequence cost $300M.
Sequencing a human genome with Illumina allows to approach the $1,000 expected.

Sample size
needs signiﬁcantly less star)ng amount of DNA/RNA

Accuracy
More repeats than with Sanger sequencing à greater coverage, higher accuracy and
sequence reliability (individual reads less accurate for NGS).

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA Sequencing costs
(Data from the NHGRI Genome Sequencing Program (GSP)

Accurately determining the

cost for sequencing a given
genome (e.g., a human
genome) is not simple.

hnp://[Link]/sequencingcosts/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Comparison of human genome sequencing methods
HGP vs. ~ 2016

hnp://[Link]/sequencingcosts/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Part 4

DNA-‐Seq Protocol : Overview

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq
Protocol for Library Construc;on

STEP
01 Genomic DNA Puriﬁca;on
END REPAIR
+
A
STEP A A A
02 Genomic DNA Fragmenta;on A A

STEP
03 End repair and A-‐tailing

STEP
04 Adapter Liga;on

STEP
05 Size Selec;on & PCR

STEP
06 Sequencing
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Kits for DNAseq Library prep

Illumina-‐compa;ble DNA-‐Seq Library Prep Kits

NEXTflex™ Rapid DNA-‐Seq Kit -‐ DNA-‐Seq library prep kit, 1 ng -‐ 1 µg input DNA
NEXTflex mtDNA-‐Seq Kit -‐ mtDNA libraries
NEXTflex™ DNA Sequencing Kits -‐ DNA-‐Seq library prep kit, 1 µg of input DNA
NEXTflex™ PCR-‐Free DNA Sequencing Kit -‐ Amplifica)on-‐free DNA-‐Seq library
prep kit for sequencing 0.5 µg – 3 µg of input DNA
NEXTflex™ PCR-‐Free Barcodes -‐ Up to 48 barcodes for use with the NEXTflex™
PCR-‐Free DNA-‐Seq Kits and other DNA-‐Seq protocols

KAPA HyperPlus Kits -‐ input DNA from 1 ng – 1 µg
KAPA Hyper Prep Kits -‐ 250 ng FFPE DNA or less + fewer cycles of ampliﬁca)on
with KAPA HiFi DNA Polymerase (duplica)on rates + coverage)
(hnp://[Link])ﬁ[Link]/Next-‐Gen-‐Sequencing/Illumina-‐DNA-‐Library-‐Prep-‐Kits/à
(hnps://[Link]/product-‐applica)ons/products/next-‐genera)on-‐sequencing-‐2/dna-‐library-‐prepara)on/
Introduc)on to Bioinforma)cs Online Course:IBT
-‐
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
GENOMIC DNA
Star;ng material: QC
01 PURIFICATION

 Quality Control
► gel visualiza)on, Bioanalyzer (Agilent, Bio-‐rad)

END REPAIR
+ A  Quan9ty Control
AAA
A A ► Nanodrop, Qubit…

Experimental design

► SR (single read) or PE (paired-‐end)
► Mul)plexing or not
► de novo or not
► …
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
GENOMIC DNA
Experimental design
01 PURIFICATION

► SR (single-‐end reads) or PE (paired-‐end reads)

END REPAIR
SR PE
+ A
AAA INSERT SIZE
A A

hnps://[Link]/p/162806/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
GENOMIC DNA
Experimental design
01 PURIFICATION

► Mul)plexing or not ?

 mul;plexing = anach to a speciﬁc
END REPAIR
+ A
barcode sequence to iden)fy later the
AAA sample from which it originates.
A A

 Libraries pooled and
sequenced in parallel.

 Reads from each library are
diﬀerenciated by using barcode
to de-‐mul)plex

 Each set is aligned to the
reference genome
Introduc)on to Bioinforma)cs [Link]
nline Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
GENOMIC DNA
Fragmenta;on
02 FRAGMENTATION

 Can be included in the kit
► Op)miza)on of fragmenta)on parameters

END REPAIR
+ A  Several methods
AAA
A A ► Enzyma)c, Nebuliza)on, acous)c shearing…

Star;ng material: input

 Low Quality DNA
► Cau)on in size selec)on

 High Quality DNA
► Size selec)on Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
END REPAIR
Repair ends
03 AND A-‐TAILING

 Converts overhangs:
Blunt ends + Phosphorylates 5’-‐end
 Reagents:
END REPAIR
+ A dNTP, T4 DNA pol, Klenow – Kinase/ATP (T4 PNK)
AAA
A A  Simple enzyma)c reac)on

BLUNT ENDING BY

EXONUCLEASE

5’-‐END
PHOSPHORYLATION
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
END REPAIR
A-‐tailing (Adenyla;on)
03 AND A-‐TAILING

 Adds ‘A’ base to the 3' end of the blunt
phosphorylated DNA fragments
 Prevents
END REPAIR
+ A ► Forma)on of adapters dimers
AAA
A A ► Concatemers
 Reagents
1 mM dATP, Klenow exo (3' to 5' exo minus)

A-‐OVERHANG

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
ADAPTER
Adapter Liga;on
04 LIGATION

 Provided or custom-‐designed
 Adapter concentra)on aﬀects liga)on, adapter
END REPAIR and adapter-‐dimer carryover
+ A
AAA
A A  Robust Liga)on eﬃciency for adapter:insert
molar ra)os between 10:1 and >200:1
 Adapter ra)o >200:1 for low-‐input applica)ons.
 Adapter quality
 Post-‐Liga)on cleanup

hnps://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
SIZE SELECTION
Size selec;on : Read length considera;ons
05 AND PCR

 Size select 300 – 400 bp or 350 – 500 bp, post-‐
liga)on
END REPAIR  Ensures maximum coverage of most inserts
+ A
AAA
A A  Problem of non-‐uniform genome coverage
 Problem of material loss

à Strategy to focus read lengths during sample and

library prepara)on

hnps://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
SIZE SELECTION
Size selec;on : Read length considera;ons
05 AND PCR

 Double solid-‐phase reverse immobiliza)on (SPRI)
selec)on methods allow to reshape the input
END REPAIR fragment distribu)on into well-‐deﬁned ranges.
+ A
AAA
A A  SPRI + Reverse-‐SPRI

FRAGMENTS
ON BEADS

FRAGMENTS
IN SUPERNATANT

hnps://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
SIZE SELECTION
Library Amplifica;on (PCR)
05 AND PCR

► Amplifies the amount of DNA in the library
► Selec)vely enrich DNA fragments with adapter
molecules on both ends
END REPAIR
+ A ► Post-‐amplifica)on cleanup
AAA
A A

QC

► Quality & Quan)ty & size check

hnps://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq
Cri;cal steps in DNAseq Library prep

STEP
SEQUENCING DNA Sequencing
06
► input : Library constructed
-‐ Whole-‐genome
-‐ Whole-‐exome
END REPAIR
+ A -‐ Target region
AAA
A A
► Cluster ampliﬁca)on + sequencing + base calling

► QC (run report)

► output : sequenced « reads » (fastq ﬁles)

hnps://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
Part 5

DNA-‐Seq Analysis Pipeline

and File Formats

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

KEY STEPS OF THE ANALYSIS PIPELINE KEY FILES

STEP
01 Raw Sequencing Reads FASTQ

INDEXING
STEP
Reads Alignment SAM/BAM MARKING DUPLICATES
02 MAPPED/UNMAPPED

STEP
03 Genomic Coverage BAI/BED

STEP
04 SV/CNV/Variant Calling VCF

STEP
05 Biological Interpreta;on CSV/XLS/TXT

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
RAW SEQUENCING
Sequencing output
01 READS
► FASTQ (text) format.
► Poten)ally SRA (binary), but rather used for public
FASTQ
data online.

SEQUENCING RUN/QC Fastq ﬁle
► Improvement of the Sanger breakthrough
(associa)ng each nucleo)de to a quality score)
► Hundreds of millions of lignes/rows
► Blocks of 4 lignes (@)
► Example

hnp://[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
RAW SEQUENCING
FastQC
01 READS

FASTQ

SEQUENCING RUN/QC

hnp://[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
hnps://[Link]
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
RAW SEQUENCING
FastQC + Trimming + bad reads removal
01 READS

FASTQ

SEQUENCING RUN/QC

hnp://[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
hnps://[Link]
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
READS MAPPING/
Input ﬁles
02 ALIGNMENT
► .fastq
► Reference Genome (.fasta, .fa, .fai)
SAM/BAM ► GFF/GTF, GFF3

BWA/BOWTIE…/SAMTOOLS Reads alignment
► BWA / BOWTIE (Burrows-‐Wheeler transform)
► de novo: NEWBLER (454)…

Output ﬁles
► .sam / .bam

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
READS MAPPING/
GFF3
02 ALIGNMENT
► 1 line for 1 feature
► tab-‐separated columns
SAM/BAM ► 9 columns + op)onal addi)onal informa)on

BWA/BOWTIE…/SAMTOOLS

STRAND
PHASE
SCORE
SEQ-‐ID SOURCE TYPE START-‐END ATTRIBUTES

hnp://[Link]/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
READS
SAM / BAM
02 ALIGNMENT
► Header lines + Alignments sec)ons
► tab-‐separated columns
SAM/BAM ► 11 columns
► Samtools (view, sort, index…)
BWA/BOWTIE…/SAMTOOLS ► Removal of duplicated reads that aﬀects Variant Calling

CIGAR

PNEXT
RNEXT

TLEN
QNAME FLAG RNAME POS MAPQ SEQ QUAL

hnp://[Link]/[Link]
hnp://[Link]/wiki/SAM
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
GENOMIC
Coverage
03 COVERAGE

► IGV = Integra)ve Genomics Viewer
BAI/BED
(hnps://[Link])[Link]/igv/)

► other useful ﬁle formats
SAMTOOLS/BEDTOOLS
BED = (Browser Extensible Data)
(hnp://[Link]/FAQ/FAQformat)

► Coverage associated to 3 diﬀerent concepts:
-‐ Fold Coverage (number + X )
-‐ Breadth of Coverage
-‐ Depth of Coverage

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
SV/CNV/VARIANT
SV/CNV/Variant Calling
04 CALLING

► Structural varia;ons (SV)
VCF Dele)ons, duplica)ons, copy-‐number varia)ons,
inser)ons, inversions, transloca)ons…
► Copy number Varia;ons (CNV)
Dele)ons or duplica)ons of genes or rela)vely large
regions of the genome that affect chromosomes►
Variant Calling (SNPs and small InDels)
-‐ SNPs: affects only 1 nucleo)de
-‐ InDels: affects 1 or several nucleo)des
SPLIT MAPPING
OR DIFFERENT DISTANCE
PROPERLY OR ORIENTATION
MAPPED PAIR FOR THE PAIR

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
SV/CNV/VARIANT
SV/CNV/Variant Calling
04 CALLING

VCF
REFERENCE

SUB-‐CHROMOSOMAL TO MICROSCOPIC

A B C

1kb TO SUB-‐ MICROSCOPIC

A A B C

A E F
SINGLE NUCLEOTIDE

2bp TO 1kb

REFERENCE REFERENCE
A C
C B
A

(deﬁni)ons
Introduc)on adapted from SOcherer
to Bioinforma)cs nline Cet ourse:IBT
al., 2007)
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
SV/CNV/VARIANT
SV/CNV/Variant Calling
04 CALLING
► VCF (Variant Call Format)
-‐ Text file format storing SNPs and InDels informa)on
VCF -‐ hnp://[Link]/node/101
-‐ Obtaining variants listed in this format is a mul)step
procedure involving different tools but standardized
-‐ Headers (meta-‐informa)on) + data lines
-‐ 8 required fields, tab-‐delimited

Introduc)on to hnp://[Link]/node/101

Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis Pipeline
and associated ﬁles

STEP
BIOLOGICAL
From Variant annota;on to data mining
05 INTERPRETATION

► web-‐based
CSV/XLS/TXT ► available packages

Aim

► Func)onal impact of variants (synonymous or not…)
► Gene Ontology Annota)on (BP, MF, CC)
► Pathway / Network informa)on
► Predic)ons of pathogenicity / severity

NB: DAVID (Database for Annota)on, Visualiza)on and
Integrated Discovery) to switch between databases
hnps://[Link]/

Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
Take-‐home messages

Biological ques;on
► need to be clearly defined first, so that the design of the experiment, the library
construc)on and the pipeline of analysis could be prepared accordingly

Pladom
► Each one has its own specifici)es that needs to be understood before choosing one
► Different technologies, short reads (Illumina…) vs long reads (PacBio…)
► Rapidly evolving, several limitaions (PCR bias for GC rich regions…)
► Combina)on of different pla•orms possible (de novo…)

Input / Output files
► Companion indexed files needed (.fa & .fai / .bam & .bai / .vcf & .[Link]…)
► text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG) or binary (BAM, BCF, SFF)
► 1-‐based (GFF/GFT, SAM/BAM, WIG) or (0-‐based : BED)

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

Understand see how these files are generated prac)cally (no
demo)

Give you an idea about how to deal with these files easily
once they are generated and given to you by your
sequencing plateform.

Make you work a bit on one specific file (a vcf file), using the
command line interface (assignment).

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

Reminder of the command line syntax and some basic
commands to manipulate files
à the same syntax is used for NGS analysis…but…using other
specific commands (algorithms, tools…)

Examples of command lines to generate or retrieve data
from NGS data files using these specific commands

Basic linux command lines can be useful to parse files: How
to interrogate a vcf file using basic linux command lines?

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

The NGS datasets: reminder
► Outputs large files
► Output files contains various kinds of informa)ons that you need to parse
-‐ fastq: quality associated to each read…
-‐ sam/bam: quality of the mapping…
-‐ vcf: variants, annota)on of the effects that these variants can have…

The command-‐line environment
► UNIX Opera)ng System: able to deal with mul)-‐task & mul)-‐user needs
► i.e. can even handle mul)ple files at a )me (useful if mul)ple samples)
► Brings flexibility to handle large files
► Allows to easily parse the content of a (big) file

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

The basic commands you have seen are useful for NGS

àRemember that many kind of files are generated
through the NGS analysis pipeline

► So you should know at this stage the file system basics
that allows you to work with many files and classify
them

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

What you know about the command-‐line environment
should allow you to:

► Create directories and move through file system
► At this stage you should be able to handle easily
queries to parse large files
-‐ able to search for a par)cular panern
-‐ able to select specific informa)on columns
à Easily interrogate the large amount of informa)on in
the output files
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Viewing & Manipulating
cat
 view the content of a short file
$ cat file1

more
 view the content of a long file, step by step
$ more ﬁle1

less
 view the content of a long file, by portions
$ less ﬁle1
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Viewing & Manipulating
head
 view the first lines of a (long) file
$ head ﬁle1
NB: By default (without options), displays the 10 first lines

tail
 view the last lines of a (long) file
$ tail ﬁle1
NB: By default (without options), displays the 10 last lines

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Viewing & Manipulating
cut

 Extract specific fields from a file

$ cut -‐d’ ‘ -‐f1,2 ﬁle1

Field specifier
Field separator

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Viewing & Manipulating
grep

 search for the occurrence of a specific pattern in a file

(regular expression using the wildcards…)

Careful : grep displays the whole line containing this specific pattern XXX
$ grep XXX ﬁle1

Could be used to display all lines that DO NOT contain a specific pattern
$ grep -‐v XXX ﬁle1

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Viewing & Manipulating
wc

 Prints different kind of counts for a file

$ wc -‐l ﬁle1

Prints line counts

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Redirecting characters
|

The “|” character allows to combine several commands, by sending

the result of one command to another

$ grep XXX ﬁle1 | wc -‐l

Prints line counts instead of displaying the result on the screen

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment
Redirecting characters
>

The “>” character allows to redirect the result of a command to a new

file

$ grep XXX ﬁle1 > ﬁle2

Prints line counts instead of displaying the result on the screen

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

Command-‐line syntax

No capital letter path_to_Directory/file

$ command -options arguments

space space

Useful commands:
$ man NameOfTheCommand
$ pwd
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

FASTQ FASTQC/…

SAM/BAM BWA/BOWTIE…/SAMtools

VCF VCFtools/jannovar

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

FASTQ FASTQC/…

To run FastQC (provided it is installed):

-‐ specify the ﬁles you want to process on the command line
-‐ FastQC will generate an HTML report for each ﬁle (embeded graphs)

fastqc seqfile1 seqfile2 .. seqfileN

fastqc -‐-‐help
-‐-‐extract
fastqc [-‐o output dir] [-‐-‐(no)extract] Uncompress the zipped output
file in the same dir aher being
created.
-‐o –outdir -‐-‐noextract
Create all output files in the specified Do not uncompress the output file
output directory (dir must already exist). aher crea)ng it.

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

FASTQ FASTQC/…

Trimming

à Trimming low Quality Bases
Low quality base reads from the sequencer can cause an otherwise mappable
sequence not to align à different open source tools can trim off 3' bases à produce a
FASTQ file of the trimmed reads to use as input to the alignment program.

e.g.: FASTX-‐Toolkit

gunzip -‐c Sample_R1.[Link] | fastx_trimmer -‐l 50 -‐Q 33 > [Link]

Trim down to 50 bases op)on that speciﬁes how base quali)es on the
(last base is 50) 4th line of each fastq entry are encoded
[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

FASTQ FASTQC/…

Trimming

à Trimming Adapters
A 3' adapter contamina)on can cause the insert sequence not to align (adapter
sequence ≠ bases at the 3' end of the reference genome sequence).
Unlike general fixed-‐length trimming, adapter trimming removes different numbers of
3' bases depending on where the adapter sequence is found.

e.g.: cutadapt
cutadapt -‐a AAAT [Link] -‐o test_new.fastq
cutadapt -‐m 22 -‐O 10 -‐a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-‐m22 = discard any sequence that is smaller than 22 bases aher trimming
-‐O10 = says not to trim 3' adapter sequences unless at least the first 10 bases of the adapter
are seen at the 3' end of the read
[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

SAM/BAM BWA/BOWTIE…/SAMtools

Mapped and unmapped reads are imported and stored into SAM/BAM format

samtools : A suite of useful commands to visualize or get informa)ons from .sam/.bam
#from SAM to BAM conversion
samtools view [Link] > [Link]

# for sorting and indexing alignment

samtools sort [Link] -o [Link]
samtools index [Link] [Link]

#all reads mapping on a certain portion of chr1 or all the chr1 in another bam
samtools index [Link]
samtools view [Link] chr1:200000-500000
samtools view -b [Link] chr1 > test_chr1.bam
[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

SAM/BAM BWA/BOWTIE…/SAMtools

Mapped and unmapped reads are imported and stored into SAM/BAM format

samtools
The ﬂagstat command provides simple sta)s)cs on a BAM ﬁle

#from SAM to BAM conversion
samtools flagstat [Link]

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

VCF VCFtools/SNPEﬀ

VCF ﬁles contains informa)on about variants

VCF can be used as input and output ﬁle for many tools

Variant calling can be done using many available tools and methods (GATK, samtools…)
and the output used by many others (VCFtools, VCFminer, snpeﬀ…)

When a mapped read shows a mismatch from the reference genome

à is the mismatch due to a real SNP???

e.g. How does samtools detect SNPs?
Samtools computes sta)s)cs to incorporate diﬀerent types of informa)on such as:
-‐ number of diﬀerent reads that share a mismatch from the reference
-‐ the sequence quality data
-‐ the expected sequencing error rates
hnp://[Link]/[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

VCF VCFtools/jannovar

VCF ﬁles contains informa)on about variants

VCF can be used as input and output file for many tools

[Link] & bcôols: 2 steps are required using these commands :

[Link]
-‐ collect summary informa)on in the input BAMs
-‐ compute the likelihood of data given each possible genotype
-‐ and store the likelihoods in the BCF format (see below). It does not call variants at this
stage.
2. Bcôols
-‐ applies the prior and does the actual calling
-‐ can also concatenate BCF files, index BCFs for fast random access and convert BCF to VCF.

hnp://[Link]/[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

VCF VCFtools/jannovar

Suppose we have :
-‐ a reference sequence in [Link], indexed by samtools faidx
-‐ posi)on sorted alignment files [Link] and [Link].

à you can call SNPs and short INDELs using:
#[Link] a BCF file (binary data format : information about sequence variants (SNPs…)
samtools mpileup -‐uD -‐f [Link] [Link] [Link] |
bcftools view -‐bvcg -‐ > fi[Link]
-‐u output into an uncompressed bcf file -‐b output to BCF format
-‐D keep read depth for each sample -‐v only output poten)al variant sites (i.e., exclude monomorphic ones)
-‐f next argument is reference genome file -‐c do SNP calling
-‐g call genotypes for each sample in addi)on to just calling SNPs

hnp://[Link]/[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

VCF VCFtools/jannovar

Suppose we have :
-‐ a reference sequence in [Link], indexed by samtools faidx
-‐ posi)on sorted alignment ﬁles [Link] and [Link].

à you can call SNPs and short INDELs using:

#[Link] BCF into VCF (flat text file rather than a binary = easier to view)
bcftools view fi[Link] | vcfu)[Link] varFilter -‐D100 > file.fi[Link]

-‐D100 ﬁlters out SNPs that had read depth higher than 100

hnp://[Link]/[Link]
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

VCF VCFtools/jannovar

Many ways to nnotate the VCF ﬁle (vchools, jannovar…)

e.g. jannovar
Jannovar iden)ﬁes all transcripts aﬀected by a given variant, and provides HGVS-‐compliant
annota)ons for diiferent types of variants.

#Download the RefSeq transcript database for the release hg19/GRCh37.
$ java -‐jar [Link] download -‐d hg19/refseq
#Annotate the [Link] file
$ java -‐jar [Link] annotate-vcf \ -‐d hg19_refseq.ser -‐i
[Link] -‐o [Link]

hnp://[Link]/jannovar/
Introduc)on to Bioinforma)cs Online Course:IBT
Genomics| Fatma Guerfali
DNA-‐Seq Analysis
The command-‐line environment

► all these commands can be run as part of an analysis pipeline

► all files generated can be parsed using specific tools (samtools…)

► text-‐based files generated can be parsed using basic linux commands

► At this stage you should be able to handle easily queries to parse
large files
-‐ able to search for a par)cular panern
-‐ able to select specific informa)on columns
à Assignment !

Introduc)on to Bioinforma)cs Online Course:IBT

Genomics| Fatma Guerfali

Human Genome Project: Sanger Sequencing
No ratings yet
Human Genome Project: Sanger Sequencing
26 pages
Understanding Genome Informatics
No ratings yet
Understanding Genome Informatics
80 pages
Human Genome Project
No ratings yet
Human Genome Project
33 pages
Overview of the Human Genome Project
No ratings yet
Overview of the Human Genome Project
30 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
46 pages
Understanding DNA and Sequencing Techniques
No ratings yet
Understanding DNA and Sequencing Techniques
37 pages
Genomics and Vertebrate Origins Explained
No ratings yet
Genomics and Vertebrate Origins Explained
21 pages
21 DetailLectOut
No ratings yet
21 DetailLectOut
17 pages
Genome Sequencing Overview and Advances
100% (1)
Genome Sequencing Overview and Advances
18 pages
21 DetailLectOut
No ratings yet
21 DetailLectOut
17 pages
DNA Sequencing: Clinical Advances Explained
No ratings yet
DNA Sequencing: Clinical Advances Explained
14 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
19 pages
Overview of the Human Genome Project
No ratings yet
Overview of the Human Genome Project
19 pages
Story of HGP As Told by Front-Line Participant
No ratings yet
Story of HGP As Told by Front-Line Participant
40 pages
Evolution of Genomes and Sequencing
No ratings yet
Evolution of Genomes and Sequencing
25 pages
A Decade of Next-Generation Sequencing
No ratings yet
A Decade of Next-Generation Sequencing
9 pages
Chapter 21 Outline
No ratings yet
Chapter 21 Outline
21 pages
Genomics and Bioinformatics Overview
No ratings yet
Genomics and Bioinformatics Overview
34 pages
Overview of the Human Genome Project
No ratings yet
Overview of the Human Genome Project
11 pages
DNA Sequencing and Genome Challenges
No ratings yet
DNA Sequencing and Genome Challenges
58 pages
Bioinformatics in the Genomic Era
No ratings yet
Bioinformatics in the Genomic Era
64 pages
Introduction to Bioinformatics Basics
No ratings yet
Introduction to Bioinformatics Basics
41 pages
Breakthroughs in Genome Sequencing
No ratings yet
Breakthroughs in Genome Sequencing
21 pages
Human Genome Project Insights
No ratings yet
Human Genome Project Insights
5 pages
Human Genome Project Overview and Phases
No ratings yet
Human Genome Project Overview and Phases
9 pages
Human Genome Project Overview and Impact
No ratings yet
Human Genome Project Overview and Impact
15 pages
Advances in DNA Sequencing Technologies
No ratings yet
Advances in DNA Sequencing Technologies
11 pages
Human Genome Project Overview and Impact
No ratings yet
Human Genome Project Overview and Impact
16 pages
Gene Cloning and DNA Sequencing Techniques
No ratings yet
Gene Cloning and DNA Sequencing Techniques
54 pages
Human Genome Project Overview 1984-2003
No ratings yet
Human Genome Project Overview 1984-2003
32 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
32 pages
The Human Genome Explained
No ratings yet
The Human Genome Explained
170 pages
Human Genome Project: A Decade Later
No ratings yet
Human Genome Project: A Decade Later
3 pages
Genomics and Proteomics Overview
No ratings yet
Genomics and Proteomics Overview
15 pages
Human Genome Project Overview and Insights
No ratings yet
Human Genome Project Overview and Insights
19 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
22 pages
DNA Sequencing Strategies Explained
No ratings yet
DNA Sequencing Strategies Explained
32 pages
Human Genome Project Overview 2024
No ratings yet
Human Genome Project Overview 2024
20 pages
Bioinformatics and cDNA Library Overview
No ratings yet
Bioinformatics and cDNA Library Overview
102 pages
Human Genome Project - Sequencing The Human Genome - Learn Science at Scitable
100% (1)
Human Genome Project - Sequencing The Human Genome - Learn Science at Scitable
6 pages
Genome Sequencing Techniques Overview
100% (1)
Genome Sequencing Techniques Overview
47 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
16 pages
Intro and Sequencing Tech
No ratings yet
Intro and Sequencing Tech
50 pages
DNA Sequencing and Human Genome Insights
No ratings yet
DNA Sequencing and Human Genome Insights
48 pages
Overview of the Human Genome Project
No ratings yet
Overview of the Human Genome Project
9 pages
Illumina Next-Gen Sequencing Overview
No ratings yet
Illumina Next-Gen Sequencing Overview
3 pages
Overview of DNA Sequencing Methods
No ratings yet
Overview of DNA Sequencing Methods
61 pages
Overview of the Human Genome Project
No ratings yet
Overview of the Human Genome Project
34 pages
Deep Sequencing: Bioinformatics Overview
No ratings yet
Deep Sequencing: Bioinformatics Overview
56 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
34 pages
Human Genome Project Overview
No ratings yet
Human Genome Project Overview
12 pages
Human Genome Project: History & Advances
No ratings yet
Human Genome Project: History & Advances
54 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
24 pages
Human Genetic Variation 2015 FS
No ratings yet
Human Genetic Variation 2015 FS
44 pages
Tenancy Agreement Template Uganda 2024
No ratings yet
Tenancy Agreement Template Uganda 2024
3 pages
Seafarer CV and Experience Overview
100% (1)
Seafarer CV and Experience Overview
1 page
IPE: Empowering Future Health Leaders
No ratings yet
IPE: Empowering Future Health Leaders
24 pages
The Legacy of Pack Man Game
No ratings yet
The Legacy of Pack Man Game
12 pages
Understanding Teacher Roles and Impact
No ratings yet
Understanding Teacher Roles and Impact
17 pages
Company Office Spaces in Kormangala
No ratings yet
Company Office Spaces in Kormangala
8 pages
Business Intelligence for Decision Making
No ratings yet
Business Intelligence for Decision Making
6 pages
20 Questions Directors Should Ask About Crisis Management 2008 PDF
No ratings yet
20 Questions Directors Should Ask About Crisis Management 2008 PDF
32 pages
Class VIII Chemistry: Coal & Petroleum Worksheet
No ratings yet
Class VIII Chemistry: Coal & Petroleum Worksheet
2 pages
Overview of Pakistan's Industrial Development
No ratings yet
Overview of Pakistan's Industrial Development
9 pages
Sara Blakely: Business Leadership Insights
No ratings yet
Sara Blakely: Business Leadership Insights
10 pages
Mario Franchise Overview 1981-2023
No ratings yet
Mario Franchise Overview 1981-2023
5 pages
Mastering Prompts for Microsoft Copilot
No ratings yet
Mastering Prompts for Microsoft Copilot
34 pages
Mivan Construction Technology Overview
No ratings yet
Mivan Construction Technology Overview
4 pages
Linking Data in Spreadsheets
No ratings yet
Linking Data in Spreadsheets
4 pages
Castep DFT Calculations Tutorial
No ratings yet
Castep DFT Calculations Tutorial
38 pages
Vietnam's 80 Years of Diplomatic Achievements
No ratings yet
Vietnam's 80 Years of Diplomatic Achievements
4 pages
C75 C85 C90 & O-200 Parts Manual
100% (2)
C75 C85 C90 & O-200 Parts Manual
56 pages
Tracers With Amp Brochure 8.21.19
No ratings yet
Tracers With Amp Brochure 8.21.19
12 pages
Heritage Tourism Handbook:: A How-To-Guide For Georgia
100% (1)
Heritage Tourism Handbook:: A How-To-Guide For Georgia
68 pages
College Admission Details for SVMJC
No ratings yet
College Admission Details for SVMJC
4 pages
Reading 6.ge - teacHER
No ratings yet
Reading 6.ge - teacHER
12 pages
Essential Poush Masanta Checklist for Accountants
No ratings yet
Essential Poush Masanta Checklist for Accountants
7 pages
2025 Heaven Gold Makeup Catalog
No ratings yet
2025 Heaven Gold Makeup Catalog
38 pages
Monograde vs. Multigrade Classrooms
No ratings yet
Monograde vs. Multigrade Classrooms
3 pages
Character Merchandising Explained
100% (1)
Character Merchandising Explained
27 pages
Inter School Science Competitions Report
No ratings yet
Inter School Science Competitions Report
3 pages
Wipro Elite NLTH Placement Material
No ratings yet
Wipro Elite NLTH Placement Material
20 pages
StructureScan 3D Transducer Mounting Guide
No ratings yet
StructureScan 3D Transducer Mounting Guide
20 pages

Introduction to Bioinformatics Course

Uploaded by

Introduction to Bioinformatics Course

Uploaded by

Introduc)on

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc;on to DNA Sequencing

Introduc)on to Bioinforma)cs Online Course:IBT

DNA Sequencing is the process of reading the

Introduc)on to Bioinforma)cs Online Course:IBT

In fact, there are 2 main types of DNA sequencing

Introduc)on to Bioinforma)cs Online Course:IBT

The Human Genome Project (HGP)= a 13-­‐years (1990-­‐April 14, 2003)

Interna)onal Human Genome Sequencing Consor)um (IHGSC)= group of

Chial, 2008 hnp://[Link]/sequencingcosts/

2 groups compe)ng for sequencing:

In February 2001, drahs of the human genome

Hierarchical genome shotgun sequencing

Whole-­‐genome shotgun sequencing

The IHGSC and Celera used the same general

Variable degrees of comple)on of published genomes

Introduc)on to Bioinforma)cs Online Course:IBT

The human genome contains only

1000 Genomes ([Link]/)

Introduc)on to Bioinforma)cs Online Course:IBT

Development of novel technologies to help increase the depth of

(Salipante et al., 2013)

DNA Sequencing in the NGS era

Introduc)on to Bioinforma)cs Online Course:IBT

Sanger sequencing is the method developed by Frederick Sanger in 1977.

Applied Biosystems (Life Technologies), manufactured the

Introduc)on to Bioinforma)cs Online Course:IBT

DNA-­‐Seq is nowadays used as an eﬀec)ve sequencing strategy

Next-­‐genera)on sequencing (NGS), or high-­‐throughput (HT)

DNA sequence faster and cheaper than the Sanger sequencing

Introduc)on to Bioinforma)cs Online Course:IBT

Technology Company Support Chemistry

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

(Shendure & Ji, 2008)

APS : adenosine 5´phosphosulfate

Overview of NGS Technologies

Introduc)on to Bioinforma)cs Online Course:IBT

Technology Company Support Chemistry

Introduc)on to Bioinforma)cs Online Course:IBT

GS Junior System GS FLX+ System

brings the power of 454 Features combination of long reads, accuracy

Read lengths: Read lengths:

Introduc)on to Bioinforma)cs Online Course:IBT

In the SOLiD sequencing method, unlike

Adapted from (Shendure & Ji, Introduc)on

Adapted from (hnp://[Link]) and (hnp://[Link]/978-­‐1-­‐4614-­‐7725-­‐9)

Single Molecule Real Time (SMRT) DNA Sequencing

Schema)c representa)on of nanopore

Introduc)on to Bioinforma)cs Online Course:IBT

Accurately determining the

DNA-­‐Seq Protocol : Overview

Introduc)on to Bioinforma)cs Online Course:IBT

Illumina-­‐compa;ble DNA-­‐Seq Library Prep Kits

Star;ng material: input

BLUNT ENDING BY

Introduc)on to Bioinforma)cs Online Course:IBT

à Strategy to focus read lengths during sample and

DNA-­‐Seq Analysis Pipeline

Introduc)on to Bioinforma)cs Online Course:IBT

KEY STEPS OF THE ANALYSIS PIPELINE KEY FILES

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

SUB-­‐CHROMOSOMAL TO MICROSCOPIC

1kb TO SUB-­‐ MICROSCOPIC

2bp TO 1kb

Introduc)on to hnp://[Link]/node/101

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

Introduc)on to Bioinforma)cs Online Course:IBT

The Human Genome Project (HGP)= a 13-‐years (1990-‐April 14, 2003)

Whole-‐genome shotgun sequencing

DNA-‐Seq is nowadays used as an eﬀec)ve sequencing strategy

Next-‐genera)on sequencing (NGS), or high-‐throughput (HT)

Adapted from (hnp://[Link]) and (hnp://[Link]/978-‐1-‐4614-‐7725-‐9)

DNA-‐Seq Protocol : Overview

Illumina-‐compa;ble DNA-‐Seq Library Prep Kits

DNA-‐Seq Analysis Pipeline

SUB-‐CHROMOSOMAL TO MICROSCOPIC

1kb TO SUB-‐ MICROSCOPIC

 Extract specific fields from a file

 search for the occurrence of a specific pattern in a file

 Prints different kind of counts for a file

$ grep XXX ﬁle1 | wc -‐l