Introduction to Bioinformatics Course
Introduction to Bioinformatics Course
to
Bioinforma)cs
Online
Course:
IBT
Genomics
Sequencing
technologies
and
NGS
Overview
Session
1:
Sequencing
technologies
and
NGS
Overview
Part
1:
Introduc)on
to
DNA
Sequencing
Part
2:
DNA
Sequencing
in
the
NGS
era
Part
3:
Overview
of
NGS
Technologies
Part
4:
DNA-‐Seq
Protocol
:
Overview
Part
5:
DNA-‐Seq
Analysis
Pipeline
and
File
Formats
Session
1:
Sequencing
technologies
and
NGS
Overview
Understand
basics
of
NGS
technologies
Understand
different
NGS
file
formats
Navigate
through
database
repositories
to
retrieve
NGS
datasets
1996
First
NGS
technology
:
Pyrosequencing
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
The
Human
Genome
Project:
the
objec;ves
$300 million project, led by the U.S. DoE and the NIH.
At
any
given
)me,
≈
200
labs
in
the
United
States
supported
these
efforts
+
>
18
different
countries
from
across
the
globe
had
contributed
to
the
HGP.
Sequencing
quality
depend
upon
the
average
number
of
)mes
each
base
in
the
genome
is
'read'
during
the
sequencing
process.
For
the
Human
Genome
Project
(HGP)
:
-‐ 'dra^
sequence'
(covering
~90%
of
the
genome
at
~99.9%
accuracy)
-‐ 'finished
sequence'
(covering
>95%
of
the
genome
at
~99.99%
accuracy).
Producing
truly
high-‐quality
'finished'
sequence
by
this
defini)on
is
very
expensive
and
labor-‐intensive.
Several
releases
of
the
human
genome
sequences
hnp://[Link]/sequencingcosts/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Finished
Genome
vs
Dra^
Genome
Finishing
-‐ Accuracy
in
bases
iden)fica)on
+
Quality
Check
+
few
if
any
gaps.
-‐ Con)guous
segments
of
sequence
are
ordered
and
linked
to
one
another
-‐ No
ambigui)es
or
discrepancies
about
segments
order
and
orienta)on
Complete
Genome
A
Genome
represented
by
a
single
con)guous
sequence
with
no
ambigui)es
à
The
sequences
available
are
finished
to
a
certain
high
quality.
(Mardis
et
al.,
2002)
hnp://[Link])[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
The
Human
Genome
Project:
the
heritage
The
HGP
project
required
that
all
human
genome
sequence
informa)on
be
freely
and
publicly
available.
The
exis)ng
DNA
sequences
have
been
stored
in
databases
available
to
anyone
willing
to
exploit
and
analyze
them.
Dedicated
databases
house
various
data
for
model
organisms
such
as
sequences
of
known
and
hypothe)cal
genes
and
proteins
(GenBank,
NCBI).
Other
databases
(Ensembl
hnp://[Link])
present
addi)onal
data
and
annota)on
as
well
as
powerful
tools
for
visualizing
and
searching
it.
Community
efforts
for
non-‐model
organisms
like
Eukaryo)c
Pathogens
:
EuPathDB
(hnp://[Link]/eupathdb/).
Computer
programs
have
been
developed
to
analyze
and
interpret
the
data.
(Mazck,
2011)
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
The
Human
Genome
Project:
the
heritage
Encode
(hnps://[Link])
The
Encyclopedia
of
DNA
Elements
(ENCODE)
Project
aims
to
provide
“a
list
of
func)onal
elements
in
the
human
genome,
including
elements
that
act
at
the
protein
and
RNA
levels,
and
regulatory
elements
that
control
cells
and
circumstances
in
which
a
gene
is
ac)ve.
“
“the
genera;on
of
such
a
catalogue
is
crucial
for
understanding
genome
func;on.”
Gencode
(hnp://[Link])
The
human
genome
has
been
the
focus
of
intensive
manual
annota)on:
The
GENCODE
Consor)um
aims
to
iden)fy
all
gene
features
in
the
human
and
mouse
genomes
using
a
combina)on
of
computa)onal
analysis,
manual
annota)on,
and
experimental
valida)on.
(Djebali
et
al.,
2012)
(Harrow
et
al.,
2012)
(Birney
Introduc)on
to
Bioinforma)cs
O nline
Ceourse:IBT
t
al.,
2007)
Genomics|
Fatma
Guerfali
The
Human
Genome
Project:
the
heritage
Gene)c
differences
in
individual
bases
(SNPs)
of
a
genome
are
by
far
the
most
common
type
of
gene)c
varia)on.
Goal:
develop
a
Haplotype
Map
of
the
Human
Genome
=
iden;fica;on
and
cataloging
of
most
of
the
millions
of
SNPs
es;mated
to
occur
commonly
in
the
human
genome.
Described
variants
are,
their
loca;on,
their
distribu;on
among
people
within
popula;ons
and
among
popula;ons
in
different
parts
of
the
world.
à
designed
to
provide
informa;on
to
link
gene;c
variants
to
the
risk
for
specific
diseases
1000
Genomes
Project:
has
become
more
complete
and
reliable
as
many
novel
variants
have
been
discovered
!!
hnp://[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
The
Human
Genome
Project:
the
heritage
[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA
Sequencing
method:
from
Sanger
to
NGS
Bridge PCR
-‐ The
adaptor-‐flanked
shotgun
library
is
PCR
amplified
on
a
flow
cell
-‐ both
primers
coat
the
surface
of
a
solid
substrate
-‐ Amplifica;on
products
from
any
given
member
of
the
library
remain
locally
fixed
near
the
point
of
origin
=
cluster
-‐ The
PCR
produces
clonal
clusters
contains
copies
of
a
single
DNA.
(hnps://[Link])
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
NGS:
In
brief
:
emPCR
(454/SOLiD/Ion
Torrent)
vs
Bridge
PCR
(Illumina)
emPCR
-‐ The
adaptor-‐flanked
shotgun
library
is
PCR
amplified
in
the
context
of
a
water-‐
in-‐oil
emulsion.
-‐ PCR
primer
is
5'-‐aiached
on
micron-‐scale
beads.
-‐ 1
bead-‐containing
compartments
=
0
or
1
template
DNA.
-‐ PCR
amplicons
are
captured
to
the
surface
of
the
bead.
-‐ 1
clonally
amplified
bead
=
PCR
products
corresponding
to
amplifica;on
of
a
single
molecule
from
the
library.
(Shendure
&
Ji,
2008)
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Part
3
hnp://[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
454
Technology
As
in
Illumina,
the
DNA
is
fragmented.
Adaptors
added,
end
annealed
to
beads.
1
DNA
fragment
=
1
bead.
Fragments
amplified
by
PCR
using
adaptor-‐specific
primers.
The
sequence
can
then
be
determined
computa)onally.
Longer
reads
than
Illumina,
different
lengths.
(hnp://[Link])
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
454
Technology
Intra-‐Pladorm
Comparison
As
in
other
kinds
of
NGS,
the
input
DNA
is
fragmented.
Adaptors
are
added
and
one
molecule
is
placed
onto
a
bead.
Amplifica)on
on
the
bead
by
emulsion
PCR.
Each
bead
is
placed
into
1
well
of
a
slide.
The
pH
is
detected
is
each
of
the
wells,
as
each
H+
ion
released
will
decrease
the
pH.
The
changes
in
pH
allow
us
to
determine
if
that
base,
and
how
many
thereof,
was
added
to
the
sequence
read.
The
dNTPs
are
washed
away,
and
the
process
is
repeated
in
cycles.
a
à
DNA
polymerase
molecule
is
tethered
to
the
bonom
of
a
nanowell
à
ZMW
design
ensures
only
one
nucleo)de-‐
linked
dye
can
be
directly
excited
at
a
)me.
b
à
Each
incorporated
phospholinked
nucleo)de
will
reside
on
the
enzyme’s
ac)ve
site
for
a
few
milliseconds,
which
is
enough
)me
for
a
fluorescent
signal
to
be
recorded.
NB:
In
other
systems,
the
fluorescent
label
is
aOached
to
the
base
in
nucleo:des.
In
SMRT
technology,
the
fluorescent
label
is
aOached
to
the
phosphate
chain
à
The
released
labled
pentaphosphates
will
diffuse
quickly.
(Osherovich
Introduc)on
to
Bioinforma)cs
Online
Ceourse:IBT
t
al.,
2010)
Genomics|
Fatma
Guerfali
Nanopore
(hnp://[Link]/978-‐1-‐4614-‐7725-‐9)
Introduc)on
(picture
to
Bioinforma)cs
from
MIT’s
O nline
Course:IBT
Technology
Review)
Genomics|
Fatma
Guerfali
Inter-‐Pladorm
Comparison
(hnps://fl[Link])
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
The
four
main
advantages
of
NGS
over
classical
Sanger
sequencing
Speed
NGS
is
quicker
than
Sanger
sequencing
in
two
ways.
-‐ Chemical
reac)on
may
be
combined
with
the
signal
detec)on,
whereas
in
Sanger
sequencing
these
are
two
separate
processes.
-‐ 1
read
can
be
taken
at
a
)me
in
Sanger
sequencing,
whereas
NGS
is
massively
parallel.
Cost
The
human
genome
sequence
cost
$300M.
Sequencing
a
human
genome
with
Illumina
allows
to
approach
the
$1,000
expected.
Sample
size
needs
significantly
less
star)ng
amount
of
DNA/RNA
Accuracy
More
repeats
than
with
Sanger
sequencing
à
greater
coverage,
higher
accuracy
and
sequence
reliability
(individual
reads
less
accurate
for
NGS).
hnp://[Link]/sequencingcosts/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Comparison
of
human
genome
sequencing
methods
HGP
vs.
~
2016
hnp://[Link]/sequencingcosts/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Part
4
STEP
01
Genomic
DNA
Purifica;on
END
REPAIR
+
A
STEP
A
A
A
02
Genomic
DNA
Fragmenta;on
A
A
STEP
03
End
repair
and
A-‐tailing
STEP
04
Adapter
Liga;on
STEP
05
Size
Selec;on
&
PCR
STEP
06
Sequencing
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Kits
for
DNAseq
Library
prep
KAPA
HyperPlus
Kits
-‐
input
DNA
from
1
ng
–
1
µg
KAPA
Hyper
Prep
Kits
-‐
250
ng
FFPE
DNA
or
less
+
fewer
cycles
of
amplifica)on
with
KAPA
HiFi
DNA
Polymerase
(duplica)on
rates
+
coverage)
(hnp://[Link])fi[Link]/Next-‐Gen-‐Sequencing/Illumina-‐DNA-‐Library-‐Prep-‐Kits/à
(hnps://[Link]/product-‐applica)ons/products/next-‐genera)on-‐sequencing-‐2/dna-‐library-‐prepara)on/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
-‐
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
GENOMIC
DNA
Star;ng
material:
QC
01
PURIFICATION
Quality
Control
►
gel
visualiza)on,
Bioanalyzer
(Agilent,
Bio-‐rad)
END
REPAIR
+
A Quan9ty
Control
AAA
A A ►
Nanodrop,
Qubit…
Experimental
design
►
SR
(single
read)
or
PE
(paired-‐end)
►
Mul)plexing
or
not
►
de
novo
or
not
►
…
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
GENOMIC
DNA
Experimental
design
01
PURIFICATION
►
SR
(single-‐end
reads)
or
PE
(paired-‐end
reads)
END
REPAIR
SR
PE
+
A
AAA INSERT
SIZE
A A
hnps://[Link]/p/162806/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
GENOMIC
DNA
Experimental
design
01
PURIFICATION
►
Mul)plexing
or
not
?
mul;plexing
=
anach
to
a
specific
END
REPAIR
+
A
barcode
sequence
to
iden)fy
later
the
AAA sample
from
which
it
originates.
A A
Libraries
pooled
and
sequenced
in
parallel.
Reads
from
each
library
are
differenciated
by
using
barcode
to
de-‐mul)plex
Each
set
is
aligned
to
the
reference
genome
Introduc)on
to
Bioinforma)cs
[Link]
nline
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
GENOMIC
DNA
Fragmenta;on
02
FRAGMENTATION
Can
be
included
in
the
kit
►
Op)miza)on
of
fragmenta)on
parameters
END
REPAIR
+
A Several
methods
AAA
A A ►
Enzyma)c,
Nebuliza)on,
acous)c
shearing…
STEP
END
REPAIR
Repair
ends
03
AND
A-‐TAILING
Converts
overhangs:
Blunt
ends
+
Phosphorylates
5’-‐end
Reagents:
END
REPAIR
+
A
dNTP,
T4
DNA
pol,
Klenow
–
Kinase/ATP
(T4
PNK)
AAA
A A Simple
enzyma)c
reac)on
5’-‐END
PHOSPHORYLATION
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
END
REPAIR
A-‐tailing
(Adenyla;on)
03
AND
A-‐TAILING
Adds
‘A’
base
to
the
3'
end
of
the
blunt
phosphorylated
DNA
fragments
Prevents
END
REPAIR
+
A ►
Forma)on
of
adapters
dimers
AAA
A A
►
Concatemers
Reagents
1
mM
dATP,
Klenow
exo
(3'
to
5'
exo
minus)
A-‐OVERHANG
STEP
ADAPTER
Adapter
Liga;on
04
LIGATION
Provided
or
custom-‐designed
Adapter
concentra)on
affects
liga)on,
adapter
END
REPAIR
and
adapter-‐dimer
carryover
+
A
AAA
A A Robust
Liga)on
efficiency
for
adapter:insert
molar
ra)os
between
10:1
and
>200:1
Adapter
ra)o
>200:1
for
low-‐input
applica)ons.
Adapter
quality
Post-‐Liga)on
cleanup
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
SIZE
SELECTION
Size
selec;on
:
Read
length
considera;ons
05
AND
PCR
Size
select
300
–
400
bp
or
350
–
500
bp,
post-‐
liga)on
END
REPAIR
Ensures
maximum
coverage
of
most
inserts
+
A
AAA
A A Problem
of
non-‐uniform
genome
coverage
Problem
of
material
loss
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
SIZE
SELECTION
Size
selec;on
:
Read
length
considera;ons
05
AND
PCR
Double
solid-‐phase
reverse
immobiliza)on
(SPRI)
selec)on
methods
allow
to
reshape
the
input
END
REPAIR
fragment
distribu)on
into
well-‐defined
ranges.
+
A
AAA
A A SPRI
+
Reverse-‐SPRI
FRAGMENTS
ON
BEADS
FRAGMENTS
IN
SUPERNATANT
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
SIZE
SELECTION
Library
Amplifica;on
(PCR)
05
AND
PCR
►
Amplifies
the
amount
of
DNA
in
the
library
►
Selec)vely
enrich
DNA
fragments
with
adapter
molecules
on
both
ends
END
REPAIR
+
A ►
Post-‐amplifica)on
cleanup
AAA
A A
QC
►
Quality
&
Quan)ty
&
size
check
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Cri;cal
steps
in
DNAseq
Library
prep
STEP
SEQUENCING
DNA
Sequencing
06
►
input
:
Library
constructed
-‐ Whole-‐genome
-‐ Whole-‐exome
END
REPAIR
+
A -‐ Target
region
AAA
A A
►
Cluster
amplifica)on
+
sequencing
+
base
calling
►
QC
(run
report)
►
output
:
sequenced
«
reads
»
(fastq
files)
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
Part
5
INDEXING
STEP
Reads
Alignment
SAM/BAM
MARKING
DUPLICATES
02
MAPPED/UNMAPPED
STEP
03
Genomic
Coverage
BAI/BED
STEP
04
SV/CNV/Variant
Calling
VCF
STEP
05
Biological
Interpreta;on
CSV/XLS/TXT
STEP
RAW
SEQUENCING
Sequencing
output
01
READS
►
FASTQ
(text)
format.
►
Poten)ally
SRA
(binary),
but
rather
used
for
public
FASTQ
data
online.
SEQUENCING
RUN/QC
Fastq
file
►
Improvement
of
the
Sanger
breakthrough
(associa)ng
each
nucleo)de
to
a
quality
score)
►
Hundreds
of
millions
of
lignes/rows
►
Blocks
of
4
lignes
(@)
►
Example
hnp://[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
RAW
SEQUENCING
FastQC
01
READS
FASTQ
SEQUENCING RUN/QC
hnp://[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
hnps://[Link]
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
RAW
SEQUENCING
FastQC
+
Trimming
+
bad
reads
removal
01
READS
FASTQ
SEQUENCING RUN/QC
hnp://[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
hnps://[Link]
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
READS
MAPPING/
Input
files
02
ALIGNMENT
►
.fastq
►
Reference
Genome
(.fasta,
.fa,
.fai)
SAM/BAM
►
GFF/GTF,
GFF3
BWA/BOWTIE…/SAMTOOLS
Reads
alignment
►
BWA
/
BOWTIE
(Burrows-‐Wheeler
transform)
►
de
novo:
NEWBLER
(454)…
Output
files
►
.sam
/
.bam
STEP
READS
MAPPING/
GFF3
02
ALIGNMENT
►
1
line
for
1
feature
►
tab-‐separated
columns
SAM/BAM
►
9
columns
+
op)onal
addi)onal
informa)on
BWA/BOWTIE…/SAMTOOLS
STRAND
PHASE
SCORE
SEQ-‐ID
SOURCE
TYPE
START-‐END
ATTRIBUTES
hnp://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
READS
SAM
/
BAM
02
ALIGNMENT
►
Header
lines
+
Alignments
sec)ons
►
tab-‐separated
columns
SAM/BAM
►
11
columns
►
Samtools
(view,
sort,
index…)
BWA/BOWTIE…/SAMTOOLS
►
Removal
of
duplicated
reads
that
affects
Variant
Calling
CIGAR
PNEXT
RNEXT
TLEN
QNAME
FLAG
RNAME
POS
MAPQ
SEQ
QUAL
hnp://[Link]/[Link]
hnp://[Link]/wiki/SAM
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
GENOMIC
Coverage
03
COVERAGE
►
IGV
=
Integra)ve
Genomics
Viewer
BAI/BED
(hnps://[Link])[Link]/igv/)
►
other
useful
file
formats
SAMTOOLS/BEDTOOLS
BED
=
(Browser
Extensible
Data)
(hnp://[Link]/FAQ/FAQformat)
►
Coverage
associated
to
3
different
concepts:
-‐ Fold
Coverage
(number
+
X
)
-‐ Breadth
of
Coverage
-‐ Depth
of
Coverage
STEP
SV/CNV/VARIANT
SV/CNV/Variant
Calling
04
CALLING
►
Structural
varia;ons
(SV)
VCF
Dele)ons,
duplica)ons,
copy-‐number
varia)ons,
inser)ons,
inversions,
transloca)ons…
►
Copy
number
Varia;ons
(CNV)
Dele)ons
or
duplica)ons
of
genes
or
rela)vely
large
regions
of
the
genome
that
affect
chromosomes►
Variant
Calling
(SNPs
and
small
InDels)
-‐
SNPs:
affects
only
1
nucleo)de
-‐
InDels:
affects
1
or
several
nucleo)des
SPLIT
MAPPING
OR
DIFFERENT
DISTANCE
PROPERLY
OR
ORIENTATION
MAPPED
PAIR
FOR
THE
PAIR
STEP
SV/CNV/VARIANT
SV/CNV/Variant
Calling
04
CALLING
VCF
REFERENCE
A
E
F
SINGLE
NUCLEOTIDE
REFERENCE
REFERENCE
A
C
C
B
A
(defini)ons
Introduc)on
adapted
from
SOcherer
to
Bioinforma)cs
nline
Cet
ourse:IBT
al.,
2007)
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Pipeline
and
associated
files
STEP
SV/CNV/VARIANT
SV/CNV/Variant
Calling
04
CALLING
►
VCF
(Variant
Call
Format)
-‐ Text
file
format
storing
SNPs
and
InDels
informa)on
VCF
-‐ hnp://[Link]/node/101
-‐ Obtaining
variants
listed
in
this
format
is
a
mul)step
procedure
involving
different
tools
but
standardized
-‐ Headers
(meta-‐informa)on)
+
data
lines
-‐ 8
required
fields,
tab-‐delimited
STEP
BIOLOGICAL
From
Variant
annota;on
to
data
mining
05
INTERPRETATION
►
web-‐based
CSV/XLS/TXT
►
available
packages
Aim
►
Func)onal
impact
of
variants
(synonymous
or
not…)
►
Gene
Ontology
Annota)on
(BP,
MF,
CC)
►
Pathway
/
Network
informa)on
►
Predic)ons
of
pathogenicity
/
severity
NB:
DAVID
(Database
for
Annota)on,
Visualiza)on
and
Integrated
Discovery)
to
switch
between
databases
hnps://[Link]/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
Take-‐home
messages
Biological
ques;on
►
need
to
be
clearly
defined
first,
so
that
the
design
of
the
experiment,
the
library
construc)on
and
the
pipeline
of
analysis
could
be
prepared
accordingly
Pladom
►
Each
one
has
its
own
specifici)es
that
needs
to
be
understood
before
choosing
one
►
Different
technologies,
short
reads
(Illumina…)
vs
long
reads
(PacBio…)
►
Rapidly
evolving,
several
limitaions
(PCR
bias
for
GC
rich
regions…)
►
Combina)on
of
different
pla•orms
possible
(de
novo…)
Input
/
Output
files
►
Companion
indexed
files
needed
(.fa
&
.fai
/
.bam
&
.bai
/
.vcf
&
.[Link]…)
►
text
based
(FASTA,
FASTQ,
SAM,
GTF/GFF,
BED,
VCF,
WIG)
or
binary
(BAM,
BCF,
SFF)
►
1-‐based
(GFF/GFT,
SAM/BAM,
WIG)
or
(0-‐based
:
BED)
more
view the content of a long file, step by step
$
more
file1
less
view the content of a long file, by portions
$
less
file1
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
Viewing & Manipulating
head
view the first lines of a (long) file
$
head
file1
NB: By default (without options), displays the 10 first lines
tail
view the last lines of a (long) file
$
tail
file1
NB: By default (without options), displays the 10 last lines
Field specifier
Field separator
Careful : grep displays the whole line containing this specific pattern XXX
$
grep
XXX
file1
Could be used to display all lines that DO NOT contain a specific pattern
$
grep
-‐v
XXX
file1
space space
Useful commands:
$ man NameOfTheCommand
$ pwd
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
FASTQ FASTQC/…
SAM/BAM BWA/BOWTIE…/SAMtools
VCF VCFtools/jannovar
fastqc
-‐-‐help
-‐-‐extract
fastqc
[-‐o
output
dir]
[-‐-‐(no)extract]
Uncompress
the
zipped
output
file
in
the
same
dir
aher
being
created.
-‐o
–outdir
-‐-‐noextract
Create
all
output
files
in
the
specified
Do
not
uncompress
the
output
file
output
directory
(dir
must
already
exist).
aher
crea)ng
it.
Trimming
à
Trimming
low
Quality
Bases
Low
quality
base
reads
from
the
sequencer
can
cause
an
otherwise
mappable
sequence
not
to
align
à
different
open
source
tools
can
trim
off
3'
bases
à
produce
a
FASTQ
file
of
the
trimmed
reads
to
use
as
input
to
the
alignment
program.
e.g.: FASTX-‐Toolkit
gunzip -‐c Sample_R1.[Link] | fastx_trimmer -‐l 50 -‐Q 33 > [Link]
Trim
down
to
50
bases
op)on
that
specifies
how
base
quali)es
on
the
(last
base
is
50)
4th
line
of
each
fastq
entry
are
encoded
[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
FASTQ
FASTQC/…
Trimming
à
Trimming
Adapters
A
3'
adapter
contamina)on
can
cause
the
insert
sequence
not
to
align
(adapter
sequence
≠
bases
at
the
3'
end
of
the
reference
genome
sequence).
Unlike
general
fixed-‐length
trimming,
adapter
trimming
removes
different
numbers
of
3'
bases
depending
on
where
the
adapter
sequence
is
found.
e.g.:
cutadapt
cutadapt
-‐a
AAAT
[Link]
-‐o
test_new.fastq
cutadapt
-‐m
22
-‐O
10
-‐a
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-‐m22
=
discard
any
sequence
that
is
smaller
than
22
bases
aher
trimming
-‐O10
=
says
not
to
trim
3'
adapter
sequences
unless
at
least
the
first
10
bases
of
the
adapter
are
seen
at
the
3'
end
of
the
read
[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
SAM/BAM
BWA/BOWTIE…/SAMtools
Mapped
and
unmapped
reads
are
imported
and
stored
into
SAM/BAM
format
samtools
:
A
suite
of
useful
commands
to
visualize
or
get
informa)ons
from
.sam/.bam
#from SAM to BAM conversion
samtools view [Link] > [Link]
#all reads mapping on a certain portion of chr1 or all the chr1 in another bam
samtools index [Link]
samtools view [Link] chr1:200000-500000
samtools view -b [Link] chr1 > test_chr1.bam
[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
SAM/BAM
BWA/BOWTIE…/SAMtools
Mapped
and
unmapped
reads
are
imported
and
stored
into
SAM/BAM
format
samtools
The
flagstat
command
provides
simple
sta)s)cs
on
a
BAM
file
#from SAM to BAM conversion
samtools flagstat [Link]
Suppose
we
have
:
-‐ a
reference
sequence
in
[Link],
indexed
by
samtools
faidx
-‐ posi)on
sorted
alignment
files
[Link]
and
[Link].
à
you
can
call
SNPs
and
short
INDELs
using:
#[Link] a BCF file (binary data format : information about sequence variants (SNPs…)
samtools
mpileup -‐uD
-‐f
[Link]
[Link]
[Link]
|
bcftools
view
-‐bvcg
-‐
>
fi[Link]
-‐u
output
into
an
uncompressed
bcf
file
-‐b
output
to
BCF
format
-‐D
keep
read
depth
for
each
sample
-‐v
only
output
poten)al
variant
sites
(i.e.,
exclude
monomorphic
ones)
-‐f
next
argument
is
reference
genome
file
-‐c
do
SNP
calling
-‐g
call
genotypes
for
each
sample
in
addi)on
to
just
calling
SNPs
hnp://[Link]/[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
VCF
VCFtools/jannovar
Suppose
we
have
:
-‐ a
reference
sequence
in
[Link],
indexed
by
samtools
faidx
-‐ posi)on
sorted
alignment
files
[Link]
and
[Link].
à
you
can
call
SNPs
and
short
INDELs
using:
#[Link] BCF into VCF (flat text file rather than a binary = easier to view)
bcftools
view
fi[Link]
|
vcfu)[Link]
varFilter
-‐D100
>
file.fi[Link]
-‐D100 filters out SNPs that had read depth higher than 100
hnp://[Link]/[Link]
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
VCF
VCFtools/jannovar
hnp://[Link]/jannovar/
Introduc)on
to
Bioinforma)cs
Online
Course:IBT
Genomics|
Fatma
Guerfali
DNA-‐Seq
Analysis
The
command-‐line
environment
►
all
these
commands
can
be
run
as
part
of
an
analysis
pipeline
►
all
files
generated
can
be
parsed
using
specific
tools
(samtools…)
►
text-‐based
files
generated
can
be
parsed
using
basic
linux
commands
►
At
this
stage
you
should
be
able
to
handle
easily
queries
to
parse
large
files
-‐ able
to
search
for
a
par)cular
panern
-‐ able
to
select
specific
informa)on
columns
à Assignment !