0% found this document useful (0 votes)
27 views95 pages

Introduction to Bioinformatics Course

Uploaded by

kkarenke87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views95 pages

Introduction to Bioinformatics Course

Uploaded by

kkarenke87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduc)on

 to  Bioinforma)cs  
Online  Course:  IBT  
Genomics  
Sequencing  technologies  and  NGS  
Overview  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
Learning  Objec;ves  

Session  1:  
Sequencing  technologies  and  NGS  Overview  
 
Ÿ Part  1:  Introduc)on  to  DNA  Sequencing    
Ÿ Part  2:  DNA  Sequencing  in  the  NGS  era  
Ÿ Part  3:  Overview  of  NGS  Technologies    
Ÿ Part  4:  DNA-­‐Seq  Protocol  :  Overview  
Ÿ Part  5:  DNA-­‐Seq  Analysis  Pipeline  and  File  Formats  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
Learning  Outcomes  

Session  1:  
Sequencing  technologies  and  NGS  Overview  
 
Ÿ Understand  basics  of  NGS  technologies  
Ÿ Understand  different  NGS  file  formats  
Ÿ Navigate  through  database  repositories  to  retrieve  
NGS  datasets  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
Part  1  

Introduc;on  to  DNA  Sequencing    

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
What  is  DNA  Sequencing  ?  

DNA   Sequencing   is   the   process   of   reading   the  


nucleo)des   present   in   DNA   :   determining   the   precise  
order  of  nucleo;des  within  a  DNA  molecule.    
 
 
DNA-­‐Seq   generally   refers   today   to   any   NGS   method  
or  technology  that  is  used  to  determine  the  order  of  
the  four  bases  (A,  T,  C,  G)  in  a  strand  of  DNA.    

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
What  is  DNA  Sequencing  ?  

In   fact,   there   are   2   main   types   of   DNA   sequencing  


technologies  that  are  used  today:  Sanger  sequencing  
and  Next-­‐Genera;on  Sequencing  (NGS).    
 
 
Each   of   these   technologies   has   u)lity   in   today’s  
gene)c  analysis  environment.  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
Main  DNA  sequencing  technologies  
1972-­‐1976    
(Min  Jou  et  al.,  1972;  Fiers  
et  al.,  1976)  sequence  of  the   1977    
1st  complete  gene  and  the   Maxam  &  Gilbert  
complete  genome  of    +  Sanger  sequencing  
Bacteriophage  MS2  (viral  
RNA  3,569  nt)  
1953     1977    
(Watson  &  Crick)     (Sanger  et  al.,  Feb  1977)  
DNA  Structure  as  a     The  first  full  DNA  genome  to  
double  helix   be  sequenced  :  
1995    
bacteriophage  φX174  
First  published  use  
of  WGS  sequencing   2001  
Drah  sequence  of  the  human  
genome  (shotgun  sequencing  
1995     methods).  Public  &  Private  (Lander  et  
([Link])  1st  complete  genome   al.,  Feb  2001  Nature;  Venter  et  al.,  
of  a  free-­‐living  organism,  the   Feb  2001  Science)  
1987    
bacterium  H.  influenzae    
ABI  markets  ABI370,  
(circular  chr  >1,8Mb)  
the  1st  fully  automated  
sequencing  machine  

1996    
First  NGS  technology  :  
Pyrosequencing   Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  objec;ves  

The  Human  Genome  Project  (HGP)=  a  13-­‐years  (1990-­‐April  14,  2003)  


interna)onal  effort  to  sequence  of  the  3  billion  "leners”  of  human  DNA.  

$300  million  project,  led  by  the  U.S.  DoE  and  the  NIH.  

Interna)onal  Human  Genome  Sequencing  Consor)um  (IHGSC)=  group  of  


publicly  funded  researchers    

At  any  given  )me,  ≈  200  labs  in  the  United  States  supported  these  efforts  +  
>  18  different  countries  from  across  the  globe  had  contributed  to  the  HGP.  
 

Chial,  2008   hnp://[Link]/sequencingcosts/  


Introduc)on  to  Bioinforma)cs  Online   Course:IBT  
[Link]  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  objec;ves  

2  groups  compe)ng  for  sequencing:    


-­‐ Public  
-­‐ Private  (Celera  Genomics)  
 
Opposing  philosophies  :  
-­‐ HGP  Bermuda  Agreement  (1996)    
à     all   informa)on   from   the   project   would   be  
made  freely  available  to  all  within  24h.  
-­‐ Private    
à  access  restricted  to  paying  customers  !  

In   February   2001,   drahs   of   the   human   genome  


sequence  were  published  simultaneously  by  both  
public-­‐private  groups  in  separate  ar)cles  (Lander  et  al  
(IHGSC).,  Feb  2001  Nature;  Venter  et  al.,  Feb  2001  Science).   Chial,  2008  
hnp://[Link]/sequencingcosts/  
hnp://[Link]/  
[Link]  
Introduc)on  to  Bioinforma)cs  Online   Course:IBT  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  method  (WGS)  

Hierarchical  genome  shotgun  sequencing  


-­‐  Shotgun  phase  
à    Genome  fragmented  into  larger  segments    
à cloning  into  vectors    
à clones  sequencing    
à shotgun  sequences  assembly    
à relied  on  the  physical  map  of  the  human  genome  
established  earlier.  
 
-­‐  Finishing  phase  :  filling  in  gaps  and  resolving  DNA  
sequences  in  ambiguous  areas  

Whole-­‐genome  shotgun  sequencing  


(Celera  genomics)  
à Genome  sheared  randomly  into  small  fragments  
(appropriately  sized  for  sequencing)    
à Reassembly.  

The   IHGSC   and   Celera   used   the   same   general  


method   of   termina:on   chain   for   the   DNA  
sequencing  (Hood  &  Galas,  2003).    
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Sequencing  Quality  

Sequencing  quality  depend  upon  the  average  number  of  )mes  each  base  
in  the  genome  is  'read'  during  the  sequencing  process.    
 
 
For  the  Human  Genome  Project  (HGP)  :    
-­‐ 'dra^  sequence'  (covering  ~90%  of  the  genome  at  ~99.9%  accuracy)  
-­‐ 'finished   sequence'   (covering   >95%   of   the   genome   at   ~99.99%  
accuracy).    
Producing   truly   high-­‐quality   'finished'   sequence   by   this   defini)on   is   very  
expensive  and  labor-­‐intensive.  
 
 
Several  releases  of  the  human  genome  sequences  

hnp://[Link]/sequencingcosts/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Finished  Genome  vs  Dra^  Genome  

Variable  degrees  of  comple)on  of  published  genomes  


 
Ÿ  Dra^  Sequencing  
-­‐ high-­‐throughput  or  shotgun  phase  (whole  genome  or  clone-­‐based  approach)  
-­‐ Assembly  using  specific  algorithms  (whole-­‐genome  or  single-­‐clone  assembly)  
à lower  accuracy  than  finished  sequence;  some  segments  are  missing  or  in  
the  wrong  order  or  orienta)on.      

Ÿ  Finishing  
-­‐ Accuracy  in  bases  iden)fica)on  +    Quality  Check  +  few  if  any  gaps.  
-­‐ Con)guous  segments  of  sequence  are  ordered  and  linked  to  one  another    
-­‐ No  ambigui)es  or  discrepancies  about  segments  order  and  orienta)on  
 
Ÿ  Complete  Genome  
A  Genome  represented  by  a  single  con)guous  sequence  with  no  ambigui)es  
 
à  The  sequences  available  are  finished  to  a  certain  high  quality.  
(Mardis  et  al.,  2002)  
hnp://[Link])[Link]/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  heritage  

The  HGP  project  required  that  all  human  genome  sequence  informa)on  be  
freely   and   publicly   available.   The   exis)ng   DNA   sequences   have   been   stored  
in  databases  available  to  anyone  willing  to  exploit  and  analyze  them.  
 
Dedicated   databases   house   various   data   for   model   organisms   such   as  
sequences  of  known  and  hypothe)cal  genes  and  proteins  (GenBank,  NCBI).  
Other  databases  (Ensembl  hnp://[Link])  present  addi)onal  data  
and  annota)on  as  well  as  powerful  tools  for  visualizing  and  searching  it.    
 
Community   efforts   for   non-­‐model   organisms   like   Eukaryo)c   Pathogens   :  
EuPathDB  (hnp://[Link]/eupathdb/).  
 
Computer  programs  have  been  developed  to  analyze  and  interpret  the  data.  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  heritage  

The   human   genome   contains   only  


about   20,000   protein-­‐coding   genes,  
similar   in   number   and   with   largely  
orthologous   func)ons   as   those   in  
nematodes   that   have   only   1,000  
soma)c  cells.    
 
The   extent   of   non-­‐protein-­‐coding  
DNA   increases   with   increasing  
complexity,   reaching   >   98%   in  
humans.    
 
à Encode  
à Gencode  

(Mazck,  2011)  
Introduc)on  to  Bioinforma)cs  Online   Course:IBT  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  heritage  

Encode  (hnps://[Link])  
The  Encyclopedia  of  DNA  Elements  (ENCODE)  Project  aims  to  provide  “a   list  
of  func)onal  elements  in  the  human  genome,  including  elements  that  act  at  
the  protein  and  RNA  levels,  and  regulatory  elements  that  control  cells  and  
circumstances  in  which  a  gene  is  ac)ve.  “  
“the   genera;on   of   such   a   catalogue   is   crucial   for   understanding   genome  
func;on.”  
 
Gencode  (hnp://[Link])  
The  human  genome  has  been  the  focus  of  intensive  manual  annota)on:  
The   GENCODE   Consor)um   aims   to   iden)fy   all   gene   features   in   the   human  
and  mouse  genomes  using  a  combina)on  of  computa)onal  analysis,  manual  
annota)on,  and  experimental  valida)on.    
(Djebali  et  al.,  2012)  
(Harrow  et  al.,  2012)  
(Birney  
Introduc)on  to  Bioinforma)cs  O nline  Ceourse:IBT  
t  al.,  2007)  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  heritage  

Gene)c   differences   in   individual   bases   (SNPs)   of   a   genome   are   by   far   the   most  
common  type  of  gene)c  varia)on.    
 
Goal:  develop  a  Haplotype  Map  of  the  Human  Genome    
=  iden;fica;on  and  cataloging  of  most  of  the  millions  of  SNPs  es;mated  to  occur  
commonly  in  the  human  genome.  
 
Described   variants   are,   their   loca;on,   their   distribu;on   among   people   within  
popula;ons   and   among   popula;ons   in   different   parts   of   the   world.   à   designed   to  
provide  informa;on  to  link  gene;c  variants  to  the  risk  for  specific  diseases  
 
1000   Genomes   Project:   has   become   more   complete   and   reliable   as   many   novel  
variants  have  been  discovered  !!  

hnp://[Link]  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
The  Human  Genome  Project:    
the  heritage  

1000  Genomes  ([Link]/)  


 
►    iden)fy  most  gene)c  variants  with  frequencies  of  at  least  1%.    
►    freely  accessible  ressource  of  human  gene)c  varia)on.  
►  final  data  set  =  data  for  2,504  individuals  from  26  popula)ons.  
(low  coverage  sequencing  and  exome  sequence  data  for  all)  
►   Interna)onal   Genome   Sample   Resource   (IGSR)   for   ongoing  
usability  of  data  generated  by  the  1000  Genomes  Project.    
 
UK10K  ([Link]/)  
 
►    iden)fica)on  of  rare  gene)c  variants  through  the  study  of  the  
DNA   of   4,000   individuals   and   their   comparison   to   the   protein-­‐
coding  areas  of  6,000  people  with  documented  diseases.    
►    link  between  gene)c  variants  and  rare  diseases.  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
hnp://[Link]  
The  Human  Genome  Project:    
the  heritage  

Development   of   novel   technologies   to   help   increase   the   depth   of  


sequencing:  Next-­‐Genera)on  Sequencing  (NGS)  technologies  
 
 
Since   their   development,   NGS   technologies   have   gained   increasing  
anen)on   with   a   considerable   poten)al   applica)on   in   both   diagnos)c   and  
public  health  microbiology.  
 
Revolu)onized  the  sequencing  process:  from  Sanger  to  HT  sequencing  
 

(Salipante  et  al.,  2013)  


Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Part  2  

DNA  Sequencing  in  the  NGS  era  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA  Sequencing  method:    
from  Sanger  to  NGS  

Sanger   sequencing   is   the   method   developed   by   Frederick   Sanger   in   1977.  


This   method   involves   copying   single-­‐stranded   DNA   with   chemically   altered  
bases  called  dideoxynucleo)des  (ddNTPs).  
 
ddNTPs    when  incorporated  at  the  3'  end  of  the  growing  chain,  terminate  the  
chain  selec)vely  at  A,  C,  G,  or  T.  The  terminated  chains  are  then  resolved  by  
capillary  electrophoresis.  
 

[Link]
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA  Sequencing  method:    
from  Sanger  to  NGS  

Applied   Biosystems   (Life   Technologies),   manufactured   the  


automated   capillary   sequencers   u)lized   by   both   Celera  
Genomics  and  The  Human  Genome  Project.  
 
While   capillary   sequencing   was   the   first   approach   to  
successfully   sequence   a   full   human   genome,   it   was   s)ll   too  
expensive  and  took  too  long  for  commercial  purposes  !!!    
 
Because  of  this,  sequencing  using  Sanger  technology  has  been  
displaced   by   technologies   like   pyrosequencing,   or   SMRT  
sequencing…  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  

DNA-­‐Seq  is  nowadays  used  as  an  effec)ve  sequencing  strategy  


aher   the   advent   of   rapid   DNA   sequencing   methods   that   has  
greatly   accelerated   biological   and   medical   research   and  
discovery  :  de  novo…  
 
DNA-­‐Seq   may   be   used   to   determine   the   sequence   of  
individual  genes,  larger  gene;c  regions,  full  chromosomes,  or  
en;re  genomes.    
   
‘DNA-­‐Seq’  and  other  related  ‘seq’  technologies  allow  to  cover  
genome  complexity  :  genomic  DNA-­‐Seq,  Methyl-­‐Seq,  ChIP-­‐Seq,  
exome  sequencing…  
  Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
NGS  

Next-­‐genera)on   sequencing   (NGS),   or   high-­‐throughput   (HT)  


sequencing   =   catch-­‐all   term   describing   different   modern  
sequencing  technologies  used  by  different  pla•orms:  
 
-­‐ Illumina  (Solexa)  sequencing  
-­‐ Roche  454  sequencing  
-­‐ Ion  torrent:  Proton  /  PGM  sequencing  
-­‐ …  

DNA  sequence  faster  and  cheaper  than  the  Sanger  sequencing  


=  revolu)on  for  genomics  and  molecular  biology.  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
NGS:  
Sequencing  Technologies  and  Pladorms  

Technology   Company   Support   Chemistry  


Massively  Parallel        
Sequencing        
Solexa   Illumina     Bridge  PCR  on  flowcell   Seq-­‐By-­‐Synthesis  
454   Roche  Applied  Science     emPCR  on  beads     Pyrosequencing    
SOLiD   AB  /  Life  Technologies   emPCR  on  beads   Seq-­‐By-­‐Liga)on  
Ion  Torrent   Life  Technologies   emPCR  on  beads   Proton  detec)on  
Single  Molecule        
Sequencing        
PacBio  SMRT   Pacific  Biosciences   Pol  performance   Real-­‐)me-­‐Seq  
Nanopore   Oxford  Nanopore  Tech/McNally   Transloca)on     NA  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
NGS:  
Principles  of  DNA  Sequencing:  SBS  
Tracking  the  addi;on  of  labeled  nucleo;des  as  the  DNA  chain  is  copied  
The  DNA  template  is  immobilized.    
Solu)ons   of   A,   nucleo)des   C   ,   G   and   T   sequen)ally   added   and   removed.  
Light  is  generated  when  a  nucleo)de  complements  the  first  unpaired  base.  
Chemiluminescent  signal  detected  to  determine  the  sequence.  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
NGS:  
In  brief  :  emPCR  (454/SOLiD/Ion  Torrent)  vs  Bridge  PCR  (Illumina)  

Bridge  PCR  

-­‐ The  adaptor-­‐flanked  shotgun  library  is  PCR  amplified  on  a  flow  cell  
-­‐ both  primers  coat  the  surface  of  a  solid  substrate  
-­‐ Amplifica;on   products   from   any   given   member   of   the   library   remain   locally  
fixed  near  the  point  of  origin  =  cluster  
-­‐ The  PCR  produces  clonal  clusters  contains  copies  of  a  single  DNA.  

(Shendure  &  Ji,  2008)  


Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
NGS:  
Principles  of  DNA  Sequencing:  Pyrosequencing  
Incorpora;on  of  dNTPs  by  DNApol  releases  pyrophosphate  (PPi).  
ATP  sulfurylase  converts  PPi  to  ATP  in  the  presence  of  APS.    
ATP   =   substrate   for   the   luciferase-­‐mediated   conversion   of   luciferin   to   oxyluciferin  
This   conversion   generates   light   in   amounts   propor)onal   to   the   amount   of   ATP  
detected  by  a  camera.  
Unincorporated  nucleo)des  and  ATP  are  degraded  by  the  apyrase,  and  the  reac)on  
can  restart  with  another  nucleo)de  

APS : adenosine 5´phosphosulfate


PPi : Pyrophosphate
DNApol: DNA Polymerase

(hnps://[Link])  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
NGS:  
In  brief  :  emPCR  (454/SOLiD/Ion  Torrent)  vs  Bridge  PCR  (Illumina)  

emPCR  

-­‐ The  adaptor-­‐flanked  shotgun  library  is  PCR  amplified  in  the  context  of  a  water-­‐
in-­‐oil  emulsion.    
-­‐ PCR  primer  is  5'-­‐aiached  on  micron-­‐scale  beads.    
-­‐ 1  bead-­‐containing  compartments  =  0  or  1  template  DNA.    
-­‐ PCR  amplicons  are  captured  to  the  surface  of  the  bead.    
-­‐ 1   clonally   amplified   bead   =   PCR   products   corresponding   to   amplifica;on   of   a  
single  molecule  from  the  library.    
(Shendure  &  Ji,  2008)  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Part  3  

Overview  of  NGS  Technologies  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
NGS:  
Sequencing  Technologies  and  Pladorms  

Technology   Company   Support   Chemistry  


Massively  Parallel        
Sequencing        
Solexa   Illumina     Bridge  PCR  on  flowcell   Seq-­‐By-­‐Synthesis  
454   Roche  Applied  Science     emPCR  on  beads     Pyrosequencing    
SOLiD   AB  /  Life  Technologies   emPCR  on  beads   Seq-­‐By-­‐Liga)on  
Ion  Torrent   Life  Technologies   emPCR  on  beads   Proton  detec)on  
Single  Molecule        
Sequencing        
PacBio  SMRT   Pacific  Biosciences   Pol  performance   Real-­‐)me-­‐Seq  
Nanopore   Oxford  Nanopore  Tech/McNally   Transloca)on     NA  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
Solexa  
The   input   sample   must   be  
Flow  cell    
cleaved  into  short  sec)ons.    
 
F r a g m e n t s   a r e   l i g a t e d   t o  
adaptors   and   annealed   to   the  
slide  using  the  adaptors.    
 
Fragments   are   separated   into  
single  strands  to  be  sequenced.  
 
Nucleo)des  are  modified  so  that  
each   emits   a   different   coloured  
light  when  excited  by  a  laser.  
+  they  have  a  terminator,  so  that  
only  one  base  is  added  at  a  )me.  
 
PCR,   process   repeated   in   cycles,  
images  analyzed.  
Introduc)on  to  (modified   from  
Bioinforma)cs   Gilchrist,  
Online   2010)  
Course:IBT  
Genomics|  Fatma  Guerfali    
Solexa  
Intra-­‐Pladorm  Comparison  

hnp://[Link]  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
454  Technology  

 
As  in  Illumina,  the  DNA  is  fragmented.    
 
 
Adaptors  added,  end  annealed  to  beads.    
1  DNA  fragment  =  1  bead.    
 
 
Fragments  amplified  by  PCR  using  adaptor-­‐specific  primers.  
 
 
The  sequence  can  then  be  determined  computa)onally.    
 
 
Longer  reads  than  Illumina,  different  lengths.  

(hnp://[Link])  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
454  Technology  
Intra-­‐Pladorm  Comparison  

GS  Junior  System     GS  FLX+  System    

brings the power of 454 Features combination of long reads, accuracy


Sequencing Systems directly and high-throughput, making the system well
to the laboratory benchtop suited for larger genomic projects.

Read lengths: Read lengths:


GS Junior (up to 400bp) GS FLX Titanium XL+ (up to 1,000bp)
GS Junior + (700-800bp) GS FLX Titanium XLR70 (up to 600bp)

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  ([Link]/)  
Fatma  Guerfali    
SOLID  
 
Sequencing  is  performed  with  a  ligase,  rather  than  a  polymerase.  
 
Each  sequencing  cycle  introduces  a  par;ally  degenerated  popula;on  of  fluorescently  
labeled  octamers.  The  popula)on  is  structured  such  that  the  label  correlates  with  the  
iden;ty  of  the  central  2  bp  in  the  octamer.  
 
A^er  liga;on  and  imaging  in  four  channels,  the  labeled  por;on  of  the  octamer  (that  is,  
'zzz')  is  cleaved  leaving  a  free  end  for  another  cycle  of  liga)on.    
 

In   the   SOLiD   sequencing   method,   unlike  


other   NGS   technologies   (which   base  
detec:on   of   DNA   fragments   is   performed  
through   polymerase   reac:on)   sequencing  
is  achieved  by  Sequencing-­‐By-­‐Liga9on.  
 

Adapted  from  (Shendure  &  Ji,  Introduc)on  


2008)  and  (hnp://[Link]/978-­‐1-­‐4614-­‐7725-­‐9)  
to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Ion  Torrent  

As  in  other  kinds  of  NGS,  the  input  DNA  is  fragmented.  
 
Adaptors  are  added  and  one  molecule  is  placed  onto  a  bead.    
 
Amplifica)on  on  the  bead  by  emulsion  PCR.  Each  bead  is  placed  into  1  well  of  a  slide.  
 
The  pH  is  detected  is  each  of  the  wells,  as  each  H+  ion  released  will  decrease  the  pH.  
The  changes  in  pH  allow  us  to  determine  if  that  base,  and  how  many  thereof,  was  
added  to  the  sequence  read.  
 
The  dNTPs  are  washed  away,  and  the  
process  is  repeated  in  cycles.  
 
 

Adapted  from  (hnp://[Link])  and  (hnp://[Link]/978-­‐1-­‐4614-­‐7725-­‐9)  


Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
 
Genomics|  Fatma  Guerfali    
PacBio  
Single  Molecule  Real  Time  (SMRT)  DNA  Sequencing  
technology  

a  à    DNA  polymerase  molecule  is  tethered  to  the  bonom  of  
a  nanowell  à  ZMW  design  ensures  only  one  nucleo)de-­‐
linked  dye  can  be  directly  excited  at  a  )me.    
 
b  à  Each  incorporated  phospholinked  nucleo)de  will  reside  
on  the  enzyme’s  ac)ve  site  for  a  few  milliseconds,  which  is  
enough  )me  for  a  fluorescent  signal  to  be  recorded.    
 
NB:  In  other  systems,  the  fluorescent  label  is  aOached  to  the  
base  in  nucleo:des.  In  SMRT  technology,  the  fluorescent  
label  is  aOached  to  the  phosphate  chain  à  The  released  
labled  pentaphosphates  will  diffuse  quickly.  

(Osherovich  
Introduc)on  to  Bioinforma)cs   Online  Ceourse:IBT  
t  al.,  2010)  
Genomics|  Fatma  Guerfali    
Nanopore  

Single  Molecule  Real  Time  (SMRT)  DNA  Sequencing  


technology:  How  it  works  

Schema)c  representa)on  of  nanopore  


sequencing  system.    
 
Ÿ Upper  protein  à  ssDNA.    
Ÿ 2nd  protein    
-­‐ forms  a  nanopore  in  a  membrane.  
-­‐ contains  an  adaptor  molecule  reduce  
the  speed  of  passing  DNA  through  
the  pore.    
Ÿ Each  base  obstructs  the  flow  to  a  
different  degree.    
Ÿ PromethION…  

(hnp://[Link]/978-­‐1-­‐4614-­‐7725-­‐9)  
Introduc)on  (picture  
to  Bioinforma)cs  
from  MIT’s  O nline  Course:IBT  
Technology   Review)  
Genomics|  Fatma  Guerfali      
Inter-­‐Pladorm  Comparison  

(hnps://fl[Link])  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
The  four  main  advantages  of  NGS  over  
classical  Sanger  sequencing  
 
Speed  
NGS  is  quicker  than  Sanger  sequencing  in  two  ways.    
-­‐ Chemical  reac)on  may  be  combined  with  the  signal  detec)on,  whereas  in  Sanger  
sequencing  these  are  two  separate  processes.    
-­‐ 1  read  can  be  taken  at  a  )me  in  Sanger  sequencing,  whereas  NGS  is  massively  parallel.  
 
Cost  
The  human  genome  sequence  cost  $300M.    
Sequencing  a  human  genome  with  Illumina  allows  to  approach  the  $1,000  expected.  
 
Sample  size    
needs  significantly  less  star)ng  amount  of  DNA/RNA    

Accuracy    
More  repeats  than  with  Sanger  sequencing  à  greater  coverage,  higher  accuracy  and  
sequence  reliability  (individual  reads  less  accurate  for  NGS).  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA  Sequencing  costs  
(Data  from  the  NHGRI  Genome  Sequencing  Program  (GSP)  

Accurately  determining  the  


cost  for  sequencing  a  given  
genome  (e.g.,  a  human  
genome)  is  not  simple.    

hnp://[Link]/sequencingcosts/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Comparison  of  human  genome  sequencing  methods  
HGP  vs.  ~  2016    

hnp://[Link]/sequencingcosts/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Part  4  

DNA-­‐Seq  Protocol  :  Overview  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Protocol  for  Library  Construc;on  

STEP  
01   Genomic  DNA  Purifica;on  
END  REPAIR  
+  
A  
STEP   A   A   A  
02   Genomic  DNA  Fragmenta;on   A   A  

STEP  
03   End  repair  and  A-­‐tailing  

STEP  
04   Adapter  Liga;on  

STEP  
05   Size  Selec;on  &  PCR  

STEP  
06   Sequencing  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Kits  for  DNAseq  Library  prep  

Illumina-­‐compa;ble  DNA-­‐Seq  Library  Prep  Kits  


 
NEXTflex™  Rapid  DNA-­‐Seq  Kit    -­‐  DNA-­‐Seq  library  prep  kit,  1  ng  -­‐  1  µg  input  DNA  
NEXTflex  mtDNA-­‐Seq  Kit  -­‐  mtDNA  libraries  
NEXTflex™  DNA  Sequencing  Kits  -­‐  DNA-­‐Seq  library  prep  kit,  1  µg  of  input  DNA  
NEXTflex™   PCR-­‐Free   DNA   Sequencing   Kit   -­‐   Amplifica)on-­‐free   DNA-­‐Seq   library  
prep  kit  for  sequencing  0.5  µg  –  3  µg  of  input  DNA  
NEXTflex™  PCR-­‐Free  Barcodes  -­‐  Up  to  48  barcodes  for  use  with  the  NEXTflex™  
PCR-­‐Free  DNA-­‐Seq  Kits  and  other  DNA-­‐Seq  protocols  

 
KAPA  HyperPlus  Kits  -­‐  input  DNA  from  1  ng  –  1  µg    
KAPA  Hyper  Prep  Kits  -­‐  250   ng  FFPE  DNA  or  less  +  fewer  cycles  of  amplifica)on  
with  KAPA  HiFi  DNA  Polymerase  (duplica)on  rates  +  coverage)  
(hnp://[Link])fi[Link]/Next-­‐Gen-­‐Sequencing/Illumina-­‐DNA-­‐Library-­‐Prep-­‐Kits/à  
(hnps://[Link]/product-­‐applica)ons/products/next-­‐genera)on-­‐sequencing-­‐2/dna-­‐library-­‐prepara)on/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
-­‐  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
GENOMIC  DNA  
Star;ng  material:  QC  
01   PURIFICATION  
 
Ÿ Quality  Control  
►    gel  visualiza)on,  Bioanalyzer  (Agilent,  Bio-­‐rad)  
 
END  REPAIR  
+   A Ÿ Quan9ty  Control    
AAA
A A ►    Nanodrop,  Qubit…  
 
Experimental  design  
 
►    SR  (single  read)  or  PE  (paired-­‐end)    
►    Mul)plexing  or  not  
►    de  novo  or  not  
►    …  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
GENOMIC  DNA  
Experimental  design  
01   PURIFICATION  
 
►    SR  (single-­‐end  reads)  or  PE  (paired-­‐end  reads)    

END  REPAIR  
SR   PE  
+   A
AAA INSERT  SIZE  
A A

hnps://[Link]/p/162806/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
GENOMIC  DNA  
Experimental  design  
01   PURIFICATION  
 
►    Mul)plexing  or  not  ?  
 
Ÿ mul;plexing  =  anach  to  a  specific    
END  REPAIR  
+   A
barcode  sequence  to  iden)fy  later  the  
AAA sample  from  which  it  originates.  
A A
 
Ÿ Libraries  pooled  and    
sequenced  in  parallel.  
 
Ÿ Reads  from  each  library  are    
differenciated  by  using  barcode    
to  de-­‐mul)plex  
 
Ÿ Each  set  is  aligned  to  the    
reference  genome  
Introduc)on  to  Bioinforma)cs  [Link]  
nline  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
GENOMIC  DNA  
Fragmenta;on  
02   FRAGMENTATION  
 
Ÿ Can  be  included  in  the  kit  
►    Op)miza)on  of  fragmenta)on  parameters  
 
END  REPAIR  
+   A Ÿ Several  methods  
AAA
A A ►    Enzyma)c,  Nebuliza)on,  acous)c  shearing…  

Star;ng  material:  input  


 
Ÿ Low  Quality  DNA  
►    Cau)on  in  size  selec)on  
 
Ÿ High  Quality  DNA  
►    Size  selec)on   Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
END  REPAIR    
Repair  ends  
03   AND  A-­‐TAILING  
 
Ÿ Converts  overhangs:    
           Blunt  ends  +  Phosphorylates  5’-­‐end    
Ÿ Reagents:    
END  REPAIR  
+   A            dNTP,  T4  DNA  pol,  Klenow  –  Kinase/ATP  (T4  PNK)  
AAA
A A Ÿ Simple  enzyma)c  reac)on  

BLUNT  ENDING  BY  


EXONUCLEASE  

5’-­‐END  
PHOSPHORYLATION  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
END  REPAIR    
A-­‐tailing  (Adenyla;on)  
03   AND  A-­‐TAILING  
 
Ÿ Adds   ‘A’   base   to   the   3'   end   of   the   blunt  
phosphorylated  DNA  fragments    
Ÿ Prevents  
END  REPAIR  
+   A ►    Forma)on  of  adapters  dimers    
AAA
A A            ►    Concatemers    
Ÿ Reagents    
           1  mM  dATP,  Klenow  exo  (3'  to  5'  exo  minus)    

A-­‐OVERHANG  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
ADAPTER  
Adapter  Liga;on  
04   LIGATION  
 
Ÿ Provided  or  custom-­‐designed  
Ÿ Adapter   concentra)on   affects   liga)on,   adapter  
END  REPAIR   and  adapter-­‐dimer  carryover    
+   A
AAA
A A Ÿ Robust   Liga)on   efficiency   for   adapter:insert  
molar  ra)os  between  10:1  and  >200:1    
Ÿ Adapter  ra)o  >200:1  for  low-­‐input  applica)ons.    
Ÿ Adapter  quality  
Ÿ Post-­‐Liga)on  cleanup  

hnps://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
SIZE  SELECTION  
Size  selec;on  :  Read  length  considera;ons  
05   AND  PCR  
 
Ÿ Size  select  300  –  400  bp  or  350  –  500  bp,  post-­‐
liga)on    
END  REPAIR   Ÿ Ensures  maximum  coverage  of  most  inserts  
+   A
AAA
A A Ÿ Problem  of  non-­‐uniform  genome  coverage  
Ÿ Problem  of  material  loss  

à  Strategy  to  focus  read  lengths  during  sample  and  


library  prepara)on  

hnps://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
SIZE  SELECTION  
Size  selec;on  :  Read  length  considera;ons  
05   AND  PCR  
 
Ÿ Double  solid-­‐phase  reverse  immobiliza)on  (SPRI)  
selec)on   methods   allow   to   reshape   the   input  
END  REPAIR   fragment  distribu)on  into  well-­‐defined  ranges.  
+   A
AAA
A A Ÿ SPRI  +  Reverse-­‐SPRI  

FRAGMENTS    
ON  BEADS  

FRAGMENTS    
IN  SUPERNATANT  

hnps://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
SIZE  SELECTION  
Library  Amplifica;on  (PCR)  
05   AND  PCR  
 
►      Amplifies  the  amount  of  DNA  in  the  library    
►   Selec)vely   enrich   DNA   fragments   with   adapter  
molecules  on  both  ends    
END  REPAIR  
+   A ►  Post-­‐amplifica)on  cleanup  
AAA
A A  
 
QC  
 
►      Quality  &  Quan)ty  &  size  check  

hnps://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq    
Cri;cal  steps  in  DNAseq  Library  prep  

STEP  
SEQUENCING   DNA  Sequencing  
06    
►      input  :  Library  constructed  
-­‐ Whole-­‐genome  
-­‐ Whole-­‐exome  
END  REPAIR  
+   A -­‐ Target  region  
AAA
A A  
►    Cluster  amplifica)on  +  sequencing  +  base  calling  
 
►      QC  (run  report)  
 
►      output  :  sequenced  «  reads  »  (fastq  files)  
 

hnps://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
Part  5  

DNA-­‐Seq  Analysis  Pipeline    


and  File  Formats  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

KEY  STEPS  OF  THE  ANALYSIS  PIPELINE   KEY  FILES  


STEP  
01   Raw  Sequencing  Reads   FASTQ  

INDEXING  
STEP  
Reads  Alignment   SAM/BAM   MARKING  DUPLICATES  
02   MAPPED/UNMAPPED  

STEP  
03   Genomic  Coverage     BAI/BED  

STEP  
04   SV/CNV/Variant  Calling   VCF  

STEP  
05   Biological  Interpreta;on   CSV/XLS/TXT  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
RAW  SEQUENCING  
Sequencing  output  
01   READS  
►    FASTQ  (text)  format.    
►  Poten)ally  SRA  (binary),  but  rather  used  for  public  
FASTQ  
data  online.    
 
SEQUENCING  RUN/QC   Fastq  file  
►  Improvement  of  the  Sanger  breakthrough  
(associa)ng  each  nucleo)de  to  a  quality  score)  
►  Hundreds  of  millions  of  lignes/rows    
►  Blocks  of  4  lignes  (@)  
►  Example  

hnp://[Link]  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
RAW  SEQUENCING  
FastQC  
01   READS  
 

FASTQ  

SEQUENCING  RUN/QC  

hnp://[Link]  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
hnps://[Link]  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
RAW  SEQUENCING  
FastQC  +  Trimming  +  bad  reads  removal  
01   READS  
 

FASTQ  

SEQUENCING  RUN/QC  

hnp://[Link]  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
hnps://[Link]  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
READS  MAPPING/    
Input  files  
02   ALIGNMENT  
►    .fastq    
►    Reference  Genome  (.fasta,  .fa,  .fai)  
SAM/BAM   ►    GFF/GTF,  GFF3  
 
BWA/BOWTIE…/SAMTOOLS   Reads  alignment  
►  BWA  /  BOWTIE  (Burrows-­‐Wheeler  transform)  
►  de  novo:  NEWBLER  (454)…  
 
Output  files  
►  .sam  /  .bam  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
READS  MAPPING/    
GFF3  
02   ALIGNMENT  
►    1  line  for  1  feature  
►    tab-­‐separated  columns  
SAM/BAM   ►    9  columns  +  op)onal  addi)onal  informa)on    
 
BWA/BOWTIE…/SAMTOOLS  

STRAND  
PHASE  
SCORE  
SEQ-­‐ID   SOURCE   TYPE   START-­‐END   ATTRIBUTES  

hnp://[Link]/  
Introduc)on  to  Bioinforma)cs   Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
READS    
SAM  /  BAM  
02   ALIGNMENT  
►    Header  lines    +  Alignments  sec)ons    
►    tab-­‐separated  columns  
SAM/BAM   ►    11  columns  
►    Samtools  (view,  sort,  index…)  
BWA/BOWTIE…/SAMTOOLS   ►    Removal  of  duplicated  reads  that  affects  Variant  Calling  

CIGAR  

PNEXT  
RNEXT  

TLEN      
QNAME   FLAG   RNAME   POS   MAPQ   SEQ   QUAL  

hnp://[Link]/[Link]    
hnp://[Link]/wiki/SAM  
Introduc)on   to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
GENOMIC  
Coverage  
03   COVERAGE  
 
►    IGV  =  Integra)ve  Genomics  Viewer  
BAI/BED  
(hnps://[Link])[Link]/igv/)  
 
►  other  useful  file  formats  
SAMTOOLS/BEDTOOLS  
BED  =  (Browser  Extensible  Data)    
(hnp://[Link]/FAQ/FAQformat)  
 
►    Coverage  associated  to  3  different  concepts:  
-­‐ Fold  Coverage  (number  +  X  )  
-­‐ Breadth  of  Coverage  
-­‐ Depth  of  Coverage  
 

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
SV/CNV/VARIANT  
SV/CNV/Variant  Calling  
04   CALLING  
 
►    Structural  varia;ons  (SV)  
VCF   Dele)ons,  duplica)ons,  copy-­‐number  varia)ons,  
inser)ons,  inversions,  transloca)ons…  
►    Copy  number  Varia;ons  (CNV)    
Dele)ons  or  duplica)ons  of  genes  or  rela)vely  large  
regions  of  the  genome  that  affect  chromosomes►    
Variant  Calling  (SNPs  and  small  InDels)    
-­‐  SNPs:  affects  only  1  nucleo)de  
-­‐  InDels:  affects  1  or  several  nucleo)des  
SPLIT  MAPPING    
OR  DIFFERENT  DISTANCE    
PROPERLY     OR  ORIENTATION    
MAPPED  PAIR   FOR  THE  PAIR  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
SV/CNV/VARIANT  
SV/CNV/Variant  Calling  
04   CALLING  

VCF  
REFERENCE  

SUB-­‐CHROMOSOMAL  TO  MICROSCOPIC    


A   B   C  

1kb  TO  SUB-­‐  MICROSCOPIC    


A   A   B   C  

A   E   F  
SINGLE  NUCLEOTIDE  

2bp  TO  1kb  

REFERENCE   REFERENCE  
A   C  
C   B  
A  

(defini)ons  
Introduc)on   adapted  from  SOcherer  
to  Bioinforma)cs   nline  Cet  ourse:IBT  
al.,  2007)  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
SV/CNV/VARIANT  
SV/CNV/Variant  Calling  
04   CALLING  
►    VCF  (Variant  Call  Format)  
-­‐ Text  file  format  storing  SNPs  and  InDels  informa)on  
VCF   -­‐ hnp://[Link]/node/101    
-­‐ Obtaining  variants  listed  in  this  format  is  a  mul)step  
procedure  involving  different  tools  but  standardized  
-­‐ Headers  (meta-­‐informa)on)  +  data  lines  
-­‐ 8  required  fields,  tab-­‐delimited  

Introduc)on  to  hnp://[Link]/node/101  


Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  Pipeline    
and  associated  files  

STEP  
BIOLOGICAL    
From  Variant  annota;on  to  data  mining  
05   INTERPRETATION  
 
►    web-­‐based  
CSV/XLS/TXT   ►    available  packages  
 
Aim  
 
►  Func)onal  impact  of  variants  (synonymous  or  not…)  
►  Gene  Ontology  Annota)on  (BP,  MF,  CC)  
►  Pathway  /  Network  informa)on      
►  Predic)ons  of  pathogenicity  /  severity    
 
NB:  DAVID  (Database  for  Annota)on,  Visualiza)on  and  
Integrated  Discovery)  to  switch  between  databases  
hnps://[Link]/  
   
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis    
Take-­‐home  messages  

Biological  ques;on  
►    need  to  be  clearly  defined  first,  so  that  the  design  of  the  experiment,  the  library  
construc)on  and  the  pipeline  of  analysis  could  be  prepared  accordingly  
 
Pladom  
►  Each  one  has  its  own  specifici)es  that  needs  to  be  understood  before  choosing  one  
►  Different  technologies,  short  reads  (Illumina…)  vs  long  reads  (PacBio…)    
►  Rapidly  evolving,  several  limitaions  (PCR  bias  for  GC  rich  regions…)  
►  Combina)on  of  different  pla•orms  possible  (de  novo…)  
 
Input  /  Output  files  
►  Companion  indexed  files  needed  (.fa  &  .fai  /  .bam  &  .bai  /  .vcf  &  .[Link]…)  
►  text  based  (FASTA,  FASTQ,  SAM,  GTF/GFF,  BED,  VCF,  WIG)  or  binary  (BAM,  BCF,  SFF)    
►  1-­‐based  (GFF/GFT,  SAM/BAM,  WIG)  or  (0-­‐based  :  BED)  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
Understand  see  how  these  files  are  generated  prac)cally  (no  
demo)    
 
Give  you  an  idea  about  how  to  deal  with  these  files  easily  
once  they  are  generated  and  given  to  you  by  your  
sequencing  plateform.  
 
Make  you  work  a  bit  on  one  specific  file  (a  vcf  file),  using  the  
command  line  interface  (assignment).  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
Reminder  of  the  command  line  syntax  and  some  basic  
commands  to  manipulate  files  
à  the  same  syntax  is  used  for  NGS  analysis…but…using  other  
specific  commands  (algorithms,  tools…)  
   
Examples  of  command  lines  to  generate  or  retrieve  data  
from  NGS  data  files  using  these  specific  commands    
 
Basic  linux  command  lines  can  be  useful  to  parse  files:  How  
to  interrogate  a  vcf  file  using  basic  linux  command  lines?  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
The  NGS  datasets:  reminder  
►  Outputs  large  files    
►  Output  files  contains  various  kinds  of  informa)ons  that  you  need  to  parse  
-­‐ fastq:  quality  associated  to  each  read…  
-­‐ sam/bam:  quality  of  the  mapping…    
-­‐ vcf:  variants,  annota)on  of  the  effects  that  these  variants  can  have…  
 
 
The  command-­‐line  environment  
►  UNIX  Opera)ng  System:  able  to  deal  with  mul)-­‐task  &  mul)-­‐user  needs  
►  i.e.  can  even  handle  mul)ple  files  at  a  )me  (useful  if  mul)ple  samples)  
►  Brings  flexibility  to  handle  large  files  
►  Allows  to  easily  parse  the  content  of  a  (big)  file  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
The  basic  commands  you  have  seen  are  useful  for  NGS  
 
àRemember  that  many  kind  of  files  are  generated  
through  the  NGS  analysis  pipeline    
 
►  So  you  should  know  at  this  stage  the  file  system  basics  
that  allows  you  to  work  with  many  files  and  classify  
them  
 

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
What  you  know  about  the  command-­‐line  environment  
should  allow  you  to:  
 
►  Create  directories  and  move  through  file  system  
►  At  this  stage  you  should  be  able  to  handle  easily  
queries  to  parse  large  files  
-­‐ able  to  search  for  a  par)cular  panern  
-­‐ able  to  select  specific  informa)on  columns  
à  Easily  interrogate  the  large  amount  of  informa)on  in  
the  output  files  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Viewing & Manipulating
cat
Ÿ view the content of a short file
$  cat        file1  

more
Ÿ view the content of a long file, step by step
$  more        file1  

less
Ÿ view the content of a long file, by portions
$  less        file1  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Viewing & Manipulating
head
Ÿ view the first lines of a (long) file
$  head        file1  
NB: By default (without options), displays the 10 first lines

tail
Ÿ view the last lines of a (long) file
$  tail        file1  
NB: By default (without options), displays the 10 last lines

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Viewing & Manipulating
cut

Ÿ Extract specific fields from a file


$  cut      -­‐d’  ‘      -­‐f1,2          file1  

Field specifier
Field separator

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Viewing & Manipulating
grep

Ÿ search for the occurrence of a specific pattern in a file


(regular expression using the wildcards…)

Careful : grep displays the whole line containing this specific pattern XXX
$  grep      XXX      file1  

Could be used to display all lines that DO NOT contain a specific pattern
$  grep      -­‐v      XXX      file1  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Viewing & Manipulating
wc

Ÿ Prints different kind of counts for a file


$  wc          -­‐l            file1  

Prints line counts

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Redirecting characters
|

The “|” character allows to combine several commands, by sending


the result of one command to another

$  grep      XXX      file1      |          wc          -­‐l          

Prints line counts instead of displaying the result on the screen

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
  Redirecting characters
>

The “>” character allows to redirect the result of a command to a new


file

$  grep      XXX      file1      >          file2  

Prints line counts instead of displaying the result on the screen

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
Command-­‐line  syntax  

No capital letter path_to_Directory/file

$ command -options arguments

space   space  

Useful commands:
$ man NameOfTheCommand
$ pwd
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 

FASTQ   FASTQC/…  

SAM/BAM   BWA/BOWTIE…/SAMtools  

VCF   VCFtools/jannovar  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
FASTQ   FASTQC/…  

To  run  FastQC  (provided  it  is  installed):  


-­‐ specify  the  files  you  want  to  process  on  the  command  line  
-­‐ FastQC  will  generate  an  HTML  report  for  each  file  (embeded  graphs)  

fastqc  seqfile1  seqfile2  ..  seqfileN  

fastqc      -­‐-­‐help  
 -­‐-­‐extract  
fastqc  [-­‐o  output  dir]  [-­‐-­‐(no)extract]   Uncompress  the  zipped  output  
file  in    the  same  dir  aher  being  
created.    
 -­‐o  –outdir   -­‐-­‐noextract  
Create  all  output    files    in    the    specified     Do  not  uncompress  the  output  file  
output    directory  (dir  must  already  exist).   aher  crea)ng  it.    

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
FASTQ   FASTQC/…  

Trimming  
 
à  Trimming  low  Quality  Bases  
Low  quality  base  reads  from  the  sequencer  can  cause  an  otherwise  mappable  
sequence  not  to  align  à  different  open  source  tools  can  trim  off  3'  bases  à  produce  a  
FASTQ  file  of  the  trimmed  reads  to  use  as  input  to  the  alignment  program.  

e.g.:  FASTX-­‐Toolkit  

gunzip  -­‐c  Sample_R1.[Link]  |  fastx_trimmer  -­‐l  50  -­‐Q  33  >  [Link]  

Trim  down  to  50  bases   op)on  that  specifies  how  base  quali)es  on  the  
(last  base  is  50)     4th  line  of  each  fastq  entry  are  encoded  
[Link]
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
FASTQ   FASTQC/…  

Trimming  
 
à  Trimming  Adapters  
A  3'  adapter  contamina)on  can  cause  the  insert  sequence  not  to  align  (adapter  
sequence  ≠  bases  at  the  3'  end  of  the  reference  genome  sequence).  
Unlike  general  fixed-­‐length  trimming,  adapter  trimming  removes  different  numbers  of  
3'  bases  depending  on  where  the  adapter  sequence  is  found.  
 
e.g.:  cutadapt  
  cutadapt  -­‐a      AAAT      [Link]      -­‐o    test_new.fastq  
cutadapt  -­‐m  22  -­‐O  10  -­‐a  AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  
-­‐m22  =  discard  any  sequence  that  is  smaller  than  22  bases  aher  trimming  
-­‐O10  =  says  not  to  trim  3'  adapter  sequences  unless  at  least  the  first  10  bases  of  the  adapter  
are  seen  at  the  3'  end  of  the  read  
[Link]
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
SAM/BAM   BWA/BOWTIE…/SAMtools  

Mapped  and  unmapped  reads  are  imported  and  stored  into  SAM/BAM  format  
 
samtools  :  A  suite  of  useful  commands  to  visualize  or  get  informa)ons  from  .sam/.bam  
#from SAM to BAM conversion
samtools view [Link] > [Link]

# for sorting and indexing alignment


samtools sort [Link] -o [Link]
samtools index [Link] [Link]

#all reads mapping on a certain portion of chr1 or all the chr1 in another bam
samtools index [Link]
samtools view [Link] chr1:200000-500000
samtools view -b [Link] chr1 > test_chr1.bam
[Link]
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
SAM/BAM   BWA/BOWTIE…/SAMtools  

Mapped  and  unmapped  reads  are  imported  and  stored  into  SAM/BAM  format  
 
samtools    
The  flagstat  command  provides  simple  sta)s)cs  on  a  BAM  file  
 
#from SAM to BAM conversion
samtools flagstat [Link]

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
VCF   VCFtools/SNPEff  

VCF  files  contains  informa)on  about  variants    


VCF  can  be  used  as  input  and  output  file  for  many  tools  
 
Variant  calling  can  be  done  using  many  available  tools  and  methods  (GATK,  samtools…)  
and  the  output  used  by  many  others  (VCFtools,  VCFminer,  snpeff…)  

When  a  mapped  read  shows  a  mismatch  from  the  reference  genome  


à is  the  mismatch  due  to  a  real  SNP???    
 
e.g.  How  does  samtools  detect  SNPs?    
Samtools  computes  sta)s)cs  to  incorporate  different  types  of  informa)on  such  as:  
-­‐ number  of  different  reads  that  share  a  mismatch  from  the  reference    
-­‐ the  sequence  quality  data  
-­‐ the  expected  sequencing  error  rates  
hnp://[Link]/[Link]  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
VCF   VCFtools/jannovar  

VCF  files  contains  informa)on  about  variants    


VCF  can  be  used  as  input  and  output  file  for  many  tools  
 
[Link]  &  bc^ools:  2  steps  are  required  using  these  commands  :  
 
[Link]  
-­‐ collect  summary  informa)on  in  the  input  BAMs  
-­‐ compute  the  likelihood  of  data  given  each  possible  genotype    
-­‐ and  store  the  likelihoods  in  the  BCF  format  (see  below).  It  does  not  call  variants  at  this  
stage.  
2.  Bc^ools  
-­‐ applies  the  prior  and  does  the  actual  calling  
-­‐ can  also  concatenate  BCF  files,  index  BCFs  for  fast  random  access  and  convert  BCF  to  VCF.  
 
  hnp://[Link]/[Link]  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
VCF   VCFtools/jannovar  

 
Suppose  we  have  :  
-­‐ a  reference  sequence  in  [Link],  indexed  by  samtools  faidx  
-­‐ posi)on  sorted  alignment  files  [Link]  and  [Link].  
 
à  you  can  call  SNPs  and  short  INDELs  using:  
#[Link] a BCF file (binary data format : information about sequence variants (SNPs…)
samtools  mpileup -­‐uD    -­‐f      [Link]        [Link]          [Link]      |    
bcftools  view  -­‐bvcg  -­‐      >      fi[Link]  
-­‐u    output  into  an  uncompressed  bcf  file   -­‐b    output  to  BCF  format    
-­‐D    keep  read  depth  for  each  sample   -­‐v    only  output  poten)al  variant  sites  (i.e.,  exclude  monomorphic  ones)  
-­‐f      next  argument  is  reference  genome  file   -­‐c    do  SNP  calling  
-­‐g  call  genotypes  for  each  sample  in  addi)on  to  just  calling  SNPs  

hnp://[Link]/[Link]  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
VCF   VCFtools/jannovar  

 
Suppose  we  have  :  
-­‐ a  reference  sequence  in  [Link],  indexed  by  samtools  faidx  
-­‐ posi)on  sorted  alignment  files  [Link]  and  [Link].  
 
à  you  can  call  SNPs  and  short  INDELs  using:  

#[Link] BCF into VCF (flat text file rather than a binary = easier to view)
bcftools  view        fi[Link]    |        vcfu)[Link]        varFilter    -­‐D100        >      file.fi[Link]  

-­‐D100    filters  out  SNPs  that  had  read  depth  higher  than  100  

hnp://[Link]/[Link]  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
VCF   VCFtools/jannovar  

Many  ways  to  nnotate  the  VCF  file  (vchools,  jannovar…)  


 
e.g.  jannovar  
Jannovar  iden)fies  all  transcripts  affected  by  a  given  variant,  and  provides  HGVS-­‐compliant  
annota)ons  for  diiferent  types  of  variants.  
 
#Download the RefSeq transcript database for the release hg19/GRCh37.
$  java  -­‐jar  [Link] download  -­‐d      hg19/refseq  
#Annotate the [Link] file  
$  java  -­‐jar  [Link] annotate-vcf    \  -­‐d      hg19_refseq.ser    -­‐i      
[Link]    -­‐o    [Link]  

hnp://[Link]/jannovar/  
Introduc)on  to  Bioinforma)cs  Online  Course:IBT  
Genomics|  Fatma  Guerfali    
DNA-­‐Seq  Analysis  
The  command-­‐line  environment  
 
►    all  these  commands  can  be  run  as  part  of  an  analysis  pipeline  
 
►    all  files  generated  can  be  parsed  using  specific  tools  (samtools…)  
 
►    text-­‐based  files  generated  can  be  parsed  using  basic  linux  commands    
   
►  At  this  stage  you  should  be  able  to  handle  easily  queries  to  parse  
large  files  
-­‐ able  to  search  for  a  par)cular  panern  
-­‐ able  to  select  specific  informa)on  columns  
à Assignment !  

Introduc)on  to  Bioinforma)cs  Online  Course:IBT  


Genomics|  Fatma  Guerfali    

You might also like