100% found this document useful (8 votes)
79 views

Next Generation Sequencing and Sequence Assembly Methodologies and Algorithms pdf epub

The document discusses next-generation sequencing (NGS) technologies and their methodologies, highlighting advancements from first to third-generation sequencing. It details various sequencing platforms, their mechanisms, and the evolution of assembly algorithms for reconstructing genomes from short reads. Additionally, it addresses challenges in genome assembly and provides an overview of algorithmic approaches to improve accuracy and efficiency in the assembly process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
79 views

Next Generation Sequencing and Sequence Assembly Methodologies and Algorithms pdf epub

The document discusses next-generation sequencing (NGS) technologies and their methodologies, highlighting advancements from first to third-generation sequencing. It details various sequencing platforms, their mechanisms, and the evolution of assembly algorithms for reconstructing genomes from short reads. Additionally, it addresses challenges in genome assembly and provides an overview of algorithmic approaches to improve accuracy and efficiency in the assembly process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Next Generation Sequencing and Sequence Assembly

Methodologies and Algorithms

Visit the link below to download the full version of this book:

https://medipdf.com/product/next-generation-sequencing-and-sequence-assembly-met
hodologies-and-algorithms/

Click Download Now


Ali Masoudi-Nejad Zahra Narimani

Nazanin Hosseinkhan

Next Generation Sequencing


and Sequence Assembly
Methodologies and Algorithms

123
Ali Masoudi-Nejad Nazanin Hosseinkhan
Laboratory of Systems Biology Laboratory of Systems Biology
and Bioinformatics (LBB) and Bioinformatics (LBB)
Institute of Biochemistry and Biophysics Institute of Biochemistry and Biophysics
University of Tehran University of Tehran
Tehran Tehran
Iran Iran

Zahra Narimani
Laboratory of Systems Biology
and Bioinformatics (LBB)
Institute of Biochemistry and Biophysics
University of Tehran
Tehran
Iran

ISSN 2193-4746 ISSN 2193-4754 (electronic)


ISBN 978-1-4614-7725-9 ISBN 978-1-4614-7726-6 (eBook)
DOI 10.1007/978-1-4614-7726-6
Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2013938267

Ó The Author(s) 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publisher’s location, in its current version, and permission for use must
always be obtained from Springer. Permissions for use may be obtained through RightsLink at the
Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Dedicated to our loving family
Preface

DNA sequencing is a fast-moving science with technologies and platforms being


updated at breathtaking speed. The hallmark of next generation sequencing (NGS)
has been a massive increase in throughput and a decrease in price compared with
previous technologies. The first next-generation DNA sequencing machine was
introduced to the market by 454 Life Sciences (Basel, Switzerland) in 2005. The
technology is based on a large-scale parallel pyrosequencing system, which relies
on fixing nebulized and adapter-ligated DNA fragments to small DNA-capture
beads in a water-in-oil emulsion. The Illumina’s (CA, USA) Genome Analyzer
was released in 2007 and marked a true revolution for genome sequencing in
which short reads became significant to genomic applications. The technology is
based on reversible dye terminators. DNA molecules are first attached to primers
on a slide and amplified so that local clonal colonies are formed. Life Technol-
ogies’ (CA, USA) SOLiDTM technology employs sequencing by ligation. In this
technology, a pool of all possible oligonucleotides of a fixed length is labeled
according to the sequenced position. Oligonucleotides are annealed and ligated;
the preferential ligation by DNA ligase for matching sequences results in a signal
that is informative of the nucleotide at that position.
So-called ‘third-generation’ technologies directly sequence individual DNA
molecules rather than relying on any amplification prior to sequencing. The
recently released PacBio system can produce 35–45 Mb of data per cell with an
average read length of 1,500 bp. The Ion Torrent Personal Genome Machine
(PGM) is another third-generation platform that uses standard sequencing chem-
istry, but with a novel, semiconductor-based detection system. This technology
already claims read lengths of approximately 200 bp with high accuracy, and the
latest PGM 318 chip can produce 1.0 Gb of data in a 2-h run. When the impli-
cations of NGS technology became apparent, several assemblers were designed to
deal with the new problems, i.e., assembly of short NGS reads in order to
reconstruct the main longer sequences. Assembly process can be done either
having a reference genome available (mapping) or without having a reference
genome available (de Novo assembly). De Novo assembly algorithms, discussed in
more detail in this book, can be classified into three main categories: greedy
algorithms, Overlap-Layout-Consensus (OLC) methods, and De Bruijn graph
approaches. The Euler assembler was the first to employ de Bruijn graphs for

vii
viii Preface

whole genome shotgun (WGS) assembly, and proved capable of assembling


bacterial genomes. Velvet and ALLPATHS improved assembly in terms of speed,
contig and scaffold length, and avoidance of misassembly. ABySS followed the
innovations with de Bruijn methods, but also introduced a distributed represen-
tation of the graph, allowing message passing interface parallelization. The
CABOGand variant MSR-CA pipelines are updates of the Celera overlap-based
assembler designed for a combination of read types, which showed some success
with short-read data for genomes in the 100 Mb range. The String Graph
Assembler (SGA) is the first to make assembly of mammalian-sized genomes
practical using the string graph approach. This observation on the current tradeoff
between accuracy and continuity suggests avenues for future improvements in
assembly. There is room for other improvements at the scaffolding stage, where, as
has happened at the assembly stage, we witness a move from naïve and greedy
algorithms to more subtle graph-based techniques.
In this book, we briefly introduce the history of first, second, and third gener-
ation sequencing technologies and also describe drawbacks of the old techniques
which now are not suitable due to their cost and the need for automation which
could not be achieved in those methods. In Sect. 2 major NGS methods—namely
Roche/454 FLX, Illumina/Solexa Genome Analyzer, and Applied Biosystems
SOLiD System, etc.—are described in detail. Also, after bringing the latest and
most predominant technologies in NGS, nanopore DNA sequencing and Pacific
single molecule real time (SMRT) DNA sequencing, which does not need an
amplification step, are described. Latest subsections of this section are devoted to
information about sequencing costs, file formats of the output, a comparison of
methods, and their drawbacks, and finally application of NGS technologies. The
second two sections, i.e. Sects. 3 and 4, provide an overview of the algorithmic
view of the assembly problem. Our main focus in these two sections will be on de
Novo assembly algorithms of NGS reads. In Sect. 3, we generally define the
assembly problem and mention the challenges involved in the assembly process,
including errors propagated from sequencing process beside computational chal-
lenges. Appropriate use of paired-end read data, which helps to overcome the
challenges regarding short length of reads, and also preprocessing that helps to
eliminate some other issues regarding inaccurate data, is the next topic discussed
in this section. Using all these techniques to reduce problems, there will still be
errors in assembly, and relevant assembly algorithms are needed to be validated in
a standard way: These are the final topics which will be discussed in Sect. 3.
Finally, in Sect. 4, an exact view of the assembly algorithm is given as to how the
problem can be mapped to a graph and how different kind of graphs are treated in
finding the solution, which is the final assembled genome. Concerning each of the
assembly approaches, several example algorithms are then described in detail and,
finally, a comparison of these methods is provided in Sect. 4.
Contents

1 Next-Generation Sequencing Methodologies . . . . . . . .......... 1


1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . .......... 1
1.1.1 A Brief History of the Discovery of DNA
Structure and Function . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advent of Sequencing Technologies. . . . . . . . . . . . . . . . . . . . 3
1.2.1 First-Generation DNA Sequencers . . . . . . . . . . . . . . . . 4
1.3 Some Drawbacks of the Sanger Technique . . . . . . . . . . . . . . . 5
1.3.1 Short Size Fragments . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Needs for Amplification and Fragment
Assembly Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Problems with Parallelization . . . . . . . . . . . . . . . . . . . 9
1.3.4 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.5 Need for Complete Automation . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Emergence of Next-Generation Sequencing . . . . . . . . . . . . . . . . . . 11


2.1 454 Pyrosequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Illumina (Solexa) Genome Analyzer. . . . . . . . . . . . . . . . . . . . 15
2.3 Applied Biosystems SOLiD Sequencing . . . . . . . . . . . . . . . . . 17
2.4 Ion Semiconductor (Ion Torrent Sequencing) . . . . . . . . . . . . . 19
2.5 Polonator Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Heliscope (Single Molecule Sequencing) . . . . . . . . . . . . . . . . 23
2.7 Latest Developments in Next-Generation
Sequencing Methods. . . . . . . . . . . . . . . . . . . . . . . . ....... 23
2.7.1 Nanopore Sequencing. . . . . . . . . . . . . . . . . . ....... 25
2.7.2 Single Molecule Real Time DNA Sequencing . ....... 26
2.8 Comparison of Available Next-Generation
Sequencing Techniques. . . . . . . . . . . . . . . . . . . . . . ....... 29
2.9 DNA Sequencing Costs . . . . . . . . . . . . . . . . . . . . . ....... 29
2.10 Sequencing Status . . . . . . . . . . . . . . . . . . . . . . . . . ....... 29
2.11 Shortcoming of NGS Techniques: Short-Reads
and Reads Accuracy Issues . . . . . . . . . . . . . . . . . . . ....... 31

ix
x Contents

2.12 NGS File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


2.13 NGS Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 The Assembly of Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . 41


3.1 What is De Novo Genome Sequence Assembly? . . . . . . . . . . . 42
3.2 Challenges of Genome Assembly. . . . . . . . . . . . . . . . . . . . . . 43
3.3 Use of Paired-End Reads in the Assembly . . . . . . . . . . . . . . . 46
3.4 Data Preprocessing Methods and Sequence Read
Correction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Assembly Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Evaluation of Assembly Methods. . . . . . . . . . . . . . . . . . . . . . 50
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 De Novo Assembly Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


4.1 Mapping Assembly to a Graph Problem . . . . . . . . . . . . . . . . . 57
4.1.1 The Overlap Graph Approach . . . . . . . . . . . . . . . . . . . 57
4.1.2 De Bruijn Graph Approach . . . . . . . . . . . . . . . . . . . . . 57
4.2 Classification of De Novo Assembly Algorithms . . . . . . . . . . . 59
4.2.1 Greedy Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Overlap Layout Consensus (OLC) Algorithms. . . . . . . . 66
4.2.3 De Bruijn Graph-Based Algorithms . . . . . . . . . . . . . . . 69
4.3 Comparison of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 1
Next-Generation Sequencing
Methodologies

1.1 Introduction

1.1.1 A Brief History of the Discovery of DNA Structure


and Function

Although many people believe that the American biologist James Watson and
English physicist Francis Crick were the first to discover DNA in the 1950s, DNA
was actually discovered by the Swiss chemist Friedrich Miescher in the late 1860s
during his attempts to isolate the protein components of leukocytes. But when he
isolated a substance that was unlike proteins resistant to proteolysis and also had
different chemical properties of proteins, including a much higher phosphorous
content, he realized that he had discovered a new substance [1]. He called this new
substance ‘‘nuclein.’’
Miescher’s finding was not considered particularly important until the twentieth
century, when the chemical nature of nuclein was studied by the Russian bio-
chemist Phoebus Levene. He was the first to discover: (1) the order of three major
components of a single nucleotide (phosphate-sugar-base) (Fig. 1.1); (2) the
carbohydrate component of RNA (ribose) and DNA (deoxyribose); and (3) the
way RNA and DNA molecules are put together. In 1919 Levene proposed that
nucleic acids were composed of a series of nucleotides and that each nucleotide
was in turn composed of just one of four nitrogen-containing bases—a sugar
molecule and a phosphate group.
Studies conducted to discover the DNA structure were continued by Erwin
Chargaff, an Austrian biochemist, to uncover additional details about the structure
of DNA. He reached two major conclusions [3]: First, he stated that the nucleotide
composition of DNA varies among species, and second, he concluded that the
amount of the base adenine (A) is usually similar to the amount of thymine (T);
this is also true about the amount of guanine (G) and cytosine (C). The latter is
known as Chargaff’s rule (Fig. 1.2).

A. Masoudi-Nejad et al., Next Generation Sequencing and Sequence Assembly, 1


SpringerBriefs in Systems Biology, DOI: 10.1007/978-1-4614-7726-6_1,
 The Author(s) 2013
2 1 Next-Generation Sequencing Methodologies

Fig. 1.1 Three components of each nucleotide: the nitrogenous base that can basically belong to
two categories (single ring: pyrimidines, or two-linked rings: purines), a pentose sugar (ribose in
RNA and deoxyribose in DNA), and a phosphate group [2]

Fig. 1.2 Chargaff’s rule: the


total amount of purines is
equal to the total amount of
pyrimidines [2]

Chargaff’s finding that A = T and C = G, along with some vital crystallog-


raphy results obtained by the English researchers Rosalind Franklin and Maurice
Wilkins, established a strong basis for the discovery of a three-dimensional,
double-helical model for the structure of DNA proposed by Watson and Crick
(Fig. 1.3).
Each chain of a double-helix DNA molecule is made up of the phosphodiester
links between nucleotides. Two strands of a DNA molecule have different direc-
tionality. The two different ends of a single strand are called 30 and 50 and the
direction of DNA synthesis is 50[30 ; this means that the free 30 hydroxyl (OH)
group from the growing strand of DNA attacks the phosphate on the next base to
be added (Fig. 1.4). Pyrophosphate is released and the new base forms a phos-
phodiester bond with the growing strand of DNA. The free 30 hydroxyl group is
then free to attack the next base to be added. This reaction is catalyzed by DNA
polymerases.
1.2 Advent of Sequencing Technologies 3

Fig. 1.3 Double-helical structure of DNA. The chains of sugar-phosphate groups are linked
together by complementary bases [2]

Fig. 1.4 DNA synthesis


direction. The 50 end of the
new nucleotide is linked to
the 30 -OH of the last
nucleotide of the growing
chain by DNA polymerase
action. During this reaction,
a pyrophosphate group is
released [http://
www.prism.gatech.edu/
*gh19/b1510/dnarep.htm]

1.2 Advent of Sequencing Technologies

Knowing about the order (sequence) of nucleotides in DNA, the molecule in which
the genetic information of all organisms is stored, has revolutionized biology and
resulted in our better understanding of life’s secrets (BBSRC Review of Next-
Generation Sequencing—final version).
The first two DNA sequencing techniques, which are known as first-generation
DNA sequencers, historically were developed by Fredrick Sanger (1977, Uni-
versity of Cambridge) and Allan Maxam and Walter Gilbert (1976–1977, Harvard
University), independently. Sanger’s method, which earned him a Nobel Prize in
Chemistry in 1980, became popular, and in fact was the sole method for DNA
sequencing for three decades, as a result of its lesser technical complexity and
lesser amount of toxic chemicals used, compared to the Maxam–Gilbert method,
4 1 Next-Generation Sequencing Methodologies

which was based on the chemical modification of DNA and subsequent cleavage at
specific bases. In the Sanger sequencing method, which is also known as ‘‘chain
termination’’ or the ‘‘dideoxy method,’’ modified nucleotides (fluorescently
labeled dideoxynucleotides) are used in the reaction in addition to normal nucle-
otides; this method was gradually improved and became automated (the first
automatic sequencing machine, AB370, was introduced in 1987 by Applied
Biosystems), and therefore has been the method of choice for large-scale
sequencing projects, e.g., whole-genome sequencing for various species, for about
30 years [4].

1.2.1 First-Generation DNA Sequencers

1.2.1.1 Sanger Sequencing Technology

In classical Sanger sequencing technology, which is sequencing by the synthesis


method, the sequencing reaction is performed in the presence of the single-
stranded DNA template, DNA primers, DNA polymerase, four normal DNA
nucleotides, and four fluorescently labeled modified nucleotides (ddATP, ddCTP,
ddGTP and ddTTP).
The DNA template is initially divided into four separate sequencing reactions
containing primers, polymerase and normal nucleotides. In each reaction in the
presence of a small amount of one of four modified nucleotides (which lack the 3’-
OH group required for the extension), which randomly incorporates into the
growing strands, terminates DNA elongation and results in DNA fragments with
various lengths. The obtained DNA fragments are then separated by size through
high resolution polyacrylamide gel electrophoresis (capillary electrophoresis) with
each of four reactions run in one of four individual lanes (lanes A, C, G and T).
DNA bands that correspond to DNA fragments with differing lengths are then
visualized, using UV light or X-ray autoradiography, and the order of nucleotides
can be determined according to the relative positions of DNA bands among four
different lanes (Fig. 1.5).

1.2.1.2 Maxam-Gilbert Chemical Degradation DNA Sequencing


Technique

The Maxam-Gilbert technique relies on the cleaving of nucleotides by chemicals


and is most efficient with small nucleotide polymers (Fig. 1.6). Chemical treat-
ment generates breaks at a small proportion of one or two of the four nucleotide
bases in each of four reactions (G, A ? G, C, C ? T). Due to the advancements in
chain termination methodology, the Maxam-Gilbert method has become redun-
dant. It became obsolete due to its less ergonomical feasibility, and it is also
considered unsafe because of the extensive use of toxic chemicals.
1.2 Advent of Sequencing Technologies 5

ddGTP ddATP ddCTP ddTTP


(a) (b)

G A C T

Largest

TCGAAGACGTATC

Smallest

Fig. 1.5 Sanger sequencing procedure. a Four distinct reactions are taking place in the presence
of all required materials for DNA synthesis. Besides in each separate reaction, a distinct type of
fluorescently labeled dideoxy nucleotides is added which after completion DNA synthesis cycles,
results in the DNA strands each of which terminated in specific dideoxy nucleotide present on
that reaction. b After reaction completion, the content of four separate reactions is electropho-
resed using high-resolution polyacrylamide gel (www.Wikipedia.org)

As a result of using less toxic chemicals and lower amounts of radioactivity


than the Maxam and Gilbert method, and because of its comparative ease, the
Sanger method was soon automated and was the method used in the first gener-
ation of DNA sequencers.

1.3 Some Drawbacks of the Sanger Technique

1.3.1 Short Size Fragments

The Sanger method can only be performed for DNA fragments with a fairly short
length, i.e., 100–1,000 base pairs. This is due to the limitation in the power of
discrimination between fragment sizes during capillary electrophoresis, which
restricts the size of the DNA that can be reliably sequenced to *1,000 base pairs
(for larger DNA fragments, longer gels are required). Larger sequences—for
example, an entire chromosome—must first be fragmented into smaller pieces and
amplified to obtain a large number of copies for each individual fragment. After
performing sequencing reaction, these fragments must be reassembled to produce
the original sequence.
6 1 Next-Generation Sequencing Methodologies

Fig. 1.6 Maxam-Gilbert chemical degradation sequencing technique. a Double-stranded DNA is


labeled at 50 ends. b Single-stranded DNA fragment is produced. c DNA fragments are distributed
in four parallel test tubes. Each test tube is subjected to a specific base degrading chemical. The
content of each tube will be electrophoresed in the next step for fragment size separation

1.3.2 Needs for Amplification and Fragment Assembly Steps

The procedure mentioned for fragmentation and amplification can be conducted by


two distinct approaches: map-based sequencing (also known as back-to-back or
hierarchical sequencing) and shotgun sequencing.
The map-based method is accomplished by using a large number of bacterial
artificial chromosomes (BAC) ([20,000), each of which contains a large DNA
fragment (approximately 100 kb), which collectively provide an overlapping

You might also like