0% found this document useful (0 votes)
8 views25 pages

1 s2.0 S1360138519301219 Main

The document reviews the advancements in third-generation sequencing technologies (TGSTs) that enable long-read sequencing and de novo assembly of plant genomes, highlighting the tools and strategies available for researchers. It discusses the challenges faced in generating high-quality genome assemblies due to genome complexity and the limitations of current bioinformatics tools. The authors provide recommendations for workflow strategies tailored to specific project needs, emphasizing the importance of selecting appropriate sequencing platforms and computational designs.

Uploaded by

Ben Dresim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

1 s2.0 S1360138519301219 Main

The document reviews the advancements in third-generation sequencing technologies (TGSTs) that enable long-read sequencing and de novo assembly of plant genomes, highlighting the tools and strategies available for researchers. It discusses the challenges faced in generating high-quality genome assemblies due to genome complexity and the limitations of current bioinformatics tools. The authors provide recommendations for workflow strategies tailored to specific project needs, emphasizing the importance of selecting appropriate sequencing platforms and computational designs.

Uploaded by

Ben Dresim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Trends in Plant Science

Feature Review

Tools and Strategies for Long-Read


Sequencing and De Novo Assembly of
Plant Genomes
Hyungtaek Jung,1,* Christopher Winefield,2 Aureliano Bombarely,3,4 Peter Prentis,5
and Peter Waterhouse1,6,*

The commercial release of third-generation sequencing technologies (TGSTs), Highlights


giving long and ultra-long sequencing reads, has stimulated the development Tumbling sequencing costs, improve-
of new tools for assembling highly contiguous genome sequences with unprec- ments in bioinformatic pipelines, and
increased access to high-performance
edented accuracy across complex repeat regions. We survey here a wide range computing capabilities have resulted in
of emerging sequencing platforms and analytical tools for de novo assembly, a perfect storm where nonspecialist
provide background information for each of their steps, and discuss the spec- genomics research groups are able to
trum of available options. Our decision tree recommends workflows for the access, deploy, and generate de novo
genome sequences in nonmodel plant
generation of a high-quality genome assembly when used in combination with systems.
the specific needs and resources of a project.
However, generating a high-quality as-
sembly for many plant species still pre-
sents significant challenges owing to
Challenges and Progress with Plant Genomics
genome size, complexity, and experi-
A genome assembly is simply the sequence produced after all of the chromosomes of a target mental and computational design.
species have been fragmented (a large number of short/long DNA sequences), sequenced,
and computationally put back together again to create a representation of the original intact Selecting the most appropriate se-
quencing and software platforms for a
chromosome sequences. De novo genome assembly assumes no prior knowledge of the
new genome project can be confusing
source DNA sequence length, layout, or composition. The usual aim of a genome assembly is and daunting because of the wide
to build a highly accurate contiguous (i.e., an uninterrupted stretch of overlapping DNA) spectrum of available options and the
consensus sequence representing a haploid-phase version of the genome (one for each parental performance quality of specific tools in
different contexts.
haplotype) of the target species. The costs of acquiring sufficient sequence data for such an
assembly have now dropped to a level that most laboratories can afford. This has led to the recent
explosion of plant species being sequenced. Four questions must be considered when
embarking on a new genome assembly project are: (i) how big is the genome?; (ii) is it a diploid, 1
Centre for Tropical Crops and
polyploid, and/or highly heterozygous hybrid species?; (iii) how much repetitive sequence is likely Biocommodities, Queensland University
to be present in the genome; and (iv) what is the best experimental and computational design to of Technology, Brisbane, QLD 4001,
Australia
be employed? 2
Department of Wine, Food, and
Molecular Biosciences, Lincoln
Most large plant genomes have high levels of repeated and duplicated sequences owing to University, 7647 Christchurch,
whole-genome, chromosomal, subchromosomal, or tandem duplications (e.g., transposable New Zealand
3
element activity) [1,2]. With genome assemblies based on short-read (75–700 bp) data, the Department of Bioscience, University of
Milan, Milan 20133, Italy
repeats and duplications are often not well resolved, leading to the bioinformatic formation of 4
School of Plants and Environmental
chimeric sequences (see Glossary) and fragmented contigs. Third-generation sequencing Sciences, Virginia Tech, Blacksburg, VA
platforms (Pacific Biosciences, PacBio and Oxford Nanopore Technologies, ONT), that generate 24061, USA
5
School of Earth, Environmental, and
individual read-lengths from 8 kb to 40 kb (maximum N150 kb for PacBio and N2 Mb for ONT) [3], Biological Sciences, Queensland
give much better resolution and contiguity. Nevertheless, some regions of a genome, such as University of Technology, Brisbane,
the telomeric and centromeric regions of chromosomes, are often poorly resolved because QLD, 4001, Australia
6
School of Biological Sciences,
they can contain megabases of repeated sequences. Current bioinformatic software does not University of Sydney, Sydney, NSW
cope well with these difficult regions, especially in the complex and polyploid genomes of many 2006, Australia

700 Trends in Plant Science, August 2019, Vol. 24, No. 8 https://doi.org/10.1016/j.tplants.2019.05.003
© 2019 Elsevier Ltd. All rights reserved.
Trends in Plant Science

crop species. Indeed, for this reason many reportedly 'complete' plant genome sequence *Correspondence:
[email protected] (H. Jung) and
assemblies have many gaps, collapsed regions, and unassigned sequences. New bioinformatic [email protected]
and sequencing strategies are continually being developed to overcome these problems, but (P. Waterhouse).
none has yet been universally successful. We therefore compare the recently developed tools
and give particular emphasis to their performances on a wide range of plant genomes.

Trends in Plant Science

Figure 1. Comparison of Different Genomic Technologies in Reconstructing Target DNA Segments. A total of eight assembly strategies are simplified
and displayed from five major genomic technologies, namely short-read sequencing, long-read sequencing, synthetic long-read (SLR), linked long-read (LLR),
and optical mapping. (A) Long-read sequencing and assembly (PacBio and ONT). The middle blue line indicates the longest seed read used for mapping smaller reads.
(B) Short-read sequencing and assembly including single-end (SE), paired-end (PE), and mate-paired (MP) reads (Illumina). The red small filled boxes indicate adaptors.
The broken lines in the bottom green patterns with zigzags represent gaps in assembled contigs/scaffolds. (C) SLR and/or LLR sequencing and assembly
(10X Genomics Chromium, 10xGC). (D) SLR and/or LLR sequencing and assembly (CHiCAGO and Hi-C, an extension of chromosome conformation capture, 3C). The
red lines/curves indicate the LLRs that are reconstituted into chromatin via proximity ligation (Hi-C). (E) Optical mapping and assembly (BioNano). The vertical black
lines indicate the enzymatic cutting and/or aligning sites. Note that BioNano is not a sequencing technology but an optical mapping technology. (F) Hybrid assembly
from raw reads of A, B, and C. (G) Hybrid longer-read scaffolding and assembly from assembled contigs/scaffolds of A, B, and C (as an input), and raw reads of D.
(H) Hybrid longer-read scaffolding and assembly from assembled contigs/ scaffolds of A, B, and C (as an input), and raw mapping data of E. (I) Hybrid longer-read
scaffolding and assembly from assembled contigs/scaffolds of A, B, and C (as an input), raw reads of D, and mapping data of E. The bottom green patterns (in A–E)
with zigzags represent assembled contigs/scaffolds for each approach. The rightmost green patterns with dots and Ns represent the final assembled scaffolds. The
approaches described in A–E can be performed not only for de novo assembly independently but also for hybrid assembly/scaffolding approaches if merged together.
Further assembly strategies are given in Figure 2.

Trends in Plant Science, August 2019, Vol. 24, No. 8 701


Trends in Plant Science

Old and New Sequencing Technologies for Plant Genomes Glossary


The reference genome sequence of Arabidopsis thaliana has been invaluable to the plant science Bacterial artificial chromosome
community, but it took an international effort over nearly a decade to produce the first draft and (BAC): an engineered DNA molecule
(vector) that is used to clone a target
at a cost of ~100 million USD [4]. Since this initial release, generated using Sanger sequencing
DNA sequence in bacterial cells.
technology (considered to be first-generation sequencing technology [5–11]), there have been de Bruijn graph (DBG): an efficient
10 major updates and the publication of a further 1135 Arabidopsis genomes [5]. The success way to represent a sequence in terms of
of this and other model plant genome sequencing projects has been a major catalyst and inspi- its K-mer components that is widely
used for short-read assemblies.
ration for research, including the recently announced 10 000 Plant Genome Sequencing Project Chimeric sequence: a form of
(10KP) [12] which will focus on nonmodel plants [13,14]. The rapid adoption of whole-genome sequence consisting of two or more
sequencing has been facilitated by the development of second- and third-generation biological sequences and/or unrelated
DNA fragments that have been artificially
sequencing technologies (SGST and TGST, respectively) which have dramatically reduced
joined together.
sequencing costs and simplified genome assembly. Without doubt, these major new initiatives Contig: a continuous stretch of
with new sequencing technology will improve our understanding of plant genomic diversity, assembled sequence without gaps.
while also acting as an important community resource for plant scientists to perform a wide Contiguity: a series of contiguous
sequence (contigs) that are in contact or
range of analyses. However, to make it possible to undertake genome assembly for nonmodel in proximity from a set of overlapping
plant species, the challenges will still include (i) assembling large complex genomes derived DNA segments that together represent a
from complex whole-genome duplications, (ii) choosing the most appropriate sequencing consensus region of DNA.
Fourth-generation sequencing
platforms, and (iii) developing high-throughput assembly and annotation pipelines that require
technology (FGST): a new single-cell
minimal human input. sequencing technique that preserves
the spatial coordinates of RNA and DNA
SGSTs (including Illumina, 454, SOLiD, and Ion Torrent) are high-throughput, fast, low-cost, and sequences with potentially subcellular
highly accurate, producing reads of short length (75–700 bp). However, their limited ability to resolution, thus enabling mapping of
resolve complex regions with repetitive or heterozygous sequences has led to incomplete or heavily sequencing reads back to the original
histological context.
fragmented genome assemblies. This is due, in particular, to difficulties in mapping this type of Error correction: the process of
data to unique positions in reference genomes and in resolving repetitive regions such as long removing and correcting the underlying
structural variants (SVs). Even after assembly, scaffolds will often contain many regions of errors generated by high-throughput
sequencing platforms and/or by true
unknown sequence (Figure 1 and Table 1). The TGST platforms from PacBio and ONT give long
genetic variation and technical artefacts
single-molecule reads (averaging N12 kb, with some ONT sequences reaching over 2 Mb [3]) to increase read and sequence quality.
with complete contiguity, facilitating assembly. However, both long-read technologies suffer from Gap filling: the process of
high costs per base and high error rates (Figure 1 and Table 1). Although earlier sequencing tech- reconstructing the missing and/or
unknown sequences (gaps) between
nologies and their associated assembly and mapping algorithms/software have been extensively
consecutive contigs by mapping actual
reviewed [6–11], there are currently few comparisons or reviews of TGSTs [47–49]. sequence reads and/or introducing
uncharacterized nucleotide (N) stretches
The simplest TGST-based whole-genome assembly approach is undertaken in three steps. First, of unknown or estimated lengths.
and most importantly for these methods, extraction of high molecular weight DNA that is free Linked long-read (LLR): a type of data
that utilizes molecular barcodes to tag
of contaminants. There are many metrics to determine the quality of DNA, the most important of
short reads together that come from the
these are summarized below in the section on DNA extraction methods and quality measurement. same long DNA fragment in 10X Geno-
mics Chromium (10xGC).
The second step requires the preparation of platform-specific libraries using kits provided by the Methylation: an epigenetic mechanism
manufacturers. Attention should be paid to the desired insert lengths in the prepared libraries be- that occurs via addition of a methyl (CH3)
group to a DNA molecule, thereby often
cause they affect the read lengths and throughput (total number of bases sequenced per run).
modifying the function and expression of
With both platforms it is possible to obtain average read-lengths of N20 kb. However, increasing the genes without changing the
read-length often comes at the expense of throughput. We would generally recommend a blend sequence.
of sequencing runs delivering smaller read-lengths with optimized throughputs followed by runs Overlap–layout–consensus (OLC): a
graph assembly algorithm for long‐
specifically aimed at long read-lengths (N50 kb) to assist scaffolding shorter reads into larger
reads relying on three consecutive
contiguous sequences. steps: (i) Overlap (build the overlap graph
to find potentially overlapping reads), (ii)
The third step is assembly of called and quality-filtered data using overlapping sequences to gen- layout (merge reads into contigs and
erate contiguous chromosome-length sequences. When completed genomes of closely related simplify the graph), and (iii) consensus
(derive the DNA sequence and correct
species are available, a reference-guided/assisted genome assembly may also be an attractive
read errors).
option because of the lower requirement for coverage data and computational memory. Some

702 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

caution should be exercised, however, because the resulting assemblies may contain biases Polishing: improving the consensus
accuracy of an assembly and/or
toward errors and chromosomal rearrangements in the existing reference genome [50–53]. Fur-
obtaining higher sequence identity using
ther practical strategies and applications for reference-guided/assisted genome assembly are short and/or long reads.
discussed elsewhere [50–53]. Although prokaryotic genomes have been successfully assembled Scaffolds: created by joining contigs
with the sole use of TGST [54], this approach has been only moderately successful for plant together using additional information
(introducing arbitrary N letters) about the
species, mainly for those with small and less-complex diploid genomes (b300 Mb) [55,56]. For relative position and orientation of the
larger plant genomes, de novo assembly using this approach has generally delivered less than contigs in the genome.
desirable results. This is due in large part to errors in the sequencing data deriving from inaccurate Second-generation sequencing
base calling. These errors present significant challenges to the current sequence assembly soft- technology (SGST): sequencing
techniques and platforms generating
ware in generating gap-closing alignments, particularly across repeat-containing regions. Some short reads (b1 kb) using wash-and-
of these issues can be resolved with increased coverage. However, there appears to be an scan approaches (Roche, Illumina, and
upper limit to useful read-depth because of the systemic nature of the errors in both ONT and Ion-Torrent).
Structural variants (SVs): large DNA
PacBio data. This combination leaves substantial fractions of large plant genome assemblies
alterations (generally N1 kb), often com-
inaccessible and, like assemblies produced by SGSTs, limits the ability to mine for important prising inversions, balanced transloca-
biological insights [57–59]. tions, and copy-number variants.
Synthetic long-read (SLR): an
The regions of large plant genomes that are most challenging to accurately determine are long advanced highly parallel library
tracts of repeat sequence that may span N1 Mb. Even the longest read-lengths reported by either preparation technique to pool barcoded
subsets of the genome (~20 kb) for
PacBio or ONT technologies will often fail to span such regions. To assemble these tracts of empowering assembly and resolving
sequence, the development of additional assembly strategies and sequencing technologies is re- highly repetitive complexes in short
quired. As an interim solution, the development of an advanced 'hybrid' approach, for example, Illumina reads (e.g., TruSeq).
Third-generation sequencing
incorporating 10X Genomics Chromium (10xGC) data or medium-size single-molecule DNA
technology (TGST): sequencing
fragment selection and tagging before sequencing with short-read sequencing, could be a viable techniques and platforms that generate
option to increase the continuity and accuracy of long reads (see Hybrid Assembly Approaches, long reads (N10 kb) and ultra-long reads
below). Although this 'hybrid' approach increases the accuracy of long reads by mapping Illumina (N1 Mb) (PacBio, ONT, and BioNano).
short reads onto them to generate a consensus sequence, and has resulted in assembled
scaffolds with high accuracy, incomplete and/or unfinished assemblies still occur (e.g., gaps
and fragments). Thus, additional techniques such as optical mapping (BioNano) and chromatin
association (Hi-C: an extension of chromosome conformation capture, 3C) are usually required
to facilitate contig joining [11,59–62] and the completion of a genome assembly. These
subchromosome scaffolding assembly (SCSA) techniques often reduce the scaffold number
and increase scaffold size by a factor of three–ten to give chromosome-level assemblies (Table 2).

DNA Extraction Methods and Quality Measurement


Given the potential breadth of plant species that are likely to be targeted for genomic studies,
each with their own peculiarities, we are only able to provide general suggestions on extraction
methods, based on our own experience. Although recent publications provide valuable guidance
[80–83], users should look to develop or adapt DNA extraction methodologies along the lines we
provide, paying particular attention to the quality metrics outlined below.

Aside from the obvious requirements to generate DNA preparations that are free of contaminants
such as proteins, carbohydrates, and polyphenolics, users should also seek to select methods
that produce high molecular weight DNA. Avoidance of column-based DNA extraction methods
is recommended given the propensity of these methods to shear DNA, often to fragment sizes
b8 kb. Although we have had some success with commercial magnetic bead-based DNA purifi-
cation methods for plants, these methods still shear DNA. However, with care DNA prepared in
this way can deliver DNA with an average size of N30 kb. In general, the most successful methods
tend to be those based on cetyltrimethyl ammonium bromide (CTAB) extraction buffers com-
bined with spooling of DNA. These approaches produce excellent quality DNA of high molecular
weight, but often require larger input of tissue than the magnetic bead-based kits. Whichever
approach is adopted, there will be a requirement for refinement of the method to achieve several

Trends in Plant Science, August 2019, Vol. 24, No. 8 703


Trends in Plant Science

Table 1. Summary of Selected Long-Read Sequencers for De Novo Assemblies of Large Eukaryotic Genomesa,b
Pros and Cons 10X Genomics Pacific Biosciences Oxford Nanopore BioNano Dovetailc
Chromiumc (SEQUEL/Cell) (MiniION) (Saphyr/Chip) (HiSeq 4000)
(HiSeq 4000)
Compatible Illumina RS II GridIONd and PromethIONd Irys Illumina
platforms
Minimum input ~3 ng ~20 μg ~1 μg ~200 ng ~5 μg
Long-read Synthetic True True True Synthetic
Average/maximum ~300 bp (PE)/ ~12 kb/~150 kb ~12 kb/~2 Mb ~350 kb/~1 Mb ~150 kb/~1 Mb (SLR)
read length ~150 kb (LLR)
Throughput ~1500 Gb 0.7 Gb–20 Gb 50 Gb–15 Tb (PromethION) ~640 Gb ~1500 Gb
(SEQUEL)
Reads ~5 Billion (B) 0.07 million 1.5–5 M ~2 M (image file) ~5 B
(M)–2 M
Runtime ~3 Days 6–10 h 2 h to 6 days ~1 day ~3 days
Quality scores N30 N10 N10 NA (only nonsequence N30
based method)
Error profile b1% (GC/AT 5–10% (indels) 5–15% (indels and Sizing error, false sites, b1% (GC/AT-biased and
biased and substitutions) and missing sites substitutions)
substitutions)
Output format Fasta Bam Fast5 BNX Fasta
Fastq Fasta C/S/XMAP Fastq
Fastq SVMerge
Hdf5 (RS II) TIFF
General assembly Supernova CANU CANU RefAligner 3D-DNA
software Falcon/Falcon-Unzip Minimap/Miniasm HiRise
Flye TULIP LACHESIS
HGAP Meraculous
Minimap/Miniasm SALSA
Instrument coste $$$$$ $$$$$$$$$$ $$$ $$$$$$$$$$ $$$$ (different library
preparations but can be used
in HiSeq 2500 or above)
Cost per Gbe $$ $$$$$$$$$$ $$$$$$$$ $$$$$$ $$$$
General Limited testing Widely tested from Mainly tested for prokaryotic Widely tested from Widely tested from
applications only for human prokaryotic to but starting to expand to prokaryotic to prokaryotic to eukaryotic
and diploid eukaryotic eukaryotic organisms eukaryotic organisms organisms
assembly organisms

Can analyze DNA Can analyze DNA Mainly applied for Mainly applied for scaffolding
methylation methylation scaffolding improvement improvement and
and chromosome- chromosome-scale
scale assemblies assemblies
Other pros Moderate cost Numerous Low-to-moderate cost No risk of PCR artefacts Simple assay process
instrument and dedicated software instrument and runs
runs tools

Low cost per Mb Well-established Moderate cost per Mb with Real-time data monitor No separate instrument
with high accuracy platform easy sample preparation for quality metrics needed
(SMRT Link) (BioNano Access/IrisView)

Minimal input Real-time analysis for rapid Can create the most
requirement and efficient workflows contiguous and accurate
(MinKNOW) assemblies possible

Can repetitively sequence a Can provide physical


given genome mapping
Other cons Vulnerable to Expensive Lower base-calling Expensive instrument Vulnerable to Illumina baises

704 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

Table 1. (continued)
Pros and Cons 10X Genomics Pacific Biosciences Oxford Nanopore BioNano Dovetailc
Chromiumc (SEQUEL/Cell) (MiniION) (Saphyr/Chip) (HiSeq 4000)
(HiSeq 4000)
Illumina biases instrument and runs accuracy and runs and limitations
and limitations

Not true long-read Higher cost per Mb Limited testing and Moderate cost per Mb Not true long-read
with high random performance for higher with sizing errors
error rates eukaryotic and polyploid
genomes
Limited data High input Limited compatible High input requirement
available requirement software

Limited test for Limited data available


non-human and
polyploid assembly
Mainly commercial-based
service (Hi-C//HiRise)

a
This table was generated after visiting the official website of each platform and the most recent review articles [3,6,7,9]. The same acronyms (i.e., programs) are used in all
Tables. For more library preparation and sequencing guides refer to the products and/or services page of the vendor.
b
Abbreviations: LLR, linked long read; NA, not available; PE, paired-end; SLR, synthetic long read.
c
10X Genomics Chromium and Dovetail: focused on HiSeq 4000 platform. Although both Dovetail and Phase Genomics provide Hi-C data, we have focused on Dovetail
Genomics only.
d
GridION and PromethION are still in experimental phase (not fully accessible for commercial service).
e
For instrument cost and cost per Gb the relative cost is indicated by the number of $ symbols.

important quality metrics that we have found to be important for both PacBio- and/or ONT-based
sequencing platforms.

Generally, purified DNA should be measured/quantified using both spectrophotometric and


fluorescence-based methods (such as Qubit). Optical density (OD260:OD280) ratios of 1.8–2.0
indicate that samples are generally free of protein contamination, and OD260:OD230 ratios of
N2.0 generally indicate the sample is free of phenolics and carbohydrates. Quantification of
genomic DNA using only spectrophotometric methods is not recommended, and quantification
is best performed using fluorimetric methods such as QubitTM. Achievement of a 1:1 ratio of
the concentrations of DNA determined by spectrophotometry and fluorimetry, respectively, has
proved to be a very good indicator of whether DNA will be sequenced efficiently.

To determine the integrity of the DNA sample, it is strongly recommended that a sample of DNA is
separated to determine the degree of degradation and the spread of molecular weight of the iso-
lated DNA. Standard agarose gel electrophoresis is not generally recommended owing to the
poor resolution of DNA above 10 kb. Contour-clamped homogeneous electric field (CHEF) or
pulsed-field electrophoresis is suitable but we would recommend the use of instruments such
as the TapeStation or Fragment Analyzer (Agilent Technologies) in conjunction with their high
molecular weight analysis kits. Analysis of isolated DNA in this manner will assist in decisions
about shearing the DNA to obtain an optimal size range for sequencing and can also be useful
by assisting the identification of contaminants that may affect sequencing performance because
common contaminants will often influence the mobility of DNA.

Workflow Design
The genome size, levels of ploidy and heterozygosity, and the quality of DNA that can be ex-
tracted will affect the complexity, overall quality, and cost of the genome assembly of the target
species. Flow cytometry (an accurate way to determine genome size) and K-mer frequency

Trends in Plant Science, August 2019, Vol. 24, No. 8 705


Trends in Plant Science

Table 2. Summary of Recently Published Plant De Novo Genome Assemblies Using Long-Read Sequencesa,b
Scientific name GS Final output Input details and depth (×) BND Dovetail (×) Refs
(Gb) (×)
AGS TSN N50 SG 454 IM PB 10xGC BAC FSM CHiCAGO Hi-C
(Gb) ( Mb)
Aegilops tauschii 4.3/DP 4.22 109 31.73 191 35 53 90 [63]
spp. strangulata 861
Amaranthus 0.47/DP 0.41 908 24.4 229 31 10 363 41 [64]
hypochondriacus
Arabis alpina 0.37/DP 0.34 817 4 86 722 85 [56]
Chenopodium 1.4/AT 1.39 3486 3.6 66 54 72 52 [65]
quinoa
Conringia 0.23/DP 0.18 67 7.4 54 410 [56]
planisiliqua
Durio zibethinus 0.74/DP 0.71 677 22.7 202 153 380 4371 [66]
Euclidium 0.26/DP 0.23 80 18.7 47 446 [56]
syriacum
Hevea brasiliensis 1.4/DP 1.26 47 0.1 5 57 40 44 [67]
154
Hordeum 5.3/DP 4.79 4235 1.9 24 200 200 60 96 [59]
vulgare L.
Lactuca sativa 2.5/DP 2.38 11 1.8 73 72 [68]
474
Malus domestica 0.65/DP 0.65 1081 5.6 200 35 600 [69]
Borkh
Manihot esculenta 0.77/DP 0.58 2019 28.1 1 29 968 1082 125 [70]
Musa acuminata 0.53/HP 0.45 1532 3 21 91 0.2 3 60 [71]
Nicotiana 2.5/DP 2.37 37 0.5 5 30 10 50 [72]
attenuata 194
Nicotiana 4.5/AT 3.69 2217 2.2 18 69 8 110 [73]
tabacum
Oropetium 0.25/DP 0.25 46 7.8 72 200 [55]
thomaeum
Oryza sativa Indica 0.4/DP 0.41 225 2.5 100 118 16 250 [74]
Rosa chinensis 0.56/DP 0.52 55 69.6 147 80 112 [75]
Saccharum 3.36/AT 3.13 76 0.1 90 78 80 90 [76]
spontaneum L. 028
Tartary buckwheat 0.54/DP 0.45 114 7.5 175 31 25 220 195 [77]
Triticum 16/HxP 14.5 138 7.0 73 217 570 47 8 [78]
aestivum L. 665
Triticum turgidum 12/AT 10.49 151 6.96 176 180 [57]
ssp. dicoccoides 912
Triticum urartu 4.94/DP 4.86 31 3.67 21 19 11 297 83 [79]
559
Zea mays 2.3/DP 2.07 625 9.6 1000 65 1 1 60 [58]

a
This table represents a selection of recent plant and crop genome work focusing on whole-genome assemblies using BioNano and/or Dovetail data (at least one tech-
nology used). In addition, the table does not include any individual chromosome assemblies, green alga genomes, pure TGST/SGST/hybrid genome assemblies without
BioNano/Dovetail data, single-cell sequencing, or transcriptomes. If there was no estimated input depth from the original report, this was estimated from the raw data. For
the most recent global statistics, we highly recommend visiting the associated GenBank BioProject.
b
Abbreviations: 454, Roche 454; 10xGC, 10X Genomics Chromium; AGS, assembled genome size; AT, allotetraploid; BAC, bacterial artificial chromosome (including
BAC-end sequence); BND, BioNano Depth; DP, diploid; FSM, fosmid; GS, genome size; HP, haploid; HxP, hexaploid; IM, Illumina [combined paired-end (PE) and
mate-pair (MP) reads]; PB, PacBio; SG, Sanger; TSN, total scaffold number.

706 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

distribution (a simple approach to infer genome size, repeat content, and heterozygosity using
Illumina reads) are two widely used methods to estimate the size of a genome [84], and the gen-
eration of sufficient sequence data/read coverage is a crucial starting point in a genome assembly
project. If cost is not an obstacle, securing N100× coverage of long-read data can be the basis for
a good genome assembly through self-correcting algorithms [e.g., in Canu, hierarchical genome
assembly process (HGAP), and Falcon] that align the reads against one another without relying
on any additional sequencing data. However, the cost of obtaining such high read-coverage of
long-read data may not be the only problem. There are some inherent errors in the technologies.
For example, both ONT and PacBio platforms struggle with homopolymeric sequences.

A hybrid approach, using a mixture of both short and long reads, can be less expensive than
using long reads alone, but in general the quality of the assembly is lower [33,84–88], and several
factors (e.g., genome size and the frequency of repetitive sequences) can affect its sequence
contiguity. A minimum of 60× (180 Gb) long-read sequence coverage should be sufficient for
a highly inbred/homozygous, small- to medium-sized (b3 Gb) diploid genome. For larger diploid
genomes and/or genomes that are highly heterozygous, we recommend a minimum 60× of
SGST, 30× of TGST, 50× of 10xGC, and 60× of SCSA per each haploid subgenome. Polyploid
and highly repetitive genomes may require an extra 50–100% more sequence data than their
diploid counterparts (Figure 2). In plants, further filtering of unwanted organelle fragments
(e.g., chloroplast and mitochondrial sequences) may be necessary to increase the quality of a
nuclear genome assembly. Input data usually consists of 5–20% of unwanted organelle DNA
reads [89]. Thus, the apparent 60× (180 Gb) coverage of a 3 Gb plant genome may actually con-
tain only 48× relevant coverage (144 Gb). Once high-quality sequence data (and where required
SCSA) have been obtained, there remains the considerable computational task of assembling
them into the best reference sequence, and this requires significant computational resources.
We highlight below the typical resources available in many core facilities. However, with increasing
democratization of whole-genome sequencing, more and more groups will require access to
such high-performance computational resources, and the increasing availability of cloud-based
solutions may offer an attractive option to many researchers (discussed below).

Each sequencing platform has different inputs (DNA, labor, and preparation), computational
requirements, and costs (Table 1 and supplemental information online), with each assembly
using multiple software packages and pipelines (Table 2). This article aims to provide a concise
review of current and emerging TGSTs (10xGC, PacBio, ONT, BioNano, and Dovetail Genomics)
and their application to de novo plant genome assembly. We highlight 25 analytical tools (chosen
from a library of 105 readily available, open-source tools), suggest practical strategy combina-
tions, and provide a decision tree to help researchers to select the most appropriate approach
to achieve a high-quality reference genome assembly for their species of choice.

Computational Requirements
Genome assembly uses large amounts of sequence data and requires computation resources
that are usually only available at high-performance computing (HPC) facilities. Given the vast
range of potential plant genome assemblies that are likely to be undertaken in the near future,
we can only give a general guide to the computational resources needed for such projects. How-
ever, the guide is scalable, based on genome size and ploidy, and our recommendations will likely
apply larger and more complex genomes, but at a slower rate. As mentioned below, new innova-
tions in the use of graphics processing units (GPUs) and other accelerator platforms will greatly
improve analysis times. As a general guide, to successfully assemble a moderately sized diploid
plant genome of ~1 Gb using software pipelines such as Canu or Falcon will require a minimum
computing resource of 96 physical CPU cores, 1 TB of high-performance RAM, 3 TB of local stor-
age, and 10 TB of shared storage. Polyploid, large (per 1 Gb genome size increase) and highly

Trends in Plant Science, August 2019, Vol. 24, No. 8 707


Trends in Plant Science

Trends in Plant Science

(See figure legend at the bottom of the next page.)

708 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

repetitive genomes may require an additional 50% more computing resource than their 1 Gb
diploid counterparts. Although increasing the size of the computing resources would generally
be expected to reduce assembly times, this must be balanced against the costs of purchasing
larger resources. To date, most plant genome assemblies and genomic analyses have been
carried out by large and well-resourced teams with access to very large in-house systems. De-
creasing sequencing costs have resulted in burgeoning numbers of users with projects requiring
HPC resources [90]. As the number of users, the diversity of users, and the volumes of data grow,
demands on these systems will also increase. Outside supercomputer-type facilities, continual
growth in capacity (core numbers, RAM, and storage space) will be necessary to maintain the
ability of existing in-house systems to deliver the required performance. Constant efforts are
therefore required both to maintain and improve infrastructure and to drive more efficient use of
these resources [91].

Other than large in-house HPC resources, two other options are available. The first and some-
what daunting option is to consider purchasing and/or constructing a moderately sized cluster.
The design, construction, and maintenance of a cluster, of the size described above, is a complex
and potentially expensive undertaking that requires significant IT support at all stages. Significant
ongoing costs should also be expected for maintenance and future expansion in capacity, with
particular attention being paid to long-term storage of pre- and post-processed data. A series
of white papers covering possible options are available from insideHPC (https://insidehpc.com/
2015/03/the-insidehpc-guide-to-hpc-in-life-sciences/).

The second option is to use cloud-based resources. Cloud approaches offer many advantages
over fixed architecture, including customized and flexible environments that allow users to exper-
iment and alter the computing environment without significant administrative overheads [92,93].
Although cloud-based solutions are generating significant interest among all HPC providers, sev-
eral issues need to be understood before adopting this solution. Cluster architecture is crucial to
achieve optimal performance, particularly where multiple nodes (typically single servers consisting
of 24 cores, 256 Gb RAM, and 500 Gb of local disk space) are employed. Most large institutional
systems have been built specifically to meet the demands of data-intensive computing, such as
genome assembly and related analyses, and typically have an order of magnitude better perfor-
mance than typical cloud-based options [92]. A considerable amount of work is often necessary
to develop a workable cloud-based solution from scratch, in particular the development of
software ‘stacks’ containing the requisite software pipelines [92]. In addition, slow internet con-
nectivity speed can be a major impediment to efficient data transfer to the cloud and back. For

Figure 2. Simplified Workflow and Decision Tree for the Selection of Suitable Next-Generation Sequencing
(NGS) Platforms, Reads, and De Novo Assemblies. The selection of a NGS platform requires a set of sequential
decisions. First, decide on desired read-length and quality from sequencing technologies. Next, specify which assemblers,
mappers, and polishers will be used for each dataset (differently colored boxes represent each stage and its related tools).
Finally, determine the assembly quality and strategy to be used (if necessary, require more refinement). Note that several
hybrid assemblers include gap-filling/scaffolding capabilities, and some can only be applied to reads from TFGST or SGST
assemblies. In addition, a few SGST assemblers have been improved since their original inception to deal with both short
and long reads simultaneously as per the hybrid assembly approach. It is highly recommended to visit the official website
of each tool to verify the latest version/mode before use in case of possible recent changes and improvements. More
applicable/alternative tools for each stage are given in Tables 3 and 4, and in the supplemental information online.
Decision-making path to follow: black unbroken line, recommended workflow, and its tool for each approach; boxes of
different colors represent each different stage. Scaffolding and merging: recommended but optional approaches for all
short, long, and hybrid reads to increase assembly continuity. Confirm and refine: the recommended three options after
assembly assessment. See Figure 1 and Table 1 for more sequencing technologies and reads (A–H). The very bottom
dark-blue box represents the final de novo assembly outcome. Abbreviations: CLMA, chromosome-level mapping
assembly using linkage group/map data; SCSA, subchromosome scaffolding assembly using (F–H); SGST, second-
generation sequencing technology; TGST, third-generation sequencing technology.

Trends in Plant Science, August 2019, Vol. 24, No. 8 709


Trends in Plant Science

users who find the development of a custom solution daunting, several commercial HPC service
providers are emerging that offer solutions to meet the needs of genomics applications, but they
may lack the flexibility of either a custom cloud solution or a large-scale institute-sponsored
solution.

To increase the speed of processing massive amounts of sequence data there is a push toward
parallelization of computer resource and software. The increased utilization of GPUs and field-
programmable gate-arrays (FPGAs) offers much greater computer capacity and flexibility than
CPU-based clusters [90–96]. Although currently utilization of such architecture requires consider-
able computing expertise, software solutions to utilize such massive parallelization are currently
being developed [90,94–96].

In summary, access to HPC resources is crucial for genome assembly projects. Users at
genome-focused institutions probably have access to in-house high-capacity systems with
the appropriate software ‘stacks’. However, these resources may come under considerable
pressure as the genomics research sector continues to grow and ask increasingly computation-
ally intensive questions. Cloud-based computing is a possible solution not only to satisfying
such increased demands but also as an avenue to empower genome researchers at institutions
lacking HPC resources. Cloud computing provides flexibility, competitive pricing, and continually
updated hardware and software. However, to set up suitable cloud-based software currently
requires IT specialists, either from the user’s research institution or contracted from the many
private providers with fit-for-purpose software and computation environments.

Assembly Approaches using Only SGST Data


Over the past decade the de Bruijn Graph (DBG) algorithm has been the method of choice
for assembling plant and animal genomes from SGST short-read data [8,47,97]. Short-read
sequence assembly approaches have been reviewed extensively (assembly approaches [8,47,
98–100], error-correction tools [101,102], and mapping software [103,104]). Although the
short-read format is low-cost and has low error rates (b1%), it presents many technical and com-
putational challenges for genome assembly [73,105,106]. However, the recently developed
10xGC system provides a workable solution. This emulsion-based method, utilizes the limiting
dilution principle, and identifies the short reads generated from the same molecule, thus linking
their sequences (linked long-read, LLR) and allowing more accurate assembly [107–109]
(Figures 1 and 2). Two recent diploid genome assemblies, Glycine latifolia and Capsicum
annuum, were assembled using a combination of SGST reads with 10xGC data, leading to a
better than threefold improvement in scaffold N50 and a cost N20-fold lower than using SGST
alone. However, this approach often still leaves many gaps and misassembled or unassembled
regions in the final assemblies, particularly in repetitive regions and/or when assembling genomes
from polyploid species [107,109]. Although this approach has greatly improved short-read
assemblies in both large and complex genomes, TGST coupled to optical mapping and Hi-C
techniques holds more promise for complete and contiguous assemblies, especially for polyploid
species [53,68,110,111], as discussed in more detail below.

Assembly Approaches Using Only TGST Data


PacBio and ONT are becoming increasingly cost-effective for generating high-quality de novo
plant genome assemblies (Tables 1, 2, and supplemental information online). The average
read-length capability, which can easily exceed 30 kb [3,54–58,112,113], makes these data
invaluable for large and complex plant genomes [29,58,63,88,111,113–116] (Table 2). Indeed,
continual improvements in sequencing chemistry, throughput, and simplification of assembly
algorithms make this approach the best choice for assembling large complex/polyploid
genomes.

710 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

PacBio and ONT long-read sequencing methods use real-time observation of DNA sequencing.
PacBio utilizes single molecule real-time (SMRT) sequencing using synthesis technology that
harnesses single-molecule DNA replication using zero-mode waveguides (ZMWs) and
phospholinked nucleotides [54,117]. ONT identifies DNA bases by observing the electrical
currents generated as a single strand of DNA passes through a nanopore [113,115,118]. Both
approaches have high random and systemic error rates (5–10% for PacBio and 5–15% for
ONT), and thus require substantial depth of coverage to accurately determine consensus base
calls [at least 30× for each haploid genome content; e.g., a 500 Mb tetraploid (4n) genome re-
quires 120× coverage (30×, × 500 Mb, × 4n = 60 Gb)] [29,54,56,58,69,85–88,117–120]
(Table 2). To work with this long-read data, overlap–layout–consensus (OLC) assemblers
are best suited for de novo assembly [6,24,56,121,122].

De novo genome assembly using TGST generally consists of four stages; stage 1. raw read self-
mapping; stage 2. error correction; stage 3. assembly of corrected reads; and stage 4. contig
consensus polishing. Stage 3, that is considered to be the key OLC assembly stage, has an-
other three internal stages: (i) find overlaps (suffix tree based or dynamic programming) and
build an overlap graph, (ii) resolve the graph (layout), and (iii) call the sequence consensus.
Stage 3 may also involve read-mapping again, but, because the error rate is much reduced at
this step, it is easier and faster than stage 1 [24]. The most commonly employed de novo assem-
blers for long reads and their associated programs are summarized together with their functional
features in Tables 3 and 4.

Drawing from recent reports of successful plant genome assemblies [55,56], we suggest the
pipeline: PacBio and/or ONT read sequencing ► read-quality assessment, evaluation, and filtering
(including removing organelle DNA reads [89]) ► assembly (single and/or multiple assemblers) ► a
single consensus sequence ► error correction and polishing ► assessment ► subchromosome
scaffolding assembly ► chromosome-level mapping assembly ► annotation (Figure 2). Relative
to the previous short-read-derived reference genomes, recent PacBio-based Triticum urartu
(wheat A subgenome) [79], Zea mays (maize B73) [58], and Saccharum spontaneum L. [76]
assemblies have increased contig lengths by 101-, 52-, and sixfold, respectively. They also have
notable improvements in the assembly of intergenic spaces and centromeres. Recent ONT-
based assemblies of Solanum pennellii [29] and Arabidopsis thaliana (KBS-Mac-74 accession)
[88] achieved contig N50 lengths of 2.5 Mb and 14.8 Mb that would be impossible using strategies
based only on Illumina.

Several recent publications have reported excellent genome-assembly qualities from PacBio
or ONT read-assemblies that have been polished/corrected with Illumina short-read data [29,
88,115,116,122]. Given the substantial depth of coverage of TGST reads, they alone may be
sufficient for consensus calling (self-polishing), but incorporation of Illumina paired-end (PE)
and/or mate pair (MP) reads for extra rounds of polishing generally gives better consensus
base accuracy [28,29,88,102,116,123,124].

Hybrid Assembly Approaches


Combining data from both TGST and SGST, in what has been termed a 'hybrid assembly', can
compensate for the downsides of both approaches (i.e., high error rates in long-reads, and the
propensity of short reads to generate fragmented assemblies). Using SGST data to correct errors
in TGST reads has been very successful in producing contiguous and accurate de novo assem-
blies for both animal [125–127] and plant species [67,85–87,128–130].

Based on the results of the hybrid-based assemblies of apple [69], durian [66], Prince-of-Wales
feather [64], quinoa [65], rice [74], and Tausch’s goatgrass [63] genomes, our suggested strategy

Trends in Plant Science, August 2019, Vol. 24, No. 8 711


Trends in Plant Science

Table 3. Summary of De Novo Genome Assemblers for Long-Read Sequencesa,b,c,d


Program Input Error Description Refs
format correction
ABruijn/Flye Fasta Yes A very fast OLC-based de novo assembler using de Bruijn graphs (DBGs) for long-read data [15,16]
Flye (successor of ABruijn) performs an extra repeat classification and analysis step to
improve the structural accuracy of the resulting sequence including a polisher module
The ABruijn algorithm comprises a series of steps: K-mer counting/selection for error
correction ► overlapping based on the ABruijn graph ► preassembly ► generating of a
rough consensus from repeating graphs longer than minimum overlap ► draft contigs
from the unbranching paths in the graph ► polishing to increase the final contig quality
Accessible from a stand-alone command line
CANU Fasta Yes A fork of the CA, designed for long-read data based on OLC [17]
Fastq A hierarchical assembly pipeline that has four steps: detect overlaps in high-noise
sequences using MHAP ► generate corrected sequence consensus ► trim corrected
sequences ► assemble trimmed corrected sequences
Accessible from a stand-alone command line while taking full advantage of any
LSF/PBS/PBSpro/Torque/Slrum/SGE grid options
FALCON Fasta Yes A set of tools for fast aligning of long reads for consensus and assembly based on OLC [18]
Specifically designed for PacBio reads to efficiently assemble haploid and diploid genomes
(diploid-aware assembler)
'HGAP' comprises a series of steps: raw subreads overlapping for error correction
preassembly and error correction ► overlapping detection of the error corrected reads
► overlap filtering ► graph construction from overlaps ► contig construction from graph
Accessible either from SMRT link or a stand-alone command line
FALCON-Unzip Fasta Yes Specifically designed modules working with FALCON for fully phased diploid assembly [18]
(representing haplotype specific contigs as 'haplotigs' as assembly output)
Accessible only from a stand-alone command line with limited cluster computational
environments
Finisher-SC Fasta Yes A repeat-aware module working with the HGAP pipeline and MUMMER alignment to [19]
produce higher-quality assemblies that can be consistently obtained after
post-processing for long-read data
A series of steps: error correction (HGAP) ► preassembly (CA) ► improved assembly
(FinisherSC) ► merging of contigs (produces longer and higher-quality contigs than
existing tools while maintaining high concordance)
Accessible from a stand-alone command line
HGAP Fasta Yes Specifically designed for PacBio data to allow the complete and accurate shotgun [20]
assembly of a wide range of genome sizes and complexity based on OLC
A succession of steps: preassembly ► de novo assembly ► consensus polishing
HGAP4 uses FALCON for de novo assembly
Accessible either from SMRT Link or a stand-alone command line
Work-friendly with SGE grip option
HINGE Fasta No A de novo assembler tool based on the OLC with repeat-resolution capabilities of DBG [21]
assemblers using an idea called 'hinging' for long-read data
A series of steps: pairwise overlaps (DAligner) ► read filter (remove chimeric reads and
place hinges) ► repeat annotation ► overlap (use hinging to construct graph) ►
hinge-aided greedy assembly ► alignment and consensus
Accessible from a stand-alone command line
MARVEL Fasta Yes A largely self-contained assembler consisting of a set of tools that facilitate the [22]
overlapping, patching, correction, and assembly of noisy long-read data
A series of steps: overlap ► patch reads (in lieu of correction) ► overlap (align and repeat
masking) ► scrubbing ► assembly graph construction and touring ► optional read
correction ► Fasta file creation
Accessible from a stand-alone command line
MECAT Fasta Yes An ultrafast mapping, error-correction, and de novo assembly tool using CA for long-read [23]
Fastq data
A specific of four modules: pairwise mapping (mecat2pw) ► reference mapping
(mecat2ref) ► error correction based on pairwise overlaps (mecat2cns) ► Canu pipeline
(mecat2canu)
Accessible from a stand-alone command line
Minimap/ GFA No A very fast OLC-based de novo assembler for long-read data [24]

712 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

Table 3. (continued)
Program Input Error Description Refs
format correction
miniasm A series of steps: crude read selection ► fine read selection ► generation of a string graph
► merging of unambiguous overlaps to produce unitig sequences
A fast de novo assembler but high consensus sequence error rate similar to raw input reads
Prone to collapse repeats or segmental duplications longer than input reads (difficult to fix
without error correction)
Accessible from a stand-alone command line
PoreSeq Fasta Yes An assembly tool for de novo sequencing, consensus, and variant calling on Nanopore data [25]
A series of steps: de novo error correction without reference using overlap alignment
► reference error correction ► scoring known sequence variants on a given dataset
► straightforward subdivision of processing for cluster/parallel tasks
Accessible from a stand-alone command line
SMART-denovo Fasta No A de novo assembler using all-versus-all raw read alignment without error correction for https://github.com/
long-reads ruanjue/smartdenovo
A useful tool to generate accurate consensus sequences using dependent consensus
polish tools
A series of steps: read overlapping ► rescue missing overlaps ► identification of
low-quality regions and chimeras ► produce better unitig consensus
Accessible from a stand-alone command line
Spectrasse-mbler Fasta Yes A de novo assembler using all-versus-all raw read mapping for long reads [26]
A useful tool to generate high quality through a coverage-based consensus generation process
A series of steps: layout computation (compute alignments with minimap) ► consensus
generation ► overlap-based similarity and repeats handling ► produces a better contig
consensus
Accessible from a stand-alone command line
SPRAI Fastq Yes Specifically designed for de novo assembly of PacBio reads http://sprai-doc.
A succession of steps: prepare data ► prepare Sprai ► correct errors and assemble readthedocs.io/en/
► find contigs latest/index.html
Accessible from a stand-alone command line
Work-friendly with SGE grip option
Trio Binning Fasta No Specifically designed modules working with CANU 1.7 (TrioCANU) for fully phased diploid [27]
assembly (similar to FALCON-Unzip and Supernova)
Requires moderate coverage of short (30× Illumina) and long reads (80× PacBio, 40× per
haplotype) to count and subtract K-mers for both parental genomes
Accessible only from a stand-alone command line
TULIP Fasta No A prototype tool for de novo assembly of Nanopore reads [28]
SAM A succession of steps using two Perl scripts (tulipseed and tulipbulb): input data
► alignment ► configuration ► TULIP seed layout ► TULIP bundling
Accessible from a stand-alone command line
WTDBG Fasta No A fuzzy Bruijn graph (FBG) de novo assembler using all-versus-all raw read alignment for [29]
Fastq long reads
A novel sequence alignment (K-mer–BIN–mapping, KBM) algorithm and a new assembly
graph (FBG) for efficient assembly of large genomes
A series of steps: read-mapping using KBM ► FBG assembly ► produces a better contig
and unitig consensus (SMARTdenovo is a progenitor of WTDBG)
Accessible from a stand-alone command line

a
This table does not include any single-cell sequencing, transcriptome, organelle genome assemblers (mitochondrial, chloroplast, and plasmid), bacterial/metagenome
assemblers (microbial and smaller genomes b10 Mb), basecalling/variant calling, SV, and methylation detection. In addition, this table does not consider any of the fol-
lowing measurements, namely user time, system time, CPU time, real time (wall clock time), or maximum memory usage for each assembly tool and dataset because
these can differ depending on sequencing coverage and the dataset. Deprecated programs have been removed from the list (Nanocorrect and pacBioToCA).
b
Abbreviations: CA, Celera Assembler; FALCON, fast alignment and consensus for assembly; GFA, graphical fragment assembly format; HGAP, hierarchical genome
assembly process; .MECAT, mapping, error correction, and de novo assembly tool; MHAP, MinHash alignment process; OLC, overlap–layout–consensus; PAF,
pairwise read mapping format; POA, partial order alignment; SGE, Sun Grid Engine; SPRAI, single-pass read accuracy improver; and TULIP, the uncorrected
long-read integration process.
c
Long-read data: PacBio and Oxford Nanopore Technology (ONT).
d
Polishing tools: Quiver (ideal for RS II); Arrow (ideal for SEQUEL); Pilon (ideal for Illumina data); and Nanopolish (ideal for Nanopore data) using BLASR, BWA-MEM, and
pbalign.

Trends in Plant Science, August 2019, Vol. 24, No. 8 713


Trends in Plant Science

is: ONT/PacBio and any SGSTs (recommend to use 10xGC read sequencing) ► read-quality
assessment, evaluation, and filtering ► assembly (single and/or multiple assemblers) ► a
single consensus sequence ► error correction and polishing ► subchromosome scaffolding
► chromosome-level mapping assembly ► annotation (Figure 2). Incorporating ONT reads
from the Promethion platform appears to generate high-quality, phased de novo assemblies for
diploid genomes of similar quality to those incorporating PacBio reads [28,29,88,116,125,131,
132], but at a lower cost. These hybrid approaches have achieved much greater sequence con-
tiguity than Illumina-only assemblies: a three- (Pinus taeda) [87] and sevenfold (Malus domestica
Borkh.) [128] improvement of contig N50 size for PacBio merged with Illumina data, and 100- to
450-fold higher contig N50s (Brassica rapa Z1, Brassica oleracea HDEM, and Musa schizocarpa)
for ONT merged with Illumina data [133].

Hybrid assemblies also greatly benefit from incorporating 10xGC (N100 kb) and/or Hi-C (N1 Mb)
data. This information facilitates the ordering and linking of the scaffolds to produce whole-
chromosome pseudomolecules [57,61,62,66,111] (Table 1). The 10xGC data can also provide
long-range information on a genome-wide scale, including variant calling, phasing, and extensive
characterization of genomic structure, giving researchers access to low-complexity and repetitive
regions that were previously missed by short-read sequencing [108]. Recent studies have
highlighted the efficacy and cost-effectiveness of 10xGC linked-reads in diploid plant genome
de novo assembly by resolving long and highly similar repetitive regions [107,109]; the utility of
this technology for complex and/or polyploid plant genomes is still being investigated [53].
Using single-molecule sequencing in combination with linked-reads enables a genome sequence
assembly with both high sequence and scaffold contiguity, a feat not currently achievable with
either technology alone.

Error Correction and Polishing of the Consensus Sequences


To increase the accuracy and assembly of a consensus sequence, sequencing errors within
TGST sequences and within assembled sequences need to be corrected. This process is termed
‘polishing’. A list of programs for error correction and polishing can be found in Tables 3 and 4,
and in Tables S1–S3 in the supplemental information online. Recent work has evaluated
and benchmarked multiple aligners that are used to enhance the accuracy of read-mapping
[102,104,121]. Although all of the tested aligners performed well with sequence read-lengths
N100 bp, some tools still showed a lack of specificity, particularly in aligning tandem repeats.
According to Chu and his colleagues [121], Minimap was the most computationally efficient
and sensitive method (both time and memory) on ONT datasets. However, Minimap was not
as sensitive or as specific as GraphMap, DALIGNER, or MHAP on the PacBio datasets.
GraphMap and DALIGNER were the most specific and sensitive methods on PacBio datasets,
and DALIGNER scaled better computationally. Aligner choice is largely based on factors such
as genome features, and can enhance the overall accuracy in the error-correction and polishing
consensus [102,104,121].

Gap Filling and Scaffolding Assembly Approach


Post-processing approaches such as gap filling and scaffolding can be applied to preassem-
bled contigs to increase N50 length and decrease the total number of contigs/scaffolds (light-
blue boxes in Figure 2). It is important to note the difference between contigs and scaffolds.
Scaffolding is often employed for SGST assemblies to order and join short contigs in fragmented
genome assemblies. Recent high-quality genome assemblies in eukaryotes have highlighted
three principal deficiencies of scaffolding [18,58,69,117]. First, in forming scaffolds it is easy to
join contigs ‘across’ GC-rich and repetitive sequence regions, thereby missing important struc-
tural features in these regions of the assembly. Second, the amount of sequence in any given
gap that a scaffold spans often has a poor relationship to the true gap size. This lack of

714 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

Table 4. Summary of Adapter Removal, Mapping, Error Correction, and Polishing Tools for Long-Read Sequencesa,b,c
Functionality Program Input Description Refs
format
Adapter BBMap/ Fasta A tool for finding and removing internal PacBio adapter http://jgi.doe.gov/data-and-tools/bbtools/
removal BBTools Fataq sequences
Accessible either from SMRT Link or a stand-alone
command line
Need to convert bax.h5 files to fasta/fastq using
bash5tools.py
Consensus- bax.h5 Specifically designed for PacBio reads (SMRTbell) https://github.com/PacificBiosciences/SMRT-
Tools Accessible either from SMRT Link or a stand-alone Analysis/wiki/ConsensusTools-v2.3.0-Documentation
command line
Cutadapt Fasta A tool for finding and removing adapter sequences from [30]
Fataq high-throughput sequencing reads including PacBio and
Nanopore
Accessible from a stand-alone command line
Porechop Fasta A tool for finding and removing adapters from Nanopore https://github.com/rrwick/Porechop
Fataq reads
Accessible from a stand-alone command line
Mapping/ BBMap/ Fasta A splice-aware global aligner for high-throughput http://jgi.doe.gov/data-and-tools/bbtools/
alignment BBTools Fataq sequencing reads including PacBio and Nanopore
Accessible either from SMRT Link or a stand-alone
command line
Need to convert bax.h5 files to fasta/fastq using
bash5tools.py
BLASR bas.h5 Not a splice-aware aligner but can be used to align [31]
Fasta transcript sequences to the genome
Fastq Good performance for long reads
Accessible either from SMRT Link or a stand-alone
command line
BWA-MEM Fasta An alignment tool to support long-read data and https://github.com/lh3/bwa
Fastq chimeric alignment for high-throughput sequencing
reads including PacBio and Nanopore
Need to construct the FM-index first for the reference
genome using the index command
Three key alignment algorithms: aln/samse/sample for
BWA-backtrack, bwasw for BWA-SW, and mem for the
BWA-MEM algorithm
Accessible from a stand-alone command line with limited
performance for queries longer than 10 Mb
COSINE Fasta An alignment tool utilizing a new method (K-mer size) for [32]
long-read data
Accessible from a stand-alone command line
DALIGNER Fasta An alignment tool (embedded as the Dazzler 'Overlap' https://github.com/thegenemyers/DALIGNER
module) to find all pairwise local alignments for long-read
data
Accessible from a stand-alone command line
GMAP/ Fasta An alignment tool for short (spliced transcripts) and http://research-pub.gene.com/gmap/
GSNAP Fataq long-reads (b1 Mb) data
Accessible from a stand-alone command line
GraphMap Fasta A mapper targeted at aligning long, error-prone reads [33]
Fataq including Illumina, PacBio, and Nanopore
Accessible from a stand-alone command line
HISEA Fasta An efficient all-versus-all read aligner for PacBio [34]
Fataq Can be integrated into the CANU assembly pipeline
Accessible from a stand-alone command line
LAST Fasta An alignment tool for long-read data using adaptive [35]
seeds that copes more efficiently with repeat-rich

(continued on next page)

Trends in Plant Science, August 2019, Vol. 24, No. 8 715


Trends in Plant Science

Table 4. (continued)
Functionality Program Input Description Refs
format
reference sequences
Accessible either from a stand-alone command line and
a web service with effective performance for query
sequences ranging from 100 bp to 100 Mb
Need to convert fastq to fasta format
marginAlign Fasta A package tool for sequence alignment and SNVs calling [36]
Fastq of Nanopore reads
Accessible from a stand-alone command line
Mash Fasta A tool for fast distance estimation alignment using [37]
Fastq MinHash for high-throughput sequencing reads
including PacBio and Nanopore
Accessible from a stand-alone command line
MHAP Fasta An alignment tool (locality sensitive hashing) to detect [38]
.dat overlaps and utilities for long-read data
Accessible from a stand-alone command line
minialign/ Fasta An alignment tool for long-reads built on three key https://github.com/ocxtal/minialign
minimap Fastq algorithms: minimizer-based index of the minimap
overlapper, array-based seed chaining, and
SIMD-parallel SWG extension
Accessible from a stand-alone command line
NanoOK Fasta A tool for extraction, alignment, and analysis of [39]
Fastq Nanopore reads
Accessible either from a stand-alone command line or
Mac OS platforms
NGMLR Fasta A specifically designed tool to quickly and correctly align [40]
Fastq long reads for spanning (complex) structural variations https://github.com/philres/ngmlr
(SVs) using an SV-aware K-mer search based on a
Smith–Waterman alignment algorithm
Accessible from a stand-alone command line
pbalign .h5 A specifically designed tool for PacBio reads https://github.com/PacificBiosciences/pbalign
Fasta Accessible either from SMRT Link or a stand-alone
command line
Need to convert bax.h5 files to bam files using bax2bam
STAR Fasta An alignment tool for short- (spliced transcripts) and https://github.com/alexdobin/STAR
Fastq long-read (b50 kb) data
Accessible from a stand-alone command line
Error Falcon_sense Fasta A tool for error correction using the consensus-calling [18]
correction Fastq algorithm in FALCON to preserve the information from
Bam heterozygous SNPs for PacBio reads
Accessible either from a stand-alone command line and
SMRT Link
Frame-Pro Fasta A profile homology search tool using HMM and DAG for [41]
m5 PacBio reads
(BLASR) Accessible from a stand-alone command line
Hmm
(Pfam)
LoRMA Fasta An iterative alignment-free correction method for [42]
long-read data
Accessible from a stand-alone command line
LRCstats Fasta A novel way pipeline using SimLORD or PBSim simulator to [43]
SAM measure the accuracy of sequencing errors for long reads
Accessible from a stand-alone command line
pbdagcon Fasta A tool for sequence alignment and consensus using https://github.com/PacificBiosciences/pbdagcon
DAGCon for PacBio reads
Accessible only from a stand-alone command line

716 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

Table 4. (continued)
Functionality Program Input Description Refs
format
Sparc Fasta A sparsity-based consensus algorithm for error [44]
correction of high-throughput sequencing reads
including PacBio and Nanopore
Accessible from a stand-alone command line
Consensus Arrow .bam A HMM model for sequence consensus and variants for https://github.com/PacificBiosciences/
polish .xml PacBio (RSII and SEQUEL) reads GenomicConsensus
Accessible either from a stand-alone command line
(GenomicConsensus) and SMRT Link
Nanopolish Fasta A package tool for consensus sequence, methylation, [45]
Fastq and SNP calling of Nanopore reads using HMM-based
consensus calling
Accessible from a stand-alone command line
Quiver .cmp. A more sophisticated tool to find the maximum https://github.com/PacificBiosciences/
h5 quasi-likelihood template sequence for PacBio (RSII) reads GenomicConsensus
.fofn Accessible either from a stand-alone command line
.xml (GenomicConsensus) and SMRT Link
Racon Fasta A consensus module for de novo assembly of long-read [46]
Fastq data based on a POA graph approach
MHAP A series of steps: layout ► aligning reads and segmentation
PAF (optional error correction) ► POA graph (SIMD-accelerated)
► segment splicing ► consensus sequence
Works effectively with miniasm to enable consensus
genomes with similar or better quality than state-of-the-art
methods while being an order of magnitude faster
Accessible from a stand-alone command line

a
This table does not include any single-cell sequencing, transcriptome, organelle genome assemblers (mitochondrial, chloroplast, and plasmid), bacterial/metagenome
assemblers (microbial and smaller genomes b10 Mb), basecalling/variant calling, SV, or methylation detection. In addition, this table does not consider any measurements
namely: user time, system time, CPU time, real time (wall clock time), or maximum memory usage for each assembly tool and dataset because these vary greatly depend-
ing on sequencing coverage and the dataset. In the case of cross-contamination in raw data (e.g., plastids, viruses, and bacteria), removing the contaminated reads with
sequence removers (Cutadapt and Porechop) before employing genome assemblers could be helpful to improve assembly speed and accuracy.
b
Abbreviations: DAGCon, directed acyclic graph consensus; HMM, hidden Markov model; HISEA, hierarchical seed aligner, MHAP: MinHash alignment process; NGMLR,
convex gap-cost alignments for long reads; RACON, rapid consensus; SIMD, single instruction multiple data; SNP, single-nucleotide polymorphism; SNV, single-nucleotide
variation; SWG, Smith–Waterman–Gotoh.
c
Long-read data: PacBio and Oxford Nanopore Technology (ONT).

concordance can affect our ability to understand the true physical distance between functional
elements in genomes. Third, the sequence flanking the newly scaffolded sequence can be of
low quality, which can result in misassembly owing to the deficiencies of SGST (GC-bias or
read-length limitations).

Despite the potential deficiencies of scaffolding, closing gaps in draft genomes is still an important
post-processing step in genome assembly. However, if closing gaps in draft genomes intends to
introduce actual nucleotide sequence (rather than 'filling' with Ns), the utility of extra 10xGC,
PacBio, and ONT reads can be effective to aid gap filling before the polishing and scaffolding
stages. The reason for this is that SCSA, mainly from BioNano and Hi-C data, acts to improve
assembly quality by correcting misassemblies and/or ordering scaffolds based on the given
input (e.g., an assembly file from a previous step).

Another post-processing approach to improve genome contiguity could be to merge assemblies


from multiple assemblers (Table S3 in the supplemental material online). A recent investigation
conducted by Alhakami and colleagues [134] evaluated contiguity, correctness, coverage, and
the duplication ratio of the merged assemblies compared with the individual assemblies as
input. For the scaffolding and meta-assembling approaches, a potential strategy to consider is:

Trends in Plant Science, August 2019, Vol. 24, No. 8 717


Trends in Plant Science

TGST, SGST, and hybrid read sequencing ► read-quality assessment, evaluation, and filtering
► assembly from multiple assemblers (multiple parameters) ► scaffolding and/or merging ► a
single consensus assembled sequence ► error correction and polishing ► assessment
and decision ► subchromosome scaffolding assembly ► chromosome-level mapping
assembly ► annotation (Figure 2).

An earlier review [135] gives excellent guidance for whole-genome sequencing projects using
tools and technologies developed before 2015; we have focused on tools released since then
(Table S2 in the supplementary material online).

Subchromosome Scaffolding Assembly


All assemblies derived only from sequence reads will contain misassemblies (inversions and
translocations) that are largely caused by the inability of both sequencing and assembly pipelines
to cope with long tracts of repeat sequences. These issues are further compounded by high levels
of heterozygosity, as well as by polyploidization, that are common in many plant species. Two meth-
odologies, BioNano and Hi-C, can improve the assembly quality by validating the integrity of the initial
assembly, correcting misorientations, and ordering the scaffolds. These methods generally improve
the scaffold N50 length by at least fivefold (Figure 2 red boxes, and Table 2). However, it is important
to secure the most contiguous, complete, and minimally fragmented genome assembly to feed into
the SCSA approach. If the initial assembly falls short in terms of the quality metrics discussed above,
further improvement by incorporating more 10xGC, PacBio, or ONT data is highly recommended.

BioNano is a nonsequence-based scaffolding method (next-generation mapping, NGM), that uses


endonucleases to nick long DNA molecules at the enzyme recognition site, upon which fluorescent
nucleotides are incorporated and the long strands are repaired. This results in long (N150 kb) frag-
ments of DNA with fluorescent labels at each endonuclease nick site in that molecule. Separation
and detection of the labeled fragments allows mapping of the sequence specific labeled sites within
long contiguous DNA molecules, resolving misalignments within the source DNA.

Hi-C, a chromatin-association/interaction analysis method, involves formaldehyde-mediated


crosslinking of cellular contents, followed by isolation and digestion of DNA, labeling DNA ends
with biotin, followed by proximity ligation of these ends, recovery of DNA, library synthesis, and
Illumina based pair-end sequencing. Each pair of reads represents a single chromatin contact
[136]. Subsequent computational analysis of the data allows reconstruction of chromatin interac-
tions that reveal wider sequence structures.

Both methods have been used successfully, and a comparison of the two methods is presented
in the supplemental information online. Some noticeable differences between these approaches
have been reported, and in general Hi-C data have been found to resolve longer segments of
chromosomes compared with BioNano, allowing near-chromosome-level assembly quickly,
cheaply, and accurately [59–62,75,126]. Other studies have also pointed out that Hi-C
approaches combined with PE, MP, or long-read sequencing could be effective at increasing
the resolution of the spatial arrangement of chromosomes through the detection and quantifica-
tion of pairwise chromatin interactions across the genome [59,64,66,75,137]. In particular, if
genetic maps are available, the creation of long-distance chromatin interaction maps by Hi-C
data should be considered for the final assembly step to generate pseudochromosomes from
the more detailed 3D genomic structures [61,62].

Assessment of Assembly Quality


Estimating assembly quality requires several statistical and biological validations. These include
overall assembly size (determining the match to the estimated genome size), measures of

718 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

assembly contiguity (N50, NG50, NA50, or NGA50; number of contigs; contig length; and contig
mean length), assembly likelihood scores (calculated by aligning reads against each candidate
assembly), and completeness of the genome assembly (BUSCO scores and/or RNA-Seq map-
ping) [138]. Agreement with data on quantitative trait loci (QTL), fluorescent in situ hybridization
(FISH) experiments using bacterial artificial chromosome (BAC) clones, and contiguity of
the genome assembly with a chromosome-level genetic map are strong indicators of quality.
If an initial assembly attempt is not satisfactory, three specific areas (contiguity, accuracy, and
completeness) [138] should be considered to determine the best path forward to improve the
quality of the de novo assembly (confirm/refine in green boxes in Figure 2). To address high contig
numbers with low average size it is generally best to acquire and incorporate more TGST or
10xGC (see Hybrid Assembly Approaches) reads. Attempting to increase assembly quality
through additional SCSA data is unlikely to be helpful because the data are usually ineffective in
assisting hybrid assemblers to span gaps between existing contigs. Addition of more and longer
TGST reads is often more productive in bridging existing contigs by increasing average contig
size; subsequent addition of further SCSA data will then improve read accuracy and the overall
contiguity of assemblies. However, if the resulting assembly still has N1000 contigs (i.e., is still
highly fragmented), increasing the amount of SCSA data alone is unlikely to result in a dramatic
improvement.

Discussion and Recommendations


De novo genome assembly is a rapidly evolving area of research. The speed of innovation is
being driven both through technological innovations in sequence data generation and through
community efforts to improve computational approaches to assemble data quickly and cost-
effectively. However, the overall quality of a genome assembly is affected by all components
of the pipeline, including the quality and integrity of the input DNA, genome size, genome
organization, and computational design. Paajanen and colleagues [139] have benchmarked
assemblies for completeness and accuracy, as well as input DNA, computational require-
ments, and sequencing costs (Box 1). They focused their benchmarking on a single diploid
species in which the use of hybrid scaffolding (Illumina and/or PacBio + CHiCAGO and/or
BioNano) was examined. Our review extends their work by addressing other approaches
(Illumina and/or 10xGC + PacBio and/or ONT) and the use of SCSA approaches such as
Hi-C. In doing so, we offer an updated guide for plant genome sequencing projects focusing

Box 1. An Example of a Plant Genome Sequencing Project


A good example of a recent assembly using the approaches outlined in the main text has been the assembly of Solanum
verrucosum (a diploid wild potato from Mexico, ~722 Mb). This was derived using SGST (Illumina) and TGST (PacBio). In
the assembly, various assemblers were compared that utilized both short and long reads, singly or in combination. In
general, genome assembly from short reads was inferior to that from long reads. Long-read assemblies produced by
Falcon were better than Canu and HGAP3 [139]. There have been only two reports of short-read assemblies after scaf-
folding with a mate-paired library using Soapdenovo2 (DISCOVAR-MP was better than ABySS-MP) that were comparable
to the results of long-read assemblers [139]. In addition, use of linked long-read (LLR) technology from 10xGC and its
associated assembler, Supernova, gave an outcome similar to that of Falcon with TGST reads. Incorporation of further
SCSA data (105× depth of CHiCAGO Library, Dovetail) or an optical map (350× depth of BioNano) with multiple assembly
approaches (Discovar, Falcon, and Supernova), substantially improved the selected assembly contiguity by ~fivefold.

Assembly-quality evaluation criteria for the assemblies produced using the above approaches, such as K-mer and BUSCO
content, indicated that there was little difference among the final assemblies: Discovar (0.15% and 99.97%), Falcon (0.66%
and 99.87%), and Supernova (1.3% and 99.40%). Gap filling by PBJelly, using a small amount of PacBio data (8× depth),
slightly increased the assembled genome size and N50 values by reducing unknown sequence (Ns).

From this work, it appears that (i) the 10xGC technology gives high-quality and accurate assembly, similarly to PacBio, but
at a significantly lower cost in diploid plant genomes; and (ii) incorporation of additional longer-range reads and/or optical
mapping (by BioNano) assists in resolving repetitive regions and greatly increases scaffolding contiguity.

Trends in Plant Science, August 2019, Vol. 24, No. 8 719


Trends in Plant Science

on the value and use of TGST sequencing platforms, analytical tools, and assembly strategies, Outstanding Questions
and provide a decision tree that may assist researchers contemplating genome assemblies for Is there a de novo assembler that
nonmodel plants including potential polyploid species. can combine 10xGC and TGST reads
from the raw data step to reach phase-
Regardless of the strategy chosen by researchers, the ultimate goal of genome sequencing separation? Will it be available for poly-
ploid species?
projects is to produce the single best assembly in a cost- and time-effective manner, and if
possible to a chromosome-level assembly on which scaffolds are anchored [58,65,69]. Each Is there any alternative algorithm beyond
approach and tool that we have examined has limitations based on compromises inherent in OLC for long-read and/or hybrid-read
the different algorithms and the assumptions used. Therefore, we recommend that several assembly to reduce computational time
and minimize storage space?
tools/approaches are used at each stage (assembly, correction, polishing, and scaffolding)
and that their outputs are compared to determine the best combinations of tools for the data What could be the next effective hybrid
at hand. In addition, it is imperative to optimize each of the individual program parameters within algorithm for correcting read errors
the pipeline for the given dataset to produce a best-quality assembly. It is common practice to (i.e., increase accuracy) as well as for
expanding contiguity?
generate multiple genome assemblies from different assemblers, parameters, and algorithms
(e.g., merging phase), and then try to predict the best assembly and/or to improve the contigu- What future techniques can produce ac-
ity and quality of each assembly into a single superior genome sequence [134,140–142]. Even curate (N99%) and long reads (N1 Mb)
with the recent advances in sequencing technology and computational analysis, this remains that can resolve segmental duplications
and heterochromatic regions in plants?
the best approach.
Will it be helpful to reach chromosome-
Concluding Remarks and Future Perspectives level resolution without genetic markers/
maps? If yes, can any practical approach
Despite the fact that there is no perfect plant genome assembly, many high-quality plant genome be considered to produce a fast and ac-
assemblies have been achieved over the past 5 years thanks to the availability of high-throughput curate reference genome?
TGST and SGST data. In particular, a combination of PacBio/ONT long reads with 10xGC, Hi-C,
or BioNano data has dramatically improved diploid de novo assembly and SV detection within Will it be possible to sequence and as-
semble individual chromosomes after
organisms [29,58,66,75,79,116]. Impressive strides have been made in the production of sorting chromosomes? Could this work
genome assemblies for large and complex plant genomes using these combined TGST and for a wide range of plant chromosomes?
SCSA approaches [57,65,66,143,144]. Long-read sequencing methods are facilitating the span-
ning of previously problematic and impenetrable repetitive regions of genome sequence, and in
doing so have provided unprecedented opportunities to resolve these regions of a genome
and improve both the assembly and annotation of plant genomes [76,78,79]. Long reads can
also provide contiguous RNA transcript data, offering new solutions for finding new genes and
precisely identifying variant splice isoforms of genes [145,146]. However, when presented with
larger/polyploid genomes with a high repeat content, it is not always easy for researchers
to choose the best approach for genome assembly. This often results in trade-offs between
sequencing cost, assembly approaches, and accuracy due to differences among sequencing
platforms and analytical tools.

Although the immediate future is arguably focused on improving TGST long-read approaches
and developing fourth-generation sequencing technologies (FGSTs), these are currently
still expensive options (cost per base) compared with SGST approaches. Thus, for the medium
term, although read accuracy and costs are improving for TGST approaches, research into hybrid
methods generating a combination of data types maximizing the positive characteristics of each
(e.g., cost, quality, and read-length) may be effective in achieving more complete and accurate
genomes. It is likely that adoption of a hybrid approach (10xGC + ONT/PacBio + Hi-C) will
often be optimal in terms of cost and accuracy when matched with an appropriate genome
assembly pipeline. Unfortunately, application of FGST in plant genomes has yet not been well
reported, and it will be interesting to watch its development in the coming years.

Sequence acquisition methodologies are improving rapidly, but bioinformatic approaches to deal
with polyploid or aneuploid genome assemblies are virtually absent. This is a clear area for im-
provement, and will be greatly facilitated by the ongoing development of hybrid approaches to

720 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

produce phase-separated chromosomal data in polyploid systems. Some progress is already


being made with pipelines [76] such as Falcon-Phase [147], Trio binning [27], and highly efficient
repeat assembly [148].

The obvious goal is to develop methods that produce and join sequences into accurate, contig-
uous, and entire-chromosome sequences, and also at low cost. Continued advances in both
sequencing and bioinformatic technology hold promise that this is not very far away. Researchers
will be able to spend less time in assembling genomes and focus more on exploring the biology of
genomes to gain a deeper understanding of genomic diversity, evolution, epigenomics, and gene
function. No doubt, this will accelerate the process of plant breeding and the production of
improved varieties in a wide range of crops [10,78,149]. We hope that the decision tree we
have developed, alongside our summary of analytical tools, and leading-edge technologies, will
aid and encourage researchers to expand the already impressive spectrum of high-quality plant
genome resources (see Outstanding Questions).

Acknowledgments
The authors are grateful to their colleagues, collaborators, and field/technical specialists of each company for their valuable
comments, especially to Matthew Hodgett, Michal Lorenc, Ana Pavasovic, and two anonymous reviewers. This project is
supported by an Australian Research Council (ARC) Laureate Fellowship (LF160100155) to P.W.

Supplemental Information
Supplemental information associated with this article can be found online at https://doi.org/10.1016/j.tplants.2019.05.003.

References
1. Pellicer, J. et al. (2018) Genome size diversity and its impact on 15. Lin, Y. et al. (2016) Assembly of long error-prone reads using
the evolution of land plants. Genes 9, 88 de Bruijn graphs. Proc. Natl. Acad. Sci. 113, E8396–E8405
2. Wang, P. et al. (2018) Factors influencing gene family size 16. Kolmogorov, M. et al. (2019) Assembly of long error-prone
variation among related species in a plant family, Solanaceae. reads using repeat graphs. Nature Biotechnol. 37,
Genome Biol. Evol. 10, 2596–2613 540–546
3. Payne, A. et al. (2018) BulkVis: a graphical viewer for Oxford 17. Koren, S. et al. (2017) Canu: scalable and accurate long-read
Nanopore bulk FAST5 files. Bioinformatics Published online assembly via adaptive k-mer weighting and repeat separation.
November 20, 2018. https://doi.org/10.1093/bioinformatics/ Genome Res. 27, 722–736
bty841 18. Chin, C.-S. et al. (2016) Phased diploid genome assembly with
4. Arabidopsis Genome Initiative (2000) Analysis of the genome single-molecule real-time sequencing. Nat. Methods 13,
sequence of the flowering plant Arabidopsis thaliana. Nature 1050–1054
408, 796–815 19. Lam, K.-K. et al. (2015) FinisherSC: a repeat-aware tool for
5. The 1001 Genome Consortium (2016) 1,135 genomes reveal upgrading de novo assembly using long reads. Bioinformatics
the global pattern of polymorphism in Arabidopsis thaliana. 31, 3207–3209
Cell 166, 481–491 20. Chin, C.S. et al. (2013) Nonhybrids, finished microbial genome
6. Escalona, M. et al. (2016) A comparison of tools for the simu- assemblies from long-read SMRT sequencing data. Nat.
lation of genomic next-generation sequencing data. Nat. Rev. Methods 10, 563–569
Genet. 17, 459–469 21. Kamath, G.M. et al. (2017) HINGE: long-read assembly
7. Goodwin, S. et al. (2016) Coming of age: ten years of next- achieves optimal repeat resolution. Genome Res. 27, 747–756
generation sequencing technologies. Nat. Rev. Genet. 17, 22. Grohme, M.A. et al. (2018) The genome of Schmidtea
333–351 mediterranea and the evolution of core cellular mechanisms.
8. Chen, Q. et al. (2017) Recent advances in sequence assembly: Nature 554, 56–61
principles and applications. Brief. Funct. Genomics 16, 23. Xiao, C.L. et al. (2017) MECAT: an ultra-fast mapping, error
361–378 correction and de novo assembly tool for single-molecule
9. Mardis, E.R. (2017) DNA sequencing technologies: 2006– sequencing reads. Nat. Methods 14, 1072–1074
2016. Nat. Protoc. 12, 213–218 24. Li, H. (2016) Minimap and miniasm: fast mapping and de novo
10. Yuan, Y. et al. (2017) Improvement of genomics technologies: assembly for noisy long sequences. Bioinformatics 32,
application to crop genomics. Trends Biotechnol. 35, 547–558 2103–2110
11. Sedlazeck, F.J. et al. (2018) Piercing the dark matter: bioinfor- 25. Szalay, T. and Golovchenko, J.A. (2015) De novo sequencing
matics of long-range sequencing and mapping. Nat. Rev. and variant calling with nanopores using PoreSeq. Nat.
Genet. 19, 329–346 Biotechnol. 33, 1087–1091
12. Cheng, S. et al. (2018) 10KP: a phylodiverse genome sequenc- 26. Recanati, A. et al. (2017) A spectral algorithm for last de novo
ing plan. GigaScience 7, 1–9 layout of uncorrected long nanopore reads. Bioinformatics
13. Chen, F. et al. (2018) The sequenced angiosperm genomes 33, 3188–3194
and genome databases. Front. Plant Sci. 9, 418 27. Koren, S. et al. (2018) De novo assembly of haplotype-
14. Liu, H. et al. (2019) Molecular digitization of a botanical garden: resolved genomes with trio binning. Nat. Biotechnol. 36,
high-depth whole genome sequencing of 689 vascular plant 1174–1182
species from the Ruili Botanical Garden. GigaScience 28. Jansen, H.J. et al. (2017) Rapid de novo assembly of the
Published online April 1, 2019. https://doi.org/10.1093/ European eel genome from nanopore sequencing reads.
gigascience/giz007 Sci. Rep. 7, 7213

Trends in Plant Science, August 2019, Vol. 24, No. 8 721


Trends in Plant Science

29. Schmidt, M.H.W. et al. (2017) De novo assembly of new 57. Avni, R. et al. (2017) Wild emmer genome architecture and di-
Solanum pennellii accession using nanopore sequencing. versity elucidate wheat evolution and domestication. Science
Plant Cell 29, 2336–2348 357, 93–97
30. Martin, M. (2011) Cutadapt removes adapter sequences from 58. Jiao, Y. et al. (2017) Improved maize reference genome with
high-throughput sequencing reads. EMBnet. J. 17, 10–12 single-molecule technologies. Nature 546, 524–527
31. Chaisson, M.J. and Tesler, G. (2012) Mapping single molecule 59. Mascher, M. et al. (2017) A chromosome conformation capture
sequencing reads using basic local alignment with successive ordered sequence of the barley genome. Nature 544, 427–433
refinement (BLASR): application an theory. BMC Bioinformatics 60. Moll, K.M. et al. (2017) Strategies for optimizing BioNano and
13, 238 Dovetail explored through a second reference quality assembly
32. Afshar, P.T. and Wong, W.H. (2017) COSINE: non-seeding for the legume model, Medicago truncatula. BMC Genomics
method for mapping long noisy sequences. Nucleic Acids 18, 578
Res. 45, e132 61. Lin, D. et al. (2018) Digestion-ligation-only Hi-C is an efficient
33. Sović, I. et al. (2016) Evaluation of hybrid and non-hybrid methods and cost-effective method for chromosome conformation
for de novo assembly of nanopore reads. Bioinformatics 32, capture. Nat. Genet. 50, 754–763
2582–2589 62. Wang, M. et al. (2018) Evolutionary dynamics of 3D genome
34. Khiste, N. and Ilie, L. (2017) HISEA: HIerachical SEed Aligner architecture following polyploidization in cotton. Nat. Plants 4,
for PacBio data. BMC Bioinformatics 18, 564 90–97
35. Kielbasa, S.M. et al. (2011) Adaptive seeds tame genomic 63. Luo, M.C. et al. (2017) Genome sequence of the progenitor of
sequence comparison. Genome Res. 21, 487–493 the wheat D genome Aegilops tauschii. Nature 551, 498–502
36. Jain, M. et al. (2015) Improved data analysis for the MinION 64. Lightfoot, D.J. et al. (2017) Single-molecule sequencing
nanopore sequencer. Nat. Methods 12, 351–356 and Hi-C-based proximity-guided assembly of amaranth
37. Ondov, B.D. et al. (2016) Mash: fast genome and metagenome (Amaranthus hypochondriacus) chromosomes provide in-
distance estimation using MinHash. Genome Biol. 17, 132 sights into genome evolution. BMC Biol. 15, 74
38. Berlin, K. et al. (2015) Assembling large genomes with single- 65. Jarvis, D.E. et al. (2017) The genome of Chenopodium quinoa.
molecule sequencing and locality-sensitive hashing. Nat. Nature 542, 307–312
Biotechnol. 33, 623–630 66. Teh, B.T. et al. (2017) The draft genome of tropical fruit durian
39. Leggett, R.M. et al. (2016) NanoOK: multi-reference alignment (Durio zibethinus). Nat. Genet. 49, 1633–1641
analysis of nanopore sequencing data, quality and error 67. Pootakham, W. et al. (2017) De novo hybrid assembly of the
profiles. Bioinformatics. 32, 142–144 rubber tree genome reveals evidence of paleotetraploidy in
40. Sedlazeck, F.J. et al. (2018) Accurate detection of complex Hevea species. Sci. Rep. 7, 41457
structural variations using single molecule sequencing. Nat. 68. Reyes-Chin-Wo, S. et al. (2017) Genome assembly with in vitro
Methods 15, 461–468 proximity ligation data and whole-genome triplication in lettuce.
41. Du, N. and Sun, Y. (2016) Improved homology search sensitiv- Nat. Commun. 8, 14953
ity of PacBio data by correcting frameshifts. Bioinformatics 32, 69. Daccord, N. et al. (2017) High-quality de novo assembly of the
i529–i537 apple genome and methylome dynamics of early fruit develop-
42. Salmela, L. et al. (2017) Accurate self-correction of errors in ment. Nat. Genet. 49, 1099–1106
long reads using de Bruijn graphs. Bioinformatics 33, 799–806 70. Bredeson, J.V. et al. (2016) Sequencing wild and cultivated
43. La, S. et al. (2017) LRCstats, a tool for evaluating long reads cassava and related species reveals extensive interspecific
correction methods. Bioinformatics 33, 3652–3654 hybridization and genetic diversity. Nat. Biotechnol. 34,
44. Ye, C. and Ma, Z. (2016) Sparc: a sparsity-based consensus 562–570
algorithm for long erroneous sequencing reads. Peer J. 4, 71. Martin, G. et al. (2016) Improvement of the banana 'Musa
e2016 acuminate' reference sequence using NGS data and
45. Loman, N.J. et al. (2015) A complete bacterial genome assem- semi-automated bioinformatics methods. BMC Genomics
bled de novo using only nanopore sequencing data. Nat. 17, 243
Methods 12, 733–735 72. Xu, S. et al. (2017) Wild tobacco genomes reveal the evolu-
46. Vaser, R. et al. (2017) Fast and accurate de novo genome tion of nicotine biosynthesis. Proc. Natl. Acad. Sci. 114,
assembly from long uncorrected reads. Genome Res. 27, 6133–6138
737–746 73. Edwards, K.D. et al. (2017) A reference genome for Nicotiana
47. Sohn, J.I. and Nam, J.W. (2018) The present and future of de tabacum enables map-based cloning of homeologous loci im-
novo whole-genome assembly. Brief. Bioinform. 19, 23–40 plicated in nitrogen utilization efficiency. BMC Genomics 18,
48. van Dijk, E.L. et al. (2018) The third revolution in sequencing 448
technology. Trends Genet. 34, 666–681 74. Du, H. et al. (2017) Sequencing and de novo assembly of a
49. Wee, Y.K. et al. (2019) The bioinformatics tools for the genome near complete indica rice genome. Nat. Commun. 8, 15324
assembly and analysis based on third-generation sequencing. 75. Raymond, O. et al. (2018) The Rosa genome provides new
Brief. Funct. Genomics 18, 1–12 insights into the domestication of modern roses. Nat. Genet.
50. Lischer, H.E.L. and Shimizu, K.K. (2017) Reference-guided de 50, 772–777
novo assembly approach improves genome reconstruction for 76. Zhang, J. et al. (2018) Allele-defined genome of the autopoly-
related species. BMC Bioinformatics 18, 474 ploid sugarcane Saccharum spontaneum L. Nat. Genet. 50,
51. Garg, S. et al. (2018) A graph-based approach to diploid 1565–1573
genome assembly. Bioinformatics 34, i105–i114 77. Zhang, L. et al. (2017) The Tartary Buckwheat genome pro-
52. Kolmogorov, M. et al. (2018) Chromosome assembly of large vides insights into rutin biosynthesis and abiotic stress toler-
and complex genomes using multiple references. Genome ance. Mol. Plant 10, 1224–1237
Res. 28, 1720–1732 78. International Wheat Genome Sequencing Consortium (2018)
53. Kyriakidou, M. et al. (2018) Current strategies of polyploidy Shifting the limits in wheat research and breeding using a fully
plant genome sequence assembly. Front. Plant Sci. 9, 1660 annotated reference genome. Science 361, eaar7191
54. Rhoads, A. and Au, K.F. (2015) PacBio sequencing and 79. Ling, H.Q. et al. (2018) Genome sequence of the progenitor of
its applications. Genomics Proteomics Bioinformatics 13, wheat A subgenome Triticum Urartu. Nature 557, 424–428
278–289 80. Mayjonade, B. et al. (2016) Extraction of high-molecular-weight
55. VanBuren, R. et al. (2015) Single-molecule sequencing of the genomic DNA for long-read sequencing of single molecules.
desiccation-tolerant grass Oropetium thomaeum. Nature 527, BioTechniques 61, 203–205
508–511 81. Denis, E. et al. (2018) Extracting high molecular genomic
56. Jiao, W.B. et al. (2017) Improving and correcting the contiguity DNA from Saccharomyces cerevisiae. Protocol Exchange
of long-read genome assemblies of three plant species using https://doi.org/10.1038/protex.2018.076
optical mapping and chromosome conformation capture 82. Workman, R. (2018) High molecular weight DNA extraction
data. Genome Res. 27, 778–786 from recalcitrant plant species for third generation sequencing.

722 Trends in Plant Science, August 2019, Vol. 24, No. 8


Trends in Plant Science

Protoc. Exch. Published online April 27, 2018. https://www. 107. Hulse-Kemp, A.M. et al. (2018) Reference quality assembly of
nature.com/protocolexchange/protocols/6785 the 3.5 Gb genome of Capsicum annuum from a single
83. Schalamun, M. et al. (2019) Harnessing the MinION: an exam- linked-read library. Hortic. Res. 5, 4
ple of how to establish long-read sequencing in a laboratory 108. Jackman, S.D. et al. (2018) Tigmint: correcting assembly errors
using challenging plant tissue from Eucalyptus pauciflora. using linked reads from large molecules. BMC Bioinformatics
Mol. Ecol. Resour. 19, 77–89 19, 393
84. Li, F.W. and Harkess, A. (2018) A guide to sequence your 109. Liu, Q. et al. (2018) Assembly and annotation of a draft genome
favourite plant genomes. Appl. Plant Sci. 6, e1030 sequence for Glycine latifolia, a perennial wild relative of
85. Zimin, A.V. et al. (2017) Hybrid assembly of the large and highly soybean. Plant J. 95, 71–85
repetitive genome of Aegilops tauschii, a progenitor of bread, 110. Ott, A. et al. (2018) Linked read technology for assembling
with the MaSuRCA mega-reads algorithm. Genome Res. 27, large complex and polyploidy genomes. BMC Genomics 19,
787–792 651
86. Zimin, A.V. et al. (2017) The first near-complete assembly 111. Marks, P. et al. (2019) Resolving the full spectrum of human
of the hexaploid bread wheat genome, Triticum aestivum. genome variation using linked-reads. Genome Res. 29,
Gigascience 6, 1–7 635–645
87. Zimin, A.V. et al. (2017) An improved assembly of the loblolly 112. Ashton, P.M. et al. (2015) MinION nanopore sequencing iden-
pine mega-genome using long-read single-molecule sequenc- tifies the position and structure of a bacterial antibiotic resis-
ing. Gigascience 6, 1–4 tance island. Nat. Biotechnol. 33, 296–300
88. Michael, T.P. et al. (2018) High contiguity Arabidopsis thaliana 113. Jain, M. et al. (2017) MinION analysis and reference consor-
genome assembly with a single nanopore flow cell. Nat. tium: phase 2 data release and analysis of R9.0 chemistry.
Commun. 9, 541 F1000Res 6, 760
89. Soorni, A. et al. (2017) Organelle_PBA, a pipeline for assem- 114. Debladis, E. et al. (2017) Detection of active transposable
bling chloroplast and mitochondrial genomes from PacBio elements in Arabidopsis thaliana using Oxford Nanopore
DNA sequencing data. BMC Genomics 18, 49 sequencing technology. BMC Genomics 18, 537
90. Liu, W. et al. (2017) Computing platforms for big biological data 115. Leggett, R.M. and Clark, M.D. (2017) A world of opportunities
analytics: perspectives and challenges. Comput. Struct. with nanopore sequencing. J. Exp. Bot. 68, 5419–5429
Biotechnol. J. 15, 403–411 116. Jain, M. et al. (2018) Nanopore sequencing and assembly of a
91. Dahlö, M. et al. (2018) Tracking the NGS revolution: managing human genome with ultra-long reads. Nat. Biotechnol. 36,
life science research on shared high-performance computing 338–345
clusters. Gigascience 7, 1–11 117. Gordon, D. et al. (2016) Long-read sequence assembly of the
92. Yelick, K. et al. (2011) The Magellan Report on Cloud gorilla genome. Science 352, aae0344
Computing for Science. Office of Advanced Scientific Comput- 118. Magi, A. et al. (2018) Nanopore sequencing data analysis: state
ing Research of art, applications and challenges. Brief. Bioinform. 19,
93. Langmead, B. and Nellore, A. (2018) Cloud computing for 1256–1272
genomic data analysis and collaboration. Nat. Rev. Genet. 119. Rang, F.J. et al. (2018) From squiggle to basepair: computa-
19, 208–219 tional approaches for improving nanopore sequencing read
94. Ocaña, K. and De Oliveira, D. (2015) Parallel computing in accuracy. Genome Biol. 19, 90
genomic research: advances and applications. Adv. Appl. 120. Volden, R. et al. (2018) Improving nanopore read accuracy with
Bioinforma. Chem. 8, 23–35 the R2C2 method enables the sequencing of highly multiplexed
95. Kawalia, A. et al. (2015) Leveraging the power of high perfor- full-length single-cell cDNA. Proc. Natl. Acad. Sci. 39,
mance computing for next generation sequencing data 9726–9731
analysis: tricks and twists from a high throughput exome 121. Chu, J. et al. (2017) Innovations and challenges in detection
workflow. PLoS One 10, 1–16 long read overlaps: an evaluation of the state-of-the-art.
96. Kulkarni, P. and Frommolt, P. (2017) Challenges in the setup of Bioinformatics 33, 1261–1270
large-scale next-generation sequencing analysis workflows. 122. Kchouk, M. and Elloumi, M. (2016) Hybrid error correction ap-
Comput. Struct. Biotechnol. J. 15, 471–477 proach and de novo assembly for minion sequencing long
97. Compeau, P.E. et al. (2011) How to apply de Bruijn graphs to reads. In IEEE International Conference on Bioinformatics and
genome assembly. Nat. Biotechnol. 29, 987–991 Biomedicine (BIBM), pp. 122–125, IEEE
98. Kajitani, R. et al. (2014) Efficient de novo assembly of highly 123. Carvalho, A.B. et al. (2016) Improved assembly of noisy long
heterozygous genomes from whole-genome shotgun short reads by k-mer validation. Genome Res. 26, 1710–1720
reads. Genome Res. 24, 1384–1395 124. Cao, M.D. et al. (2017) Scaffolding and completing genome
99. Liu, B. et al. (2016) BASE: a practical de novo assembler for assemblies in real-time with nanopore sequencing. Nat.
large genomes using long NGS reads. BMC Genomics 17, Commun. 8, 14515
499 125. Mostovoy, Y. et al. (2016) A hybrid approach for de novo
100. Pryszcz, L.P. and Gabaldón, T. (2016) Redundans: an assem- human genome sequence assembly and phasing. Nat.
bly pipeline for highly heterozygous genomes. Nucleic Acids Methods 13, 587–590
Res. 44, e113 126. Bickhart, D.M. et al. (2017) Single-molecule sequencing and
101. Utturkar, S.M. et al. (2014) Evaluation and validation of de novo chromatin conformation capture enable de novo reference as-
and hybrid assembly techniques to derive high-quality genome sembly of the domestic goat genome. Nat. Genet. 49, 43–650
sequences. Bioinformatics 30, 2709–2716 127. Weissensteiner et al. (2017) Combination of short-read, long-
102. Heydari, M. et al. (2017) Evaluation of the impact of Illumina read and optical mapping assemblies reveals large-scale
error correction tools on de novo genome assembly. BMC tandem repeat arrays with population genetic implications.
Bioinformatics 18, 374 Genome Res. 27, 697–708
103. Smith, H.E. and Yun, S. (2017) Evaluating alignment and 128. Li, X. et al. (2016) Improved hybrid de novo genome assembly
variant-calling software for mutation identification in of domesticated apple (Malus x domestica). GigaScience 5, 35
C. elegans by whole-genome sequencing. PLoS One 12, 129. Clavijo, B.J. et al. (2017) An improved assembly and annotation
e0174446 of the allohexaploid wheat genome identifies complete families
104. Thankaswamy-Kosalai, S. et al. (2017) Evaluation and assess- of agronomic genes and provides genomic evidence for chro-
ment of read-mapping by multiple next-generation sequencing mosomal translocations. Genome Res. 27, 885–896
aligners based on genome-wide characteristics. Genomics 130. Miller, J.R. et al. (2017) Hybrid assembly with long and short
109, 186–191 reads improves discovery of gene family expansions. BMC
105. Schatz, M.C. et al. (2012) Current challenges in de novo plant Genomics 18, 541
genome sequencing and assembly. Genome Biol. 13, 243 131. Goodwin, S. et al. (2015) Oxford nanopore sequencing, hybrid
106. Michael, T.D. and Jackson, S. (2013) The first 50 plant error correction, and de novo assembly of a eukaryotic
genomes. Plant Genome 6, 1–7 genome. Genome Res. 25, 1750–1756

Trends in Plant Science, August 2019, Vol. 24, No. 8 723


Trends in Plant Science

132. Madoui, M.-A. et al. (2015) Genome assembly using Nanopore- 142. Lam, K.-K. et al. (2016) BIGMAC: breaking inaccurate
guided long and error-free DNA reads. BMC Genomics 16, 327 genomes and merging assembled contigs for long read
133. Belser, C. et al. (2018) Chromosome-scale assemblies of plant metagenomic assembly. BMC Bioinformatics 17, 435
genomes using nanopore long reads and optical maps. Nat. 143. Thind, A.K. et al. (2017) Rapid cloning of genes in hexaploid
Plants 4, 879–887 wheat using cultivar-specific long-range chromosome assem-
134. Alhakami, H. et al. (2017) A comparative evaluation of genome bly. Nat. Biotechnol. 35, 793–796
assembly reconciliation tools. Genome Biol. 18, 93 144. Thind, A.K. et al. (2018) Chromosome-scale comparative se-
135. Hunt, M. et al. (2014) A comprehensive evaluation of assembly quence analysis unravels molecular mechanisms of genome
scaffolding tools. Genome Biol. 15, R42 dynamics between two wheat cultivars. Genome Biol. 19, 104
136. Belaghzal, H. et al. (2017) Hi-C 2.0: an optimized Hi-C proce- 145. Chen, X. et al. (2018) Transcriptome-referenced association
dure for high-resolution genome-wide mapping of chromo- study of clove shape traits in garlic. DNA Res. 25, 587–596
some conformation. Methods 123, 56–65 146. Wang, B. et al. (2018) A comparative transcriptional landscape
137. Ghurye, J. et al. (2017) Scaffolding of long read assemblies of maize and sorghum obtained by single-molecular sequenc-
using long range contract information. BMC Genomics 18, 527 ing. Genome Res. 28, 921–932
138. Conte, M.A. et al. (2017) A high quality assembly of the Nile 147. Kronenberg, Z.N. et al. (2018) Extended haplotype phasing of
tilapia (Oreochromis niloticus) genome reveals the structure of de novo genome assemblies with FALCON-Phase. bioRxiv.
two sex determination regions. BMC Genomics 18, 341 Published online April 19, 2019. https://doi.org/10.1101/
139. Paajanen, P. et al. (2019) A critical comparison of technologies 327064
for a plant genome sequencing project. GigaScience 8, 1–12 148. Du, H. and Liang, C. (2018) Assembly of chromosome-scale
140. Wences, A.H. and Schatz, M.C. (2015) Metassembler: merging contigs by efficiently resolving repetitive sequences with long
and optimizing de novo genome assemblies. Genome Biol. 16, reads. bioRxiv. Published online June 13, 2018. http://doi.
207 org/10.1101/345983
141. Chakraborty, M. et al. (2016) Contiguous and accurate de novo 149. Schreiber, M. et al. (2018) Genomic approaches for studying
assembly of metazoan genomes with modest long read cover- crop evolution. Genome Biol. 19, 140
age. Nucleic Acids Res. 44, e147

724 Trends in Plant Science, August 2019, Vol. 24, No. 8

You might also like