Appl Plant Sci - 2019 - Mangelson - The Genome of Chenopodium Pallidicaule An Emerging Andean Super Grain
Appl Plant Sci - 2019 - Mangelson - The Genome of Chenopodium Pallidicaule An Emerging Andean Super Grain
Manuscript received 9 July 2019; revision accepted 24 September PREMISE: Cañahua is a semi-domesticated crop grown in high-altitude regions of the Andes.
2019.
It is an A-genome diploid (2n = 2x = 18) relative of the allotetraploid (AABB) Chenopodium qui-
1
Department of Plant and Wildlife Sciences, Brigham Young noa and shares many of its nutritional benefits. Cañahua seed contains a complete protein, a
University, 5144 LSB, Provo, Utah 84602, USA
low glycemic index, and offers a wide variety of nutritionally important vitamins and minerals.
2
Institute of Natural Product Research, Universidad Mayor de San
Andrés, La Paz, Bolivia METHODS: The reference assembly was developed using a combination of short- and long-
3
Departamento de Fitotecnia, Facultad de Agronomía, read sequencing techniques, including multiple rounds of Hi-C–based proximity-guided
Universidad Nacional Agraria de La Molina, La Molina, Peru assembly.
4
Author for correspondence: [email protected]
RESULTS: The final assembly of the ~363-Mbp genome consists of 4633 scaffolds, with 96.6%
Citation: Mangelson, H., D. E. Jarvis, P. Mollinedo,
O. M. Rollano-Penaloza, V. D. Palma-Encinas, L. R. Gomez-Pando,
of the assembly contained in nine scaffolds representing the nine haploid chromosomes of
E. N. Jellen, and P. J. Maughan. 2019. The genome of Chenopodium the species. Repetitive element analysis classified 52.3% of the assembly as repetitive, with
pallidicaule: An emerging Andean super grain. Applications in the most common repeat identified as long terminal repeat retrotransposons. MAKER annota-
Plant Sciences 7(11): e11300.
tion of the final assembly yielded 22,832 putative gene models.
doi:10.1002/aps3.11300
DISCUSSION: When compared with quinoa, strong patterns of synteny support the hypothesis
that cañahua is a close A-genome diploid relative, and thus potentially a simplified model
diploid species for genetic analysis and improvement of quinoa. Resequencing and phyloge-
netic analysis of a diversity panel of cañahua accessions suggests that coordinated efforts are
needed to enhance genetic diversity conservation within ex situ germplasm collections.
Chenopodium pallidicaule Aellen (also known as cañahua) is a species quinoa, cañahua remains practically unknown and underutilized as a
of the goosefoot family (Amaranthaceae) related to the increasingly food resource (Rastrelli et al., 1996) outside of the Andes.
popular seed crop, quinoa (C. quinoa Willd). Gade (1970) noted that Cañahua has a unique nutritional profile that is ideal for human
cañahua is a partially domesticated crop that provides food security consumption in areas where protein is limited. Its seed contains
to many subsistence farmers across the Andean Altiplano—the high 15–18% protein, with a complete set of essential amino acids, in-
plateau situated at 3500–4200 m above sea level between the west- cluding 5–6% lysine, which is typically limiting in monocotyledon-
ern and eastern Andean Cordilleras of west-central South America. ous grain crops (Penarrieta et al., 2008). In addition to high-quality
Cultivation of cañahua dates back more than 7000 years when it was protein, cañahua offers a wide variety of other health-promoting
a staple crop in ancient Incan and pre-Incan societies. It has several compounds, including antioxidants, phenols, and flavonoids (Repo-
common names in native languages, including cañahua, cañigua, cañi- Carrasco-Valencia et al., 2010). Cañahua seeds contain vanillic acid,
hua, cañawa, and kañiwa (Gade, 1970). After the Spanish Conquest, a phenolic compound which acts as a flavor enhancer and lends
cultivation was likely discouraged in colonial society due to its asso- a pleasant taste to cañahua, particularly when ground and toasted
ciation with indigenous cultures (Ruas et al., 1999). Although it never as a flour called cañihuaco (Penarrieta et al., 2008). With a poverty
completely regained its former status, subsistence farmers across the rate of nearly 50% in the rural highlands of the Altiplano, caña-
Andean region continue to grow cañahua due to its tolerance to abi- hua represents an incredibly important resource in the prevention
otic stress (i.e., frost, drought, and salinity) in addition to its high nu- of poverty-induced malnutrition and in improving food security
tritional quality. Despite the increasing popularity of its close relative throughout the region (Repo-Carrasco et al., 2003).
Applications in Plant Sciences 2019 7(11): e11300; http://www.wileyonlinelibrary.com/journal/AppsPlantSci © 2019 Mangelson et al. Applications in Plant
Sciences is published by Wiley Periodicals, Inc. on behalf of the Botanical Society of America. This is an open access article under the terms of the Creative
Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. 1 of 12
21680450, 2019, 11, Downloaded from https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.11300 by Cochrane Peru, Wiley Online Library on [13/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Applications in Plant Sciences 2019 7(11): e11300 Mangelson et al.—The cañahua genome • 2 of 12
Gade (1970) noted nearly half a century ago that the contin- by Jarvis et al. (2017), referred to hereafter as the ALLPATHS-LG short-
ued presence of cañahua in the Altiplano will depend on its ge- read assembly (ASRA). Fresh leaf tissue from a single, dark-treated
netic transformation into a more efficient crop. Agronomic issues (72 h) 3-week-old plant that was derived directly from selfing of the
that have prevented more extensive cultivation of cañahua include original cañahua ‘PI 478407’ plant used by Jarvis et al. (2017) was sent
non-uniform seed ripening and small seed size that make harvest- to Phase Genomics (Seattle, Washington, USA) for in vivo Hi-C–based
ing and processing of the seed difficult (Mujica, 1994). Despite its proximity-guided ligation and 80-bp paired-end sequencing followed
unique agronomic and nutritional qualities, very few of the genetic by alignment to the ASRA assembly using BWA version 0.7 (Li and
resources needed to accelerate the improvement of cañahua have Durbin, 2010). Only reads that aligned uniquely to the scaffolds were
been investigated. Ruas et al. (1999) published a phylogenetic study retained. Proximo, a proximity-guided assembly method based on the
of 19 Chenopodium L. species based on RAPD markers, includ- ligating adjacent chromatin enables scaffolding in situ (LACHESIS) as-
ing two cañahua accessions that were found to be nearly identical. sembler (Burton et al., 2013), was used to cluster, order, and orient scaf-
Vargas et al. (2011) developed the first microsatellite markers for folds from the ASRA assembly, producing the first proximity-guided
cañahua, including 34 polymorphic markers, exhibiting a total of assembly (PGA1). Following the development of PGA1, long reads
154 different alleles. A phylogeny of 43 cañahua accessions showed were used for gap-filling. High-molecular-weight DNA was extracted
clear distinctions between wild and cultivated lines, including a dis- from leaf tissue of a single, 72-h dark-treated cañahua (PI 478407)
tinct subclade of erect morphotypes. Kolano et al. (2011) cytologi- plant using the QIAGEN Genomic-tip 500/G Kit (Hilden, Germany).
cally characterized the genome size and rDNA loci of cañahua. Their Single-molecule, real-time sequencing using the PacBio Sequel plat-
findings predicted a 2C value for the cañahua genome of 0.886 ± form (Menlo Park, California, USA) was performed at the Brigham
0.034 pg (~433 Mbp per haploid genome) with a single copy of both Young University DNA Sequencing Center (Provo, Utah, USA). The
35S (subterminal) and 5S (interstitial) rDNA loci. PBJelly2 pipeline from PBSuite version 15.8.24 (English et al., 2012)
As a part of the genome analysis of quinoa, Jarvis et al. (2017) re- was used to align the long reads to PGA1 in order to gap-fill the as-
ported a draft assembly of the cañahua genome (accession PI 478407). sembly. Arrow version 0.22.0 (Chin et al., 2013) and Pilon version 1.22
Quinoa is an allotetraploid (2n = 4x = 36), presumably resulting from (Walker et al., 2014) were used for genome polishing with the previ-
a relatively recent (3.3–6.3 mya) polyploidization event between ously described PacBio long reads and Illumina paired-end reads, re-
North American and Eurasian diploids representing the A and B spectively. This gap-filled and polished assembly is henceforth referred
subgenomes of modern quinoa, respectively (Štorchová et al., 2015). to as PGA1.5. To correct for possible errors introduced by low PacBio
Although cañahua is not believed to be the direct A-genome donor of read coverage and relaxed PBJelly2 parameters, a contig-breaking tool,
quinoa, it is a related A-genome diploid. The draft genome reported Polar Star (https://github.com/phasegenomics/polar_star), was em-
by Jarvis et al. (2017) was based solely on Illumina short reads and ployed. Polar Star aligns long reads to an assembly, then calculates the
was thus highly fragmented, consisting of 3015 scaffolds and spanning read depth at each base. Read depth is smoothed in a 100-bp sliding
a total length of 337 Mbp, with an N50 of 356 kbp. Here we report window, then regions of high, low, and normal read depth are merged.
the use of PacBio long reads and Hi-C–based proximity-guided as- These classifications are made based on the read depth distribution.
sembly to develop a reference-quality, chromosome-scale assembly of Low-read-depth outliers are identified, and the assembly is broken at
cañahua. The genome was fully annotated using a deeply sequenced each such location. Following Polar Star, PGA1.5 underwent a second
transcriptome developed from six combinations of tissue types and de novo, proximity-guided assembly. Assembly errors (inversions and
abiotic stresses. Additionally, genetic diversity within the species was rearrangements) were identified and adjusted manually using Juicebox
characterized with a panel of 30 cultivated and wild cañahua varieties. version 1.9.8 (Durand et al., 2016). The result was a chromosome-scale,
polished assembly referred to as PGA2 (Appendix S1).
METHODS
Transcriptome assembly, gene annotation, and repeat
modeling
Plant material
RNA-Seq data was generated on the Illumina Hi-Seq platform from
The cañahua accession PI 478407 was used to develop the refer- cañahua (PI 478407) leaf, root, inflorescence, and apical meristem
ence assembly. It was originally collected in 1981 at the Instituto tissues grown in both non-stressed and salt-stressed conditions,
Boliviano de Tecnologia, Patacamaya, Bolivia, and is freely available as detailed by Jarvis et al. (2017). The reads were trimmed using
from the United States Department of Agriculture (USDA; Ames, Trimmomatic version 0.32 (Bolger et al., 2014) to remove Illumina
Iowa, USA; https ://npgsw
eb.ars-grin.gov/). The diversity panel adapters and trailing bases with a quality score below 20, then
consisted of 30 accessions from three germplasm collections: spe- aligned to the PGA2 reference using HiSat2 version 2.0.4 (Kim
cifically, eight cañahua varieties from the USDA collection (https:// et al., 2015; Pertea et al., 2016) with default parameters except the
npgsweb.ars-grin.gov/), one landrace and two wild accessions from max intron length was set to 50,000 bp. After alignment, the re-
the Universidad Nacional Agraria La Molina (UNALM; Lima, sulting sequence alignment map (SAM) file was sorted and indexed
Peru), and 21 accessions from Universidad Major de San Andrés using SAMtools version 1.6 (Li et al., 2009) and assembled into pu-
(UMSA; La Paz, Bolivia). A complete list of all plant materials used tative transcripts using StringTie version 1.3.4 (Pertea et al., 2016).
is provided in Table 1. Whole-genome annotation of the PGA2 assembly was performed
by MAKER version 2.31.10 (Holt and Yandell, 2011) using the
cañahua transcriptome as expressed sequence tag (EST) evidence,
Whole genome assembly
the uniprot_sprot database (downloaded 25 September 2018) and
In vivo Hi-C and proximity-guided assembly techniques were used to quinoa protein sequences (Jarvis et al., 2017) as protein homology
improve the previously published short-read draft assembly reported evidence, and the consensi.fa.classified output from RepeatModeler
TABLE 1. Passport and sequence archive information for plant materials used. Raw sequencing data for each accession are deposited in the Sequence Read Archive
(SRA) at the National Center for Biotechnology Information (NCBI).
Altitude
Name Collectiona Accession ID Collection location (m a.s.l.) Sequencing technology SRA IDb
WGS reference
information
PI 478407 USDA PI 478407 −17.2333, −67.9166 3800 PacBio SRR9661228
PI 478407 USDA PI 478407 −17.2333, −67.9166 3800 Hi-C (Illumina) SRR9661229
PI 478407 USDA PI 478407 −17.2333, −67.9166 3800 WGS (Illumina) SRR4425239c
PI 478407 USDA PI 478407 −17.2333, −67.9166 3800 RNA-Seq SRR4425240–
SRR4425243c
Diversity panel
information
P1 UNALM BYU 1780 −15.6967, −70.20510 3830 WGS (Illumina) SRR9620980
P2 UNALM BYU 1781 −15.7268, −70.23560 3838 WGS (Illumina) SRR9640749
P4 UNALM BYU 1785 −15.7693, −70.27050 3860 WGS (Illumina) SRR9640748
U7 USDA PI 510525 −16.3628, −69.2765 NA WGS (Illumina) SRR9640742
U8 USDA PI 510526 −16.2833, −69.2833 NA WGS (Illumina) SRR9640741
U9 USDA PI 510527 −16.0000, −69.7833 3810 WGS (Illumina) SRR9640740
U12 USDA PI 510530 −16.4500, −70.2333 NA WGS (Illumina) SRR9640747
U13 USDA PI 665279 −17.2333, −67.9166 3700 WGS (Illumina) SRR9640746
U14 USDA PI 665280 −17.2333, −67.9166 3700 WGS (Illumina) SRR9640745
U15 USDA PI 665281 −17.2333, −67.9166 3700 WGS (Illumina) SRR9640744
U16 USDA PI 665282 −17.2333, −67.9166 3700 WGS (Illumina) SRR9640743
B17 UMSA Bol-1.1 −15.7472, −68.8091 3845 WGS (Illumina) SRR9640755
B18 UMSA Bol-3.1 −16.5344, −68.0622 3445 WGS (Illumina) SRR9640754
B20 UMSA Bol-19.1 −17.8241, −67.7702 3721 WGS (Illumina) SRR9640757
B21 UMSA Bol-20.123 −17.7850, −68.1447 4025 WGS (Illumina) SRR9640756
B22 UMSA Bol-21.123 −17.6483, −67.2072 3777 WGS (Illumina) SRR9640751
B23 UMSA Bol-22.123 −18.2166, −67.0333 3707 WGS (Illumina) SRR9640750
B24 UMSA Bol-23.123 −16.5344, −68.0622 3445 WGS (Illumina) SRR9640753
B25 UMSA Bol-24.123 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640752
B26 UMSA Bol-25.123 −16.5344, −68.0622 3445 WGS (Illumina) SRR9640759
B27 UMSA Bol-26.123 −16.5344, −68.0622 3445 WGS (Illumina) SRR9640758
B28 UMSA Bol-28.123 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640732
B29 UMSA Bol-29.123 −16.5344, −68.0622 3445 WGS (Illumina) SRR9640733
B30 UMSA Bol-30.123 −17.2500, −67.9166 3800 WGS (Illumina) SRR9640734
B31 UMSA Bol-4.3 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640735
B32 UMSA Bol-6.2 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640736
B33 UMSA Bol-7.1 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640737
B34 UMSA Bol-8.1 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640738
B35 UMSA Bol-13.3 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640739
B36 UMSA Bol-27.123 −16.6740, −68.3183 3900 WGS (Illumina) SRR9640731
Note: m a.s.l. = meters above sea level; NA = not available.
a
Germplasm collection center. USDA = United States Department of Agriculture, Ames, Iowa, USA; UNALM = Universidad Nacional Agraria La Molina, Lima, Peru; UMSA = Universidad Major
de San Andrés La Paz, Bolivia; BYU = Brigham Young University, Provo, Utah, USA.
b
Sequence Read Archive (SRA) identifier.
c
Deposited in BioProject ID PRJNA326220. All other sequences are deposited in BioProject ID PRJNA552289.
for soft repeat masking. Gene prediction models included an assembler version 1.1.4 (Hunter et al., 2015) using a subset of six
Augustus model for cañahua produced by Benchmarking Universal million whole-genome, paired-end Illumina reads with the quinoa
Single-Copy Orthologs (BUSCO) version 3.0.2 (Simao et al., 2015) chloroplast genome (Maughan et al., 2019) as a target. The ARC al-
and the Arabidopsis thaliana SNAP HMM file (Korf, 2004) for gorithm uses Bowtie2 (Langmead and Salzberg, 2012) with relaxed
gene prediction. BUSCO version 3.0.2 assessed the completeness parameters to map reads against targets, extract mapped reads from
of the assembly and annotation using the embryophyta odb10 data each target, and assemble mapped reads using the SPAdes assembler
set. RepeatModeler version 1.0.11 and RepeatMasker version 4.0.7 (Bankevich et al., 2012). The targets are then replaced with newly
(Smit et al., 2013–2015) were used to identify and classify repetitive assembled contigs, and the process is iterated for a predetermined
elements in the final (PGA2) assembly relative to Repbase-derived number of cycles or until no additional reads can be incorporated.
RepeatMasker libraries version 20181026 (Bao et al., 2015). The ARC pipeline extended the assembled cañahua chloroplast
contigs through four (numcycles = 4) successive rounds of mapping
and re-assembly. Because chloroplast read depth should be sig-
Chloroplast genome assembly and annotation
nificantly higher than nuclear genome read depth, only assembled
A reference-guided assembly of the cañahua chloroplast genome contigs with read depth >50× coverage were selected for further as-
was constructed by the Assembly by Reduced Complexity (ARC) sembly. Pacific Biosciences long reads (>15 kbp; n = 246,847) were
used to fill gaps between contigs using PBJelly2, a subprogram from 1 : 1 (beet) or 1 : 2 (quinoa, amaranth). The DAGchainer (Haas et al.,
PBSuite version 15.8.24 (English et al., 2012). A circularized contig 2004) output files were used as input for the MCScanX (Wang et al.,
representing the complete plastid genome was constructed using 2012) and VGSC toolkit (Xu et al., 2016) for data visualization.
the circularize tool from Geneious (version 11.1.5; https://www.
geneious.com/), then the assembly was polished with the same six
million paired-end Illumina reads as used in the initial assembly. RESULTS
Annotation of the cañahua chloroplast genome was performed us-
ing GeSeq version 1.65 (Tillich et al., 2017) with the quinoa chlo-
Whole genome assembly
roplast annotation (Maughan et al., 2019) and the Max Planck
Institute of Molecular Plant Physiology (MPI-MP) chloroplast da- The previous draft assembly of PI 478407 reported in Jarvis et al.
tabase as references. ARAGORN version 1.2.3 and HMMER profile (2017) was based solely on Illumina short reads assembled using
searches were enabled, the latter using the embryophyta chloroplast the ALLPATHS-LG assembler (Gnerre et al., 2011). Although this
(CDS + rRNA) database. Comparison to the quinoa plastid genome was an excellent draft assembly, the lack of long-jump libraries (i.e.,
was performed by the nucmer tool from MUMmer version 4.0beta fosmid) or bacterial artificial chromosome (BAC)-end sequenc-
(Marcais et al., 2018) followed by MUMmerplot with all default ing resulted in a highly fragmented assembly. The ASRA assem-
parameters. bly consisted of 8982 contigs in 3013 scaffolds with a contig and
scaffold N50 of 84 kbp and 357 kbp, respectively, spanning a total
length of 337 Mbp (Table 2). To improve the ASRA, 179 million
Resequencing and single-nucleotide polymorphism discovery
Hi-C–based paired-end reads were generated and used to scaffold
DNA was extracted from single plants for each of 30 cañahua acces- the ASRA using the Proximo pipeline (Phase Genomics). Seventy-
sions using a cetyltrimethylammonium bromide (CTAB) extraction nine percent (2392) of the ASRA scaffolds were clustered into nine
method as described by Doyle and Doyle (1987). Samples were pseudomolecules, corresponding to the nine haploid chromosomes
sent to Novogene (San Diego, California, USA) for whole-genome of cañahua (2n = 2x = 18; Appendix S1), producing a substantially
Illumina HiSeq (150-bp paired-end) sequencing from 500-bp insert improved proximity-guided assembly (PGA1). The number of scaf-
libraries. Trimmomatic version 0.32 (Bolger et al., 2014) was used folds clustered to specific chromosomes ranged from 203 to 317,
to remove Illumina adapters and trailing bases with a quality score and the length of the assembled chromosomes ranged from 31.3
below 20 or average per-base quality of 20 over a four-nucleotide to 40.4 Mbp. The PGA1 scaffolds contained 95.3% of the total se-
sliding window. Reads from each accession were aligned to PGA2 quence length (99.7% excluding N gaps) with an N50 and L50 of
using BWA-MEM version 0.7.17 (Li, 2013) to produce SAM files 35.6 Mbp and 5, respectively (Table 2). Ns occupied 12.3 Mbp (4%)
that were converted to binary alignment map (BAM) format, sorted, of the assembly, with an average of 1047 gaps (20 or more contigu-
and indexed using SAMtools version 1.9 (Li et al., 2009). The BAM ous Ns) per scaffold. The unincorporated scaffolds (621) were small,
files were used as input for InterSnp, a subprogram of the BamBam representing <5% of the total sequence length of the ASRA, with a
version 1.4 pipeline (Page et al., 2014), for single-nucleotide poly- mean scaffold size of 25.8 kbp, making them much more difficult to
morphism (SNP) genotyping. SNPhylo version 20160204 (Lee incorporate accurately into chromosomes.
et al., 2014) used the HapMap output files produced by InterSnp PGA1 was further improved by applying a combination of
to filter and remove SNPs with >10% missing data and minor al- gap-filling and genome-polishing techniques. To close gaps,
lele frequency <5%. SNPhylo also filters SNP data sets using linkage 10.21 Gbp (1,101,202 reads) of PacBio long reads were generated
disequilibrium (LD) estimates (SNPs with LD < 40% are removed) with a mean read length of 9.3 kbp, providing 23.6× coverage of the
prior to building bootstrapped (n = 1000) phylogenies based on cañahua genome. PacBio long reads were aligned to PGA1 using
MUSCLE (Edgar, 2004) sequence alignments. The resulting tree PBJelly2 (English et al., 2012), closing 75% of existing N gaps. Due
was visualized in FigTree version 1.4.3 (http://tree.bio.ed.ac.uk/soft to potential errors introduced into gaps because of the inherent high
ware/figtree). Population structure was evaluated using Structure error rate of PacBio reads, the assembly quality was improved using
version 2.3.4 (Pritchard et al., 2000) with a range of K = 1 through
K = 5 from a 1000 SNP subset of the InterSnp output. ArcMap ver-
sion 10.3.1 (ArcGIS Desktop, release 10; Esri, Redlands, California, TABLE 2. Assembly statistics for the ASRA, PGA1, PGA1.5, and PGA2 assemblies.
USA) mapping software was used to map the geographic locations Assembly statistic ASRA PGA1 PGA1.5 PGA2
of the source materials. The clustering partitions produced by Assembly size (Mbp) 337 337 363 363
Structure were used to construct a pie chart representing the allelic No. of scaffolds 3015 623 591 4633
composition of each mapped individual. Scaffold N50 size (Mbp) 0.357 35.6 37.8 38.1
Scaffold L50 count 243 5 5 5
Longest scaffold (Mbp) 2.9 40.4 43.2 45.5
Genome comparison No. of contigs 8984 8984 2580 8210
Comparisons of coding sequences for quinoa (C. quinoa; CoGe Contig N50 size (Mbp) 0.083 0.083 0.516 0.236
Contig L50 count 1096 1096 168 401
id53523), beet (Beta vulgaris L.; CoGe id37197; Funk et al., 2018),
% missing bases 2.5 2.6 0.2 0.1
and amaranth (Amaranthus hypochondriacus L.; CoGe id34733; Assembly size (Mbp) in 20 321 344 350
Lightfoot et al., 2017) were made using the CoGe SynMap tool top 9 scaffolds
(https
://genomevolution.org/coge/) applying the Last algorithm Assembly % in top 9 5.8 95.4 94.8 96.5
with the recommended DAGChainer option (relative gene order) scaffolds
and Merge syntenic blocks option (quota align merge). The syntenic Note: ASRA = ALLPATHS-LG Short-Read Assembly; PGA1 = Proximity-Guided Assembly 1;
depth was set to quota align merge, at a ratio of coverage depth of PGA1.5 = Proximity-Guided Assembly 1.5; PGA2 = Proximity-Guided Assembly 2.
two genome-polishing tools: Arrow (Chin et al., 2013), which pro- A transcriptome assembly of cañahua was developed by se-
duces consensus-quality assemblies from PacBio sequences, followed quencing RNA-Seq libraries from six unique tissue and abiotic
by Pilon (Walker et al., 2014), which performs a similar function stress combinations. The resulting RNA-Seq libraries generated
but takes advantage of the significantly lower error rate of Illumina 66.3 Gbp of data from 663,493,956 paired-end reads with an av-
reads to improve the consensus assembly. These polishing steps made erage of 11.05 Gbp per library. Ninety-eight percent (649,273,284)
changes at 593,821 positions, representing <0.165% of PGA1. The of the paired RNA-Seq reads aligned to the final PGA2 assembly,
resulting assembly, PGA1.5, had a total size of 363 Mbp, an approx- with 97.9% of those read pairs aligning concordantly—suggestive
imately 7.7% increase from the ASRA. The scaffold N50 of PGA1.5 of a high-quality genome assembly. A Stringtie (Pertea et al., 2016)
increased slightly to 37.8 Mbp, while the number of gaps decreased reconstruction of the cañahua transcriptome identified 255,893 fea-
dramatically from 8013 to 2007, which is also reflected in a 10-fold tures, including 214,170 exons in 41,723 primary and alternative
decrease in the number of Ns in the assembly (4% to 0.2%; Table 2). transcripts with a mean transcript length of 2.19 kbp and an average
A second round of proximity-guided assembly using PGA1.5 of 28,246 features per chromosome.
improved the chromosome-scale assembly. Polar Star, which ag- The MAKER pipeline was used to annotate PGA2 using as ev-
gressively breaks contigs at low PacBio depth locations based on idence the cañahua transcriptome (described above) and, as alter-
deviation from mean depth, introduced 5241 breaks that were then native evidence, the transcripts and protein sequences for quinoa
tested for rescaffolding using Hi-C–based proximity-guided as- (Jarvis et al., 2017), as well as the complete uniprot_sprot database.
sembly. This acts as a check on the error-prone PacBio reads and A total of 22,832 gene models were annotated, which is slightly more
low coverage depth used in the gap-filling process. The result is a than half of the 44,776 gene models annotated in the allotetraploid
dramatically improved proximity-guided assembly, evident by the quinoa (Fig. 1). The average transcript length was 4.6 kbp, with the
consistent pattern of Hi-C crosslink density along chromosomes longest protein sequence spanning 4769 amino acids (annotation
and the resolution of erroneous inversions and rearrangements. ID: CP013000), which is predicted to encode the large sacsin-like
The final assembly (PGA2) spans 362.5 Mbp, has a scaffold N50 gene found in many eukaryotes, including other Amaranthaceae
and L50 of 38.1 Mbp and 5, respectively, with <0.1% of the as- species such as quinoa (XP_021735414), beet (XP_010688704),
sembled sequence found in 3586 gaps. Eighty-four percent of the and spinach (XP_021846357). The mean annotation edit distance
estimated genome size is represented; the remaining 16% is likely (AED), which is a quality measure combining values for sensitivity,
composed of repetitive sequence that has collapsed in regions such specificity, and accuracy to give evidence of a high-quality annota-
as centromeres and telomeres due to the use of short reads for the tion, was 0.23. AED values <0.3 are indicative of high-quality anno-
initial assembly. The nine chromosomes contain 96.7% of the to- tations (Holt and Yandell, 2011).
tal sequence length (99.9% excluding N gaps), ranging in size from Completeness of the gene space was assessed using the BUSCO
33.5 Mbp to 45.4 Mbp (Appendix S2). platform, which quantifies functional gene content using a large
core set of highly conserved orthologous genes (COGs). Of the
1375 plant-specific COGs in the embryophyta database, 1341
Repeat identification and genome annotation
(97.5%) were identified in the cañahua genome as complete, with
RepeatModeler and RepeatMasker were employed to identify and another nine (0.7%) COGs classified as fragmented (complete:
classify repetitive elements in the cañahua genome. Fifty-three 97.5% [single: 95.9%, duplicated: 1.6%], fragmented: 0.7%, miss-
percent (191 Mbp) of the cañahua genome was classified as repet- ing: 1.8%). Relative to the MAKER de novo annotated proteins
itive, with an additional 1.9% (7 Mbp) classified as low complex- and transcripts, BUSCO identified 1260 (91.6%) and 1303 (94.8%)
ity (satellites, simple repeats, and small RNAs). A total of 129 Mbp complete COGs, respectively. The discrepancies between the whole
(35.5%) was identified as retrotransposons or DNA elements, with genome, protein, and transcript BUSCO findings may be attributed
an additional 61 Mbp (16.8%) classified as unknown elements. The to the difference in gene annotation method between BUSCO and
most common elements identified were long terminal repeat (LTR) MAKER. Whereas BUSCO uses BLAST to identify known genes,
retrotransposons, specifically copia-like and gypsy-like elements, MAKER uses an approach that requires sufficient evidence from a
which spanned 67 Mbp (27.1%) of the genome (Appendix S3). The combination of protein, EST, and ab initio gene prediction inputs.
large fraction of unknown elements is not surprising given that the The annotation could potentially be improved by further training of
only published studies of repetitive elements in the Chenopodium the input gene prediction model (Augustus, SNAP).
genus have been limited to rDNA sequences (Maughan et al., 2006;
Kolano et al., 2011) and two repetitive sequences, 18-24J and 12-13P,
Chloroplast genome reconstruction
that were only recently characterized cytogenetically (Orzechowska
et al., 2018). BLASTn was used to locate the 5S rDNA sequence and The cañahua chloroplast assembly spans 151,799 bp in a single
the two Chenopodium repetitive elements within the final assembly. circular molecule. Annotation of the chloroplast genome revealed
Consistent with the findings of Kolano et al. (2011), the 5S rDNA a quadripartite structure, including two copies of an inverted re-
sequence was found in a single genomic location in the centromeric peat (IR) region separating large and small single-copy regions.
region of chromosome Cp8 (Fig. 1). Orzechowska et al. (2018) pre- One hundred thirty-two genes were identified, including 88 pro-
viously reported that the 18-24J repeat was almost exclusively found tein-coding genes, 36 tRNA genes, and eight rRNA genes (Fig. 2).
in the Chenopodium B-genome, whereas 12-13P was located at peri- Twenty-one genes were located in each IR, including a pseudogene
centric positions in both the A- and B-genomes. BLASTn searches previously characterized in other Amaranthaceae species as rpl23
of the cañahua genome confirmed these observations, with 18-24J (Park et al., 2018; Maughan et al., 2019). With a length of 151,799 bp,
identified in only 0.012% of the cañahua genome while the 12-13P the cañahua plastid genome is of a similar size to quinoa, which has
repetitive element, occupying 124.6 kbp (0.027%), was localized to been reported for multiple quinoa accessions ranging in size from
putative centromeric regions on all nine chromosomes (Fig. 1). 152,079–152,282 bp, with an average length of 152,134 bp (Hong
30M
10M
25M
5M
15M
20M
20M
15M
25M
10M
Cp1
M
Cp9
5M
30
M
35
40
M
35
30
5M
M
M M
25
M 10
M
20 15
M M
20
Cp
15
8
M
25M
Cp
2
10M
30M
5M
35M
35M
5M
30M
Chenopodium pallidicaule 10M
25M
nuclear genome 15M
Cp3
Cp7
20M
363 Mbp; 22,832 gene models 20M
15M
25M
10M
30M
5M
35M
M
40
5M
M
35 10
M M
30 15
C
M M
4
25
p6
Cp
20
M
20
M
25
M
15
M
30
M
10
35
M
5M
Cp5
M
45M
5M
40M
10M
35M
15M
30M
20M
25M
FIGURE 1. Genome annotation overview. An overview of gene and repetitive element annotations in the Chenopodium pallidicaule genome. Track
1: chromosome names and sizes; Track 2: frequency of pericentromeric 12-13P repetitive elements (purple); Track 3: frequency of 18-24J repetitive
element (blue) and the 5S rRNA locus (red); Track 4: frequency of canonical telomeric repeat; Track 5: gene density.
et al., 2017; Maughan et al., 2019). Due to the lack of recombination collection site and genotype (Z = 11,296.22, r = −0.12326, and
in chloroplast genomes and the relatively recent allotetraploidiza- P = 0.837). This is likely due to a lack of good collection site data for
tion event leading to quinoa (3.3–6.3 mya; Jarvis et al., 2017), the many of the accessions. Indeed, eight of the accessions have as their
high degree of similarity between the cañahua and quinoa chlo- passport data the latitude and longitude coordinates of the research
roplasts supports the hypothesis that the maternal parent in the facilities where they are curated instead of the coordinates of the
polyploidization event that led to modern quinoa was an A-genome original collection site (Fig. 3B, Table 1). Another potentially com-
species (Maughan et al., 2019). plicating issue is the well-known cultural practice of seed trading
among indigenous Andean societies that was an important part of
agriculture in the pre-Columbian Altiplano region for thousands of
Diversity panel resequencing
years (Vargas et al., 2011). Lastly, the two wild accessions (P1, P4)
A diversity panel consisting of 30 varieties of cañahua, including are found by themselves on a distinct clade within the phylogeny. A
28 landrace varieties and two wild accessions, was sequenced to structure analysis (Pritchard et al., 2000) suggests that they are dis-
an average depth of 10.9× coverage (4.7 Gbp) per accession. After tinct from the landrace and cultivated accessions, showing little ad-
read alignment to the PGA2 final assembly, a total of 358,461 SNPs mixture with cultivated accessions, even though they were collected
were identified in the diversity panel, which were then filtered to in close proximity to a cultivated type (P2, Fig. 3C). This finding
16,194 SNPs, based on minor allele frequency, missing data, and agrees with those of Vargas et al. (2011) and is further evidence that
linkage disequilibrium, with an average of 1799 SNPs per chromo- the wild accessions may be useful sources of novel genetic diversity
some. Analysis of the consensus, 1000-bootstrap phylogeny of the for improving cañahua.
cañahua diversity panel suggests several major points of interest
(Fig. 3A). First, the USDA collection of the species is limited to only
Genome comparison
two of three major groups, with the majority (seven out of eight
accessions) on a single group, suggesting limited diversity within Syntenic relationships between cañahua and other Amaranthaceae
the USDA collection and highlighting the need for international species were explored using DAGChainer (Haas et al., 2004),
collection efforts to preserve diversity within the species. Second, which identifies colinear sets of homologous gene pairs (syntenic
the Mantel test suggests that there is no correlation between blocks) between genomes. The first species of the family with a
trnS-GGA
trnS-GGA
CC
trnL-U
psbZ -GCC
trnF-G
trnG-G
A
trnG
AA
T-G U
GU
trn T-GG
trn
trn V-U
trn
C
V- AC
psb
D
UA
psb
C
atp
A
atp
GC
B
trn t N
E
C-
rps4
pe
trnT-U
ac
ndhJ
A
cD
-UG
ps f 4
rbc
GU
aI
yc
ycf3*
psaA
trnS
psaB-fragment
ent
L
psaA-f psaB
t
rps14
ce
ragmen
trnfM AU
-CAU
m
trn
Final-fragm
pe
ndh C
ndh
A
trnM-C
tA
M-
trn E-U C
K
C
tr Y- UC
p
trn -UU
pe etL
trnnD-GGUA
AU
G C
oB
tG
ps UC
D- U
trnE
bM
rp
LSC
ps 1*
r aJ oC
rpspl33 rp
18 p
p sb
p sbL J
ps sbF C2
trn bE rpo
rpl2 trn W-
0 trn P-U CCA 2
P- G rps I
UG G
rps G atp
12- H
psbB fra atp
psbT gm
ent CU
psbH clpP atpF trnR-U CU
trnR-U GA
psbN atpA trnS-C
petB
psbI
U
petD trnS-GC psbK
rpoA
rps11 Chenopodium pallidicaule trnQ-UUG
rpl36 rps16
infA
rps8
rpl14 chloroplast genome trnK-UUU
rpl16 matK
rps3
rpl22
rps19 151,799 bp psbA
trnH-GU
G
rpl2
rpl23-CAU rps19-fra
trnM AU rpl2 gment
trnI-C rpl23
trnM-C
trnI-C A U
AU
ycf2
AA
gm ent L-C
-fra trn
IR
A
ycf2
IR
B
hB
nd 7
s
rp s12 l
rp ina
F trn
L-C
trn AA
trn F UG
V-
trn
I-G in C
G
A-
AC
AUal
S SC
trn R-A
rrn
rp
trn
rrn 4.5
nd
rrn rrn5
s7
16
hB
R- CG
23
AC
G
rp
ycf1
s1
UU
AC 2
ndhG
psaC
G
ndhE
V-
N-G
-frag
16
trn
rrn
trn
AU
men
ndh
rps15
I-G
ndhH
ndhD
ndhA
photosystem I
ndhI
t
trn
23
photosystem II
rrn
GC
ATP synthase
1
N-G
ycf
trn
r .5
NADH dehydrogenase
trn R-ACrn5
UU
4
rpl32
R-A G
rrn
AG
CG
ccsA
RNA polymerase
trn
FIGURE 2. Chloroplast annotation overview. The outside track shows genes transcribed in a clockwise direction, the second track shows genes tran-
scribed in a counterclockwise direction, and the inside track shows G/C content levels. Annotation reveals a quadripartite structure, including two
copies of the inverted repeat (bolded line) dividing large and small single-copy regions.
reference-quality chromosome-scale assembly was beet (B. vulgaris; Cp1 = Bv1 (93% shared syntenic block sequence), Cp2 = Bv2
2n = 2x = 18; Dohm et al., 2014). A genomic comparison between (100%), Cp3 = Bv3 (99%), Cp4 = Bv4 (100%), Cp5 = Bv5 (100%),
cañahua and beet identified 162 syntenic blocks with 11,659 colin- Cp6 = Bv6 (100%), Cp7 = Bv7 (100%), Cp8 = Bv8 (100%), Cp9 =
ear gene pairs (average of 72 genes/block) spanning 271 Mbp. As Bv9 (96%). To maintain the family naming convention, we have as-
expected, given the relatively close ancestry of the species, the size signed the cañahua chromosomes with the same number as their
(in base pairs) of the syntenic blocks between species was highly beet homologs (i.e., Cp1 = Bv1, etc.).
correlated (R2 = 0.80). The large blocks of syntenic genes are sug- Beet and cañahua are diploids that share a base chromosome
gestive of homologous relationships between the chromosomes of number of x = 9, whereas the base number in Amaranthus is x = 8.
the two species (Fig. 4A). Indeed, homologous chromosome pairs Lightfoot et al. (2017) identified evidence of chromosome loss (the
can easily be identified for all cañahua and beet chromosomes: homoeolog of Ah5) and chromosome fusion (Ah1) in the amaranth
A B
FIGURE 3. Diversity panel. (A) The unrooted tree was developed using 16,194 single-nucleotide polymorphisms (SNPs) filtered to remove SNPs with
>10% missing data, minor allele frequency <5%, and linkage disequilibrium <40%. Colors represent the collection source (purple = United States
Department of Agriculture, green = Universidad Nacional Agraria La Molina, blue = Universidad Major de San Andrés La Paz), and bolded lines indi-
cate wild accessions. (B) Geographic location (see Table 1 for passport information) combined with population structure information developed by
Structure with K = 4. There is no significant correlation between collection site and genetic distance (P = 0.837). The wild Chenopodium pallidicaule
accessions are identified with arrows. (C) Population structure and admixture in the diversity panel.
genome that likely led to the observed base chromosome number Cp2 = Ah10/Ah16, Cp4 = Ah6/Ah4, Cp6 = Ah3/Ah2, Cp7 = Ah8/Ah15,
reduction in the amaranths. Our comparison of cañahua to the Cp8 = Ah14/Ah9, and Cp9 = Ah1/Ah1 (homoeolog fusion). Only
amaranth genome identified 285 syntenic blocks with 13,200 co- one of the amaranth homoeologs of Cp3 is clearly identifiable in
linear gene pairs. Although there was an increase in syntenic blocks the data, specifically Ah13, with Ah7 likely the homoeolog, but ob-
identified in the cañahua–amaranth comparison, the number of scured by a large translocation between Ah7 and Ah4. Similarly,
genes per block dropped (46 genes/block) and was accompanied the orthologous relationship with Cp5 likely involves Ah11 and
by a lower syntenic block size correlation (R2 = 0.25). The decrease Ah12 but is complicated by translocations with Ah2 (Fig. 4B,
in block size and correlation is reflective of the more distant evo- Appendix S4).
lutionary relationship between these two species. Our analysis Comparison of cañahua to the quinoa genome identified 418
confirms the chromosome loss and fusion events in the amaranth syntenic blocks with 23,410 colinear gene pairs (Appendix S5).
genome. Indeed, the entirety of Cp9 aligns twice (end-to-end) with When analyzed on a subgenome basis, cañahua had considerably
Ah1, and one homolog of Cp1 is largely missing (Fig. 4B). The syn- more and gene-dense syntenic blocks with subgenome A (13,073
teny observed among the cañahua and amaranth chromosomes gene pairs, 71.1 genes/block) relative to subgenome B (10,337 gene
suggests several homoeologous relationships within the amaranth pairs, 46.2 genes/block; Appendix S5). The size of the sytenic blocks
genome. For example, cañahua chromosome Cp6 is clearly homol- were also more highly conserved with the A subgenome relative to
ogous across the entirety of amaranth chromosomes Ah2 and Ah3. the B subgenome (R2 = 0.82 and 0.35, respectively) as well as the
Indeed, of the 54 Mbp of amaranth sequence that is syntenic with total number of syntenic genes (Table 3), confirming that cañahua
Cp6, 48% (26 Mbp) is syntenic to Ah2 and 52% (28 Mbp) is syntenic is representative of the A-genome species in the genus. Although
to Ah3—clearly suggestive that Ah2 and Ah3 are homoeologs. Using both the A and B subgenomes have maintained similar chromo-
the syntenic data from the cañahua–amaranth comparison and somal structure, the A-subgenome homoeologs in quinoa can be
a simple majority rule (>75% syntenic sequence), we identify the clearly identified via visual inspection of the syntenic sequence dot
following orthologous relationships: Cp1 = Ah5 (homoeolog loss), plots and are supported by the amount of syntenic bases shared
A B
FIGURE 4. Genomic comparison of cañahua with beet, amaranth, and quinoa. Synteny dot plot (left) and dual syteny plots (right) show syntenic
regions between cañahua and beet (A), amaranth (B), and quinoa (C) coding sequences. The dual synteny plot of the quinoa genome is divided into
A- and B-subgenomes with cañahua in the center. Increasing color intensity is associated with increasing homology in the dot plots. The arrows iden-
tify the chromosomal fusion (red) and loss (blue) in amaranth.
(Fig. 4C, Appendix S4). All quinoa A chromosomes share a higher 0.8–1.5 mya, whereas the B-subgenome and cañahua have been
number of syntenic genes with cañahua than their B homoeologs, diverged for nearly twice as long (1.7–3.1 mya). KS values suggest
except for Cq4A and Cq4B where 1376 and 1416 syntenic genes that the last common ancestor between cañahua and beet was ap-
were identified, respectively. However, the number of genes per proximately 16–29.6 mya, whereas the last common ancestor be-
syntenic block shared with Cp4 is higher in Cq4A (76.4) than for tween cañahua and amaranth was more distant at 21.3–39.5 mya
Cq4B (70.8), as is the total amount of syntenic bases shared with (Appendix S6, Table 3).
Cp4 (28.3 Mbp), which is suggestive that the assignment of Cq4A
to the A-subgenome is likely correct. This is even more significant
considering that the B-subgenome of quinoa is larger than the DISCUSSION
A-subgenome (531 Mbp in the A-subgenome and 670 Mbp in the
B-subgenome; Jarvis et al., 2017). The value of incorporating Hi-C data and long reads into the as-
We calculated the rate of synonymous substitutions per synon- sembly is clear when comparing ASRA and PGA2 assemblies. The
ymous site (KS) in duplicate gene pairs between cañahua and the Hi-C data increased contiguity of PGA2 significantly by reducing
A- and B-subgenome chromosomes found in quinoa (Jarvis et al., the assembly from 3015 scaffolds to nine chromosome-scale scaf-
2017). Clear peaks are present at KS = 0.025 and 0.05 for the A- and folds, while the long-read sequence dramatically reduced the num-
B-subgenome comparisons, respectively, reflecting a notable differ- ber of gaps (by 75%) in the assembly as well as increasing the total
ence in the estimated time since the A and B subgenomes of quinoa assembly size. One notable disadvantage to developing a genome
last shared a common ancestor with cañahua (Appendix S6). Indeed, assembly based on short reads is the difficulty of properly assem-
depending on whether an A. thaliana–based synonymous muta- bling repetitive elements (Richards, 2018). When the read length is
tion rate (Koch et al., 2000) or a core eukaryotic–based rate is used shorter than the repeat element, the reads collapse into single con-
(Lynch and Conery, 2000) in the calculation, the A-subgenome of tigs, resulting in genome assemblies that can be significantly smaller
quinoa last shared a common ancestor with cañahua approximately than the actual size of the genome such as seen here. For example,
TABLE 3. Comparison of gene synteny, synonymous substitutions rate, and divergence since the last common ancestor relative to cañahua.
Metric Amaranth Beet Quinoa A-subgenome Quinoa B-subgenome
Total no. of genesa 45,947 45,334 43,663 44,638
Unique syntenic genesb 23,878 23,075 26,230 25,327
% of syntenic genes 52.0 50.9 60.1 56.7
Syntenic genes/block 46.3 71.9 71.1 46.1
Average syntenic block size (Mbp) 1.3 3.1 4.7 4.9
Ks peak value 0.64 0.48 0.025 0.05
Last common ancestor (mya) 21.33–39.51 16–29.63 0.830–1.54 1.67–3.09
Note: Ks = synonymous substitutions per synonymous site.
a
Total number of annotated genes in cañahua and the comparison species.
b
Total number of unique syntenic genes in cañahua and the comparison species.
the telomeric repeat in PGA2 was largely collapsed into a single beet after the divergence from a common ancestor. Homologs of
contig that was not scaffolded to any of the chromosomes. Although Cp9 also show an evolutionarily interesting pattern. Whereas Cp9
there are traces of telomere sequence on several of the nine scaffolds is conserved in the A-subgenome of quinoa (Cq4A), demonstrated
(Fig. 1), the integrity of this element was largely lost. Nonetheless, both by a syntenic dot plot (Fig. 4A) and a high number of syntenic
the assembly method reported here is cost-effective, requiring only genes (1323; Appendix S5), the B-subgenome homolog has a much
inexpensive Illumina short-read technology for the initial assembly different structure and less than half the number of syntenic genes
and Hi-C scaffolding, while the more expensive long reads (PacBio) (536). Meanwhile, beet and amaranth both have unique rearrange-
necessary for gap-filling are only needed at low coverage. ments of this homolog (Bv9 and Ah1, respectively), suggesting that
The high level of synteny between cañahua chromosomes and the order of genes along this molecule may not hold significant bi-
the A-subgenome chromosomes of quinoa, as well as the high ological importance.
chloroplast similarity and KS values, provides strong evidence sup- In conclusion, the reference-quality, chromosome-scale assem-
porting a New World A-genome diploid as the maternal cytoplasm bly of cañahua presented here dramatically improves the existing
donor of the A-subgenome in the allopolyploidization of quinoa. resources for this regionally important subsistence crop. The refer-
However, given the closer proximity between the Eurasian land- ence genome provides a critical genomic tool needed to draw atten-
mass (B-subgenome origin) with North America versus South tion to cañahua, which should lead to renewed interest in improved
America, a North American A-genome diploid donor is more log- varietal development using modern plant breeding techniques, in-
ical than a South American origin donor, such as cañahua. Thus it cluding marker-assisted selection and genomic selection (Jannink
is unlikely that cañahua is the direct ancestor of the A-subgenome et al., 2010; Brachi et al., 2011). The genome annotation reported
in quinoa, suggesting that future genomic analyses of the more here will undoubtedly facilitate gene discovery efforts within the
than 45 putative A-genome diploid Chenopodium species should species, allowing researchers to move quickly from genetic link-
provide important insight into the polyploidization events that un- age/linkage disequilibrium experiments to possible candidate gene
derlie the evolution and domestication of the New World AABB targets. Indeed, once target genomic regions are identified, en-
Chenopodium species complex that includes free-living C. berland- hanced marker-assisted breeding methods can be more effectively
ieri Moq. subsp. berlandieri, C. quinoa var. melanospermum Hunz., employed. Given that the only other domesticated Chenopodium
C. quinoa subsp. milleanum Aellen, and C. hircinum Schrad., along species are complex allotetraploids (C. quinoa and C. berlandieri
with their domesticated forms C. quinoa and C. berlandieri subsp. subsp. nuttalliae), we anticipate that cañahua will serve as a sim-
nuttalliae (Saff.) H. Dan. Wilson & Heiser (Wilson, 1990). Indeed, plified genetic model for the family (Bertioli et al., 2016; Du et al.,
recently reported read-mapping percentages reveal that C. watsonii 2018). Lastly, the resequencing information presented here for a
A. Nelson and C. sonorense Benet-Pierce & M. G. Simpson, both large diversity panel of cañahua accessions collected from across
wild diploids collected in the southwestern United States, align the Altiplano should provide the preliminary data needed for germ-
more closely to the quinoa A-subgenome than does cañahua, with plasm conservation and core-collection development.
C. watsonii exhibiting the highest mapping percentage (Jellen et al.,
2019). Whole-genome sequencing of an additional 24 putative
A-genome diploid Chenopodium species, originating from across ACKNOWLEDGMENTS
North, Central, and South America, is currently underway in our
laboratory. The authors would like to thank PROINPA Foundation for the gen-
Careful evaluation of chromosomes within the Amaranthaceae erous donation to the Universidad Mayor de San Andrés, La Paz,
family can shed light on how these genomes evolved over time Bolivia, of 10 cañahua lines belonging to the “Programa de mejora-
and what role structural changes have played in biological func- miento genético de cañahua.”
tion. For example, homologs of Cp5 are highly conserved in both
the A and B subgenomes of quinoa (Cq5A and Cq5B), but there
is clear structural variation in comparison to the homolog in beet, DATA AVAILABILITY
Bv5 (Fig. 4A). One of the amaranth homologs of Cp5 is collinear
(Ah2), whereas the second homolog is split between two chromo- The raw sequences are deposited in the National Center for
somes (Ah11 and Ah12) but also reflects a similar order. This may Biotechnology Information (NCBI) Sequence Read Archive da-
be evidence that a terminal inversion occurred in the evolution of tabase under the BioProject ID PRJNA552289 with the following
accession numbers: SRR9661228 (PacBio), SRR9661229 (Hi-C), and Du, X. M., G. Huang, S. P. He, Z. E. Yang, G. F. Sun, X. F. Ma, N. Li, et al. 2018.
SRR9640731–SRR9640759 and SRR9620980 (resequencing panel). Resequencing of 243 diploid cotton accessions based on an updated A ge-
Bulk data downloads, including annotations and BLAST analysis, and nome identifies the genetic basis of key agronomic traits. Nature Genetics
50: 796–802.
JBrowse viewing of the final proximity-guided assembly are available
Durand, N. C., J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S.
at CoGe (https://genomevolution.org/coge/; Genome id53872).
Lander, and E. L. Aiden. 2016. Juicebox provides a visualization system for
Hi-C contact maps with unlimited zoom. Cell Systems 3: 99–101.
Edgar, R. C. 2004. MUSCLE: Multiple sequence alignment with high accuracy
SUPPORTING INFORMATION and high throughput. Nucleic Acids Research 32: 1792–1797.
English, A. C., S. Richards, Y. Han, M. Wang, V. Vee, J. X. Qu, X. Qin, et al. 2012.
Additional Supporting Information may be found online in the Mind the gap: Upgrading genomes with Pacific Biosciences RS long-read
supporting information tab for this article. sequencing technology. PLoS ONE 7: e47768.
Funk, A., P. Galewski, and J. M. McGrath. 2018. Nucleotide-binding resistance
gene signatures in sugar beet, insights from a new reference genome. Plant
Journal 95: 659–671.
APPENDIX S1. Outline of the genome assembly process.
Gade, D. W. 1970. Ethnobotany of cañihua (Chenopodium pallidicaule), rustic
seed crop of the Altiplano. Economic Botany 24: 55–61.
APPENDIX S2. Length and contig number for each chromo- Gnerre, S., I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker,
some-scale scaffold in PGA2. T. Sharpe, et al. 2011. High-quality draft assemblies of mammalian genomes
from massively parallel sequence data. Proceedings of the National Academy
APPENDIX S3. Repetitive element classification for final assembly of Sciences USA 108: 1513–1518.
(PGA2) as reported by RepeatMasker. Haas, B. J., A. L. Delcher, J. R. Wortman, and S. L. Salzberg. 2004. DAGchainer: A
tool for mining segmental genome duplications and synteny. Bioinformatics
APPENDIX S4. Orthologous genes were identified between caña- 20: 3643–3646.
hua and beet (A), amaranth (B), and quinoa (C) to detect ortholo- Holt, C., and M. Yandell. 2011. MAKER2: An annotation pipeline and ge-
nome-database management tool for second-generation genome projects.
gous chromosome relationships.
BMC Bioinformatics 12: 491.
Hong, S. Y., K. S. Cheon, K. O. Yoo, H. O. Lee, K. S. Cho, J. T. Suh, S. J. Kim, et al.
APPENDIX S5. Comparison of gene synteny between cañahua and 2017. Complete chloroplast genome sequences and comparative analysis of
the two subgenomes of quinoa. Chenopodium quinoa and C. album. Frontiers in Plant Science 8: https://doi.
org/10.3389/fpls.2017.01696.
APPENDIX S6. Rate of synonymous substitutions per synony- Hunter, S. S., R. T. Lyon, B. A. J. Sarver, K. Hardwick, L. J. Forney, and M. L.
mous site (Ks). Ks values within duplicated gene pairs between caña- Settles. 2015. Assembly by Reduced Complexity (ARC): A hybrid ap-
hua with amaranth (red), beet (yellow-brown), tetraploid quinoa proach for targeted assembly of homologous sequences. bioRxiv 014662
(green), the A-subgenome of quinoa (blue), and the B-subgenome [Preprint]. 31 January 2015 [cited 5 July 2018]. Available from: https://doi.
of quinoa (purple). org/10.1101/014662.
Jannink, J. L., A. J. Lorenz, and H. Iwata. 2010. Genomic selection in plant breed-
ing: From theory to practice. Briefings in Functional Genomics 9: 166–177.
Jarvis, D. E., Y. S. Ho, D. J. Lightfoot, S. M. Schmockel, B. Li, T. J. A. Borm,
LITERATURE CITED H. Ohyanagi, et al. 2017. The genome of Chenopodium quinoa. Nature 542:
307–312.
Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, Jellen, E. N., D. E. Jarvis, S. P. Hunt, H. H. Mangelsen, and P. J. Maughan. 2019.
V. M. Lesin, et al. 2012. SPAdes: A new genome assembly algorithm and its New seed collections of North American pitseed goosefoot (Chenopodium
applications to single-cell sequencing. Journal of Computational Biology 19: berlandieri) and efforts to identify its diploid ancestors through whole-ge-
455–477. nome sequencing. Ciencia e Investigacion Agraria 46: 187–196.
Bao, W. D., K. K. Kojima, and O. Kohany. 2015. Repbase Update, a database of Kim, D., B. Langmead, and S. L. Salzberg. 2015. HISAT: A fast spliced aligner
repetitive elements in eukaryotic genomes. Mobile DNA 6: 11. with low memory requirements. Nature Methods 12: 357–360.
Bertioli, D. J., S. B. Cannon, L. Froenicke, G. D. Huang, A. D. Farmer, E. K. S. Koch, M. A., B. Haubold, and T. Mitchell-Olds. 2000. Comparative evolutionary
Cannon, X. Liu, et al. 2016. The genome sequences of Arachis duranensis and analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis,
Arachis ipaensis, the diploid ancestors of cultivated peanut. Nature Genetics Arabis, and related genera (Brassicaceae). Molecular Biology and Evolution
48: 438. 17: 1483–1498.
Bolger, A. M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer Kolano, B., B. W. Gardunia, M. Michalska, A. Bonifacio, D. Fairbanks, P. J.
for Illumina sequence data. Bioinformatics 30: 2114–2120. Maughan, C. E. Coleman, et al. 2011. Chromosomal localization of two
Brachi, B., G. P. Morris, and J. O. Borevitz. 2011. Genome-wide association stud- novel repetitive sequences isolated from the Chenopodium quinoa Willd. ge-
ies in plants: The missing heritability is in the field. Genome Biology 12: 232. nome. Genome 54: 710–717.
Burton, J. N., A. Adey, R. P. Patwardhan, R. L. Qiu, J. O. Kitzman, and J. Shendure. Korf, I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59.
2013. Chromosome-scale scaffolding of de novo genome assemblies based Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie
on chromatin interactions. Nature Biotechnology 31: 1119–1125. 2. Nature Methods 9: 357–359.
Chin, C. S., D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Lee, T. H., H. Guo, X. Y. Wang, C. Kim, and A. H. Paterson. 2014. SNPhylo: A pipe-
Clum, et al. 2013. Nonhybrid, finished microbial genome assemblies from line to construct a phylogenetic tree from huge SNP data. BMC Genomics
long-read SMRT sequencing data. Nature Methods 10: 563–569. 15: 162.
Dohm, J. C., A. E. Minoche, D. Holtgrawe, S. Capella-Gutierrez, F. Zakrzewski, Li, H. 2013. Aligning sequence reads, clone sequences and assembly contigs with
H. Tafer, O. Rupp, et al. 2014. The genome of the recently domesticated crop BWA-MEM. arXiv preprint arXiv:1303.3997 [Preprint]. 16 March 2013 [cited
plant sugar beet (Beta vulgaris). Nature 505: 546–549. 14 September 2018]. Available from: https://arxiv.org/abs/1303.3997v2.
Doyle, J. J., and J. L. Doyle. 1987. A rapid DNA isolation procedure for small Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with Burrows-
quantities of fresh leaf tissue. Phytochemical Bulletin 19: 11–15. Wheeler transform. Bioinformatics 26: 589–595.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, et al. of new triterpene saponins. Journal of Agricultural and Food Chemistry 44:
2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 3528–3533.
25: 2078–2079. Repo-Carrasco, R., C. Espinoza, and S. E. Jacobsen. 2003. Nutritional value
Lightfoot, D. J., D. E. Jarvis, T. Ramaraj, R. Lee, E. N. Jellen, and P. J. Maughan. and use of the Andean crops quinoa (Chenopodium quinoa) and kaniwa
2017. Single-molecule sequencing and Hi-C-based proximity-guided as- (Chenopodium pallidicaule). Food Reviews International 19: 179–189.
sembly of amaranth (Amaranthus hypochondriacus) chromosomes provide Repo-Carrasco-Valencia, R., J. K. Hellstrom, J. M. Pihlava, and P. H. Mattila. 2010.
insights into genome evolution. BMC Biology 15: 74. Flavonoids and other phenolic compounds in Andean indigenous grains:
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of Quinoa (Chenopodium quinoa), kañiwa (Chenopodium pallidicaule) and ki-
duplicate genes. Science 290: 1151–1155. wicha (Amaranthus caudatus). Food Chemistry 120: 128–133.
Marcais, G., A. L. Delcher, A. M. Phillippy, R. Coston, S. L. Salzberg, and A. Zimin. Richards, S. 2018. Full disclosure: Genome assembly is still hard. PLoS Biology
2018. MUMmer4: A fast and versatile genome alignment system. PLoS 16: e2005894.
Computational Biology 14: e1005944. Ruas, P. M., A. Bonifacio, C. F. Ruas, D. J. Fairbanks, and W. R. Andersen. 1999.
Maughan, P. J., B. A. Kolano, J. Maluszynska, N. D. Coles, A. Bonifacio, J. Rojas, Genetic relationship among 19 accessions of six species of Chenopodium L.,
C. E. Coleman, et al. 2006. Molecular and cytological characterization of ri- by Random Amplified Polymorphic DNA fragments (RAPD). Euphytica
bosomal RNA genes in Chenopodium quinoa and Chenopodium berlandieri. 105: 25–32.
Genome 49: 825–839. Simao, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov.
Maughan, P. J., L. Chaney, D. J. Lightfoot, B. J. Cox, M. Tester, E. N. Jellen, and 2015. BUSCO: Assessing genome assembly and annotation completeness
D. E. Jarvis. 2019. Mitochondrial and chloroplast genomes provide insights with single-copy orthologs. Bioinformatics 31: 3210–3212.
into the evolutionary origins of quinoa (Chenopodium quinoa Willd.). Smit, A. F. A., R. Hubley, and P. Green. 2013–2015. RepeatMasker Open-4.0.
Scientific Reports 9: 185. Website http://www.repeatmasker.org [accessed 10 August 2018].
Mujica, A. 1994. Andean grains and legumes. In J. E. H. B. a. J. León [ed.], Štorchová, H., J. Drabesova, D. Chab, J. Kolar, and E. N. Jellen. 2015. The introns
Neglected crops: 1492 from a different perspective. FAO, Rome, Italy. in FLOWERING LOCUS T-LIKE (FTL) genes are useful markers for track-
Orzechowska, M., M. Majka, H. Weiss-Schneeweiss, A. Kovarik, N. Borowska- ing paternity in tetraploid Chenopodium quinoa Willd. Genetic Resources and
Zuchowska, and B. Kolano. 2018. Organization and evolution of two re- Crop Evolution 62: 913–925.
petitive sequences, 18-24J and 12-13P, in the genome of Chenopodium Tillich, M., P. Lehwark, T. Pellizzer, E. S. Ulbricht-Jones, A. Fischer, R. Bock, and
(Amaranthaceae). Genome 61: 643–652. S. Greiner. 2017. GeSeq: Versatile and accurate annotation of organelle ge-
Page, J. T., Z. S. Liechty, M. D. Huynh, and J. A. Udall. 2014. BamBam: Genome nomes. Nucleic Acids Research 45: W6–W11.
sequence analysis tools for biologists. BMC Research Notes 7: 829. Vargas, A., D. B. Elzinga, J. A. Rojas-Beltran, A. Bonifacio, B. Geary, M. R.
Park, J.-S., I.-S. Choi, D.-H. Lee, and B.-H. Choi. 2018. The complete plastid ge- Stevens, E. N. Jellen, and P. J. Maughan. 2011. Development and use of mi-
nome of Suaeda malacosperma (Amaranthaceae/Chenopodiaceae), a vul- crosatellite markers for genetic diversity analysis of cañahua (Chenopodium
nerable halophyte in coastal regions of Korea and Japan. Mitochondrial DNA pallidicaule Aellen). Genetic Resources and Crop Evolution 58: 727–739.
Part B 3: 382–383. https://doi.org/10.1080/23802359.2018.1437822. Walker, B. J., T. Abeel, T. Shea, M. Priest, A. Abouelliel, S. Sakthikumar, C. A.
Penarrieta, J. M., J. A. Alvarado, B. Akesson, and B. Bergenstahl. 2008. Total anti- Cuomo, et al. 2014. Pilon: An integrated tool for comprehensive microbial
oxidant capacity and content of flavonoids and other phenolic compounds variant detection and genome assembly improvement. PLoS ONE 9: e112963.
in canihua (Chenopodium pallidicaule): An Andean pseudocereal. Molecular Wang, Y. P., H. B. Tang, J. D. DeBarry, X. Tan, J. P. Li, X. Y. Wang, T. H. Lee, et al.
Nutrition & Food Research 52: 708–717. 2012. MCScanX: A toolkit for detection and evolutionary analysis of gene
Pertea, M., D. Kim, G. M. Pertea, J. T. Leek, and S. L. Salzberg. 2016. Transcript- synteny and collinearity. Nucleic Acids Research 40: e49.
level expression analysis of RNA-seq experiments with HISAT, StringTie and Wilson, H. D. 1990. Quinua and relatives (Chenopodium sect. Chenopodium
Ballgown. Nature Protocols 11: 1650–1667. subsect. Celluloid). Economic Botany 44: 92–110. https://doi.org/10.1007/
Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population BF02860478.
structure using multilocus genotype data. Genetics 155: 945–959. Xu, Y. Q., C. W. Bi, G. X. Wu, S. Y. Wei, X. G. Dai, T. M. Yin, and N. Ye. 2016.
Rastrelli, L., F. DeSimone, O. Schettino, and A. Dini. 1996. Constituents of VGSC: A web-based vector graph toolkit of genome synteny and collinearity.
Chenopodium pallidicaule (Canihua) seeds: Isolation and characterization BioMed Research International 2016: 7823429.