0% found this document useful (0 votes)
11 views

Vrcek_2022

Uploaded by

xlg47311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Vrcek_2022

Uploaded by

xlg47311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Learning to Untangle Genome Assembly

with Graph Convolutional Networks

Lovro Vrček1,2 , Xavier Bresson3 , Thomas Laurent4 , Martin Schmitz1,3 , and Mile Šikić1,2
1
Genome Institute of Singapore, A*STAR
arXiv:2206.00668v1 [q-bio.GN] 1 Jun 2022

2
Faculty of Electrical Engineering and Computing, University of Zagreb
3
National University of Singapore
4
Loyola Marymount University
{vrcekl, miles}@gis.a-star.edu.sg

Abstract
A quest to determine the complete sequence of a human DNA from telomere to
telomere started three decades ago and was finally completed in 2021. This accom-
plishment was a result of a tremendous effort of numerous experts who engineered
various tools and performed laborious manual inspection to achieve the first gapless
genome sequence. However, such method can hardly be used as a general approach
to assemble different genomes, especially when the assembly speed is critical
given the large amount of data. In this work, we explore a different approach to
the central part of the genome assembly task that consists of untangling a large
assembly graph from which a genomic sequence needs to be reconstructed. Our
main motivation is to reduce human-engineered heuristics and use deep learning to
develop more generalizable reconstruction techniques. Precisely, we introduce a
new learning framework to train a graph convolutional network to resolve assembly
graphs by finding a correct path through them. The training is supervised with
a dataset generated from the resolved CHM13 human sequence and tested on
assembly graphs built using real human PacBio HiFi reads. Experimental results
show that a model, trained on simulated graphs generated solely from a single
chromosome, is able to remarkably resolve all other chromosomes. Moreover, the
model outperforms hand-crafted heuristics from a state-of-the-art de novo assem-
bler on the same graphs. Reconstructed chromosomes with graph networks are
more accurate on nucleotide level, report lower number of contigs, higher genome
reconstructed fraction and NG50/NGA50 assessment metrics. Both the code and
the dataset are made publicly available1 .

1 Introduction
De novo genome assembly is a problem in bioinformatics that focuses on reconstructing the original
genomic sequence of a species or an individual from a biological sample of shorter overlapping
fragments, called reads, without any prior knowledge about the original sequence. There is a variety
of tools that tackle this problem, so called de novo genome assemblers, but complete reconstruction
of large genome in a fast and accurate manner remains unsolved. One of the main issues is the
fragmentation of the assembly genomes—instead of reconstructing each chromosome of the genome
as a single contiguous sequence, it is broken into many shorter fragments, called contigs. The
fragmentation happens when de novo assembly tools fail to resolve a complex genomic region, which
is, in lack of better methods, cut out from the reconstruction.
1
https://github.com/lvrcek/GNNome-assembly

Preprint. Under review.


Researchers have been on a quest to assemble the complete human genome since 1990 [1]. A
recent accomplishment, however, shows that a complete, unfragmented reconstruction of human
genome is possible. This was enabled by the latest generation of sequencing technologies which
produce longer and more accurate reads compared with those from previous generations, but also by
a tremendous effort of numerous researchers who used different de novo assembly tools to manually
inspect remaining large repetitive regions and avoid fragmentation of the assemblies. To achieve this
goal, first they used HiFi reads developed by Pacbio to reconstruct most of the genome, after which
complex regions were resolved with ultra-long Nanopore reads.
Motivated by their effort, in this work we propose an automated method for reducing the fragmentation
in de novo genome assembly that relies neither on tedious human inspection nor inaccurate heuristics,
but rather on graph neural networks. For this, we use HiFi reads only and focus on the central phase
of de novo genome assembly, in which a large, complex assembly graph has to be untangled into
contigs. In our work we propose a new learning framework following the Overlap-Layout-Consensus
assembly paradigm. Scientific contributions of this work are:
• Assembly quality: Starting from the same assembly graph, we reconstruct the chromosomes in
fewer, longer fragments than heuristics commonly used in de novo genome assembly.
• Research direction: To the best of our knowledge, this is the first ever work that utilizes deep
learning to resolve the Layout phase of de novo genome assembly. Thus, we open a new avenue to
bring improvements over the current methods.
• Dataset and framework: We manually curate and annotate a dataset of human chromosomes used
in the human genome reconstruction, which allows for training on real data and easier evaluation
of models. We make the dataset publicly available, together with the code of the entire framework
we implemented, from simulating the synthetic data to inferring on real data.

2 Background
Contemporary de novo assembly methods are primarily based on third generation long reads tech-
nologies. The most popular long reads sequencing technologies are produced by Pacific Biosciences
(Pacbio) and Oxford Nanopore Technologies (ONT). Sequenced reads mutually vary in length and
accuracy, so not all tools work equally good with both—there is no one-size-fits-all assembly tool.
Although ONT devices can produce much longer reads (up to 1 million base pairs), we focus only on
the PacBio HiFi reads due to combination of their high accuracy (above 99.5%) and length (15000 -
25000 base pairs).

OLC The Overlap-Layout-Consensus (OLC) paradigm is a popular approach to de novo assembly


with long reads, used also in the recent telomere to telomere reconstruction of the human genome. In
the Overlap phase, the reads are overlaped onto each other in an all-versus-all manner to find relations
between them. Using the obtained information, an assembly graph is built—a directed graph in which
nodes represent reads and edges represent the suffix-prefix overlaps between the reads. In the Layout
phase, the assembly graph is then simplified in order to find a path through it. This path corresponds
to the reconstructed sequence of the original genome, also called the assembly sequence. Finally, in
the Consensus phase, the assembly sequence is cleaned of per-base errors by mapping all the reads
onto the assembly.

Layout Theoretically, this phase is formulated as finding a Hamiltonian path over the assembly
graph—visit every node in the graph exactly once. However, due to errors in basecalling sequenced
fragments, sequencing artifacts, long repetitive genomic regions, and heuristic overlap methods,
constructed graph either does not have a Hamiltonian path or consists of spurious nodes or edges
which should be skipped. Therefore, instead of finding a path through the graph directly, modern
assemblers rely on heuristics to simplify the entire graph into a chain. This is done by iteratively
removing nodes and edges deemed unnecessary, such as transitive edges, dead-ends, and bubbles
[2, 3]. Frequently, however, some parts of assembly graphs, which usually correspond to highly
complex genome regions, cannot be simplified by the current heuristics. In situations where a unique
solution does not exist, contemporary assemblers cut out parts of these complex regions which leads
to fragmented genome reconstructions. To this day, the problem of fragmentation continues to plague
all the existing de novo assemblers. The only remaining solution is laborious manual assembly
inspection and curation.

2
Related work So far, deep learning has been applied to different problems relevant to genome
assembly, such as Consensus [4], basecalling for ONT reads [5], and error-correction of HiFi reads [6],
in each case producing state-of-the-art results. However, to the best of our knowledge, no work has
been done on solving the Layout phase with deep learning. The closest one is [7], a proof-of-concept
work where authors focus only on simulating the deterministic simplification steps in Layout, instead
of untangling the graphs directly. Due to the type of the problem, it is natural to tackle Layout phase
with Graph Neural Networks (GNNs) [8], which have recently found a wide application in a variety
of biological problems, ranging from drug design [9] and protein interactions [10], to predicting
anticancer foods [11].
Hamiltonian Cycle Problem can be reduced to the Traveling Salesman Problem (TSP) by completing
the graph and adding a unit cost to newly added edges while keeping zero cost for the existing ones.
However, we cannot just plug-in existing state-of-the-art models for the TSP problem such as the
graph transformer [12], due to several reasons. First, assembly graphs are usually far larger than
those used in the TSP research, often reaching hundreds of thousands of nodes and millions of edges.
Second, for assembly graphs it is not possible to use coordinates of nodes as node features, like in
the 2D TSP graphs that most of the research focuses on [13]. Finally, in de novo assembly setting,
spurious nodes (reads) and edges (overlaps), created due to imperfect sequencing procedure and
overlap algorithms should be avoided. Consequently, we needed to conceive a GNN model carefully
tailored for finding a path in assembly graphs. Nevertheless, we draw an inspiration from the work
done on applying learning-based methods to TSP and follow a similar framework [14].

3 Problem setup

We propose the learning framework depicted in Figure 1 to tackle the untangling of de novo assembly
graphs in an end-to-end manner. The approach requires a previously completed genome, for which
we use the recently reconstructed CHM13 human genome [15]. It starts with simulating the reads
from the CHM13 chromosomes from which assembly graphs are built using a de novo genome
assembler called Raven [3]. Next step is training a model to predict which edges lead to the optimal
assembly, running a search algorithm over those predictions, and translating the obtained paths into
contigs. Once trained, we evaluate our model on the real PacBio HiFi reads used in the reconstruction
of the CHM13 genome, and compare our approach to the heuristics Raven uses in the Layout phase.
Since both methods have the same assembly graph at the input, this eliminates the effects different
graph-construction techniques could have on the final assembly and enables us to compare only
techniques for resolving Layout.
Raven is one of the state-of-the-art assemblers for long reads. Although there are assemblers tailored
for Pacbio Hifi data, such as hifiasm [16] and HiCanu [17], we chose Raven because, unlike many
other assemblers, it doesn’t specialize on a single type of reads, but performs well on both PacBio
and ONT data. Even though in this work we focus on PacBio HiFi reads, we argue that the same
approach can be applied to ONT data, which we plan to demonstrate in future research. Moreover,
we argue that the same approach can be used not only on different types of long reads, but with any
other OLC-based assembler.

3.1 Dataset

3.1.1 Simulating synthetic reads


We start with CHM13 reference genome [15] which we first split into 23 chromosomes and sample
them using a tool called seqrequester [18]. Seqrequester samples the reference in such a manner
that the length distribution of simulated reads faithfully resembles distribution of real HiFi reads. It
also saves the positional information of each read, which facilitates distinguishing real from false
overlaps in the graph. This process enables producing a large amount of data for supervised training,
as well as constructing accurate labels for the edges. Although we start with only 23 chromosomes,
by resampling we obtain a different set of reads each time, which result in graphs that are mutually
sufficiently different to be considered as new elements. Although seqrequester produces errorless
reads, due to high accuracy of HiFi data and error correction steps in the preprocesing, we deem that
this would not result in a significant difference in the distribution of graphs created with real and
simulated reads. The results prove our assumption.

3
Figure 1: Machine learning framework for end-to-end task of untangling de novo assembly graphs
and genome sequence reconstruction.

3.1.2 Preprocessing real reads


We first error-correct reads with hifiasm [16], thus reducing amount of mismatches, insertions, and
deletions in the reads. Then, we annotate the reads with their positional information by mapping
them onto the reference with minimap2 [19] and manually inspecting complex regions. Although
annotations are not mandatory to evaluate the quality of the assembly, they serve us to compare
how well will the model trained on synthetic graphs be able to predict edge-labels on the real-world
graphs. This, additionally, gives a clue whether generalizing from synthetic to real-world data is
possible. Annotating the real reads also facilitates using them for training, which could bring increase
in performance. We deem this dataset will be important for future development and testing of similar
learnable approaches to untangling assembly graphs, and thus make it available together with the
code and present it in more details in Supplementary materials, Section D.

3.1.3 Constructing graphs and labeling edges


Once the reads are prepared, the assembly graphs are constructed using Raven’s Overlap phase [3].
We output the generated graph prior to any simplification steps, in order to avoid errors that can
occur during these steps. Due to heuristic nature of alignment algorithms used for overlapping the
reads and complex genomic regions, there will exist edges in the graph which, if traversed, lead to
suboptimal genome reconstruction. In order to train the model to avoid such edges, we assign them a
label of 0, while all the edges which lead to optimal reconstruction are labeled as 1. We obtain these
labels by running a DFS-like algorithm which finds the optimal path while taking into account the
positional information saved during reads preparation. Due to transitive edges being present in the
assembly graph, a large majority of edges will be labeled positively, thus making the dataset highly

4
imbalanced towards the positive label. Eventually, this leaves us with a fully prepared dataset, on
which a model can be trained for the task of binary edge classification. All the tools used in the
described framework are publicly available and free to use—links to CHM13 dataset [15] and the
codes of Raven [3], hifiasm [16], and minimap2 [19] are available in their respective papers, while
seqrequester is available at https://github.com/marbl/seqrequester.

3.2 Model

We train a Graph Neural Network called GatedGCN model [20] that computes a d-dimensional
representation of nodes and edges in the assembly graph, from which an MLP classifier outputs a
probability that a given edge can lead to the optimal reconstruction. We compare these predicted
probabilities to the correct edge labels obtained during the graph construction to calculate the binary
cross entropy loss and minimize it with stochastic gradient descent. During inference, we do greedy
search over the probabilities to find paths in the graphs, which are subsequently converted to contigs
representing the reconstructed genome. The motivation for using GatedGCN comes from works
previously done on the path-searching TSP problem [14, 21] and a GNN benchmark where it
outperformed other models on several tasks [22].

3.2.1 Input Features


The input edge features zij ∈ Rde are composed of the length and quality of an overlap between two
reads represented by two nodes of the assembly graph (de = 2). They are normalized with a standard
z-scoring (i.e. zero mean and unit standard deviation) over all edges.
We do not have any available genomic node features. In the absence of features, nodes are anonymous
and GNNs are known to perform either poorly or completely fail to classify isomorphic graphs or
detect basic patterns like cycles or cliques [23, 24, 25]. The main reason of this failure is the lack
of canonical positional information in arbitrary graphs. As such, recent works have proposed to
augment node features with graph positional encoding such as [26, 27, 28, 29]. In the case of the
directed assembly graph, we proposed to use the in-degree, out-degree and k-step diffused PageRank
vector [30] as node positional encoding. Note that these node positions are invariant to indexing
permutation, which is essential for generalization. The k-step PageRank vector is generated with the
following iterative scheme
1n 1n
pk+1 = α(D−1 A)T pk + (1 − α) ∈ Rn , pk=0 = ∈ Rn , (1)
n n
where A ∈ Rn×n is the adjacency matrix of the directed assembly graph, n is the number of nodes,
D ∈ Rn×n is the diagonal matrix of the out-degree vector A1n ∈ Rn , 1n ∈ Rn is the n-dim vector
of ones, and α = 0.95 is the random walker constant. To summarize, the input node features are
1 K dv
xi = din out
i k di k pi k · · · k pi ∈ R , where · k · is the concatenation operator , dv = 2 + k with 2
for the degrees and k for the dimentionality of the PageRank vector.

3.2.2 Input layer


An initial layer of the network transforms node features xi ∈ Rdv and edge features zij ∈ Rde into
the d-dimensional node and edge representations.
This is done using two fully connected layers:
h0i = W1,2 ReLU(W1,1 xi + b1,1 ) + b1,2 ∈ Rd , (2)
e0ij = W2,2 ReLU(W2,1 zij + b2,1 ) + b2,2 ∈ R , d
(3)
where h0i is the initial node representation of the node i, e0ij is the initial representation of the edge
i → j, while all the W and b are learnable parameters.

3.2.3 GatedGCN
The main part of the network consists of multiple GatedGCN layers [20]. In addition to the original
GatedGCN, we include an edge feature representation eij ∈ Rd , and use a dense attention map
ηij ∈ Rd for the edge gates, as proposed in [31, 14]. The original implementations, as most of the
off-the-shelf GNN layers, are meant to be used on undirected graphs where messages are passed in

5
all directions. However, in case of directed graphs this means that half of the messages are lost. For
assembly graphs, which have an inherent directional information—from the beginning to the end of
the genome—it is crucial to address this lack of expressivity.
Let i and p → q be respectively the node and the directed edges (we will use pq to denote p → q for
simplicity) whose representations we want to update, and let their representations be hli and elpq at
layer l. Also, let all the predecessors of node i be denoted with j and all its successors with k. Then,
the node and edge representations at layer l + 1 will be computed as:
  X f,l+1 X b,l+1 
l+1 l l l l l l l
hi = hi + ReLU BN A1 hi + ηji A2 hj + ηik A3 hk ∈ Rd , (4)
j→i i→k
  
el+1
pq = elpq + ReLU BN B1l elpq + B2l hlp + B3l hlq ∈ Rd , (5)

where all A, B ∈ Rd×d are learnable parameters, ReLU stands for rectified linear unit, BN for batch
normalization, and for Hadamard product. The edge gates are defined as:

σ elji

f,l b,l σ elik
ηji =P   ∈ [0, 1]d , ηik =P l
 ∈ [0, 1]d (6)
0
l
σ e 0 + i→k0 σ eik0 + 
j →i j i

where σ is the sigmoid function, and  is a small value in order to avoid division by zero. Observe
f,l
that we distinguish between the messages ηji passed along the edges, and in the reverse direction of
b,l
the edges ηik .

3.2.4 MLP classifier


We use a multi-layer perceptron (MLP) on the node and edge representations produced by L Gat-
edGCN layers to classify the edges. For each directed edge i → k, a probability pik is computed
and trained such that it leads to the optimal assembly, i.e. pik = 1 if the edge is in the solution, or 0
otherwise. The probability is given by passing the concatenation of the node representation of nodes
i and k, as well as the edge representation of directed edge i → k:
pik = σ MLP hL L L

i k hk k eik ∈ [0, 1], (7)
where σ is the sigmoid function and L denotes the last GatedGCN layer.

3.3 Sequence Decoding

We decode the paths on the graph to assemble the reads with a greedy search algorithm. In the
case where all the edge predictions were correct and the graph topology was noiseless (i.e. graph
composed of a single connected component with neither dead-ends nor cycles) then extracting paths
from the graph would become trivial by starting at any positively predicted edge and greedily choosing
a sequence of edges with the highest probability in both forward and backward graph directions.
However, neither of these conditions are met in practice. Hence, instead of performing a single greedy
search, we first sample B starting edges with Bernoulli, unroll B greedy searches, and finally get
the set of paths {w1 , . . . wB }. Then, we compute the lengths of the sequences corresponding to the
extracted paths, and choose the path with the longest sequence length. Subsequently, the selected path
is translated into a contig of our reconstructed genome by concatenating the overlapping reads in the
path. The nodes in the chosen path are masked out from the graph to avoid visiting them twice. The
decoding process continues iteratively until the length of extracted path is below a certain threshold.

4 Numerical Experiments
4.1 Model Training

We first simulate 18 read-datasets from chromosome 19, from which assembly graphs are constructed.
Out of those 18, we use 15 as the training set and the remaining as the validation set. There is
nothing particularly special about chromosome 19. We chose it because it is one of the smaller
chromosomes, while not being acrocentric. Acrocentric are chromosomes 13, 14, 15, 21, and 22, and

6
are known to have more repetitive regions and be more difficult to assemble. This setup facilitates
training at first, but it soon produced admirable results. As a sanity check, we additionally trained the
model on five graphs from chromosomes 9, 19, and 22, each, but got only slightly better results, not
statistically significant. Therefore, we report the results obtained with training and validating only on
chromosome 19. The results achieved from training on a combination of chromosomes 9, 19, and 22
are available in the Supplementary materials, Section C.
Since each set of reads was created anew by resampling the reference, they result in significantly
different graphs, even if the reads are sampled from the same chromosome. This is confirmed by the
fact that our model was able to generalize to different chromosomes with ease.
Due to their size, it was not possible to train the network on the entire graphs, so we partition the
graphs using METIS clustering algorithm [32]. We randomly choose a number of clusters from an
interval between 400 and 600 each epoch. The obtained clusters are then grouped into mini-batches of
size 50. We used Adam optimizer [33], with the initial learning learning rate of 10−3 , and minimize
the binary cross-entropy loss over each mini-batch. We decay the learning rate by a factor of 0.95
with patience of 2 epochs. Due to the dataset being highly imbalanced towards the positive class, we
scale the learning rate for positive examples with an appropriate weight computed from the graphs in
the training set. The entire training was done on a single Nvidia A100 GPU and took 53 minutes.

Hyperparameters All the results reported in this paper are obtained with the network with hidden
dimension d = 256, number of GatedGCN layers L = 16, and 3 layers in the MLP classifier, resulting
in approximately 6.5 million parameters. They were obtained with a grid-search hyperparameter
optimization over the mentioned hyperparameters. For the number of greedy path candidates, we
choose B = 50 because that is the lowest number of candidate paths that consistently produces
optimal paths. We make the best performing pretrained model available together with the code.

4.2 Inference

At inference, we do not use METIS to cluster the graphs in order to avoid cutting the edges from
the graph. Instead, we feed the whole graph to the model, which in some cases requires us to run
the model on a CPU, due to GPU memory limitations. However, the model is small enough so that
a single forward pass of the network takes around 1 minute and 15 seconds for the largest graphs,
and below 1 minute for the smaller ones on an AMD EPYC 7702 processor. This provides us
with edge-probabilities which are used to guide the greedy search algorithm as described in Section
3.3, resulting in contigs which together form an assembly genome. Hence, we can evaluate our
reconstruction using the same metrics that are commonly used in de novo genome assembly. For this
we use a tool called Quast [34], and compare against several different measures:

• Number of contigs: Gives an insight into how fragmented our reconstruction is (lower is better).
• Longest contigs: The length of the longest contig (higher is better).
• Genome fraction: Fraction of the genome which is reconstructed in our assembly (higher is
better).
• NG50: Length of the contig, which, coupled with longer contigs, covers 50% of the reference
genome (higher is better).
• NGA50: Calculated the same way as NG50, but on top of alignments between contigs and the
reference (higher is better).

5 Results
First we evaluate whether the model, trained only on graphs generated from simulated chromosome 19
reads, can generalize to other chromosomes constructed also from simulated reads. As an additional
test, and to demonstrate the complexity of the problem, we run two naïve greedy searches, from
the same B starting edges as for decoding described in Subsection 3.3—one following the longest
overlap and the other following the highest overlap similarity. Starting from the same edges removes
the sampling randomness, and thus it is easy to see that the naïve approaches reconstruct significantly
shorter contigs. Moreover, we show that the model can generalize to other chromosomes with a
surprising ease, maintaining high performance on the entire dataset. When compared to Raven’s

7
Figure 2: Our method’s reconstruction of chromosomes 6 (left) and 10 (right). The plot shows
the matching between the reference CHM13 chromosome on the x-axis and our reconstruction
on the y-axis. Ideally, only a single dark green line would be visible on the diagonal. Horizontal
dashed lines indicate fragmentation. Yellow and brown lines indicate low-quality mapping onto the
reference, 0-25% and 25-50%, respectively, while light green and dark green lines represent 50-75%
and 75-100% matching. Generated with DGenies [35].

Layout heuristics, GatedGCN outputs significantly fewer contigs for every chromosome, while
having comparable reconstructed genome fraction. Moreover, GatedGCN creates contigs with either
comparable or higher NG50 and NGA50 than the Raven’s heuristics, in some cases improving them
by more than 100%. Full results of this experiment are available in the Supplementary materials
Section B, where results of the naïve approaches are presented in Table 3 and results of GatedGCN
and Raven’s heuristics are in Table 4. In the same Section, we compare the amount of mismatches
and insertions/deletions (indels) per 100.000 base pairs of GatedGCN and Raven, and demonstrate
that GatedGCN achieves fewer errors than Raven even in this comparison.
After verifying that the network is able to generalize to synthetic graphs of other chromosomes, as
well as outperform Raven’s Layout heuristics, we evaluate whether the same prediction quality holds
for assembly graphs generated from real human HiFi data. We assemble all the chromosomes, and
compare our method with Raven’s heuristics. These results are reported in Table 1.
Once again, our method outperforms Raven’s heuristics on all of the reported tasks. It consistently
produces fewer contigs, while the reconstructed genome fraction is in most cases comparable or
higher—Raven’s heuristics reconstruct more chromosome only in two cases, and the gain is less
than 0.5%. Longest contig, NG50, and NGA50 are also mostly comparable or higher, with just
a few cases where Raven’s heuristics outperform by a narrow margin. While NG50 is improved
significantly for most of the chromosomes, most notable improvements in terms of NGA50 are seen
in chromosomes 6 and 10, for which it is more than 100% higher when compared to Raven’s methods.
Our reconstructions of these two chromosomes can be seen in Figure 2. When per-base errors are
compared, presented method also mostly outperforms Raven’s heuristics. Full results can be seen in
Table 2.

6 Discussion
The experimental results clearly demonstrate practical value and usefulness of the proposed learning-
based framework which opens new avenues for further research. Next steps will include application
of the developed framework to genomes of different species and to different types of data, e.g. Oxford
Nanopore Technology (ONT) reads. Since the assembly process with ONT reads is the same as
with PacBio reads, we expect that similar improvements in the contiguity of the assemblies are
possible. Furthermore, the generalization from one to all the other human chromosomes implies
similar structure of chromosomes and we argue that generalizing to some other species (mainly

8
Table 1: Evaluation of our method (GatedGCN) and Raven’s Layout heuristics on all the real
chromosomes. The used metrics are number of contigs (Num ctg), longest contig (Longest), genome
fraction (GF), NG50, and NGA50. With asterisk we denote chromosome 19 we used to simulate
reads for the training dataset.
GatedGCN Raven
chr Num Longest GF NG50 NGA50 Num Longest GF NG50 NGA50
ctg (Mbp) (%) (Mbp) (Mbp) ctg (Mbp) (%) (Mbp) (Mbp)
1 26 115.6 98.1 73.0 46.3 241 86.9 97.6 44.4 44.4
2 20 73.1 99.6 35.1 35.1 56 73.1 98.9 28.1 28.1
3 6 127.0 99.6 127.0 56.0 45 90.5 99.5 56.0 56.0
4 8 139.0 99.0 139.0 34.8 78 67.8 99.0 34.9 34.9
5 8 123.6 99.1 123.6 103.5 47 103.5 99.0 103.5 103.5
6 7 101.0 98.9 101.0 52.8 20 110.3 98.7 110.3 25.9
7 17 58.6 98.1 42.6 25.7 69 29.3 98.0 25.1 17.5
8 12 68.8 98.6 33.9 28.5 33 31.6 98.4 28.5 28.5
9 17 67.1 95.0 31.9 16.1 139 38.9 90.2 19.7 15.8
10 13 47.7 99.3 36.7 36.7 43 36.7 99.2 17.2 17.2
11 7 65.4 99.9 35.3 23.2 31 35.3 99.7 32.6 23.2
12 11 57.2 99.9 31.0 31.0 33 57.2 99.8 31.0 31.0
13 13 73.0 96.1 73.0 30.1 116 47.5 95.9 25.5 25.5
14 9 82.6 97.8 82.6 82.6 32 82.6 97.2 82.6 82.6
15 19 47.1 93.6 13.4 10.0 157 29.0 93.5 9.0 8.5
16 28 16.0 91.6 8.7 5.9 164 16.4 90.8 5.9 5.7
17 11 29.9 96.4 15.7 10.2 47 12.9 96.1 9.0 9.0
18 8 44.9 97.6 44.9 17.4 45 43.5 97.9 43.5 17.4
*19 20 14.0 98.4 5.1 3.6 44 9.5 98.5 3.6 3.6
20 9 32.7 98.6 26.7 17.8 40 31.8 98.6 25.2 17.3
21 4 32.8 94.6 32.8 32.8 21 32.8 94.1 32.8 32.8
22 11 9.0 94.7 6.7 4.0 66 9.0 93.8 3.9 3.9
X 18 50.6 98.6 27.1 13.2 64 40.1 98.3 11.7 11.7

mammals) is not far-fetched. One of the next challenges is haploid resolved assembly, which enables
correct phasing of information from parental genomes. This, however, will require additional changes
in the Overlap phase of genome assembly.
We are aware of other assemblers tailored for HiFi reads such as hifiasm [16], HiCanu [17], rust-
mdbg [36] and LJA [37]. However, the scope of this paper was not to present a whole new de novo
assembler, but to focus only on the Layout phase and to show that learning-based methods can bring
an improvement over the conventional combination of algorithms and heuristics used in the field.
For a fair comparison, there is a need for the same starting assembly graphs. Assemblers produce
their own graphs which might significantly vary in type, complexity and fragmentation. Overlap
step is a critical prerequisite for the successful reconstruction, but it is out of scope of this work.
Having in mind the importance of Overlap step, presented framework is modular enough so the other
researchers might incorporate it in other OLC based assemblers.

7 Conclusion

In this work, we introduce a new framework for untangling large graphs constructed in de novo
assembly process with long reads. Existing de novo assemblers use a combination of algorithms
and hand-crafted heuristics which try to simplify graphs by removing some known structure found
in them, and cut the graphs into fragments when heuristics cannot result in a unique solution. We
propose a different approach, one based on graph neural networks, where we predict favorable and
unfavorable edges for genome reconstruction, and then run a greedy search algorithm over these
predictions. By starting from the graph produced by one of contemporary assemblers, Raven, we
show that our method can improve the contiguity of the assemblies and nucleotide error rates. We
trained our method only on the graphs generated from synthetic reads of a single chromosome, and
are able to effectively untangle graphs generated from real human PacBio HiFi reads. We deem that

9
Table 2: Base-error metrics for our method (GatedGCN) and Raven’s Layout heuristics on graphs of
all the chromosomes, constructed from real reads. Asterisk indicates chromosome used in training.
GatedGCN Raven
chr Mismatch Indel Mismatch Indel
1 2.54 0.91 5.30 1.21
2 1.50 0.64 2.23 0.85
3 3.47 0.69 2.46 0.73
4 1.32 0.65 3.63 0.75
5 2.65 0.54 4.20 0.74
6 0.84 0.50 1.09 0.56
7 2.89 0.99 2.29 1.16
8 2.53 0.72 1.79 0.73
9 5.22 1.94 8.98 2.59
10 3.60 0.93 2.85 1.12
11 0.65 0.74 1.59 1.04
12 0.37 0.53 1.68 0.67
13 2.16 0.63 8.95 1.50
14 1.57 1.18 1.80 1.14
15 6.02 1.56 10.55 2.43
16 8.82 1.98 12.99 2.52
17 6.01 1.33 7.19 1.45
18 4.60 0.69 7.75 0.92
*19 9.45 1.84 8.48 2.09
20 4.69 0.93 9.05 1.65
21 4.46 1.52 10.00 1.82
22 24.42 2.32 27.45 4.40
X 2.42 0.90 3.42 1.18

presented framework proves usefulness of using graph neural networks in real life problems such as
genome reconstruction.

Acknowledgments and Disclosure of Funding


We thank Robert Vaser for helping us with understanding the Overlap phase in Raven, as well as
modifying it in order to work better for PacBio HiFi reads. We also thank Filip Bosnić and Sara
Bakić for comments on the paper.
Lovro Vrček has been supported by "Young Researchers" Career Development Program DOK-
2018-01-3373, ARAP scholarship awarded by A*STAR, and core funding of Genome Institute of
Singapore, A*STAR. Xavier Bresson has been supported by NRF Fellowship NRFF2017-10 and
NUS-R-252-000-B97-133. Martin Schmitz has been supported by SINGA scholarship awarded
by A*STAR. Mile Sikic has been supported in part by the European Union through the European
Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS), by the Croatian
Science Foundation under the project Single genome and metagenome assembly (IP-2018-01-5886),
and by the core funding of Genome Institute of Singapore, A*STAR.

References
[1] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody, Jennifer
Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, et al. Initial sequencing
and analysis of the human genome. 2001.
[2] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
Bioinformatics, 32(14):2103–2110, 2016.
[3] Robert Vaser and Mile Šikić. Time-and memory-efficient genome assembly with raven. Nature
Computational Science, 1(5):332–336, 2021.
[4] Oxford Nanopore Technologies Ltd. Medaka, 2018.

10
[5] Ryan R Wick, Louise M Judd, and Kathryn E Holt. Performance of neural network basecalling
tools for oxford nanopore sequencing. Genome biology, 20(1):1–10, 2019.
[6] Gunjan Baid, Daniel E Cook, Kishwar Shafin, Taedong Yun, Felipe Llinares-Lopez, Quentin
Berthet, Aaron M Wenger, William J Rowell, Maria Nattestad, Howard Yang, et al. Deepcon-
sensus: Gap-aware sequence transformers for sequence correction. bioRxiv, 2021.
[7] Lovro Vrček, Petar Veličković, and Mile Šikić. A step towards neural genome assembly. arXiv
preprint arXiv:2011.05013, 2020.
[8] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
[9] Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M
Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackermann, et al.
A deep learning approach to antibiotic discovery. Cell, 180(4):688–702, 2020.
[10] Pablo Gainza, Freyr Sverrisson, Frederico Monti, Emanuele Rodola, D Boscaini, MM Bronstein,
and BE Correia. Deciphering interaction fingerprints from protein molecular surfaces using
geometric deep learning. Nature Methods, 17(2):184–192, 2020.
[11] Guadalupe Gonzalez, Shunwang Gong, Ivan Laponogov, Michael Bronstein, and Kirill Veselkov.
Predicting anticancer hyperfoods with graph convolutional networks. Human Genomics,
15(1):1–12, 2021.
[12] Xavier Bresson and Thomas Laurent. The transformer network for the traveling salesman
problem. arXiv preprint arXiv:2103.03012, 2021.
[13] Christos H Papadimitriou. The euclidean travelling salesman problem is np-complete.
Theoretical computer science, 4(3):237–244, 1977.
[14] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional
network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019.
[15] Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla
Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The
complete sequence of a human genome. Science, 376(6588):44–53, 2022.
[16] Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. Haplotype-
resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods,
18(2):170–175, 2021.
[17] Sergey Nurk, Brian P Walenz, Arang Rhie, Mitchell R Vollger, Glennis A Logsdon, Robert
Grothe, Karen H Miga, Evan E Eichler, Adam M Phillippy, and Sergey Koren. Hicanu: accurate
assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
Genome research, 30(9):1291–1305, 2020.
[18] Maryland Bioinformatics Labs. seqrequester.
[19] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–
3100, 2018.
[20] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint
arXiv:1711.07553, 2017.
[21] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. On learning paradigms for the
travelling salesman problem. arXiv preprint arXiv:1910.07210, 2019.
[22] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier
Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
[23] Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational
pooling for graph representations. In International Conference on Machine Learning, pages
4663–4673, 2019.
[24] Andreas Loukas. What graph neural networks cannot learn: depth vs width. In International
Conference on Learning Representations, 2020.
[25] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count
substructures? Advances in neural information processing systems, 33:10383–10395, 2020.
[26] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs.
AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.

11
[27] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen,
and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in
Neural Information Processing Systems, 34, 2021.
[28] Derek Lim, Joshua David Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and
Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation learning.
In ICLR 2022 Workshop on Geometrical and Topological Representation Learning, 2022.
[29] Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson.
Graph neural networks with learnable structural and positional representations. In International
Conference on Learning Representations, 2022.
[30] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120.
[31] Xavier Bresson and Thomas Laurent. A two-step graph convolutional decoder for molecule
generation. arXiv preprint arXiv:1906.03412, 2019.
[32] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning
irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
[33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[34] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. Quast: quality
assessment tool for genome assemblies. Bioinformatics, 29(8):1072–1075, 2013.
[35] Floréal Cabanettes and Christophe Klopp. D-genies: dot plot large genomes in an interactive,
efficient and simple way. PeerJ, 6:e4958, 2018.
[36] Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de bruijn graphs: Whole-
genome assembly of long reads in minutes on a personal computer. Cell systems, 12(10):958–
968, 2021.
[37] Anton Bankevich, Andrey V Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A
Pevzner. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads.
Nature biotechnology, pages 1–7, 2022.

12
A Code
Code can be found at https://github.com/lvrcek/GNNome-assembly and instructions for
running it are in the README.md of the repository.
The main files in the repository are the following:

• example.py - Runs an entire framework on a small example, which includes setting up the
directory structure, downloading real data, simulating the reads, constructing the graphs, training
the model, and finding contigs in the assembly graph. Training is done on three chromosome 19
graphs, validation on one chromosome 19 graph, and inference on one chromosome 21 graph. Note
that this, although a small example compared to other experiments, still takes time due to amount
of data needed to download and generate.
• config.py - File where, inside three dictionaries, it can be specified graphs of which chromosome
and in what amount should be used for training, validation, and testing.
• pipeline.py - Runs the entire framework, similar to the example.py, but generates and
trains/validates/tests on the data specified in config.py.
• reproduce.py - Quickly reproduce the results reported in the paper, by running the script with
––mode argument set to either synth for synthetic data or real for real data. Just like example.py
and pipeline.py, it sets up the working directory in case it hasn’t been set up before.

The easiest way to get started is to follow the installation instructions and then run example.py
script. This will set up the directories for storing references and data. The default data-directory
is data/ and the default reference directory is data/references/. By running the code for the
first time, CHM13 reference will be downloaded into data/references/CHM13/, in case it is not
already located there. Real HiFi data, both the genomic sequences and the processed DGL graphs,
will also be downloaded to data/real/. Compressed, the real dataset is 43 GB in size, while
uncompressed it is 180 GB. Furthermore, by running the script, data-directory will be populated with
directories simulated/ and experiments/. All the simulated data for one of the chromosomes
will be stored inside its respective directory inside data/simulated/. For example, simulated data
for chromosome 15 will be stored inside data/simulated/chr15/.
These chromosome-specific directories follow a particular structure, in order to be compatible with
DGLDataset class and to have all data related to one graph in one place:

• raw/ - Directory where the genomic sequences are stored.


• raven_output/ - Directory where Raven stores assembly graphs after the Overlap phase and the
assembly sequence after the Layout phase.
• processed/ - Directory where the DGL graphs are stored.
• info/ - Directory with auxiliary information about the graphs, used in training and inference.
• graphia/ - Directory where a file in a format suitable for visualization in Graphia2 is stored.

For purposes of training and inference, simulated and real data used for training, validation, and
testing is copied into data/experiments/train_<out>/, data/experiments/valid_<out>/,
and data/experiments/test_<out>/, respectively, where <out> is specified upon run-
time and denotes the name of the run. Inference is performed only on the graphs in
data/experiments/test_<out>/, where two additional subdirectories are created:

• inference/ - Directory where paths found during the decoding are stored.
• assembly/ - Directory where the contigs, obtaiend by translating paths into sequences, are stored.

2
Graphia is a tool for visualizing graphs. Read more about it here: https://graphia.app/

13
B Model trained only on chromosome 19

The first model is trained on 15 graphs created from simulated chromosome 19 reads and is tested
on graphs of all the chromosomes constructed from simulated (synthetic) data. Before comparing
the model’s performance against Raven’s heuristics, we construct a sanity check by comparing it
against two baseline methods—greedy search over overlap lengths and over overlap similarities. In
order to make the comparison of the decoding with model-predicted probabilities and the baseline
approaches fair, and avoid effects of the random sampling of edges, we first sample B edges as
starting positions from which we run three greedy searches, each following a different criteria. First
baseline method chooses neighbors with the longest overlap, while the second baseline method
chooses neighbors with the highest similarity. The results of these approaches can be seen in Table 3.
The results of the decoding by following the edge-probabilities predicted by the model are shown
in Table 4, together with the results produced by Raven. Note that in all three scenarios based on
greedy search—following overlap lengths, similarities, and predicted probabilities—the number of
contigs is the same, which is due to the sampling method described above. From this it is clear
that both GatedGCN and Raven outperform naïve baseline approaches, indicating the complexity of
the problem. It can also be noticed that GatedGCN outperforms Raven’s heuristics by a significant
margin. Moreover, in some cases NG50 and NGA50 were not computed, due to less than 50% of the
reference being reconstructed.

Table 3: Contiguity metrics for the two baseline methods—greedy search choosing largest overlap
length and greedy search choosing largest overlap similarity. Both baselines were run from the same
sampled edges as greedy over model-provided probabilities, hence the same number of contigs.
Overlap length Overlap similarity
chr Num Longest GF NG50 NGA50 Num Longest GF NG50 NGA50
ctg (Mbp) (%) (Mbp) (Mbp) ctg (Mbp) (%) (Mbp) (Mbp)
1 23 56.5 62.9 30.3 11.7 23 56.5 73.6 42.0 19.8
2 12 111.7 82.5 64.4 64.4 12 111.8 84.4 64.4 64.4
3 4 101.9 91.7 101.9 101.9 4 102.3 97.1 102.3 101.9
4 6 115.1 60.1 115.1 34.9 6 115.0 60.3 115.0 34.8
5 6 33.5 24.0 - - 6 69.8 43.6 - -
6 5 26.2 34.9 - - 5 32.0 38.5 - -
7 16 48.3 40.1 - - 16 50.3 67.1 15.6 14.3
8 6 59.4 47.3 - - 6 99.5 73.7 99.5 28.5
9 13 41.2 71.9 16.1 15.9 13 40.7 81.8 24.9 15.9
10 7 37.2 40.5 - - 7 83.1 95.8 83.1 83.1
11 1 49.4 36.6 - - 1 52.9 39.2 - -
12 4 57.5 96.4 34.9 34.8 4 59.1 97.4 34.9 34.6
13 8 96.1 88.3 96.1 57.2 8 96.1 92.9 96.1 57.2
14 9 9.8 15.0 - - 9 86.8 95.1 86.8 86.8
15 14 47.3 70.3 13.4 13.4 14 41.0 75.2 13.4 13.4
16 21 18.7 52.7 0.6 0.1 21 18.5 64.0 10.0 2.0
17 8 19.2 56.2 7.4 2.2 8 28.3 57.4 17.8 12.2
18 3 59.1 94.3 59.1 26.1 3 59.2 94.5 59.2 26.5
19 2 31.7 51.5 31.7 0.9 2 31.6 51.5 31.6 0.9
20 4 36.9 94.5 36.9 33.9 4 37.1 94.7 37.1 33.4
21 1 4.3 9.5 - - 1 8.9 19.2 - -
22 4 26.8 62.4 26.8 26.8 4 26.9 72.0 26.9 26.9
X 9 17.5 32.3 - - 9 46.9 91.1 39.4 39.2

We also compare the per-base errors—number of mismatches and indels (insertions and deletions)
per 100,000 base pairs—of assemblies produced by GatedGCN and Raven. We can notice that the
error rate between the chromosomes varies greatly, but, when compared side-by-side, it is clear that
GatedGCN outperforms Raven for most of the chromosomes. This can be seen in Table 5.

14
Table 4: Contiguity metrics for our method (GatedGCN) and Raven’s Layout heuristics on graphs
of all the chromosomes, constructed from simulated reads. Asterisk indicates chromosome used in
training, although the graph itself was different.
GatedGCN Raven
chr Num Longest GF NG50 NGA50 Num Longest GF NG50 NGA50
ctg (Mbp) (%) (Mbp) (Mbp) ctg (Mbp) (%) (Mbp) (Mbp)
1 23 77.9 97.8 56.5 30.5 233 56.4 97.6 30.4 30.4
2 12 111.8 99.2 87.8 87.0 57 111.0 99.1 86.7 86.7
3 4 102.8 99.2 102.8 101.9 36 101.8 99.4 101.8 101.8
4 6 188.7 99.0 188.7 58.4 60 67.5 99.1 58.4 58.4
5 6 169.6 99.1 169.6 101.2 51 101.1 99.1 101.1 101.1
6 5 133.0 98.8 133.0 100.9 18 110.4 98.6 110.4 100.3
7 16 69.2 98.2 54.4 25.7 57 40.9 98.0 26.7 24.1
8 6 99.5 98.6 99.5 32.0 27 59.3 98.4 31.9 31.9
9 13 41.2 91.4 24.9 15.9 133 41.2 89.9 16.1 15.9
10 7 83.2 99.3 83.2 83.1 42 46.4 99.1 37.2 37.2
11 1 134.9 99.8 134.9 32.9 32 40.9 99.7 31.3 31.3
12 4 59.1 99.7 37.7 34.8 21 59.1 99.7 35.0 34.8
13 8 96.6 96.2 96.6 57.2 98 62.0 95.7 62.0 57.2
14 9 86.8 97.8 86.8 86.8 30 86.0 97.0 86.0 86.0
15 14 41.0 93.8 13.4 13.4 134 29.1 92.5 9.0 9.0
16 21 34.5 91.6 21.3 15.8 136 15.9 89.6 9.5 9.5
17 8 24.3 96.7 19.8 13.2 41 18.3 96.2 12.9 12.9
18 3 59.9 97.4 59.9 26.5 34 59.4 97.5 59.4 26.0
*19 2 60.7 98.6 60.7 9.5 22 16.0 98.1 9.5 9.5
20 4 37.1 98.5 37.1 34.3 36 33.9 98.4 33.9 33.9
21 1 42.4 94.1 42.4 33.8 25 33.8 94.2 33.8 33.8
22 4 26.9 95.1 26.9 26.9 54 26.9 94.0 26.9 26.9
23 9 51.4 98.6 39.4 39.2 53 47.8 98.2 15.8 14.2

Table 5: Base-error metrics for our method (GatedGCN) and Raven’s Layout heuristics on graphs
of all the chromosomes, constructed from simulated reads. Asterisk indicates chromosome used in
training, although the graph itself was different.
GatedGCN Raven
chr Mismatch Indel Mismatch Indel
1 4.41 0.46 6.21 0.52
2 0.72 0.10 1.54 0.26
3 2.83 0.09 2.16 0.09
4 0.84 0.04 3.40 0.22
5 5.04 0.17 4.09 0.33
6 1.04 0.05 0.61 0.06
7 1.02 0.11 3.09 0.40
8 1.25 0.16 2.49 0.18
9 5.16 0.88 11.76 2.04
10 7.17 0.39 1.86 0.20
11 0.42 0.06 0.74 0.07
12 0.57 0.01 1.05 0.05
13 1.24 0.08 7.31 0.54
14 2.07 0.24 1.64 0.17
15 5.08 0.89 9.91 1.5
16 7.78 0.92 14.04 1.59
17 6.15 0.42 6.21 0.52
18 4.34 0.17 4.89 0.19
*19 6.91 0.40 3.89 0.12
20 7.54 0.26 10.45 0.73
21 2.71 0.19 4.93 0.54
22 24.59 1.01 29.14 1.43
23 2.77 0.21 3.15 0.22

15
C Model trained on chromosomes 9, 19, and 22

In Section 4.1 of the paper, we mention that training only on chromosome 19 is not the only
experiment we performed, but also tried to train on a combination of graphs from chromosomes 9, 19,
and 22. It is expected that training on 15 graphs of a combination of different chromosomes would
improve performance over training on 15 graph of a single chromosome. This is indeed the case here
as well, also proving that there is nothing special about chromosome 19. The model trained on the
same amount of graphs of different chromosomes does perform slightly better, but the improvements
are not statistically significant. Therefore, in the paper we report only the results obtained by training
on a single chromosome.
We test the model trained on chromosomes 9, 19, and 22, on both synthetic and real data, and compare
both the contiguity and per-base errors against Raven. Contiguity on synthetic data can be seen in
Table 6, per-base errors on synthetic data can be seen in Table 7, contiguity on real data can be seen
in Table 8, per-base errors on real data can be seen in Table 9.

Table 6: Contiguity metrics for our method (GatedGCN), trained on three chromosomes, and Raven’s
Layout heuristics on graphs of all the chromosomes, constructed from simulated reads. Asterisk
indicates chromosomes used in training, although the graphs were different.
GatedGCN Raven
chr Num Longest GF NG50 NGA50 Num Longest GF NG50 NGA50
ctg (Mbp) (%) (Mbp) (Mbp) ctg (Mbp) (%) (Mbp) (Mbp)
1 27 135.0 97.8 135.0 30.5 233 56.4 97.6 30.4 30.4
2 13 111.8 99.2 87.8 87.0 57 111.0 99.1 86.7 86.7
3 4 102.7 99.1 102.7 101.9 36 101.8 99.4 101.8 101.8
4 6 139.0 98.9 139.0 58.3 60 67.5 99.1 58.4 58.4
5 8 121.5 99.3 121.5 101.2 51 101.1 99.1 101.1 101.1
6 5 111.0 98.7 111.0 100.9 18 110.4 98.6 110.4 100.3
7 15 67.2 98.2 54.4 41.5 57 40.9 98.0 26.7 24.1
8 6 99.6 98.8 99.6 31.4 27 59.3 98.4 31.9 31.9
*9 17 41.2 92.4 17.6 15.9 133 41.2 89.9 16.1 15.9
10 7 83.2 99.3 83.2 83.1 42 46.4 99.1 37.2 37.2
11 1 134.9 99.8 134.9 35.3 32 40.9 99.7 31.3 31.3
12 5 59.1 99.8 37.6 34.8 21 59.1 99.7 35.0 34.8
13 7 96.2 95.9 96.2 57.2 98 62.0 95.7 62.0 57.2
14 8 86.8 97.8 86.8 86.8 30 86.0 97.0 86.0 86.0
15 20 41.0 94.3 13.4 13.4 134 29.1 92.5 9.0 9.0
16 19 34.5 92.0 21.3 15.8 136 15.9 89.6 9.5 9.5
17 7 38.7 96.6 20.5 18.4 41 18.3 96.2 12.9 12.9
18 4 59.9 97.3 59.9 26.5 34 59.4 97.5 59.4 26.0
*19 3 55.7 98.5 55.7 9.5 22 16.0 98.1 9.5 9.5
20 3 37.1 98.2 37.1 34.3 36 33.9 98.4 33.9 33.9
21 4 38.9 94.6 38.9 30.4 25 33.8 94.2 33.8 33.8
*22 6 26.9 94.5 26.9 26.9 54 26.9 94.0 26.9 26.9
X 6 51.4 98.5 47.1 40.0 53 47.8 98.2 15.8 14.2

16
Table 7: Base-error metrics for our method (GatedGCN), trained on three chromosomes, and Raven’s
Layout heuristics on graphs of all the chromosomes, constructed from simulated reads. Asterisk
indicates chromosomes used in training, although the graphs were different.
GatedGCN Raven
chr Mismatch Indel Mismatch Indel
1 3.00 0.22 6.21 0.52
2 0.96 0.11 1.54 0.26
3 2.89 0.08 2.16 0.09
4 1.05 0.06 3.40 0.22
5 3.76 0.16 4.09 0.33
6 0.86 0.04 0.61 0.06
7 1.22 0.11 3.09 0.40
8 1.43 0.18 2.49 0.18
*9 6.19 1.10 11.76 2.04
10 4.11 0.27 1.86 0.20
11 0.32 0.04 0.74 0.07
12 0.82 0.02 1.05 0.05
13 2.75 0.13 7.31 0.54
14 0.91 0.15 1.64 0.17
15 4.68 0.60 9.91 1.50
16 7.38 0.82 14.04 1.59
17 6.61 0.60 6.21 0.52
18 4.38 0.09 4.89 0.19
*19 8.40 0.53 3.89 0.12
20 6.25 0.20 10.45 0.73
21 3.20 0.19 4.93 0.54
*22 21.92 0.92 29.14 1.43
X 3.09 0.18 3.15 0.22

Table 8: Contiguity metrics for our method (GatedGCN), trained on three chromosomes, and Raven’s
Layout heuristics on graphs of all the chromosomes, constructed from real reads. Asterisk indicates
chromosomes used in training.
GatedGCN Raven
chr Num Longest GF NG50 NGA50 Num Longest GF NG50 NGA50
ctg (Mbp) (%) (Mbp) (Mbp) ctg (Mbp) (%) (Mbp) (Mbp)
1 33 103.7 98.4 63.6 46.3 241 86.9 97.6 44.4 44.4
2 19 73.1 99.5 35.1 28.2 56 73.1 98.9 28.1 28.1
3 8 90.5 99.5 56.0 56.0 45 90.5 99.5 56.0 56.0
4 12 115.2 99.0 115.2 34.8 78 67.8 99.0 34.9 34.9
5 8 123.6 99.1 123.6 103.5 47 103.5 99.0 103.5 103.5
6 7 101.0 98.8 101.0 52.8 20 110.3 98.7 110.3 25.9
7 19 58.7 98.4 42.6 25.7 69 29.3 98.0 25.1 17.5
8 14 68.7 98.4 34.9 28.5 33 31.6 98.4 28.5 28.5
*9 19 66.8 90.9 31.9 16.1 139 38.9 90.2 19.7 15.8
10 14 40.1 99.2 36.7 31.9 43 36.7 99.2 17.2 17.2
11 7 66.2 99.8 35.3 32.8 31 35.3 99.7 32.6 23.2
12 12 57.2 99.9 31.0 31.0 33 57.2 99.8 31.0 31.0
13 15 73.0 96.1 73.0 30.1 116 47.5 95.9 25.5 25.5
14 9 82.6 97.5 82.6 82.6 32 82.6 97.2 82.6 82.6
15 24 47.0 93.9 13.4 10.0 157 29.0 93.5 9.0 8.5
16 29 16.0 90.9 8.7 8.7 164 16.4 90.8 5.9 5.7
17 13 22.7 96.4 10.8 10.2 47 12.9 96.1 9.0 9.0
18 6 45.0 97.4 45.0 17.4 45 43.5 97.9 43.5 17.4
*19 19 12.4 98.5 5.1 3.6 44 9.5 98.5 3.6 3.6
20 10 32.7 98.7 26.6 26.1 40 31.8 98.6 25.2 17.3
21 4 32.8 94.5 32.8 32.8 21 32.8 94.1 32.8 32.8
*22 11 11.0 95.3 8.3 4.1 66 9.0 93.8 3.9 3.9
X 17 50.6 98.5 27.1 13.2 64 40.1 98.3 11.7 11.7

17
Table 9: Base-error metrics for our method (GatedGCN), trained on three chromosomes, and Raven’s
Layout heuristics on graphs of all the chromosomes, constructed from real reads. Asterisk indicates
chromosomes used in training.
GatedGCN Raven
chr Mismatch Indel Mismatch Indel
1 1.95 0.79 5.30 1.21
2 0.81 0.58 2.23 0.85
3 2.90 0.65 2.46 0.73
4 1.21 0.72 3.63 0.75
5 2.85 0.54 4.20 0.74
6 0.65 0.45 1.09 0.56
7 2.36 0.84 2.29 1.16
8 1.87 0.67 1.79 0.73
*9 4.17 1.58 8.98 2.59
10 7.06 1.08 2.85 1.12
11 0.78 0.78 1.59 1.04
12 0.68 0.56 1.68 0.67
13 2.63 0.58 8.95 1.50
14 1.27 0.97 1.80 1.14
15 7.80 1.99 10.55 2.43
16 7.55 1.94 12.99 2.52
17 6.62 1.57 7.19 1.45
18 6.10 0.68 7.75 0.92
*19 9.66 2.03 8.48 2.09
20 3.34 1.05 9.05 1.65
21 3.83 1.95 10.00 1.82
*22 26.99 2.98 27.45 4.40
X 2.16 0.84 3.42 1.18

18
D Dataset
The human reference genome, CHM13, consists of 23 chromosomes, which are together 3.3 billion
base pairs long. The PacBio HiFi dataset from which CHM13 was reconstructed, and which we use
to evaluate our model, consists of 5.6 million reads. The amount of base pairs and reads per each
chromosome can be seen in Table 10, together with the number of nodes and edges in each graph
corresponding to a particular chromosome. During the graph construction, some of the reads are
found to be contained inside other, longer reads, which is why the graphs have less nodes than the
chromosomes have reads.

Table 10: Statistics for data set based on real HiFi reads, aligned with minimap2 and assembly graph
created with Raven
chr Base Reads Nodes Edges
pairs
1 248,387,328 462,582 184,050 1,407,158
2 242,696,752 444,450 180,764 1,381,376
3 201,105,948 366,547 149,828 1,138,668
4 193,574,945 352,056 145,702 1,115,808
5 182,045,439 332,985 135,068 1,028,484
6 172,126,628 313,731 127,698 966,560
7 160,567,428 291,366 120,280 890,388
8 146,259,331 265,288 108,948 828,772
9 150,617,247 290,786 106,874 860,710
10 134,758,134 244,927 101,484 765,180
11 135,127,769 246,436 100,598 757,642
12 133,324,548 241,403 99,542 750,364
13 113,566,686 199,405 84,500 653,730
14 101,161,492 182,551 73,436 541,214
15 99,753,195 183,176 70,598 535,842
16 96,330,374 182,280 65,834 519,358
17 84,276,897 150,066 60,498 439,416
18 80,542,538 147,509 59,868 459,356
19 61,707,364 105,052 45,114 315,348
20 66,210,255 120,635 48,816 366,614
21 45,090,682 79,245 32,096 239,166
22 51,324,926 89,624 35,612 252,666
X 154,259,566 272,496 112,922 834,702

In the rest of this Section, we provide different statistics of the real PacBio HiFi reads we error-
corrected and annotated by finding location on the chromosome for each read. These reads are
automatically downloaded upon setting up the repository, as explained in Section A of these supple-
mentary materials. We show histogram of read lengths of all the reads in the real dataset in Figure
3, histogram of overlap lengths of all the overlaps in the real dataset in Figure 4, as these plots do
not vary significantly between different chromosomes. We also show coverage of each chromosome
in real dataset—uneven coverage can indicate regions which are more difficult to assemble, such as
repetitive regions inside centromeres and telomeres, and gives us insight into where the fragmentation
is likely to happen. The coverage histograms are shown in Figures 5, 6, and 7.

19
Figure 3: Histogram of read length of real HiFi reads from the dataset (the synthetic reads have
exactly the same length distribution).

Figure 4: Histogram of the overlap length of the Raven assembly graph based on the real HiFi reads
from the dataset.

20
Figure 5: Coverage histograms for real HiFi reads of chromosomes 1 to 8.

21
Figure 6: Coverage histograms for real HiFi reads of chromosomes 9 to 16.

22
Figure 7: Coverage histograms for real HiFi reads of chromosomes 17 to X.

23

You might also like