0% found this document useful (0 votes)
11 views

CNV软件coverageMaster

Uploaded by

df747473602
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CNV软件coverageMaster

Uploaded by

df747473602
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Briefings in Bioinformatics, 2022, 23(2), 1–8

https://doi.org/10.1093/bib/bbac049
Problem Solving Protocol

CoverageMaster: comprehensive CNV detection


and visualization from NGS short reads for genetic
medicine applications
Melivoia Rapti, Yassine Zouaghi, Jenny Meylan, Emmanuelle Ranza, Stylianos E. Antonarakis and Federico A. Santoni
Corresponding author: Federico A. Santoni, Centre Hospitalier Universitaire Vaudois (CHUV), Lausanne, Switzerland; Medigenome, Swiss Institute of Genomic
Medicine, Geneva, Switzerland; Univesity of Lausanne, Lausanne, Switzerland. E-mail: [email protected]

Abstract
CoverageMaster (CoM) is a copy number variation (CNV) calling algorithm based on depth-of-coverage maps designed to detect
CNVs of any size in exome [whole exome sequencing (WES)] and genome [whole genome sequencing (WGS)] data. The core of the
algorithm is the compression of sequencing coverage data in a multiscale Wavelet space and the analysis through an iterative Hidden
Markov Model. CoM processes WES and WGS data at nucleotide scale resolution and accurately detects and visualizes full size range
CNVs, including single or partial exon deletions and duplications. The results obtained with this approach support the possibility
for coverage-based CNV callers to replace probe-based methods such as array comparative genomic hybridization and multiplex
ligation-dependent probe amplification in the near future.

Keywords: medical genetics, copy number variants, signal processing

Introduction not cover the lower size spectrum. Multiplex ligation-


Copy number variation (CNV) is the most frequent dependent probe amplification (MLPA) is the current
structural alteration in the human genome. Aberrant golden standard to detect exon-sized CNVs but this
numbers of copies of specific genes, exons or, in technology can cover few exons per assay (low through-
general, genomic regions are known to be implicated put) and its application is limited to a small number
in pathogenic conditions such as Mendelian diseases of genes [5].
and cancer [1–4]. Hence, identification of these deletion In recent years, the development of next-generation
and amplification events is a primary purpose in sequencing (NGS) technologies of short reads has pro-
medical genetics research. In clinical diagnostics, the vided a standardized way for accurate coding variant
identification of rare, potentially causative CNVs in a analyses through whole genome sequencing (WGS) and
patient with a suspected genetic disorder is a long-sought whole exome sequencing (WES). Remarkably, this tech-
objective. However, the discovery of such variants that nology provides the coverage per nucleotide of clinically
can vary in size and copy number is a challenging task. relevant regions of the genome. Although WGS allows for
Currently, the most commonly used high-throughput a more comprehensive overview of the entire genome
methodologies to detect clinically relevant CNVs rely with uniform coverage [6], the related sequencing costs
on microarray-based technologies. Array comparative and the computational infrastructures needed to process
genomic hybridization (array CGH) offers an efficient the raw data are still limiting its broad application in
method to detect CNVs and micro-CNVs (5Kbp < size clinical practice [7]. On the other hand, WES is compu-
<10Mbp) in the whole genome, but its resolution does tationally less demanding and has reached such a high
Melivoia Rapti is PhD student at the University of Lausanne and at the Endocrinology Diabetes and Metabolism Service, CHUV Hospital, Lausanne Switzerland.
She studies the application of computational pipelines to bionformatics applications.
Yassine Zouaghi is bioinformatician PhD student at the University of Lausanne and at the Endocrinology Diabetes and Metabolism Service, CHUV Hospital,
Lausanne, Switzerland. His thesis focuses on finding new Congenital Hypogonadodtropic Hypogonadism causing genes through WGS.
Jenny Meylan is research technician at the Endocrinology Diabetes and Metabolism Service, CHUV Hospital, Lausanne Switzerland. She is interested on novel
NGS technologies, data preparation and analysis.
Emmanuelle Ranza is clinical geneticist and CMO at Medigenome, Swiss Institute of Genomic Medicine, Geneva, Switzerland. Her research interests focus on rare
genetic diseases and the improvement of diagnostic and preventive genetic services to the population.
Stylianos E. Antonarakis is professor emeritus at the University of Geneva, Switzerland, and the CEO of MediGenome, Swiss Institute of Genomic Medicine. His
current research work is to identify novel genes for autosomal recessive disorders, by studying consanguineous families.
Federico A. Santoni is group leader at the University and Lausanne and at the Endocrinology Diabetes and Metabolism Service, CHUV Hospital, Lausanne,
Switzerland. His research focuses on the development of computational methods for genomics and transcriptomics in the context of rare diseases, cancer and
single cell multi-modal data processing.
Received: September 23, 2021. Revised: January 28, 2022. Accepted: January 31, 2022
© The Author(s) 2022. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/
by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial
re-use, please contact [email protected]
2 | Rapti et al.

sensitivity and specificity in variant calling to eventually Materials and methods


become a clinical standard. Currently, WES is widely Material
used for diagnostic purposes in many medical genetics The analyses reported in this study were performed on
laboratories throughout the world. DNAs processed by WES at the Health 2030 Genome
A wide range of detection algorithms have been devel- Center (https://www.health2030genome.ch/) or Medi-
oped to call CNVs from WGS and WES data. It is custom- genome (www.medigenome.ch) using Twist Human
ary to define as CNVs duplication and deletions with a Core Exome Kit (TWIST Biosciences, San Francisco, CA,
size >1–5 Kbp where the smaller ones are called INDELs. USA); NA12878 DNA has been obtained from the Coriell
In the exomic space, however, the duplication/deletion of Institute (https://coriell.org/); sequencing was performed
one exon (down to 100pb or less) can result in a much on Illumina HiSeq4000 or Novaseq platforms. Array CGH
bigger duplication/deletion in the genomic space for the and MLPA were performed in GeneSupport using Agilent
large majority of breakpoints happen in the intronic or SurePrint G3 Human 4x180K (analyzed with Agilent
intergenic part of the genome. Therefore, while split- CytoGenomics (V 5.1.1.15) and double checked by visual
reads- and gapped-reads-based algorithms [8] might be inspection) and SALSA MLPA Probemix P021 SMA (MRC
quite sensitive and precise when the breakpoints are cov- Holland), respectively.
ered (i.e. sequenced), in practice they are quite inefficient
to detect exonic structural variants if the SV is bigger
Preprocessing and transformation of exome data
than the size of the read (∼100–200 bp in standard WES
and WGS experiments) [9]. For this reason, while waiting CoM uses DoC maps from aligned short sequence
for long-read NGS to take over in clinical applications, reads to estimate CNV events. To acquire the sequence
read-depth-based methods [10–12] are so far considered reads, the mapping is done with the standard pipeline
more effective for accurate copy number detection in for whole-exome or WGS data based on GATK [15],
WES data [13]. NGS short reads are mapped to a reference and the coverage at each nucleotide of the region of
sequence and the depth-of-coverage (DoC) in a genomic interest (ROI) is calculated and stored in tab separated
region is calculated by counting the number of reads COV files (format: chr nucleotide_position coverage)
that align to this region. DoC is then assumed to be using samtools (samtools depth) [16]. Coverage files
proportional to the copy number of that region. In prin- of a test/target plus one or more controls plus one
ciple, DoC is sufficient for the detection of all clinically reference coverage serve as input for the algorithm. The
relevant CNVs, irrespectively of size and copy number assumption is that control coverages are DoC maps of
and breakpoints location, promoting WGS and WES as a copy number neutral cases (diploid) or carrier of frequent
robust and more inclusive alternative to complementary CNVs in the ROI of interest. The reference set consists of
laboratory approaches such as array CGH or MLPA. a batch of coverage files from samples processed with
Nevertheless, WES has technical issues that result in the same technology (i.e. hybridization kit, reagents for
the generation of noisy data. First, the lack of continu- library prep and sequencer) used to generate case and
ity of the target regions and, second, the biases due to controls. First, the coverage per nucleotide per sample
hybridization and sequencing processes complicate the is normalized by the respective total number of reads.
procedure to standardize read-depth-based CNV detec- Then, mean and standard deviation of the normalized
tion [14]. As a result, current WES-based detection meth- coverage values are computed over all the samples for
ods suffer from limited resolution, high false positives each nucleotide.
and false negatives calls [9].
Here, we introduce CoverageMaster (CoM), a CNV WAVELET transform
calling algorithm based on DoC maps from aligned short In a genomic region of N nucleotides, the coverage
sequence reads from WES or WGS. CNVs are inferred of test case and control can be represented as the
with Hidden Markov Models (HMMs) at multiscale discrete signals s(n) and c(n) , respectively, where n is
nucleotide-like levels in the Wavelet reduced space, the nucleotide number corresponding to the genomic
in comparison to existing methods that utilize fixed or exonic position in the exon space [the space where
length windows or exon averages. This approach is covered regions ( i.e. exons) are ‘ligated’ together]. In
designed to optimize the search for CNVs of different the ideal case, the coverage ratio r = sc is a non-
sizes in WES and WGS data. Of note, since it is working periodic square waveform with up and down steps in
at nucleotide resolution, CoM provides the graphical correspondence of increased or decreased copy number,
representation of the predicted CNV in all genes of respectively. In order to diminish the noise induced by
interest, and, optionally, a wig formatted file compatible fast variations of the signal and, at the same time, to
with UCSC Genome Browser for detailed visualization reduce the computational burden, the coverage ratio
of the normalized coverage on the target genes or is compressed in the nucleotide-like space using the
regions in the genomic space. We propose CoM as a Discrete Wavelet Transform (DWT) equipped with the
potential first-line diagnostic tool in research and clinical Haar basis. At scale l, the approximation and detail
applications. coefficients are rl , dl , dl−1 , . . . , d0 = DWTl (r) . The M =
Coverage Master | 3

N· 2−l approximation coefficients rl are normalized to the BAM files using the library Pysam from Python. Briefly,
median of the original signal and used for CNV analysis. a script selects a random exonic position (inter-exonic or
across two or multiple exons) and, around that location,
Multiscale CNV detection removes or duplicates half of the overlapping reads in
The probability bj (om ) of each m nucleotide-like posi- the sample BAM files, respectively. Coverage (COV) files
tions of the sequence of approximation coefficients rl = are then produced with samtools depth following the usual
o1 o2 . . . ok ok+1 . . . oM to be in a normal (i.e. diploid), dupli- protocol and processed with CoM with standard param-
cated or deleted state s ∈ S ≡ {1, 32 , 12 } is defined at any eters.
scale l as a random variable with Gaussian distribution of
mean s and standard deviation σ (Rl ) , where Rl (m) is the
sequence of approximation coefficients of the reference Results
coverage in the m -coordinates of the l -scaled nucleotide- CoM utilizes the representation of coverage signal ratio
like space. (case over control) in the reduced Wavelet approximation
At scale l, the indicator function (trigger) T = space to perform a multiscale analysis of aberrant
argmaxs (bs (rl ))  = 1 identifies the locations of non-diploid coverage profiles, potentially underlying causative CNVs,
nucleotide-like positions and masks the rest of the signal. at nucleotide resolution (Figure 1, see Methods). This
If no location is identified, the algorithm discards this approach is meant to explore a broad spectrum of CNV
region and processes the next one. sizes and in particular deletions or duplications of <5 kb.
Once the putative CNVs are identified, the Viterbi At this scale, the experimental noise is caused on one
algorithm is then used to identify the most likely copy hand by the particular technology used for sequencing
number state sequence Q = q1 q2 . . . qk qk+1 . . . qM of the and, for WES, DNA selection by hybridization. On the
compressed genomic region, based on the corresponding other hand, batch specific coverage distortions may
sequence of observations rl = o1 o2 . . . ok ok+1 . . . oM . occur. Intuitively, the smaller the CNV the higher the
Masked observations ok have a fixed diploid state chance that the call is a false positive. To overcome
qk = 1. this problem, CoM exploits the fact that, as all other
More formally, if vt (j) represents the Viterbi probability genomic variations, clinically relevant CNVs are rare
that the underlying HMM is in copy number state j after (MAF < 0.01%). Thus, it is reasonable to assume that such
seeing the first m observations and passing through the CNVs cannot be present in two or more independent
most probable state sequence q1 q2 . . . qm−1 , it can be unrelated individuals of the same batch. Following this
shown that vm (j) = maxi∈S vm−1 (i)αij bj (om ) , where vm−1 (i) basic principle, CoM utilizes a reference with the average
is the previous Viterbi path probability from the previous coverage and standard deviation of 15–20 samples
nucleotide, αij is the transition probability (here set to processed with the same technology (hybridization kit,
aij = 5 × 10−6 which is the probability of finding a dupli- reagents and sequencer). The reference provides the
cation or a deletion in the human genome, calculated as standard deviation per nucleotide from the expected
the mean of the inclusive and stringent number of CNVs coverage where coverage spikes are produced by repro-
per nucleotide from [17]) and bj (om ) is the observation ducible experimental noise and/or recurrent CNVs.
probability given the state j as defined above. Eventually, matching CNVs in the test sample are then
If no putative CNV is detected at this stage, the algo- considered as frequent or false positives and finally
rithm performs a multiscale analysis by repeating the discarded. Moreover, CoM pairwise compares the sample
HMM phase with the masked signal transformed at scale case with independent samples, used as controls, from
l − 1 . Again, in absence of CNVs, the algorithm keeps the same batch. Spikes present in the test signal coverage
decrementing l down to, if necessary, l = 0 (no com- and in one control sample are averaged out in the
pression). This is computationally possible because only coverage ratio and consequently discarded.
the relevant unmasked regions are actually inspected. In order to prove its efficiency, we tested CoM in vari-
Otherwise, eventual putative CNVs are saved and the ous contexts of NGS data analysis. All samples processed
algorithm proceeds to the next region. here for WES were hybridized with Twist Core Exome +
RefSeq Spike and sequenced with Illumina HSeq4000 or
Iteration over controls Novaseq.
In case more control coverages are provided, eventual Most of the published algorithms use samples form
putative CNVs and relative masks are stored in a 1000 Genomes to evaluate their performances (e.g. [19]).
temporary buffer. Following the assumption that a rare Being the large majority of CNVs in these samples quite
causative CNV cannot be present in any control sample, frequent and of no clinical relevance, this approach
CNVs are iteratively challenged with the Multiscale CNV is not appropriate for CoM. To clarify this point, we
Detection algorithm against each control. sequenced and analyzed the exome of sample NA12878,
generally considered the golden standard for this
Generation of simulated data analysis [20]. Whole Genome CNV calls validated by
Heterozygous deletions and duplications in randomly several technologies are made available by the 1000G
picked exonic regions have been inserted in samples consortium in https://www.internationalgenome.org/
4 | Rapti et al.

Figure 1. CoverageMaster workf low CoM is based on depth-of-coverage maps from aligned short sequence reads from WES or WGS. The normalized
values of the depth-of-coverage for each nucleotide position are calculated (Step 1). The ratio of the test to control coverage signal is compressed at a
specified initial scale l(= 25 by default) in the nucleotide-like space using the DWT (Step 2). For the compressed signal, an indicator detects the potential
non-diploid nucleotide-like positions (Step 3). HMM is used to segment the compressed signal into regions of similar copy number and assign CNV states
(Step 4). If no putative CNVs are identified, the process is repeated at scale l − 1 via ‘zooming’ (Step 5).

phase-3-structural-variant-dataset/. This sample has 45 5/11 where PatternCNV scored 2/45 on the full set and
exonic CNVs of which 34 are frequent (MAF > 5%). As 0/11 on the rare set (Supplementary Figure 1).
expected, CoM achieved a recall of 9/45 on the full set To design a more clinically oriented test to investigate
but 9/11 on the rare CNV set (the two miss CNVs were the performance of CoM on CNVs of different size,
few bases overlapping with the exonic covered region). we created a dataset of simulated WES data starting
ED identified 11/45 CNVs on the full set and 6/11 on the from real BAM files obtained from 10 individuals where
rare set. CODEX2 and CONTRA both detected 7/45 and array CGH did not previously provide any clinically
Coverage Master | 5

Figure 2. CNV detection in simulated data and array CGH comparison on clinical samples. (A) (up) Number and fraction of true calls detected by CoM
(orange), ED (blue), CODEX2 (gray), CONTRA (green) and PatternCNV (yellow) in 10 samples where CNVs of various size were randomly introduced
in exonic regions. (down) Number and fraction of true calls of detected CNVs by the above-mentioned tools stratified by size. (B) Cumulative plots
of number of calls (y-axis) detected by CoM (blue) and ED (green) and the number of CNVs found by array CGH (ed) in 12 samples: (up) all calls are
considered; (down) only the rare CNVs (MAF < 1%) are included.

significant call. We preferred this approach to the To demonstrate further the performance of CoM in
generation of synthetic reads as performed in other standard clinical analyses, we analyzed 12 clinical sam-
studies where they had to simulate the sequencing ples and compared CoM CNVs calls to standard array
error model, the probability to have a single nucleotide CGH calls (see Methods). In order to provide a point of ref-
variant, GC content effect etc. [18]. Indeed the use of erence, we also included the results obtained by ED given
real samples automatically provides all the requested its reasonably good performance in the simulation test.
features. Around 2000 heterozygous duplications and In Figure 2a, the cumulative true positive values for CNVs
deletions of 200, 500, 1000 and 5000 base pairs were detected by CoM and ED are reported. CoM calls coincide
randomly introduced in the exonic regions of these with almost the entire array CGH calls for each sample
samples (see Methods) and analyzed by CoM and with the exception of some frequent benign variants
other CNV callers such as ExomeDepth [12], CONTRA discarded by CoM as they are present in most controls.
[21], PatternCNV [22] and CODEX2 [19]. The results In fact, when searching for CNVs with MAF < 1%, CoM
show that CoM has the best performance with an identifies all CNVs detected by array CGH, in contrast
average sensitivity of 88.5% as compared with 77% to ED that detects 80% of them (Figure 2b). This result
obtained by the second best performer ED (Figure 2c) demonstrates that CoM may replace array CGH in clinical
and an average precision of 30% for CoM versus 16% diagnostic settings.
obtained by ED with 25 control samples (with the CoM has been mainly conceived as a diagnostic sup-
conservative hypothesis to considering all CNV calls not port tool for clinical genetics analysis. To provide a per-
overlapping with the simulated test as False Positives). spective of the broad capabilities of the algorithm, we
It is worth to note that, in contrary to ED, CoM precision report four examples (three WES and one WGS) of solved
drastically increases with the number of control samples clinical cases.
(Supplementary Figure 2). The explanation of this differ- Patient 1 is a 38-year-old male with a Kallman
ence in performance between CoM and the other tools syndrome [OMIM 308700] born from a consanguineous
becomes evident by stratifying the CNV calls by size. It couple of the first degree. WGS analysis with CoM
is indeed the multiscaling approach that enables CoM to revealed a homozygous deletion of 135Kbp including
keep a constant high sensitivity above 80% for all CNVs the two first exons of ANOS1 that completely explain
sizes in contrast to the other tools where the perfor- the phenotype [23]. Interestingly, WGS data can be also
mance rapidly decreases with size reduction (Figure 2A). analyzed as WES by CoM by calculating the appropriate
6 | Rapti et al.

Figure 3. Examples of clinically relevant CNV identified by CoM. (A) In green, the collapsed exon structure of the gene of interest, up or down blocks
representing one exon. Coverage profiles in exon space of test sample, control and reference (color code in the legend) are represented in the second
plot. For patient 1 (see the text), the homozygous deletion of 135Kbp covering the last two exons of ANOS1 is clearly visible in the WGS analysis but less
evident in the WES analysis (called by CoM but not detected by ED). Below the respective coverage as reported by CoM in the genomic space for WGS
and WES data. (B) For patient 2 the partial heterozygous deletion of 115Kbp in SCN1A, detected from WES, is clearly visible in the exonic and genomic
spaces. (C) Homozygous deletion of exon 7 in SMN1 in patient 3, detected in WES data, is clearly visible in the exonic and genomic spaces. It is worth
noting that, in the genomic space, the coverage profile seems to show two other exons with a drop in coverage. The control, dashed line in the plot above,
shows the same profile indicating a f luctuation of the coverage in this region, likely independent from the number of copies, or a common deletion.

exon coverage. From this perspective, the causative regions presenting with positive SNV calls from the
CNV appears as a full two exons deletion of <100 bp first step.
in the exonic space, detected by CoM but not by ED. Patient 2 is an, 8-year-old female child, diagnosed with
Therefore, CoM can be used to perform an efficient drug-resistant epilepsy with febrile seizures. WES single
clinical analysis of WGS data in a two-step approach: nucleotide variant analysis did not provide any candidate
first, through a high-resolution (100–200 bp) WES profile on a panel of 478 genes related to epilepsy (Epilepsy MDG-
and second, through a broad investigation of the genomic 1204.01, https://www.medigenome.ch/en/gene-panels/).
Coverage Master | 7

CoM reported a heterozygous deletion of ∼120Kbp par- and long reads as leading approach for SV detection [29].
tially overlapping the last 10 exons of SCN1A (Figure 3b). We show that CoM can already be used to analyzed WGS
The sodium channel 1A is associated with generalized data (Figure 3) and, in principle, there is no limitation
epilepsy with febrile seizures, Type 2 [OMIM 604403]. to employ it on long-read data. A possible improvement
Deletions in this gene are known to cause seizure disor- might involve the integration of CoM CNV search and
ders, ranging from early-onset isolated febrile seizures to zooming process with split-read detectors to provide
generalized epilepsy [24]. precise breakpoint detection for large CNVs. It is crucial
Patient 3 is a 3-year-old female child with a suspicion information needed to understand the impact of the CNV
of spinal muscular atrophy. WES analysis and array CGH on patient phenotype especially on cancer [30]. Of note,
were negative but CoM identified a homozygous deletion we are planning to apply CoM on tumor samples in the
of the exon 7 (112 bp - Figure 3c). This deletion, confirmed next future. Concerning WES data, CoM demonstrated
by MLPA but not detected by ED, is the most frequent CNV to be superior to the current state-of-the-art algorithms
related to SMN1-induced muscular atrophy [25] [SMA in the detection of rare and small CNVs in simulated
OMIM 253400]; this deletion was eventually considered and clinical data and it can be a valid and inexpensive
as the pathogenic cause of the phenotype of the patient alternative to MLPA and array CGH in clinical settings.
by the clinicians.

Discussion Key Points


• CoverageMaster (CoM) is designed to identify CNVs of
CoM is an NGS coverage based CNV calling algorithm
any size at nucleotide resolution through multiscale
designed to work at nucleotide resolution with WES and
analysis.
WGS data. The capacity to analyze a given coverage • Simulated and clinical data show that CoM significantly
signal in different scale sizes, combined with the increased CNV call sensitivity with respect to the state
nowadays availability of numerous controls in standard of the art, especially in the lower size spectrum (50–
clinical batches, enables the detection of multi-sized 1000 bp).
clinically relevant deletions or duplications and in • CoM can analyze whole exome or whole genome
particular the detection of the so far elusive small CNVs sequencing data.
of <5Kbp. The algorithm has been designed to reduce the • The analysis at nucleotide resolution enables the visual-
analysis burden by using all available control datasets ization of the identified CNVs in the exonic and genomic
space (Genome Browser) to further support the clinical
to eliminate frequent CNVs and stochastic coverage
interpretation of the calls.
variations. We have proven the effectiveness of CoM
in comparison to ExomeDepth and others broadly used
in silico CNV callers. Performance wise, CoM is not the
fastest algorithm available but in line with the state Supplementary Data
of the art (Supplementary Table 1). With 10 control
Supplementary data are available online at https://
samples, CoM takes 6 h to analyze a full gene panel of
academic.oup.com/bib.
20 400 genes and around 1 h to process a WES panel
of 4758 clinically relevant genes from OMIM and the
Clinical Genomic Database ( [26], https://research.nhgri. Acknowledgments
nih.gov/CGD/) on a 16 cores machine with 32Gbyte of
We thank Professor Nelly Pitteloud, Lucia Bartoloni
RAM. The analysis time, however, can be sensibly reduced
and Alexia Boizot from the Endocrine Diabetes and
by iteratively increasing the number of controls and,
Metabolism service of the CHUV for helping with the
consequently, reducing the number of False Positives
sample preparation and analyses, Marco Belfiore from
(Supplementary Figure 2). CoM, in common with all
Genesupport for providing the array CGH data and Xavier
other read-depth based algorithms, is sensitive to cov-
Blanc from Medigenome for constructive discussions and
erage variations induced by different hybridization kits
suggestions.
and sequencing processes. Indeed, mismatches between
samples, reference and controls can lead to a consistent
increase of the number of False Positives. Nowadays, this Funding
problem is less compelling given that even small labs This study was supported by the EU Framework Pro-
process hundreds of WES before updating the production gramme for Research and Innovation Action (RIA), Hori-
lines. One caveat concerns the ethnicity of patients and zon 2020, n◦ 847941 (miniNO) and partially by the Swiss
controls and the interpretation of CoM results. A CNV can National Science Foudation (310030_185292) and Novar-
be frequent in a specific region or population and rare tis Foundation (18AO52) to F.A.S.
elsewhere. Therefore, as for single nucleotide variants,
the ethnicity of the patient must be taken into account
to reach an appropriate diagnostic [27]. Future devel-
Data availability
opments on CNV detection will deal with WGS as the CoverageMaster is available at https://github.com/
standard technology for genetic clinical applications [28] fredsanto/coverageMaster.
8 | Rapti et al.

Contributions 14. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten
years of next-generation sequencing technologies. Nat Rev Genet
M.R. contributed in developing the algorithm, performed
2016;17:333–51.
the tools comparison and wrote the manuscript. Y.Z. 15. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data
performed WGS analyses. J.M. performed samples to high confidence variant calls: the genome analysis toolkit
preparation for WGS sequencing and quality assessment. best practices pipeline. Curr Protoc Bioinformatics 2013;43:11 10
E.R. and S.E.A. analyzed CGH data and provided the 11–33.
clinical analysis of the samples reported in the study. 16. Li H, Durbin R. Fast and accurate short read alignment with
F.A.S. designed and supervised the study, wrote the burrows-wheeler transform. Bioinformatics 2009;25:1754–60.
algorithm and wrote the manuscript. All authors 17. Zarrei M, MacDonald JR, Merico D, et al. A copy number vari-
contributed to the manuscript. ation map of the human genome. Nat Rev Genet 2015;16:
172–83.
18. Xing Y, Dabney AR, Li X, et al. SECNVs: a simulator of copy
References
number variants and whole-exome sequences from reference
1. Shlien A, Malkin D. Copy number variations and cancer. Genome genomes. Front Genet 2020;11:82.
Med 2009;1:62. 19. Jiang Y, Wang R, Urrutia E, et al. CODEX2: full-spectrum copy
2. Truty R, Paul J, Kennemer M, et al. Prevalence and properties of number variation detection by high-throughput DNA sequenc-
intragenic copy-number variation in Mendelian disease genes. ing. Genome Biol 2018;19:202.
Genet Med 2019;21:114–23. 20. Gordeeva V, Sharova E, Babalyan K, et al. Benchmarking
3. Zack TI, Schumacher SE, Carter SL, et al. Pan-cancer patterns of germline CNV calling tools from exome sequencing data. Sci Rep
somatic copy number alteration. Nat Genet 2013;45:1134–40. 2021;11:14416.
4. Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, 21. Li J, Lupat R, Amarasinghe KC, et al. CONTRA: copy num-
et al. The cancer genome atlas pan-cancer analysis project. Nat ber analysis for targeted resequencing. Bioinformatics 2012;28:
Genet 2013;45:1113–20. 1307–13.
5. Stuppia L, Antonucci I, Palka G, et al. Use of the MLPA assay in the 22. Wang C, Evans JM, Bhagwate AV, et al. PatternCNV: a versatile
molecular diagnosis of gene copy number alterations in human tool for detecting copy number changes from exome sequencing
genetic diseases. Int J Mol Sci 2012;13:3245–76. data. Bioinformatics 2014;30:2678–80.
6. Rieber N, Zapatka M, Lasitschka B, et al. Coverage bias and 23. Franco B, Guioli S, Pragliola A, et al. A gene deleted in Kall-
sensitivity of variant calling for four whole-genome sequencing mann’s syndrome shares homology with neural cell adhe-
technologies. PLoS One 2013;8:e66621. sion and axonal path-finding molecules. Nature 1991;353:
7. Marshall CR, Bick D, Belmont JW, et al. The medical genome 529–36.
initiative: moving whole-genome sequencing for rare disease 24. Parihar R, Ganesh S. The SCN1A gene variants and epileptic
diagnosis to the clinic. Genome Med 2020;12:48. encephalopathies. J Hum Genet 2013;58:573–80.
8. Shigemizu D, Miya F, Akiyama S, et al. IMSindel: an accurate 25. Ogino S, Wilson RB. Genetic testing and risk assessment
intermediate-size indel detection tool incorporating de novo for spinal muscular atrophy (SMA). Hum Genet 2002;111:
assembly and gapped global-local alignment with split read 477–500.
analysis. Sci Rep 2018;8:5608. 26. Solomon BD, Nguyen AD, Bear KA, et al. Clinical genomic
9. do Nascimento F, Guimaraes KS. Copy number variations detec- database. Proc Natl Acad Sci U S A 2013;110:9851–5.
tion: unravelling the problem in tangible aspects. IEEE/ACM Trans 27. White SJ, Vissers LE, Geurts van Kessel A, et al. Variation of
Comput Biol Bioinform 2017;14:1237–50. CNV distribution in five different ethnic populations. Cytogenet
10. Sathirapongsasuti JF, Lee H, Horst BA, et al. Exome sequencing- Genome Res 2007;118:19–30.
based copy-number variation and loss of heterozygosity detec- 28. Stranneheim H, Lagerstedt-Robinson K, Magnusson M, et al. Inte-
tion: ExomeCNV. Bioinformatics 2011;27:2648–54. gration of whole genome sequencing into a healthcare setting:
11. Krumm N, Sudmant PH, Ko A, et al. Copy number variation high diagnostic rates across multiple clinical entities in 3219
detection and genotyping from exome sequence data. Genome rare disease patients. Genome Med 2021;13:40.
Res 2012;22:1525–32. 29. De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards
12. Plagnol V, Curtis J, Epstein M, et al. A robust model for read population-scale long-read sequencing. Nat Rev Genet 2021;22:
count data in exome sequencing experiments and implications 572–87.
for copy number variant calling. Bioinformatics 2012;28:2747–54. 30. van Belzen IAEM, Schönhuth A, Kemmeren P, et al. Struc-
13. Tan R, Wang Y, Kleinstein SE, et al. An evaluation of copy number tural variant detection in cancer genomes: computational chal-
variation detection tools from whole-exome sequencing data. lenges and perspectives for precision oncology. NPJ Precis Oncol
Hum Mutat 2014;35:899–907. 2021;5:15.

You might also like