CNV软件coverageMaster
CNV软件coverageMaster
https://doi.org/10.1093/bib/bbac049
Problem Solving Protocol
Abstract
CoverageMaster (CoM) is a copy number variation (CNV) calling algorithm based on depth-of-coverage maps designed to detect
CNVs of any size in exome [whole exome sequencing (WES)] and genome [whole genome sequencing (WGS)] data. The core of the
algorithm is the compression of sequencing coverage data in a multiscale Wavelet space and the analysis through an iterative Hidden
Markov Model. CoM processes WES and WGS data at nucleotide scale resolution and accurately detects and visualizes full size range
CNVs, including single or partial exon deletions and duplications. The results obtained with this approach support the possibility
for coverage-based CNV callers to replace probe-based methods such as array comparative genomic hybridization and multiplex
ligation-dependent probe amplification in the near future.
N· 2−l approximation coefficients rl are normalized to the BAM files using the library Pysam from Python. Briefly,
median of the original signal and used for CNV analysis. a script selects a random exonic position (inter-exonic or
across two or multiple exons) and, around that location,
Multiscale CNV detection removes or duplicates half of the overlapping reads in
The probability bj (om ) of each m nucleotide-like posi- the sample BAM files, respectively. Coverage (COV) files
tions of the sequence of approximation coefficients rl = are then produced with samtools depth following the usual
o1 o2 . . . ok ok+1 . . . oM to be in a normal (i.e. diploid), dupli- protocol and processed with CoM with standard param-
cated or deleted state s ∈ S ≡ {1, 32 , 12 } is defined at any eters.
scale l as a random variable with Gaussian distribution of
mean s and standard deviation σ (Rl ) , where Rl (m) is the
sequence of approximation coefficients of the reference Results
coverage in the m -coordinates of the l -scaled nucleotide- CoM utilizes the representation of coverage signal ratio
like space. (case over control) in the reduced Wavelet approximation
At scale l, the indicator function (trigger) T = space to perform a multiscale analysis of aberrant
argmaxs (bs (rl )) = 1 identifies the locations of non-diploid coverage profiles, potentially underlying causative CNVs,
nucleotide-like positions and masks the rest of the signal. at nucleotide resolution (Figure 1, see Methods). This
If no location is identified, the algorithm discards this approach is meant to explore a broad spectrum of CNV
region and processes the next one. sizes and in particular deletions or duplications of <5 kb.
Once the putative CNVs are identified, the Viterbi At this scale, the experimental noise is caused on one
algorithm is then used to identify the most likely copy hand by the particular technology used for sequencing
number state sequence Q = q1 q2 . . . qk qk+1 . . . qM of the and, for WES, DNA selection by hybridization. On the
compressed genomic region, based on the corresponding other hand, batch specific coverage distortions may
sequence of observations rl = o1 o2 . . . ok ok+1 . . . oM . occur. Intuitively, the smaller the CNV the higher the
Masked observations ok have a fixed diploid state chance that the call is a false positive. To overcome
qk = 1. this problem, CoM exploits the fact that, as all other
More formally, if vt (j) represents the Viterbi probability genomic variations, clinically relevant CNVs are rare
that the underlying HMM is in copy number state j after (MAF < 0.01%). Thus, it is reasonable to assume that such
seeing the first m observations and passing through the CNVs cannot be present in two or more independent
most probable state sequence q1 q2 . . . qm−1 , it can be unrelated individuals of the same batch. Following this
shown that vm (j) = maxi∈S vm−1 (i)αij bj (om ) , where vm−1 (i) basic principle, CoM utilizes a reference with the average
is the previous Viterbi path probability from the previous coverage and standard deviation of 15–20 samples
nucleotide, αij is the transition probability (here set to processed with the same technology (hybridization kit,
aij = 5 × 10−6 which is the probability of finding a dupli- reagents and sequencer). The reference provides the
cation or a deletion in the human genome, calculated as standard deviation per nucleotide from the expected
the mean of the inclusive and stringent number of CNVs coverage where coverage spikes are produced by repro-
per nucleotide from [17]) and bj (om ) is the observation ducible experimental noise and/or recurrent CNVs.
probability given the state j as defined above. Eventually, matching CNVs in the test sample are then
If no putative CNV is detected at this stage, the algo- considered as frequent or false positives and finally
rithm performs a multiscale analysis by repeating the discarded. Moreover, CoM pairwise compares the sample
HMM phase with the masked signal transformed at scale case with independent samples, used as controls, from
l − 1 . Again, in absence of CNVs, the algorithm keeps the same batch. Spikes present in the test signal coverage
decrementing l down to, if necessary, l = 0 (no com- and in one control sample are averaged out in the
pression). This is computationally possible because only coverage ratio and consequently discarded.
the relevant unmasked regions are actually inspected. In order to prove its efficiency, we tested CoM in vari-
Otherwise, eventual putative CNVs are saved and the ous contexts of NGS data analysis. All samples processed
algorithm proceeds to the next region. here for WES were hybridized with Twist Core Exome +
RefSeq Spike and sequenced with Illumina HSeq4000 or
Iteration over controls Novaseq.
In case more control coverages are provided, eventual Most of the published algorithms use samples form
putative CNVs and relative masks are stored in a 1000 Genomes to evaluate their performances (e.g. [19]).
temporary buffer. Following the assumption that a rare Being the large majority of CNVs in these samples quite
causative CNV cannot be present in any control sample, frequent and of no clinical relevance, this approach
CNVs are iteratively challenged with the Multiscale CNV is not appropriate for CoM. To clarify this point, we
Detection algorithm against each control. sequenced and analyzed the exome of sample NA12878,
generally considered the golden standard for this
Generation of simulated data analysis [20]. Whole Genome CNV calls validated by
Heterozygous deletions and duplications in randomly several technologies are made available by the 1000G
picked exonic regions have been inserted in samples consortium in https://www.internationalgenome.org/
4 | Rapti et al.
Figure 1. CoverageMaster workf low CoM is based on depth-of-coverage maps from aligned short sequence reads from WES or WGS. The normalized
values of the depth-of-coverage for each nucleotide position are calculated (Step 1). The ratio of the test to control coverage signal is compressed at a
specified initial scale l(= 25 by default) in the nucleotide-like space using the DWT (Step 2). For the compressed signal, an indicator detects the potential
non-diploid nucleotide-like positions (Step 3). HMM is used to segment the compressed signal into regions of similar copy number and assign CNV states
(Step 4). If no putative CNVs are identified, the process is repeated at scale l − 1 via ‘zooming’ (Step 5).
phase-3-structural-variant-dataset/. This sample has 45 5/11 where PatternCNV scored 2/45 on the full set and
exonic CNVs of which 34 are frequent (MAF > 5%). As 0/11 on the rare set (Supplementary Figure 1).
expected, CoM achieved a recall of 9/45 on the full set To design a more clinically oriented test to investigate
but 9/11 on the rare CNV set (the two miss CNVs were the performance of CoM on CNVs of different size,
few bases overlapping with the exonic covered region). we created a dataset of simulated WES data starting
ED identified 11/45 CNVs on the full set and 6/11 on the from real BAM files obtained from 10 individuals where
rare set. CODEX2 and CONTRA both detected 7/45 and array CGH did not previously provide any clinically
Coverage Master | 5
Figure 2. CNV detection in simulated data and array CGH comparison on clinical samples. (A) (up) Number and fraction of true calls detected by CoM
(orange), ED (blue), CODEX2 (gray), CONTRA (green) and PatternCNV (yellow) in 10 samples where CNVs of various size were randomly introduced
in exonic regions. (down) Number and fraction of true calls of detected CNVs by the above-mentioned tools stratified by size. (B) Cumulative plots
of number of calls (y-axis) detected by CoM (blue) and ED (green) and the number of CNVs found by array CGH (ed) in 12 samples: (up) all calls are
considered; (down) only the rare CNVs (MAF < 1%) are included.
significant call. We preferred this approach to the To demonstrate further the performance of CoM in
generation of synthetic reads as performed in other standard clinical analyses, we analyzed 12 clinical sam-
studies where they had to simulate the sequencing ples and compared CoM CNVs calls to standard array
error model, the probability to have a single nucleotide CGH calls (see Methods). In order to provide a point of ref-
variant, GC content effect etc. [18]. Indeed the use of erence, we also included the results obtained by ED given
real samples automatically provides all the requested its reasonably good performance in the simulation test.
features. Around 2000 heterozygous duplications and In Figure 2a, the cumulative true positive values for CNVs
deletions of 200, 500, 1000 and 5000 base pairs were detected by CoM and ED are reported. CoM calls coincide
randomly introduced in the exonic regions of these with almost the entire array CGH calls for each sample
samples (see Methods) and analyzed by CoM and with the exception of some frequent benign variants
other CNV callers such as ExomeDepth [12], CONTRA discarded by CoM as they are present in most controls.
[21], PatternCNV [22] and CODEX2 [19]. The results In fact, when searching for CNVs with MAF < 1%, CoM
show that CoM has the best performance with an identifies all CNVs detected by array CGH, in contrast
average sensitivity of 88.5% as compared with 77% to ED that detects 80% of them (Figure 2b). This result
obtained by the second best performer ED (Figure 2c) demonstrates that CoM may replace array CGH in clinical
and an average precision of 30% for CoM versus 16% diagnostic settings.
obtained by ED with 25 control samples (with the CoM has been mainly conceived as a diagnostic sup-
conservative hypothesis to considering all CNV calls not port tool for clinical genetics analysis. To provide a per-
overlapping with the simulated test as False Positives). spective of the broad capabilities of the algorithm, we
It is worth to note that, in contrary to ED, CoM precision report four examples (three WES and one WGS) of solved
drastically increases with the number of control samples clinical cases.
(Supplementary Figure 2). The explanation of this differ- Patient 1 is a 38-year-old male with a Kallman
ence in performance between CoM and the other tools syndrome [OMIM 308700] born from a consanguineous
becomes evident by stratifying the CNV calls by size. It couple of the first degree. WGS analysis with CoM
is indeed the multiscaling approach that enables CoM to revealed a homozygous deletion of 135Kbp including
keep a constant high sensitivity above 80% for all CNVs the two first exons of ANOS1 that completely explain
sizes in contrast to the other tools where the perfor- the phenotype [23]. Interestingly, WGS data can be also
mance rapidly decreases with size reduction (Figure 2A). analyzed as WES by CoM by calculating the appropriate
6 | Rapti et al.
Figure 3. Examples of clinically relevant CNV identified by CoM. (A) In green, the collapsed exon structure of the gene of interest, up or down blocks
representing one exon. Coverage profiles in exon space of test sample, control and reference (color code in the legend) are represented in the second
plot. For patient 1 (see the text), the homozygous deletion of 135Kbp covering the last two exons of ANOS1 is clearly visible in the WGS analysis but less
evident in the WES analysis (called by CoM but not detected by ED). Below the respective coverage as reported by CoM in the genomic space for WGS
and WES data. (B) For patient 2 the partial heterozygous deletion of 115Kbp in SCN1A, detected from WES, is clearly visible in the exonic and genomic
spaces. (C) Homozygous deletion of exon 7 in SMN1 in patient 3, detected in WES data, is clearly visible in the exonic and genomic spaces. It is worth
noting that, in the genomic space, the coverage profile seems to show two other exons with a drop in coverage. The control, dashed line in the plot above,
shows the same profile indicating a f luctuation of the coverage in this region, likely independent from the number of copies, or a common deletion.
exon coverage. From this perspective, the causative regions presenting with positive SNV calls from the
CNV appears as a full two exons deletion of <100 bp first step.
in the exonic space, detected by CoM but not by ED. Patient 2 is an, 8-year-old female child, diagnosed with
Therefore, CoM can be used to perform an efficient drug-resistant epilepsy with febrile seizures. WES single
clinical analysis of WGS data in a two-step approach: nucleotide variant analysis did not provide any candidate
first, through a high-resolution (100–200 bp) WES profile on a panel of 478 genes related to epilepsy (Epilepsy MDG-
and second, through a broad investigation of the genomic 1204.01, https://www.medigenome.ch/en/gene-panels/).
Coverage Master | 7
CoM reported a heterozygous deletion of ∼120Kbp par- and long reads as leading approach for SV detection [29].
tially overlapping the last 10 exons of SCN1A (Figure 3b). We show that CoM can already be used to analyzed WGS
The sodium channel 1A is associated with generalized data (Figure 3) and, in principle, there is no limitation
epilepsy with febrile seizures, Type 2 [OMIM 604403]. to employ it on long-read data. A possible improvement
Deletions in this gene are known to cause seizure disor- might involve the integration of CoM CNV search and
ders, ranging from early-onset isolated febrile seizures to zooming process with split-read detectors to provide
generalized epilepsy [24]. precise breakpoint detection for large CNVs. It is crucial
Patient 3 is a 3-year-old female child with a suspicion information needed to understand the impact of the CNV
of spinal muscular atrophy. WES analysis and array CGH on patient phenotype especially on cancer [30]. Of note,
were negative but CoM identified a homozygous deletion we are planning to apply CoM on tumor samples in the
of the exon 7 (112 bp - Figure 3c). This deletion, confirmed next future. Concerning WES data, CoM demonstrated
by MLPA but not detected by ED, is the most frequent CNV to be superior to the current state-of-the-art algorithms
related to SMN1-induced muscular atrophy [25] [SMA in the detection of rare and small CNVs in simulated
OMIM 253400]; this deletion was eventually considered and clinical data and it can be a valid and inexpensive
as the pathogenic cause of the phenotype of the patient alternative to MLPA and array CGH in clinical settings.
by the clinicians.
Contributions 14. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten
years of next-generation sequencing technologies. Nat Rev Genet
M.R. contributed in developing the algorithm, performed
2016;17:333–51.
the tools comparison and wrote the manuscript. Y.Z. 15. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data
performed WGS analyses. J.M. performed samples to high confidence variant calls: the genome analysis toolkit
preparation for WGS sequencing and quality assessment. best practices pipeline. Curr Protoc Bioinformatics 2013;43:11 10
E.R. and S.E.A. analyzed CGH data and provided the 11–33.
clinical analysis of the samples reported in the study. 16. Li H, Durbin R. Fast and accurate short read alignment with
F.A.S. designed and supervised the study, wrote the burrows-wheeler transform. Bioinformatics 2009;25:1754–60.
algorithm and wrote the manuscript. All authors 17. Zarrei M, MacDonald JR, Merico D, et al. A copy number vari-
contributed to the manuscript. ation map of the human genome. Nat Rev Genet 2015;16:
172–83.
18. Xing Y, Dabney AR, Li X, et al. SECNVs: a simulator of copy
References
number variants and whole-exome sequences from reference
1. Shlien A, Malkin D. Copy number variations and cancer. Genome genomes. Front Genet 2020;11:82.
Med 2009;1:62. 19. Jiang Y, Wang R, Urrutia E, et al. CODEX2: full-spectrum copy
2. Truty R, Paul J, Kennemer M, et al. Prevalence and properties of number variation detection by high-throughput DNA sequenc-
intragenic copy-number variation in Mendelian disease genes. ing. Genome Biol 2018;19:202.
Genet Med 2019;21:114–23. 20. Gordeeva V, Sharova E, Babalyan K, et al. Benchmarking
3. Zack TI, Schumacher SE, Carter SL, et al. Pan-cancer patterns of germline CNV calling tools from exome sequencing data. Sci Rep
somatic copy number alteration. Nat Genet 2013;45:1134–40. 2021;11:14416.
4. Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, 21. Li J, Lupat R, Amarasinghe KC, et al. CONTRA: copy num-
et al. The cancer genome atlas pan-cancer analysis project. Nat ber analysis for targeted resequencing. Bioinformatics 2012;28:
Genet 2013;45:1113–20. 1307–13.
5. Stuppia L, Antonucci I, Palka G, et al. Use of the MLPA assay in the 22. Wang C, Evans JM, Bhagwate AV, et al. PatternCNV: a versatile
molecular diagnosis of gene copy number alterations in human tool for detecting copy number changes from exome sequencing
genetic diseases. Int J Mol Sci 2012;13:3245–76. data. Bioinformatics 2014;30:2678–80.
6. Rieber N, Zapatka M, Lasitschka B, et al. Coverage bias and 23. Franco B, Guioli S, Pragliola A, et al. A gene deleted in Kall-
sensitivity of variant calling for four whole-genome sequencing mann’s syndrome shares homology with neural cell adhe-
technologies. PLoS One 2013;8:e66621. sion and axonal path-finding molecules. Nature 1991;353:
7. Marshall CR, Bick D, Belmont JW, et al. The medical genome 529–36.
initiative: moving whole-genome sequencing for rare disease 24. Parihar R, Ganesh S. The SCN1A gene variants and epileptic
diagnosis to the clinic. Genome Med 2020;12:48. encephalopathies. J Hum Genet 2013;58:573–80.
8. Shigemizu D, Miya F, Akiyama S, et al. IMSindel: an accurate 25. Ogino S, Wilson RB. Genetic testing and risk assessment
intermediate-size indel detection tool incorporating de novo for spinal muscular atrophy (SMA). Hum Genet 2002;111:
assembly and gapped global-local alignment with split read 477–500.
analysis. Sci Rep 2018;8:5608. 26. Solomon BD, Nguyen AD, Bear KA, et al. Clinical genomic
9. do Nascimento F, Guimaraes KS. Copy number variations detec- database. Proc Natl Acad Sci U S A 2013;110:9851–5.
tion: unravelling the problem in tangible aspects. IEEE/ACM Trans 27. White SJ, Vissers LE, Geurts van Kessel A, et al. Variation of
Comput Biol Bioinform 2017;14:1237–50. CNV distribution in five different ethnic populations. Cytogenet
10. Sathirapongsasuti JF, Lee H, Horst BA, et al. Exome sequencing- Genome Res 2007;118:19–30.
based copy-number variation and loss of heterozygosity detec- 28. Stranneheim H, Lagerstedt-Robinson K, Magnusson M, et al. Inte-
tion: ExomeCNV. Bioinformatics 2011;27:2648–54. gration of whole genome sequencing into a healthcare setting:
11. Krumm N, Sudmant PH, Ko A, et al. Copy number variation high diagnostic rates across multiple clinical entities in 3219
detection and genotyping from exome sequence data. Genome rare disease patients. Genome Med 2021;13:40.
Res 2012;22:1525–32. 29. De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards
12. Plagnol V, Curtis J, Epstein M, et al. A robust model for read population-scale long-read sequencing. Nat Rev Genet 2021;22:
count data in exome sequencing experiments and implications 572–87.
for copy number variant calling. Bioinformatics 2012;28:2747–54. 30. van Belzen IAEM, Schönhuth A, Kemmeren P, et al. Struc-
13. Tan R, Wang Y, Kleinstein SE, et al. An evaluation of copy number tural variant detection in cancer genomes: computational chal-
variation detection tools from whole-exome sequencing data. lenges and perspectives for precision oncology. NPJ Precis Oncol
Hum Mutat 2014;35:899–907. 2021;5:15.