Copy No
Copy No
1038/nature08516
ARTICLES
Origins and functional impact of copy
number variation in the human genome
Donald F. Conrad1*, Dalila Pinto2*, Richard Redon1,3, Lars Feuk2,4, Omer Gokcumen5, Yujun Zhang1, Jan Aerts1,
T. Daniel Andrews1, Chris Barnes1, Peter Campbell1, Tomas Fitzgerald1, Min Hu1, Chun Hwa Ihm5,
Kati Kristiansson1, Daniel G. MacArthur1, Jeffrey R. MacDonald2, Ifejinelo Onyiah1, Andy Wing Chun Pang2,
Sam Robson1, Kathy Stirrups1, Armand Valsesia1, Klaudia Walter1, John Wei2, Wellcome Trust Case Control
Consortium{, Chris Tyler-Smith1, Nigel P. Carter1, Charles Lee5, Stephen W. Scherer2,6 & Matthew E. Hurles1
Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are
still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a
comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been
validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European,
African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition
has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by
correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are
candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the
patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by
genome-wide association studies will not be accounted for by common CNVs.
Genomes vary from one another in multifarious ways, and the totality
of this genetic variation underpins the heritability of human traits.
Over the past two years, the human reference sequence1 has been
followed by other genome sequences from individual humans
(reviewed in ref. 2) with fruitful comparisons. These studies show
the landscape of genetic variation, and allow estimation of the relative
contributions of sequence (base substitutions) and structural variation (indels (that is, insertions or deletions), CNVs and inversions).
For simplicity, in this study we use the term CNV to describe collectively all quantitative variation in the genome, including tandem
arrays of repeats as well as deletions and duplications.
Despite this growing genomic clarity, these classes of variation are
not equivalently recognized in human genetic studies. To appreciate
the functional impact and selective history of a variant, its correlation
with nearby variants must be characterized3 allowing imputation into
previously assayed genomes4, and experimental reagents and protocols are needed for the variant to be assayed in a cost-effective manner
in different samples.
Genome re-sequencing studies have shown that most bases that vary
among genomes reside in CNVs of at least 1 kilobase (kb)5,6.
Population-based surveys have identified thousands of CNVs, most
of which, due to limited resolution, are larger than 5 kb79. Their functional impact has been demonstrated across the full range of biology10,
from cellular phenotypes, such as gene expression11, to all classes of
human disease with an underlying genetic basis: sporadic, Mendelian,
complex and infectious (reviewed in ref. 12). This class of variation is,
nonetheless, poorly integrated into human genetic studies at all levels.
Not only are CNVsespecially smaller onesunderrepresented in
1
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK. 2The Centre for Applied Genomics and Program in Genetics and Genomic
Biology, The Hospital for Sick Children, MaRS CentreEast Tower, 101 College Street, Room 14-701, Toronto, Ontario M5G 1L7, Canada. 3Inserm UMR915, Linstitut du thorax, Nantes
44035, France. 4Uppsala: Department of Genetics and Pathology, Rudbeck Laboratory Uppsala University, Uppsala 751 85, Sweden. 5Department of Pathology, Brigham and Womens
Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA. 6Department of Molecular Genetics, University of Toronto, Toronto M5S 1A8, Canada.
*These authors contributed equally to this work.
{Lists of participants and affiliations appear in Supplementary Information.
1
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
data was shared at an early stage with the WTCCC. The array used the
Agilent CGH platform and comprised 105,000 long oligonucleotide
probes. Its targets include 10,819 out of 11,700 (92%) of the candidate
CNV loci, and 375 other loci from published CNV surveys, including
292 new sequence insertions (Supplementary Methods)5,18. To perform
large-scale validation of candidate CNVs, we ran each of the 41 DNA
samples used in the discovery phase of this study on the CNV-typing
array against a pooled reference sample to minimize reference-specific
artefacts. By comparing the correlation between the discovery data and
the CNV-typing data across the same samples at each locus, we could
distinguish probable false-positives and true CNVs (Supplementary
Methods). Using this approach we estimated the false discovery rate
to be 15%, in good agreement with the estimate obtained from the
much smaller set of independent validation experiments using qPCR.
We then assayed 450 HapMap samples (180 CEU, 180 YRI, 45 JPT
(individuals in Tokyo, Japan) and 45 CHB (individuals in Beijing,
China)) across our CNV-typing array. We used a Bayesian algorithm
to genotype CNVs (more precisely: to assign individuals to diploid
copy number classes), and then manually curated the selection of the
optimal normalization and cluster locations for every locus (Supplementary Methods). We applied quality-control filters to identify
5,238 non-redundant CNVs (4,978 from the CNVs discovered here)
that could be genotyped with high confidence in at least one HapMap
population (3,320 were polymorphic in CEU, 3,985 in YRI and 1,957
in JPT1CHB), and these genotypes exhibited high concordance
across replicate experiments (Supplementary Table 2 and Supplementary Methods).
We also analysed data on 242 HapMap samples on an Illumina
Infinium genotyping platform (Human660W), developed in conjunction with the WTCCC 2 experiments, which incorporates probes
in 8,914 of our CNVs (biased towards those with high frequency in
CEU), using recently published CNV genotyping software21. We
observed that 2,513 CNVs could be genotyped, 2,175 (87%) of which
were also genotyped on the Agilent CGH microarrays. This high
concordance suggests that the genomic properties of the CNV rather
than the performance characteristics of the technology platform
determine whether a CNV can be reliably typed. Given the extensive
overlap, and the smaller number of HapMap samples run on the
Illumina array, subsequent analyses of genotyped CNVs focus solely
on data from the array-CGH CNV-typing.
We developed a new statistical method (Supplementary Methods) to
estimate the absolute copy number of each genotyped CNV, allowing us
Reference sample
(NA10851)
CNV genotyping
CNV genotyping array
(2 105K arrays)
180 YRI
90 JPT+CHB
Chromosomal
coordinates
10,819
CNV loci
30
20
10
0
CNV map
180 CEU
40
1.5
1.0
0.5
0.0
log2ratio
0.5
CNV genotypes
for 5,238 loci
2
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
Candidate loci
Validated loci
Validated genotyped
20%
40%
60%
Full gene
Exon or promoter
Stop codon
80% 100%
Within intron
Intergenic
Common
Intermediate
Rare
Common deletions
Intermediate deletions
Rare deletions
Common duplications
Intermediate duplications
Rare duplications
YRI common
CEU common
ASN common
0%
20%
40%
60%
80%
100%
Intron
total (avg/sample)
Whole-gene
total (avg/sample)
Exon
total (avg/sample)
Stop codon
total (avg/sample
1,236 (269)
1,036 (198)
909 (177)
203 (20)
47 (5)
893 (204)
494 (67)
222 (26)
244 (36)
62 (5)
238 (42)
278 (20)
147 (9)
93 (9)
49 (2)
183 (38)
134 (18)
80 (11)
45 (7)
21 (1)
270 (70)
163 (28)
74 (10)
90 (16)
15 (2)
Overlap analysis was performed to identify CNV loci that were completely confined to introns and intergenic regions, as well as those that overlapped gene regions. The latter group was further
subdivided in succession into complete CNV-gene overlaps, partial CNV-gene overlaps that included stop codons, and partial CNV-gene overlaps that included the promoter region. The remainder
of CNV loci overlapping other (internal) exons was considered as a separate group. Counts are given for the total number of CNV loci (that is, among all samples) as well as for the number of CNVs
detected per sample on average (avg/sample). For the validated CNVs the avg/sample is actually an average per CGH comparison of two diploid genomes.
TSS, transcription start site.
* In total, 247 (12%) genes in the Online Mendelian Inheritance in Man (OMIM) database overlapped with validated CNV loci, averaging 45 (2.2%) OMIM genes per sample affected by 48 (4.4%)
CNVs.
3
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
b
Motif density
+ * dbSNP_129_indels_reduced
+ motif15
+ motif11
+ motif18
+ * simple_repeats_under_100bp
motif0
motif6
+ degen_myers
+ motif19
+ * lifted_over_quadruplexes
+ motif2
motif5
motif14
motif8
+ * cpg
motif7
motif9
+ motif12
+ motif10
+ slipped_dna
+ * simple_repeats_over_100bp
Z_dna
motif13
motif16
cruciforms_2
+ * segdup_pairs
motif1
motif17
motif3
+ * uniq_segdups
motif4
triplexes_2
human_pseudogenes
Duplications
Deletions
0.00
0.05
0.10
0.15
Proportion of events
Left flank
CNV
Right flank
Left flank
CNV
Right flank
0.004
0.002
0.000
c
Density of
hotspot motif
0.008
VNTR
0.000
Non-VNTR
0.20
4
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
using the signatures described earlier, but other mechanisms may also
generate dispersed duplications. Interestingly, a subset of these retrotransposition events does not comprise retroposed repeat elements or
known RNA transcripts, some but not all of which seem likely to result
from L1 transduction35.
Population genetics of CNV
Although rates of CNV mutation have been well characterized at
a small number of loci using experimental techniques, a reliable
estimate of the genome-wide mutation rate has yet to be obtained.
With a set of CNVs ascertained in a consistent manner we used the
Watterson estimator of the population-scaled mutation rate, hW, to
estimate the average per-generation rate of CNV formation, m. The
ascertainment-corrected number of segregating sites (.500 bp) leads
to an estimate of m 5 3 3 1022 mutations per haploid genome, per
generation; however at the base-pair level, heterogeneity in this rate is
expected to vary by several orders of magnitude among sites
(Supplementary Methods). This estimate does not account for
purifying selection, and so it probably represents a lower bound on
the true rate.
A key parameter for linkage-disequilibrium-based studies of
human variation is the proportion of CNVs that can be tagged well
by nearby SNPs. Such taggability depends on CNV allele frequency
and local SNP density, but not on CNV size (Supplementary
Methods). Overall, the taggability of biallelic CNVs genotyped with
high confidence seems to be largely similar to that of frequencymatched SNPs, except that rare CNVs are more poorly tagged; in
CEU, 77% of CNVs .5% MAF are captured with r2 5 0.8, whereas
only 23% of CNVs ,5% MAF are similarly tagged. These results are
similar to others in a smaller data set8. Interestingly, deletions are
much better tagged by nearby SNPs than by duplications (average
difference in maximum r2 is 0.25; P , 10216), while controlling for
allele frequency and local SNP density; this may be a result of the
chromosomal dispersion of some duplications and an increased frequency of reversions and repeat mutations at some duplications36.
To estimate the strength of purifying selection acting on CNVs in
different functional categories, we fitted a population genetic model
of demography and selection37 to the site frequency spectrum of
deletions and duplications in the CEU population, corrected for
incomplete ascertainment (Supplementary Methods). We observed
the strongest purifying selection acting on exonic CNVs, then intronic CNVs then intergenic CNVs (Fig. 5a). Stronger purifying selection at intronic CNVs than intergenic CNVs has also been observed
in Drosophila38 and intronic deletions can be pathogenic if they interfere with proper splicing39. Differences in the ascertainment and in
the precision of estimates of key population genetic parameters
between CNV and published base substitution data sets render direct
comparison of average fitness coefficients between CNVs and substitutions potentially misleading.
One signal of recent positive selection is an unusually long haplotype around the selected marker, but it is difficult to fine-map the
selected variant within such long haplotypes on the basis of population genetic data alone. Large CNVs, by virtue of their potential
functional impact, may make a useful first screen for deconstructing
such signals. Accordingly, we have surveyed our CNVs for signs of
recent positive selection using population differentiation9 and two
previously described approaches40,41 relying on haplotype structure
(integrated haplotype score: iHS, and cross-population extended
haplotype homozygosity: XP-EHH). Several of the CNVs exhibited
iHS in the top 1% of the genomic distribution: 7 in CEU, 1 in
CHB1JPT, 18 in YRI, all of which seem to represent populationspecific signals. The most impressive signal is around CNVR8151.1
in YRI: a standardized iHS of 3.39, in the top 700 out of 2.26 million
markers (top 0.03% of the genome). This deletion lies between the
APOL2 and APOL4 genes involved in pathogen immunity and previously reported to have been under positive selection in primates42. The
top XP-EHH signal is CNVR3685.1, a deletion at .80% frequency in
5
ARTICLES
NATURE
Proportion of
segregating sites
a
0.3
0.2
0.1
0.0
4
3
XP-EHH
0.4
IKBKB
10
1.0
0.8
0.6
r2
CNV exonic = 17
CNV intron = 8
CNV inter
= 5
0.4
0.2
0.0
42.25
42.35
Chr8 position (Mb) NCBI36
155.35
155.45
Chr5 position (Mb) NCBI36
CEU and CHB1JPT but almost absent from YRI, 500 bp 39 to another
immune-related gene, IKBKB (Fig. 5b).
Recent positive selection can also drive increased population differentiation. The VST statistic9 for population differentiation (Fig. 4) is
distinct from haplotype-based measures of recent positive selection as
it allows assessment of all loci, not just those with biallelic genotype
calls (for example, unclusterable events and multiallelic CNVs). The
CNV with the highest value of VST between CEU and YRI is an intronic
deletion of the PDLIM3 gene, which encodes an abundant protein in
skeletal and cardiac muscle. We noted that also among the top five
most highly differentiated loci was an intronic VNTR of the gene
encoding ACTN2, the sarcomeric protein binding-partner of
PDLIM3. Four other pathways with two genes under recent selection
have been identified in SNP-based selection scans40,43 (EDAR and
EDA2R, SLC24A5 and SLC45A2, NRG and ERBB4, and LARGE and
DMD). The possibility that these two highly differentiated CNVs in
genes encoding interacting proteins contribute to population44 or
individual differences in cardiac or skeletal muscle phenotypes
warrants further investigation. Mutations in ACTN3, the close paralogue of ACTN2, alter muscle function in humans and mice45, and a
recent study has highlighted an enrichment of genes involved in
muscle development among signals of recent positive selection46.
We tested for biases of certain mutation processes or functional
locations for CNVs with high VST values. We noted that VNTR are
significantly enriched in both tails of the VST distribution (Supplementary Fig. 1.11), whereas CNVs formed by NAHR seem to be
6
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
CNV
Location*
r2 {
Population{
Data1
Reported gene
Trait
PMID
rs10492972
rs11809207
rs2815752
rs7553864
rs4085613
rs11265260
rs12029454
rs6725887
rs9311171
rs3772255
rs9291683
rs9291683
rs401681
rs11747270
rs11747270
rs4704970
rs12191877
rs10484554
rs3129934
rs9277535
rs9277535
rs210138
rs2301436
rs2705293
rs1602565
rs1602565
rs7395662
rs9300212
rs1495377
rs3118914
rs763014
rs8049607
rs7188697
rs1805007
CNVR65.1
CNVR118.1
CNVR217.1
CNVR240.1
CNVR358.1
CNVR381.1
CNVR384.1
CNVR1111.1
CNVR1355.1
CNVR1591.1
CNVR1819.6
CNVR1819.1
CNVR2293.1
CNVR2646.1
CNVR2647_full
CNVR2659.1
CNVR2841.6
CNVR2841.6
CNVR2845.21
CNVR2846.3
CNVR2846.5
CNVR2850.1
CNVR3164.1
CNVR4074.1
CNVR5123.2
CNVR5123.1
CNVR5165.1
CNVR5492.1
CNVR5583.1
CNVR5871.1
CNVR6576.1
CNVR6636.1
CNVR6746.1
CNVR6887.1
chr1: 1040513710406094
chr1: 2633215726337219
chr1: 7253887072584557
chr1: 8738582787386846
chr1: 150822234150856715
chr1: 157915386157916253
chr1: 160497369160497846
chr2: 203607766203612122
chr3: 3795347437961880
chr3: 157574746157576258
chr4: 97832529843664
chr4: 98204199843664
chr5: 13860431386897
chr5: 150157836150161778
chr5: 150183562150203623
chr5: 155409234155427600
chr6: 3138450531397416
chr6: 3138450531397416
chr6: 3251988532887814
chr6: 3315633833162718
chr6: 3315968233163323
chr6: 3369191733693857
chr6: 167408121167409138
chr8: 138980822138981379
chr11: 2909595329096982
chr11: 2909611429096643
chr11: 4855743248560877
chr12: 3360639633608182
chr12:69818942-69819932
chr13: 4996734749973131
chr16: 601068603588
chr16: 1159153811592052
chr16: 5723110757233858
chr16: 8842359988425903
0.92
0.61
0.96
0.76
0.97
0.62
0.57
1.00
1.00
0.90
0.51
0.51
0.68
1.00
1.00
0.95
0.79
0.79
0.87
0.62
0.67
0.55
0.71
0.51
0.64
0.61
1.00
0.84
0.72
0.69
0.68
0.88
0.61
0.87
CEU
CEU
CEU
CEU
CEU
CHB1JPT
CHB1JPT
CEU
CHB1JPT
CEU
YRI
YRI
YRI
CEU
CEU
CEU
CEU
CEU
CEU
CEU
CEU
CEU
CEU
YRI
CEU
CEU
CEU
CEU
CEU
CEU
CEU
CHB1JPT
YRI
CEU
Phased
Phased
Phased
Intensities
Phased
Phased
Phased
Phased
Phased
Phased
Intensities
Intensities
Intensities
Phased
Phased
Phased
Phased
Phased
Phased
Intensities
Intensities
Phased
Intensities
Intensities
Intensities
Intensities
Phased
Phased
phased
Phased
Intensities
Phased
Phased
Phased
KIF1B
CATSPER4
NEGR1
AK002179
LCE3D, LCE3A
CRP
NOS1AP
WDR12
CTDSPL
KCNAB1
NR
NR
CLPTM1L
IRGM
IRGM
SGCD
HLA-C
HLA-C
HLA-DRB1
HLA-DPB1
HLA-DPB1
BAK1
CCR6
AK127771
Intergenic
Intergenic
MADD, FOLH1
Intergenic
NR
DLEU7
RAB40C
LITAF
NDRG4
MC1R
Multiple sclerosis
Height
Body mass index
Smoking behaviour
Psoriasis
C-reactive protein
QT interval
Myocardial infarction (early onset)
Prostate cancer
Ageing traits
Bone mineral density
Bone mineral density
Lung cancer
Crohns disease
Crohns disease
Multiple sclerosis (age of onset)
Psoriasis
AIDS progression
Multiple sclerosis
Hepatitis B
Hepatitis B
Testicular germ cell tumour
Crohns disease
Neuroticism
Schizophrenia
Schizophrenia
HDL cholesterol
Cognitive test performance
Type 2 diabetes
Height
Height
QT interval
QT interval
Skin sensitivity to sun
18997785
19343178
19079261
19247474
19169255
18439552
19305408
19198609
17903305
17903295
17903296
17903296
18978787
18587394
18587394
19010793
19169254
19115949
18941528
19349983
19349983
19483681
18587394
18762592
18677311
18677311
19060911
17903297
17554300
19343178
18391950
19305409
19305409
18488028
List of CNV correlations with trait-associated SNPs with r2 . 0.5 (see main text for details). When a locus-trait association has been reported several times, only the results for the most recently
published trait-associated SNPs are shown in this table. Some trait-associated SNPs are strongly correlated with more than one CNV in the same recombination hotspot interval. NR, no gene
reported in original study; PMID, PubMed accession of the paper reporting the trait-associated SNP.
* Location of the CNV.
{ Squared correlation coefficient.
{ Population in which correlation observed; some SNP-CNV correlations are observed in several populations.
1 CNV data that correlates with the hit-SNP. Phased, phased SNP1CNV haplotypes; intensities, CNV intensity data and SNP genotypes. If present in phased and intensity data only phased data
reported.
CNVs (MAF . 5%) greater than 1 kb in length, and have been able to
genotype approximately 40% of these (Supplementary Methods). The
remaining CNVs will probably be best captured by genome sequencing experiments.
The CNVs most difficult to genotype directly were duplications
and multiallelic loci (including VNTR). They are also the categories
of CNVs least likely to be tagged well by SNPs, and therefore most
likely to be overlooked by linkage-disequilibrium-based association
testing. The observation that VNTR are enriched among loci exhibiting high population differentiation provides evidence for the functional importance of this CNV class, which highlights the need for
development of genome-wide assays for incorporating this often
recalcitrant class of variants into human genetic studies.
We found that the mutational mechanisms generating CNVs vary
depending on the different size of the genomic alteration. NAHR has
more of a role in larger CNV formation, whereas VNTR and dispersed
duplications (whose role in CNV formation was previously underappreciated) are more commonly observed with smaller CNVs.
Although some sequence motifs (for example, some non-B-DNA
structures) were more mutagenic than others, the sequence context
was not strongly predictive of the location of CNVs, unlike the link
between segmental duplications and larger CNVs mediated by NAHR.
We observed that non-B-DNA forming sequences that are
enriched in promoter regions are also enriched in CNV breakpoints,
suggesting that the same properties that enable regulation of transcription may also be mildly mutagenic for the formation of CNVs,
and as a consequence, CNVs may influence the evolution of gene
regulation. We also discovered that there are substantive differences
in both the mutation mechanisms and the selection pressures of
deletions and duplications.
Despite the fact that we identified several new CNVs that are potential causal variants on trait-associated haplotypes, collectively these
CNVs could explain less than 5% of previously reported GWAS hits.
Nonetheless, these observations emphasize the need to consider all
classes of variation (SNPs and all structural variants, common and rare)
when fine-mapping causal variants within association intervals.
Sequence insertions relative to the reference sequence represent a particular challenge for both fine-mapping and association studies, because
their presence on an associated haplotype might be easily overlooked.
Our results provide some guidance as to how resources might best
be targeted to identify genetic variation underlying the missing
heritability for complex traits that remains unexplained by recent
GWAS. Although common CNVs seem highly unlikely to account
for much of this missing heritability, the striking strength of purifying
selection acting on exonic and intronic deletions suggests that CNVs
might contribute appreciably to rare variants involved in common and
rare diseases, and that study designs that focus on ascertaining rare
sequence and structural variants will maximise power to detect new
causal variation.
METHODS SUMMARY
Samples. HapMap and Polymorphism Discovery Resource DNA samples were
obtained from the Coriell Cell Repository. The reference DNA in genotyping
experiments on the Agilent 105K array was a pool of 10 genomic cell-line DNAs
from the European Collection of Cell Cultures.
CNV discovery experiments. Probes on the 20 array set were designed with a
relaxed threshold for multiple matches to the reference genome to maximise
coverage and allow screening of moderately repetitive sequences. The array data
were generated at NimbleGens Icelandic service facility. Experiments were
repeated and quality-control filters were applied to improve the data consistency.
Data were normalized to minimize variation between experiments; putative
7
ARTICLES
NATURE
CNVs were detected as chromosomal segments with unusually high or low log2
ratios of fluorescent intensity between the test and reference genomes using the
genome alteration detection analysis (GADA) algorithm48. Further filtering
reduced false positives.
Validation experiments. qPCR experiments were performed by Applied
Biosystems. Further validation was conducted by Sequenom and the co-authors
of this paper.
CNV genotyping experiments. The Agilent 105K CNV genotyping array was
designed by the WTCCC in collaboration with the other co-authors of this paper.
After pilot experiments, each locus was targeted with at least 10 probes. Agilent
array data were generated by Oxford Gene Technologies at their UK service facility
as part of the pipeline developed for the large WTCCC association experiment
(pipeline to be described elsewhere). We assessed the quality of the experiments
on the 450 HapMap samples and repeated 90 poorer quality experiments to
improve data consistency. The Illumina 660W array data were generated by
Illumina Inc.
Statistical and population analysis. We devised statistical methods for CNV
genotyping, absolute copy number estimation, breakpoint enrichment testing,
and estimation of discovery power. We phased CNVs and SNPs into haplotypes
using BEAGLE 3.0.3 (ref. 49), and used NestedMICA31 for breakpoint motif
discovery.
Received 14 August; accepted 21 September 2009.
Published online 7 October 2009.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26. Jeffreys, A. J. et al. Human minisatellites, repeat DNA instability and meiotic
recombination. Electrophoresis 20, 16651675 (1999).
27. Bacolla, A. & Wells, R. D. Non-B DNA conformations, genomic rearrangements,
and human disease. J. Biol. Chem. 279, 4741147414 (2004).
28. Myers, S. et al. A common sequence motif associated with recombination hot
spots and genome instability in humans. Nature Genet. 40, 11241129 (2008).
29. Jeffreys, A. J. et al. Meiotic recombination hot spots and human DNA diversity.
Phil. Trans. R. Soc. Lond. B 359, 141152 (2004).
30. Huppert, J. L. & Balasubramanian, S. G-quadruplexes in promoters throughout the
human genome. Nucleic Acids Res. 35, 406413 (2007).
31. Down, T. A. & Hubbard, T. J. NestedMICA: sensitive inference of over-represented
motifs in nucleic acid sequence. Nucleic Acids Res. 33, 14451453 (2005).
32. Sen, S. K. et al. Human genomic deletions mediated by recombination between
Alu elements. Am. J. Hum. Genet. 79, 4153 (2006).
33. Tian, D. et al. Single-nucleotide mutation rate increases close to insertions/
deletions in eukaryotes. Nature 455, 105108 (2008).
34. Campbell, P. J. et al. Identification of somatically acquired rearrangements in
cancer using genome-wide massively parallel paired-end sequencing. Nature
Genet. 40, 722729 (2008).
35. Pickeral, O. K., Makalowski, W., Boguski, M. S. & Boeke, J. D. Frequent human
genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res. 10,
411415 (2000).
36. Gondo, Y. et al. High-frequency genetic reversion mediated by a DNA duplication:
the mouse pink-eyed unstable mutation. Proc. Natl Acad. Sci. USA 90, 297301
(1993).
37. Boyko, A. R. et al. Assessing the evolutionary impact of amino acid mutations in
the human genome. PLoS Genet. 4, e1000083 (2008).
38. Emerson, J. J., Cardoso-Moreira, M., Borevitz, J. O. & Long, M. Natural selection
shapes genome-wide patterns of copy-number polymorphism in Drosophila
melanogaster. Science 320, 16291631 (2008).
39. Wang, L. L. et al. Intron-size constraint as a mutational mechanism in RothmundThomson syndrome. Am. J. Hum. Genet. 71, 165167 (2002).
40. Sabeti, P. C. et al. Genome-wide detection and characterization of positive
selection in human populations. Nature 449, 913918 (2007).
41. Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive
selection in the human genome. PLoS Biol. 4, e72 (2006).
42. Smith, E. E. & Malik, H. S. The apolipoprotein L family of programmed cell death
and immunity genes rapidly evolved in primates at discrete sites of host-pathogen
interactions. Genome Res. 19, 850858 (2009).
43. Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of
human populations. Genome Res. 19, 826837 (2009).
44. Silva, A. M. et al. Ethnicity-related skeletal muscle differences across the lifespan.
Am. J. Hum. Biol. doi:10.1002/ajhb.20956 (16 June 2009).
45. MacArthur, D. G. et al. Loss of ACTN3 gene function alters mouse muscle
metabolism and shows evidence of positive selection in humans. Nature Genet. 39,
12611265 (2007).
46. Nielsen, R. et al. Darwinian and demographic forces affecting human protein
coding genes. Genome Res. 19, 838849 (2009).
47. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106,
93629367 (2009).
48. Pique-Regi, R. et al. Sparse representation and Bayesian detection of genome copy
number alterations from microarray data. Bioinformatics 24, 309318 (2008).
49. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and
missing-data inference for whole-genome association studies by use of localized
haplotype clustering. Am. J. Hum. Genet. 81, 10841097 (2007).
50. Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nature
Genet. 36, 949951 (2004).
8
2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE
Author Information The CNV discovery and CNV genotyping data are available at
ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae/) under accession
numbers E-MTAB-40 and E-MTAB-142, respectively. Normalized CNV discovery
data are available at http://www.sanger.ac.uk/humgen/cnv/42mio. CNVs are
displayed at the Database of Genomic Variants (http://projects.tcag.ca/
variation). CNV locations and genotypes are reported in Supplementary Tables 1
and 2. Reprints and permissions information is available at www.nature.com/
reprints. Correspondence and requests for materials should be addressed to
M.E.H. ([email protected]) or S.W.S. ([email protected]).
9
2009 Macmillan Publishers Limited. All rights reserved