A Supplement To: Supported by
A Supplement To: Supported by
Supported by
Cancer Research
Roche Diagnostics
Roche Applied Science
© 2006 Roche Diagnostics. All rights reserved. Indianapolis, Indiana
4 A N C I E NT DN A
New Methods Yield Mammoth Samples
A. Gibbons
5 Metagenomics to Paleogenomics:
Large-Scale Sequencing of Mammoth DNA
H. N. Poinar, et al.
16 No Longer De-Identified
A. L. McGuire and R. A. Gibbs
W
hen the human genome sequencing project was initiated in 1990, the
time required to complete the project was predicted to be at least 15 years.
However, determination on the part of researchers coupled with technological
advances that allowed more sequence data to be squeezed out of each sequencing run
led to a reduction of two years in that target. A remarkable achievement, considering
the ambitious timeline. The fundamental principle on which this technology was based
has essentially not changed since its development by Fred Sanger in 1977. Although a
number of significant improvements have been made, the methodology still has a number
of inherent technical and physical limitations, such as the limited throughput of samples
and the requirement for cloning and amplification of the DNA prior to sequencing. These
are believed to, for all intents and purposes, preclude further significant improvements in
speed, cost, and throughput.
This circumstance has prompted the development of a number of new technologies
through which researchers have looked at the problem afresh and proposed interesting
solutions, or presented novel twists on the original Sanger technique. Not all have
achieved implementation yet, but the parties involved are plainly focused on making their
technologies available to users as soon as possible. The different approaches include single
molecule sequencing (VisiGen, LI-COR), sequencing by ligation (Agencourt), and so-
called sequencing-by-synthesis (Roche/454 Life Sciences, Microchip Biotechnologies,
NimbleGen). An interesting new methodology utilizing nanopore sequencing, which
benefits from not requiring amplification or cloning of target DNA, is being researched
by a number of laboratories. Although still in the initial stages of development, it perhaps
gives a glimpse of where sequencing may be going in the future. The field as a whole has
been invigorated not only by increased industry spending, but also by the offer of prize
monies from different groups (such as the Genome X Prize offered by Craig Venter), plus
the availability of some sizable NIH grants.
Each new technology is not without its limitations and drawbacks. In particular, limited
sequence reads may reduce the throughput of certain systems; these weaknesses are being
aggressively addressed, with good success. Additionally, important ethical and privacy
concerns that frequently come hand in hand with such technological advances need to be
urgently addressed. These issues aside, it is abundantly clear that the $1,000 genome is
tantalizingly close to being realized. Its benefits to “omics” research will be many, from
the ability to perform SNP comparisons over entire genomes to the sequencing of the full
complement of siRNAs in individual transcriptomes, or the capacity for genomewide
comparisons of genetic aberrations in tumor samples. Appealingly, it could also make the
Holy Grail of personalized medicine a reality sooner rather than later.
Sean Sanders, Ph.D.
Commercial Editor
Science Office of Publishing and Member Services
I
n the last decade, DNA sequencing has been a cornerstone for nearly every aspect of
biological research and drug discovery. Past advances in Sanger sequencing, coupled
with the shotgun method of sample preparation, enabled the completion of the whole
genome, accelerating discoveries in other fields such as gene expression and genotyping.
The capillary electrophoresis–based Sanger method is currently the most popular DNA
sequencing technology and was the foundation of the human genome project. In addition
to sequencing complete genomes, the Sanger technique has been used for multiple
applications; for example, fragment sequencing for clone verification or the detection
of variations in the genome for use in population studies or disease-focused research.
As this method has been steadily improved over the past ten years, costs have fallen by
approximately 90% and throughput has increased tenfold. Nevertheless, this standard
method has reached its physical limits.
For the first time in nearly a decade, a new sequencing technology is commercially
available. Based upon pyrosequencing chemistries, the Genome Sequencer 20 System
has energized the DNA sequencing market, providing more data faster and creating new
applications to address existing scientific questions. The Genome Sequencer 20 System,
developed by 454 Life SciencesTM Corporation and distributed by Roche Diagnostics,
uses a sequencing approach called “Sequencing by Synthesis.” The technique is based on
the clonal amplification of single molecules on beads (microparticles), which are isolated
in an emulsion, and the subsequent massively parallel sequencing of the amplified DNA
on the beads placed in wells of a PicoTiterPlate™ device. The Genome Sequencer 20
System achieves a sequencing output of at least 20 Mb (200,000 fragments) per 5.5-hour
run, with an average read length of 100 base pairs. The system’s detection method is
based on the conversion of pyrophosphate—released during the polymerase-catalyzed
attachment of a nucleotide to the complementary strand—into light via an enzyme
cascade.
As the throughput and sensitivity represent a breakthrough relative to the capabilities
of the Sanger method, the new Genome Sequencer 20 System is very much suited not
only for the sequencing of whole genomes of microorganisms, but also for resequencing
of amplicons and small RNA fragments, supporting transcriptome analysis. The ability
to sequence whole genomes in a rapid manner, combined with the ability to identify
and quantify low-abundance DNA subpopulations within a sample, has enabled new
applications. Some recent developments include studies on drug resistance and the
pathogenicity of bacteria; identification of human DNA variations, including low-
abundance mutations in cancer studies; establishing the onset of drug resistance in HIV
or HCV; and differentiating the modes of action of antibiotics.
New-generation sequencing technologies are gradually opening up new applications,
particularly in the medical field and in the discovery and development of drugs.
If sequencing is to become a standard component of individualized diagnosis and
preventive medicine—the next step to bridge basic research tools to patient care—the
costs per sequencing run must be reduced still further. The international planning target
is US$1,000 for a complete human genome. A large number of research institutes
and companies are now working on technologies designed to lower costs and increase
throughput. Roche Diagnostics and 454 Life Sciences Corporation continue working
together to further develop the Genome Sequencer System to enable researchers to open
more doors to new applications and solve previously unaddressable problems.
Timothy Harkins, Ph.D.
Marketing Manager
Roche Diagnostics
Roche Applied Science
3
A
ncient DNA has always held the jawbone. This specimen had been recov-
promise of a visit to a long-vanished ered from the shore of Lake Taimyr, where
world of extinct animals, plants, and very cold winters and short, cool, and dry
even humans. But although researchers summers turned out to be ideal conditions
have sequenced short bits of ancient DNA for preserving DNA.
from organisms including potatoes, cave Poinar sent the DNA-rich sample
bears, and even Neandertals, most samples to genomicist Stephan C. Schuster at
have been too damaged or contaminated Pennsylvania State University, University
for meaningful results. Park, who is working with a new genome
Now in a paper published online by sequencer developed by a team at Stanford
Science this week*, an international team University and 454 Life Sciences Corp.
reports using new technology to sequence of Branford, Connecticut (Nature, 15
a staggering 13 million basepairs of both September, p. 376). This rapid, large-scale
nuclear and mitochondrial DNA from a sequencing technology sidesteps the need
27,000-year-old Siberian mammoth. Also to insert DNA into bacteria before ampli-
this week, a Nature paper reports using a fying and sequencing it. Instead, scientists
souped-up version of more conventional break DNA into small fragments, each at-
methods to sequence a mammoth’s entire tached to a tiny bead and encapsulated by a
mitochondrial genome. lipid bubble where the DNA is multiplied
Besides helping reveal the origins of into many copies for sequencing. Because
mammoths, the new nuclear data serve as a each fragment is isolated before copying,
dramatic demonstration of the power of the the method avoids bias from copying large
new technique to reliably sequence large amounts of contaminant DNA from bacte-
amounts of ancient DNA, other research- ria or humans.
ers say. “The ‘next generation’ sequencer The researchers were stunned by how
that was used [in the Science paper] will well the method worked on ancient DNA,
revolutionize the field of ancient DNA,” which is notoriously difficult to extract
predicts evolutionary biologist Blair and sequence: “I would have been happy if
Hedges of Pennsylvania State University we got 10,000 bases of mammoth DNA,”
in University Park. Ancient DNA pioneer said Poinar. Instead, they got 28 million
Svante Pääbo of the Max Planck Institute basepairs, 13 million from the mammoth
for Evolutionary Anthropology in Leipzig, itself. Their preliminary analysis shows
Germany, who co-led the independent mi- that the mammoth was a female who
tochondrial study, calls the nuclear DNA shared 98.55% of her DNA with modern
work “really great—the way forward in African elephants. But mammoths were
ancient DNA is to go for the nuclear ge- apparently closest kin to Asian elephants,
nome with technologies like this.” as shown by Pääbo’s mitochondrial study,
To get mammoth samples for the new which retrieved about 17,000 basepairs.
method, molecular evolutionary geneticist Poinar’s team also found sequences
Hendrik Poinar of McMaster University in from bacteria, fungi, viruses, soil micro-
Hamilton, Canada, took bone cores from organisms, and plants, which the research-
woolly mammoths found in permafrost and ers say will help reconstruct the mam-
stored in a frigid Siberian ice cave. When moth’s ancient world. The technique was
Poinar returned the samples to his lab, he so productive that the authors predict it
was surprised by the amount of DNA that will be used soon to sequence entire ge-
emerged, particularly from one mammoth nomes of extinct animals.
–ANN GIBBONS
*www.sciencemag.org/cgi/content/abstract/1123360 With reporting by Michael Balter.
C
omplete genome sequences of ex- ability to retrieve nuclear DNA (nDNA).
tinct species will answer long-stand- Most DNA extracted from fossil remains
ing questions in molecular evolution is truncated into fragments of very short
and allow us to tackle the molecular basis length [<300 base pairs (bp)] from hydro-
of speciation, temporal stages of gene evo- lysis of the DNA backbone, cross-linking
lution, and intermediates of selection dur- due to condensation (1, 2), and oxidation
ing domestication. To date, fossil remains of pyrimidines (3), which prevents exten-
have yielded little genetic insight into sion by Taq DNA polymerase during poly-
evolutionary processes because of poor merase chain reaction (PCR). In addition,
preservation of their DNA and our limited DNA extracts are a mixture of bacterial,
fungal, and often human contaminants,
1
McMaster Ancient DNA Center, 2Department complicating the isolation of endogenous
of Anthropology, 3Department of Pathology DNA. In the past, these problems could
and Molecular Medicine, McMaster University,
only be indirectly overcome by concentrat-
1280 Main Street West, Hamilton, ON L8S 4L9
Canada. 4Pennsylvania State University, Center ing on the small number of genes present
for Comparative Genomics and Bioinformatics, on the maternally inherited mitochondrial
310 Wartik Building, University Park, PA 16802, genome, which is present in high copy
USA. 5Henry Wellcome Ancient Biomolecules
Centre, Department of Zoology, Oxford University,
number in animal cells. This approach se-
South Parks Road, Oxford, OX1 3PS, UK. verely limits access to the storehouses of
6
Division of Vertebrate Zoology/Mammalogy, genetic information potentially available
American Museumof Natural History, 79th Street in fossils of now-extinct species. In a few
and Central Park West, New York, NY 10024,
USA. 7#2 Avenue de la Pelouse, F-94160 St. rare cases, investigators have managed
Mandé, France. 8Zoological Institute, Russian to isolate and characterize nuclear DNA
Academy of Sciences, Universitetskaya nab.1, from fossil remains preserved in arid cave
Saint Petersburg 199034, Russia. 9Center for
Bioinformatics (ZBIT), Institute for Computer
deposits (4–6) or, more commonly, perma-
Science, Tübingen University, 72076 Tübingen, frost-dominated environments (7, 8) and
Germany. 10 Garching Computing Center ice (9), where the average burial tempera-
(RZG), Boltzmannstrasse 2, D-85748 Garching, ture can be as low as –10˚C (10). Under
Germany.
*To whom correspondence should be addressed. these conditions, preservation is enhanced
E-mail: [email protected] (H.N.P.); scs@bx. by reduced reaction rates: In permafrost
psu.edu (S.C.S.)
Environmental
sequences
14.15% Virus
0.09% Alignable to human
Other eukaroyta 1.40%
4.15% Archaea Bacteria Alignable to dog
0.24% 5.76% 1.25%
(13). As we have not detected any con- read length of 132 bp. To test whether the
vincing mappings of a read to the human Y observed hits were more likely to be de-
chromosome, despite random distribution rived from endogenous mammoth DNA,
across the genome, we conclude that our as opposed to potential contaminants such
mammoth was a female. as human DNA, we repeated the BLASTZ
A total of 137,527, or 45.4% of all analyses as above, this time comparing our
reads, aligned to the African elephant sequence reads to the currently available
genome (Fig. 1, Table 1) (13), currently versions of the human and dog genomes.
available at 2.2-fold coverage, with an Only 4237 reads (1.4%) aligned to human
estimated number of base pairs in the ge- and 3775 (1.2%) to dog (at our threshold
nome of 2.3 × 109 bp (www. broad.mit. of approximately 90% identity). Between
edu/mammals/#chart). A twofold coverage 1% and 5% of any two distantly related
approximates only 80% of the total ge- mammalian genomes should align at 90%
nome, so a conservative estimate is that half identity or greater, because roughly 0.5%
of our reads would align to a completed el- of these genomes consist of protein-coding
ephant sequence. Among all reads, 44,442 segments conserved at that level (21), and
(14.7%) aligned to only one position in noncoding DNA contributes a somewhat
the elephant genome, and 21,952 (7.3%) larger fraction (22). Thus, the fraction of
exhibited a perfect (100%) match, up to a our reads that show at least 90% identity
was 1:658, which agrees with what one reads (100%) against the nonredundant
would expect given a 1:1000 copy-number (nr) and the environmental database (env_
ratio for nDNA versus mtDNA. nr). Using an adjustable factor for bitscore
Despite the presence in our sample of (13), we classified the reads simultane-
an exceptionally high percentage (54.5%, ously to the individual kingdom, phylum,
including reads predicted to align to el- class, order, family, and genus down to the
ephant) of mammoth DNA, relative to species level wherever possible (fig. S2
environmental contaminants, 45.5% of and table S3), excluding all hits matching
the total DNA derives from endogenous Gnathostomata (jawed vertebrates).
bacteria and nonelephantid environmental The remaining 12,563 hits within the
contaminants. In addition to ubiquitous Eukaryota (4.15%) were only surpassed
contaminants resulting from handling or by the number of bacterial hits, 17,425
conditions of storage, these exogenous (5.76%), and hits against the environ-
species are likely to represent taxa pres- mental database, 42,816 (14.15%). The
ent at or immediately after the time of the kingdom Archaea was hit infrequently
mammoth’s death, thereby contributing to with only 736 hits (0.24%). Within this
the decomposition of the remains. To ac- group, the Euryarchaeota dominated the
quire a glimpse of the biodiversity of these Crenarchaeota by a ratio of 16:1. In the
communities, we have devised software bacterial superkingdom, the most preva-
(GenomeTaxonomyBrowser) (27, 28) that lent species were found to be proteobacte-
allows for the taxonomic identification of ria, 5282 (1.75%); Firmicutes (gram-posi-
various species on the basis of sequence tives), 940 (0.31%), mostly Bacilli and
comparison and current phylogenetic Chlostridia; Actinobacteria, 2740 (0.91%);
classification at the National Center for Bacteroidetes, 497 (0.16%); and the group
Biotechnology Information (NCBI) tax- of the Chlorobi bacteria, 248 (0.08%).
onomy browser as of November 2005 Other identified microorganisms included
(www.ncbi.nlm.nih.gov/Taxonomy/tax- the fungal taxa Ashbya, Aspergillus, and
onomyhome.html). We compared 302,692 Neurospora/Magnaporte with 440 hits
early postglacial time and ultimately shed 2 December 2005; accepted 15 December 2005
Published online 20 December 2005;
new light on the cause and consequences 10.1126/science.1123360
of late Quaternary extinctions. Include this information when citing this paper.
10
MARCO ISLAND, FLORIDA—Computers are getting more confident that it’s a real
aren’t the only things getting better and possibility. “From what I’ve listened to
cheaper every time you turn around. the last few days, there is no physical
Genome-sequencing prices are in free fall, principle that says we shouldn’t be able to do
too. The initial draft of the first human a $1000 genome,” says Harvard University
genome sequence, finished just 5 years sequencing pioneer George Church.
ago, cost an estimated $300 million. (The Even today, the declining cost of
final draft and all the technology that genome sequencing is triggering
made it possible came in near $3 billion.) a flowering of basic research, looking
Last month, genome scientists completed at broad-ranging topics such as how
a draft of the genome sequence of the the activation of genes is regulated and
second nonhuman primate—the rhesus understanding genetic links to cancer. And
macaque—for $22 million. And by the end as prices continue to drop, sequencing will
of the year, at least one company expects revolutionize both the way biologists hunt
to turn out a full mammalian genome for disease genes and the way medical
sequence for about $100,000, a 3000-fold professionals diagnose and treat diseases.
cost reduction in just 6 years. In fact, some researchers say cheap
It’s not likely to stop there. Researchers sequencing technology could finally usher
are closing in on a new generation of in personalized medicine in a major way.
technology that they hope will slash the “The promise of cheap sequencing
cost of a genome sequence to $1000. is in the understanding of disease and
“Advances in this field are happening biology, such as cancer, where the genome
fast,” says Kevin McKernan, co–chief changes over time,” says Dennis Gilbert,
scientist at Agencourt Bioscience in chief scientist of Applied Biosystems,
Beverly, Massachusetts. “And they are the leading gene-sequencing-technology
coming more quickly than I think anyone company based in Foster City, California.
was anticipating.” Jeffrey Schloss, who “It will enable different kinds of science to
heads the sequencing-technologies grant be done.” Of course, as with other forms
program at the National Human Genome of high technology, that promise brings
Research Institute (NHGRI) in Bethesda, new risks as well. Researchers expect
Maryland, agrees. “People are roundly cheap sequencing to raise concerns about
encouraged and nervous,” Schloss says— the proliferation of bioterrorism agents as
encouraged because their own technologies well as patient privacy.
are working, and nervous because their
competitors’are too. The race is on
A host of these novel sequencing The first group to produce a technology
technologies were on display last month capable of sequencing a human genome
at a meeting here.* Although no one at sequence for $1000 will get instant
the meeting claimed to have cracked the gratification, as well as potential future
$1000 genome sequence yet, researchers profits: In September 2003, the J. Craig
Venter Science Foundation promised
$500,000 for the achievement. That
*Advances in Genome Biology and Technology challenge has since been picked up by
Conference, Marco Island, Florida, 8-11 February the Santa Monica, California–based
2006.
X Prize Foundation, which is expected
This News Focus article was published in the 7 to up the ante to between $5 million
March 2006 issue of Science. and $20 million. But the competition
11
0.10
0.01
1990 1995 2000 2005
really began in earnest in 2004, when the the mix of reagents used to perform the
National Institutes of Health launched a synthesis—stop the process when one of
$70 million grant program to support them is tacked onto the end of the growing
researchers working to sequence a complete DNA strand. The result is a soup of newly
mammal-sized genome initially for synthesized DNA fragments, each of
$100,000 and ultimately for $1000. That which started at the same point but ends at
program has had an “amazing” effect a different base along the chain.
on the field, encouraging researchers to Today’s sequencers separate these
pursue a wide variety of new ideas, says fragments by passing the soup through tiny
Church. That boost in turn has led to a capillaries containing a gel; the shorter
miniexplosion of start-up companies, each the fragment, the faster it moves through
pursuing its own angle on the technology the gel. The process, known as capillary
(see table, p. 15). electrophoresis, is so effective that each
All are racing to improve or replace a fragment that emerges from the capillary
technology first developed by Fred Sanger is just one base longer than the one that
of the U.K. Medical Research Council in preceded it. As each fragment emerges, it
the mid-1970s that is the basis of today’s is hit by a laser, which causes the altered
sequencing machines. The technique base at the fragment’s tip to fluoresce. A
involves making multiple copies of the computer records the identity of these bases
DNA to be sequenced, chopping it up and the sequence in which they appear.
into small pieces, and using those pieces Eventually, the process generates billions
as templates to synthesize short strands of stretches of sequence that are fed into
of DNA that will be exact complements pattern-recognition software running on a
of stretches of the original sequence. The supercomputer, which picks out overlaps
synthesis essentially mimics the cell’s and stitches the pieces together into a
processes for copying DNA. complete genome sequence.
The technology relies on the use of A long list of refinements in capillary
modified versions of the four bases that electrophoresis systems, coupled with
make up DNA, each of which is tagged increased automation and software
with a different fluorescent marker. A short improvements, has driven down the
DNA snippet called a primer initiates the costs of sequencing 13-fold since these
synthesis at a specific point on the template machines were introduced in the 1990s.
DNA, and the altered bases—which are Most of the new technologies aim to
vastly outnumbered by normal bases in miniaturize, multiplex, and automate the
12
13
14
Gregory Timp of the University of Illinois, strains in patients at the earliest stages. In
Urbana-Champaign, reported that his team another study, they quickly analyzed the
has generated electrical readings of DNA sequence of non–small cell lung cancer
moving through nanopores. Unfortunately, cells and identified the specific mutations
the DNA wriggled back and forth so much that give rise to drug resistance.
that the researchers had trouble teasing In similar studies, Thomas Albert
out the sequence of bases in the chain. and colleagues at NimbleGen Systems,
But Timp says he and his colleagues are a biotechnology firm in Madison,
finishing a second-generation device that Wisconsin, used their version of
uses electric fields to keep the movement sequencing-by-synthesis technology to
of the DNA under control. If it works, the identify the mutations in Helicobacter
technology promises to read long stretches pylori—the microbe responsible for
of DNA without the need for expensive ulcers—that cause resistance to a drug
optical detectors. known as metronidazole, as well as the
mutations in the tuberculosis-causing
“We have to worry now” bacterium that trigger resistance to a new
No matter which technology or tech- TB drug. The power of such studies is
nologies make it to market, the scien- “unbelievable,” Snyder says, because they
tific consequences of lower sequencing hold out the hope of enabling doctors to
costs are bound to be enormous. “I think tailor medicines to battle diseases most
it’s going to have a profound impact on effectively. Some personalized-treatment
biology,” says Yale University molecular strategies are already in use: Herceptin,
biologist Michael Snyder. for example, is targeted to patients with
Some early progress is already on a specific genetic form of breast cancer.
display. At the Florida meeting, for example, But cheap sequencing should make them
454’s Egholm reported that he and his far more widespread, Church says. Basic
colleagues used their technology to researchers are looking at the early benefits
identify as many as four genetic variants of cheap sequencing as well. At the
of HIV in single blood samples, in contrast meeting, for example, Snyder talked about
to today’s technology, which identifies his team’s use of gene chips to map the sites
just the dominant strain. The technique, where transcription factors—proteins that
Egholm says, could eventually help control when genes are turned on—bind to
doctors see the rise of drug-resistant HIV the genome. The technology is effective,
15
No Longer De-Identified
Amy L. McGuire1* and Richard A. Gibbs2
A
s DNA sequencing becomes more the privacy of subjects whose sequenced
afford able and less time-consuming, DNA is publicly released have largely
scientists are adding DNA banking been addressed by ensuring that the data
and analysis to research protocols, resulting are “de-identified” and that confidentiality
in new disease specific DNA databases. A is maintained (1–2). There is a large
major ethical and policy question will be literature on the various data-management
whether and how much information about models and computer algorithms that
a particular individual’s DNA sequence can be used to provide access to genetic
ought to be publicly accessible. data while purportedly protecting privacy
Without privacy protection, public trust (3–6). We believe that minimizing risks
will be compromised, and the scientific and to subjects through new developments in
medical potential of the technology will data and database structures is crucial and
not be realized. However, scientific utility should continue to be explored, but that
grows with increased access to sequenced additional safeguards are required.
DNA. At present, ethical concerns about Scientists have been aware for years of
the possibility that coded or “anonymized”
sequenced DNA may be more readily
1 linked to an individual as genetic databases
Center for Medical Ethics and Health Policy,
Baylor College of Medicine, 2Human Genome proliferate (1, 3, 7, 8). In 2004, Lin and
Sequencing Center, Baylor College of Medicine,
colleagues demonstrated that an individual
One Baylor Plaza, Suite 310D, Houston, TX
77030, USA. *Author for correspondence. E-mail: can be uniquely identified with access to
[email protected] just 75 single-nucleotide polymorphisms
16
17
Option 2 Single gene loci Typically <20 SNPs Intermediate Ability to study individual genes
18
19
D
NA methods are now widely used for 2, 9). Such methods could potentially
many forensic purposes, including be applied to searches of the convicted
routine investigation of serious offender/arrestee DNA databases. When
crimes and for identification of persons crime scene samples do not match anyone
killed in mass disasters or wars (1–4). in a search of forensic databases, the
DNA databases of convicted offenders are application of indirect methods could
maintained by every U.S. state and nearly identify individuals in the database who
every industrialized country, allowing are close relatives of the potential suspects.
comparison of crime scene DNA profiles This raises compelling policy questions
to one another and to known offenders about the balance between collective
(5). The policy in the United Kingdom security and individual privacy (10).
stipulates that almost any collision with To date, searching DNA databases
law enforcement results in the collection to identify close relatives of potential
of DNA (6). Following the U.K. lead, the suspects has been used in only a small
United States has shifted steadily toward number of cases, if sometimes to dramatic
inclusion of all felons, and federal and effect. For example, the brutal 1988
six U.S. state laws now include some murder of 16-year-old Lynette White,
provision for those arrested or indicted. At in Cardiff, Wales, was finally solved in
present, there are over 3 million samples in 2003. A search of the U.K. National DNA
the U.S. offender/arrestee state and federal Database for individuals with a specific
DNA databases (7). Statutes governing the single rare allele found in the crime scene
use of such samples and protection against evidence identified a 14-year-old boy with
misuse vary from state to state (8). a similar overall DNA profile. This led
Although direct comparisons of DNA police to his paternal uncle, Jeffrey Gafoor
profiles of known individuals and unknown (11). Investigation of the 1984 murder
biological evidence are most common, of Deborah Sykes revealed a close, but
indirect genetic kinship analyses, using not perfect, match to a man in the North
the DNA of biological relatives, are often Carolina DNA offender database, which led
necessary for humanitarian mass disaster investigators to his brother, Willard Brown
and missing person identifications (1, (12). Both Gafoor and Brown matched the
DNA from the respective crime scenes,
1 confessed, and were convicted.
Department of Pathology, Brigham and Women’s
Hospital and Harvard Medical School, Boston, MA Although all individuals have some
02115, USA. 2DNAVIEW, Oakland, CA 94611, and genetic similarity, close relatives have very
School of Public Health, University of California, similar DNA profiles because of shared
Berkeley, CA 94720, USA. 3John F. Kennedy School
of Government, Harvard University, Cambridge, MA
ancestry. We demonstrate the potential
02138, USA. value of kinship analysis for identifying
Authors are alphabetized to reflect equal promising leads in forensic investigations
contributions. Comments and ideas expressed on a much wider scale than has been used
herein are their own.
*Author for correspondence. to date.
E-mail: [email protected] Let us assume that a sample from a
20
0.8
0.6
0.4
0.2
0
0 10 100 1000
Number of leads investigated (k)
crime scene has been obtained that is not Our simulations demonstrate that
an exact match to the profile of anyone kinship analysis would be valuable now
in current DNA databases. Using Monte for detecting potential suspects who are
Carlo simulations (13, 14), we investigated the parents, children, or siblings of those
the chances of successfully identifying whose profiles are in forensic databases.
a biological relative of someone whose For example, assume that the unknown
profile is in the DNA database as a possible sample is from the biological child of
source of crime scene evidence (15). Each one of the 50,000 offenders in a typical-
Monte Carlo trial simulates a database sized state database. Of the 50,000 LRs
of known offenders, a sample found at comparing the “unknown” sample to
a crime scene, and a search. The search each registered offender in the database,
compares the crime sample with each the child corresponds to the largest LR
catalogued offender in turn by computing about half the time, and has a 99% chance
likelihood ratios (LRs) that assess the of appearing among the 100 largest LRs
likelihood of parent-child or of sibling (see chart). An analysis of potential sibling
relationships (1, 16). We used published relationships produced a similar curve
data on allele frequencies of the 13 short (13).
tandem repeat (STR) loci on which U.S. These results could be ref ined
offender databases are based and basic by additional data—for example,
genetic principles (17–19). A high LR is large numbers of single nucleotide
characteristic of related individuals and polymorphisms (SNPs). Better and
is an unusual but possible coincidence immediately practical, a seven-locus
for unrelated dindividuals. The analysis Y-STR haplo type analysis on the crime
of each simulation therefore assumes that scene and the list of database leads would
investigators would follow these leads in eliminate 99% of those not related by male
priority order, starting with those in the lineage (20). Data mining (vital records,
offender database with the highest LR for genealogical and geographical data) for
being closely related to the owner of the the existence of suitable suspects related
crime scene DNA sample. to the leads can also help to refine the list.
21
22
23
Despite the greater information content of genomic DNA, ancient DNA studies have
largely been limited to the amplification of mitochondrial sequences. Here we de-
scribe metagenomic libraries constructed with unamplified DNA extracted from
skeletal remains of two 40,000-year-old extinct cave bears. Analysis of ~1 mega-
base of sequence from each library showed that despite significant microbial con-
tamination, 5.8 and 1.1% of clones contained cave bear inserts, yielding 26,861 base
pairs of cave bear genome sequence. Comparison of cave bear and modern bear se-
quences revealed the evolutionary relationship of these lineages. The metagenomic
approach used here establishes the feasibility of ancient DNA genome sequencing
programs.
G
enomic DNA sequences from ex- ert environments, which are well suited to
tinct species can help reveal the preserving ancient DNA (9–12). However,
process of molecular evolution that the remains of most ancient animals, in-
produced modern genomes. However, the cluding hominids, have not been found in
recovery of ancient DNA is technologi- such environments.
cally challenging, because the molecules To circumvent these challenges, we
are degraded and mixed with microbial developed an amplification-independent
contaminants, and individual nucleotides direct cloning approach to constructing
are often chemically damaged (1, 2). In metagenomic libraries from ancient DNA
addition, ancient remains are invariably (Fig. 1). Ancient remains are obtained from
contaminated with modern DNA, which natural environments in which they have
amplifies efficiently compared with an- resided for thousands of years, and their
cient DNA, and therefore inhibits the de- extracted DNA is a mixture of genome
tection of ancient genomic sequences (1, fragments from the ancient organism and
2). These factors have limited most previ- sequences derived from other organisms
ous studies of ancient DNA sequences to in the environment. A metagenomic ap-
polymerase chain reaction (PCR) amplifi- proach, in which all genome sequences in
cation of mitochondrial DNA (3–8). In ex- an environment are anonymously cloned
ceptional cases, small amounts of single- into a single library, may therefore be a
copy nuclear DNA have been recovered powerful alternative to the targeted PCR
from ancient remains less than 20,000 approaches that have been used to re-
years old obtained from permafrost or des- cover ancient DNA molecules. We chose
to explore this strategy with the extinct
1
United States Department of Energy Joint
cave bear instead of an extinct hominid, to
Genome Institute, Walnut Creek, CA 94598, USA. unambiguously assess the issue of mod-
2
Genomics Division, Lawrence Berkeley National ern human contamination (1, 2). In ad-
Laboratory,Berkeley,CA94720,USA. 3MaxPlanck dition, because of the close evolutionary
Institute for Evolutionary Anthropology, Leipzig, relationship of bears and dogs, cave bear
D-04103, Germany. 4Institute of Paleontology, sequences in these libraries can be identi-
University of Vienna, Vienna, A-1010 Austria. fied and classified by comparing them to
5
Biosciences Directorate, Lawrence Livermore
the available annotated dog genome. The
National Laboratory, Livermore, CA 94550, USA.
*To whom correspondence should be addressed. phylogenetic relationship of cave bears
E-mail: [email protected] and modern bear species has also been
24
25
average insert coverage calculated by two- likely biased toward less-damaged ancient
tailed t test). This result suggests that it DNA fragments and modern DNA, which
may be possible to discriminate between could contribute to the abundance of pro-
inserts derived from short ancient DNA karyotic sequences relative to cave bear
molecules and inserts containing mod- sequences in these libraries. The represen-
ern undamaged DNA in ancient DNA tation we observe in the libraries thus re-
libraries. This may have relevance to the flects the proportion of clonable sequences
application of these methods to ancient from each source, not the true abundances
hominids, in which the ability to distin- of such sequences in the original extracts.
guish ancient hominid DNA from modern However, the results from both libraries
contamination will be essential. demonstrate that substantial quantities of
The remaining inserts with BLAST genuine cave bear genomic DNA are ef-
hits to sequences from known taxa were ficiently end-repaired and cloned, despite
derived from other eukaryotic sources, this possible bias. A considerable fraction
such as plants or fungi, or from prokary- of inserts in each library (17.3% in library
otic sources (bacteria and archaea), which CB1 and 11.2% in library CB2; Fig. 2) had
provided the majority of known sequences hits only to uncharacterized environmental
in each library. The endrepair reaction sequences. The majority of these clones
performed on each ancient DNA extract is had BLAST hits to GenBank sequences
26
derived from a single soil sample (17), dog genome assembly were annotated as
consistent with the contamination of each RefSeq exons, and ~10% of the cave bear
cave bear bone with soil bacteria from the sequences we obtained appear to be con-
recovery site. As in other metagenomic strained overall, whereas 5 to 8% of posi-
sequencing studies, most inserts in each tions in sequenced mammalian genomes
library had no similarity to any sequences are estimated to be under constraint (19).
in the public databases. This discrepancy may be caused by our
To annotate cave bear genomic se- use of BLAST sequence similarity to
quences, we aligned each cave bear se- identify cave bear sequences, an approach
quence to the dog genome assembly us- that is biased in favor of more-constrained
ing BLAST-like alignment tool (BLAT) sequences. Nevertheless, coding sequenc-
(18). 6.1% of 6775 cave bear nucleotides es, conserved noncoding sequences, and
from library CB1 and 4.1% of 20,086 repeats appear in both cave bear genomic
cave bear nucleotides from library CB2 libraries at frequencies roughly propor-
aligned to predicted dog RefSeq exons, tional to what has been observed in mod-
in a total of 21 genes distributed through- ern mammalian genomes.
out the dog genome (Fig. 2, C and D, and To determine whether the cave bear
table S2). 4.1% and 6.2% of cave bear nu- sequences we obtained contain sufficient
cleotides, respectively, from library CB1 information to reconstruct the phylogeny
and library CB2 aligned to constrained of cave bears and modern bears, we gener-
nonexonic positions in the dog genome ated and aligned 3201 bp of orthologous
with phastCons conservation scores < 0.8 sequences from cave bears and modern
(conserved noncoding, Fig. 2, C and D) black, polar, and brown bears and esti-
(14). The majority of cave bear sequence mated their phylogeny by maximum like-
in each library, however, aligned to dog lihood (Fig. 3A) (20). This phylogeny is
repeats or regions of the dog genome with topologically equivalent to phylogenies
no annotated sequence features. These lat- previously obtained using cave bear and
ter sequences are likely fragments of neu- modern bear mitochondrial DNA (13).
trally evolving, nonrepetitive sequence This result further indicates that our li-
from the cave bear genome. Constrained braries contain genuine cave bear se-
sequences are slightly overrepresented quences and demonstrates that we can
in our set: Only 1.7% of bases in the obtain sufficient ancient sequences from
27
28
Materials Methods
Equipment:
For complete details on the Paired End DNA library
Genome Sequencer 20 Instrument (Software 1.0.53)
preparation procedure and sequencing, please refer to
Reagents: the GS Guide to Paired End Sequencing, the GS
Sample Preparation: GS Paired End Adaptor Kit; emPCR Kit User's Manual, and the Genome Sequencer 20
GS emPCR Kit II (Amplicon A, Paired End) Operator’s Manual.
Sequencing: GS 20 Sequencing Kit; GS PicoTiterPlate Kit
29
A) B) C)
Figure 3:
A) The Paired End library is amplified onto beads using the emPCR process.
B) The bead-bound library is deposited onto the PicoTiterPlate.
C) The Paired End reads are sequenced on the Genome Sequencer 20 Instrument.
Figure 4: De novo assembly results for E. coli K12 aligned Spaces between the pink bars are typically a result of repeat
against a reference genome. The reference genome is regions that cannot be uniquely assigned to a region in the
represented by the top black line. The standard whole genome. With the addition of one sequencing run of Paired
genome shotgun sequence and assembly is represented End reads (represented by the purple bars) using the
by the pink bars. Repeat regions of the genome are Genome Sequencer 20 Instrument, the genome sequence
represented by the green bars at the bottom of the figure. becomes much more complete.
NOTICE TO PURCHASER
RESTRICTION ON USE: Purchaser is only authorized to use the
References
Genome Sequencer 20 Instrument with PicoTiterPlate devices [1] Margulies, M (2005) Nature 437:376-380
supplied by 454 Life Sciences Corporation and in conformity with
the procedures contained in the Operator’s Manual. For more information, visit
Trademarks
www.genome-sequencing.com
PICOTITERPLATE is a trademark of 454 Life Sciences Corporation,
Branford, CT, USA.
Other brands or product names are trademarks of their respective
holders.
is here.
www.aaas.org/future
Roche Diagnostics
Roche Applied Science
© 2006 Roche Diagnostics. All rights reserved. Indianapolis, Indiana
Roche Diagnostics
Roche Applied Science
© 2006 Roche Diagnostics. All rights reserved. 04926447001 Indianapolis, Indiana