Sequencing The Human Genome Lab Assignment (Main)
Sequencing The Human Genome Lab Assignment (Main)
ed
Re v is d
an
ed
Up d a t
Edvo-Kit #339
Sequencing the
Human Genome
Experiment Objective:
In this experiment, students will read DNA sequences obtained from automat-
ed DNA sequencing techniques. The data will be analyzed using publicly avail-
able databases to identify genes and gene products. The impact of Genomics
will be discussed in the context of today's society.
339.150623
Sequencing the Human Genome EEDDVVOO--
KKiitt 333399
Table of Contents
Page
Experiment Components 2
Experiment Requirements 2
Background Information 3
Experiment Procedures 7
Study & Discussion Questions 13
Instructor's Guidelines 14
Answers to Exercises 15
Answers to Study Questions 16
Experiment Components
This experiment contains a total of twelve sections of automated DNA sequence printouts. Students can use any
Human Genome sequence database to perform the activities in this lab. For purposes of simplification we
have chosen to illustrate the database offered by the NCBI.
Requirements
EDVOTEK and The Biotechnology Education Company are registered trademarks of EDVOTEK, Inc.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
2
EEDDVVOO-- Sequencing the Human Genome
KKiitt 333399
Background Information
The haploid human genome comprises approximately three billion base pairs of DNA that are organized into 23
chromosomes. The order of these nucleotides creates genes, which are discrete units of genetic information that
contain the instructions to build and maintain an organism. DNA sequencing is the process of determining the pre-
cise order of these nucleotides. In 2001, the first draft sequences of the human genome were published, one by
an international coalition known as the Human Genome Project and one by a company called Celera. Although
these initial studies took multiple years and required a lot of people, advances in sequencing technology have
made acquiring full genome data simple and fast. In 2015 a full sequence could be acquired in a day for around
one thousand dollars. Given that much of this sequence data is freely available online, the new challenge in
genetics involves finding creative and efficient ways to analyze and manage the vast amounts of data being
generated. This has resulted in the growing field of bioinformatics – a discipline that blends computer science,
biology, and infor- mation technology.
Information from the human genome can help us better understand our physiology and the biological basis of
inherited diseases. For example, DNA sequencing of individual patients, known as personalized medicine, is chang-
ing the role of genetics in medicine. Personalized medicine uses an individual's genetic profile to guide
decisions regarding the prevention, diagnosis, and treatment of disease. Table 1 highlights five promising areas of
person- alized medicine. Although these tests and services provide amazing therapeutic possibilities, they also
generate patient information that is unprecedented in its detail and permanence. This raises new legal and
logistical ques- tions about protecting doctor-patient confidentially. For example, do parents have the right to
order genetic tests for their minor children, or can insurance companies increase rates or deny service due to a
potential genetic issue?
The Genetic Information Nondiscrimination Act of 2008 addresses some of these concerns by prohibiting the use
of genetic information in employment and health insurance decisions. In depth discussion is needed to
balance improvements to human health with the ethical consequences.
Besides a role in health care, human genome sequencing has, and will continue to have, a large impact on our
understanding of human history, evolution, and general biology. In the field of phylogeography, scientists exam-
ine current and ancient geographic patterns of molecular genetic variation to learn more about human’s global
expansion and adaptation. Studies at the genomic scale have shown extensive interbreeding between separated
populations, far more than was previously estimated based on individual loci studies. Another promising area is
genomic comparison between humans and other organisms that help connect the biology of model organisms to
human physiology. Comparative genomic studies are also helping to decipher the roles of protein coding genes,
noncoding RNAs, and regulatory sequences in evolution. Similarly by studying the structure and activity of the
human genome itself we can ask questions about the function of DNA at the levels of genes, RNA transcripts, and
Molecular characterization of rare disease Provide a definitive diagnosis and suggest new treatment options
Pharmacogenomic Optimize drug therapy to ensure maximum efficacy with minimal adverse effects
Population screening for disease risk Increase individuals health knowledge and encourage proactive health changes
Preconception and prenatal screening Inform parents about the risk of disease in offspring
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
3
Sequencing the Human Genome EDVO-Kit 339
protein products. However, for genomic data to provide insight into these questions, scientists must address
the computational challenges around data analysis, integration, and visualization.
One of the largest and most influential databases is known as GenBank. This free, open source database
con- tains over a trillion nucleotide bases of publicly available sequence data. Each entry in GenBank contains a
se- quence and an accession number as well as supporting bibliographic and biological annotations such as
author references and taxonomic data. The NCBI (National Center for Biotechnology Information) oversees and
main- tains the database as a whole but each entry is submitted directly by individual laboratories. Direct
submission has allowed the database to keep pace with the rapid growth in sequence data production.
However, it also means that heterogeneity in entry quality exists, especially in the certainty of each
nucleotide’s identity and in the extent of attached annotation. These can vary depending on the goals of the
study, the physical proper-
ties of the DNA region(s), and the chosen sequencing
1.800.EDVOTEK method. To •address
• Fax 202.370.1501 this, GenBank• classifies
[email protected] the sequence
www.edvotek.com
information based on the sequencing strategy used to obtain the data.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
4
EDVO-Kit 339 Sequencing the Human Genome
OH H
DNA (template 5’ 3’
strand)
ddC
Shortest
ddG C
ddA G
B Labeled
Strands C ddA A G
C ddG A A G
ddT G A A G C
ddC T G A A G C
5’ Longest
ddA C T G A A G C
ddG A C T G A A G C
Direction of
Movement of Longest Labeled
Strands Strand
C
Laser
Detector
Shortest Labeled
Strand
Printer
D
Last nucleotide of Last nucleotide of
shortest labeled longest labeled
strand strand
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
5
Sequencing the Human Genome EDVO-Kit 339
Associated with this database are several useful bioinformatics tools including the Basic Local Alignment
Search Tool, or BLAST. The BLAST tool finds regions of local similarity between a user’s DNA sequence and
sequences in the GenBank database. Such similarity suggests homology, the existence of shared ancestry
between the genes. In BLAST terminology the user’s input sequence is known as the query sequence,
sequences in the database are known as target sequences, and sequences with similarities to the input
sequence are hits. The user can draw inference about the putative molecular function of the query sequence by
looking at the hits. BLAST is also used to identify unknown species, locate known protein domains, and find
potential chromosome locations.
BLAST takes a heuristic approach to the problem of searching through such a mammoth database of target se-
quences. This means that it takes shortcuts in order to find sequence matches in a reasonable time frame.
These shortcuts are based on the assumption that biologically similar sequences will contain short stretches of
very high scoring matches. BLAST attempts to find these high scoring segment pairs by removing low complexity
regions, di- viding the sequence into much shorter seeds, and then scanning the database for matches. Once it
has generated a list of high scoring matches BLAST extends the seeds to see if they are contained in longer
high scoring align- ments. By searching the GenBank database this way BLAST can return results very quickly
although it sacrifices some accuracy and precision.
BLAST is popular not only because of its speed but also because it computes the statistical significance of the
solutions. In addition to the accession number, description, and genome link BLAST provides a score, bit score,
and e value. The score, S, is a raw measure of the quality of alignment between the query and the hit. User
chosen variables that incorporate molecular and biochemical concepts heavily influence this value. The bit score is
the raw score adjusted for the size of the database and the sequence length. The e value translates to the
probability due to chance that there is another alignment with a similarity score greater than the given S score.
Scores, bits scores, and e values are a good first indicator of similarity between sequences, however the
alignment itself should also be examined to ensure accuracy.
This exercise introduces students to genomics and bioinformatics. In order to gain experience in database search-
ing, students will use the free service offered by the National Center for Biotechnology (NCBI). At present, Gen-
Bank comprises several databases including the GenBank and EMBL nucleotide sequences, the non-redundant
GenBank CDS (protein sequences) translations, and the EST (expressed sequence tags) database. For this
experi- ment, we recommend using a database offered by the NCBI. These exercises will involve using BLASTN,
whereby a nucleotide sequence will be compared to other sequences in the nucleotide database. For each of
the three sequences, students should identify a potential human disease and discuss related bioethical issues.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
6
EEDDVVOO-- Sequencing the Human Genome
KKiitt 333399
Experiment Procedures
4. On the new BLAST Home screen select “nucleotide blast” which is the first opinion under the Basic BLAST list.
continued...
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
7
Sequencing the Human Genome EDVO-Kit 339
6. Enter the nucleotide sequence into the large box in the “Enter Query Sequence” section; be careful to type
the following sequence exactly: ggcaactgcccaaagtgtgatccagcctgtctcaacagaa
7. Under “Choose Search Set” make sure that “Others (nr etc)” is selected and that “Nucleotide collection (nr/nt)”
is highlighted in the dropdown menu. The remaining entries should be left blank.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
8
EDVO-Kit 339 Sequencing the Human Genome
10. Once the “BLAST” query box has been clicked you will be assigned an ID#. Record this number so you
can check your results at a later time.
continued...
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
9
Sequencing the Human Genome EDVO-Kit 339
b. Graphic Summary section that shows the alignment of database matches to the query sequence. The color
of the boxes corresponds to the score of the alignment with red representing the highest alignment scores.
c. Description section that shows all the sequences in the database with significant sequence homology to
our sequence. By default the results are sorted according to the E-value but you can click on the column
header to sort the results by different categories. Notice that there can be several different entries with
identical high scores.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
10
EDVO-Kit 339 Sequencing the Human Genome
d. Alignment section that shows alignment blocks for each BLAST hit. Each alignment block begins with a
sum- mary that includes the Max score and expected value, sequence identity, the number of gaps in the
alignment, and the orientation of the query sequence relative to the subject sequence.
12. Select a sequence to focus on in-depth. You can do this by clicking on a colored bar in the graphic section, click-
ing on the sequence name in the description section, or scrolling down to the alignment section. Then click
on the sequence ID. This brings up additional information about the subject sequence, including the gene
name, the genus and species of origin, and articles written about the gene. After performing this search,
the top hit should be Bos taurus epidermal growth factor receptor (EGFR), mRNA. Sequence ID: ref|
XM_002696890.3|. If top hit does not match, try re-entering the sequence. Be sure to double check the
search parameters before BLASTN searching.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
11
Sequencing the Human Genome EDVO-Kit 339
Natalie Perales
11 July 2023
Molecular Diagnostics
Experiment Procedures, continued
EXERCISE 1
Now that you have familiarity with the entry and submission process of BLAST, read the DNA sequence
analy- sis from the automated gel run sequence printout (any lane) and find the gene that this sequence
fingerprint identifies.
To do this:
(1) Identify the nucleotide sequence (100-200 nucleotides) from the DNA Sequence readout.
(2) Type at least 70 bases in the query box of the BLAST program at the NCBI website. The bases can
be from any region of the sequence, but they should be contiguous.
(3) Examine the BLASTN search report, identify a likely gene, and examine the gene ID for detailed
information.
Once the gene has been identified, answer the following questions:
(a) What is the name of this gene?
Sequence 2 – Homo sapiens RAC family small GTPase 1 (RAC 1), transcript variant Rac1b,
mRNA
(b) Compared to the GenBank entry, what strand have you read?
NM_018890.4
(c) Can you find a paper that has been written about this gene? Write down the name of
one of the contributing authors.
Yes; Mingyu Chen
(d) Identify a disease caused by mutations in this gene. What would be the implications
if a doctor tested for this disease? What if an employer or an insurance company
tested for this disease?
Lung Adenocarcinoma Brain Metastasis. If a doctor tested for this disease,
the test would express significantly increased levels of the RAC1 in the
Lung Adenocarcinoma (LUAD) tissue based on HPA database. The doctor
would then decide to increase patient monitoring. If an employer/insurance
company tested for the disease, they would possibly charge an extensive
amount of money for treatment or decide to not treat the patient.
• The automated sequence differentiates the bases as follows: A is green, C is blue, G is black, and T is red.
• DNA is double stranded and contains a top (5’➔3’) and bottom (3’➔5’) strand (sometimes this
corre- sponds to the coding and noncoding strands.) A DNA sequence is always entered in the 5’➔3’
direction.
• Sometimes it is difficult to read a nucleotide peak. This is particularly true at the beginning and end
of a sequence read where peaks may overlap. Such ambiguous places are often labeled with a N rather
than one of the four nucleotides. It is often best to skip sections with lots of Ns
• Because researchers call the same gene different names, several possibilities may exist for each se-
quence.
• By clicking on the GenBank accession number you can access additional information such as the protein/
amino acid sequence, descriptions of the sequence/ gene, and the contributing scientists names.
• Some high-scoring sequence matches may be predicted genes that do not have any research papers
as- sociated with them. Students may have to check several matches before they will find a
sequence with an associated reference.
EXERCISE 2
Exchange your automated data sequence printout with another group and submit the sequence to
BLAST analysis. Write down the gene. Select a sequence associated with a published paper, record the
title and first author of the paper.
Gene Name: Sequence 3 - Trachypithecus francoisi CDC42 effector protein 3 (CDC24EP3), transcript
variant x3, mRNA
13
EEDDVVOO-- Sequencing the Human Genome
KKiitt 333399
Study Questions
1. What is bioinformatics? How have advances in sequencing technology affected this field?
Bioinformatics is a discipline that blends computer science, biology, and information technology; It
is the science of collecting and analyzing complex biological data such as genetic codes. Advancement
in sequencing technology made it possible to acquire the full genome data quickly and efficiently.
2. Name two sequencing methods and describe the trade off between the production rate and the
length of the sequences produced.
Chain Termination Sequence (Sanger Sequencing)
o Produces 500 – 800 bp
Sequencing by Synthesis
o Produces 50 – 150 bp
Compared to Sanger Sequencing, Sequencing by Synthesis has a much higher throughput
and a lower cost because many different DNA strands can be examined in a single run.
However, Sanger Sequencing produces longer sequence reads than Sequencing by
Synthesis.
3. What assumption does BLAST make? What are the advantages and disadvantages of making this
assumption?
BLAST assumes that biologically similar sequences will contain short stretches of very
high scoring matches.
Advantage
o Fast and it computes the statistical significance of the solutions
Disadvantage
o It reduces some accuracy and precision
DISCUSSION QUESTIONS
These questions are complex with no correct or incorrect answers. They are provided to stimulate discussions
about the application of genomics in the 21st Century.
3. What role should the government play in establishing guidelines for human genetic tests?
The role they should play should go no further than to put proper precautions in place to
prevent improper use of genetics testing and ensure personnel’s right to privacy, as well as
ensure that everything is done ethically.
Duplication of any part of this document is permitted for non-profit educational purposes only. Copyright © 1989-
2015 EDVOTEK, Inc., all rights reserved. 339.150623
13