Tics and Homology Modeling
Tics and Homology Modeling
Introduction
This tutorial allows you to explore opsins -- the proteins that catch light for our eyes --
and the genes that code for opsins. But the real subject of this exercise is bioinformatics
-- the use of computers to search for, explore, and use information about genes, nucleic
acids, and proteins. While learning about the human opsins, you will use some of today's
most powerful bioinformatics tools, and you will even build a model of a protein whose
detailed structure is unknown (called homology modeling). You can follow up this
tutorial with a study of opsins from other organisms, or by exploring any class of
biomolecules that interest you.
Please realize that this tutorial merely scratches the surface of what you need to know in
order to use bioinformatics wisely in your research. If you want to learn more, including
vital guidance in judging the quality of your results, I recommend you turn next to
Bioinformatics for Dummies, by Claverie and Notredame, Wiley Publishing, Inc., 2007.
I assume that you are conversant with biochemistry and molecular biology. If you see
unfamiliar terms pertaining to the genes, mRNAs, and proteins used as examples here,
break out your biochemistry text, head for the index, and review, review, review.
For more information about each database or tool, go to its home page and read, read,
read. These tools come with plenty of help.
History
This web page was originally composed of somewhat sketchy procedures that I devised
by playing* with bioinformatics tools on the web. For five years or so, my biochemistry
students carried out the tutorial, and their suggestions led to many improvements, as have
emails from users around the world.
*My play with bioinformatics tools started with the book Bioinformatics for Dummies,
by Claverie and Notredame, Wiley Publishing, Inc., 2003. Not considering myself a
dummie in most of my areas of interest, I had never looked very hard at Wiley's
"Dummies" books. I'm so glad I looked at this one. The authors are on the frontiers of the
field, and they have produced a serious, high quality book. If you want to work through
lots of clear tutorials in all areas of bioinformatics, buy it. It was the best $30 I had spent
on a book in quite a few years. Just click the title above to learn more about the latest
edition, 2007 (only $20 now), which guided my October 2008 revisions of this tutorial.
Many thanks to Professors Claverie and Notredame for this friendly and powerful
resource.
NEXT
Bioinformatics Tutorial
Cast of Characters
You will encounter these databases and software tools one by one as you follow this
tutorial. Use this page for reference if you can't remember the meaning of an acronym or
program name.
I. The Databases
• Genbank, operated by NCBI (National Center for Biotechnology Information)
Contains all publicly available sequences of DNA, with annotations, which are
constantly being extended and updated. Annotations include identification of a
genes its gene product(s) (if known), and extensive links to all kinds of
information about the gene in other databases.
NCBI contains the same DNA sequence content as EMBL (European Molecular
Biology Laboratory) and DDBJ (DNA Data Bank of Japan)
• OMIM, (Online Mendelian Inheritance in Man—woman, too)
An encyclopedia of human genes and genetic disorders, linked to gene entries
in GenBank and to scientific literature in PubMed. Gives complete and up-to-the-
minute information about many human genes.
• PDB (Protein Data Bank)
Contains all publicly available experimentally determined (by x-ray
crystallography and NMR) structural models of proteins and nucleic acids.
Does not contain homology models or other types of theoretical models.
• PubMed
Described in Wikipedia as "a free search engine for accessing the MEDLINE
database of citations and abstracts of biomedical research articles. The core
subject is medicine, and PubMed covers fields related to medicine, such as
nursing and other allied health disciplines. It also provides very full coverage of
the related biomedical sciences, such as biochemistry and cell biology. It is
offered by the United States National Library of Medicine at the National
Institutes of Health as part of the Entrez information retrieval system."
• UniProt Knowledgebase (Swiss-Prot and TrEMBL), operated by SIB (Swiss
Institute of Bioinformatics) and EBI (European Bioinformatics Institute).
Contains most of the publicly available sequences of proteins (not DNA or
RNA). Sequences in Swiss-Prot are annotated manually, and provide or link you
to just about all published information about the sequence. Sequences in TrEMBL
are collected and annotated automatically from sequence databases, and will make
their way to Swiss-Prot, but only after they are manually annotated to meet Swiss-
Prot standards.
NEXT
Bioinformatics Tutorial
Start Finding Genes
Preview
In this section, you will use the NCBI Map Viewer and the keyword opsin to get a list of
opsin or opsin-related genes in the human genome.
Human Opsins
The subject of this tutorial is human opsins, which are found in the cells of your retina.
Opsins catch light and begin the sequence of signals that result in vision. We will proceed
by asking questions about opsins and opsin genes, and then using bioinformatics to
answer them.
When I provide a web address, I'll also make it a link -- just click it to go to the site in a
new browser window. Then make it a bookmark so you can find it again. This tutorial
will still be open in the window behind the new one.
WARNING: Bioinformatics tools evolve rapidly, faster than I can make changes to this
tutorial. So if a page does not look exactly like I say it should, or if its title is different,
look around and try to do what the tutorial says. You should find the same links, but
names may be slightly different, or many new links may have been added (bioinformatics
pages never get simpler). If the differences are so great that you can't proceed, send me
email (see contact link at top of Tutorial Contents), and I'll adapt the instructions to the
changes as soon as I learn about them.
Find Homo sapiens (human), and click on the magnifier tool beside the lowest-
numbered Build (a build is an assembly of the genome, which is done repeatedly). We
will use the older build because sometimes not all searching and viewing tools are not
connected to the newest build, which is in progress. The magnifier tool takes you to the
Search Page for the organism, which shows a chromosome diagram, and provides input
boxes (at top of page) for searches.
Click Find.
You see the diagram again, with red marks at your "hits", the locations of genes whose
entries contain "opsin" as a whole or partial word. Below the diagram is a list of the
indicated genes.
If the list is very long, simplify it using Quick Filter box on the right at the top of the
list; check the box marked Gene, and then click Filter. If you are already seeing the
filtered list, the Quick Filter box will not be present.
In the list of genes related to the search term opsin, there are the rhodopsin gene (RHO),
and three cone pigments, short-, medium-, and long-wavelength sensitive opsins (for
blue, green, and red light detection). Four hits look like visual pigments, which should
not surprise you. To the left of each entry is the chromosome number, allowing you to tell
which red mark corresponds to each entry. Note that several hits are on the X
chromosome, one of the sex-determining chromosomes.
NOTE: In the human genome lists, you will often see duplicates marked reference or
Celera, referring to the results from two major efforts to sequence the human genome. At
first, these two efforts were separate, but eventually they came together. When you have a
choice, choose "reference," so you will be following the same path I followed in setting
up the tutorial.
You can get more details on multiple hits on the same chromosome with the all matches
link for that chromosome. Click all matches next to X. Be patient: the next page may
load slowly--it's packed with information.
You see a very complicated display (don't sweat -- we're going to use only a part of this).
On the left is a diagram of the X chromosome, with red marks at the positions of the
gene(s) you've followed to this page -- in our case, the two opsins, medium- and long-
wave, which are located near the bottom tip of the X chromosome. To the right are
various representations of the X chromosome, with listings of annotated areas. The two
opsin genes are highlighted in pink. If you pass your cursor over this page without
clicking, you will find that some symbols provide brief information, mostly about regions
that are not yet characterized well enough to have a full entry.
As you can see, there is a tremendous amount of information on this page, with links to
much more. If you want full information about the meanings of abbreviations and
symbols on this page, as well as the kinds of information linked to the page, you can use
Map Viewer Help at the top of the page. You will find abundant information about the
Map Viewer, explanations of all symbols and links, and even tutorials about how to ask
and answer all kinds of questions about the genome. The Map Viewer is like the Google
Earth of the genome, and as with Google Earth, the amount of information is sometimes
daunting.
For now, note the information provided for the the opsin gene OPN1LW (called the gene
symbol). You see that this is the long-wavelength-sensitive (red) opsin, and that it's a
gene involved in color blindness (a sex-linked trait -- no surprise, because we find the
gene on the X chromosome).
NEXT
Bioinformatics Tutorial
All About a Gene
Preview
In this section, you will explore a few links to extensive information about specific genes.
You have entered the OPN1LW opsin 1 page of Entrez Gene, which is a sort of
highway interchange with routing to all sorts of information about this gene. Scan down
the page. Some of the information is very plain and understandable, while some is very
cryptic. One of the most accessible links is to OMIM (Online Mendeliam Inheritance in
Man), a catalog of human genes and genetic disorders. Despite the name, the database
includes genes of women, too.
Look down the page and find the Phenotypes section, and notice the links marked MIM.
These are links to OMIM entries. Click one of them.
Each OMIM entry tells you about this gene and types of colorblindness, genetic
disorders associated with mutations in this gene. Read as much as your interest dictates.
Follow links to other information. For more information about OMIM itself, click the
OMIM logo at the top of the page. Through OMIM, a wealth of information is available
for countless genes in the human genome, and all information is backed up by references
to the latest research articles.
Once you've satisfied your appetite, return to the Entrez Gene page (use the Back button
of your browser or your browser's history list).
Next to the Display button, pull down the menu and select PubMed (calculated) Links.
You have entered PubMed, a free database of scientific literature, to the results of a
complete search for articles directly associated with this gene locus. By clicking on the
authors of each article, you can see abstracts of the article. If you are on a university
campus where there is online access to specific journals, you might also see links to full
articles. PubMed is your entry point to a wide variety of scientfic literature in the life
sciences. On the left side of any PubMed page, you will find links to a description of the
database, help, and tutorials on searching.
NEXT
Bioinformatics Tutorial
Finding Sequences
Preview
In this section, you will learn how to obtain nucleic acid or protein sequence information,
in a format called FASTA, that is easy to use as input into bioinformatics tools.
• the mRNA Sequence (sequence of nucleotide bases in the messenger RNA), here
listed as NM_020061.3 (M for mRNA);
• the protein sequence (sequence of this gene's protein product, the red opsin), here
listed as NP_064445.1 (P for protein);
• the source sequences (entire sequences of the all of the overlapping genome
fragments in which this sequence was found, from GenBank).
Note that the two links to mRNA sequence and protein sequence are given as
NM_020061.3→NP_064445.1, the arrow implying that the sequence of the NM entry is
translated (by protein synthesis) to give the sequence of the NP entry.
This is a typical GenBank nucleotide file, and a lot of it is hard to read, but a few things
are clear. First note, under references, citations to the publication of this sequence in the
scientific literature. To see an abstract of the article in which this gene was described,
click the PubMed link (a number) below the first reference and read it.
Scroll to the bottom of this long page. The last thing, labeled ORIGIN, is the sequence of
this messenger RNA. You are seeing the actual list of As, Ts, Gs, and Cs that make up the
message for synthesis of this opsin. But wait! You know that RNA contains no T. In most
nucleotide databases, U from RNA is represented as T, to make for easy comparison of
DNA and RNA sequences. This sequence information is not in the form that is most
useful for searching in databases, say, searching for related genes. Let's display this entry
in a form more useful for searching.
At the top of the page, beside the Display button, pull down the menu that says GenBank
(the default display format for each entry), and select FASTA (note that several other
display options are available). Now you see one descriptive or "comment" line that
begins with ">", followed by the nucleotide sequence. This little bit of text is just what
you need to search nucleotide databases for similar sequences.
Keep it for future use, as follows. Click and drag on the web page to select everything
from the ">" through the last nucleotides (CCAA). Be careful not to select anything else.
From your browser's Edit menu, select Copy to make a copy of this information on your
clipboard, for pasting elsewhere. Now start a simple word processor (use TextEdit on
Mac, Notepad on Windows—to avoid inadvertent changes in crucial formatting of
sequence files), make a new document, and paste. The FASTA comment and sequence
should appear. If necessary, select all of the text and change the font to Courier or
Monaco -- these "typewriter" fonts make it easy to align letters into columns, because all
letter are the same width. Save this file, choosing text or plain text as the file type. Call it
mrnared.txt (for mRNA sequence of red opsin). Save it to a convenient location for this
and other files you'll be making for later seaches.
Click your browser's Back button until you return to the Entrez Gene page for this gene.
Things look a lot like before, but this is a protein entry (the classical view is that gene
products are proteins, but many are not), containing the amino-acid sequence in one-letter
abbreviations. Just as with the mRNA entry, turn this into a FASTA display, and copy it
into a new word-processor document. Save it in text format as protred.txt (for protein
sequence of red opsin). Return to Entrez Gene.
Now take a look at the chromosome region that contains the red opsin gene. Scroll back
to near the top of the Entrez Gene page for OPN1LW, to the section called Genomic
context. The diagram shows you that the red opsin gene lies on the X chromosome,
within a segment of base pairs (bp) stretching from position 152,929,151 to position
153,114,725 (a distance of 185,574 bp). [Don't worry if these numbers are not exactly the
ones you see; these resources are constantly being updated.] The location of OPN1LW,
shown as a red arrow, is about 3/4 of the way down this segment.
Now look at the diagram in the preceding section, Genomic regions, transcripts, and
products. This diagram gives a closer look at the OPN1LW segment, representing only
positions 153,062,939 to 153,077,701 (14,762 bp). The lower line shows coding regions
as red blocks, noncoding regions as red lines. Here is the surprise: You knew, but you
might have forgotten, that eucaryotic genes are often interrupted by non-coding regions
called intervening sequences or introns. The coding regions are called exons. From this
diagram, you can see that the OPN1LW gene consists of 6 exons and 5 introns, and that
the introns are far larger than the exons. Of the 14,762 bp in the "gene", only 1095 bp
code for protein, which means that less than 8% of the base pairs contain the code. When
this gene is expressed in cells in the human retina, an RNA copy of the entire gene is
synthesized. Then the intron regions are cut out, and the exon regions joined together to
produce the mature mRNA (a process called splicing). which will be translated by
ribosomes as they make the red opsin protein. In this case, 92% of the initial RNA
transcript is tossed out, leaving the pure protein code. Seems wasteful, but our
understanding of how all this works, while impressive, is still pretty fragmentary.
Tomorrow will tell us what eludes us today, but not what eludes us
tomorrow.
At the ends of the lower line in the diagram, there are links to NM_020061.3 and
NP_064445.1, the entries for the mRNA and protein sequences for this gene. You visited
these pages in the two sections above. Click CCDS 14742.1 at the far right of the
diagram to go to the Consensus Coding Sequence page for this gene. It shows nicely how
the OPN1LW gene transcript is divided into exons. Under Chromosomal Locations for
CCDS 14742.1 is a table listing start and end base-pair positions for each exon. Below
that is the full nucleotide sequence of the mature mRNA, with alternating blue and black
sections indicating exon boundaries. Farther below is the amino-acid sequence, again
divided into exons by alternating blue and black, with red indicating amino-acid residues
whose codons are partly in one exon and partly in the following exon. This makes it
dramatically clear how the mRNA is pieced together from the exons.
You still have not seen any of the actual sequences of the introns. Return to the Entrez
Gene page for OPN1LW. Under Genomic regions, transcripts, and products, click Go
to reference sequence details. This takes you down the page to NCBI Reference
Sequences. You were here before, to retrieve the mRNA and protein sequences. This
time, click the sequence of four entry numbers (all one link) beside Source Sequence(s).
This takes you to the Entrez Nucleotide page that contains information about all four of
the genome fragments from the Human Genome Project that contain all of part of the red
opsin gene, along with information about how each clone was produced. This entry thus
shows the gene in the larger context of the cloned fragments in which the gene was
found. These sequences allow you to explore flanking regions around the gene, which
might be useful in designing PCR primers for making useful quantities of this region.
From this page, you could also find neighboring sequences if you wanted to look farther
afield. As before, you can display this entry in FASTA format. You will get a series of
entries, each a different clone that was used to construct this region of the genome.
NEXT
Bioinformatics Tutorial
First BLAST Search
Preview
In this section, you will use a FASTA sequence as an input (query) to BLAST, a program
that searches a genomic database for similar sequences (hits). You will also learn how to
judge whether a hit arises by chance or by common ancestry.
This is the NCBI's BLAST search tool. BLAST is a widely used program for finding
sequences similar to a "query" sequence that you're interest in. Pick these options from
the various menus:
• Database: Build Protein for PREVIOUS build (look at bottom of the Database
menu). This means that you will search the protein sequences in the previous
build of the database. (Sometimes not all tools needed later are available in the
latest build, which is currently under construction.)
• Program: BLASTP (Use the version of BLAST that compares protein sequences,
unlike BLASTN, which compares nucleotide sequences.)
• Other Parameters: Make no changes.
Next, copy the FASTA data from your file protred.txt to your clipboard, and paste it into
the BLAST search box, above which it says, "Enter an accession..." Check to be sure that
the first character in the box is the ">" at the beginning of the FASTA data. Then click
Begin Search.
The next page is for formatting your search results. Accept all default settings, and just
click the View Report button. When your results are ready, the results of BLAST page
appears. Look down the page to the Graphic Summary, a box containing lots of colored
lines. Each line represents a hit from your blast search. If you pass your mouse cursor
over a red line, the narrow box just above the box gives a brief description of the hit.
You'll find that the first hit is your red opsin. That's encouraging, because the best match
should be to the query sequence itself, and you got this sequence from that gene entry.
The second hit is the green opsin -- remember that the PubMed entry reported that the red
and green pigments are the most similar. The third and fourth hits are the blue opsin and
the rod-cell pigment rhodopsin. Other hits have lower numbers of matching residues, and
are color coded according to a score of matches. If you click on any of the colored lines,
you'll skip down to more information about that hit, and you can see how much similarity
each one has to the red opsin, your original query sequence. As you go down the list, each
succeeding sequence has less in common with red opsin. Each sequence is shown in
comparison with red opsin in what is called a pairwise sequence alignment. Later, you'll
make multiple sequence alignments from which you can discern relationships among
genes.
See what you can figure out about what the scores mean. Identities are residues that are
identical in the hit and the query (red opsin), when the two are optimally aligned.
Positives are residues that are very similar to each other (see residue number 1 in the blue
opsin—it's threonine in red opsin, and the very similar serine in the blue). Gaps are
sometimes introduced into a hit to improve its alignment with the query. The more
identities and positives, and the fewer gaps, the higher the score. Note that blue opsin and
rhodopsin are only about 45% identical to the red opsin. Other proteins, which are
apparently not visual pigments, have even lower scores.
The displays contain two prominent measures of the significance of the hit, 1) the
BLAST Score [lableled Score (bits)], and 2) the Expectation Value (labeled Expect or
E).
The BLAST Score indicates the quality of the best alignment between the query
sequence and the found sequence (hit). The higher the score, the better the alignment.
Scores are reduced by mismatches and gaps in the best alignment. Calculation of the
score is complex, involving a substituion matrix, which is a table that assigns a score to
each pair of residues aligned. The most widely used matrix for protein alignment is
known as BLOSUM62.
The expectation value E of a hit tells whether the hit is likely be result from chance
likeness between hit and query, or from common ancestry of hit and query. (If E is
smaller than 10-100, it is sometimes given as 0.0.) The expectation value is the number of
hits you would expect to occur purely by chance if you searched for your sequence in a
random genome the size of the human genome. E = 25 means that you could expect to
find 25 matches in a genome of this size, purely by chance. So a hit with E = 25 is
probably a chance match, and does not imply that the hit sequence shares common
ancestry with your search sequence. Expectation values of around 0.1 may or may not be
biologically significant (other tests would be needed to decide). But very small values of
E mean that the hit is biologically significant; that is, the correspondence between your
search sequence and this hit must arise from common ancestry of the sequences, because
the odds are are simply too low that the match could arise by chance. For example, E =
10-18 for a hit in the human genome means that you would expect only one chance match
in one billion billion different genomes the same size of the human genome.
One place to find out more about BLAST searches and statistics is The BLAST Sequence
Analysis Tool in the NCBI Handbook.
Now you will see where all these hits are found on human chromosomes.
Where (in the human genome) are all the genes for these
other proteins?
Just above the Graphic Summary, click Human Genome View.
You have come full circle. You are back at the human chromosome diagram, and you see
all the hits of your search, in the colors that signify their BLAST scores as they were
shown in the Graphic Summary. Notice that there are about 100 proteins that have 40%
or more positives in alignment with red opsin. The opsins are members of the much
larger family of G protein-coupled receptors, key players in signal transduction.
NEXT
Bioinformatics Tutorial
Family Relations
Preview
In this section, you will learn how to gather a group of related sequences in FASTA
format, and then use them as inputs to the program ClustalW. The result is a multiple-
sequence alignment (MSA), from which you can deduce much about how the sequences
resemble and differ from each other. Then you will use the MSA as input to tree-printing
programs, in order to produce a phylogenetic tree—a visual summary of relationships
among the genes.
You see the home page of ExPASy, the Expert Protein Analysis System. As stated in the
Cast of Characters, ExPASy is a complete protein tool box. With ExPASy, you can do
almost any imaginable analysis or comparison of protein sequences and structures. In my
humble opinion, Swiss sequence database tools are among the easiest ones to use.
Read the introduction to these databases. They are high quality protein (not nucleic acid)
sequence databases with abundant annotation, minimal redundancy, and many
connections to other databases.
Click New UniProt Website. The new (2008) home page of UniProt contains links to
information about the resource. Click to learn more about the site, and then return to this
page. Bookmark this page (UniProt Welcome) as a good starting point for future use of
UniProt, Swiss-Prot, or TrEMBL.
At the top of the page is a deceptively simple but powerful search tool. A menu lets you
choose among data sets to search. Take a look at the list on the menu, put return it to
Protein Knowledgebase (UniProtKB).
In the Query box, type opsin. Click Search. The search produces over 4000 entries, all
of which are protein entries that are opsins or include the word or fragment -opsin-.
Obviously, you need to be more specific.
Limit the search to human opsins, as follows. Click Fields, beside the Query box. The
Search area expands to include a logical operator menu (with default operator AND), a
Field menu, and a Term box. Under Field, pick Organism. In the Term box, start typing
human. As you type, the search tool helpfully shows you all allowed search terms that fit
what you have typed so far. As soon as human [9606] appears, click it to enter it in the
Term box, and click Add and Search.
Notice that the Query box now says "opsin AND organism: human [9606]". This shows
that you have limited your search to opsin-related entries that are also (AND) human
proteins. Notice also that the Fields link is available again, so that you could add
additional terms to your search, with logical operators AND, OR, and NOT to specify
how to use the additional terms. But the search is already specific enough to make our
task easy: there are only 25 results for this search.
Before looking at the results, look at the other Fields you can search. UniProt entries are
files that are divided into sections, called fields, each containing specific kinds of
information. You can limit searches to terms that reside in specific fields, or can simply
search for your query in entire entries.
Now look over the results. On 2008/09/19, this search gave 25 hits, including the rod
pigment rhodopsin (OPSD), along with the three cone pigments (OPSB, OPSG, OPSR).
There is also a "visual pigment-like receptor peropsin", OPSX, which still, more than ten
years after its discovery in the genome, is of unknown function. In the rest of this tutorial,
you will include this mysterious protein in your inquiries into the visual pigments of the
human retina.
Digression
Now you will digress briefly from the question of how these proteins are related
evolutionarily, and find out more about peropsin. In the process, you will glimpse the
wealth of information in, and linked to, a typical UniProt entry.
By the way, an accession number such as O14718 can be used as an iput to almost any
ExPASy tool for analysis of the corresponding sequence.
You see the UniProtKB View of entry O14718 [note: that first character is capital letter
O, not zero (0)]. Peruse this entry and try to find out just what this rhodopsin-like protein
is thought to do. Under General annotation (Comments), you'll learn that it is found in
the retina (the RPE or retinal pigment epithelium), and that it may detect light, or perhaps
monitors levels of retinoids, the general class of compounds that are the actual light
absorbers in opsins. Also under Similarity in the same section, you see, as mentioned
earlier, that this protein is a member of the large family of G protein-coupled receptors
(GPCRs). If you click G-protein coupled receptor 1 family, you conduct a search for a
members of this family—the result is about 10,000 hits in UniProt. Limit this search to
humans (about 1200 hits). Back on the O14718 page, click Opsin subfamily to find a list
of all purported members of this subfamily in UniProt (about 220). Limit the search to
humans (fewer than 20).
Under References find the journal citation, "Peropsin, a novel visual pigment-like protein
located in the apical microvilli of the retinal pigment epithelium.". Click the PubMed
link with that reference to see an abstract of the paper. On the abstract page, click on of
the Free Full Text Article links to obtain the full paper from either the journal (PNAS) or
from PubMed Central, which distributes many articles. Like many journals, PNAS puts
full articles online just 6 to 12 months after publication.
Return to O14718, and look around more on the entry page. You will find Cross-
references to this protein or its gene in other databases, predicted structural features of
the protein, and the sequence, which you can lift in FASTA format if you wanted to
search for more of its relatives. Note also links to a number of ExPASy tools listed for
further analysis of this sequence.
Try one of them: under Cross-references, find PROSITE, and click Graphical view.
You now have a form that allows you to search for signatures of function or functional
sites in peropsin. Leave all settings as they are, and click scan next to the graphical image
(green) of the protein. Here is another form, with the accession number O14718 already
entered. Again, leave all other settings as they are (but notice that there are many ways to
modify this search), and click START THE SCAN.
PROSITE finds three identifiable things about this sequence. One "hit by profile"
identifies peropsin as a G-protein coupled receptor. Two "hits by pattern" are shown. One
is a short sequence that also identifies peropsin as a GPCR, while the second hit identifies
a binding site for retinal. So PROSITE indicates that, like its visual opsin relatives,
peropsin also binds specifically to retinal, the visual pigment that we make from vitamin
A. Note also that, by similarity to other related proteins, PROSITE predicts the presence
of a disulfide bond, between residues 98 and 175.
(Later, you will find out more about the three-dimensional structure of peropsin by
building a model of it. You will use a related protein of know structure as a template for
making this model. This process is called homology modeling.)
End of Digression
Next you will answer the main question of this section: how are the visual pigments (and
peropsin) related to each other? Apparently, they diverged from a common ancestral
opsin, but you can get a much clearer picture of which of these opsins came first, and
which are the most closely related. To answer this question, you will align all their
sequences (called a multiple sequence alignment) and then produce a little family tree.
UniProt provides easy access to ClustalW, which does multiple-sequence alignments in a
snap, as well as the information needed to print a phylogenetic tree from the alignment
information.
Return to the UniProt search results, with its 25 hits for entries from the human genome
that include the description "opsins". Your next task is to compare the sequences of
peropsin and four visual pigments. Start by clicking to put check marks in the left-hand
column of the results table, beside the first four entries (rhodopsin and the blue-, red-, and
green-sensitive opsins) and also in the row for peropsin, O14718. As you put in the first
check mark, a green band appears at the bottom of the window, providing a tool bar with
options for handling multiple sequences. After you have checked the entries as instructed,
click the Align button in the green tool bar. This is a request to use ClustalW to make a
multiple-sequence alignment using the selected entries.
The Clustalw results page appears. At the top, in the Sequences box, are FASTA-format
listings of all the sequences compared. Take a moment to edit this listing to make
subsequent alignments and trees easier to interpret. In the FASTA sequences listed in the
Sequences box, make the follow changes:
After editing, click Align to redo the alignment with new headings.
To save this alignment in a form needed for the next section, click the orange TEXT
button to the right of Clustalw Results. Copy the text file that is displayed, paste it into a
new text file, and name it OpsinMSAEdited.txt. Now back up to the Clustalw Results
page.
Below the table that names each opsin with your new headings is the multiple sequence
alignment. In blocks of 60 residues, Clustalw has aligned five sequences. Below each
column of five residues, symbols indicate how closely the residues match across the five
proteins. "*" means all 5 aligned proteins have the same amino-acid residue in this
position (fully conserved residues, within this group); ":" means that all residues in this
position are very similar in size, charge, and polarity (replacements are very
conservative); "." means that they are sort of similar (somewhat conservative
replacements); and no symbol means that the residues in that position vary greatly in
properties (nonconserved residues). (What does each symbol suggest about the
importance of that residue to the function of this protein family?)
At the bottom of the results page are several tool bars. Play with the first two to see what
they do. You will find that they modify the display of the multiple-sequence alignment to
highlight residues types or signatures of protein function. Using these tools, you can get a
general picture of similarities and differences among the proteins. But the comparison can
be made much more explicit by using it to make a phylogenetic tree for this group of
proteins. The last tool bar should provide a ClustalW tree, but as of 2008/09/20, clicking
show on this toolbar opened up a blank space, but the tree never appeared (if it's fixed
now, send me email to let me know).
As you can see at the bottom, this page provides the information needed to print a tree,
and a tool at the University of Indiana can use that information. Unfortunately, this tree is
not a true phylogenetic tree (I still don't know about the one that is not displayed yet); it
is a simple tree that shows the order in which ClustalW carried out pairwise alignments as
it built the multiple-sequence alignment. It will show the pairs that are most closely
related to each other, but you must use a more powerful tree-generating program to obtain
a more rigorous tree.
NOTE: This type of ClustalW working tree file always has a .dnd suffix. For really good
phylogenetic trees, do not use .dnd files.
Anyway, we can use this tree just to learn how to print trees once you have a good one
from any source (next section). This procedure will work if you have tree data in Newick
format, which is true for the tree file provided on this page. Get the file you need to make
a tree by going to the top of the page and clicking the orange TREE button. Your browser
display a very small text file, littered with parentheses. Copy and save this file as
ClustalwTreeData.txt. This is tree data in Newick format, a widely used format for tree-
printing programs. You will use the data in this file to print your first tree.
Paste the contents of your ClustalwTreeData.txt into the Tree data box near the top of
the form. Type a title into the Title box, something like "Opsin Family Tree". To get a
tree that looks like mine (below), pick Phenogram from the Tree styles at the top. Then
under Extra Options, select Format: GIF image; width and height: 400 pixelsFont:
Helvetica; Style: plain; Size: 12. Leave all other settings as you found them, and click
Submit.
Your tree should appear in your browser. Save it OpsinTree.gif. Be sure to remove ".cgi"
from the default name, so that your file will be recognizable as a normal GIF file. You
can paste these files into documents for reports and publications. Play around with other
options at Phylodendron, and see how they affect the tree image.
Like this tree, most trees produced by bioinformatics tools are unrooted trees; that is, the
tree shows distances, based on sequence differences, between the tips, but it does not
attempt to show the tips and branches in order of their appearance in time. Sequence-
comparison programs cannot figure out the order or direction of evolution. They can only
assess the magnitude of sequence differences. If you know which sequence is the
progenitor of all the others (we don't, in this case), you can root the tree with that
sequence. The result will be that the first branch will separate that sequence from the
others. The tree above happens to be rooted with peropsin, so it shows the first branch as
the divergence of peropsin from the progenitor of all the other opsins. More advanced
tree-building programs allow you to choose the root sequence for a tree, but remember
that sequence information alone will not tell you the root.
Beware!
The conclusions of the previous paragraph are based on examining this printed tree. We
will see later that this tree is very similar to a tree made by a more rigorous method. This
simply means that this particular tree is an easy one to determine. Most trees are not so
easy, and more rigorous methods will give results that are substantially different from
ClustalW's little working .dnd file.
Remember also that the truth of any conclusions drawn from a tree depends on the
accuracy of the multiple sequence alignment and on the alignment scores. In this tutorial,
you are using default settings on many hidden parameters in the processes of comparing
and aligning sequences. If you want to draw conclusions about phylogenetic relationships
that will hold up to scientific scrutiny, you need to learn much more about the inner
workings of alignment tools like Clustalw.
In the next section, you will make this tree two more times, using more rigorous tools for
calculating phylogenetic distances.
NEXT
Bioinformatics Tutorial
Improved Relations
Preview
In this section, you will learn how to use a few tools from the phylogeny-analysis
program Phylip to make a phylogenetic tree by a more rigorous method, called neighbor
joining.
This is one home of the program Phylip, One of the most rigorous tools for constructing
phylogentic trees from aligned sequences.
You are about to run protdist, a program that computes the "distance", or the quantitative
amount of difference, of protein sequences from each other. These so-called distance
matrices will be used by Phylip to construct your tree. The input to protdist is the
multiple-sequence alignment you made using Clustalw (file: OpsinMSAEdited.txt)
Enter your email into the top box.
In the alignment file box, paste your edited mutiple sequence alignment from ClustalW
(OpsinMSAEdited.txt).
On the results page, look in the outfile window to see the 100 matrices containing
numbers that represent the relative number of differences among the five sequences. Each
matrix has the sequence names in the first column, and you should imagine that these
sequence names are also the headings for the remaining columns. The number at the
intersection of the row Blue and the column with the imaginary heading Peropsin gives
the relative magnitude of the sequence differences between the blue opsin and peropsin.
The matrices have zeros on the diagonal because each pseudosequence is identical to
itself. Click the Save button to save the entire file of 100 matrices. The file is
automatically downloaded with the name protdist.outfile.txt. Transfer the file to a
convenient place.
Clicking the Back button of your browser from a results page takes you back to the
Phylogeny page. Under Distance Matrix Method programs, Phylip click neighbor.
Read the lists carefully: don't pick "weighbor".
Into the Distance matrix File window, paste the contents of the file protdist.outfile.txt.
Under Bootstrap options make these settings:
This entry area gives you the option of designating an outgroup for the root of your tree.
An outgroup is the sequence you think is most distant from the others, possibly the
common ancestor of all. We don't know that in this case, so leave the default of 1.
On the results page, the Newick file you need to make the tree is neighbor.outtree.
Copy and save it as PhylipTreeData.txt.
By scrolling down in the consense.outfile window, you can see the consensus tree,
printed in a simple text format. This tree is listed as "unrooted", meaning that we do not
know the ancestor of all these sequences. We learn from this tree which sequences are
most alike and which are most different. We also learn how often the connections of this
tree were made the same way in the 100 trees made from those 100 difference matrices.
The numbers on the branches indicate the number of times that partition of the species
into the two sets separated by that branch occurred among the 100 trees. For example, the
separation of Red and Green from the other three, indicating that Red and Green are more
similar to each other than to the other three, occurred in all 100 trees. The separation of
Blue and Peropsin from the other three occurred in only 53 of the 100 trees. In the other
47 trees, Rhodopsin and Peropsin were separated from the other three. (Can you extract
this information from this file?) In the tree branching shown, the majority rules, and the
results of 47 of the trees are discarded.
Note: Your results may be slightly different from mine. Because of the random choices
made in constructing the tree, the percentages in the paragraph above my vary. I have
gotten as high as 82% consensus on the separation of Blue and Peropsin from the other
three.
Here is my tree:
Interpreting a tree is not as simple as interpreting the types of trees that you see in
textbooks. The Phylip tree apears to say that the divergence of Blue from Rhodopsin
came before the divergence of Rhodopsin from Peropsin. But remember that this tree is
unrooted; we did not specify which protein we think is the progenitor of the others. The
tree-printing program automatically puts a little root on the tree, but that line is not
necessarily the beginning or "bottom" of the tree. We can start from any branch and read
the tree as if that were the first branching event on the tree. What the tree does tell us is
which sequences are the most similar. Clearly, Red and Green are the most similar pair,
and Blue is more similar to rhodopsin than is is to peropsin.
For making a multiple-sequence alignment with Tcoffee, you need raw FASTA files. To
get them,
• Return to UniProt, and repeat your search for human opsins.
• Select the four visual opsins plus peropsin, and click Retrieve at the bottom of the
page.
• Clikc Open under FASTA on the UniProt Jobs page.
• Select the text that appears. You might want to save it into a text file, but you can
just paste it into Tcoffee directly. The file you have here is simply the five opsin
sequences, one after another, in FASTA format, which is just what Tcoffee needs.
Paste your FASTA data into the space provided. Enter your email address. Click Submit.
That's all there is to it. After what is usually a short delay, a results page appears. It
provides links to your multiple-sequence alignment in several formats. (You might find it
interesting to compare the alignment from Tcoffee to the one you got from ClustalW. This
is easiest to do with the Tcoffee file clustalw_aln.) The file you want for producing a tree
is labeled phylip, which provides the alignment in Phylip format, which is needed for
PhyML. Click phylip to see this file, select all the text displayed, and copy it. Paste it
into a text file, 5Opsins4PhyML.txt).
PhyML uses maximum-likelihood methods, which are based on very powerful (but
obscure) Bayesian statistics, to calculate the tree that has the highest probability of
showing the correct relationship among the aligned sequences. Maximum-likelihood
methods are among the most highly respected means of making decisions when you must
navigate a minefield of probability-based choices to arrive at a either a single best
decision, or a small group of similar good ones (X-ray crystallographers use it to decide
which data to use, and which to exclude, when trying to build a model of a protein from
diffraction data). As the availability of such methods has grown, so has the number of
people for whom they are completely black boxes. When you use a black-box method,
you must be careful to compare the results with everything else you know about the
subject. A surprising result might be a genuine discovery, or it might be just wrong. It is a
result to test further, not to accept blindly.
Sequences: File; then click Choose File, and choose the phylip file you saved from the
Tcoffee output.
Number of bootstrap data sets: 100 (do not click Print bootstrap info.)
Bioinformatics Tutorial
Seeking Structure
Preview
In this section, you will learn how to use a FASTA sequence as a search input (query) to
the Protein Data Bank, the repository of almost all protein models that have been
deduced by X-ray crystallography or NMR. Your search will tell you whether anyone has
produced an experimental model of your query protein, or whether models are available
for any protein of similar sequence. You will also visualize the model using an online
graphics tool. Finally, you will learn how turn a long list of hits into an interactive
Custom Report that makes details of each hit easy to find.
In fact, the PDB does not contain molecular structures at all. Is is better to say that it
contains models of macromolecules. These models are interpretations of data from one of
the two main methods of macromolecular structure determination: X-ray
crystallography and NMR spectroscopy. When researchers make a model, or as they
commonly say, "determine the structure" of a macromolecule, they deposit a file
containing the three-dimensional coordinates of all the atoms in the model. This
coordinate file—along with an online molecular graphics tool (like the PDB's Jmol
Viewer) or a computer graphics program like DeepView—are all that you need to see
and study the model on your computer. Next you will retrieve a model from the PDB and
view it with an online graphics tool. You will also visit the home of a topnotch computer
graphics program that you can download FREE and use on your home computer.
The PDB home page contains a simple search box at the top. You can search for models
using simple keywords or PDB ID codes. An PDB code has four characters, like 1CYO.
How would you ever know a model by its code? When a new structure is published, the
authors usually give the PDB code in the last reference of the bibiography. With that
code, you can go straight to the model you want to see. But more often, your question,
like ours, is more general. For such cases, PDB also provides forms for more
sophisticated searches. For now, let's just see if any opsin models are availalble. Type
"opsin" into the search box, make sure the PDB ID or keyword is selected, and click
Site Search.
On 2008/09/22, this search returned only one model, which is quite puzzling, because a
search for "rhodopsin" returns 48 models. So it appears that the quicky (quirky?) search
tool at the PDB still needs some work. But this shortcoming is a gift for now. You have
bagged an experimental model of an opsin; the PDB contains only models derived
experimentally—either by x-ray crystallography or NMR spectroscopy. Now take a look
at this one.
Click the PDB file code 3CAP above the tiny image of the model.
You have come to the Structure Summary page for this model, which is its home page
at the PDB. This page is connected to just about everything you could possible do with
this model. At the PDB, your first goal is always to get to the Structure Summary page
for the model you are seeking.
NOTE: Structure Summary does not exactly jump out at you on this page. It's the tab
selected over the main part of the entry, and it is a sub-tab of the Structure tab above the
left column. Those tabs should be more prominent—they are what distinguishes each of
the important pages in the PDB. If you want to know where you are in the PDB, look at
the two sets of tabs at the top of the page. The set on the left are main tabs, and the set on
the right are sub tabs of the main tabs. Main tabs take you to PDB's major sets of tools,
and sub tabs subdivide them. Sub tabs under the Structure tab open LOTS of additional
information about the currently chosen model.
In the left column of all PDB pages, you find a set of nested menus (they might vary on
different PDB pages). Click Display Molecule to open the PDB display options. If you
already own or use one of the listed viewers, like the free program DeepView, you are in
business. Click your viewer to download the model and view it in a familiar environment.
But first behave as if you are new to all this (perhaps you are), and use a handy viewer
that works in your browser.
Click Jmol Viewer. Assuming that your computer has up-to-date Java software, your
browser will load the viewer, and it will load the file 3CAP. Your should see models of
two rhodopsin molecules—with backbones shown as ribbon-like cartoons, one green, one
blue—and several ball-and-stick models of smaller molecules. Is rhodopsin a dimer? No,
but in the crystals of rhodopsin from which this model was derived contained two
rhodopsin molecules per asymmetric unit (the smallest portion from which the entire
unit cell of the crystal can be constructed). PDB files usually show the full contents of the
asymmetric unit. If more than one molecule is present, they are referred to as chains in
the model.
NOTE ON VIEWERS: The viewer embedded in the viewing frame of this page is the
widely used Jmol, which you will find in use as a molecular viewer at many web sites. If
you take time to get to know this viewer fairly well, you will get more out of the many
sites that use it. Like most of the other viewers listed at PDB, Jmol is quite limited in its
capacity for analysis of protein structure.
Here are some other things you can do to get to know models in a Jmol frame (to get
back to the original rendition, reload the page):
• Click/drag (left button if you have more than one) on the image to rotate the
structure. You should be able to tell that is has a lot of alpha helix.
• Hold down option (for Macintosh; alt for Windows) and click/drag to zoom in
(drag towards you) or out (drag away) or the rotate the model in the plane of the
screen (drag left or right).
• Hold down ctrl (or right-click) the image: up pops a set of menus, and if you
browse around on them, you'll see that there is much more to Jmol. Try just a
couple of things to get some general ideas, as follows.
• Using the pop-up menus, Select:Protein:All. EXPLANATION: This means to
slide to Select on the main pop-up menu, then on its submenu, slide to Protein,
then on its submenu, slide to All. On my Macintosh computer, if I right-click a
menu or submenu item, its submenu locks on display, and it's easier to navigate.
Nothing appears to happen. You have selected part of the model (the protein part,
but not the small molecules). Subsequent commands will change only this aspect
of the display.
• Color:Structure:Cartoon:By Scheme:Secondary Structure
The cartoons become red (well, bright pink) for alpha helix, and yellow for beta
sheet. You probably had not noticed the beta sheet in the models before. Look one
of the chains over carefully to get a feeling for its structure. How many helices are
present? How many strands of beta sheet? Are the strands parallel or antiparallel?
• Do you know how to view stereo pairs? (If not, click HERE to learn how.) Then
Style:Stereographic:(choose your favorite mode, cross-eyed or wall eyed
viewing). NOTE, As of 2008/09/22, you get the opposite of what you pick. Despite
my attempts to inform the programmers about this, cross-eyed viewing gives you
wall-eyed, and vice versa. But anyhow, now you can see the model as a solid
object with convincing depth. If you are ever going to do anything serious with
protein structure, you'll need to find a way to view them in 3D.
• Work in stereo or not, as you prefer. Clear the display: Select:None; then
Select:Display Selected Only. The display goes blank; nothing is selected and
you are displaying only the selection (very logical!).
• Select:Protein:All (means select both backbond and sidechains). Then
Style:Scheme:CPK Spacefilling. The protein portion is now show as a
spacefilling model. In this rendition, you get a good idea of the overall shape of
the protein. Unfortunately, the Jmol menu does not allow you to color the two
chains separately or get rid of one of them.
• Style:Scheme:Wireframe. Now you see all of the protein parts of this model in
wireframe. This is not as impressive as some other schemes, but is actually the
most useful when you start exploring models in detail, because the wires do not
hide each other like ball and sticks or spacefilling models.
To learn more about Jmol, consult the help links at PDB below the display. You can also
find extensive help for all viewers listed there. But if you plan serious protein structure
work, especially judging model quality and comparing models by superimposing them,
get to know DeepView.
Next, you will try to find other models in the PDB that are homologous to the human
opsins. You will ask the PDB, in effect, to "list all models whose sequences can be
aligned with that of human red opsin, in order of sequence similarity." In PDB
terminology, the red opsin sequence is the query, and similar models found (hits) are
called subjects.
First, open your query file protred.txt (FASTA sequence of red human opsin), and copy
the sequence portion only to the clipboard; omit all of the comment line that begins with
>.
At the top right of any PDB page, click Search. From the list of search types, click
Sequence. On the resulting page, click the button next to use Sequence, and paste your
red opsin sequence into the box just below. Not that the search tool is your new friend
Blast, and that a E cut-off value of 10 is given as a default. From what you learned
earlier, you know that this is not a very restrictive search criterion, so your search should
pick up anything remotely similar in sequence to the red human opsin. Click the search
button. The search tool is now looking for PDB models whose sequences are similar to
the human red opsin sequence. Hits in UniProt are just other proteins, most of whose
structures are not known. Hits in the PDB are models, so hits tell you that there are
experimental models for one or more proteins that are similar in sequence to your query.
0.0000000000000000000000000000000000000000000000000000000000000000000000
00062,
which means, to any sane biologist, that these two molecules descended from a common
ancestor. There is no chance that, in the history of the universe, two proteins could arrive
at sequences this similar by chance. This also means that the structure of the bovine
rhodopsin is a sure bet to be very similar to that of the human red opsin, whose structure
is unknown (if if were known, this search would have found it).
Now look down the list of the models you found. Most are models of the same substance:
bovine rhodopsin (lumirhodopsin, bathorhodopsin, and some others are altered forms that
represent rhodopsin in different stages of the visual cycle, but notice that all of these
come from Bos taurus, from which the good old barnyard cow got the name Bossy. A few
hits are the recently published beta-2-adrenergic receptor, the first G protein coupled
receptor model besides rhodopsin. Perhaps by the time you take this tutorial, there will be
more.
Use the results page to answer these questions about the comparison between human red
opsin and the bovine rhodopsin in PDB 1F88:
1. How many corresponding residues, and what percent of the residues, do the two
proteins have in common (exact matches)?
2. How many and what percent of corresponding residues are similar in chemical
properties?
3. How many gaps did the alignment program introduce, and how many residues in
each gap, to get best alignment between human red opsin and 1F88?
4. Find the longest string of exact matches between the two proteins. How many
matches does it contain, and what are the beginning and ending residue numbers?
Results pages are difficult to deal with if you want to look around on a long (anything
more than 10) list of subjects (hits). To make a display that is easier to navigate, in the
left column, click Tabulate, and then Custom Report. You can use this Custom Tabular
Report form to generate a list of your subject that includes any features of interest. For
now, you will generate a very simple list, but you will quickly see its power.
On the form, click to put checkmarks in these boxes: Descriptor (under Structure
Summary), and Source (under Biological Details). Then click Create Report at the
bottom of the form.
The custom report appears, with three columns, PDB ID code, model descriptor, and
biological source of the protein. The form contains many clickable items. Clicking an ID
code takes you to the Structure Summary page for that model. Clicking a column heading
sorts the list on that heading. Try this by clicking Source above the third column. Then
look down the Source column. This makes it easy to find the non-Bos taurus entries,
which include that adrenergic receptor. Anything else?
Now you know how to search the PDB for models whose sequences are similar to a
target or query sequence. Structural biologists use such searches when they have a new
protein sequence and want to know its structure. If the structure is known, this search
would find it, so if you are interested in the structure of a particular gene product,
search PDB with its sequence to see if the structure is already known. If not, any hits
with high sequence similarity can tell you the overall fold of the protein. You also got a
glimpse of the Custom Report tool, which can make it easy for you to organize and
peruse a large number of hits from any search.
NEXT
Bioinformatics Tutorial
Seeking Structure
Preview
In this section, you will learn how to use a FASTA sequence as a search input (query) to
the Protein Data Bank, the repository of almost all protein models that have been
deduced by X-ray crystallography or NMR. Your search will tell you whether anyone has
produced an experimental model of your query protein, or whether models are available
for any protein of similar sequence. You will also visualize the model using an online
graphics tool. Finally, you will learn how turn a long list of hits into an interactive
Custom Report that makes details of each hit easy to find.
In fact, the PDB does not contain molecular structures at all. Is is better to say that it
contains models of macromolecules. These models are interpretations of data from one of
the two main methods of macromolecular structure determination: X-ray
crystallography and NMR spectroscopy. When researchers make a model, or as they
commonly say, "determine the structure" of a macromolecule, they deposit a file
containing the three-dimensional coordinates of all the atoms in the model. This
coordinate file—along with an online molecular graphics tool (like the PDB's Jmol
Viewer) or a computer graphics program like DeepView—are all that you need to see
and study the model on your computer. Next you will retrieve a model from the PDB and
view it with an online graphics tool. You will also visit the home of a topnotch computer
graphics program that you can download FREE and use on your home computer.
The PDB home page contains a simple search box at the top. You can search for models
using simple keywords or PDB ID codes. An PDB code has four characters, like 1CYO.
How would you ever know a model by its code? When a new structure is published, the
authors usually give the PDB code in the last reference of the bibiography. With that
code, you can go straight to the model you want to see. But more often, your question,
like ours, is more general. For such cases, PDB also provides forms for more
sophisticated searches. For now, let's just see if any opsin models are availalble. Type
"opsin" into the search box, make sure the PDB ID or keyword is selected, and click
Site Search.
On 2008/09/22, this search returned only one model, which is quite puzzling, because a
search for "rhodopsin" returns 48 models. So it appears that the quicky (quirky?) search
tool at the PDB still needs some work. But this shortcoming is a gift for now. You have
bagged an experimental model of an opsin; the PDB contains only models derived
experimentally—either by x-ray crystallography or NMR spectroscopy. Now take a look
at this one.
Click the PDB file code 3CAP above the tiny image of the model.
You have come to the Structure Summary page for this model, which is its home page
at the PDB. This page is connected to just about everything you could possible do with
this model. At the PDB, your first goal is always to get to the Structure Summary page
for the model you are seeking.
NOTE: Structure Summary does not exactly jump out at you on this page. It's the tab
selected over the main part of the entry, and it is a sub-tab of the Structure tab above the
left column. Those tabs should be more prominent—they are what distinguishes each of
the important pages in the PDB. If you want to know where you are in the PDB, look at
the two sets of tabs at the top of the page. The set on the left are main tabs, and the set on
the right are sub tabs of the main tabs. Main tabs take you to PDB's major sets of tools,
and sub tabs subdivide them. Sub tabs under the Structure tab open LOTS of additional
information about the currently chosen model.
In the left column of all PDB pages, you find a set of nested menus (they might vary on
different PDB pages). Click Display Molecule to open the PDB display options. If you
already own or use one of the listed viewers, like the free program DeepView, you are in
business. Click your viewer to download the model and view it in a familiar environment.
But first behave as if you are new to all this (perhaps you are), and use a handy viewer
that works in your browser.
Click Jmol Viewer. Assuming that your computer has up-to-date Java software, your
browser will load the viewer, and it will load the file 3CAP. Your should see models of
two rhodopsin molecules—with backbones shown as ribbon-like cartoons, one green, one
blue—and several ball-and-stick models of smaller molecules. Is rhodopsin a dimer? No,
but in the crystals of rhodopsin from which this model was derived contained two
rhodopsin molecules per asymmetric unit (the smallest portion from which the entire
unit cell of the crystal can be constructed). PDB files usually show the full contents of the
asymmetric unit. If more than one molecule is present, they are referred to as chains in
the model.
NOTE ON VIEWERS: The viewer embedded in the viewing frame of this page is the
widely used Jmol, which you will find in use as a molecular viewer at many web sites. If
you take time to get to know this viewer fairly well, you will get more out of the many
sites that use it. Like most of the other viewers listed at PDB, Jmol is quite limited in its
capacity for analysis of protein structure.
Here are some other things you can do to get to know models in a Jmol frame (to get
back to the original rendition, reload the page):
• Click/drag (left button if you have more than one) on the image to rotate the
structure. You should be able to tell that is has a lot of alpha helix.
• Hold down option (for Macintosh; alt for Windows) and click/drag to zoom in
(drag towards you) or out (drag away) or the rotate the model in the plane of the
screen (drag left or right).
• Hold down ctrl (or right-click) the image: up pops a set of menus, and if you
browse around on them, you'll see that there is much more to Jmol. Try just a
couple of things to get some general ideas, as follows.
• Using the pop-up menus, Select:Protein:All. EXPLANATION: This means to
slide to Select on the main pop-up menu, then on its submenu, slide to Protein,
then on its submenu, slide to All. On my Macintosh computer, if I right-click a
menu or submenu item, its submenu locks on display, and it's easier to navigate.
Nothing appears to happen. You have selected part of the model (the protein part,
but not the small molecules). Subsequent commands will change only this aspect
of the display.
• Color:Structure:Cartoon:By Scheme:Secondary Structure
The cartoons become red (well, bright pink) for alpha helix, and yellow for beta
sheet. You probably had not noticed the beta sheet in the models before. Look one
of the chains over carefully to get a feeling for its structure. How many helices are
present? How many strands of beta sheet? Are the strands parallel or antiparallel?
• Do you know how to view stereo pairs? (If not, click HERE to learn how.) Then
Style:Stereographic:(choose your favorite mode, cross-eyed or wall eyed
viewing). NOTE, As of 2008/09/22, you get the opposite of what you pick. Despite
my attempts to inform the programmers about this, cross-eyed viewing gives you
wall-eyed, and vice versa. But anyhow, now you can see the model as a solid
object with convincing depth. If you are ever going to do anything serious with
protein structure, you'll need to find a way to view them in 3D.
• Work in stereo or not, as you prefer. Clear the display: Select:None; then
Select:Display Selected Only. The display goes blank; nothing is selected and
you are displaying only the selection (very logical!).
• Select:Protein:All (means select both backbond and sidechains). Then
Style:Scheme:CPK Spacefilling. The protein portion is now show as a
spacefilling model. In this rendition, you get a good idea of the overall shape of
the protein. Unfortunately, the Jmol menu does not allow you to color the two
chains separately or get rid of one of them.
• Style:Scheme:Wireframe. Now you see all of the protein parts of this model in
wireframe. This is not as impressive as some other schemes, but is actually the
most useful when you start exploring models in detail, because the wires do not
hide each other like ball and sticks or spacefilling models.
To learn more about Jmol, consult the help links at PDB below the display. You can also
find extensive help for all viewers listed there. But if you plan serious protein structure
work, especially judging model quality and comparing models by superimposing them,
get to know DeepView.
Next, you will try to find other models in the PDB that are homologous to the human
opsins. You will ask the PDB, in effect, to "list all models whose sequences can be
aligned with that of human red opsin, in order of sequence similarity." In PDB
terminology, the red opsin sequence is the query, and similar models found (hits) are
called subjects.
First, open your query file protred.txt (FASTA sequence of red human opsin), and copy
the sequence portion only to the clipboard; omit all of the comment line that begins with
>.
At the top right of any PDB page, click Search. From the list of search types, click
Sequence. On the resulting page, click the button next to use Sequence, and paste your
red opsin sequence into the box just below. Not that the search tool is your new friend
Blast, and that a E cut-off value of 10 is given as a default. From what you learned
earlier, you know that this is not a very restrictive search criterion, so your search should
pick up anything remotely similar in sequence to the red human opsin. Click the search
button. The search tool is now looking for PDB models whose sequences are similar to
the human red opsin sequence. Hits in UniProt are just other proteins, most of whose
structures are not known. Hits in the PDB are models, so hits tell you that there are
experimental models for one or more proteins that are similar in sequence to your query.
On 2008/09/22, I got 26 subjects, or 26 PDB models whose sequences are homologous to
the search sequence. Each is listed with an E-value, which is the probability that the
sequence similarity between query and subject is a coincidence. The first result or subject
is PDB model 1F88, a model of bovine rhodopsin. The E-value is 6.2 x 10 -74 . In other
words, while the probability that a coin flip and your call will agree just by chance is 0.5,
the probability that the similarity between human red opsin and bovine rhodopsin is just a
chance occurence is
0.0000000000000000000000000000000000000000000000000000000000000000000000
00062,
which means, to any sane biologist, that these two molecules descended from a common
ancestor. There is no chance that, in the history of the universe, two proteins could arrive
at sequences this similar by chance. This also means that the structure of the bovine
rhodopsin is a sure bet to be very similar to that of the human red opsin, whose structure
is unknown (if if were known, this search would have found it).
Now look down the list of the models you found. Most are models of the same substance:
bovine rhodopsin (lumirhodopsin, bathorhodopsin, and some others are altered forms that
represent rhodopsin in different stages of the visual cycle, but notice that all of these
come from Bos taurus, from which the good old barnyard cow got the name Bossy. A few
hits are the recently published beta-2-adrenergic receptor, the first G protein coupled
receptor model besides rhodopsin. Perhaps by the time you take this tutorial, there will be
more.
Use the results page to answer these questions about the comparison between human red
opsin and the bovine rhodopsin in PDB 1F88:
1. How many corresponding residues, and what percent of the residues, do the two
proteins have in common (exact matches)?
2. How many and what percent of corresponding residues are similar in chemical
properties?
3. How many gaps did the alignment program introduce, and how many residues in
each gap, to get best alignment between human red opsin and 1F88?
4. Find the longest string of exact matches between the two proteins. How many
matches does it contain, and what are the beginning and ending residue numbers?
Results pages are difficult to deal with if you want to look around on a long (anything
more than 10) list of subjects (hits). To make a display that is easier to navigate, in the
left column, click Tabulate, and then Custom Report. You can use this Custom Tabular
Report form to generate a list of your subject that includes any features of interest. For
now, you will generate a very simple list, but you will quickly see its power.
On the form, click to put checkmarks in these boxes: Descriptor (under Structure
Summary), and Source (under Biological Details). Then click Create Report at the
bottom of the form.
The custom report appears, with three columns, PDB ID code, model descriptor, and
biological source of the protein. The form contains many clickable items. Clicking an ID
code takes you to the Structure Summary page for that model. Clicking a column heading
sorts the list on that heading. Try this by clicking Source above the third column. Then
look down the Source column. This makes it easy to find the non-Bos taurus entries,
which include that adrenergic receptor. Anything else?
Now you know how to search the PDB for models whose sequences are similar to a
target or query sequence. Structural biologists use such searches when they have a new
protein sequence and want to know its structure. If the structure is known, this search
would find it, so if you are interested in the structure of a particular gene product,
search PDB with its sequence to see if the structure is already known. If not, any hits
with high sequence similarity can tell you the overall fold of the protein. You also got a
glimpse of the Custom Report tool, which can make it easy for you to organize and
peruse a large number of hits from any search.
NEXT
Bioinformatics Tutorial
Summary
You have used these categories of tools in this tutorial:
1. Databases like GenBank, UniProt, and PDB store sequence and structural data
in the form of entries (each with a unique code) that correspond to a single gene
or its protein product. The databases provide extensive information about each
entry, ranging from brief pop-up information, to links that submit the entry to
various search and analysis tool (below), to encyclopedias of information about
the entry, or to the results of automated searches in PubMed for publications
related the entry. Databases also provide sequences in formats (like FASTA) that
serve as search queries in the same or other databases.
2. Search Tools can be integral parts of databases, or stand-alone programs. Integral
search tools allowing you to search with keywords, with FASTA sequences, or
with entry numbers from other databases. Stand-alone search tools like BLAST
allow you to find sequences (hits) similar to sequences of interest to you (queries).
3. Analysis Tools (example: PROSITE) use single sequences to determine
properties or identify functions of genes and their products. Sequence comparison
tools like ClustalW and Tcoffee perform multiple sequence alignments and
produce phylogenetic trees, showing vividly how genes are related to each other.
Consensus tree-building tools like Phylip and PhyML build trees based on many
interations of random sampling and alignment of the sequences being compared,
thus reducing the possibility of bias from a single sequence alignment.
Phylodendron lets you print trees to you liking, using tree data in Newick format
from any tree-building tool.
4. Modelling Tools like Swiss-Model provide, or assist you in building, homology
models of proteins of unknown structure. The modeling program DeepView (also
knowns as Swiss-PdbViewer) helps you to build homology models, as well as to
study and judge the quality of all types of models (homology, X-ray, NMR).
DeepView and SWISS-MODEL are integrated, so you can move back and forth
between them at any point in a modeling project.
All of the tools you have used here are much more complex and powerful, and
require more judgement to use properly, than you might think from your use of
them so far. You have only scratched their surfaces. For example, programs like BLAST
and ClustalW have many settings that allow the user to control many aspects of the
analysis. When you click a link to ClustalW and get a multiple-sequence alignment with
no fuss, you have used default settings that might not be the best for your task. For
serious scientific work, you need to visit sites that provide full implementations of search,
alignment, and analysis tools, giving you full control of the task, but also requiring
deeper understanding of the kind of analysis you are doing. This kind of knowledge is
crucial to judging the quality of your results (an aspect in which this tutorial is very
weak).
To learn more about specific tools, go directly to any network service, such as ExPASy or
NCBI, that provides the tool you want to use. First, you will find links to extensive user
manuals that tell how the analysis tools work. You might also find lists of frequently
asked questions (FAQs) about the tool. Finally, you will find a direct link to a form for
running the tool, in which you can make all settings, put in a query, and run the tool. Only
trouble is, as a beginner, you often do not know what settings to put in.
In my opinion, the best services for beginners are those that provide settings in pull-down
menus that show you all of the allowed settings. As an example, go to EMBL-EBI,
another great online service, and click Sequence Similarity and Analysis. In the left-
hand column, under Sequence Analysis, click ClustalW2. The resulting form shows all
of the ClustalW settings in the form of pull-down menus, so you don't have to know the
possible settings and type them in—all allowed settings are displayed in the menus, so
you can't go wrong. The settings shown when you arrive (called the defaults) are
probably the same settings applied to your analysis when you clicked the quick link from
your table of opsin entries at UniProt to get your Clustalw multiple-sequence analysis. In
fact, if you go back to that page, you will see that the box at the top contains all FASTA
files in sequence. If you want to see how other settings affect the analysis, you can use
paste this set of files, as one block of text, into the EMBL-EBI Clustalw form, play with
settings, and get multiple-sequence analyses to your heart's delight. This is a great way to
learn more about a tool that you want to use wisely. EMBL-EBI provides most of the
common bioinformatics in this beginner-friendly kind of environment.
For a more rigorous and systematic, yet readable and clear, survey of the full range of
bioinformatics, get the latest editionof Bioinformatics for Dummies, by Claverie and
Notredame, Wiley Publishing, Inc. It will help you learn to use the tools wisely, and to
judge the reliability of your results. I recently bought the 2007 edition, and I have learned
a lot of cool new stuff. The new edition helped immensely in updating this tutorial. It's
the best thing I know of to take you further.
Bioinformatics Tutorial
Test Your New Skills
Here is a problem you should be able to solve using what you learned in this tutorial.
Humans cannot synthesize vitamin C (ascorbate), and so we must obtain it from our diet.
Many mammals, including mice, can make ascorbate. In the time since our line diverged
from that of rodents, we have lost one enzyme, gulonolactone oxidase, the final enzyme
in the pathway of ascorbate synthesis (if interested, read more).
This means that humans have an evolutionary ancestor that possessed a functional
gulonolactone oxidase gene. It stands to reason that humans should possess a
nonfunctional remnant of that gene (called a pseudogene).
Can you find a remnant of the gulonolactone oxidase gene in the human genome?
Happy hunting!