Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX
Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX
com
https://ebookfinal.com/download/bioinformatics-methods-
express-1st-edition-edition-paul-dear/
OR CLICK BUTTON
DOWNLOAD EBOOK
https://ebookfinal.com/download/plant-bioinformatics-methods-and-
protocols-1st-edition-peter-sterk/
ebookfinal.com
https://ebookfinal.com/download/algorithms-in-bioinformatics-theory-
and-implementation-1st-edition-paul-a-gagniuc/
ebookfinal.com
https://ebookfinal.com/download/plant-bioinformatics-methods-and-
protocols-2nd-edition-david-edwards-ed/
ebookfinal.com
https://ebookfinal.com/download/knowledge-discovery-in-bioinformatics-
techniques-methods-and-applications-1st-edition-xiaohua-hu/
ebookfinal.com
Dear Chester Dear John Letters between Chester Himes and
John A Williams 1st Edition Chester Himes
https://ebookfinal.com/download/dear-chester-dear-john-letters-
between-chester-himes-and-john-a-williams-1st-edition-chester-himes/
ebookfinal.com
https://ebookfinal.com/download/dear-carnap-dear-van-the-quine-carnap-
correspondence-and-related-work-w-v-quine-editor/
ebookfinal.com
https://ebookfinal.com/download/motivation-express-exec-1st-edition-
philip-whiteley/
ebookfinal.com
https://ebookfinal.com/download/methods-in-biological-oxidative-
stress-1st-edition-paul-h-gamache/
ebookfinal.com
The METHODS EXPRESS series
Bioi nformatics
Biosensors
Cell Imaging
DNA Microarrays
Expression Systems
Genomics
Immunohistochemistry
PCR
Protein Arrays
Proteomics edited by Paul H. Dear
Whole Genome Amplification MRC Laboratory of Molecular Biology,
Cambridge, UK
.:t
Scion
© Scion Publishing Ltd; 2007
All rights reserved. No part of this book may be reproduced or transmitted, in any
form or by any means, without permissibn.
Contents
A CIP catalogue record for this book is available from the British Library.
1.2 The hierarchical form of protein architecture 172 Chapter 11. Multiple sequence alignment
Domains
1.3 173 Burkhard Morgenstern
2. Methods and approaches 173 1. Introduction 245
2.1 Accessing macromolecular structures on the web 173 2. Methods and approaches 246
2.2 Classification of protein structures 176 2.1 The alignment problem in computational biology 246
2.3 Structural genomics 180 2.2 Pairwise sequence alignment 247
2.4 Approaches to protein structure prediction 180 2.3 Multiple sequence alignment 249
2.5 Specialized methods for particular types of structure 186 2.4 Benchmarking and evaluation of multiple-alignment software 250
3. References 194 2.5 Visualization and comparison of multiple alignments 251
2.6 Multiple alignment of large genomic sequences 251
Chapter 9; Gene ontology 2.7 Software tools for multiple alignment 252
Vineet Sangar 3. Additional web resources 262
1. Introduction 195 4. References 263
1.1 Gene ontology 196
1.2 Structure of the GO database 196 Chapter 12. Inferring phylogenetic relationships from sequence data
1.3 The three qo ontologies 198 Peter G. Foster
1.4 GO terms 199 1. Introduction 265
1.5 Evidence codes 199 2. Methods and approaches 269
2. Methods and approaches 200 2.1 Alignments 269
2.1 GO browsers 200 2.2 File formats 269
2.2 GO annotation tools 204 2.3 Software 270
2.3 Gene expression tools 205 2.4 Tree-building methods 271
2.4 Integration of GO with other classification systems 206 2.5 Choosing a model 274
3. Additional web resources 206 2.6 A Bayesian approach to phylogenetics 278
4. References 207 3. Troubleshooting 280
4. References 281
Chapter 10. Prediction of protein function
Rodrigo Lopez Appendix
1. Introduction 209 Additional useful bioinformatics resources 283
2. Methods and approaches 210
2.1 Required tools 210 Index 287
2.2 Prediction and determination of physicochemical
properties of proteins 210
2.3 Determination of secondary structure from sequence 215
2.4 Determination of functional domains using pattern-matching
methods 224
2.5 Advanced methods combining several protein function
prediction algorithms 230
2.6 Protein function prediction by transfer of annotation 233
2.7 Multiple sequence alignments and secondary databases 234
2.8 An overview of InterPro and COD 235
2.9 Recent advances in protein function prediction 238
2.10 Concluding remarks 241
3. Additional web resources 241
4. References 242
CONTRIBUTORS xi
a
GuhaThakurta, Debraj Rosetta Inpharmatics LLC, Merck Co., Research Genetics
Stormo, Gary D. Washington University School of Medicine, Department of
Department, 401 Terry Avenue North, Seattle, WA 98109, USA. E-mail:
Genetics, Campus Box 8510, Room 5410,4444 Forest Park Parkway, St. Louis, MO
[email protected]
63108, USA. E-mail: [email protected]
Hall, Neil The Institute for Genomic Research, 9712 Medical Center Drive,
Rockville, MD 20850, USA. Current address: University of Liverpool, School of
Biological Sciences, Biosciences Building, Crown St, Liverpool, L69 7ZB, UK.
E-mail: [email protected]
+
a:I=_ac::I :'11111 lllaatibf'M I
m
m
le-U m
Chr.19
~- ~~ ~~ ~~200.00~~ ~- ~- ~- ~- .. 11 !:lE
- ..
Length I - FQ~rd $frond Kb m
I]l
FI:l1204 i
m
F\'I1205
AI X
m
I]l
m
I]l
m
m
D m
m
Figure 6. Screenshot showing part of the 'Detailed view' panel of the 'Contig View' page of the
Ensembl genome browser (see page 36).
The data that was uploaded in Protocol 7 is shown (dark bars just below the chromosome length scale)
o
-!.).': m
in the track called 'NavigatingGenomesTrack'. In this shot, the user has clicked on one of the uploaded m
features (P61205), and the small pop-up window displays information about this feature. f<'-
:..
~
'00
l1li>
--
~
~2800 115200
I~'
116000 117600.. _ 118~2o~..119.200
Chapter 7. Expressed sequence tags Chapter 8. Protein structure, classification, and prediction
HQrne S-earc:h Structure structure SUll1m~ny aloiogy & ~hemistry Ma~6rla!s Sr Methods
Result$: Queries
lIDP I!II I m _ and VI...." ...,on
BiologIcal Molecule I Asymmetric
llDP Crystal structure of scytalone dehydratase F162A mutant in the unligated state Unit
.'Download Files
(a)
(b)
B.6
0.8
:on
13.4
-,
-"
-"
0.6
0
'-
Q.
..'"
-"
0.2
0.4
0.2
50 11010 150 200 250 3B0 350 480 450
You can get the prediction graphics shown above in one of the following formats:
.. GIF-fonnat
50 100 150 200 250 300
.. Postscript-format
• numerical fonnat (window 14 21 28)
transmembrane oytoplasmic non cy"topl.asmio - - signal peptide - -
0.6
::>
+'
0.4
...
'"
-"
0
0.6
'-
"-
.."
-"
0.2
e.4
13.2
513 1013 15111 200 '2513 3130 350 41313 4513
You can get the prediction graphics shown above in one of the following formats:
• GIF-format
50 le8 158 2013 2513 300
• Postscript-format
• numerical format (window 14 21 28)
tl"'ansmembrane -~- oytoplasmic non cytoplasmic signal peptide - -
The prediction is based on an 0lignment . The probability data used in the plot is found bEl:L and the Back to ISREC home page
gnuplot script is here .
Figure 15. Output of the COILS website, showing an example of the rediction of coiled-coiled regions
(see page 193).
Figure 13. Output of the PHOBIUS website, showing an example of the prediction of transmembrane
regions and signal sequences (see page 191).
xxvi COLOR SECTION COLOR SECTION xxvii
Chapter 10. Prediction of protein function PEPj\lET ()f FC)S, HUMAj\j fr()nl ~1 to 2,3
1.0
~ 0.8
71 t34 57 .50 4-3 36 29 LL '15 8
:.0
co
.D
0.6
;::
0..
0.4
0.2
200 400 600 800 1000 1200 1400 '14-8 '141 13,4 '127 'l20 '113, 106 99 92 85 78
225 218 211 '197 '190 8.3, '176 '159 '162 '15,5,
Figure 7. Helical representation of the sequence of the human fos oncogene using PEPNET from the
EMBOSS suite of programs (see page 223).
Note the leucine zipper between positions 165 and 193. (Only the first 231 amino acids of the protein
are shown here.)
xxviii COLOR SECTION COLOR SECTION xxix
PPSearch Output
Figure 10. Expected output of PPSEARCH when used to search for patterns in 'USERSEQ1_fasta.tx1' (see
page 226).
Sequence tJIV'v'VAAliPNPADGTPJ<:vLLLSGQPASAZ\GAPAAR-LPIoMVPAQRGASPEAASGGLPQARK 59
YlVVVAAAPNP1I.DGTPJ<:VLLLSGQPASAAGAPAGQALPL:,lVPAQRGASPEAASGGLPQARK 60
:1VVVAAAPSAATAAPKVLLLSGQPASGG-----RALPLMVPGPRAAGSEAS--GTPQARK
Sequence
120
Sequence
EKTHGL VVENQELRQRLGMDALVAEE - - EAEAKGNEVRPVAGSAESAALRLRAPLQQVQA 178
EKTHGLVIENQELRTRLGMNALVTEEVSE}\ESKGNGVRLVAGSAESAALRLRA,PLQQVQA
EKTHGL VVENQELR TRLGMDTLD PDEVPEVEAKGSG'IRL VAGS1\ESAALRLCAPLQQ'1QA
.
*******-****** . . * ...*
**** .. ** ************ ********
Sequence QLSPLQNISPWlLAVLTLQIQSLISCWAFWTTWTQSCSSNALPQSLPAWRSSQRSTQKDP 237
QLSPLQNISPWlLAVLTLQIQSLISCWAFWTrWTQSCSSNALPQSLPAwKSSQRSTQKDP 238
QLSPPQNIFPWILTLLPLQILSLISFWAFWTSWTLSCFSNVLPQSLLIWRNSQRSTQKDL
QLSPPQNIFPWTLTLLPLQILSLISFWAFHTSWTLSCFSNVLPQSLLVHRNSQRSTQKDL 233
260
VPYQPPFLCQWGRHQPSWKPLMN - - - - -- - - - - - -
261
267
VPYQPPFLCQWGPHQPSWKPLMNSFVLTMYTPSL 267
Figure 16. Part of the results of INTERPROSCAN for the protein RXRA__HUMAN (see page 237).
The publicly available web resources described in this chapter can be divided into
two types: primary and secondary databases. Whilst both types of database serve
a useful purpose, one has to understand the distinction before making assertions
based on a database query.
1.1.1 Primary databases SCU49845) is unique to each entry and in many cases it may be identical or similar
Primary databases include all of the repositories of the primary output from to the accession number (in this case, U49845). The accession number will always
experimental work, such as GenBank (2) (http://www.ncbi.nlm.nih.gov/Genbank/ remain the same even if the entry is changed. There is also a version number,
index.html1.2), which contains nucleotide sequences, and ArrayExpress (3) (http:// which indicates how many times the entry has updated: U49845.1 indicates that
www.ebi.ac.uk/arrayexpress/1.3) at the European Bioinformatics Institute (EB!), this is the first version of this entry. The GI is the 'Genlnfo Identifier': if a sequence
which contains microarray expression data. These repositories generally house or protein translation changes in any way, a new GI number will be assigned, so GI
information that is submitted by the scientist who generated it and little is done to sequence identifiers run parallel to the version numbers.
process, curate, or provide quality control over what is entered. Therefore, they are
usually very comprehensive, but one must always treat the data with caution. 1.2.1 Nucleotide databases at NCB I
The main nucleotide database at NCBI (and, like the rest, accessible through
1.1.2 Secondary databases Entrez) is GenBank (2), which contains all of the publicly available DNA sequences.
Secondary databases are less all-inclusive than the primary databases, but instead However, there are subdivisions of GenBank that can be searched independently
concentrate on data quality and include additional information and cross- of the complete database. Depending on what you are looking for, it may be better
referencing. Secondary databases usually draw on (and may be linked to) a number to search a subdivision rather than the whole dataset. For example dbEST (7, 8)
of primary databases, collecting together information centered on a particular topic. contains only that subset of sequence data and other information that relates to
1 'single-pass' cDNA sequences or expressed sequence tags (ESTs); similarly, dbGSS
For example, Pfam (3) (http://www.sanger.ac.uk/Software/Pfam/ .4) is a secondary
database that curates protein domains and allows users to search proteins for known contains only single-pass genome survey sequences.
domains, whereas the Mouse Genome Informatics (MGI) (http://www.informatics. Further nucleotide databases (again, Entrez-accessible) exist outside GenBank.
jax.org/1.5) collects and curates genomic, genetic, and functional data associated For example, dbSNP (9) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp 1.9)
with the laboratory mouse. There is a clear distinction between these two examples is a database of single-nucleotide polymorph isms, small insertions, and deletions.
of secondary databases: Pfam covers one theme (protein domains) and does so for There is also a sequencing trace archive (http://www.ncbi.nlm.nih.gov/Traces/trace.
all organisms, whereas MGI (4) covers many types of data but for only one species. cgi 1.10), which contains sequences that have been submitted along with all of
their underlying experimental data, so that you can view the original trace file
that generated them. This data could be particularly useful if you wish to check
1.2 Database resources at NCBI
------------------,~~.- -~- .. .... ~. the validity of a frameshift or insertion in a gene of interest.
At the time of writing, NCBI has over 20 databases, which can be searched either
individually or en masse. Entrez (5) is a system that provides a common interface 1.2.2 Protein databases at NCBI
to all of the major NCBI databases, including PubMed (6), nucieotide and protein Protei n databases ca n contain different levels of information from pri mary sequence
sequences, protein structures, complete genomes, taxonomy, and many others. It to secondary and tertiary structures. As well as databases that contain entire
provides a consistent user interface and format, and allows queries to be made peptide sequences, there are a number of resources dedicated to collecting and
across mUltiple NCBI databases at once. The starting point for Entrez is curating protein domains and motifs. The NCBI protein database is a concatenation
www.ncbi.nlm.nih.gov/gquery/gquery.fcgi 1.6, and an overview, showing all of a number of subdatabases: it includes sequences from Swiss-Prot (10, 11), the
Entrez-accessible databases and the connections between them, is available at Protein Information Resource (PIR) (12), the Protein Research Foundation (PRF),
http://www.ncbi.nlm.nih.govjDatabase/datamodei/ 17 . and the Protein Data Bank (PDB) (13), along with protein sequences translated from
Each of these databases (which are often called nodes) will contain entries nucleotide sequences of RefSeq (10,11) and GenBank. Therefore, when you search
with unique identifiers (UlDs). These identifiers are stable over time, whilst the using the Entrez server, you will be doing a comprehensive search of all publicly
data associated with them can change. For example, a gene will always have the available sequences. Like the nucleotide databases, these proteins will be based on
same UID, but over time we may discover a new function for it or new splice sites variable data quality depending on their source. For example, Swiss-Prot has highly
so its annotation and sequence could change. These entries can be linked within curated annotations and should be nonredundant, whereas the translations from
and between nodes, which allows, for example, publication entries to be linked GenBank will contain sequence errors and misannotations in a number of cases.
to protein entries. Each node has a specific entry format, although many features Additional protein-related information is available in NCBI's Structure database
will recur among the different nodes. For example, there is a fully annotated (http://www.ncbi.nlm.nih.gov/Structure/1.11) and NCBI's Conserved Domains
GenBank record at http://www.ncbi.nlm.nih.gov/Sitemapjsampierecord.htmI 18 . Of Database (COD) (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml 112). COD
particular interest are the three identifiers you will find in this single record: the contains domains from Pfam (14), Simple Modular Architecture Research Tool
iocus name, the accession number, and the GI. The locus name (in this example (SMART) (15), and Clusters of Orthologous Groups (COG) (16), as well as other
4 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 5
domains curated at NCBI. The major utility of the domain database is for identifying
domains in a protein sequence, which will allow the user to infer a function (also Protocol 1
see Chapter 8).
A simple text search for Plasmodium across NCBI
1.2.3 Other databases at NCBI databases using Entrez
As well as the major nucleotide and protein databases, NCBI houses a number
1. Start at the NCBI home page: http://www.ncbi.nlm.nih.gov/1.1 and select Entrez home on the
of other related databases (nodes) that are linked to the sequence databases, as
right to go to the Entrez cross-database search page: http://www.ncbi.nlm.nih.gov/gquery/
well as being themselves browsable and searchable through Entrez. One of the gquery.fcgi 1.6.
most commonly used databases is PubMed, which is a repository of biomedical
2. In the text box towards the top of the screen (,Search across all databases'), type 'Plasmodium'.
journal articles. As well as being searchable using text queries, all of the articles Click the adjacent Go button, or press 'Enter' on your keyboard.
in PubMed are linked to other NCBI entries related to them, such as nucleotide
3. The page will be updated. Next to each of the database names and icons in the lower part of
sequences. Similarly, there are databases of chemical structures (PubChem),
the screen will be a number (or, in a few cases, 'none'). This indicates the number of entries in
microarray experiments (Gene Expression Omnibus, GEO) (17), taxonomy (Entrez each of these databases that contained, anywhere within it, the word 'Plasmodium:
Taxonomy), genes (Entrez Gene), maps (Map Viewer), and inherited diseases
4. Many of these records will not relate to Plasmodium itself. For example, some will describe
Mendelian Inheritance in Man, OMIM) (18, 19) among others. Whilst
proteins from other species that are noted as interacting with, or being similar to, Plasmodium
some of these datasets may not seem obviously useful to your particular area of proteins.
research, much of their functionality is derived from the fact that related records
5. We therefore want to limit the search to entries in which the organism is Plasmodium. To do
in each database are all linked so the user is able to traverse between datasets
this, repeat the query, this time using the text 'Plasmodium [ORGANISM] 'a
by following the links provided. For example, search the PubChem database for
6. The page will be updated, this time giving the number of entries (in each of the databases) in
'ethanol' and it will return not only a structure and description of the compound
Plasmodium is in the 'Oraanism' field.
but also links to protein databases of enzymes that bind ethanol, as well as
toxicology reports in the National Library of Medicine and relevant publications 7. Clicking on any of the results will take you to the respective results page, listing all of the
Plasmodium entries found. For example, clicking on Nucleotide or the adjacent icon will take
in PubMed.
you to the start of a list of over 200000 Plasmodium nucleotide sequence entries.
Note
"A list of fields that can be searched is given here: http://www.ncbi.nlm.nih.gov/entrez/query/static/
heip/Summary Matrices.html#Search Fields and Qualifiers 1.13. Many fields can be abbreviated;
Here we discuss, in a little more detail, some of the tools available at NCBI through
for example, 'ORGN' can be used in place of 'ORGANISM'.
Entrez, tailored for searching their nucleotide, protein, and other databases.
Additionally, we then provide a set of protocols, illustrating how to answer specific
typical bioinformatics questions. Search terms can also be combined; for example, searching for 'malaria AND
It is not possible to cover more than a tiny fraction of the resources, tools, and mosquito' will find all entries that contain (anywhere within the entry) both
methods of query that are available through Entrez. However, we suggest that you 'malaria' and 'mosquito'. Similarly, 'Plasmodium NOT Plasmodium [ORGANISM] , will
start with the examples in the protocols and then take these as starting points to find all entries that refer to Plasmodium, but that do not originate from the
explore on your own. organism Plasmodium itself. More sophisticated searches can be made by querying
each database individually, rather than globally. The advantage of this is that it
2.1 Searching databases at NCBI will allow you to define your search using fields specific to that database. It is also
possible to view a 'history' of previous .searches and to combine these together
2.1.1 Text searches to refine the search further. Protocol 2 gives a simple example of searching a
single database, using 'limits' and 'history' to build up a progressively more refined
The simplest and broadest search of NCBI databases is offered via the Entrez entry
query.
page: http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi 16. Here, you can perform
a simple text-based search across all databases or you can choose a specific NCBI
database to search such as PubMed or Nucleotide. If you are searching across
all databases, the simplest search you can perform is a text search of all fields.
Protocol 1 gives a simple example.
6 !I CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 7
Protocol 2 Protocol 3
A search in PubMed using limits and History Determining the set of web resources available for a genome
1. Navigate to the Entrez entry page (http://www.ncbLnlm.nih.gov/gquery/gquery.fcgi 1.6) and 1. Start at NCBI's home page (http://www.ncbi.nlm.nih.gov/ 1.1) and click on the Genomic Biology
on the PubMed link towards the top of the page. link in the left blue bar. This link takes you to the genomic biology page: http://www.ncbi.nim.
2. In the text field at the top of the page, enter 'malaria' and click the adjacent Go button. ni h.gov/Genomes/ 1.15 .
3. The result is a list of over 40000 (at the time of writing) entries that contain the word 2. Under Genome resources on the right, select Eukaryotic to go to an alphabetic list of genome
'malaria' in any part of the entry. projects, listed by species: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
4. We will repeat this search, restricting it to articles with 'malaria' in their title. Click on the 3. Scroll down to find Plasmodium falciparum 307 (at the time of writing, only one genome
Limits tab Uust below the text entry box). Scroll to the bottom of the new page to find the project is listed for this strain). On the same line, you will see a number of links, including a
pull-down menu Default tag. Select Title from the pull-down menu. Click the adjacent Go taxonomic identifier ('36329') and a link to the sequencing consortium's home page.
button (or scroll back up the page and click the one at the top). 4. You will also see, at the right of the line, a series of colored abbreviations for different NCBI
5. The result is a list of about 20000 articles, all with 'malaria' in their title. databases (PM, R, G, etc.). Clicking on anyone of these will bring up data on Plasmodium
falciparum 3D7 from the appropriate database. For example, clicking on G will show you all
6. We will now look at our previous searches and combine them to refine the search. Click on of the entries in the Genes database for this organism. If necessary, use your browser's 'back'
the History tab. button to return to http://www.ncbLnlm.nih.gov/genomes/leuks.cgi1.16.
7. Into the text entry field, type 'mosquito'. Also, click on the ticked box on the Limits tab to 5. Clicking on the organism name at the left will take you to the Genome Project database
'untick' it. This removes our previous limits settings. and display entries for P falciparum, offering further links to data and resources for this
8. Below, you will see a list of your most recent searches. The top-most one in the list will be: organism.
#xx Search malaria Field: Title
where 'xx' is a number. Click on the number. NCBI Map Viewer provides one way to access positional genome information and to
9. A pop-up menu of options will appear, asking how you want to combine your previous search integrate it into searches. Protocol4takesyou through a typical use of Map Viewer.
(for articles with 'malaria' in the title) with the current search (for 'mosquito', not limited to
the title). Click on AND.
10. The text box should now show: Protocol 4
(mosquito) AND (#xx)
Finding sequence-tagged site (STS) markers on chromosome 3
meaning that we are about to search for records that contain mosquito in any part of the
entry and that also contain 'malaria' in the title (from our previous search). Click Go. of P. falciparum using the NeBI Map Viewer
11. The result is around 3500 entries (at the time of writing), each with 'malaria' in the title and 1. Navigate to the Map Viewer home page (http://www.ncbi.nlm.nih.gov/mapview/1.17) by the
with 'mosquito' somewhere in the entry. Map Viewer link from the NCBI home page (http://www.ncbi.nlm.nih.qov/ I .,).
2. On the Map Viewer home page, select Plasmodium falciparum (http://www.ncbi.nlm.nih.gov/
mapview/map search.cgi?taxid=363291.18) from the puil-down Search menu (leave the
NCBI provides a web page giVing further details of how to search their text field empty) and click Go. You will be taken to the Map Viewer page for P falciparum,
databases at http://www.ncbLnlm.nih.gov/entrez/query/static/help/helpdoc.html# including an ideogram of the karyotype.
Searching 1.14. The following protocols give some examples of other ways to search 3. Enter 'STS' and '3' in the text fields Search for and on chromosome(s), respectively, and click
the NCBI databases. The examples are in no way exhaustive, but they will introduce Find.
you to a range of search types that can form the basis of your own explorations. 4. The results of your query are presented as hits on chromosome ideograms and in a tabular
format. View the results in the Map Viewer graphical display by ciicking on the 3 underneath
the chromosome in the ideogram (to show all STSs) or by clicking on the blue links in the table
below. Click on the first Map Element in the table (at the time of writing, this was Pf2541).
5. The resulting page will show STSs in a part of the chromosome, with the chosen STS (Pf2541)
indicated. Clicking on its name will call up further information on that STS, including the
polymerase chain reaction primer sequences.
8 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 9
ProtocolS Protocol'
Searching for a-tubulin genes in P. falciparum Simple BLAST searches at NCBI
This search could be started from the Entrez Home page (searching all NCBI databases and then 1. Navigate to the BLAST home page (http://www.ncbi.nlm.nih.gov/BLAST/1.20) from the BLAST link
selecting those hits from the Gene database). Alternatively, as here, we can navigate to the Entrez included in the query bar found at the top of most NCBI pages.
Gene page to search only that database.
2. The BLAST home page provides links to the suite of BLAST tools for comparisons between
1. Navigate from the NCBI home page (http://www.ncbi.nlm.nih.gov/1.1) to the Entrez home nucleotide or protein sequences. Searches may be conducted against highly divergent
page (http://www.ncbLnlm.nih.gov/gquery!gquery.fcgi 1.6) using the link on the right of the (discontinuous mega blast), the trace archive, the COD, gene expression data in
screen. single-nucleotide polymorph isms, immunoglobulins, etc. In this case, we will search for
2. Click on the Gene link or adjacent icon (left side of screen) to go to the Entrez Gene page proteins related to a Plasmodium a-tubulin, so select protein blast (under the 'Basic BLAST'
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene 1.19). heading).
3. Into the text box at the top of the screen, type: 3. On the 'Protein BLAST' page, leave all settings at their default values (note that we will be
searching against the 'm' or non-redundant database).
Plasmodium falciparum[ORGANISM] alpha tubulin
4. Open a new browser window and find the P. falciparum a-tubulin gene PFI0180w (see
(see Protocol 1 for an explanation of using '[ORGANISM)' to limit a search to entries originating
Protocol 5). When you have found the Entrez Gene page for this gene, scroll down to find
from a species). Click Go.
the heading 'NCBI Reference Sequence (RefSeq)' and click on the link to the gene product
4. The query should return 11 genes (correct at the time of writing) with a-tubulin in the (XP_001351911.1). This should bring up the corresponding Entrez Protein page and, scrolling
annotation. Click on the link for PFI0180w to get a detailed summary of its annotation. down, you will find the complete amino acid sequence for this protein. Copy this sequence
5. In this case, the PubMed reference for the gene (under Links on the right of the screen) (along with the numbers and spaces) and paste it into the Search box on the BLAST page.
links back to the genome project paper (Hall et an. rather than to an original paper about 5. Click on BLAST. You will be taken to a page saying that your request has been successfully
a-tubulin. From this, we might assume that this gene's annotation has been predicted by submitted. The page will be automatically updated when your results are ready (BLAST searches
homology to other tubulin genes, rather than having been verified experimentally. can take some time to complete).
6. Use your browser's 'back' button to return to the Entrez Gene page for PFI0180w. Scrolling 6. When the results are ready, you will see a diagram representing the best matches. The colored
. down the page to the section headed 'General gene information' shows that all of the Gene bars indicate the score of the match and the portion of your query sequence that it matches.
Ontology terms relating to this protein (see Chapter 9 for an introduction to Gene Ontology) In this case, there will be many full-length red bars indicating many close matches to the
were assigned on the basis of evidence-code 'lEA' ('Inferred from Electronic Annotation'), complete a-tubulin sequence.
confirming our assumption that the annotation was not verified experimentally.
7. Below this are listed the hits in order of BLAST score (best first). If you click on the accession
7. Additional examples of gene searches are given on the Entrez Gene home page. number of the hit, you will go to the GenBank entry for that protein. If you click on the BLAST
score, you will go to the alignment (remember that there may be more than one alignment
per hit).
2.1.2 Sequence similarity searches and alignment of transcripts to genomic 8. Now try repeating the search, but looking only for matches in Arabidopsis thaliana. To do
sequences this, navigate to the 'Protein BLAST' page (step 3 above), but this time, type Arabidopsis
A common method of querying sequence databases is by similarity searching. The thaliana into the 'Organism' text field (under 'Choose Search Set'). before continuing as
before. (Note: as you type the organism name, you will be prompted with a list of likely
most well-known tool for similarity searching is BLAST (Basic Local Alignment Search
organisms-simply click on Arabidopsis thaliana to save typing the complete name.
Tool) (17), which allows you to search your query sequence against a database of
your choosing. Chapter 3 gives detailed information about BLAST and related tools; 9. The result this time is a smaller number of matches, some of them shorter or of lower score,
to Arabidopsis sequences.
here we introduce the use of such tools in the context of the NCBI databases.
Similarity searches of NCBI's nucleotide and protein sequence databases can
be restricted to sequences from one or more species either by specifying the BLAST is not always the best tool for sequence alignment and NCBI provides other tools
organisms in the Options section of the BLAST page or by submitting searches against that may be more appropriate for your needs. Alignment of a mRNA or cDNAsequence
databases on the organism-specific BLAST pages (http://www.ncbi.nlm.nih.gov/ to a genomic sequence can be computed using NCBl's SPIDEY (17) (http://www.ncbi.
BLAST/1.20). Multiple query sequences can also be submitted in the same search nlm.nih.gov/I EB/Research/Ostell/Spidey/ 1.21) or SPUGN (http://www.ncbi.nlm.nih.gov/
using the organism-specific BLAST pages. sutiis/splign/splign.cgi 1.22) alignment tools. Chapters 4 and 7 cover the alignment of
transcripts with genomic sequence in more detail. Protocol 7 gives a simple example
of using SPLIGN to compare a cDNA sequence with genomic sequence.
10 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS TROUBLESHOOTING !II 11
may disappear at any moment. So, whilst you may want to use the more recent .. TIGRFAM (curated protein domains of microbes):
version of the data, remember that it may not be there tomorrow. TIGRFAMs/index.shtmI 1.48
.. SMART: http://smart.embl-heidelberg.de/1.49
.. InterPro (protein families, domains, and functional sites,in which identifiable
features found in known proteins can be applied to unknown protein
sequences): http://www.ebi.ac.uk/interpro/1.50
Listed here is a selection of other web sites that are likely to be useful. .. Protein data bank (provides a variety of tools and resources for studying the
structures of biological macromolecules): http://www.rcsb.org/pdb 1.51
Major primary sequence generators
Miscellaneous
.. The Wellcome Trust Sanger Institute: http://www.sanger.ac.uk/ 1.33
.. JGI Genomes: Eurkaryota, Archae, Bacteria: http://genome.jgi-psf.org/tre .. OBO (Open Biomedical Ontologies: an umbrella web address for well-structured
home.html1.34 controlled vocabularies for shared use across different biological and medical
.. Human Genome Sequencing Center, Baylor College of Medicine: httn·!I\A1\AI\AJ domains): http://obo.sourceforge.net/ 1.52
hgsc.bcm.tmc.edu/1.35 .. Amigo (a web interface for browsing gene ontologies, which will allow you
.. The Broad Institute: http://www.broad.mit.edu 1.36 to search for genes with specific functions or cellular locations, or that are
.. The Institute for Genomic Research: http://www.tigr.org 1.37. Now renamed the involved in specific processes): http://www.godatabase.orgf1.53
J. Craig Venter Institute (JCVI; http://www.jcvLorgl
.. Washington University Genome Sequencing Center: http://genome.wustl.edu{1.38
.. Genoscope: http://www.genoscope.cns.fr/1.39
Bioinformatics institutes
* 1. WlteelerOl, Barrett T, Benson OA, et al. (2006) Nucleic Acids Res. 34, 0173-0180.
- This publication gives an overview of the NCBI databases and their associated tools. This
.. The European Bioinformatics Institute: http://www.ebi.ac.uk 1.40 reference is the most up to date at the time of writing, but NCBI publishes an overview of
changes and updates in the Nucleic Acids Research database issue, which is published
• National Center for Biotechnology Information: http://www.ncbLnlm.nih. every year.
gov/l.l 2. Benson OA, Karsch-Mizrachi I, Lipman OJ. Ostell J & Wheeler Ol (2006) Nucleic Acids
.. Center for Information Biology and DNA Data Bank of Japan: http://www.cib. Res. 34, 016-020.
nig.ac.jp 1.41 3. Parkinson H, Sarkans U, Shojatalab M, et al. (2005) Nucleic Acids Res. 33, 0553-
0555.
4. EppigJT, Bult CJ, Kadin JA, et aJ. (2005) Nucleic Acids Res. 33, 5.
Genome annotation databases * 5. GeerRC & Sayers EW (2003) Brief. Bioinform. 4, 5. - A tutorial paper that covers some
of the ground of this chapter but in more detail. It gives a useful overview of the concepts
• KEGG (Kyoto Encyclopedia of Genes and Genomes): http://www.genome.jp/ behind the Entrez tool using example tasks.
keggi 1.42 6. McEntyre J & Lipman D (2001) CMAj, 164, 1317-1319.
• GeneDB (Sanger Institute Pathogen Sequencing Unit annotation database): 7. Boguski MS, lowe 1M & Toistoshev CM (1993) Nat. Genet. 4, 332-333.
http://www.genedb.org/ 1.43 8. Banff S, Guffanti A & Borsani G (1998) Trends Genet. 14, 80-81.
9. Sherry ST, Ward MH, Kholodov M, et al. (2001) Nucleic Acids Res. 29, 308-311.
• Ensembl Genomes: http://www.ensembl.org/index.htmI 1.44 10. Boeckmann B, Bairoch A, Apweiler R, et al. (2003) Nucieic Acids Res. 31, 365-370.
• CMR (Comprehensive Microbial Resource) Annotated Microbial Genomes: 11. Boeckmann B, Blatter MC, Famiglietti L, et al. (2005) c. R. Bioi. 328, 882-899.
http://pathema.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi 1.45 12. Barker WC, GaravelliJS, McGarvey PR, et al. (1999) Nucieic Acids Res. 27, 39-43.
• BRC Central (central web site of NIAID Bioinformatics Resource Centers, 13. Sussman JL, Lin 0, Jiangj, et al. (1998) Acta Crystallogr. D Bioi. Crystallogr. 54, 1078-
1084.
which houses databases of biodefense-related organisms): http://www.brc-
central.org 1.46
* 14. Bateman A, Coin l, Durbin R, et al. (2004) Nucieic Acids Res. 32, 0138-0141.
- Pfam is possibly the most useful web resource available for gene function analysis.
.. Genome properties (a database of curated and calculated properties It is useful to understand how it is built and annotated before diving in and using it.
of microbial genomes): http://cmr.tigr.org/tigr-scripts/CMR/shared/Genome This publication should be updated each year in the database issue of Nucleic Acids
Research.
PropertiesHomePage.cgi 1.47
15. letunic I, Goodstadt l, Dickens NJ, et al. (2002) Nucleic Acids Res. 30,242-244.
* 16. TatusovRl, Fedofova ND, Jackson JO, et al. (2003) BMC Bioinform. 4, 41. - An in-depth
Protein families, domains, and structures description of how orthologous groups have been calculated for the COG database; it also
describes the eukaryotic clusters (or KOGS). This database is an excellent tool for studying
.. Pfam (curated protein domains): http://www.sanger.ac.uk/Software/Pfam/1.4 the phylogenetic coverage of genes.
14 CHAPTER 1 : DATABASE RESOURCES FOR WET-BENCH SCIENTISTS
11. Barrett T, Suzek TO,Troup DB, et al. (2005) Nucleic Acids Res. 33, 0562-0566.
18. Hamosh A, Scott AF, Amberger j8, Bocchini CA & McKusick VA (2005) Nucleic Acids Res.
33, 0514-0517.
19. Cantor MN & Lussier VA (2004) Medinfo, 11, 753-757.
CHAPTER
Navigating sequenced genomes
Melody S. Clark and Thomas Schlitt
-----------------------.--.~~ ...'".-.. -.
World sequencing capacity and technologies were fuelled by the race to complete
the Human Genome Project and resulted in a massive investment in infrastructure,
machinery, techniques, and personnel. The effective completion of this project has
not seen a reduction in the amount of sequencing data generated. The techniques
learnt, especiaily the use of shotgun sequencing, can efficiently sequence large
vertebrate genomes, and the realization that low-density coverage of a genome
could prove almost as useful to scientists as a completed genome, allied with a
dramatic reduction in sequencing costs, brought about a liberation in the genomic
science field. Therefore, the 'excess' capacity was not closed down but maintained
so that, in theory, the DNA of any organism could be sequenced. A press release from
the National Institutes of Health (NIH) in August 2005 announced that the public
collections of sequence data (GenBank, EMBL, and DDBJ) had reached 100 gigabases
from over 165 000 different organisms, and the sequence databases continue to
grow at a tremendous rate. This continued production of sequencing data, much
of which is for comparative analyses, has gradually led to a standardization of data
presentation and genome viewers, some of which will be described here.
So what constitutes a 'sequenced' genome? This is either in the form of:
This chapter wWencompass all of these types of ~enomes and will explain how to
access the data, appreciate the limitations of the associated annotation, identify Protocol 1
sources of additional useful information, and download and manipulate the data.
The chapter provides a number of worked examples and also lists several different The Genomes On line Database (GOLD)
genome viewers to use for both vertebrates and microbes. In general, the different
1. As an example, we will try to find out if there are any genome projects for the whiptail
genome browsers present identical data sets in slightly different ways, and some
wallaby, Macropus parryi.
are easier to use than others. This chapter has focused on the use of Ensembl (for
vertebrates) (2) and the related site Integr8 (mainly dealing with microbes) for 2. Go tohttp://www.genomesonline.org 2.1 and click on the GOLD Tables button.
several reasons: 3. You have a choice of the following buttons:
.. They combine the considerable resources of two major bioinformatics centers: • Published complete genomes
• Archaeal ongoing genomes
The European Bioinformatics Institute and the Sanger Institute.
• Bacterial ongoing genomes
.. Both are publicly funded with what appears to be a sustained commitment to
• Eukaryotic ongoing genomes
future work and are not subject to the vagaries of commercial interest. • Metagenomes
.. They provide data as quickly as possible from the sequencing pipelines, allowing
4. Click on Published complete genomes to call up a table of all published genome sequences.
access to very early crude data releases. Several of the column headings at the top of the table are clickable links; clicking on one of
It Numerous updates are provided. these will sort the data according to that criterion.
It If you grasp Ensembl, then you do not need to relearn another browser when
5. Click Organism to sort the table by genus and species, and scroll down to look for M. parryi.
accessing Integr8. There is no such entry (at the time of writing).
Having said that, they may not suit everyone, and the best way to get to grips 6. Perhaps there is an ongoing genome project for this species. We could go back and look
with genome viewers is to pick a gene, enter it into the different browsers, and through the list of 'Eukaryotic ongoing genomes', but instead we will use the 'search' function.
decide which one you like the best with regard to data presentation and ease of Go back to the front page of GOLD and click on the Search GOLD link.
use. Nevertheless, you would be well advised to explore Ensembl and Integr8 first, 7. On the resulting page, you can specify what type of information you want to retrieve (using
as understanding how they are used will help in understanding other browsers. check boxes at the top of the screen; leave this at its default with the All fields box checked)
You are also encouraged to explore references (3}-(6), which provide useful and your search criteria (menus and fields in the lower part of the screen).
information on a number of browsers and related resources. 8. Under the Type menu on the left, select species and, in the adjacenttext box, type 'parryi'. Press
'Enter' to start the search (or click the Submit search button at the bottom of the screen).
9. The query returns no results (at least, at the time of writing). Alas, there does not appear to
be any whiptail wallaby genome project. However, are any other members of the genus being
2.1 Finding genome resources for an organism 10. Use the 'back' button to return to the search form and this time search for the genus
'Macropus' (setting the type menu to genus).
If an organism has been (or is being) sequenced, then, generally speaking, every 1. This gives two results - both for Macropus eugenii genome projects. (For other species,
scientist who works in that community is aware of it. However, this does not different types of project such as expressed sequence tag (EST) projects may be listed.)
necessarily mean that they will have privileged access to the data prior to the 12. Clicking on the Taxonomy link for either of these two entries will take you to the appropriate
obligatory Science or Nature genome paper or know where the data is being entry in the National Center for Biotechnology Information (NCBI) Taxonomy Browser, where
stored. Also, the data may not be available immediately in one of the 'standard' you will see that M. eugenii is the tammar wallaby. Additional information about - and links
public genome browsers. There may also be a requirement to try to identify similar to - available genome and other data for this species are given.
genomes for comparative analyses. For this, you need to access a genome project- 13. Return to the GOLD page displaying the two M. eugenii genome projects. Links to relevant
monitoring web site, which keeps an up-to-date log of what is being sequenced, funding bodies, institutions, databases, and other resources are given for each of the genome
with contact details (7). The most comprehensive site for this is the Genomes On projects. The identifiers listed under GOLDSTAMP (column on the extreme left) are unique to
each genome project and link to a summary page or 'Gold Card' for the project.
Line Database (GOLD), which is maintained by Dino Liolios, Nektarios Tavernarkis,
Phil Hugenholtz, and Nikos Kyrpides (8-10). Protocol 7 is an example of a simple 14. The GOLD search page offers many options for searching, not only by organism type or by
query to GOLD. genome properties, but also by funding body, country, researcher, and many other factors.
18 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 19
GOLD will direct you to specific genome project information and also to an
NCBI flat file (if it exists). A 'flat file' is a data file that contains records with no Protocol 2
structured relationship. In the case of sequence files, these usually list minimal
data on the source of the data, the full sequence, and (not always) annotation A short tour of Ensembl starting from the parathyroid
on coding regions. There are no links to additional sources of information and it
hormone-related protein gene
is necessary to extract the data manually to be able to analyze it further. If, for
example, you wanted to examine a particular gene from a sequenced microbe, 1. Go to Ensembl via http://www.ensembf.org 2 .2 and click on Homo sapiens under the Popular
then using the flat file you would have to scroll manually through the entire genomes heading. In the text box next to the Search e! Human box at very top right of the
screen, enter 'parathyroid hormone-related protein: Set the adjacent pull-down menu
sequence to find it, if the annotation was in place, or else perform a BLAST search
to Anything. Press 'Enter' or click on the Go button.
on the whole sequence, work out where the bit you wanted was, and then extract
that particular piece of sequence. In a nutshell, this is why genome browsers are 2. There are only three results 3 , starting with two Ensembl protein coding genes and, below it, an
Ensembl gene family. Click on the link (Ensembl protein coding gene: ENSG00000087494 ... )
so useful.
for the gene.
There are distinct advantages in accessing the information via generic genome
viewers. They present the data in a standard format and often link sequence data 3. This will call up an Ensembl Human Gene View page, starting with the Ensembl Gene Report
for this gene (see Fig. 1). A range of information about the gene is displayed, along with links
between different genomes, allowing easier comparisons and data handling.
to further annotation and resources. Note that if you hover the cursor over the links on the
They are also not restricted in the organisms they list (some institutes only list
left of the screen, brief explanations of the links will appear.
those organisms being sequenced 'in house'), as long as the data is in the public
domain.
It should be mentioned that GOLD does not necessarily direct you to a generic
genome browser for the organism in question, even if the information is available
there. For example, both the dog (Canis familiaris) and cow (Bas taurus) genomes
are available in Ensembl, but a direct link to the relevant Ensembl pages is not
available under the GOLD 'Organism' listing. In general, you will need to discover ENSGOO!Ji)f)_VllH94 EI Ens.mbl Gene Reportfor ENSGOOOOOOS1494
which of the generic genome browsers gives access to the data for the species you . . C<:!'n«!-imarmlrtimJ
__ GenIl'Sj1IkBlli(I:!'im,uJ(l
are interested in. . . GCf¢rc>gUlatktniofo.
In the following sections, we will focus on the use of the Ensembl and Integr8 _ Gm-mmk _~,ett¢e
. . Tnmlluiptififoun.mon
browsers; other popular resources will be covered later. _Exoninfwfflirt:k!n
~ PrO'teininform;Uion
Oeacrtptkm
,. E"..,ortdqt~
available for an increasing number of species (over two dozen at the time of
,.....,
__ fJll)ort WOfnHl{tan abQut
. . Exp(!ft"&oque-m:ellsfASTA
__ E:q)M(MBtfikt
writing). It is continually being updated with an increasing array of features.
Most importantly, it is not restricted to the genome sequence produced by any
one institute and it also allows easy comparison of data from different genomes.
Protocol 2 describes a typical use of Ensembl - browsing for information related
to a given gene and its genomic environment. Gene symbols and descriptions can
also be used instead of gene names.
Figure 1. Ensembl gene report for parathyroid hormone-related protein.
4. To download directly the sequence data and some associated information for your own
analyses, you may use Export information about region, Export sequence as FASTA or Export
EMBL file. In each case, you will be offered various options such as which file format to use,
which features of the annotation to include in the output, and how much flanking sequence
to include. Each of these options will export a flat file, but with only minimal annotation.
Much more information is displayed graphically using the other browser features. If you try
20 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 21
use the 'back' button on your browser to return to the Gene View page
5. On the right-hand part of the screen, the Genomic Location box gives both the location
of the gene in the genome (the chromosome and the start and end positions of the gene
sequence) and the sequence contig in which the start of the gene lies. Access the genomic
location of the gene by clicking on the link 28,002,284-28,016,183 (the exact coordinates
may change in later versions of the genome assembly).
6. This takes you to an Ensembl Contig View page, showing an ideogram of chromosome 12 and,
in the 'Overview' area below, an expanded view of the region surrounding the gene at 28 Mb
in band 11 p22. The parathyroid hormone-related protein gene (also known as parathyroid
hormone-like hormone or PTHLH) is shown in the middle (the short brown bar above the name
shows its extent on the chromosome) and neighboring genes are displayed to the left and
right. A red box delineates the PTHLH gene alone; this region is shown in greater detail in the
'Detailed view' area below. Scroll down to bring the entire 'Detailed view' area on screen b•
7. The 'Detailed view' area displays the data that were used to infer the Ensembl gene transcript
for PTHLH. The data is displayed in 'tracks': each track is named towards the left of the box
(for example 'Genscan'). The default tracks include matches with ESTs, Unigene, and mRNA
sequences, and the transcripts predicted by the Genscan automated gene prediction software.
Also shown are details of the sequence contigs spanning this region
assembly and DNA markers in the region.
8. Clicking on any item will bring up a small menu of relevant features. Clicking on the track
names (for example, Unigene) will bring up a help menu; clicking on Track information ... in
this menu will then bring up an explanation of that track.
Figure 2. Ensembl viewer showing the synteny viewer.
9. In addition to the default tracks, many other tracks can be displayed for the region of interest.
The PTH LH gene in Homo sapiens is on chromosome 12 and shares synteny with dog
Use the Features pull-down menu (at the top of the 'Detailed view' area) to add further tracks
chromosome 27. The human genes in the region are listed with a reference to the orthologous
to (or remove tracks from) the view.
region in dog, and provide direct access to both the candidate gene and the neighboring genes in
10. To get an overview of the syntenic regions in other species, go to View syntenic regions on both human and dog.
the left side of the screen. You will be offered a choice of organisms (six at the time of writing,
starting with Bos taurus), and clicking on anyone of these will bring up a 'classical' map in the Mus musculus part of the window and then on the Gene:ENSMUSG00000048776
showing the overall synteny of human chromosome 12 with the chromosomal regions in the item in the menu that pops up will take you to the Ensembl Gene Report for this mouse gene:
other organism (see Fig. 2). After viewing this, use the 'back' button on your browser to return this is the mouse equivalent of the Gene Report for the human gene, which we saw in step 3
to the Contig View page. of this protocoi. Use your web browser's 'back' button to return to the page displaying the
". To view a more detailed version of the syntenic region in other organisms, go to View alignment alignments with human.
with on the left side of the screen. A pop-up menu will offer you a choice of species (or groups 15. To view a more detailed comparison between the human region and just one other species,
of species) to align with the human sequence. At the time of writing, the first item in the go to View alongside (upper left-hand part of the screen) and select any of the single species
menu is 5 eutherian mammals - select this item. that appear in the pop-up menu. The new window (an Ensembl Human MultiContig View)
12. This will take you to an Ensembl AlignSlice View page, which will show the human chromosome shows a side-by-side comparison between the genomic regions in human and the second
12 ideogram and an overview of the human chromosomal region as before, but the 'Detailed species at three levels of detail: chromosomal ideograms at the top; a 'navigational overview'
view' area below these will now display the human PTHLH gene and the corresponding regions of the PTHLH region below this; and at the bottom a 'Detailed view'. As before, you can select
in each of the other species. Only limited information (sequence contig and EMBL transcript which features to display in the 'Detailed view' and can also zoom in or out.
tracks) are shown for each species by default, but the pull-down Features menu at the top of
the 'Detailed view' area can be used to add tracks displaying any other desired features for all Notes
of the species shown.
aThe information and figures given are correct for the NCBI36 assembly current at the time of
13. To view neighboring genes in the various species, use the 'Zoom' feature towards the top of writing.
the 'Detailed view' box. Click on the larger end of the wedge (or click on the - icon) to display bEach of the views - the chromosomal ideogram, the overview, and the detailed view can be
a larger region around the human PTHLH gene and its aligned regions in the other species. closed or opened by clicking on the corresponding '-' or '+' at the top-left corner of the view. A
fourth and more detailed 'Basepair view' is closed by default but can be opened at the bottom of
14. At any point, you can access information from the other species by clicking on the feature of
the screen by clicking on its '+' icon.
interest to bring up a menu of available information. For example, clicking on the Pthlh gene
22 II CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 23
Note
aCorrect at time of writing. Details are, of course, likely to change.
24 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 25
.k~.i!iiW¥;;IiJ!.¥,¥¥I§iM*ii. C'~ntiIfView
El Overview
- gquery.fcgi 2.5). Entrez Genomes integrates the scientific literature, DNA and
protein sequence databases, 3D protein structure and protein domain data,
population study datasets, expression data, assemblies of complete genomes,
and taxonomic data. There is a comprehensive map viewer, a genome browser
for eukaryotic genomes, Plant Genomes Central, microbial and viral genome
databases, and gMap, a comparative analysis of microbial genomes. The archaea,
irM,oon"."
!Gene'#"
Ji'~:aa:
\.MlAIl"HU '/l¢.l'AlU<la
__"
\.l\CSH)ua· '''IOIAOHU
--.l
~AtfllOUU 1,.. .... 111 !.""II
t1::...1
>~N,A ......... !."'"
bacteria, and eukaryota can all be viewed by either chromosome, plasmid, or
organelles. Information leads directly to the relevant NCBI files. NCBI's genome
lr.ene l~~ ~~~~;;; __ __
~~_-=- ~~~~~::~~~~ ___~~~:~_~__~~rn~~~ll c~~M ------.~~:~:-----------J
resources are covered in more detail in Chapter 1, and there are extensive links
to Ensembl.
2.4.6 The Institute for Genomic Research (TIGR) Comprehensive Microbial defined by sequence and/or located in the NCBI Map Viewer. This comprehensively
Resource (CM R) links NCBI resources for each gene in a single-page, easy-to-view format.
Table 1. Databases providing BioMart access to genome data for various species
Database Species
Ensembl Various vertebrates and Apis mel/itera, Caenorhabditis
http://www.ensembl.org/Mu Iti/martview e/egans, Drosophiia melanogaster, and Anopheles
gambiae
Gramene Oryza sativa, Zea mays, and Arabidopsis thaliana
http://www.g ra mene.org/M uIti/ma rtview ~~
Homo~ _ (NCBI36)
Wormbase Caenorhabditis elegans ~A_(F_)
E~G"""IO
http://www.wormbase.org/biomart/martview E_Tra"""JiptIO
~
euGenes Various Drosophila species ChiOOiOs""",,9
http://insects.eugenes.org/Bi oMa rt/ma rtview
Various species, archaea, bacteria, and eukaryota
Protocol 4 blOrnart_sionOS
Specifying the dataset 8. Click on Protein_coding within the Gene type menu; the adjacent check box should change
3. To specify the dataset you want to search within, on the right-hand part of the screen, there to a 'tick' automatically (if not, click to tick it).
should be menus entitled Database and Dataset (if not, click on the »Dataset link at upper 9. Stili in the Gene type menu, ensure that the Status menu is set to Known (to retrieve only
left of the screen and they should appear). Using these two menus, select ENSEMBL 42 GENE genes of known function) and tick the check box next to it.
(SANGER) and Homo sapiens genes (NCBI36) as the database and dataset, respectivell· b.
10. In the same way, go into the Gene Ontology menu (see Chapter 9 for details of Gene Ontology
(GOl. which is a standardized vocabulary for describing genes). Tick the check boxes next to
Specifying the search criteria
Molecular function and enter 'Go: 0003700' into the text field next to it. (GO:0003700 is the
4. Next, click on the »Fiiters link on the left of the screen. On the right will appear several
GO term for 'transcription factor activity'; if you do not know the relevant GO term, you can
boxes (REGION:, GENE:, etc.), each with a small '+' symbol next to it. These are the filters that
use the adjacent Browse button to jump to the QuickGO browser, which allows you to browse
will be used to define the type of data (which genes, in this case) you want to download.
GO and identify the correct term).
S. In general, clicking on '+' next to any of these filters will open a menu within which you can
11. Leave the check box next to Evidence code (Molecular function) unticked. This means that we
set parameters; check boxes in each menu are used to determine which parameters to take
will be searching for proteins that have been assigned the GO term 0003700 under 'molecular
into account.
function', using any type of available evidence.
6. Click on the '+' next to REGION: to expand its menu. Below it will appear various options by
12. Other menus (Expression, Protein, SNP) would allow you to apply further criteria in selecting
which you can specify the region (chromosome, base-pair coordinates, band, etc.). From the
your data, but we will leave these for now.
pull-down menu next to Chromosome, select 9, and ensure that the check box to the ieft is
'ticked' (see Fig. 4).
Specifying the features to retrieve
7. Scroll down to the Gene filter, click on the adjacent '+' to expand its menu, and scroil down 13. Click on the »Attributes (features) link on the left. On the right, you will be offered the
again to Gene type. now-familiar style of menu. Expand the Gene submenu (by clicking on the adjacent '+' button)
30 CHAPTER 2: NAViGATING SEQUENCED GENOMES METHODS AND APPROACHES 31
and, under Ensembl Attributes, ensure that only the Ensembl Gene 10 and Description check ARTEMIS (see Fia. 5, aiso available in the color section) is written in Java
boxes are ticked. java.sun.com/2.21) and is therefore available for many different computer systems
(Linux, UNIX, Macintosh, and Windows). It is freely available, but you need an
Previewing and retrieving the results installation of Java on your computer (Java is also freely available). For details, see
14. Click on the »Dataset link (the database and dataset will again be displayed on the the ARTEMIS web site (http://www.sanger.ac.uk/Software!Artemis/2.20).
then on the Count button towards the top of the screen. This will display the number We recommend that you consult the ARTEMIS manual (available at the same site
matching your criteria and the total number of genes in the dataset, next to the »Dataset
as the software) to learn about the many available features, as this is beyond the
link towards the top of the page (at the time of writing, this was 44/31 148 genes).
scope of this chapter. For example, ARTEMIS allows you to define your own features,
15. Now click on the Results button. On the right, you will see a table displaying the first few edit and annotate a sequence, and save the results in various standard formats
genes retrieved by your query. Above this table, pull-down menus allow you to specify the file
(e.g. GenbankJ.
format for the output; ensure that the Export all results to menu is set to File and the rows
as menu is set to TSV (tab-separate value output) and click the adjacent Go button.
Protocol 6 provides a simple example of using ARTEMIS to view an annotated
sequence file downloaded from Ensembl.
16. A text file will be downloaded, con+~;n;n~ the results of your search. The results of this
particular example are given in the folder for this chapter on the book's web-site
as 'BioMarttxt 2 . 1S '.
17. If you specify HTML (in the rows as menu), the result will be an HTML file of your results,
which can be opened in your browser and which will contain active links for the relevant
data. An example of such a file is given in the Protocol_ 4 folder for this chapter on the book's
Notes
aThese were current at the time of writing; there may be more-recent versions by the time you--
read this.
bit is also possible to specify a second dataset to search in combination with the first; this is set
using the lower of the two »Dataset links on the left of the screen.
In this example, we started with a 'blank' BioMart query. However, while viewing
information in Ensembl (for example, while following Protocoi 2 in this chapter),
you will also notice links such as Export Gene info in region, with the BioMart
logo of colored dots. Clicking on these links effectively completes the first steps of
a BioMart query for the gene, region, etc. in question, leaving you only to specify
which aspects of the data you want to download and the format for export.
If yotl' have particular queries BioMart does not cater for, you can query the
Ensembl database directly by connecting to their MySQLserver. You will need to know
how to use SOL and to understand the database schema. We cannot go into more
detail here, but more information can be found in the Ensembl documentation.
Figure 5. Screenshot of the ARTEMIS sequence viewer and annotation tool (see page xvii for color
2.7 Browsing genomes 'off line' using stand-alone software version).
The main window is divided into several sections. Below the main menu is information about the current
There are alternatives to using a remote web site for genome browsing. ARTEMIS selection and the sequences being viewed. Below this (and filling most of the top half of the screen), the
(http://www.sanger.ac.uk/Software/Artemisj2· 20 J is a stand-alone program that 'overview' section shows stop codons in all six reading frames (short vertical black lines) and features
allows you to browse and annotate genomes (16). It can read different formats, on both strands (colored boxes, mainly in blue and yellow); the vertical scroll bar on the right controls
the scale (zoom) of this window and the horizontal one scans along the sequence. The 'base view' Oust
including FASTA files, EMBL, and Genbank format, as well as GFF format. These
below the overview) shows the sequence of both strands and, above and below these, the translation in
formats contain the sequence data, as well as the annotations of this sequence. all six frames; again, the scroll bars control the zoom and position of this window. The bottom third of
The sequence and its features are displayed graphically, in broadly the same way the screen shows a list of annotated features. Many other aspects of the sequence can be displayed or
as some of the online genome browsers described above. hidden (see text).
32 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 33
Another option for browsing genomes locally is to install Ensembl on your own
Protocol 6 computer. This requires some experience with installing software and a knowledge
of Perl, MySQL, and Java. You can download the Ensembl data as well as their
Using ARTEMIS to display the human genome sequence program code free of charge. This means you can run it on a copy of the public
data, run it with your own data, or you can install the whole annotation pipeline.
surrounding the gene Alien
There is some documentation on the Ensembl webpage on how to install the
This protocol assumes that you have installed ARTEMIS on your local computer. ARTEMIS is available database locally, but installing and running the whole annotation pipeline will
from http://www.sanger.ac.uk/Software/Artem is/ 2.2o , along with instructions for insta Iii ng require considerable expertise and hardware.
and using it.
1. You first need to download an EMBL file containing the relevant sequence and its annotation.
2.8 linking your own data to a genome browser
Either download a copy of this file (Alien.embI 2 .22 ) from the Protocol_6 folder for this chapter
at the web site that accompanies this book and proceed to step 7, or recover the data from
Ensembl by following steps 2-6. The UCSC Genome Browser and Ensembl both allow you to overlay data from other
sources, so that you can have additional information displayed in the browser that
2. Go to Ensembl via http://www.ensembl.org 2.2 and click on Homo sapiens under the Mammalian
is not actually part of UCSC's or Ensembl's data, but is provided by other sources
genomes heading. Type 'alien' in the Search e! Human field at the top right and press Go.
(17,18). This means that you can also feed in your data and have it displayed in
3. The results page will show the Ensembl protein family and, below it, the Ensembl gene (correct
the Ensembl or UCSC browser alongside their own data.
at time of writing). As you can see, this gene is an Alien homolog; Alien was first discovered
in Drosophila. Click on the link Ensembl gene: ENSG00000166200 to go to the Ensembl There are different ways of achieving this: by uploading your data to the UCSC
Gene Report. or Ensembl site; by linking UCSC/Ensembl views to data on your web site; or by
setting up a distributed annotation system (DAS) server on your computer. In most
4. To export the data, click on Export gene data on the left of the page. This will take you to the
'Ensembl Human Export View: cases, the information is stored in text files and can be in various formats (see
2 23
http://genome.ucsc.edu/goldenPath/hel p/custom Track.htm 1 . ). Unfortunately,
5. Under context, enter '5000' in each of the two boxes (Bp upstream and Bp downstream) to
the formats differ slightly in the features they support. Check the file formats
export 5000 bp of sequence either side of the Alien gene, and choose EMBL as output format
and then press Continue. carefully: some formats require spaces to separate the columns, whilst others
tabs. Please refer to the format descriptions to find out what the best file
6. The page will show a series of check boxes for the various features that can be included in the
output. Select the following features: Repeat features, Prediction features (genscan), Gene format for your data might be. Below is shown the example file that we will use in
Information, and Vega Gene Information. Set the output format to Text and click Continue. Protocol 8, which illustrates some general (although not universal!) points about
Save the text file as 'Alien.embl'. these formats.
7. Start ARTEMIS (e.g. by double clicking the Artemis.jar icon), and open the file 'Alif'n f'mhl
A window should appear, similar to that shown in Fig. 5. #example file
browser position chr19:6901101-7100000
8. The window is divided into three main sections. The upper section gives a coarse view of the browser hide all
sequence and is annotated on each strand; it also displays the positions of stop codons in track name=NavigatingGenomesTrackl description='This is an example of how to ~
each of the three reading frames on each strand. Scroll bars allow you to scroll along the link your data into Ensembl/UCSC' visibility=2 color=255,0, useScore=2 ~
sequence (horizontal scroll bar) or to zoom in or out (vertical scroll bar). The middle section url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinAc=$$
displays similar information, but is initially set to a higher resolution (again, controllable by chr19 Navigator test_element 6903521 6903982 1000 P61203
chr19 Navigator test_element 7000000 7001000 400 P61205
its vertical scroll bar) to show the nucleotide sequence and the amino acids encoded in each
Navigator test_element 7010000 7030000 400 P61205
reading frame. You can use these two sections to look at the same region at different levels
chr19 Navigator test_element 7050000 7060000 400 P61205
of detail, e.g. to see an overview of the exon/intron structure of a gene and the sequence of chr19 Navigator test element 7010000 7090000 200 P61204
a particular exon/intron boundary at the same time. browser dense Test2
The bottom section lists the annotated features; double-clicking on a feature will bring track name=NavigatingGenomesTrack2 description='This is an example of how to
it to the center of the upper two windows. Many other tools - detailed in the ARTEMIS manual show even more data' visibility=l color=255,0,255 useScore=l ~
- are available for viewing and editing the sequence and its annotation. url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do? proteinAc=$$~
color=0,255,0 visibility=l
chr19 Navigator1 test_element1 7010000 7090000 400 P61204
in this example, the first line is a comment (indicated by '#'). The next three
lines contain instructions for the browser: the line starting with 'browser' tells
the browser which part of the genome to display; the line starting with 'track'
34 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 35
states the name of the information track. The 'track' line lets you define some 3. In the Search box at the top centre of the screen, type '19: 6900000-7100000' (to view the
additional information on how to display the information and it also allows you to region from 6.9 to 7.1 Mb on chromosome 19) and then click the Go button or press 'Enter'
provide a URL to link to if the user selects the particular element (just as most of on your keyboard.
the features you have already viewed in Ensembl had links to further information). 4. This brings up the Contig View page for chromosome 19, displaying the specified region. Scroll
In this example, we provide links from this track to UniProt entries, by stating: down if necessary to see the 'Detailed view' panel.
url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView. CJ 5. Click on the DAS Sources menu (at the top of the 'Detailed view' panel) and then click on
do?proteinAc=$$ Manage sources at the bottom of the list that appears.
(we will discuss the '$$' in a moment). Obviously, you can change this to link to 6. A new window (called 'DasconNiew') wiil appear. On the left, under Manage Sources, click
point to any web page. on the link Upload your data.
The next five lines contain generic file format (GFF) fields, describing five 7. The refreshed page should be headed 'DAS Wizard Step 1 of 3: Data location', and there are
features that will be displayed in this track. Each line consists of the sequence fields at the top for your e-mail address and a password of your choice. You need to complete
name (a chromosome or a contig - in this case, 'chr19'), the source (for example, these fields so that Ensembl can contact you at a future date if necessary and so that nobody
the program that generated this feature - in this case 'Navigator'), the name of else can modify your uploaded data.
this type of feature (such as 'CDS', 'starCcodon', or 'exon' - in this case, 'test_ 8. Use the Choose file button (or Browse button as it will appear in some browsers) to find and
element'), the start and end positions of the feature, a score of between 0 and upload the file Linked dataEnsEMBLtxt 2.24 . When this has been done, the filename should
appear next to the Choose file/Browse button.
1000, which determines the level of gray in which this feature is displayed, the
strand ('+', '-', or '.' for features to be shown on the plus-strand, minus-strand, or 9. Click the Next button just below this.
both), the frame (a number between 0 and 2 that represents the reading frame of 10. If everything goes well, you will be taken to a new page headed 'AS Wizard Step 2 of 3: Data
the first base, or a '.' if the feature is not an exon) and finally the 'group': all lines appearance', in which case go on to step 12.
with the same group (for example, 'P6120S') are linked together into a single item, 11. If this does not happen, look for an error message towards the top of the screen Uust below
which can be used, for example, if you want to display the linked exons of a single where it says 'Please upload your data location'), along the lines of:
gene. When creating the link for each feature, the genome browser will substitute ERROR: could not upload data due to 'ERROR: Invalid format. Line l'
the group for the '$$' in the generic link, so that in this eX(Jmple each of the five
In this case, check that the Linked dataEnsEMBLtxt 224 file has not been corrupted - check
features will link to the respective UniProt entry.
especiaily (using a text editor or Word - but remember to save as a text-only file) that the
It is possible to display several tracks; for example, you could use one track for
fields in the file are divided by tab characters (not spaces). Try uploading again and, if it still
each alternative splice variant. The final lines of the eX(Jmple file (starting with fails, copy and paste the contents of the file into the Paste your data window instead of using
'browser dense Test2') define a second custom track, this time containing only the Choose file/Browse option in step 8.
a single feature. 12. You should now be on the page headed 'AS Wizard Step 2 of 3: Data appearance'. Towards
The Ensembl browser behaves slightly differently from the UCSC browser. At the top of the page are several check boxes (next to Enable on; these check boxes determine
the time of writing, it seems that the behavior of the UCSC browser conforms which views your data will be visible in. The contigview box should be checked (if not, click
better to the examples given on the instruction pages of the browsers (Jnd provides to tick itl.
more meaningful error messages. Therefore, we will use a slightly simpler example 13. Click on the Next button and you will be taken to the next page, headed 'DAS Wizard Step
in Ensembl. In Protocol 7, we will use a simplified file format to display basic 3 of 3: Display configuration'. This page gives you numerous options for the appearance the
information in the Ensembl browser. In Protocol 8, we will use the more extensive data you have just uploaded.
file (shown above) to display slightly more complex data in the UCSC browser. 14. Under Name and Track label at the top of the screen, enter any descriptive name you like (for
this example, use 'NavigatingGenomes' and 'NavigatingGenomesTrack', respectively). You
can leave the other fields blank and the other options at their default settings. Click on the
Finish button just below these options.
Protocol 7
15. You will be taken back to the first step (as in step 6); close this window and return to the Contig
linking your own data to Ensembl View page (as you left it in step 5). Refresh this page using your browser's 'reload' button.
16. Click on the 'DAS Sources' menu to open it (if it is already open, close it and then reopen it).
1. Download a copy of the file Linked dataENS.txt 2.24 from the ProtocoL7 folder for this You should see 'NavigatingGenomesTrack' towards the bottom of the list. The check box next
chapter on the book's web site. Open the file in a text editor or word processor and examine to it should be ticked (if not, click to tick it).
its contents. If you experiment by editing this file, be sure to save it as 'text only' after editing a
and take care with end-of-line and 17. Now click the Refresh button toward the top of the 'Detailed view' panel. The screen will
refresh and you will see your added data displayed alongside the other features. Clicking on
2. Go to http://www.ensembl.org/Homo sapiens/index.html2.25. anyone of the features will bring up information from the file you uploaded (see Fig. 6).
36 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 37
2. Go to http://genome.cse.ucsc.eduf2·27 and choose Genomes from the menu bar at the top.
El DetaIled view
3. Click the add custom tracks button towards the top of the screen.
4. Ensure that the current human genome sequence assembly is selected (menus towards the
top of the screen).
5. Click on the button called Browse or Choose file (the button name will depend on which
web browser you are using) just above the first large text field (called 'Paste URls or data:')
and find the file that you saved in step 1. Alternatively, you can copy the contents of the file
and paste them into the window or put a URL that points to your file into the window. Press
Submit.
6. You will be taken to a new page headed 'Manage Custom Tracks'. At the top is a table
the tracks vou have added so far. There were two tracks defined in the example file,
and 'NavigatingGenomesTrack2~ (Links and options from this
page allow you to delete, view, or edit some tracks, but we will not do this now.)
7. To the right of the table, click the button go to genome browser.
8. The UCSC browser should display the appropriate segment of chromosome 19, showing
the custom tracks that you have just uploaded (see Fig. 7). Try clicking on the newly added
features or their names.
9. All other tracks in the genome browser will be disabled, but can be activated by selecting
from the numerous pull-down menus underneath the chromosome graphic; click the refresh
button (underneath the graphic window - not the 'reload' button on your own web browser)
Figure 6. Screenshot showing part of the 'Detailed view' panel of the Contig View page of the after selecting which tracks to activate.
Ensembl genome browser (see page xviii for color version).
The data that was uploaded in Protocol 7 is shown (dark bars just below the chromosome
scale) in the track called 'NavigatingGenomesTrack'. In this shot, the user has clicked on one
the uploaded features (P6120S), and the small pop-up window displays information about this
feature.
Home Genomes Blat ianfes Gena Sorter:
U€scGaomeBJ!owsel'.BtmIaIrMD. . .
-'88888Broom:ln@®El39- aut @®El3
peR DNA f"'Off\l~rr
.....,1
~n5amol NeBt POFiPS Heip
By using the other options available through Ensembl, it is possible to upload and
present more complex data (including data held on your own or another web site),
to provide links (clickable from Ensembl) from your features to other databases ~ ....",.""IW' ~ f"'....n. _ _·~ t __ ·, ~
such as UniProt, and so on. However, this is beyond the scope of the present U_dropdllwnCOll1mll:be!ow 1IDd,pmtc~.lOlIbIIrtI'IICII:I'~.
Tl'lIClcscwilhlolrlllfltllmlwWIIIIIDmIIIcaIIy'tIe~lnmClll",:om"lI..nllOdes,
chapter. Instead, in the next protocol, we will upload the slightly more complex ,.tMrlniWMi.i
ProtocolS
linking your own data to the UCSC Genome browser
Figure 7. Screenshot showing part of the uses genome browser.
1. Download a copy of the file Linked_dataUCSC.txt 2 .26 from the Protocol_8 folder for this The data that was uploaded in Protocol 8 is shown in the graphic window, just beneath the
chapter on the book's web site. Open the file in a text editor or word processor and examine chromosome distance scale. Additional tracks (STS markers, Ref Seq genes, and spliced ESTs)
its contents. If you experiment by editing this file, be sure to save it as 'text only' after editing have also been activated, using the pull-down menus further down the screen.
and take care with end-of-line and tab characters.
38 CHAPTER 2: NAVIGATING SEQUENCED GENOMES
like the Ensembl browser, the UCSC browser offers many opportunities to
incorporate your own data, manipulate and display it, and integrate it with other
features both within the browser and beyond. Many of these options are beyond
CHAPTER 3
the scope of this chapter, but the reader is encouraged to explore and to refer to
the online help files. Sequence similarity searches
Acknowledgements Jaap Heringa and Walter Pirovano
The authors would like to thank all people involved in the many projects
presented here, especially the people writing and maintaining the excellent online
documentations. T.S. was a British Antarctic Survey/European Bioinformatics
Institute/St Edmund's College Research Fellow 2003-2006. This paper was
produced by M.s.C. and T.S. within the BIOREACH/BIOFLAME core programs. 1.1 Comparative sequence analysis
Ideally, the alignment matches the nucleotide or amino acid sequences from The similarity score for two sequences can be calculated from their alignment
either sequence according to their evolutionary descent from a common ancestor, (see below), such that it depends on the actual scoring matrix and gap penalties
with conserved residues at matched positions and inserted/deleted fragments used. It has also been calculated as a fraction of a maximal score possible for
intervening at proper sequence positions. Often, however, evolution has led to two sequences using a normalized scoring matrix and by normalizing the raw
widely diverged sequences where the ancestral ties have become blurred beyond alignment score by the length of the shorter sequence (5).
recognition, leading to biologically incorrect alignment.
Another confounding issue is the fact that an increasing number of cases
are identified with nonorthologous displacement, where enzymes carrying out
1.4 Techniques for pairwise alignment _"_"~_~ __
an identical function in different organisms belong to entirely different protein
families and thus are not expected to show any sequence similarity. For example, 1.4.1 The dynamic programming algorithm
the ornithine decarboxylase spe1 in Saccharomyces cerevisiae has a completely Protein sequences mutate to varying degrees of divergence through evolution. In
different domain structure from - and is not related to - the Escherichia coli order to identify homologous proteins and reveal important similarities, a range
ornithine decarboxylase isozymes speC and speF (1). Nor are sequence alignment of sequence alignment methods are commonly used (for a recent overview, see 6).
techniques able to trace evolutionary cases of horizontal gene transfer or These methods rely mainly on approximated evolutionary models that aim to
functional displacement of one gene by another within a genome. reflect as accurately as possible the evolutionary paths that connect two or more
protein sequences.
Many methods for the calculation of sequence alignments have been developed,
1.3 Similarity versus homology of which implementations of the dynamic programming algorithm (7, 8) are
considered the standard in yielding the most biologically relevant alignments. (For
The term 'homologous sequence' is often used when in fact a sequence should three or more sequences, these methods apply the progressive strategy (9), where
only be described as 'similar' to a given reference sequence (2). Whereas sequence sequences are hierarchically aligned in pairs according to a pre-generated tree,
similarity is a quantification of an empirical relationship of sequences expressed based on their sequence similarity; see Chapter 11 for a discussion of multiple
using a gradual scale, 'homology' denotes an inference of a common ancestor sequence alignment).
between the sequences. Sequence similarity is normally used to assess the likelihood The dynamic programming algorithm (7) requires a scoring matrix, which is
of homology, but homology itself is a qualitative state: a pair of sequences is an evolutionary model expressed in the form of a symmetrical 4x4 exchange
either homologous or not. As protein tertiary structures are more conserved during matrix for nucleotide sequences or a 20x20 matrix for amino acids: each matrix
evolution than their coding sequences, homologous sequences are assumed to cell approximates the evolutionary propensity for the mutation of one nucleotide
share the same protein fold. Although it is possible in theory that two proteins or amino acid type into another, including self-conservation. For this purpose, it
evolve different structures and functions from a common ancestor, this situation is common to use pre-determined substitution scores (e.g. the scores from the
cannot be traced and so such proteins are seen as unrelated. However, numerous BLOSUM (10) and PAM (11) series and more recently the JTI (12), GaNNET (13),
cases exist of homologous protein families where subfamilies with the same fold VT (14), and VTML (15) series) that have been derived using a specific set of 'true'
have evolved distinct molecular functions. The term homology is often used in alignments. However, these 'standard' substitution scores reflect a standardized
practice when two sequences have the same structure or function, although in evolutionary model and introduce inconsistencies when applied to nonstandard
the case of two sequences sharing a common function this ignores the possibility cases (16). Although this does not impact too severely on alignments between
that the sequences are analogs resulting from convergent evolution, now often closely related sequences, sequences in the so-called 'twilight zone' (<300f0
referred to as nonorthologous displacement. sequence identity) are extremely difficult to align (3), partly for this reason. This is
Unfortunately, it is not straightforward to infer homology from similarity, as because the evolutionary scenario relating them becomes virtually undetectable
enormous differences exist between sequence similarities within homologous against the 'noise' introduced by the extent of mutational change that has
families. Many protein families of common descent comprise members that share occurred (17).
pairwise sequence similarities that are only slightly higher than those observed The dynamic programming algorithm also relies on the specification of gap
between unrelated proteins. This region of uncertainty has been characterized penalties, which model the relative probabilities for the occurrence of insertion/
to lie in the range of 15-25% sequence identity (3) (see below) and is commonly deletion events during evolution. In most available methods, a penalty score is
referred to as the 'twilight zone' There are even some known examples of applied for creating (opening) a gap, and a further penalty score is added for each
homologous proteins with sequence similarities below the randomly expected extension of the gap (affine gap penalties), so that the chance for an insertion/
level given their amino acid composition (4). As a consequence, it is impossible to deletion depends linearly upon the length of the associated fragment. Given an
prove using sequence similarity that two sequences are not homologous. exchange matrix and gap penalty values (which together are commonly called the
42 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES INTRODUCTION 43
scoring scheme), the dynamic programming algorithm is guaranteed to produce an inserted domain B in aligning a two-domain sequence AC (where A and C
the highest scoring alignment of any pair of sequences, the optimal alignment. represent domains) with a three-domain sequence ABC, are often too costly so
that such sequences become misaligned.
of the complexities of protein evolution and distant relationships observed in standard workstation. However, for any biologist who has a new protein sequence
nature, any statistical scheme will inevitably lead to situations where a sequence of unknown functionality, comparison with all known and annotated sequences
is assessed as unrelated whilst it is in fact homologous (false negative), or the is paramount.
inverse, where a sequence is deemed homologous whilst it is in fact biologicaliy Therefore, fast routines have been devised that enable database searches on
unrelated (false positive). A frequent cause of false positives - and hence of even small computers with only a small loss of sensitivity compared with searches
erroneous transfer of annotation - is based on similarity found over relatively using full dynamic programming. With the recentadventofparallel multi-processor
short sequence regions, or similarity based on different domains in multi-domain computers at central sites, researchers can routinely perform multiple sequence
structures (20). searches over complete sequence databases. However, for large-scale application
of the dynamic programming technique, the computational requirements are still
prohibitive. For example, consider the task of searching the Swiss-Prot database
1.8 Protein domains
against a query sequence of 400 amino acids. As release 50.0 of UniProtKB/Swiss-
Prot contains 222289 sequence entries, comprising 81585146 amino acids,
Many protein families have diverged from common ancestors by evolving different
finding local alignments via dynamic programming over this database would
combinations and associations of domains (21-23). Domains are characterized
entail about 10 10 matrix operations. Given the fact that many servers routinely
as semi-independent three-dimensional units in proteins, often with a particular
handle thousands of such queries a day (over 50000 per day in the case of the
function, observed to be genetically mobile and frequently moving within and
NCBI server), it is clear that the application of dynamic programming would lead
between biological systems through mechanisms of gene or exon shuffling. An
to unfeasible waiting times.
understanding of the domain organization of a protein sequence is crucial for
Although some special hardware has been designed to accelerate the dynamic
structural and functional genomics initiatives and the reader is referred to Chapter
programming algorithm, the solution has depended largely on the development
8 for a discussion of protein architecture and domains.
of several heuristic algorithms that represent shortcuts to speed up the basic
The .correct partitioning of a protein into its putative domains is especially
alignment procedure. These include the currently most widely used heuristic
important in the comparative analysis of entire genome sequences. Consideration
method for scouring sequence databases for homologies, PSI-BLAST (33), an
of domain architecture will shed light on the evolution, structure, and function
extension of the BLAST technology (34), and FASTA (35). which is another commonly
of a protein family. For example, the 'Rosetta Stone' genome analysis method (24)
used heuristic method for fast sequence comparison. At the same time, advances
exploits the fact that a multi-domain protein in one organism may be present as
in computer hardware have made it possible to use some more computationally
separate (and hence, presumably, interacting) proteins in another organism. It is
intense approaches such as the hidden Markov modeling-based tools SAM-T99 (36),
clear that such analysis requires accurate sequence comparison tools at the level
SAM-T2K (37), and HMMER2 (38).
of the domain rather than of the whole protein.
Domain annotation of a protein sequence in the absence of structural
information has proved to be a difficult problem. For example, the method 2.1 Should one compare protein or nucleotide sequences?
of Wheelan et 01. (25) is based on the fact that domains have a distinct size
distribution, averaging at 100 residues. Accurate predictions are limited to two- As long as we are considering sequences between encoded proteins, the actual
domain proteins with less than 300 residues. George and Heringa (26) improved pairwise comparison between two sequences can take place at the nucleotide
the delineation of protein domain boundaries to 52% using a consistency- or peptide level. However, the most effective way to compare sequences is at
based protocol over sets of protein ab initio three-dimensional model structures the protein level (39), which requires that nucleotide sequences must first be
generated using distance geometry. Currently, most annotated domain databases translated in all six reading frames followed by comparison with each of these
are based on inferring domains by sequence similarity searches (27-32). A number conceptual protein sequences.
of these search techniques will be discussed in the next section. Although mutation, insertion, and deletion events take place at the DNA
level, there are several reasons why comparing protein sequences can reveal more
distant relationships:
1. Many mutations within DNA are synonymous, which means that they do not
lead to a change in the corresponding amino acids. As a result of the fact
A typical application to infer knowledge for a given query sequence is to compare that most evolutionary selection pressure is exerted on protein sequences,
it with all sequences in an annotated sequence database. Unfortunately, the synonymous mutations can lead to an overestimation of the sequence
dynamic programming algorithm (see above) is too slow for repeated searches divergence if compared at the DNA level.
over large databases and may take many hours for a single query sequence on a 2. Evolutionary relationships can be expressed more finely using a 20x20
46 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 47
amino acid exchange table than by using exchange values among four and NR-ExPasy databases are popular. Also the genome survey sequence (GSS),
nucleotides, leading to a significant increase in statistical subtlety for protein expressed sequence tag (EST), sequence-tagged site (STS) and high-throughput
sequences. Amino acid substitution matrices incorporate subtle differences genomic sequence nucleotide databases can be scoured to find homologies, gain
in physicochemical properties among the 20 residue types, rendering protein insight in expression data, or locate a gene on the genome map. The NR-NCBI
sequences more informative than nucleotide sequences. database is compiled by the National Center for Biotechnology Information (NCB!)
3. DNA sequences contain noncoding regions, which should be avoided in as a nonredundant (NR) protein sequence database for BLAST searches. It contains
homology searches. Note that the latter is still an issue when using DNA a total of about one million nonidentical sequences from GenBank CDS (coding
translated into protein sequences through a codon table. However, a sequence) translations, Protein Data Bank (PDB), Swiss-Prot, PIR, and Protein
complication arises when using translated DNA sequences to search at the Research Foundation (PRF).
protein level because frame shifts can occur, leading to stretches of incorrect
amino acids in the wrongly transcribed product and possible elongation of
sequences due to missed stop codons. On the other hand, frame shifts typically 2.3 Heuristic sequence similarity searching methods _",_"_,~~_,,,p,,<"~',",~'c _~_o
result in stretches of highly unlikely and distant amino acids, which can be
used as a signal to trace their occurrence. Both the FASTA and BLAST suite of programs feature a quick step for initial filtering
of the database sequences, followed by a second slower step to scrutinize the
sequences and compile the final alignments between the query and each of the
2.2 Curated and annotated sequence databases database sequences. If the initial filtering step is too strict, there is a biological risk:
homologous sequences will be discarded before the more detailed analysis and are
The success of sequence similarity searches depends crucially on the quality and lost (false negatives). If the initial filtering step is too permissive, however, there
coverage of the sequence database used. Although the amount of raw sequence is a computational penalty because too many unrelated sequences are passed
data is increasing rapidly, and although modern sequencing techniques achieve a through to the slower subsequent step. In both the FASTA method and a recent
very high accuracy, the utility of this data depends crucially upon its annotation. implementation of the BLAST algorithm, the slow step incorporates the dynamic
Incorrect annotation of database sequences can distort similarity searches (for programming algorithm to compile a local alignment.
example, when the location or structure of predicted genes in the database
sequence is incorrect), or can lead to false inferences when genuine similarities 2.3.1 FASTA
are found but the database sequence has been annotated with an incorrect
In the early years of sequence database searching, the heuristic method FASTA (35)
function.
was the most widely used technique. The FASTA program compares a given query
As inferring and experimentally validating the annotations represents a
sequence with a library of sequences and calculates for each pair the highest-
bottleneck, there is a rapidly widening gap between sequence and annotation
scoring local 3lignment. The speed of the algorithm is obtained by delaying
data. This is reflected by the fact that many sequences have 'unknown' as their
3pplication of the dynamic progr3mming technique to the moment where the
functional annotation, whilst an increasing number of sequences, especially
most similar segments are already identified by faster and less-sensitive techniques.
those originating from bacterial genomes, have annotations such as 'conserved
To accomplish this, the FASTA routine operates in four steps of which the first two
hypothetical: Conserved hypothetical open reading frames have homologs,
represent a quick filter to eliminate sequences that have no fragments scoring
usually in other organisms (which at least gives reassurance that the open reading
beyond a specified threshold value. Sequence fragments that score beyond a
frame truly is a gene - see Chapter 4), but none of these homologs have known
given threshold value after the first two steps are combined and realigned in the
functions.
Although many new protein structures are now being determined using last two steps.
X-ray crystallography, nuclear magnetic resonance spectroscopy, and cryoelectron The four basic steps of FASTA are as follows:
microscopy, without direct experimental evidence there is considerable difficulty 1. The first step searches for identical 'words' (short segments of sequence) of a
in assigning functions to proteins from their structures. This can even be the case user-specified length Cktup') occurring in the query sequence and the target
for homologs of well-characterized proteins because of the recruitment of similar sequencers). For each target sequence, the ten regions with the highest density
proteins for divergent functions. Computational prediction methods can aid to of ungapped common words are determined. The technique is based on that of
some extent, but for reliable annotation, manual curation is often essential. Wilbur and Lipman (40, 41) and, for not-too-distant sequences (>35% residue
Widely used annotated databanks for homology searches include the annotated identity), little sensitivity is lost whilst speed is greatly increased. The search is
EMBL, GenBank, and DDBJ for nucleotide sequences, whilst for protein sequences performed by 'hashing techniques', where a look-up table is constructed for
the Swiss-Prot, Protein Information Resource (PIRl. TrEMBL, GenPept, NR-NCBI, all words in the query sequence and is then used to compare all encountered
48 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 49
identical words in the database sequence(s).Generally, for proteins, a word 5. For running the actual FASTA routine, go to the home page of the method at http://www.ebi.ac.uk/
length of two amino acids is sufficient (ktup=2), whilstfor nucleotide sequences fasta33j3·2. Note that the FASTA method should not be confused with the FASTA sequence
ktup=6 is the default word length. Searching with higher ktup values increases format mentioned in the preceding step.
the speed but also the risk that similar regions are missed. 6. This page offers many variants of FASTA, but we will use FASTA3 - make sure that this is selected
2. In the second step, these ten regions are rescored using the Dayhoff PAM-250 (highlighted) in the Program menu.
residue exchange matrix (42). 7. There are numerous options on the page, but most of them should .be left at their default
3. In the third step, a threshold value is applied to filter the ten regions: sequences values. You can (for example) choose to receive the results bye-mail rather than interactively.
with none of the ten regions scoring beyond the threshold are effectively You can also specify which protein sequence databases to search against (menu at
discarded at this point; regions scoring higher than the threshold value and right); you can select multiple databases by Shift-clicking or by other key/mouse combinations
being sufficiently near to each other in the sequence are joined, now allowing (depending on your computer and browser) but, for this example, leave the Databases menu
set to UniProt.
gaps. The highest-scoring region of these new fragments is retained.
4. The fourth and final step performs a full dynamic programming alignment 8. The other parameters define the details of the search. The values for the gap penalty (both to
over the region yielded in the preceding step, which is widened by 32 residues open a gap and to extend it by one residue), the ktup value (the size of the 'words' that are
used in the early stages of the search - see above), and the substitution matrix that is used
on either side (43). .
to evaluate the similarity between amino acids can all be altered, but for now, leave them at
In early FASTA versions, the best-scoring regions resulting from steps 2 and 3 above their default values.
were reported as init1 and initn in the FASTA output, respectively, whilst the final 9. The expectation upper and lower values (Evalues) can also be altered. The Evalue is a measure of
alignment score (step 4) was written under opt. Modern implementations of FASTA, the statistical significance of a hit (see below) - higher values correspond to lower significance.
however, only report an E value for each of the database sequence fragments The default upper value of 10 ensures that we will find even fairly distantly related proteins.
aligned with the query as a measure of their statistical significance as putative The default lower value is effectively zero - so that we will also find extremely closely related
(or identical) proteins. Leave these settings at their defaults.
homologs (see section 2.4).
In Protocol 1, we will give an example of the use of FASTA, focusing on a subunit '0. It is also possible to restrict the search to a part of the query sequence (using Sequence
of a large enzymatic complex called cytochrome c oxidase. This complex is found range), or to compare the query only against database proteins that have a certain range of
sizes (Database range), but we will leave these at their default settings, so that we search all
both in bacteria and mitochondria where it catalyzes electron transfer through
of our query sequence against all proteins in the database.
the last part of the respiratory chain. The starting point will be the mouse (Mus
musculus) subunit IV of the mouse cytochrome c oxidase complex. 11. Under Scores a Alignments, set both Scores and Align to 100 (the default values are 50)
- this ensures that we will retrieve up to 100 matches.
12. Further help can be obtained by clicking on any of the colored menu titles.
13. Either copy and paste the complete contents of NP_034071.fa into the large text window
lower down the screen, or use the Choose file (or Browse) button to choose this file. (Note
Protocol 1 that many other sequence formats can also be used.)
14. Click Run Fasta3 and wait for your job to be processed (this should take only a minute or
A typical search using FASTA so).
1. First, retrieve the query sequence that we will be using for this example. Go to the NCBI web 15. When your results page appears a, it should look similar to Fig. 1. (Of course, the UniProt
site at http://www.ncbLnlm.nih.gov 3.1. From the pull-down menu headed Search at the top database that was searched is updated frequently. It is therefore quite likely that some of the
left, select Protein and type 'NP_034071' in the adjacent text box. Click Go or press 'Enter' on 'hits' will change by the time you read this. At the time of writing, this search produced 81
your keyboard; we now see a link to the corresponding entry. hits.)
2. Click on the link (NP_034071) at the top of the entry and then, from the Display menu (left, 16. The upper part of the screen (Submission parameters) simply summarizes the settings you
towards the top), select FASTA to display the protein sequence in the FASTA format. Copy the have used and some aspects of your query sequence.
entire.entry (from the header line starting '>gi! 6753498' to the end of the protein sequence,
ending :.. DK.l\JEWKK'). 17. The lower part of the screen is a table listing the hits, starting with the most significant
(lowest Evalue - see section 2.4). (Clicking on any of the other headings in the table will cause
3. Paste the text into a new document in Word or another text processor and save the file as the results to be resorted by that parameter; if you try this, click on EO afterwards, to again
'NP_034071.fa' on your computer. It is important to save it as 'text only'. sort the list by Evalue.)
4. There are several other ways to obtain the same sequence from NCBI or from other databases 18. For each hit, there is a link to UniProt (under DB:IO). Also reported are the length of the
- for example, see Chapters 1 and 2 for general information on retrieving sequences from protein, the percentage of residues in the alignment that were either identical or similar to
databases. Also, a copy of the file NP_034071.fa can be downloaded from the ProtocoU those in the query sequence, and the length of the overlap between the two sequences in the
folder for this chapter on the book's web site. 3lignment.
50 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 51
next 80, etc.). Also shown are consensus sequences at different levels of stringency, although
this is only meaningful when all of the proteins are well aligned.
23. Click on Return to result to return to the results table.
24. There are many options for viewing and downloading the results, or for selecting only some of
the hits (using the check boxes under Alignment on the left of the table) to examine further.
In particular, the VisualFasta option gives a quiek graphical indication of the strength and
extent of each of the alignments.
Note
aNote that results are stored for about 24 h. Thus, if you copy the web address of the results page,
you can return to it at a later point.
Protocol 2 suggests a variant on the previous search, this time examining only
more distantly related proteins.
Protocol 2
A FASTA search for distant homologs
1. Repeat the FASTA search from Protocol 1, but this time set the Expectation upper value and
Expectation lower value to 20 and 0.001, respectively. This will find only those sequences
with a low similarity (down to an E value of 20, which is of very low significance) and will
exclude the most similar sequences (with Evalues below 0.001). At the time of writing, this
search produced 24 hits.
:~_1_9 ____~~ :~~~!\l8.t~crfpt_._
2. If you look at these hits using the Mview option, you will see that the identity with the query
Figure 1. FASTA output. sequence is generally very sparse, and that the identical residues are widely scattered and
Below the summary table (which gives details of the search parameters), 81 hits are listed in tend to be at different places in the different hits. This suggests that many of these hits are
(decreasing significance). spurious.
3. There will certainly be some true homologs these less-significant hits, but it is difficult
to spot these purely from the sequence identities Mview. Examining the alignments gives a
19. Note that the top sequences have very low E values and can therefore be trusted to be little more information; a true but distant homolog might be expected to have some similarity
homologous to the query. (The top match, in fact, is identical to the query sequence). However. across most of the protein length (rather than good similarity on only one or a few smail
at the bottom of the list we also find some sequences that apparently are not related to the areas). However, more confident identification of true distant homologs is not possible using
query (e.g. UNIPROT:Q2CHW9_9RHOB showing an Evalue of 9). Although there might well be this simple search strategy.
ous sequences having unfavorable Evalues (e.g. UNIPROT:Q2TWP1_ASPOR
8.1), users should generaily be cautious of Evalues above 0.001, as at this
score level, false positives can arise.
2.3.2 BLAST
20. Click on Show aligmnents (underneath the Submission Parameters). The display will now
show alignment details and the alignment itself for each of the hits. Each alignment shows Since its inception in 1990, the BLAST (basic local alignment search tool) program has
the aliqned parts of the two proteins plus (if the alignment covers only part of the sequences) quickly gained a dominant position, and the original publication of the technique
A':' indicates that the amino acids in the query sequence and the hit (34) is the most cited paper in molecular biology to date. BLAST is a speed-optimized
a'.' that they are similar (for example, lysine and arginine). technique that maintains significant sensitivity through the combination of a fast
21. Click on Summary table to return to the previous page. and subsequent slow algorithmic step.
22. Click on MView. The hits are now displayed in color-coded form: residues identical to the The BLAST suite includes a number of variants to allow all possible combinations
query sequence in the respective alignment are colored, whilst nonidentical residues are gray. of comparisons between nucleotide or protein sequences. In particular, nucleotide
(Note that the display is in 'blocks' - the first 80 residues for each aligned protein, then the sequences can be translated in all six possible reading frames for comparison either
52 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 53
with protein sequences or with other, similarly translated nucleotide sequences maximum scoring pairs (MSPs) resulting from the word extensions are presented
(see Table 1). as the final result in the original BlP.5T version.
As the original BLAST program spends more than 90% of its time on extending
Table 1. BLAST variants words, a key improvement of the BLAST method (33) has been to extend words
only when there are two hits on the same diagonal within a given distance, A,
Note the distinction between BlASTN (which compares query and database nucleotide sequences directly
of each other (in other words, where there is a chance of the extension running
with each other) and TBLASTX (which first translates the query and database sequences in all possible
frames and then compares the resulting protein sequences). into further regions of similarity). As this would lead to far fewer words being
extended, sensitivity is maintained by lowering the value of T for finding the
Program Query sequence Database Notes initial HSPs. With a lower value of T, far more single hits are produced, but only
BLASTN Nucleotide Nucleotide Direct comparison between nucleotide sequences a minority has an associated second hit nearby on the same diagonal. In the
BLASTP Protein Protein Direct comparison between protein sequences more recent version of BLAST, word extension is done using dynamic programming,
BLASTX Nucleotide Protein All six translations of the query sequence are compared leading to gapped alignments. The updated technique, referred to as gapped BLAST
with the protein database (33), is therefore more similar in spirit to the earlier FASTA program (35) than to the
TBLASTN Protein Nucleotide The query protein is compared against all six translations earlier BLAST method (34). It is also slightly faster than the earlier BLAST method, as
of each sequence in the nucleotide database extension by dynamic programming is only triggered when the aforementioned
TBLASTX Nucleotide Nucleotide All six translations of the query sequence are compared two-hit extension has a sufficiently large score. If this is the case, the highest
against all six translations of each sequence in the scoring segment of length 11 along the region covered by the two-hit extension
nucleotide database
is taken as the seed. Dynamic programming is then initiated in the forward and
backward directions from the central pair in the ll-Iong HSP. Gapped extension
The basic idea behind the initial quick filtering step in BLAST is the generation proceeds as long as the score remains above a given threshold, whilst the score
of all consecutive tripeptides in a given protein query sequence, or l1-nucleotide is temporarily allowed to drop below the threshold as long as it takes off again
words when searching with a DNA sequence. For each of the words, a table is and rises above the threshold value. The ends of the alignment are finally pruned
constructed of words deemed to be 'similar', where the number of similar to yield the best local alignment given the 11-residue seed, and this alignment is
tripeptides corresponds to only a fraction of the 203 possible tripeptides, or 4 11 reported to the user.
possible nucleotide 11 mers. The BLAST program uses the tables of similar words to A difference between FASTA and BLAST is that FASTA uses the BLOSUM50 substitution
quickly scan a database of protein or nucleotide sequences for ungapped regions matrix when calculating similarities between protein sequences, whilst BLAST uses
showing high similarity; each time, a database word is accepted whenever it BLOSUM62 (although this can be changed in both programs). BLOSUM62 is a
occurs in the table for the query word considered. 'harder' matrix (Le. overall it tends to report a lower similarity between any two
In this respect, BLAST differs significantly from FASTA, in that it can consider nonidentical amino acids than BLOSUM50j, which is amenable to less-divergent
similar 'words' in the early stages of the search whereas FASTA considers sequence comparisons. Another difference is that the BLAST server can return a
identical words. Similar regions found by BLAST between the query and database maximum of 20000 hit sequence descriptions and alignments, whilst the FASTA
sequences scoring beyond a given threshold, T, are referred to as high-scoring server (Protocol l) is limited to a maximum of 100.
sequence pairs (HSPs) and are retained for further processing. To score these Protocol 3 takes the user through a basic BLAST search, again using subunit IV of
regions, BLAST employs the BLOSUM62 amino acid exchange matrix (10l, such the mouse cytochrome c oxidase complex from M. musculus as an example.
that the existence of HSPs scoring higher than T signifies pairwise similarity
beyond random probability, which is taken as a signal that the database
sequence considered is related. The computational strategy involved behind the
quick initial step in the BLAST method is based on deterministic finite automata,
which allows very quick searching of the similar-words table associated with
each query word.
The original BLAST method features a slow algorithmic step that tries to refine
the database hits by extending each HSP in either direction in an attemp_t to
generate a longer alignment with a higher score than the nonextendedregion.
During extension, the alignment score is temporarily allowed to drop but not more
than a pre-set drop threshold of 5, which is set to 20 for protein sequences and 22
for DNA sequences, before the score picks up again to arrive at a higher value. The
54 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES II 55
1. Open the BLAST homepage at http://www.ncbi.nlm.nih.govjblast/ 3.3• protein datahasa '" -·~ ... teic A4ida ltrtl& .. 25l33e~l,{02"
3. Paste the cytochrome c oxidase sequence from M. musculus in FASTA format (see Protocol 1)
into the Enter Query Sequence box. DWrlhution oUt HINt Bits 9D tIk.
4. Below we can see that the choice of available databases is different from that available ,........~~ .......
=~"'""'""='= ........
MOit:ie-owr to show d«fftne M'd Kons, click to fthew aifG'tJm'Mt:S
~...iooZ~.................~==--= .......-=--~........,r'
from the FASTA search page (Protocol 1), although most are major protein databases that will Color key fOr alllnment score. .
have essentially the same content. Leave Database at its default of nr - this encompasses • 40 40--50 50"'
-810 30,-ZOO >=ZOO
all nonredundant translations from GenBank and several other major databases (click on the Qu.,,>, I I i I
o 30 60 90 120 150
Database link, or indeed on any of the highlighted headings, for a 'Help' page giving more
details).
5. Leave all other settings at their defaults for now and click BLAST to start the search using
standard BLAST? (protein-protein BLASi).
6. Typically within a few seconds the results of the BLAST? run will be displayed (see Fig. 2).
7. At the top of the results page (after the references), a graphic similar to FASTA'S Mview (see
Protoco/Jj represents the extent and significance of hits against the query sequence. Moving
_ 1*,. at a n ttft . . 'Ra,l.iittgtd s1;1jlb-
the mouse over any of the colored bars will cause details of that alignment to appear in the
,.
small text window above the graphic. ~_. FO<tu..,in... ,,,,,,iti""''''' . , U _ t a , (Bits) Value
8. Below the graphic, the hits are listed in order of increasing Evalue (decreasing significance). Clil§7S3429Im!!tiNP 934971 11 qt:oobrome. -c ruddue ~I.lbunit; IV i~.. ~ ...ill • ..-9Q m
The hit sequences found for the cytochrome c oxidase query are in line with those retrieved by ril13729Q8fghlMl{)2122 I! Q!(eochrcm& a oxidue subunit IV >q:L •• Jll 9..-90 IS
Qi!U?J18g!""fllIP Q5B898 11 c:yt<>cbl'Ollll>" o>ti.da.... iJUl:!W1it IV i ... ...rnt 1"",S6I!D
the FASTA method (Protocol 1), but the different nomenclatures do not make this obvious. sriil41945afilaplQ9mBICQX4t mX7 ~hrM& 0 oxidase 9ubuniH~ -ill !Ie-1i
9. The default threshold for the Evalue (see section 2.4) is 0, but can be adjusted by the user. As Igil 510810211 re# tu 536752 11
(di470;4R6+!r9jlp' gQIQ9143:9
J?UDICHth similar ~ (!y't<x'!Jtroma-. ~.-
11 <:ytoc~ Q oxyda$. subunit .. ..
J.n
....aM
3oe-77
2.... 16
IS
m
PIUillXc:nm. BiI\;i,lar to CytOM ..... ,.
with the FASTA method, sequences with E values beyond 0.00 should be treated with caution rlJl0?12242QjmftXP CPlOSiO?4 11
qi!lQt1Qt2§,I;tll!lp O"'~· ....... '" ~! Pll2Bl)!C'l'Blh aimi1ar to cytoch .. ~ ..
-Ua J.e-721S
3... 12 IS
--XU
as they may well be unrelated (e.g. sequence gi 1118401825 i ref IXP_OOI033232 .11, which ~<i"." 'qi,n p.rod:Uot [11M ..... f ••• J.ll 3_7~ !1'1
s.i,.lj.5~"'''
has an Evalue of 2.6). .. "t; auiMln:i:t; .IV i .... '. Jll
10. Links on the left of each of the named hits link to the Entrez protein database entry for the
protein. U and G symbols to the right of some hits link to the entries in UniGene or Entrez
1... ~giflqg]2p42qtxaelU Q01QR4Q§4 U mP1tIl)ICHttJ. _'" slJ.bunit 4. iso£oX'm
1" mitOQhondri.~l. prtJCura:or t:Cyt;ochr~ " ox;i.dallta .ubuni.~ ""'.
ICytoo~" ~ polyptlpt.~da
Gene. From all of these linked databases, there are numerous links to other resources for that is<>icm II {COX IV-I}
iso£o.r.. 2 (Kacaca mula;,;,,,]
IV)
11. Below the list of hits, each alignment is shown in detail. (You can jump directly to one of the i.G'.,,,.,1) iCOX IV_I} (Cyt;o<:b.-- c: oxid...... polyptlpt..de
isoform 1 [Kacaca mulattlt 1
alignments by cllcking on the link under Score in the list of hits, or by clicking on the colored jI.en
vth-169
bar in the graphic window.) Note that redundant entries are listed here (although only one of S<z¢.-& I$(t 213 bit-a- (698)" Expa.crt: ~ 30-12, Mat.hoch Compoai1:.i.<m-baeed
!<l<mtiU •• - 135/169 (79%1, Positive. - tSO/169 {8n~. (laps - 0/1&9
them is given in the graphic window and in the list of hits). I
'Quary 1
12. On the query page (step 3), there are numerous options to alter the search parameters or limit Sbjot 1
the search. In particular, under Algorithm parameters (below), the Evalue threshold (Expect Query ~l
threshold), word size (equivalent to ktup in FASTA), and the gap penalties (Gap costs) can be
adjusted, as can more-complex parameters.
13. For example, the Organism box (under Choose Search set) can be used to limit the search to Figure 2. BLAST output.
specific organisms or groups of organisms. (Try repeating the search with this option set to Only sections of the full output page are shown. Beneath the references at the top of the page,
Custom ... combined with Sus scrofa; the search should return only a few hits, all from pig.) a graphic shows the distribution and strength of hits against the query sequence, as a series
More complex iimits can be imposed by typing an Entrez query into the adjacent text box; see of colored bars. Below this, a table lists the hits in order of increasing Evalue (decreasing
Chapter 1 for more information on Entrez search terms. significance), with links to other databases. Below this, each alignment is shown in detail
(duplicate entries, such as gi1 09129420 and gi1 09129422, are shown as a single alignment).
56 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 57
2.3.3 BLAT simulations have indicated that the theory probably applies to gapped alignments
as well, so that its application to general pairwise alignment is not likely to
A recent adaptation of the BLAST routine is the BLAT (BLAST-like alignment tool; 44)
introduce error. Therefore, E values are roughly comparable across the various
program. BU\T performs rapid mRNA/DNA and cross-species protein alignments. It
search tools. A number of state-of-the-art homology search techniques adopt
is more accurate and about SOD times faster than other tools such as BLAST when
the Karlin-Altschul statistical framework and routinely calculate P or Evalues for
used for mRNA/DNA alignments, and 50 times faster for protein alignments at
each query-database pairwise sequence comparison (see next section).
sensitivity settings typically used when comparing vertebrate sequences. When
BLAT is applied to DNA sequences, it builds an index of the entire genome in
memory. The index consists of all nonoverlapping 11 mers except for those heavily
2.4.1 Statistics of local aiignments without gaps
involved in repeats. As the total index amounts to less than a gigabyte, it can An important contribution for fast sequence database searching has been the
be kept in RAM for quick access, allowing BLAT to perform very quick searches on realization (45,47, 48) that local similarity scores of ungapped alignments follow
a standard Linux box. The index is used to delineate areas that are likely to be the extreme value distribution (EVD) (49). This distribution is unimodal but not
homologous, which are then loaded into memory for further detailed alignment. symmetrical like the normal distribution, because the right-hand tail at high-
Protein BLAT works in a similar manner, except with 4mers rather than 11 mers. scoring values falls offmore gradually than the lower tail. reflecting the fact that
The protein index takes a little more than 2 gigabytes, which is also feasible on the best local alignment is associated with a score that is the maximum out of a
modern workstations. great number of independent alignments (see Fig. 3).
The standard implementation of BLAT quickly finds sequences of :2:95% similarity Following the EVD, the probability (P) of a score 5 being larger than a given
and length 40 bases or more when applied to DNA. It may miss more divergent value x can be calculated as:
regions of longer length or more similar ones of shorter length. None the less,
p(S:2: x) = 1 - exp(_e-lL(x-.ul)
BLAT is guaranteed to detect sequence matches down to 33 bases and sometimes
detects identical regions as short as 20 bases. When applied to search protein where,u = (In Kmnl/A and Kis a constant that can be estimated from the background
sequences, BLAT finds sequences of :2:800J0 sequence identity and of length 20 amino amino acid distribution and scoring matrix (for a collection of values for It and K
acids or more. In practice, DNA BLAT works well on primates and protein BLAT on land
vertebrates.
o.W
2.4 Statistical significance of search results - E values
I
The BLAST method is based on an exhaustive statistical analysis of ungapped
f
aiignments (45) and provides a rigorous statistical framework, based on the extreme 30
r
I
0.
value theorem, to estimate the statistical significance of putative homologs.
The E(or expectation) value indicates the expected number of sequences with
an alignment score equal to or greater than that of the alignment considered,
taking into account factors such as the size of database being searched and the
composition of the query sequence. For example. if an alignment has an E value
of 1e-9 (10- 9), this means that a match with that score (or better) would only be
expected to occur by chance (i.e. in the absence of true homology) in the database
f020f I
with a probability of 1 in a billion and is thus highly significant. Conversely, if a hit
0.10
I I
has an Evalue of3.0, this means that one might expect about three equally similar
sequences to be found in the database by chance alone - clearly, therefore, such
a hit is not necessarily a homologous sequence.
The original BLAST program could detect only local alignments without gaps
and therefore might miss some significant similarities. The more recent version of r==--
the BLAST algorithm, gapped BLAST, is able to insert gaps in the alignments, leading O.OO~~.O <
0.0 Score 50
. 10.0
to increased sensitivity (33). The original statistical framework for ungapped Figure 3. Extreme value distribution.
alignments is used to assess the significance of the gapped alignments as well, Shown is the probability density function for the extreme value distribution (EDV)
although no mathematical proof for this is yet available (46). However, computer resulting from parameter values J1 = 0 and A, = 1.
58 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES II 59
over a set of widely used scoring matrices, see 50). Using the equation for J..L, the 2.4.3 Statistics of database searches
probability for 5 becomes:
In order to be useful in sequence database searches, the above framework
p(S? x) = 1- exp(-Kmne-AXj for comparing a pair of random sequences should be adapted to multiple
pairwise comparisons. Here, it becomes important to establish the probability
!n practice, the probabiiity p(S? xl is estimated using the approximation:
for a given query sequence to have a significant similarity with at least one of
1 - exp(-e-Xj :=:: e-x the database sequences. The P value is the probability of seeing at least one
unrelated score 5 greater than or equal to a given score x in a database search
which is valid for large values of x. This leads to a simplification of the equation over n sequences. This probability has been demonstrated to follow a Poisson
for P(S? x): . distribution (53):
p(S? xl :=:: e-Mx-,u) = Kmne-AX P (x, nl = 1 - e-n·p(s<!x)
The lower the probability for a given threshold value x, the more significant the where n is the number of sequences in the database. In addition to the P value,
score S. some database search methods employ the E value of the Poisson distribution,
In spite of the usefulness of the above statistical estimates in recognizing which is defined as the expected number of nonhomologous sequences with a
sequence similarity, it should be noted that they do not judge the distribution of score greater than or equal to a score x in a database of n sequences:
similarity along the sequences, which is a crucial aspect in assessing homology.
E(x, n) = n· p(S ? xl
For example, a statistically significant alignment score can correspond to a single
domain in a mUlti-domain protein sequence or to a single motif within a domain, For example, if the Evalue of a matched database sequence segment is 0.01, then
thereby still conferring an incomplete biological picture. the expected number of random hits with score S? x is 0.01, which means that
this Evalue is expected by chance only once in 100 independent searches over the
database. However, if the Evalue of a hit is 5, then five fortuitous hits with S? x
2.4.2 Statistics of local alignments with gaps are expected within a single database search, which renders the hit not significant.
Although similarities between sequences can be detected reasonably well Database searching is commonly performed using an Evalue of between 0.1 and
using methods that do not allow insertions/deletions in aligned sequences, it 0.001. Low E values decrease the number of false positives in a database search,
is clear that insertion/deletion events playa major role in divergent sequences. but increase the number of false negatives such that the sensitivity (see below) of
This means that accommodating gaps within alignments of distantly related the search is lowered.
sequences is important for obtaining an accurate measure of similarity. In addition to Por Evalues, a number of sequence similarity searching routines
Unfortunately, a rigorous statistical framework as obtained for gapless local provide an additional normalized alignment score based on the raw alignment
alignments has not been conceived for local alignments with gaps. However, score, S. This score, called the bit score, is defined as:
although it has not been proven analytically that the distribution of 5 for
gapped alignments can be approximated with the EVD, there is accumulated
B= (AS -In KJ/ln 2
evidence that this is the case: for example, for various scoring matrices, gapped where 5 is the raw alignment score and A and K are the aforementioned statistical
alignment similarities have been observed to grow exponentially with the parameters of the scoring system (50). The bit score, B, is a linear transformation
sequence lengths (51). Other empirical studies have shown it to be likely that of the raw score and has a standard set of units - the higher the score, the more
the distribution of local gapped similarities follows the EVD (52, 53). although significant the alignment. As bit scores are normalized with respect to the scoring
an appropriate downward correction for the effective sequence length has system, they can be used to compare alignment scores from different searches
been recommended (50). The distribution of empirical similarity values can based on different scoring schemes, which is not warranted using raw alignment
be obtained from unrelated biological sequences (54). Fitting of the EVD scores.
parameters A and K (see above) can be performed using a linear regression
technique (54), although the technique is not robust against outliers, which
can have a marked influence. Maximum likelihood estimation (55, 56) has 2.5 Fast Smith-Waterman local alignment searches
been shown to be superior for EDV parameter fitting and, for example, is the
method used to parameterize the gapped BLAST method (33). However, when Collins and Coulson (57) devised a parallel computer protocol to perform
low gap penalties are used to generate the alignments, the similarity scores database searches based on an implementation of the full Smith-Waterman
can lose their local character and assume more global behavior, such that the (8) local alignment technique. They implemented their VlPSRCH protocol (58)
EVD-based probability estimates are no longer valid (51). on massively parallel computers with single-instruction multiple data-type
60 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 61
processors. Following Collins and Coulsen (57), a number of implementations corresponding to the best local alignment quantifies the degree of similarity of
that enable fast Smith-Waterman-based local searches have emerged. One of this sequence with the probe profile. The scores are then corrected for sequence
the central computer sites where such programs are running as web servers is the length, represented in the form of Z scores, and ranked to create the final list of
European Bioinformatics Institute outstation of the European Molecular Biology databank search hits. Top-scoring sequences with scores above some threshold
Laboratory. Available are MPSRCH (http://www.ebLac.uk/MPsrch/index.htmI 3.4) and level are then likely to be related to the multiply aligned sequences used to build
a fast implementation of the true Smith-Waterman algorithm (SCANPS) (59), both the profile. In addition to aligning a single sequence to a profile, it is also possible
allowing users to perform database queries via the Internet. The output of a to align two profiles. In this case, two matched profile positions receive a score
query is a list of top-scoring local alignments (one per protein) where statistical by summing the products of the corresponding propensities from the two profiles
significance measures are also given based on the mean value and standard over the 20 residue types.
deviation of the distribution of scores over the entire database (57). The speed of A number of improvements have been effected since the early Gribskov
the techniques allows several PAM exchange weight matrices (based on different et 01. (60) approach. A subclass of profiles for ungapped alignments is referred
evolutionary distances) to be used in searching the databanks with the same to as position-specific scoring matrices, a term developed by the BLAST team
query sequence. (33). An approach adopted in many profile-based methods is to implement
a more probabilistic and informational scheme based on log likelihoods and
normalization using expected residue compositions, which has been shown to
2.6 Profile searching lead to more sensitive comparisons than the classic Gribskov et 01. approach.
A common problem in this approach is the occurrence of zero values at some
A natural extension of sequence database searching is provided by methods positions in the matrix. This not only leads to divide-by-zero problems in the
that use information over an entire sequence alignment of a certain protein analysis, but also fails to recognize the potential for diversity at some sites,
family to find additional related family members. The earliest conceptually clear which might be seen if a large enough set of sequences was available. A common
technique of this kind of sequence searching was called profile analysis (60), way to deal with the under-representation of nucleotides or amino acids at
which combines a full representation of a sequence alignment with a sensitive alignment positions is the application of pseudo-counts (61-64). Pseudo-counts
searching algorithm. The procedure takes as input a multiple alignment of n effectively extrapolate the number of amino acids at each alignment position
sequenccs. First, a profile is constructed from the alignment, i.e. an alignment- by adding extra artificial residue counts to the profile, based for example on a
specific scoring table, which comprises the likelihood of each residue type known residue composition observed in the database. This generally enhances
occurring in each position of the multiple alignment. A typical profile has the predicted power of the profile.
Lx (20 + 2) elements, where L is the total length of the alignment and 20 rows In 1994, Baldi et 01. (65) and Krogh et 01. (66) pioneered the use of hidden
are reserved for the number of amino acid types, whilst the last two rows are Markov models (HMMs) to represent an aligned block of sequences. A distinct
often reserved for affine gap penalties (see above). Gribskov et 01. (60) used advantage of HMMs over traditional profiles is that an HMM incorporates a
a single extra column in the profile to describe the local weight for both the richer probabilistic description of insertions and deletions probabilities. Whereas
gap opening and the gap extension penalty. For gapless alignment positions, in profiles there is just a single gap penalty for each position, such that the
the weight is the maximal value, whereas for positions with insertions/deletions, introduction of a gap in either the profile or the sequence aligned against the
the weighting factor is lowered according to the maximum length of the gap profile leads to the same penalty, in an HMM these two events can be modeled
crossing a given alignment position. The advantage of positional gap weights is with different probabilities. An extensive library of HMMs for protein domains is
that multiple alignment regions with gaps (loop regions) will be assigned lowered deposited in the Pfam database (30). Profile searching using HMMs is currently
gap penalties and hence will be more likely than core regions to attract gaps in a one of the most sensitive search techniques.
target sequence during profile searching, consistent with structural considerations. Bucher et 01. (67) unified the profile, motif, and HMM approaches through
However, the implementation by Gribskov et 01. does not take the frequency of extension of the profile definition with regular expression-like patterns, weight
gaps at each alignment position into account for the estimation of gap opening matrices, and HMMs. They proved that their generalized profiles are equivalent
and/or extension penalties. Many alternative profile implementations therefore to certain types of HMM. The generalized profiles have been used to extend the
reserve the two last columns of the profile for positional gap opening (Paoen ) PROSITE protein motif database (67, 68), which in its basic form is a library of
and gap extension (Pextend) penalties, which can be individually determined us'ing regular expressions. The profile syntax enables the emulation of most common
protocols that take the above considerations into account. motif search techniques, such as direct searching for PROSITE patterns, searching
In the approach of Gribskov et 01., a profile calculated as described above is for patterns without gaps (69), searching using the profile definition of Gribskov
aligned with the databank sequence by means of the Smith-Waterman dynamic et 01. (60), flexible pattern searches (70), searching using HMMs (66), and domain
programming procedure (8). For each database sequence, the alignment score and fragment searches using the HMMER method (71 J.
62 CHAPTER 3: SEQUENCE SiMILARITY SEARCHES METHODS AND APPROACHES 63
Many of these scoring schemes have been assessed in recent comparison studies
and have shown little significant difference in their respective performances (90,
91). However, most of the profile-profile alignment approaches to date have
been used mainly for sequence database searching (local pairwise alignment),
where a popular application has been to use profile-profile comparisons for
~:~~ £3
3.-11 £3 aligning a profile derived from a query multiple aiignment with a number of
",,",ui!!
le-J.Q ['!]E profiles describing a collection of different protein families.
i!!
£3 A direct application of the profile-profile alignment technique is implemented
5e-10 i!! in PRALINE-PSI (92), a multiple alignment technique that relies on constructing a
profile for each of the query sequences using the PSI-BLAST method. Pre-alignment
profiles (pre-profiles) are generated using each sequence in a set as a PSI-BLAST
(33, 46) query. The resulting PSI-BLAST local alignments are filtered for redundancy
'.-03 m
m and converted to PRALINE pre-profiles, which replace the single sequence input that
m
m would otherwise be used for the alignment (see 93-95 for further details). The
m
m increased sensitivity of the PRALINE-PSI method in detecting similarities becomes
5e--D&B
"e-OS m:
most evident in aligning distant sequence pairs (or sequence-profile and profile-
profile pairs in multiple sequence alignment).
Figure 4. PSI-BLAST output (see page xix for color version). 3.1 Iterative homology searching problems "~_"""'~'"
Part of the results after the second PSI-RL'\ST iteration are shown. The output format is P<:<:pnt'::lll\/
the same as for BLASTP (see Fig. 2), but new family members found in this iteration are Iterative sequence search methods such as PSI-BLAST can be a powerful way of
'NEW'. Family members found in the previous round are indicated by green dots. Sequences finding distant homologies, but often fail when querying a multi-domain protein
below the horizontal dividing line have too high an Eva/ue; only those above the line will be used in
or a protein with regions of compositional bias. For example, common conserved
rnmniiincr thp PSSM (or 'profile') of the protein family for the next iteration ofthe search.
protein domains such as the tyrosine kinase domain can obscure weak but relevant
matches to other domain types (96), whilst sequences containing low-complexity
regions, such as coiled coils and transmembrane regions, can cause an explosion
2:6.2 Profile-profile alignment
of the search rather than convergence due to the absence of any strong sequence
The previous sections described how a query multiple sequence alignment, based signals. Conversely, some searches may lead to premature convergence; this occurs
on a given query sequence and a number of putative homologs, can confer an when the PSSM is too strict, only allowing matches to very similar proteins, i.e.
enhanced signal for recognizing distantly evolved members of a given family. sequences with the same domain organization as the query are detected but no
Over the last few years, further improvements to the alignment of distant homologs with different domain combinations.
sequences have been achieved using several approaches. As a first improvement, An additional problem with iterative searches is 'matrix migration' (also referred
the evolutionary model describing the relationship of a set of sequences can to as 'profile wander'), which occurs when the search strategy is too permissive so
be readjusted to fit the sequence set rather than using a pre-set generic model that information from false-positive sequences is included in the profile, resulting
incorporated in a single-residue exchange matrix such as the PAM or BLOSUM in the possible loss of truly homologous sequences found in earlier rounds. A further
series. Recently, Yu et 01. (16) showed that the use of organism-specific or loss of information can be incurred with PSI-BLAST, as PSI-BLAST PSSMs are trimmed to
alignment-set-specific background frequencies for contextual readjustment use only the highest-scoring region in a search, ignoring less-conserved regions.
of the standard amino acid exchange weights provides a more sensitive and The alternative database search method QUEST (97) alleviates these problems by
biologically accurate way of aligning sequences. Alternatively, structural or using an independent multiple-alignment program to generate a true multiple
homologous sequence information can be incorporated into the alignment sequence alignment between iterations, and not a 'master-slave' alignment,
process to help identify the distant relationships between sequences. The benefits thereby improving the quality of the PSSM. The QUEST method also removes any
of using related sequence information has been shown in numerous profile- sequences that are deemed to be too divergent as a reliable family member in
profile alignment methods that apply different profile-scoring schemes (73-89). order not to 'pollute' the PSSM, which leads to increased search capabilities.
66 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES REFERENCES 67
3.2 Post-processing of homology searches true hits found relative to the total number of sequences in the database that
are homologous to the query. The sensitivity reflects the extent to which the
A few methods exist to predict domain boundaries through post-processing BLAST method is able to identify distantly related sequences. In many studies, this
searches. The BALLAST method can be used to visualize conservation profiles for a measure is also referred to as coverage. The specificity (or selectivity) is defined
query sequence based on sequence searching (98, 99), although the method does as TN/(FP + TN), where FP is the number of false positives, which denotes the
not delineate domain boundaries. Another technique is the PASS (prediction of fraction of entries correctly excluded as hits and hence measures the avoidance
autonomous folding units based on sequence similarities) method, which uses of unrelated hits. Yet another widely used measure is the positive predictive value,
a simple and non iterative method of domain delineation based on the stacking defined as TP/(TP + FP), which measures the proportion of true homologs within
of sequences from a gapped BLAST search onto the query sequence (100). Regions all sequences designated by the search tool as related. In practical database
along a query sequence often have a varying number of matching sequences searches, there is a trade-off between sensitivity and specificity: the more the
from the BLAST data leading to abrupt increases and decreases in sequence numbers P or E values are relaxed to allow more distantly related sequences to be found,
along the query. The PASS method is based on a single BLAST run and does not use the more likely it becomes that chance hits infiltrate the search. Moreover, even
iteration to include information from distant homologs. Furthermore, the current if a statistically highly significant similarity is encountered, problems remain. For
release of the PRODOM domain database (29) is created using the \<1KD0\<12 method example, if high similarity is found over only a portion of the sequences, the
(101). which performs PSI-BLAST searches starting with the smallest sequence in the sequences may each contain multiple domains and share a single homologous
database as a query, which is supposed to represent a single domain. All domain domain only (see above), so that only an aspect of the overall function might
sequences identified are removed from the database, after which the process is be inferred. in iterative homology searches. protein sequences containing more
iterated with the remaining subsequences and terminated when the database than one structural domain can be problematic in that they cause the search
becomes empty. The \<1K00\<12 method is an iterative protocol but does not address to terminate prematurely or lead to an 'explosion' of common domains (102).
the aforementioned problems connected to PSSM-based iterative searches. For example, the occurrence in the query sequence of a common and conserved
The DOMAINATION method (102) assigns domain boundaries by applying PSI-RLAST in protein domain such as the tyrosine kinase domain, which is then hit many
a repetitive fashion. The distribution of the aligned positions of Nand C termini times in the database, can obscure weaker but also relevant matches to other
from PSI-BLAST local sequence alignments is used to identify potential domain domain types {102l, particularly when the E value is set to include only strong
boundaries. DOMAINATION incorporates an iterative strategy for chopping and joining hits. Conversely, when multi-domain sequences with the same sequential order of
domains and domain segments based on the loss and gain of domains. This allows domains as in the query sequence are found initially during an iterative search,
the recognition of both continuous and discontinuous domains. For each domain homologs with different domain combinations might well be missed due to
inferred from the corresponding PSI-BLAST local alignments. profiles are created by early convergence of the search. To reduce the chance of including spurious hits,
filtering redundant sequences and subsequent multiple sequence alignments. some database search engines, such as PSI-BLAST (33), scan query sequences for the
Each profile filtered in this way is then used in further iterative database searches presence of so-called low-complexity regions. These are then excluded from the
using PSI-BLAST. All profiles are required to contain the original query sequence at alignment to limit the inclusion of false-positive hits due to database sequence
each iteration of PSI-BLAST to avoid profile wander, but parameters are set to ensure matches with these regions. However, the occurrence of database sequences with
that the profiles are divergent enough to capture distant sequence fragments. The low-complexity regions can still cause an explosion of false positives in iterative
whole process of iterative PSI-BLAST searches is repeated until domain assignment homology searches (102). Despite recent improvements in search techniques,
ends and no new homologs are found anymore. DOMM,ATION can successfully assign complications such as the above illustrate that automatic biological evaluation of
domain boundaries within a given query sequence, whilst the added information homology searches in genomic pipelines remains elusive for biologically intricate
gleaned from the putative domains delineated during the parallel and iterated relationsh ips.
searches leads to a search performance enhanced by 15 percentage points over a
wide range of Evalues compared with stand-alone PSI-BLAST searches.
SIRENO
SYLUANO
Soneto.