100% found this document useful (8 votes)
62 views

Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX

Express

Uploaded by

menoeosben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
62 views

Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX

Express

Uploaded by

menoeosben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Download the full version of the ebook at ebookfinal.

com

Bioinformatics Methods Express 1st Edition Edition


Paul Dear

https://ebookfinal.com/download/bioinformatics-methods-
express-1st-edition-edition-paul-dear/

OR CLICK BUTTON

DOWNLOAD EBOOK

Download more ebook instantly today at https://ebookfinal.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Plant Bioinformatics Methods and Protocols 1st Edition


Peter Sterk

https://ebookfinal.com/download/plant-bioinformatics-methods-and-
protocols-1st-edition-peter-sterk/

ebookfinal.com

Algorithms in Bioinformatics Theory and Implementation 1st


Edition Paul A. Gagniuc

https://ebookfinal.com/download/algorithms-in-bioinformatics-theory-
and-implementation-1st-edition-paul-a-gagniuc/

ebookfinal.com

Plant Bioinformatics Methods and Protocols 2nd Edition


David Edwards (Ed.)

https://ebookfinal.com/download/plant-bioinformatics-methods-and-
protocols-2nd-edition-david-edwards-ed/

ebookfinal.com

Knowledge discovery in bioinformatics techniques methods


and applications 1st Edition Xiaohua Hu

https://ebookfinal.com/download/knowledge-discovery-in-bioinformatics-
techniques-methods-and-applications-1st-edition-xiaohua-hu/

ebookfinal.com
Dear Chester Dear John Letters between Chester Himes and
John A Williams 1st Edition Chester Himes

https://ebookfinal.com/download/dear-chester-dear-john-letters-
between-chester-himes-and-john-a-williams-1st-edition-chester-himes/

ebookfinal.com

Dear Carnap Dear Van The Quine Carnap Correspondence and


Related Work W. V. Quine (Editor)

https://ebookfinal.com/download/dear-carnap-dear-van-the-quine-carnap-
correspondence-and-related-work-w-v-quine-editor/

ebookfinal.com

Statistical Bioinformatics For Biomedical and Life Science


Researchers Methods of Biochemical Analysis 1st Edition
Jae K. Lee
https://ebookfinal.com/download/statistical-bioinformatics-for-
biomedical-and-life-science-researchers-methods-of-biochemical-
analysis-1st-edition-jae-k-lee/
ebookfinal.com

Motivation Express Exec 1st Edition Philip Whiteley

https://ebookfinal.com/download/motivation-express-exec-1st-edition-
philip-whiteley/

ebookfinal.com

Methods in Biological Oxidative Stress 1st Edition Paul H.


Gamache

https://ebookfinal.com/download/methods-in-biological-oxidative-
stress-1st-edition-paul-h-gamache/

ebookfinal.com
The METHODS EXPRESS series

Faculty of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK

Bioi nformatics
Biosensors
Cell Imaging
DNA Microarrays
Expression Systems
Genomics
Immunohistochemistry
PCR
Protein Arrays
Proteomics edited by Paul H. Dear
Whole Genome Amplification MRC Laboratory of Molecular Biology,
Cambridge, UK

.:t
Scion
© Scion Publishing Ltd; 2007

First published 2007

All rights reserved. No part of this book may be reproduced or transmitted, in any
form or by any means, without permissibn.
Contents
A CIP catalogue record for this book is available from the British Library.

ISBN: 978 1 904842 163 (paperback)


ISBN : 978 1 90484223 1 (hardback) Contributors x
Preface xii
Acknowledgements xii
Scion Publishing Limited Before you begin xiii
Bloxham Mill, Barford Road, Bloxham, Oxfordshire OX15 4FF Abbreviations xv
www.scionpublishing.com
Color se-ction xvii
Important Note from the Publisher
Chapter 1. Database resources for wet-bench scientists
The information contained within this book was obtained · by Scion Publishing Neil Hall and Lynn M. Schriml
Limited from sources believed by us to be reliable. However, while every effort has 1. Introduction
been made to ensure its accuracy, no responsibility for loss or injury whatsoever 1.1 Types of databases 1
occasioned to any person acting or refrai ning from action as a result of information 1.2 Database resources at NCB I 2
contained herein can be accepted by the authors or publishers. 2. Methods and approaches 4
2.1 Searching databases at NCBI 4
2.2 Downlo ading NCBI datasets 11
Dedication 3. Troubleshooting 11
4. Additional web resources 12
To Felicity 5. References 13

Chapter 2. Navigating sequencedgenomes


Melody 5. Clark and Thomas Schlitt
1. Introduction 15
2. Methods and approaches 16
2.1 Finding genome resources for an organism 16
2.2 Browsing vertebrate genomes with Ensembl 18
2.3 Integr8 - an Ensembllookalike for microbes 22
2.4 Other web-based genome browsers 24
2.5 Specialized sites 26
2.6 Downl oading data with BioMart 27
2.7 Browsing genomes 'off line' using stand-alone software 30
Typeset by Phoenix Photosetting, Chatham,Kent, UK 2.8 Linking your own data to a genome browser 33
Printed by Ajanta Offset ancj Packagings Ltd, Delhi, India 3. Refere nces 38

Cover image by Paul H Dear, representing fruiting bodi es of the amoeba


Dictyostelium discoideum .
vi CONTENTS CONTENTS vii

Chapter 3. Sequence similarity searches 3. Troubleshooting 114


Jaap Heringa and Walter Pirovano 3.1 RNA-derived repeats and pseudogenes 114
1. Introduction 39 3.2 Computational complexity 115
1.1 Comparative sequence analysis 39 4. References 115
1.2 Sequence alignment as a reflection of similarity 39
1.3 Similarity versus homology 40 Chapter 6. Finding regulatory elements in DNA sequence
1.4 Techniques for pairwise alignment 41 Debraj GuhaThakurta and Gary D. Stormo
1.5 Alignment scores as a measure of similarity 42 1. Introduction 117
1.6 Sequence identity as a measure of similarity 43 1.1 Background 117
1.7 Statistics of alignment similarity scores 43 1.2 An overview of progress in the computational identification
1.8 Protein domains 44 of DNA sequence motifs 118
2. Methods and approaches 44 1.3 Modeling and representation of DNA motifs 119
2.1 Should one compare protein or nucleotide sequences? 45 2. Methods and approaches 123
2.2· Curated and annotated sequence databases 46 2.1 Searching DNA for known motifs 123
2.3 Heuristic sequence similarity searching methods 47 2.2 Discovery of DNA motifs from input DNA sequences 126
2.4 Statistical significance of search results - Evalues 56 2.3 Comparative genomics and phylogenetic footprinting in the
2.5 Fast Smith-Waterman local alignment searches 59 search for DNA regulatory elements 132
2.6 Profile searching 60 2.4 Composite DNA motifs and cis-regulatory modules 134
3. Troubleshooting 65 3. Additional web resources 135
3.1 Iterative homology searching problems 65 4. References 136
3.2 Post-processing of homology searches 66
3.3 Evaluating sequence database searches 66 Chapter 7. Expressed sequence tags
4. References 67 Arthur Gruber
1. Introduction 141
1.1 EST library construction and sequencing 142
1.2 Representation: normalized and subtracted libraries 144
Chapter 4. Gene prediction
2. Methods and approaches 145
Marie-Adele Rajandream
2.1 Overview 145
1. Introduction 71
2.2 EST databases 146
1.1 Ab initio methods 72
2.3 Automated EST pre-processing pipelines 150
1.2 Comparative methods 73
2.4 Transcript reconstruction 155
2. Methods and approaches 74
2.5 Redundancy estimation 160
2.1 Predicting eukaryotic genes 75
2.6 Electronic gene expression profiles 162
2.2 Predicting prokaryotic genes 90
2.7 Mapping ESTs to the genome 162
3. Troubleshooting 98
3. Troubleshooting 163
4. Additional web resources 99
3.1 Clone chimerism 163
5. References 101
3.2 SNPs 64
3.3 Repeat masking 164
3.4 Contamination 164
Chapter 5. Prediction of noncoding transcripts 4. Additional web resources 164
Alex Bateman and Sam Griffiths-Jones 5. References 165
Introduction 103
2. Methods and approaches 105 Chapter 8. Protein structure. classification, and prediction
2.1 Ab initio versus family-specific searches 105 Arthur M. Lesk
2.2 Web servers for the detection of single, specific RNA classes 106 1. Introduction 169
2.3 Web servers for the prediction of multiple RNA classes 111 1.1 The chemical structure of proteins 170
viii CONTENTS CONTENTS ix

1.2 The hierarchical form of protein architecture 172 Chapter 11. Multiple sequence alignment
Domains
1.3 173 Burkhard Morgenstern
2. Methods and approaches 173 1. Introduction 245
2.1 Accessing macromolecular structures on the web 173 2. Methods and approaches 246
2.2 Classification of protein structures 176 2.1 The alignment problem in computational biology 246
2.3 Structural genomics 180 2.2 Pairwise sequence alignment 247
2.4 Approaches to protein structure prediction 180 2.3 Multiple sequence alignment 249
2.5 Specialized methods for particular types of structure 186 2.4 Benchmarking and evaluation of multiple-alignment software 250
3. References 194 2.5 Visualization and comparison of multiple alignments 251
2.6 Multiple alignment of large genomic sequences 251
Chapter 9; Gene ontology 2.7 Software tools for multiple alignment 252
Vineet Sangar 3. Additional web resources 262
1. Introduction 195 4. References 263
1.1 Gene ontology 196
1.2 Structure of the GO database 196 Chapter 12. Inferring phylogenetic relationships from sequence data
1.3 The three qo ontologies 198 Peter G. Foster
1.4 GO terms 199 1. Introduction 265
1.5 Evidence codes 199 2. Methods and approaches 269
2. Methods and approaches 200 2.1 Alignments 269
2.1 GO browsers 200 2.2 File formats 269
2.2 GO annotation tools 204 2.3 Software 270
2.3 Gene expression tools 205 2.4 Tree-building methods 271
2.4 Integration of GO with other classification systems 206 2.5 Choosing a model 274
3. Additional web resources 206 2.6 A Bayesian approach to phylogenetics 278
4. References 207 3. Troubleshooting 280
4. References 281
Chapter 10. Prediction of protein function
Rodrigo Lopez Appendix
1. Introduction 209 Additional useful bioinformatics resources 283
2. Methods and approaches 210
2.1 Required tools 210 Index 287
2.2 Prediction and determination of physicochemical
properties of proteins 210
2.3 Determination of secondary structure from sequence 215
2.4 Determination of functional domains using pattern-matching
methods 224
2.5 Advanced methods combining several protein function
prediction algorithms 230
2.6 Protein function prediction by transfer of annotation 233
2.7 Multiple sequence alignments and secondary databases 234
2.8 An overview of InterPro and COD 235
2.9 Recent advances in protein function prediction 238
2.10 Concluding remarks 241
3. Additional web resources 241
4. References 242
CONTRIBUTORS xi

Lopez, Rodrigo EMBL Outstation - Hinxton, European Bioinformatics


Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB 10 1SO,
UK. E-mail: [email protected]

Contributors Morgenstern, Burkhard Universitat G6ttingen, Institut fUr Mikrobiologie und


Genetik, Abteilung fUr Bioinformatik, Goldschmidtstr. 1, 0-37077 G6ttingen,
Germany. E-mail: [email protected]

Pirovano, Walter Centre for Integrative Bioinformatics, Vrije Universiteit


De Boelelaan 1081 a, 1081 HV Amsterdam, The Netherlands. E-mail:
Bateman, Alex Wellcome Trust Sanger Institute, Wellcome Trust Genome
[email protected]
Campus, Hinxton, Cambridge, CB10 1SA, UK. E-mail: [email protected]

Rajandream, Marie-Adele Wellcome Trust Sanger Institute, Wellcome Trust


Clark, Melody S. British Antarctic Survey, Natural Environment Research Council,
Genome Campus, Hinxton, Cambridge, CB10 lSA, UK. E-mail: [email protected]
High Cross, Madingley Road, Cambridge, CB3 OET, UK. E-mail: [email protected]

Sangar, Vineet Department of Biochemistry and Molecular Biology, The


Foster, Peter G. Department of Zoology, Natural History Museum, London, UK.
Pennsylvania State University, University Park, PA 16802, USA. E-mail:
E-mail: [email protected]
[email protected]

Griffiths-Jones, Sam Wellcome Trust Sanger Institute, Wellcome Trust Genome


Schlitt, Thomas British Antarctic Survey, Natural Environment Research
Campus, Hinxton, Cambridge, CB10 lSA, UK. Current address: Faculty of Life
Council, High Cross, Madingley Road, Cambridge, CB3 OET, UK. Current address:
Sciences, University of Manchester, Michael Smith Building, Oxford Road,
Department of Medical and Molecular Genetics, King's College London School of
Manchester, M13 9PT. UK. E.:.niail: [email protected]
Medicine, 8th Floor Guy's Tower, London, SE1 9RT, UK. E-mail:
[email protected]
Gruber, Arthur Department of Parasitology, Institute of Biomedical Sciences,
University of Sao Paulo, Av. Prof. Lineu Prestes 1374, Sao Paolo SP, Brazil,
Schriml, Lynn M. The Institute for Genomic Research, 9712 Medical Center
05508-000. E-mail: [email protected]
Drive, Rockville, MD 20850, USA. E-mail: [email protected]

a
GuhaThakurta, Debraj Rosetta Inpharmatics LLC, Merck Co., Research Genetics
Stormo, Gary D. Washington University School of Medicine, Department of
Department, 401 Terry Avenue North, Seattle, WA 98109, USA. E-mail:
Genetics, Campus Box 8510, Room 5410,4444 Forest Park Parkway, St. Louis, MO
[email protected]
63108, USA. E-mail: [email protected]
Hall, Neil The Institute for Genomic Research, 9712 Medical Center Drive,
Rockville, MD 20850, USA. Current address: University of Liverpool, School of
Biological Sciences, Biosciences Building, Crown St, Liverpool, L69 7ZB, UK.
E-mail: [email protected]

Heringa, Jaap Centre for Integrative Bioinformatics, Vrije Universiteit De


Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands. E-mail:
[email protected]

Lesk, Arthur M. Department of Biochemistry and Molecular Biology, The


Pennsylvania State University, University Park, PA 16802, USA. E-mail:
[email protected]
Preface Before you begin
In 1984, our total knowledge of DNA sequence amounted to 2825441 bases - Computer hardware and software
enough to be printed in a modest book. In fact, such a book was printed - as a Wherever possible, the protocols in this book use web-based resources that will
two-volume paperback (Nucleotide Sequences 1984:IRL Press) - and a copy of it still work with most current web browsers on Macs or PCs. In some cases, your browser
sits in my lab's library. At a pinch, you could do bioinformatics by finding the right may need plug-ins (additional modules that add specific functions to the browser)
page and then marking up the restriction sites or stop codons with a pencil. such as Java; most web sites will tell you whether you need these and where they
Twenty-odd years later, there is about 170 billion bases (50 human genomes' can be downloaded from (usually at no cost).
worth) of sequence data in GenBank alone, most of it heavily annotated with In general, web pages should look and behave similarly on all platforms.
experimental data or the results of computational analysis. Finding the right page However, there are a few differences. For instance, some 'standard' button-names
has become correspondingly harder. are set by the browser and not by the web site - a button that is called 'Upload'
To the honest wet-bench scientist who just wants to know whether robe has a when viewed in one browser may be called 'Choose file' in another.
homolog in Drosophila, or whether there are any transcription factors on human A few of the protocols require software (usually free) to be installed and run on
chromosome 14q23, there is now a bewildering glut of bioinformatic resources. your own machine; in these cases, instructions are given in the relevant chapter.
Data may be duplicated and scattered; a protein might have a dozen different It is also helpful to have some software on your own computer for viewing or
names indifferent databases; a sequence won't upload because it's in EMBl manipulating files. In particular, a basic text editor (the simpler the better) is very
format instead of FASTA format. This apparent impenetrability is a great shame, useful. Microsoft Word or other word processors can be used at a pinch, but take
because the data resources and computational tools available to biologists are by care to save files as 'Text only' (see below, under 'File formats, editing and saving').
far the most extensive and sophisticated available to scientists in any discipline.
This book is not a comprehensive guide to all facets of bioinformatics. Instead, Typefaces used in this book
it aims to guide you - the honest wet-bench scientist through a selection of
Throughout this book, underlining is used to indicate URls (for instance, .b.:!1QJL
the more accessible and user-friendly resources, showing how to answer the sort
www.ncbi.nlm.nih.gov/l and also for the names of example files, which can
of questions that you are most likely to ask. In each chapter, an overview of the
be downloaded from the book's web site (for example, ABCC9fasta.txt). Bold
subject is followed by a series of worked examples with step-by-step instructions.
typeface is used for the names of buttons, menus, links, menu items, and other
After following these marked paths, you'll find that the resources out there are
'active' features on web pages or in software. A monospaced font is used
immensely powerful and, with a little experience, not so bewildering after all.
for inputs or commands that you should enter verbatim and to show the output
Paul H. Dear of programs. The <'.J symbol indicates that a single line of text has been 'wrapped
May 2007 around' to fit on one page. The names of programs (though not the names of
websites which give access to them) are indicated in SMALL CAPITALS.
Acknowledgements
Online resources to accompany this book
My thanks go to the authors for their excellent chapters. Thanks also to David A web site accompanies this book at www.scionpublishing.com/bioinformatics
Hames for inviting me to edit this volume; to all at Scion for the astonishing Before you start reading a chapter you should download; and unpack, the zip fil{
elasticity of their deadlines; and to Jane Hoyle for moderating the quirkier aspects for the chapter you are working on. These zip files contain example files (datasets,
of my grammar and spelling. I'm grateful to many colleagues at LMB (especially copies of the expected results, etc.) which you will need to work through the
Alan Bankier, Sarah Teichmann, and Paul Hart) for help with a range of technical corresponding protocols.
matters and to the staff at many bioinformatics help desks, particularly Giulietta The web site also contains a compendium of links and data files, to save you
Spudich at the EBI. Special thanks go to my wife, Denise, for her inexhaustible having to type in each of the addresses referred to in the book. Throughout the
patience and to my daughter, Felicity, whose tree house would otherwise have text, you will find superscripts next to each URL, such as: http://www.ncbi.nlm.
been finished by now. nih.gov/1.1 - simply click on the relevant link (in this case' 1.1 ') to jump directly
to the URL. The example files are similarly linked, such as: ABCC9fasta.txt lO .5 - to
download these, right click on the link and select 'Save Target As... :
xiv BEFORE YOU BEGIN

File formats, editing, and saving


Text documents that need to be uploaded to the web sites used in these protocols
(for example, files containing sequences for analysis) will need to be in specific
formats appropriate to the site. These are almost always 'text only' formats so,
if you view or edit a file using a word processor such as Microsoft Word, it is
Abbreviations
essential to save the file as a 'text only' document: the web server will not be
able to cope with a Word '.doc' file. However, copying and pasting from a word
processor document into a text-box on a web site is generally OK.
Another frustrating problem that you may encounter is that different text
editors use different hidden symbols to mark the end of a line. This can cause AIC Akaike information criterion
problems when uploading a file (even 'simple' text-only file) to some web sites. If ASRV among-site rate variation
this happens, try using a different text editor (the simpler the better) to open and CDD Conserved Domains Database
save the document before uploading it. Again, if the web site allows you to paste CDR complementarity-determining region
(instead of uploading) the file contents, this problem tends not to arise. CDS coding sequence
Some software, including e-mail programs, can introduce other changes into COGS Clusters of Orthologous Groups of proteins
your text without your noticing. Watch out for the distinction between tabs and CPU central processing unit
spaces, and between 'line wraps' (where a continuous line of text is displayed on DAG directed acyclic graph
two lines to fit into a window) and true line breaks (where an 'end of line' symbol DAS distributed annotation system
breaks the line). Evalue expectation value
Finally, be aware that a few web servers that require an e-mail address will EBI European Bioinformatics Institute
balk if your address contains hyphens ('e.g., [email protected]') or some other EC Enzyme Commission
characters. EM expectation-maximization
EMBnet European Molecular Biology network
The ever-changing web
EMBOSS European Molecular Biology Open Software Suite
Web sites and database contents are constantly changing - that is their point. I EST expressed sequence tag
have tried to concentrate on sites that are likely to be stable, and to indicate cases EVD extreme value distribution
where database updates may give results that differ from those in the protocols. GCG Genetics Computer Group
If you find that things are not working exactly as described in this book, bear in GEO Gene Expression Omnibus
mind the following points: GFF generic file format
GHMM generalized hidden Markov model
.. A database search performed today might give slightly different results from
GI Genlnfo Identifier
those described here (for example, more hits if the database has expanded; or
GNN Genome News Network
a different location for a gene if a genome sequence has been revised).
GO Gene ontology
.. Web sites may be updated, usually with additional features or cosmetic
GOLD Genomes On Line Database
changes. If the button or link described in the text is not where you expect it
GSS genome survey sequence
to be, a little hunting will generally find it or its new equivalent.
GTR general time reversible
II If a given web site is unavailable, try going back to the root address. For
HIV human immunodeficiency virus
exa m pie, if http://www.hugelab.ac.im/bobsl abib i oi nformatics/tools/ phyl og eny/
HMM hidden Markov model
annhialigner no longer works, try going back to http://www.hugelab.ac.im/
HSE heat-shock element
bobslab/ or even http://www.hugelab.ac.im and then follow links to the new
HSP high-scoring sequence pair
location of the site you are after. If all else fails, a Google search for the name
HTH helix-tum-helix
of the database or software will often lead you to its new home.
IUPAC International Union of Pure and Applied Chemistry
Explore! JGI Joint Genome Institute
MAP maximum a priori
Almost all of the web sites and resources described in this book offer many more
MCMC Markov chain Monte Carlo
functions, options, settings, and links than those described here. Spending a little
time playing and trying 'nondefault' settings will yield unexpected (and sometimes
even useful) results.
xvi ABBREVIATIONS

MGI Mouse Genome Informatics


ML maximum likelihood Color section
MP maximum parsimony
MSA multiple sequence alignment Chapter 2. Navigating sequenced genomes
MSD Macromolecular Structure Database
MSP maximum scoring pair
NCBI National Center for Biotechnology Information
NIH National Institutes of Health
NMR nuclear magnetic resonance
NTF2 nuclear transport factor 2
OMIM Online Mendelian Inheritance in Man
ORESTES ORF ESTs
ORF open reading frame
PDB Protein Data Bank
PIR Protein Information Resource
PRF Protein Research Foundation
PSSM position-specific scoring matrix
PTHLH parathyroid hormone-like hormone
PWM position weight matrix
RCSB Research Collaboratory for Structural Bioinformatics
RT reverse tra nscri ptase
RT-PCR reverse transcriptase polymerase chain reaction
SAGE serial analysis of gene expression
SCPD Promoter Database of Saccharomyces cerevisiae
SGD Saccharomyces Genome Database
SNP single-nucleotide polymorphism
SOAP simple object access protocol
SRP signal recognition particle
SSP secondary structure profile
SSR simple sequence repeat
STS sequence-tagged site
Figure 5. Screenshot of the ARTEMIS sequence viewer and annotation tool (see page 31).
TF transcription factor The main window is divided into several sections. Below the main menu is information about the current
TFBS transcription factor-binding site selection and the sequences being viewed. Below this (and filling most of the top half of the screen), the
TGI TIGR Gene Indices 'overview' section shows stop codons in all six reading frames (short vertical black lines) and features
TIGR CMR The Institute for Genomic Research Comprehensive Microbial on both strands (colored boxes, mainly in blue and yellow); the vertical scroll bar on the right controls
Resource the scale (zoom) of this window and the horizontal one scans along the sequence. The 'base view' Qust
below the overview) shows the sequence of both strands and, above and below these, the translation in
UCSC University of California Santa Cruz all six frames; again, the scroll bars control the zoom and position of this window. The bottom third of
UID unique identifier the screen shows a list of annotated features. Many other aspects of the sequence can be displayed or
UTR untranslated region hidden (see text).
wwPDB Worldwide Protein Data Bank
XBP-l X-box-binding protein 1
xviii COLOR SECTION COLOR SECTION xix

Chapter 3. Sequence similarity searches


E1 Detailed view
Features'l' Comparative'l' DAS Sources y Repeats'l' Decoratlons'l' Export y Image size 'I'

Jump to region ~ Band:

+
a:I=_ac::I :'11111 lllaatibf'M I

m
m
le-U m
Chr.19
~- ~~ ~~ ~~200.00~~ ~- ~- ~- ~- .. 11 !:lE

- ..
Length I - FQ~rd $frond Kb m
I]l
FI:l1204 i
m
F\'I1205

AI X

m
I]l
m
I]l
m
m
D m
m

Figure 6. Screenshot showing part of the 'Detailed view' panel of the 'Contig View' page of the
Ensembl genome browser (see page 36).
The data that was uploaded in Protocol 7 is shown (dark bars just below the chromosome length scale)
o
-!.).': m
in the track called 'NavigatingGenomesTrack'. In this shot, the user has clicked on one of the uploaded m
features (P61205), and the small pop-up window displays information about this feature. f<'-

Figure 4. PSI-BLAST output (see page 64).


Part of the results after the second PSI-BLAST iteration are shown. The output format is essentially the
same as for BLASTP (see Fig. 2), but new family members found in this iteration are indicated by 'NEW'.
Family members found in the previous round are indicated by green dots. Sequences shown below the
horizontal dividing line have too high an Evalue; only those above the line will be used in compiling the
PSSM (or 'profile') of the protein family for the next iteration of the search.
xx COLOR SECTION COLOR SECTION xxi

Chapter 4. Gene prediction

:..
~

11 600 11400 12200 [1000 [1800 [2600


I
~,
-" ..
<1'J:]'
El
IT
IJ

~
JI

'00
l1li>
--
~
~2800 115200
I~'
116000 117600.. _ 118~2o~..119.200

Figure 4. M. Jeprae genome viewed in ARTEMIS (see page 97).


Figure 3. Mouse gene predictions viewed in ARTEMIS (see page 89). The top section shows the codon usage plots for the forward (upper) and reverse (lower) strands in each
The track immediately below the distance scale in the top section shows reversed alignment to EST frame. In the section below this, the tracks immediately above and below the distance scale show the
BB280527 (extreme left), produced by Esr2GENOME (see Protocol 11). The track below this shows the CDSs of the published annotation (forward and reverse strands). The next tracks (working outwards from
EST2GENOME alignment to BY742253, showing alignment of four parts of this EST to parts of the genomic
the distance scale) show the BlASTX matches (labeled 'CRUNCH X') to the M. tuberculosis protein set. The
sequence. The track below that shows three BLASTN hits (labeled BLASTCDS) to M. musculus ESTs. next tracks show the GENEMARK.HMM predictions (numbered) and then the trimmed ORFs of >150 amino
Beneath this, a three-exon gene predicted by GENEMARK.HMM is shown; a second GENE MARK prediction - a acids (labeled CDS). The lower sections of the screen show the same information in greater detail, with
small single-exon gene - appears at the extreme right on the forward strand (shown above the distance the annotations listed in the bottom section.
scale). The lowest track in this section Qust above the top-most horizontal scroll bar) shows the gene
predicted by SGP2. The second section of the screen (between the top two horizontal scroll bars) shows
the same region, but with the three reading frames represented as individual lines above (forward strand)
and below (reverse strand) the distance scale; features are shown in the relevant reading frame; vertical
lines are stop codons. The third section shows a close-up view of part of the sequence (nt 2960-3040)
including nucleotide and amino acid sequences. The bottom section lists the annotated
xxii COLOR SECTION COLOR SECTION xxiii

Chapter 7. Expressed sequence tags Chapter 8. Protein structure, classification, and prediction

An information Portal to Biological Macromolecular Structures

Contact Us I Help I Print Page ,': PDBIDorl;:eyword ('

HQrne S-earc:h Structure structure SUll1m~ny aloiogy & ~hemistry Ma~6rla!s Sr Methods
Result$: Queries
lIDP I!II I m _ and VI...." ...,on
BiologIcal Molecule I Asymmetric
llDP Crystal structure of scytalone dehydratase F162A mutant in the unligated state Unit
.'Download Files

> Display Files


• Display Molecule Primary
Citation
Structural Reports
• External Links History

> Structure Analysis Experimental


Method
Help

Length [A] 72.64 61.31 72.62


Anglesi"l 90.00 120.02 90.00

(a)

(b)

Figure 6. Visualizing cluster assemblies using CLVIEW (see page 160).


CLVIEW presents cluster assemblies in a zoomed view (a), displaying a directory tree on the left of the
window and the aligned DNA sequences on the right; or in an overview of the assembled reads (b), © RCSB Protein Data Bank
displaying the consensus sequence with a yellow background, followed by a pile of aligned reads marked
with a blue background. Base discrepancies are labeled in red and may represent potential SNPs. Figure 5. A page from the ReSB site, showing the Structure Explorer summary page for entry 1 idp
(M. grisea scytalone dehydratase) (see page 175).
Bibliographical information is shown, as well as some data about the structure and its determination, links
to other databases. and a picture.
xxiv COLOR SECTION COLOR SECTION xxv

PolyPhobius prediction Coils output for FOS _CHICK


Proto-oncogene protein c . . fos
Prediction of Q2PR35IQ2PR35_FUGRU
[ISREe-Server] Date: Mon Nov 27 23:53:01 Europe/Zurich 2006
ID Q2PR35IQ2PR35_FUGRU
FT rOPO_DOM 1 25 NON CYTOPLASMIC. coils -def -in= . ./wwwtmp/.COILS.29000. 70Bl.seq
FT TRANSMEM 26 50 -out= . ./wwwtmp/.COILS.29000.70Bl.out -mat=2
FT Tapa_DaM 51 59 CYTOPLASMIC.
FT TRANSMEM 60 80 # COILS version 2.1
FT Tapa_DaM 81 97 NON CYTOPLASMIC. # using MTIDK matrix
FT TRANSMEM 98 120 # no weights
FT Tapa_DaM 121 140 CYTOPLASMIC. # Input file is .. /wwwtmp/.COILS.29000.7081.seq
FT TRANSMEM 141 162 #>FOS_CHICK Proto-oncogene protein c-fos, 419 bases, 4DDEA701 checksum.
FT Tapa_DaM 163 196 NON CYTOPLASMIC.
FT TRANSMEM 197 222
FT Tapa_DaM 223 238 CYTOPLASMIC. Coi Is output tor FOS CHICK Proto-onoogene protein c-fos
FT TRANSMEM 239 260
FT Tapa_DaM 261 272 NON CYTOPLASMIC.
FT TRANSMEM 273 292
FT Tapa_DaM 293 312 CYTOPLASMIC.
II

Po 1 yPhobius poster-ior pr'"obabi 1 i ties for Q2PR35; Q2PR35 FUGRU


0.8

B.6
0.8

:on

13.4

-,
-"
-"
0.6
0
'-
Q.

..'"
-"
0.2

0.4

0.2
50 11010 150 200 250 3B0 350 480 450

You can get the prediction graphics shown above in one of the following formats:
.. GIF-fonnat
50 100 150 200 250 300
.. Postscript-format
• numerical fonnat (window 14 21 28)
transmembrane oytoplasmic non cy"topl.asmio - - signal peptide - -

Back to ISREC home page


The prediction is based on an 31ignment . The probability data used in the plot is found ~andthe
gnuplot script is here.
Figure 15. Output of the COilS webSite, showing an example of the rediction of coiled-coiled regions
(see page 193).
Figure 13. Output of the PHOBIUS website, showing an example of the prediction of transmembrane
regions and signal sequences (see page 191).
xxiv COLOR SECTION COLOR SECTION xxv

PolyPhobiu5 prediction Coils output for FOS CHICK


Proto-oncogene protein c-fos
Prediction of Q2PR35IQ2PR35_FUGRU
[ISREC-Server] Date: Mon Nov 27 23:53:01 Europe/Zurich 2006
ID Q2PR35IQ2PR35_FUGRU
FT rOPO_DoM 1 25 NON CYTOPLASMIC. coils -def -in= . ./wwwtmp/.COILS.29000. 7081.seq
FT TRANSMEM 26 50 -out= ../wwwtmp/.COILS.29000. 7081.out -mat=2
FT TOPO_DOM 51 59 CYTOPLASMIC.
FT TRANSMEM 60 80 # COILS version 2.1
FT TOPO_DOM 81 97 NON CYTOPLASMIC. # using MTIDK matrix
FT TRANSMEM 98 120 # no weights
FT TOPO_DOM 121 140 CYTOPLASMIC. # Input file is .. /wwwtmp/.COILS.29008.7881.seq
FT TRANSMEM 141 162 #>FOS_CHICK Proto-oncogene protein c-fos, 419 bases, 4DDEA781 checksum.
FT TOPO_DOM 163 196 NON CYTOPLASMIC.
FT TRANSMEM 197 222
FT TOPO_DOM 223 238 CYTOPLASMIC. Coils output tor FOS CHICK Proto-oncogene protein c-fos
FT TRANSMEM 239 260
FT TOPO_DOM 261 272 NON CYTOPLASMIC.
FT TRANSMEM 273 292
FT TOPO_DOM 293 312 CYTOPLASMIC.
II

Pol yPhobius posterior probabi I i ties for Q2PR35: Q2PR35 FUGRU


13.8

0.6

::>
+'
0.4
...
'"
-"
0
0.6
'-
"-

.."
-"
0.2

e.4

13.2
513 1013 15111 200 '2513 3130 350 41313 4513

You can get the prediction graphics shown above in one of the following formats:
• GIF-format
50 le8 158 2013 2513 300
• Postscript-format
• numerical format (window 14 21 28)
tl"'ansmembrane -~- oytoplasmic non cytoplasmic signal peptide - -

The prediction is based on an 0lignment . The probability data used in the plot is found bEl:L and the Back to ISREC home page
gnuplot script is here .
Figure 15. Output of the COILS website, showing an example of the rediction of coiled-coiled regions
(see page 193).
Figure 13. Output of the PHOBIUS website, showing an example of the prediction of transmembrane
regions and signal sequences (see page 191).
xxvi COLOR SECTION COLOR SECTION xxvii

Chapter 10. Prediction of protein function PEPj\lET ()f FC)S, HUMAj\j fr()nl ~1 to 2,3

TMHMM posterior probabilities for UniProCSwiss-Prot_060706_ABCC9_HUMAN


1.2

1.0

~ 0.8
71 t34 57 .50 4-3 36 29 LL '15 8
:.0
co
.D
0.6
;::
0..
0.4

0.2

200 400 600 800 1000 1200 1400 '14-8 '141 13,4 '127 'l20 '113, 106 99 92 85 78

Transmembrane - - lnside-- Outside--

Figure 4. Graphical output from TMHMM (see page 219).


The lower part shows the probability that each part of the sequence lies inside or outside the cell or in a
transmembrane helix, whilst the upper part predicts the organization of the protein.

225 218 211 '197 '190 8.3, '176 '159 '162 '15,5,

Figure 7. Helical representation of the sequence of the human fos oncogene using PEPNET from the
EMBOSS suite of programs (see page 223).
Note the leucine zipper between positions 165 and 193. (Only the first 231 amino acids of the protein
are shown here.)
xxviii COLOR SECTION COLOR SECTION xxix

PPSearch Output

ppsearch (c) 1994 EMBL Data Library


based on MacPattern (c) 1990~1994 R. Fuchs

PROSITE pattern search started: Thu Jun 15 12:56:22 2006

Sequence file: /ebi/extserv/old~work/ppsearch~20060615~12562181829685. input

Sequence /ebi/extserv/old~work/ppsearch~20060615~12562181829685. input (260


residues) :

Matching pattern BZIP_BASIC:


74: RRKLKNRVA
Total matches: 1

Total no of hits in this sequence:


mETCONF

(s) searched in 1 sequence(s), 260 residues.


hits in all sequences:

Figure 10. Expected output of PPSEARCH when used to search for patterns in 'USERSEQ1_fasta.tx1' (see
page 226).

Figure 14. Jpred output shown in JAlV1EW (see page 232).


The upper part of the screen shows the alignment of the query protein with others of
(shading indicates conservation) and contains a screen image of the results obtained
sequence above usingJALvlEW. Below this, the lines beginning 'Lupas' indicate predicted coiled-coil regions
(at three different window sizes). The 'JNETSOL lines indicate which residues are likely to be accessible to
solvent ('8' indicates a buried residue - one that fS less than 25%, less than 5% or 0% exposed to solvent
in the three tracks). The remainder of the tracks all relate to secondary structure predictions: 'J NETPSSM',
'JNETFREQ', 'JNETHMM', and 'JNETAUGN' show predictions made by various methods (red tubes are
a-helices; green arrows are ~-sheets); asterisks in the 'JNETJURY' track show where these predictions
disagreed and had to be resolved, whilst the 'jnetpred' track shows the consensus secondary-structure
prediction. Finally, the histogram (and the corresponding numerical values beneath it) show the
of the prediction shown in 'jnetpred': for example, the large helix on the left includes one region
its right end where the prediction is less certain, approximately at residues 113-116.
xxx COLOR SECTION COLOR SECTION xxxi

Sequence tJIV'v'VAAliPNPADGTPJ<:vLLLSGQPASAZ\GAPAAR-LPIoMVPAQRGASPEAASGGLPQARK 59
YlVVVAAAPNP1I.DGTPJ<:VLLLSGQPASAAGAPAGQALPL:,lVPAQRGASPEAASGGLPQARK 60

:1VVVAAAPSAATAAPKVLLLSGQPASGG-----RALPLMVPGPRAAGSEAS--GTPQARK

Sequence
120

Sequence
EKTHGL VVENQELRQRLGMDALVAEE - - EAEAKGNEVRPVAGSAESAALRLRAPLQQVQA 178
EKTHGLVIENQELRTRLGMNALVTEEVSE}\ESKGNGVRLVAGSAESAALRLRA,PLQQVQA
EKTHGL VVENQELR TRLGMDTLD PDEVPEVEAKGSG'IRL VAGS1\ESAALRLCAPLQQ'1QA
.
*******-****** . . * ...*
**** .. ** ************ ********
Sequence QLSPLQNISPWlLAVLTLQIQSLISCWAFWTTWTQSCSSNALPQSLPAWRSSQRSTQKDP 237
QLSPLQNISPWlLAVLTLQIQSLISCWAFWTrWTQSCSSNALPQSLPAwKSSQRSTQKDP 238
QLSPPQNIFPWILTLLPLQILSLISFWAFWTSWTLSCFSNVLPQSLLIWRNSQRSTQKDL
QLSPPQNIFPWTLTLLPLQILSLISFWAFHTSWTLSCFSNVLPQSLLVHRNSQRSTQKDL 233

260
VPYQPPFLCQWGRHQPSWKPLMN - - - - -- - - - - - -
261
267
VPYQPPFLCQWGPHQPSWKPLMNSFVLTMYTPSL 267
Figure 16. Part of the results of INTERPROSCAN for the protein RXRA__HUMAN (see page 237).

Figure 15. A multiple alignment using DBClUSTAl (see page 234).


CHAPTER 1
Database resources for wet .. bench
scientists
Neil and Lynn Schriml

With the increasing amount of data being generated by genomic-scale studies, it


has become much more important for biologists to store data in a structured way
that makes it easily accessible and allows the integration of different data types
from different sources. Hence, there are now hundreds of specialized databases
available to biological researchers that cover a vast array of different data types
from transcription factor binding sites to metabolic pathways and from protein
domains to scientific journals. For these data resources to be properly exploited,
one has to understand how the data is generated and structured, otherwise there
is a danger that important data may be missed or, worse still, incorrect data blindly
trusted.
Because of the number and diversity of data resources available, we cannot
describe them all in a single chapter, so here we will describe the publicly available
databases at the National Center for Biotechnology Information (NCBI) (1) (http://
www.ncbi.nlm.nih.gov!1.1) and provide examples of how you can query different
information using their online tools. This chapter complements Chapter 2, which
explores resources for navigating sequenced genomes with the emphasis on a
second major group of tools - Ensembl. It should be noted also that there are
many other tools available at different web sites and we will mention some of
these later in this chapter.

1.1 Types of databases

The publicly available web resources described in this chapter can be divided into
two types: primary and secondary databases. Whilst both types of database serve
a useful purpose, one has to understand the distinction before making assertions
based on a database query.

Bioinformatics: Methods Express (Pmli H. Dear, ed.)


© Scion Publishing Limited, 2007
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.1

2 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS INTRODUCTION 3

1.1.1 Primary databases SCU49845) is unique to each entry and in many cases it may be identical or similar
Primary databases include all of the repositories of the primary output from to the accession number (in this case, U49845). The accession number will always
experimental work, such as GenBank (2) (http://www.ncbi.nlm.nih.gov/Genbank/ remain the same even if the entry is changed. There is also a version number,
index.html1.2), which contains nucleotide sequences, and ArrayExpress (3) (http:// which indicates how many times the entry has updated: U49845.1 indicates that
www.ebi.ac.uk/arrayexpress/1.3) at the European Bioinformatics Institute (EB!), this is the first version of this entry. The GI is the 'Genlnfo Identifier': if a sequence
which contains microarray expression data. These repositories generally house or protein translation changes in any way, a new GI number will be assigned, so GI
information that is submitted by the scientist who generated it and little is done to sequence identifiers run parallel to the version numbers.
process, curate, or provide quality control over what is entered. Therefore, they are
usually very comprehensive, but one must always treat the data with caution. 1.2.1 Nucleotide databases at NCB I
The main nucleotide database at NCBI (and, like the rest, accessible through
1.1.2 Secondary databases Entrez) is GenBank (2), which contains all of the publicly available DNA sequences.
Secondary databases are less all-inclusive than the primary databases, but instead However, there are subdivisions of GenBank that can be searched independently
concentrate on data quality and include additional information and cross- of the complete database. Depending on what you are looking for, it may be better
referencing. Secondary databases usually draw on (and may be linked to) a number to search a subdivision rather than the whole dataset. For example dbEST (7, 8)
of primary databases, collecting together information centered on a particular topic. contains only that subset of sequence data and other information that relates to
1 'single-pass' cDNA sequences or expressed sequence tags (ESTs); similarly, dbGSS
For example, Pfam (3) (http://www.sanger.ac.uk/Software/Pfam/ .4) is a secondary
database that curates protein domains and allows users to search proteins for known contains only single-pass genome survey sequences.
domains, whereas the Mouse Genome Informatics (MGI) (http://www.informatics. Further nucleotide databases (again, Entrez-accessible) exist outside GenBank.
jax.org/1.5) collects and curates genomic, genetic, and functional data associated For example, dbSNP (9) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp 1.9)
with the laboratory mouse. There is a clear distinction between these two examples is a database of single-nucleotide polymorph isms, small insertions, and deletions.
of secondary databases: Pfam covers one theme (protein domains) and does so for There is also a sequencing trace archive (http://www.ncbi.nlm.nih.gov/Traces/trace.
all organisms, whereas MGI (4) covers many types of data but for only one species. cgi 1.10), which contains sequences that have been submitted along with all of
their underlying experimental data, so that you can view the original trace file
that generated them. This data could be particularly useful if you wish to check
1.2 Database resources at NCBI
------------------,~~.- -~- .. .... ~. the validity of a frameshift or insertion in a gene of interest.
At the time of writing, NCBI has over 20 databases, which can be searched either
individually or en masse. Entrez (5) is a system that provides a common interface 1.2.2 Protein databases at NCBI
to all of the major NCBI databases, including PubMed (6), nucieotide and protein Protei n databases ca n contain different levels of information from pri mary sequence
sequences, protein structures, complete genomes, taxonomy, and many others. It to secondary and tertiary structures. As well as databases that contain entire
provides a consistent user interface and format, and allows queries to be made peptide sequences, there are a number of resources dedicated to collecting and
across mUltiple NCBI databases at once. The starting point for Entrez is curating protein domains and motifs. The NCBI protein database is a concatenation
www.ncbi.nlm.nih.gov/gquery/gquery.fcgi 1.6, and an overview, showing all of a number of subdatabases: it includes sequences from Swiss-Prot (10, 11), the
Entrez-accessible databases and the connections between them, is available at Protein Information Resource (PIR) (12), the Protein Research Foundation (PRF),
http://www.ncbi.nlm.nih.govjDatabase/datamodei/ 17 . and the Protein Data Bank (PDB) (13), along with protein sequences translated from
Each of these databases (which are often called nodes) will contain entries nucleotide sequences of RefSeq (10,11) and GenBank. Therefore, when you search
with unique identifiers (UlDs). These identifiers are stable over time, whilst the using the Entrez server, you will be doing a comprehensive search of all publicly
data associated with them can change. For example, a gene will always have the available sequences. Like the nucleotide databases, these proteins will be based on
same UID, but over time we may discover a new function for it or new splice sites variable data quality depending on their source. For example, Swiss-Prot has highly
so its annotation and sequence could change. These entries can be linked within curated annotations and should be nonredundant, whereas the translations from
and between nodes, which allows, for example, publication entries to be linked GenBank will contain sequence errors and misannotations in a number of cases.
to protein entries. Each node has a specific entry format, although many features Additional protein-related information is available in NCBI's Structure database
will recur among the different nodes. For example, there is a fully annotated (http://www.ncbi.nlm.nih.gov/Structure/1.11) and NCBI's Conserved Domains
GenBank record at http://www.ncbi.nlm.nih.gov/Sitemapjsampierecord.htmI 18 . Of Database (COD) (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml 112). COD
particular interest are the three identifiers you will find in this single record: the contains domains from Pfam (14), Simple Modular Architecture Research Tool
iocus name, the accession number, and the GI. The locus name (in this example (SMART) (15), and Clusters of Orthologous Groups (COG) (16), as well as other
4 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 5

domains curated at NCBI. The major utility of the domain database is for identifying
domains in a protein sequence, which will allow the user to infer a function (also Protocol 1
see Chapter 8).
A simple text search for Plasmodium across NCBI
1.2.3 Other databases at NCBI databases using Entrez
As well as the major nucleotide and protein databases, NCBI houses a number
1. Start at the NCBI home page: http://www.ncbi.nlm.nih.gov/1.1 and select Entrez home on the
of other related databases (nodes) that are linked to the sequence databases, as
right to go to the Entrez cross-database search page: http://www.ncbi.nlm.nih.gov/gquery/
well as being themselves browsable and searchable through Entrez. One of the gquery.fcgi 1.6.
most commonly used databases is PubMed, which is a repository of biomedical
2. In the text box towards the top of the screen (,Search across all databases'), type 'Plasmodium'.
journal articles. As well as being searchable using text queries, all of the articles Click the adjacent Go button, or press 'Enter' on your keyboard.
in PubMed are linked to other NCBI entries related to them, such as nucleotide
3. The page will be updated. Next to each of the database names and icons in the lower part of
sequences. Similarly, there are databases of chemical structures (PubChem),
the screen will be a number (or, in a few cases, 'none'). This indicates the number of entries in
microarray experiments (Gene Expression Omnibus, GEO) (17), taxonomy (Entrez each of these databases that contained, anywhere within it, the word 'Plasmodium:
Taxonomy), genes (Entrez Gene), maps (Map Viewer), and inherited diseases
4. Many of these records will not relate to Plasmodium itself. For example, some will describe
Mendelian Inheritance in Man, OMIM) (18, 19) among others. Whilst
proteins from other species that are noted as interacting with, or being similar to, Plasmodium
some of these datasets may not seem obviously useful to your particular area of proteins.
research, much of their functionality is derived from the fact that related records
5. We therefore want to limit the search to entries in which the organism is Plasmodium. To do
in each database are all linked so the user is able to traverse between datasets
this, repeat the query, this time using the text 'Plasmodium [ORGANISM] 'a
by following the links provided. For example, search the PubChem database for
6. The page will be updated, this time giving the number of entries (in each of the databases) in
'ethanol' and it will return not only a structure and description of the compound
Plasmodium is in the 'Oraanism' field.
but also links to protein databases of enzymes that bind ethanol, as well as
toxicology reports in the National Library of Medicine and relevant publications 7. Clicking on any of the results will take you to the respective results page, listing all of the
Plasmodium entries found. For example, clicking on Nucleotide or the adjacent icon will take
in PubMed.
you to the start of a list of over 200000 Plasmodium nucleotide sequence entries.

Note
"A list of fields that can be searched is given here: http://www.ncbi.nlm.nih.gov/entrez/query/static/
heip/Summary Matrices.html#Search Fields and Qualifiers 1.13. Many fields can be abbreviated;
Here we discuss, in a little more detail, some of the tools available at NCBI through
for example, 'ORGN' can be used in place of 'ORGANISM'.
Entrez, tailored for searching their nucleotide, protein, and other databases.
Additionally, we then provide a set of protocols, illustrating how to answer specific
typical bioinformatics questions. Search terms can also be combined; for example, searching for 'malaria AND
It is not possible to cover more than a tiny fraction of the resources, tools, and mosquito' will find all entries that contain (anywhere within the entry) both
methods of query that are available through Entrez. However, we suggest that you 'malaria' and 'mosquito'. Similarly, 'Plasmodium NOT Plasmodium [ORGANISM] , will
start with the examples in the protocols and then take these as starting points to find all entries that refer to Plasmodium, but that do not originate from the
explore on your own. organism Plasmodium itself. More sophisticated searches can be made by querying
each database individually, rather than globally. The advantage of this is that it
2.1 Searching databases at NCBI will allow you to define your search using fields specific to that database. It is also
possible to view a 'history' of previous .searches and to combine these together
2.1.1 Text searches to refine the search further. Protocol 2 gives a simple example of searching a
single database, using 'limits' and 'history' to build up a progressively more refined
The simplest and broadest search of NCBI databases is offered via the Entrez entry
query.
page: http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi 16. Here, you can perform
a simple text-based search across all databases or you can choose a specific NCBI
database to search such as PubMed or Nucleotide. If you are searching across
all databases, the simplest search you can perform is a text search of all fields.
Protocol 1 gives a simple example.
6 !I CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 7

Protocol 2 Protocol 3
A search in PubMed using limits and History Determining the set of web resources available for a genome
1. Navigate to the Entrez entry page (http://www.ncbLnlm.nih.gov/gquery/gquery.fcgi 1.6) and 1. Start at NCBI's home page (http://www.ncbi.nlm.nih.gov/ 1.1) and click on the Genomic Biology
on the PubMed link towards the top of the page. link in the left blue bar. This link takes you to the genomic biology page: http://www.ncbi.nim.
2. In the text field at the top of the page, enter 'malaria' and click the adjacent Go button. ni h.gov/Genomes/ 1.15 .

3. The result is a list of over 40000 (at the time of writing) entries that contain the word 2. Under Genome resources on the right, select Eukaryotic to go to an alphabetic list of genome
'malaria' in any part of the entry. projects, listed by species: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi

4. We will repeat this search, restricting it to articles with 'malaria' in their title. Click on the 3. Scroll down to find Plasmodium falciparum 307 (at the time of writing, only one genome
Limits tab Uust below the text entry box). Scroll to the bottom of the new page to find the project is listed for this strain). On the same line, you will see a number of links, including a
pull-down menu Default tag. Select Title from the pull-down menu. Click the adjacent Go taxonomic identifier ('36329') and a link to the sequencing consortium's home page.
button (or scroll back up the page and click the one at the top). 4. You will also see, at the right of the line, a series of colored abbreviations for different NCBI
5. The result is a list of about 20000 articles, all with 'malaria' in their title. databases (PM, R, G, etc.). Clicking on anyone of these will bring up data on Plasmodium
falciparum 3D7 from the appropriate database. For example, clicking on G will show you all
6. We will now look at our previous searches and combine them to refine the search. Click on of the entries in the Genes database for this organism. If necessary, use your browser's 'back'
the History tab. button to return to http://www.ncbLnlm.nih.gov/genomes/leuks.cgi1.16.
7. Into the text entry field, type 'mosquito'. Also, click on the ticked box on the Limits tab to 5. Clicking on the organism name at the left will take you to the Genome Project database
'untick' it. This removes our previous limits settings. and display entries for P falciparum, offering further links to data and resources for this
8. Below, you will see a list of your most recent searches. The top-most one in the list will be: organism.
#xx Search malaria Field: Title
where 'xx' is a number. Click on the number. NCBI Map Viewer provides one way to access positional genome information and to
9. A pop-up menu of options will appear, asking how you want to combine your previous search integrate it into searches. Protocol4takesyou through a typical use of Map Viewer.
(for articles with 'malaria' in the title) with the current search (for 'mosquito', not limited to
the title). Click on AND.
10. The text box should now show: Protocol 4
(mosquito) AND (#xx)
Finding sequence-tagged site (STS) markers on chromosome 3
meaning that we are about to search for records that contain mosquito in any part of the
entry and that also contain 'malaria' in the title (from our previous search). Click Go. of P. falciparum using the NeBI Map Viewer
11. The result is around 3500 entries (at the time of writing), each with 'malaria' in the title and 1. Navigate to the Map Viewer home page (http://www.ncbi.nlm.nih.gov/mapview/1.17) by the
with 'mosquito' somewhere in the entry. Map Viewer link from the NCBI home page (http://www.ncbi.nlm.nih.qov/ I .,).
2. On the Map Viewer home page, select Plasmodium falciparum (http://www.ncbi.nlm.nih.gov/
mapview/map search.cgi?taxid=363291.18) from the puil-down Search menu (leave the
NCBI provides a web page giVing further details of how to search their text field empty) and click Go. You will be taken to the Map Viewer page for P falciparum,
databases at http://www.ncbLnlm.nih.gov/entrez/query/static/help/helpdoc.html# including an ideogram of the karyotype.
Searching 1.14. The following protocols give some examples of other ways to search 3. Enter 'STS' and '3' in the text fields Search for and on chromosome(s), respectively, and click
the NCBI databases. The examples are in no way exhaustive, but they will introduce Find.
you to a range of search types that can form the basis of your own explorations. 4. The results of your query are presented as hits on chromosome ideograms and in a tabular
format. View the results in the Map Viewer graphical display by ciicking on the 3 underneath
the chromosome in the ideogram (to show all STSs) or by clicking on the blue links in the table
below. Click on the first Map Element in the table (at the time of writing, this was Pf2541).
5. The resulting page will show STSs in a part of the chromosome, with the chosen STS (Pf2541)
indicated. Clicking on its name will call up further information on that STS, including the
polymerase chain reaction primer sequences.
8 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS METHODS AND APPROACHES 9

. - - - - - - - - - - - - -_ _ _ _ _ _ _ _ _ _ _ _ _ ""'_ _ ~"~"'_"N "

ProtocolS Protocol'
Searching for a-tubulin genes in P. falciparum Simple BLAST searches at NCBI
This search could be started from the Entrez Home page (searching all NCBI databases and then 1. Navigate to the BLAST home page (http://www.ncbi.nlm.nih.gov/BLAST/1.20) from the BLAST link
selecting those hits from the Gene database). Alternatively, as here, we can navigate to the Entrez included in the query bar found at the top of most NCBI pages.
Gene page to search only that database.
2. The BLAST home page provides links to the suite of BLAST tools for comparisons between
1. Navigate from the NCBI home page (http://www.ncbi.nlm.nih.gov/1.1) to the Entrez home nucleotide or protein sequences. Searches may be conducted against highly divergent
page (http://www.ncbLnlm.nih.gov/gquery!gquery.fcgi 1.6) using the link on the right of the (discontinuous mega blast), the trace archive, the COD, gene expression data in
screen. single-nucleotide polymorph isms, immunoglobulins, etc. In this case, we will search for
2. Click on the Gene link or adjacent icon (left side of screen) to go to the Entrez Gene page proteins related to a Plasmodium a-tubulin, so select protein blast (under the 'Basic BLAST'
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene 1.19). heading).

3. Into the text box at the top of the screen, type: 3. On the 'Protein BLAST' page, leave all settings at their default values (note that we will be
searching against the 'm' or non-redundant database).
Plasmodium falciparum[ORGANISM] alpha tubulin
4. Open a new browser window and find the P. falciparum a-tubulin gene PFI0180w (see
(see Protocol 1 for an explanation of using '[ORGANISM)' to limit a search to entries originating
Protocol 5). When you have found the Entrez Gene page for this gene, scroll down to find
from a species). Click Go.
the heading 'NCBI Reference Sequence (RefSeq)' and click on the link to the gene product
4. The query should return 11 genes (correct at the time of writing) with a-tubulin in the (XP_001351911.1). This should bring up the corresponding Entrez Protein page and, scrolling
annotation. Click on the link for PFI0180w to get a detailed summary of its annotation. down, you will find the complete amino acid sequence for this protein. Copy this sequence
5. In this case, the PubMed reference for the gene (under Links on the right of the screen) (along with the numbers and spaces) and paste it into the Search box on the BLAST page.
links back to the genome project paper (Hall et an. rather than to an original paper about 5. Click on BLAST. You will be taken to a page saying that your request has been successfully
a-tubulin. From this, we might assume that this gene's annotation has been predicted by submitted. The page will be automatically updated when your results are ready (BLAST searches
homology to other tubulin genes, rather than having been verified experimentally. can take some time to complete).
6. Use your browser's 'back' button to return to the Entrez Gene page for PFI0180w. Scrolling 6. When the results are ready, you will see a diagram representing the best matches. The colored
. down the page to the section headed 'General gene information' shows that all of the Gene bars indicate the score of the match and the portion of your query sequence that it matches.
Ontology terms relating to this protein (see Chapter 9 for an introduction to Gene Ontology) In this case, there will be many full-length red bars indicating many close matches to the
were assigned on the basis of evidence-code 'lEA' ('Inferred from Electronic Annotation'), complete a-tubulin sequence.
confirming our assumption that the annotation was not verified experimentally.
7. Below this are listed the hits in order of BLAST score (best first). If you click on the accession
7. Additional examples of gene searches are given on the Entrez Gene home page. number of the hit, you will go to the GenBank entry for that protein. If you click on the BLAST
score, you will go to the alignment (remember that there may be more than one alignment
per hit).
2.1.2 Sequence similarity searches and alignment of transcripts to genomic 8. Now try repeating the search, but looking only for matches in Arabidopsis thaliana. To do
sequences this, navigate to the 'Protein BLAST' page (step 3 above), but this time, type Arabidopsis
A common method of querying sequence databases is by similarity searching. The thaliana into the 'Organism' text field (under 'Choose Search Set'). before continuing as
before. (Note: as you type the organism name, you will be prompted with a list of likely
most well-known tool for similarity searching is BLAST (Basic Local Alignment Search
organisms-simply click on Arabidopsis thaliana to save typing the complete name.
Tool) (17), which allows you to search your query sequence against a database of
your choosing. Chapter 3 gives detailed information about BLAST and related tools; 9. The result this time is a smaller number of matches, some of them shorter or of lower score,
to Arabidopsis sequences.
here we introduce the use of such tools in the context of the NCBI databases.
Similarity searches of NCBI's nucleotide and protein sequence databases can
be restricted to sequences from one or more species either by specifying the BLAST is not always the best tool for sequence alignment and NCBI provides other tools
organisms in the Options section of the BLAST page or by submitting searches against that may be more appropriate for your needs. Alignment of a mRNA or cDNAsequence
databases on the organism-specific BLAST pages (http://www.ncbi.nlm.nih.gov/ to a genomic sequence can be computed using NCBl's SPIDEY (17) (http://www.ncbi.
BLAST/1.20). Multiple query sequences can also be submitted in the same search nlm.nih.gov/I EB/Research/Ostell/Spidey/ 1.21) or SPUGN (http://www.ncbi.nlm.nih.gov/
using the organism-specific BLAST pages. sutiis/splign/splign.cgi 1.22) alignment tools. Chapters 4 and 7 cover the alignment of
transcripts with genomic sequence in more detail. Protocol 7 gives a simple example
of using SPLIGN to compare a cDNA sequence with genomic sequence.
10 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS TROUBLESHOOTING !II 11

2.2 Downloading NCB I data sets ~.W' __ '~~"


Protocol 7
For most scientists, web sites such as NCBI will provide all of the functionality they
Aligning a cDNA sequence to genomic sequence will ever need for their research. However, for people who want to do a lot of data
searches, web sites become impractical and it is sometimes necessary to download
1. From the NCBI home page, select Tools. From the tools page (http://www.ncbi.nlm.nih.gov/
entire datasets and software tools for searching them, so that the analysis can be
Tools/1.23) click on the Splign link (scroll down to find it) to go to the SPLIGN page (http://www.
nc5T.'n Im.n ih.gov/suti Is/spl ig n/spl ig n.cg i 1.22). done on a local computer.
NCBI provides this at their FTP site: ftp://ftp.ncbLnih.gov/ 1.29 . In the blast
SPLIGN can be downloaded to run locally or you can submit a cDNA and genomic sequence to
directory, you will find all of the protein and nucleotide databases available
SPLIGNat NCBI, which we will do here. Click on the click here link (http://www.ncbi.nlm.nih.gov/
sutils/splign/splign.cgi?textpage=onlineEtlevel=form 1.24) to submit an online job. through the NCBI web site, as well as executable files that you can install on
Windows, Mac OS X, or Linux operating systems. In the genomes directory, you
3. You will see text boxes to accommodate the cDNA sequence and the genomic sequence with
can download individual genomes. For people who want to create their own
which to align it. In the cDNA box, you can either paste a sequence or specify a sequence by
its accession number. In this case, we will specify the cDNA for a chicken cDNA sequence: type scripts that incorporate NCBI datasets or NCBI software tools, there is a set of
the accession number 'AJ744697' into the cDNA box. file standards and software tools called the NCBI toolbox that will allow you to
process files, run searches, and format output on your own machine: http://www.
4. In the 'Genomic' box, you can again specify the sequence either by pasting it in or by giving
an accession number. However, you can also select from a list of whole genome sequences ncbLnlm.nih.gov/IEB/TooIBox/MainPage/index.htmI 1.30 .
using the pull-,down menu underneath. Use this to select Gallus gallus (chicken). One important caveat with installing your own datasets is that you will need
to update them constantly as they go out of date, whereas if you are running
5. Click on the Align button and wait until the results are ready.
searches over the internet this is not a problem. You should also test the output of
6. The cDNA aligns to two places in the same genomic sequence contig (listed under 'Subject' at
any executable that you install to make sure that it is running properly and that
the top of the results pagel. Each alignment is a 'Model:
your databases are indexed correctly.
7. For each model. the alignment of the cDNA to the genomic sequence is shown. This "Iinnmpnt
will return six segments (six putative exons). The yellow boxes at the top represent
divided into the aligned segments; the genomic sequence is shown below. The vertical blue
lines in the cDNA represent indels and the red lines mismatches in the alignments. To view the
alignment for each segment, click on the graphical display or on the segment number.
'---------------------~-- ~~~-............ ..
• I can't find my genome at NCBI. Where else can I search?
This problem is becoming less widespread but it is not unusual. If a genome
is published, then the rule is that it must be submitted to NCB!. However, if
Results of whole genome-to-genome pre-computed sequence comparisons are the genome project is still ongoing or it is unpublished, then it may not be
available in NCBl's Homologene resource, with highly similar sequences being submitted to GenBank but it may still be available. Many genome centers
represented as distinct Homologene groups. Text queries of the Homologene will submit ongoing projects to the trace archive at NCBI (http://www.ncbL
database yield results pages of matching Homologene groups containing highly nlm.nih.govjTracesjtrace.cgi 1.31), so check there first. If that does not work,
related sequences from multiple organisms. many ongoing projects have links from the genome project page: http://
Pre-computed orthologs can be searched and browsed using Clusters of www.ncbLnlm.nih.gov/entrez/query.fcgi?db=genomeprj 1.32. If you still cannot
Orthologous Groups of proteins (COGS): http://www.ncbi.nlm.nih.gov/COG/ find it, then you will have to search yourself; the major academic genome
new/1.25, These are genes that are predicted to be functional equivalents due to centers are listed below as additional web resources.
the fact that they are derived by vertical descent from a single ancestral gene in • How do I reconcile different versions of a genome at the various sites?
the last common ancestor of the compared species. You can also compare your Unfortunately, you will find more than one version of a genome depending
gene to known COGS using the kognitor tool: http://www.ncbLnlm.nih.gov/COG/ on where you look. This will happen if a genome project is ongoing but puts
grace/kognitor.html1.26. an intermediate version in GenBank in order to make the sequence widely
Results of pre-computed protein comparisons can be viewed at NCBl's Blink available. The genome center may then continue to update their own version
(BLAST Link) resource (http://www.ncbi.nlm.nih.gov/sutils/static/blinkhelp.htmI whilst leaving the old version in GenBank. GenBank will give a submission date
Blink results are provided by links on other Entrez database pages (e.g. Entrez in the top line of the entry. One must keep in mind that the GenBank version
Protein, Entrez Gene). Blink provides a tabular display of pre-computed highly will be the 'official' version of the genome; this means that you will be able
related proteins for all organisms in the Entrez Protein database including hits to to give an accession number for this record in publications so that others can
the COD (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml1.28). see the data. The genome center version is a transient file on a web site that
12 CHAPTER 1: DATABASE RESOURCES FOR WET-BENCH SCIENTISTS REFERENCES 13

may disappear at any moment. So, whilst you may want to use the more recent .. TIGRFAM (curated protein domains of microbes):
version of the data, remember that it may not be there tomorrow. TIGRFAMs/index.shtmI 1.48
.. SMART: http://smart.embl-heidelberg.de/1.49
.. InterPro (protein families, domains, and functional sites,in which identifiable
features found in known proteins can be applied to unknown protein
sequences): http://www.ebi.ac.uk/interpro/1.50
Listed here is a selection of other web sites that are likely to be useful. .. Protein data bank (provides a variety of tools and resources for studying the
structures of biological macromolecules): http://www.rcsb.org/pdb 1.51
Major primary sequence generators
Miscellaneous
.. The Wellcome Trust Sanger Institute: http://www.sanger.ac.uk/ 1.33
.. JGI Genomes: Eurkaryota, Archae, Bacteria: http://genome.jgi-psf.org/tre .. OBO (Open Biomedical Ontologies: an umbrella web address for well-structured
home.html1.34 controlled vocabularies for shared use across different biological and medical
.. Human Genome Sequencing Center, Baylor College of Medicine: httn·!I\A1\AI\AJ domains): http://obo.sourceforge.net/ 1.52
hgsc.bcm.tmc.edu/1.35 .. Amigo (a web interface for browsing gene ontologies, which will allow you
.. The Broad Institute: http://www.broad.mit.edu 1.36 to search for genes with specific functions or cellular locations, or that are
.. The Institute for Genomic Research: http://www.tigr.org 1.37. Now renamed the involved in specific processes): http://www.godatabase.orgf1.53
J. Craig Venter Institute (JCVI; http://www.jcvLorgl
.. Washington University Genome Sequencing Center: http://genome.wustl.edu{1.38
.. Genoscope: http://www.genoscope.cns.fr/1.39

Bioinformatics institutes
* 1. WlteelerOl, Barrett T, Benson OA, et al. (2006) Nucleic Acids Res. 34, 0173-0180.
- This publication gives an overview of the NCBI databases and their associated tools. This
.. The European Bioinformatics Institute: http://www.ebi.ac.uk 1.40 reference is the most up to date at the time of writing, but NCBI publishes an overview of
changes and updates in the Nucleic Acids Research database issue, which is published
• National Center for Biotechnology Information: http://www.ncbLnlm.nih. every year.
gov/l.l 2. Benson OA, Karsch-Mizrachi I, Lipman OJ. Ostell J & Wheeler Ol (2006) Nucleic Acids
.. Center for Information Biology and DNA Data Bank of Japan: http://www.cib. Res. 34, 016-020.
nig.ac.jp 1.41 3. Parkinson H, Sarkans U, Shojatalab M, et al. (2005) Nucleic Acids Res. 33, 0553-
0555.
4. EppigJT, Bult CJ, Kadin JA, et aJ. (2005) Nucleic Acids Res. 33, 5.
Genome annotation databases * 5. GeerRC & Sayers EW (2003) Brief. Bioinform. 4, 5. - A tutorial paper that covers some
of the ground of this chapter but in more detail. It gives a useful overview of the concepts
• KEGG (Kyoto Encyclopedia of Genes and Genomes): http://www.genome.jp/ behind the Entrez tool using example tasks.
keggi 1.42 6. McEntyre J & Lipman D (2001) CMAj, 164, 1317-1319.
• GeneDB (Sanger Institute Pathogen Sequencing Unit annotation database): 7. Boguski MS, lowe 1M & Toistoshev CM (1993) Nat. Genet. 4, 332-333.
http://www.genedb.org/ 1.43 8. Banff S, Guffanti A & Borsani G (1998) Trends Genet. 14, 80-81.
9. Sherry ST, Ward MH, Kholodov M, et al. (2001) Nucleic Acids Res. 29, 308-311.
• Ensembl Genomes: http://www.ensembl.org/index.htmI 1.44 10. Boeckmann B, Bairoch A, Apweiler R, et al. (2003) Nucieic Acids Res. 31, 365-370.
• CMR (Comprehensive Microbial Resource) Annotated Microbial Genomes: 11. Boeckmann B, Blatter MC, Famiglietti L, et al. (2005) c. R. Bioi. 328, 882-899.
http://pathema.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi 1.45 12. Barker WC, GaravelliJS, McGarvey PR, et al. (1999) Nucieic Acids Res. 27, 39-43.
• BRC Central (central web site of NIAID Bioinformatics Resource Centers, 13. Sussman JL, Lin 0, Jiangj, et al. (1998) Acta Crystallogr. D Bioi. Crystallogr. 54, 1078-
1084.
which houses databases of biodefense-related organisms): http://www.brc-
central.org 1.46
* 14. Bateman A, Coin l, Durbin R, et al. (2004) Nucieic Acids Res. 32, 0138-0141.
- Pfam is possibly the most useful web resource available for gene function analysis.
.. Genome properties (a database of curated and calculated properties It is useful to understand how it is built and annotated before diving in and using it.
of microbial genomes): http://cmr.tigr.org/tigr-scripts/CMR/shared/Genome This publication should be updated each year in the database issue of Nucleic Acids
Research.
PropertiesHomePage.cgi 1.47
15. letunic I, Goodstadt l, Dickens NJ, et al. (2002) Nucleic Acids Res. 30,242-244.
* 16. TatusovRl, Fedofova ND, Jackson JO, et al. (2003) BMC Bioinform. 4, 41. - An in-depth
Protein families, domains, and structures description of how orthologous groups have been calculated for the COG database; it also
describes the eukaryotic clusters (or KOGS). This database is an excellent tool for studying
.. Pfam (curated protein domains): http://www.sanger.ac.uk/Software/Pfam/1.4 the phylogenetic coverage of genes.
14 CHAPTER 1 : DATABASE RESOURCES FOR WET-BENCH SCIENTISTS

11. Barrett T, Suzek TO,Troup DB, et al. (2005) Nucleic Acids Res. 33, 0562-0566.
18. Hamosh A, Scott AF, Amberger j8, Bocchini CA & McKusick VA (2005) Nucleic Acids Res.
33, 0514-0517.
19. Cantor MN & Lussier VA (2004) Medinfo, 11, 753-757.
CHAPTER
Navigating sequenced genomes
Melody S. Clark and Thomas Schlitt
-----------------------.--.~~ ...'".-.. -.

World sequencing capacity and technologies were fuelled by the race to complete
the Human Genome Project and resulted in a massive investment in infrastructure,
machinery, techniques, and personnel. The effective completion of this project has
not seen a reduction in the amount of sequencing data generated. The techniques
learnt, especiaily the use of shotgun sequencing, can efficiently sequence large
vertebrate genomes, and the realization that low-density coverage of a genome
could prove almost as useful to scientists as a completed genome, allied with a
dramatic reduction in sequencing costs, brought about a liberation in the genomic
science field. Therefore, the 'excess' capacity was not closed down but maintained
so that, in theory, the DNA of any organism could be sequenced. A press release from
the National Institutes of Health (NIH) in August 2005 announced that the public
collections of sequence data (GenBank, EMBL, and DDBJ) had reached 100 gigabases
from over 165 000 different organisms, and the sequence databases continue to
grow at a tremendous rate. This continued production of sequencing data, much
of which is for comparative analyses, has gradually led to a standardization of data
presentation and genome viewers, some of which will be described here.
So what constitutes a 'sequenced' genome? This is either in the form of:

• A completely sequenced genome (to a genome-center standard of 99.9%


accuracy, with no gaps).
• A draft genome with perhaps only three- to tenfold coverage, such that the
genome is present in numerous contigs. The quality of the sequence data is
not shown and may be variable, particularly at the ends of the contigs. Repeat
sequences may be masked unless a single clone encompasses the whole repeat
region with accurate sequence readthrough. Misassemblies can occur, so care
should be taken when interpreting draft sequences (1).
• Ongoing genome sequencing data. This is very much 'work in progress', with
sequence information made publicly available in ongoing draft form and
with frequent updates. Again, repeat sequence data may be either masked or
removed from the draft sequence.

Bioinformatics: Methods Express (Paul H. Dear, ed.)


Scion PublishinG Limited. 2007
16 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 17

This chapter wWencompass all of these types of ~enomes and will explain how to
access the data, appreciate the limitations of the associated annotation, identify Protocol 1
sources of additional useful information, and download and manipulate the data.
The chapter provides a number of worked examples and also lists several different The Genomes On line Database (GOLD)
genome viewers to use for both vertebrates and microbes. In general, the different
1. As an example, we will try to find out if there are any genome projects for the whiptail
genome browsers present identical data sets in slightly different ways, and some
wallaby, Macropus parryi.
are easier to use than others. This chapter has focused on the use of Ensembl (for
vertebrates) (2) and the related site Integr8 (mainly dealing with microbes) for 2. Go tohttp://www.genomesonline.org 2.1 and click on the GOLD Tables button.
several reasons: 3. You have a choice of the following buttons:

.. They combine the considerable resources of two major bioinformatics centers: • Published complete genomes
• Archaeal ongoing genomes
The European Bioinformatics Institute and the Sanger Institute.
• Bacterial ongoing genomes
.. Both are publicly funded with what appears to be a sustained commitment to
• Eukaryotic ongoing genomes
future work and are not subject to the vagaries of commercial interest. • Metagenomes
.. They provide data as quickly as possible from the sequencing pipelines, allowing
4. Click on Published complete genomes to call up a table of all published genome sequences.
access to very early crude data releases. Several of the column headings at the top of the table are clickable links; clicking on one of
It Numerous updates are provided. these will sort the data according to that criterion.
It If you grasp Ensembl, then you do not need to relearn another browser when
5. Click Organism to sort the table by genus and species, and scroll down to look for M. parryi.
accessing Integr8. There is no such entry (at the time of writing).
Having said that, they may not suit everyone, and the best way to get to grips 6. Perhaps there is an ongoing genome project for this species. We could go back and look
with genome viewers is to pick a gene, enter it into the different browsers, and through the list of 'Eukaryotic ongoing genomes', but instead we will use the 'search' function.
decide which one you like the best with regard to data presentation and ease of Go back to the front page of GOLD and click on the Search GOLD link.
use. Nevertheless, you would be well advised to explore Ensembl and Integr8 first, 7. On the resulting page, you can specify what type of information you want to retrieve (using
as understanding how they are used will help in understanding other browsers. check boxes at the top of the screen; leave this at its default with the All fields box checked)
You are also encouraged to explore references (3}-(6), which provide useful and your search criteria (menus and fields in the lower part of the screen).
information on a number of browsers and related resources. 8. Under the Type menu on the left, select species and, in the adjacenttext box, type 'parryi'. Press
'Enter' to start the search (or click the Submit search button at the bottom of the screen).
9. The query returns no results (at least, at the time of writing). Alas, there does not appear to
be any whiptail wallaby genome project. However, are any other members of the genus being

2.1 Finding genome resources for an organism 10. Use the 'back' button to return to the search form and this time search for the genus
'Macropus' (setting the type menu to genus).
If an organism has been (or is being) sequenced, then, generally speaking, every 1. This gives two results - both for Macropus eugenii genome projects. (For other species,
scientist who works in that community is aware of it. However, this does not different types of project such as expressed sequence tag (EST) projects may be listed.)
necessarily mean that they will have privileged access to the data prior to the 12. Clicking on the Taxonomy link for either of these two entries will take you to the appropriate
obligatory Science or Nature genome paper or know where the data is being entry in the National Center for Biotechnology Information (NCBI) Taxonomy Browser, where
stored. Also, the data may not be available immediately in one of the 'standard' you will see that M. eugenii is the tammar wallaby. Additional information about - and links
public genome browsers. There may also be a requirement to try to identify similar to - available genome and other data for this species are given.
genomes for comparative analyses. For this, you need to access a genome project- 13. Return to the GOLD page displaying the two M. eugenii genome projects. Links to relevant
monitoring web site, which keeps an up-to-date log of what is being sequenced, funding bodies, institutions, databases, and other resources are given for each of the genome
with contact details (7). The most comprehensive site for this is the Genomes On projects. The identifiers listed under GOLDSTAMP (column on the extreme left) are unique to
each genome project and link to a summary page or 'Gold Card' for the project.
Line Database (GOLD), which is maintained by Dino Liolios, Nektarios Tavernarkis,
Phil Hugenholtz, and Nikos Kyrpides (8-10). Protocol 7 is an example of a simple 14. The GOLD search page offers many options for searching, not only by organism type or by
query to GOLD. genome properties, but also by funding body, country, researcher, and many other factors.
18 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 19

GOLD will direct you to specific genome project information and also to an
NCBI flat file (if it exists). A 'flat file' is a data file that contains records with no Protocol 2
structured relationship. In the case of sequence files, these usually list minimal
data on the source of the data, the full sequence, and (not always) annotation A short tour of Ensembl starting from the parathyroid
on coding regions. There are no links to additional sources of information and it
hormone-related protein gene
is necessary to extract the data manually to be able to analyze it further. If, for
example, you wanted to examine a particular gene from a sequenced microbe, 1. Go to Ensembl via http://www.ensembf.org 2 .2 and click on Homo sapiens under the Popular
then using the flat file you would have to scroll manually through the entire genomes heading. In the text box next to the Search e! Human box at very top right of the
screen, enter 'parathyroid hormone-related protein: Set the adjacent pull-down menu
sequence to find it, if the annotation was in place, or else perform a BLAST search
to Anything. Press 'Enter' or click on the Go button.
on the whole sequence, work out where the bit you wanted was, and then extract
that particular piece of sequence. In a nutshell, this is why genome browsers are 2. There are only three results 3 , starting with two Ensembl protein coding genes and, below it, an
Ensembl gene family. Click on the link (Ensembl protein coding gene: ENSG00000087494 ... )
so useful.
for the gene.
There are distinct advantages in accessing the information via generic genome
viewers. They present the data in a standard format and often link sequence data 3. This will call up an Ensembl Human Gene View page, starting with the Ensembl Gene Report
for this gene (see Fig. 1). A range of information about the gene is displayed, along with links
between different genomes, allowing easier comparisons and data handling.
to further annotation and resources. Note that if you hover the cursor over the links on the
They are also not restricted in the organisms they list (some institutes only list
left of the screen, brief explanations of the links will appear.
those organisms being sequenced 'in house'), as long as the data is in the public
domain.
It should be mentioned that GOLD does not necessarily direct you to a generic
genome browser for the organism in question, even if the information is available
there. For example, both the dog (Canis familiaris) and cow (Bas taurus) genomes
are available in Ensembl, but a direct link to the relevant Ensembl pages is not
available under the GOLD 'Organism' listing. In general, you will need to discover ENSGOO!Ji)f)_VllH94 EI Ens.mbl Gene Reportfor ENSGOOOOOOS1494

which of the generic genome browsers gives access to the data for the species you . . C<:!'n«!-imarmlrtimJ
__ GenIl'Sj1IkBlli(I:!'im,uJ(l
are interested in. . . GCf¢rc>gUlatktniofo.

In the following sections, we will focus on the use of the Ensembl and Integr8 _ Gm-mmk _~,ett¢e

. . Tnmlluiptififoun.mon
browsers; other popular resources will be covered later. _Exoninfwfflirt:k!n
~ PrO'teininform;Uion
Oeacrtptkm
,. E"..,ortdqt~

2.2 Browsing vertebrate genomes with Ensembl


BTrnnacrlpt5
Ensembl was developed as a browser for the Human Genome Project. but is now ,. (""'ap!'1kaI~'IIi~w

available for an increasing number of species (over two dozen at the time of
,.....,
__ fJll)ort WOfnHl{tan abQut

. . Exp(!ft"&oque-m:ellsfASTA
__ E:q)M(MBtfikt
writing). It is continually being updated with an increasing array of features.
Most importantly, it is not restricted to the genome sequence produced by any
one institute and it also allows easy comparison of data from different genomes.
Protocol 2 describes a typical use of Ensembl - browsing for information related
to a given gene and its genomic environment. Gene symbols and descriptions can
also be used instead of gene names.
Figure 1. Ensembl gene report for parathyroid hormone-related protein.

4. To download directly the sequence data and some associated information for your own
analyses, you may use Export information about region, Export sequence as FASTA or Export
EMBL file. In each case, you will be offered various options such as which file format to use,
which features of the annotation to include in the output, and how much flanking sequence
to include. Each of these options will export a flat file, but with only minimal annotation.
Much more information is displayed graphically using the other browser features. If you try
20 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 21

use the 'back' button on your browser to return to the Gene View page

5. On the right-hand part of the screen, the Genomic Location box gives both the location
of the gene in the genome (the chromosome and the start and end positions of the gene
sequence) and the sequence contig in which the start of the gene lies. Access the genomic
location of the gene by clicking on the link 28,002,284-28,016,183 (the exact coordinates
may change in later versions of the genome assembly).
6. This takes you to an Ensembl Contig View page, showing an ideogram of chromosome 12 and,
in the 'Overview' area below, an expanded view of the region surrounding the gene at 28 Mb
in band 11 p22. The parathyroid hormone-related protein gene (also known as parathyroid
hormone-like hormone or PTHLH) is shown in the middle (the short brown bar above the name
shows its extent on the chromosome) and neighboring genes are displayed to the left and
right. A red box delineates the PTHLH gene alone; this region is shown in greater detail in the
'Detailed view' area below. Scroll down to bring the entire 'Detailed view' area on screen b•
7. The 'Detailed view' area displays the data that were used to infer the Ensembl gene transcript
for PTHLH. The data is displayed in 'tracks': each track is named towards the left of the box
(for example 'Genscan'). The default tracks include matches with ESTs, Unigene, and mRNA
sequences, and the transcripts predicted by the Genscan automated gene prediction software.
Also shown are details of the sequence contigs spanning this region
assembly and DNA markers in the region.
8. Clicking on any item will bring up a small menu of relevant features. Clicking on the track
names (for example, Unigene) will bring up a help menu; clicking on Track information ... in
this menu will then bring up an explanation of that track.
Figure 2. Ensembl viewer showing the synteny viewer.
9. In addition to the default tracks, many other tracks can be displayed for the region of interest.
The PTH LH gene in Homo sapiens is on chromosome 12 and shares synteny with dog
Use the Features pull-down menu (at the top of the 'Detailed view' area) to add further tracks
chromosome 27. The human genes in the region are listed with a reference to the orthologous
to (or remove tracks from) the view.
region in dog, and provide direct access to both the candidate gene and the neighboring genes in
10. To get an overview of the syntenic regions in other species, go to View syntenic regions on both human and dog.
the left side of the screen. You will be offered a choice of organisms (six at the time of writing,
starting with Bos taurus), and clicking on anyone of these will bring up a 'classical' map in the Mus musculus part of the window and then on the Gene:ENSMUSG00000048776
showing the overall synteny of human chromosome 12 with the chromosomal regions in the item in the menu that pops up will take you to the Ensembl Gene Report for this mouse gene:
other organism (see Fig. 2). After viewing this, use the 'back' button on your browser to return this is the mouse equivalent of the Gene Report for the human gene, which we saw in step 3
to the Contig View page. of this protocoi. Use your web browser's 'back' button to return to the page displaying the
". To view a more detailed version of the syntenic region in other organisms, go to View alignment alignments with human.
with on the left side of the screen. A pop-up menu will offer you a choice of species (or groups 15. To view a more detailed comparison between the human region and just one other species,
of species) to align with the human sequence. At the time of writing, the first item in the go to View alongside (upper left-hand part of the screen) and select any of the single species
menu is 5 eutherian mammals - select this item. that appear in the pop-up menu. The new window (an Ensembl Human MultiContig View)
12. This will take you to an Ensembl AlignSlice View page, which will show the human chromosome shows a side-by-side comparison between the genomic regions in human and the second
12 ideogram and an overview of the human chromosomal region as before, but the 'Detailed species at three levels of detail: chromosomal ideograms at the top; a 'navigational overview'
view' area below these will now display the human PTHLH gene and the corresponding regions of the PTHLH region below this; and at the bottom a 'Detailed view'. As before, you can select
in each of the other species. Only limited information (sequence contig and EMBL transcript which features to display in the 'Detailed view' and can also zoom in or out.
tracks) are shown for each species by default, but the pull-down Features menu at the top of
the 'Detailed view' area can be used to add tracks displaying any other desired features for all Notes
of the species shown.
aThe information and figures given are correct for the NCBI36 assembly current at the time of
13. To view neighboring genes in the various species, use the 'Zoom' feature towards the top of writing.
the 'Detailed view' box. Click on the larger end of the wedge (or click on the - icon) to display bEach of the views - the chromosomal ideogram, the overview, and the detailed view can be
a larger region around the human PTHLH gene and its aligned regions in the other species. closed or opened by clicking on the corresponding '-' or '+' at the top-left corner of the view. A
fourth and more detailed 'Basepair view' is closed by default but can be opened at the bottom of
14. At any point, you can access information from the other species by clicking on the feature of
the screen by clicking on its '+' icon.
interest to bring up a menu of available information. For example, clicking on the Pthlh gene
22 II CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 23

The Ensembl genome browser is very user-friendly with numerous drop-down


menus to access further information in a graphical form. If you want to perform Protocol 3
your own analyses on the data, there are direct links (such as Export EMBl fiie) in
the upper-left part of the screen, which will allow you to export information on Using Integr8 to search for dihydroorotase genes in
the region being examined; or the Export data link under 'Use Ensembl to .. : lets
you specify the region and the type of data to export. Data can also be downloaded
Acinetobacter
from Ensembl using BioMart, which is covered later in this chapter. 1. Go to http://www.ebLac.ukj2·3.Click on the Databases tab towards the top left of the screen,
Other particularly useful features are listed under 'Other EMBL websites' then select Database browsing and then Integr8 from the menus that pop up.
towards the bottom of the Ensembl front page: 2. You first need to specify the species. This can be done using the Browse Species menu (left
side of screen) and selecting Acinetobacter sp. from the resulting list, or bycnPf'jf"jnn
.. Vega (Vertebrate Genome Annotation) (11). This is a central repository for 'Acinetobacter sp.' in the Search for species box and clicking the adjacent Go!
high-quality, manually annotated data (Ensembl is a fully automated pipeline
3. To find all genes with 'dihydroorotase' in their name, put 'dihydroorotase' in the Search for
based on gene prediction programs). It is currently available for four species:
gene/protein box. The pull-down menu to the right will be set by default to Acinetobacter
Homo sapiens, Mus musculus, Dania rerio, and Canis familia ris), so if you sp. (as we have just selected this species) - leave it set to this default and click the right-most
are interested in these organisms, it will be worth accessing Vega, at least of the Go! buttons.
alongside Ensembl. 4. This produces 2 matchesa, one of which is 'putative dihydroorotase'. There are then two ways
.. Ensembl Pre! This provides access to recent data that has yet to be entered into of looking at the data for this gene: using 'Integr80r' or 'Genome Reviews'.
Ensembl (very useful for draft genomes or on-going sequencing projects).
5. To examine the gene using 'Integr80r', click on the i8 button to the left of 'putative
.. Archive! This is particularly useful for tracing back information in previous dihydrooratase: This displays the information on this gene in a very straightforward way via a
releases, if you have data that you accessed previously. Archive versions series of tabs (Gene, Results, Context, and History).
extending back over a 2 year period are available.
6. To identify orthologs and para logs in other species, click on the Protein tab and then on the
Finally, if you want to explore other genome browsers, there are direct links to Orthologues and Paralogues buttons (next to the 'Homology:' heading underneath the Gene
both the NCBI and University of California Santa Cruz (UCSC) genome browsers tab). This will bring up a table of genes in various species, each of which can be examined
more detail by clicking on its name (on the left of the table).
from Ensembl. For example, the Ensembl Contig View, AlignSlice View, and Human
MultiContig View pages (seen in Protocol 2) all have links in the upper-left part 7. To the right of the table of orthologs or paralogs is a column headed 'Select', with a dot for
of the screen that will take you to the equivalent information in either of these each of the listed genes. Click on several of these dots (they will turn into 'ticks' when clicked)
two alternative browsers. to select specific genes from the list.
8. Click on the Compare button at the top of the column. A table will appear showing each
of the selected orthologs (or para logs) and the genes lying adjacent to it in the respective
2.3 Integr8 - an Ensembl lookalike for microbes genome. Clicking on any of the gene names will take you to a new 'Integr80r' page for that
-------------------------------~--"-,-'"'--
gene.
Integra is a portal maintained by the European Bioinformatics Institute (EBI) for
9. Use the 'back' button on your browser to return to the iist of Acetinobacter genes in step 4.
access to information related to completed genomes and their proteomes (12).
Integr8 does include some vertebrate data, but its strength is in the application of 10. To examine the Acetinobacter putative dihydroorotase gene using 'Genome Reviews', click on
the GR button to the left of the gene name. You will be taken to an Ensembl-style browser
the Ensembl browser to microbial genomes. It currently does not have an up-to-
disol::Jvino information on this gene and the region around it (see Fig. 3). See Protocol 2 for
date listing of all of the organisms that are in Ensembl. to the Ensembl-style browser. Note that not all organisms in Integr8 have a
Genome Reviews browser.

Note
aCorrect at time of writing. Details are, of course, likely to change.
24 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 25

2.4.2 Entrez Genomes


This is the NCBI graphical genome viewer (http://www.ncbLnlm.nih.gov/gquery/

.k~.i!iiW¥;;IiJ!.¥,¥¥I§iM*ii. C'~ntiIfView
El Overview
- gquery.fcgi 2.5). Entrez Genomes integrates the scientific literature, DNA and
protein sequence databases, 3D protein structure and protein domain data,
population study datasets, expression data, assemblies of complete genomes,
and taxonomic data. There is a comprehensive map viewer, a genome browser
for eukaryotic genomes, Plant Genomes Central, microbial and viral genome
databases, and gMap, a comparative analysis of microbial genomes. The archaea,
irM,oon"."
!Gene'#"
Ji'~:aa:
\.MlAIl"HU '/l¢.l'AlU<la
__"
\.l\CSH)ua· '''IOIAOHU
--.l
~AtfllOUU 1,.. .... 111 !.""II
t1::...1
>~N,A ......... !."'"
bacteria, and eukaryota can all be viewed by either chromosome, plasmid, or
organelles. Information leads directly to the relevant NCBI files. NCBI's genome
lr.ene l~~ ~~~~;;; __ __
~~_-=- ~~~~~::~~~~ ___~~~:~_~__~~rn~~~ll c~~M ------.~~:~:-----------J
resources are covered in more detail in Chapter 1, and there are extensive links
to Ensembl.

2.4.3 UCSC Genome Browser


20
This browser (http://genome.cse.ucsc.edu/index.htmI . ) provides access to a
comprehensive range of organisms. It has the advantage of linking cDNAs to
microarray expression data. Again, there are extensive links to Ensembl.

2.4.4 Joint Genome Institute OGl)


The JGI (http://www.jgi.doe.govj2·7) combines four national laboratories: Lawrence
Figure 3. Genome Reviews browser for Acinetobacter sp. in the region around the putative Berkeley, Lawrence Livermore, Los Alamos, and Oak Ridge; and the Stanford Human
dihydrooratase gene. Genome Center. They provide information on the genomes they have sequenced,
but have a limited number of eukaryotic genomes; for example. there is data for
Homo sapiens chromosomes 5, 16, and 19 only.
Integr8 also has some other very useful buttons on the side menu, which provide The data is annotated using Vista plots, gene model predictions and BLAST search
taxonomy information on your organism, relevant literature, and genome statistics results. Vista plots (13) are generated from sequence data multiple alignments. If
(on amino acid composition, protein length distribution, and triplet usage). annotation files are present with the segments of DNA chosen, then the Vista file
will also show the locations of untranslated regions and exons. The plots show
2.4 Other web-based genome browsers ~._~~,,,', conservation of sequence and similarity over the whole length of the sequence
and so are very useful for a quick overview of how similar sequences are between
Although we have focused here on Ensembl and Integr8, there are many other organisms with regard to exons, but also can identify conserved noncoding
web-based genome browsers. All offer access to a different range of data sets, so sequence that may have functional significance.
the best advice is to explore the available sites to find one that suits you and that The JGI's microbial genome browser (IMG) is much more comprehensive and is
offers access to the data you need. Below are listed several of the more popular described in the following section.
sites, along with brief descriptions.
2.4.5 Integrated Microbial Genomes (lMG)
2.4.1 Genome News Network (GNN) This browser (http://img.jgi.doe.gov/cgi-bin/pub/main.cgi 2.8) (14) provides a
This online magazine (http://www.genomenewsnetwork.orge.4) covers important framework for the comparative analysis of microbial genomes, many of which
developments in genomics research around the world and, under resources, have been sequenced by the JGI. Searches include: Find genes, Find functions,
has 'A Quick Guide to Sequenced Genomes'. This provides a brief description of and Find organisms. There is the possibility of browsing across genomes. It is very
sequenced organisms and a link via 'Abstract' to the relevant Entrez PubMed entry. easy to add sequences from several different genomes to a 'Gene Cart' for further
It also references links to any related science articles by GNN. The site appears to analysis such as DNA and protein alignments, to examine 5' or 3' neighborhoods
have stopped posting information in 2004; however, it is still a useful source of of the gene of interest, and to export sequences. This is an excellent and very
background information presented in a very accessible format. easy-to-use tool.
26 ;; CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 27

2.4.6 The Institute for Genomic Research (TIGR) Comprehensive Microbial defined by sequence and/or located in the NCBI Map Viewer. This comprehensively
Resource (CM R) links NCBI resources for each gene in a single-page, easy-to-view format.

Th is TIGR-based browser (http://cmr.tigr.orgjtigr-scripts/CMR/CmrHomePage.


cgi 2.9) presents data from publicly available complete microbial genomes, many 2.5.4 GeneTests
of which have been sequenced by TIGR. It is also possible to BLAST search against GeneTests (http://www.genetests.org!2.15) is a medical genetics information
TIGR's unfinished data via http://www.tigr.orgjdb.shtmI 2.1o. Access is also provided resource and contains GeneReviews. This is an online publication containing a
to a number of parasite and fungal sequence projects. collection of expert-authored, peer-reviewed descriptions of heritable diseases. It
is biased towards clinicians rather than laboratory scientists.
2.4.7 Gendb
Gendb (http://www.cebitec.uni-bielefeld.de/groups/brf/software/gendb info/index. 2.6 Downloading data with BioMart
htmI 2.11 ) is an annotation system for prokaryotic genomes provided by the
Bielefeld University Centre for Biotechnology (CeBiTec). A user log-in is required Having seen how to view genomic data, you may wish to download data for
and there is a limited number of genomes available. However, this site does provide your own analysis. Genome browsers will often offer a variety of 'export' options
a ready-made annotation system (available via collaboration) for independently (some of these were mentioned briefly in Protocol 2), but these will vary widely
sequenced genomes. depending on the browser in question and on the type of data you are exporting.
Here, we will deal with BioMart, a more general tool for downloading genome
data.
2.5 Specialized sites
BioMart is a generic data management system. It can be thought of as a
'shopping tool', allowing you to select and download data for your own analysis.
There are also a great many specialized sites that can provide information on a
BioMart is not a database per se - it is a tool that can be accessed from a variety
gene-by-gene rather than genomic basis, and these sites complement the genome
of databases, providing a standardized way of accessing the respective database's
browsers. Many of these sites will be discussed in other chapters of this book, but
contents. For example, it is accessible from Ensembl (15).
some of particular relevance are listed below.
The details that will be displayed in BioMart will of course depend on the
database it is accessing, but the basic 'shopping process' consists of three steps
2.5.1 Gene Cards that remain the same. First, you choose the dataset (for example, the species and
This site (http://www.genecards.org/ 2. 12 ) is provided free to non-profit academic the annotation version) that you want to query. Secondly, BioMart then offers you
institutions and is probably one of the most comprehensive sites with regard to a range of filters (tailored to the type of data being accessed) to seleCt the data
the provision of information. Data include gene name aliases and descriptions, you are after. For example, when using BioMart to access data through Ensembl,
genomic location with protein and transcript data, microarray expression profiles, you can query for particular regions of the human genome, by chromosome,
functional annotation including Gene Ontology (see Chapter 9). orthologs in other chromosome bands, base-pair coordinates, or marker location; or you can
species, single-nucleotide polymorphism (SNP) analysis, and research publications. query for known genes using identifiers such as those from Uniprot, RefSeq, or
Links are provided to numerous other databases. EntrezGene IDs. You can also query for proteins that belong to particular protein
families according to PROFILE IDs, PFAM IDs, InterPro IDs, PRINTS IDs, or PROSITE
2.5.2 Online Mendelian Inheritance in Man (OMIM) IDs that have or do not have transmembrane domains or signaling domains. You
can even combine the data from two different genomes. Thirdly, after setting the
OMIM (http://www.ncbLnlm.nih.gov/entrez/query.fcgi?db=OMIM L.l.l) is one of the
appropriate filters, you can choose the type of information you want to retrieve,
earliest genome resources and is still going strong. This database is a catalog of
such as sequences (including or excluding introns etc.) or features, and the format
human genes and genetic disorders authored and edited by Dr Victor A. McKusick
in which you would like the information downloaded (HTML, CSV, Text, etc.).
and his colleagues at Johns Hopkins and elsewhere, and developed for the web
Some of the databases that implement BioMart are listed in Table 1. The
by NCB!. The database contains textual information and references pius links to
BioMart web site (http://www.biomart.orgj2.16) also provides a current list of the
additional related resources at NCBI and elsewhere.
databases that implement BioMart. Protocol 4 gives an example of using BioMart
to find and download a specific group of genes from Ensembl. Working through
2.5.3 Entrez Gene this example should enable the reader to perform other types of BioMart queries
Entrez Gene (http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene 2.141 has super- through Ensembl and, with a little extrapolation, to use BioMart through the
seded LocusLink and is a searchable database of genes, from RefSeq genomes, and other databases given in Table 1.
28 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 29

Table 1. Databases providing BioMart access to genome data for various species
Database Species
Ensembl Various vertebrates and Apis mel/itera, Caenorhabditis
http://www.ensembl.org/Mu Iti/martview e/egans, Drosophiia melanogaster, and Anopheles
gambiae
Gramene Oryza sativa, Zea mays, and Arabidopsis thaliana
http://www.g ra mene.org/M uIti/ma rtview ~~
Homo~ _ (NCBI36)
Wormbase Caenorhabditis elegans ~A_(F_)
E~G"""IO
http://www.wormbase.org/biomart/martview E_Tra"""JiptIO
~
euGenes Various Drosophila species ChiOOiOs""",,9
http://insects.eugenes.org/Bi oMa rt/ma rtview
Various species, archaea, bacteria, and eukaryota

Data collected by the International HapMap Project

Protocol 4 blOrnart_sionOS

Using BioMart to download transcription factors located on


chromosome 9 through Ensembl
Figure 4. Using BioMart.
1. Go to httn'//u""", (you can also access this page using the In this screenshot, we have chosen to search the dataset NCBI36 (under Dataset on the upper
Data upper-left of most Ensembl pages). left) and have restricted the search to chromosome 9 (set using the menu and check box on the
right, and shown under Filters on the left). We have not yet specified which Attributes to retrieve
2. On the resulting Martview page, you will be able to build the This is done in three
- the ones shown on the left (Ensembl Gene 10 and Ensembl Transcript 10) are the default
stages to specify the dataset you want to search, the criteria you want to apply, and
values.
the features of the resulting genes you want to see.

Specifying the dataset 8. Click on Protein_coding within the Gene type menu; the adjacent check box should change
3. To specify the dataset you want to search within, on the right-hand part of the screen, there to a 'tick' automatically (if not, click to tick it).
should be menus entitled Database and Dataset (if not, click on the »Dataset link at upper 9. Stili in the Gene type menu, ensure that the Status menu is set to Known (to retrieve only
left of the screen and they should appear). Using these two menus, select ENSEMBL 42 GENE genes of known function) and tick the check box next to it.
(SANGER) and Homo sapiens genes (NCBI36) as the database and dataset, respectivell· b.
10. In the same way, go into the Gene Ontology menu (see Chapter 9 for details of Gene Ontology
(GOl. which is a standardized vocabulary for describing genes). Tick the check boxes next to
Specifying the search criteria
Molecular function and enter 'Go: 0003700' into the text field next to it. (GO:0003700 is the
4. Next, click on the »Fiiters link on the left of the screen. On the right will appear several
GO term for 'transcription factor activity'; if you do not know the relevant GO term, you can
boxes (REGION:, GENE:, etc.), each with a small '+' symbol next to it. These are the filters that
use the adjacent Browse button to jump to the QuickGO browser, which allows you to browse
will be used to define the type of data (which genes, in this case) you want to download.
GO and identify the correct term).
S. In general, clicking on '+' next to any of these filters will open a menu within which you can
11. Leave the check box next to Evidence code (Molecular function) unticked. This means that we
set parameters; check boxes in each menu are used to determine which parameters to take
will be searching for proteins that have been assigned the GO term 0003700 under 'molecular
into account.
function', using any type of available evidence.
6. Click on the '+' next to REGION: to expand its menu. Below it will appear various options by
12. Other menus (Expression, Protein, SNP) would allow you to apply further criteria in selecting
which you can specify the region (chromosome, base-pair coordinates, band, etc.). From the
your data, but we will leave these for now.
pull-down menu next to Chromosome, select 9, and ensure that the check box to the ieft is
'ticked' (see Fig. 4).
Specifying the features to retrieve
7. Scroll down to the Gene filter, click on the adjacent '+' to expand its menu, and scroil down 13. Click on the »Attributes (features) link on the left. On the right, you will be offered the
again to Gene type. now-familiar style of menu. Expand the Gene submenu (by clicking on the adjacent '+' button)
30 CHAPTER 2: NAViGATING SEQUENCED GENOMES METHODS AND APPROACHES 31

and, under Ensembl Attributes, ensure that only the Ensembl Gene 10 and Description check ARTEMIS (see Fia. 5, aiso available in the color section) is written in Java

boxes are ticked. java.sun.com/2.21) and is therefore available for many different computer systems
(Linux, UNIX, Macintosh, and Windows). It is freely available, but you need an
Previewing and retrieving the results installation of Java on your computer (Java is also freely available). For details, see
14. Click on the »Dataset link (the database and dataset will again be displayed on the the ARTEMIS web site (http://www.sanger.ac.uk/Software!Artemis/2.20).
then on the Count button towards the top of the screen. This will display the number We recommend that you consult the ARTEMIS manual (available at the same site
matching your criteria and the total number of genes in the dataset, next to the »Dataset
as the software) to learn about the many available features, as this is beyond the
link towards the top of the page (at the time of writing, this was 44/31 148 genes).
scope of this chapter. For example, ARTEMIS allows you to define your own features,
15. Now click on the Results button. On the right, you will see a table displaying the first few edit and annotate a sequence, and save the results in various standard formats
genes retrieved by your query. Above this table, pull-down menus allow you to specify the file
(e.g. GenbankJ.
format for the output; ensure that the Export all results to menu is set to File and the rows
as menu is set to TSV (tab-separate value output) and click the adjacent Go button.
Protocol 6 provides a simple example of using ARTEMIS to view an annotated
sequence file downloaded from Ensembl.
16. A text file will be downloaded, con+~;n;n~ the results of your search. The results of this
particular example are given in the folder for this chapter on the book's web-site
as 'BioMarttxt 2 . 1S '.
17. If you specify HTML (in the rows as menu), the result will be an HTML file of your results,
which can be opened in your browser and which will contain active links for the relevant
data. An example of such a file is given in the Protocol_ 4 folder for this chapter on the book's

Notes
aThese were current at the time of writing; there may be more-recent versions by the time you--
read this.
bit is also possible to specify a second dataset to search in combination with the first; this is set
using the lower of the two »Dataset links on the left of the screen.

In this example, we started with a 'blank' BioMart query. However, while viewing
information in Ensembl (for example, while following Protocoi 2 in this chapter),
you will also notice links such as Export Gene info in region, with the BioMart
logo of colored dots. Clicking on these links effectively completes the first steps of
a BioMart query for the gene, region, etc. in question, leaving you only to specify
which aspects of the data you want to download and the format for export.
If yotl' have particular queries BioMart does not cater for, you can query the
Ensembl database directly by connecting to their MySQLserver. You will need to know
how to use SOL and to understand the database schema. We cannot go into more
detail here, but more information can be found in the Ensembl documentation.

Figure 5. Screenshot of the ARTEMIS sequence viewer and annotation tool (see page xvii for color
2.7 Browsing genomes 'off line' using stand-alone software version).
The main window is divided into several sections. Below the main menu is information about the current
There are alternatives to using a remote web site for genome browsing. ARTEMIS selection and the sequences being viewed. Below this (and filling most of the top half of the screen), the
(http://www.sanger.ac.uk/Software/Artemisj2· 20 J is a stand-alone program that 'overview' section shows stop codons in all six reading frames (short vertical black lines) and features
allows you to browse and annotate genomes (16). It can read different formats, on both strands (colored boxes, mainly in blue and yellow); the vertical scroll bar on the right controls
the scale (zoom) of this window and the horizontal one scans along the sequence. The 'base view' Oust
including FASTA files, EMBL, and Genbank format, as well as GFF format. These
below the overview) shows the sequence of both strands and, above and below these, the translation in
formats contain the sequence data, as well as the annotations of this sequence. all six frames; again, the scroll bars control the zoom and position of this window. The bottom third of
The sequence and its features are displayed graphically, in broadly the same way the screen shows a list of annotated features. Many other aspects of the sequence can be displayed or
as some of the online genome browsers described above. hidden (see text).
32 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 33

Another option for browsing genomes locally is to install Ensembl on your own
Protocol 6 computer. This requires some experience with installing software and a knowledge
of Perl, MySQL, and Java. You can download the Ensembl data as well as their
Using ARTEMIS to display the human genome sequence program code free of charge. This means you can run it on a copy of the public
data, run it with your own data, or you can install the whole annotation pipeline.
surrounding the gene Alien
There is some documentation on the Ensembl webpage on how to install the
This protocol assumes that you have installed ARTEMIS on your local computer. ARTEMIS is available database locally, but installing and running the whole annotation pipeline will
from http://www.sanger.ac.uk/Software/Artem is/ 2.2o , along with instructions for insta Iii ng require considerable expertise and hardware.
and using it.
1. You first need to download an EMBL file containing the relevant sequence and its annotation.
2.8 linking your own data to a genome browser
Either download a copy of this file (Alien.embI 2 .22 ) from the Protocol_6 folder for this chapter
at the web site that accompanies this book and proceed to step 7, or recover the data from
Ensembl by following steps 2-6. The UCSC Genome Browser and Ensembl both allow you to overlay data from other
sources, so that you can have additional information displayed in the browser that
2. Go to Ensembl via http://www.ensembl.org 2.2 and click on Homo sapiens under the Mammalian
is not actually part of UCSC's or Ensembl's data, but is provided by other sources
genomes heading. Type 'alien' in the Search e! Human field at the top right and press Go.
(17,18). This means that you can also feed in your data and have it displayed in
3. The results page will show the Ensembl protein family and, below it, the Ensembl gene (correct
the Ensembl or UCSC browser alongside their own data.
at time of writing). As you can see, this gene is an Alien homolog; Alien was first discovered
in Drosophila. Click on the link Ensembl gene: ENSG00000166200 to go to the Ensembl There are different ways of achieving this: by uploading your data to the UCSC
Gene Report. or Ensembl site; by linking UCSC/Ensembl views to data on your web site; or by
setting up a distributed annotation system (DAS) server on your computer. In most
4. To export the data, click on Export gene data on the left of the page. This will take you to the
'Ensembl Human Export View: cases, the information is stored in text files and can be in various formats (see
2 23
http://genome.ucsc.edu/goldenPath/hel p/custom Track.htm 1 . ). Unfortunately,
5. Under context, enter '5000' in each of the two boxes (Bp upstream and Bp downstream) to
the formats differ slightly in the features they support. Check the file formats
export 5000 bp of sequence either side of the Alien gene, and choose EMBL as output format
and then press Continue. carefully: some formats require spaces to separate the columns, whilst others
tabs. Please refer to the format descriptions to find out what the best file
6. The page will show a series of check boxes for the various features that can be included in the
output. Select the following features: Repeat features, Prediction features (genscan), Gene format for your data might be. Below is shown the example file that we will use in
Information, and Vega Gene Information. Set the output format to Text and click Continue. Protocol 8, which illustrates some general (although not universal!) points about
Save the text file as 'Alien.embl'. these formats.
7. Start ARTEMIS (e.g. by double clicking the Artemis.jar icon), and open the file 'Alif'n f'mhl
A window should appear, similar to that shown in Fig. 5. #example file
browser position chr19:6901101-7100000
8. The window is divided into three main sections. The upper section gives a coarse view of the browser hide all
sequence and is annotated on each strand; it also displays the positions of stop codons in track name=NavigatingGenomesTrackl description='This is an example of how to ~
each of the three reading frames on each strand. Scroll bars allow you to scroll along the link your data into Ensembl/UCSC' visibility=2 color=255,0, useScore=2 ~
sequence (horizontal scroll bar) or to zoom in or out (vertical scroll bar). The middle section url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinAc=$$
displays similar information, but is initially set to a higher resolution (again, controllable by chr19 Navigator test_element 6903521 6903982 1000 P61203
chr19 Navigator test_element 7000000 7001000 400 P61205
its vertical scroll bar) to show the nucleotide sequence and the amino acids encoded in each
Navigator test_element 7010000 7030000 400 P61205
reading frame. You can use these two sections to look at the same region at different levels
chr19 Navigator test_element 7050000 7060000 400 P61205
of detail, e.g. to see an overview of the exon/intron structure of a gene and the sequence of chr19 Navigator test element 7010000 7090000 200 P61204
a particular exon/intron boundary at the same time. browser dense Test2
The bottom section lists the annotated features; double-clicking on a feature will bring track name=NavigatingGenomesTrack2 description='This is an example of how to
it to the center of the upper two windows. Many other tools - detailed in the ARTEMIS manual show even more data' visibility=l color=255,0,255 useScore=l ~
- are available for viewing and editing the sequence and its annotation. url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do? proteinAc=$$~
color=0,255,0 visibility=l
chr19 Navigator1 test_element1 7010000 7090000 400 P61204

in this example, the first line is a comment (indicated by '#'). The next three
lines contain instructions for the browser: the line starting with 'browser' tells
the browser which part of the genome to display; the line starting with 'track'
34 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 35

states the name of the information track. The 'track' line lets you define some 3. In the Search box at the top centre of the screen, type '19: 6900000-7100000' (to view the
additional information on how to display the information and it also allows you to region from 6.9 to 7.1 Mb on chromosome 19) and then click the Go button or press 'Enter'
provide a URL to link to if the user selects the particular element (just as most of on your keyboard.
the features you have already viewed in Ensembl had links to further information). 4. This brings up the Contig View page for chromosome 19, displaying the specified region. Scroll
In this example, we provide links from this track to UniProt entries, by stating: down if necessary to see the 'Detailed view' panel.
url=http://www.ebi.uniprot.org/uniprot-srv/uniProtView. CJ 5. Click on the DAS Sources menu (at the top of the 'Detailed view' panel) and then click on
do?proteinAc=$$ Manage sources at the bottom of the list that appears.
(we will discuss the '$$' in a moment). Obviously, you can change this to link to 6. A new window (called 'DasconNiew') wiil appear. On the left, under Manage Sources, click
point to any web page. on the link Upload your data.
The next five lines contain generic file format (GFF) fields, describing five 7. The refreshed page should be headed 'DAS Wizard Step 1 of 3: Data location', and there are
features that will be displayed in this track. Each line consists of the sequence fields at the top for your e-mail address and a password of your choice. You need to complete
name (a chromosome or a contig - in this case, 'chr19'), the source (for example, these fields so that Ensembl can contact you at a future date if necessary and so that nobody
the program that generated this feature - in this case 'Navigator'), the name of else can modify your uploaded data.
this type of feature (such as 'CDS', 'starCcodon', or 'exon' - in this case, 'test_ 8. Use the Choose file button (or Browse button as it will appear in some browsers) to find and
element'), the start and end positions of the feature, a score of between 0 and upload the file Linked dataEnsEMBLtxt 2.24 . When this has been done, the filename should
appear next to the Choose file/Browse button.
1000, which determines the level of gray in which this feature is displayed, the
strand ('+', '-', or '.' for features to be shown on the plus-strand, minus-strand, or 9. Click the Next button just below this.
both), the frame (a number between 0 and 2 that represents the reading frame of 10. If everything goes well, you will be taken to a new page headed 'AS Wizard Step 2 of 3: Data
the first base, or a '.' if the feature is not an exon) and finally the 'group': all lines appearance', in which case go on to step 12.
with the same group (for example, 'P6120S') are linked together into a single item, 11. If this does not happen, look for an error message towards the top of the screen Uust below
which can be used, for example, if you want to display the linked exons of a single where it says 'Please upload your data location'), along the lines of:
gene. When creating the link for each feature, the genome browser will substitute ERROR: could not upload data due to 'ERROR: Invalid format. Line l'
the group for the '$$' in the generic link, so that in this eX(Jmple each of the five
In this case, check that the Linked dataEnsEMBLtxt 224 file has not been corrupted - check
features will link to the respective UniProt entry.
especiaily (using a text editor or Word - but remember to save as a text-only file) that the
It is possible to display several tracks; for example, you could use one track for
fields in the file are divided by tab characters (not spaces). Try uploading again and, if it still
each alternative splice variant. The final lines of the eX(Jmple file (starting with fails, copy and paste the contents of the file into the Paste your data window instead of using
'browser dense Test2') define a second custom track, this time containing only the Choose file/Browse option in step 8.
a single feature. 12. You should now be on the page headed 'AS Wizard Step 2 of 3: Data appearance'. Towards
The Ensembl browser behaves slightly differently from the UCSC browser. At the top of the page are several check boxes (next to Enable on; these check boxes determine
the time of writing, it seems that the behavior of the UCSC browser conforms which views your data will be visible in. The contigview box should be checked (if not, click
better to the examples given on the instruction pages of the browsers (Jnd provides to tick itl.
more meaningful error messages. Therefore, we will use a slightly simpler example 13. Click on the Next button and you will be taken to the next page, headed 'DAS Wizard Step
in Ensembl. In Protocol 7, we will use a simplified file format to display basic 3 of 3: Display configuration'. This page gives you numerous options for the appearance the
information in the Ensembl browser. In Protocol 8, we will use the more extensive data you have just uploaded.
file (shown above) to display slightly more complex data in the UCSC browser. 14. Under Name and Track label at the top of the screen, enter any descriptive name you like (for
this example, use 'NavigatingGenomes' and 'NavigatingGenomesTrack', respectively). You
can leave the other fields blank and the other options at their default settings. Click on the
Finish button just below these options.
Protocol 7
15. You will be taken back to the first step (as in step 6); close this window and return to the Contig
linking your own data to Ensembl View page (as you left it in step 5). Refresh this page using your browser's 'reload' button.
16. Click on the 'DAS Sources' menu to open it (if it is already open, close it and then reopen it).
1. Download a copy of the file Linked dataENS.txt 2.24 from the ProtocoL7 folder for this You should see 'NavigatingGenomesTrack' towards the bottom of the list. The check box next
chapter on the book's web site. Open the file in a text editor or word processor and examine to it should be ticked (if not, click to tick it).
its contents. If you experiment by editing this file, be sure to save it as 'text only' after editing a
and take care with end-of-line and 17. Now click the Refresh button toward the top of the 'Detailed view' panel. The screen will
refresh and you will see your added data displayed alongside the other features. Clicking on
2. Go to http://www.ensembl.org/Homo sapiens/index.html2.25. anyone of the features will bring up information from the file you uploaded (see Fig. 6).
36 CHAPTER 2: NAVIGATING SEQUENCED GENOMES METHODS AND APPROACHES 37

2. Go to http://genome.cse.ucsc.eduf2·27 and choose Genomes from the menu bar at the top.
El DetaIled view
3. Click the add custom tracks button towards the top of the screen.
4. Ensure that the current human genome sequence assembly is selected (menus towards the
top of the screen).
5. Click on the button called Browse or Choose file (the button name will depend on which
web browser you are using) just above the first large text field (called 'Paste URls or data:')
and find the file that you saved in step 1. Alternatively, you can copy the contents of the file
and paste them into the window or put a URL that points to your file into the window. Press
Submit.
6. You will be taken to a new page headed 'Manage Custom Tracks'. At the top is a table
the tracks vou have added so far. There were two tracks defined in the example file,
and 'NavigatingGenomesTrack2~ (Links and options from this
page allow you to delete, view, or edit some tracks, but we will not do this now.)
7. To the right of the table, click the button go to genome browser.
8. The UCSC browser should display the appropriate segment of chromosome 19, showing
the custom tracks that you have just uploaded (see Fig. 7). Try clicking on the newly added
features or their names.
9. All other tracks in the genome browser will be disabled, but can be activated by selecting
from the numerous pull-down menus underneath the chromosome graphic; click the refresh
button (underneath the graphic window - not the 'reload' button on your own web browser)
Figure 6. Screenshot showing part of the 'Detailed view' panel of the Contig View page of the after selecting which tracks to activate.
Ensembl genome browser (see page xviii for color version).
The data that was uploaded in Protocol 7 is shown (dark bars just below the chromosome
scale) in the track called 'NavigatingGenomesTrack'. In this shot, the user has clicked on one
the uploaded features (P6120S), and the small pop-up window displays information about this
feature.
Home Genomes Blat ianfes Gena Sorter:

U€scGaomeBJ!owsel'.BtmIaIrMD. . .
-'88888Broom:ln@®El39- aut @®El3
peR DNA f"'Off\l~rr

.....,1
~n5amol NeBt POFiPS Heip

''''''~I.,,,,,lOI~7,IOMOO S~ _l98,900bp, (_"';;;;:,


Note
3Even this can cause problems: some word processors can change the 'hidden characters' that mark
latcr. If in doubt, do not edit

By using the other options available through Ensembl, it is possible to upload and
present more complex data (including data held on your own or another web site),
to provide links (clickable from Ensembl) from your features to other databases ~ ....",.""IW' ~ f"'....n. _ _·~ t __ ·, ~
such as UniProt, and so on. However, this is beyond the scope of the present U_dropdllwnCOll1mll:be!ow 1IDd,pmtc~.lOlIbIIrtI'IICII:I'~.
Tl'lIClcscwilhlolrlllfltllmlwWIIIIIDmIIIcaIIy'tIe~lnmClll",:om"lI..nllOdes,
chapter. Instead, in the next protocol, we will upload the slightly more complex ,.tMrlniWMi.i

example file that we discussed earlier into the UCSC browser.

ProtocolS
linking your own data to the UCSC Genome browser
Figure 7. Screenshot showing part of the uses genome browser.
1. Download a copy of the file Linked_dataUCSC.txt 2 .26 from the Protocol_8 folder for this The data that was uploaded in Protocol 8 is shown in the graphic window, just beneath the
chapter on the book's web site. Open the file in a text editor or word processor and examine chromosome distance scale. Additional tracks (STS markers, Ref Seq genes, and spliced ESTs)
its contents. If you experiment by editing this file, be sure to save it as 'text only' after editing have also been activated, using the pull-down menus further down the screen.
and take care with end-of-line and tab characters.
38 CHAPTER 2: NAVIGATING SEQUENCED GENOMES

like the Ensembl browser, the UCSC browser offers many opportunities to
incorporate your own data, manipulate and display it, and integrate it with other
features both within the browser and beyond. Many of these options are beyond
CHAPTER 3
the scope of this chapter, but the reader is encouraged to explore and to refer to
the online help files. Sequence similarity searches
Acknowledgements Jaap Heringa and Walter Pirovano
The authors would like to thank all people involved in the many projects
presented here, especially the people writing and maintaining the excellent online
documentations. T.S. was a British Antarctic Survey/European Bioinformatics
Institute/St Edmund's College Research Fellow 2003-2006. This paper was
produced by M.s.C. and T.S. within the BIOREACH/BIOFLAME core programs. 1.1 Comparative sequence analysis

Comparative sequence analysis is a common first step in the analysis of sequence-


structure-function relationships in protein and nucleotide sequences. To obtain
knowledge about the role of a certain unknown protein, comparing the protein's
1. Salzberg SL Be Yorke JA (2005) Bioinformatics, 21, 4320-4321.
2. Hubbard T, Andrews D, Caccamo M, et al. (2005) Nucleic Acids Res. 33, D447-0453. sequence with the many sequences in annotated protein databases often leads
"* 3. EBI2can Support Portalhttp://www.ebLac.uk/2can/ 2.28 . - Home page for the EMBL EBl's to useful suggestions regarding the protein's three-dimensional structure or
2can bioinformatics support portal. Tutorials are offered on a wide range of bioinformatic- molecular function. As the prediction of a protein's structure and function on first
related topics from basic biology to detailed protocols. principles is still a major unsolved problem in molecular biology (see Chapters 8
"* 4. Ensembl help pages http://www.ensembl.org 2.2• - The help pages for Ensembl are vel)!
and 10), the method of indirect inference by comparative sequence techniques
readable and helpful, and contain a wealth ot information not only on Ensembl but also on
related bioinformatic topics. has become essential for structural and functional genomics initiatives. Over the
"* 5. GalperinMY (2007) Nucleic Acids Res. 35, 03-04. - Nucleic Acids Research last decade, this approach has led to the novel annotation of more sequences than
a regularly updated database issue, summarizing major bioinformatic databases. This is a
any other individual technology.
vel)! good source for information on available bioinformatic data resources. Free access is
available via http://nar.oxfordjournals.orgl. Analysis of sequence similarity also underpins many other areas of
"* 6. foxJA, McMillan S Be Ouelette SF (2006) Nucleic Acids Res. 34, W3-W5. - Nucleic Acids bioinformatics, including the identification of coding regions in genomic sequence
Research publishes a regular web selVer issue, summarizing and providing links to a great by interspecies comparison (see Chapter 4) and the analysis of evolutionary
many bionformatic web selVers. Free access is available via http://nar.oxfordjourna/s.org/.
relationships (see Chapter 12). In short, analysis of the similarities between protein
7. Mullan L (2004) Brief. Bioinform. 5, 365-369.
8. Bernal A, Ear U Be Kyrpides N (2001) Nucleic Acids Res. 29,126-127. or DNA sequences is a cornerstone of bioinformatics.
9. Kyrpides NC (1999) Bioinformatics, 15, 773-774. Given the fact that sequence databases are growing exponentially, many
10. Liolios K, Tavemarakis N, Hugenholtz P Be Kyrpides NC (2006) Nucieic Acids Res. 34, current research projects are aimed at improving the sensitivity of sequence
0332-0334.
similarity searching techniques, whilst trying to ensure that the speed of the
11. AshurstjL, Chen CK,'6I1bertJG, et al. (2005) Nucleic Acids Res. 33, 0459-0465.
12. Kersey P, Bower l, Morris l, et al. (2005) Nucleic Acids Res. 33, 0297-0302. algorithms is sufficient to scour ail of the available sequence data.
13. Frazer KA, Pachter l, Poliakov A, Rubin EM Be Dubchak I (2004) Nucleic Acids Res. 32,
W273-W279.
14. Markowitz VM, Korzeniewski F, Palaniappan K, et al. (2006) Nucleic Acids Res. 34,
1.2 Sequence alignment as a reflection of similarity
0344-0348.
15. Kasprzyk A, Keefe D, Smedley D, et al. (2004) Genome Res. 14, 160-169. Although many properties of nucleotide or protein sequences can be used to derive
16. Berriman M Be Rutherford K (2003) Brief. Bioinform. 4, 124-132. a similarity score (e.g. nucleotide or amino acid composition, isoelectric point, or
17. Birney E, Andrews D, Caccamo M, at al. (2006) Nucleic Acids Res. 34, 0556-0561.
18. Hinrichs AS, Karolchik D, Baertsch R, et al. (2006) Nucleic Acids Res, 34, 0590-0598. molecular weight), the vast majority of sequence similarity calculations rely on
an alignment between two sequences from which a similarity score is inferred.

Bioinformatics: Methods Express (Paul H. Dear, ed.)


© Scion Publishing Limited, 2007
40 Ii CHAPTER 3: SEQUENCE SIMILARITY SEARCHES INTRODUCTION 41

Ideally, the alignment matches the nucleotide or amino acid sequences from The similarity score for two sequences can be calculated from their alignment
either sequence according to their evolutionary descent from a common ancestor, (see below), such that it depends on the actual scoring matrix and gap penalties
with conserved residues at matched positions and inserted/deleted fragments used. It has also been calculated as a fraction of a maximal score possible for
intervening at proper sequence positions. Often, however, evolution has led to two sequences using a normalized scoring matrix and by normalizing the raw
widely diverged sequences where the ancestral ties have become blurred beyond alignment score by the length of the shorter sequence (5).
recognition, leading to biologically incorrect alignment.
Another confounding issue is the fact that an increasing number of cases
are identified with nonorthologous displacement, where enzymes carrying out
1.4 Techniques for pairwise alignment _"_"~_~ __
an identical function in different organisms belong to entirely different protein
families and thus are not expected to show any sequence similarity. For example, 1.4.1 The dynamic programming algorithm
the ornithine decarboxylase spe1 in Saccharomyces cerevisiae has a completely Protein sequences mutate to varying degrees of divergence through evolution. In
different domain structure from - and is not related to - the Escherichia coli order to identify homologous proteins and reveal important similarities, a range
ornithine decarboxylase isozymes speC and speF (1). Nor are sequence alignment of sequence alignment methods are commonly used (for a recent overview, see 6).
techniques able to trace evolutionary cases of horizontal gene transfer or These methods rely mainly on approximated evolutionary models that aim to
functional displacement of one gene by another within a genome. reflect as accurately as possible the evolutionary paths that connect two or more
protein sequences.
Many methods for the calculation of sequence alignments have been developed,
1.3 Similarity versus homology of which implementations of the dynamic programming algorithm (7, 8) are
considered the standard in yielding the most biologically relevant alignments. (For
The term 'homologous sequence' is often used when in fact a sequence should three or more sequences, these methods apply the progressive strategy (9), where
only be described as 'similar' to a given reference sequence (2). Whereas sequence sequences are hierarchically aligned in pairs according to a pre-generated tree,
similarity is a quantification of an empirical relationship of sequences expressed based on their sequence similarity; see Chapter 11 for a discussion of multiple
using a gradual scale, 'homology' denotes an inference of a common ancestor sequence alignment).
between the sequences. Sequence similarity is normally used to assess the likelihood The dynamic programming algorithm (7) requires a scoring matrix, which is
of homology, but homology itself is a qualitative state: a pair of sequences is an evolutionary model expressed in the form of a symmetrical 4x4 exchange
either homologous or not. As protein tertiary structures are more conserved during matrix for nucleotide sequences or a 20x20 matrix for amino acids: each matrix
evolution than their coding sequences, homologous sequences are assumed to cell approximates the evolutionary propensity for the mutation of one nucleotide
share the same protein fold. Although it is possible in theory that two proteins or amino acid type into another, including self-conservation. For this purpose, it
evolve different structures and functions from a common ancestor, this situation is common to use pre-determined substitution scores (e.g. the scores from the
cannot be traced and so such proteins are seen as unrelated. However, numerous BLOSUM (10) and PAM (11) series and more recently the JTI (12), GaNNET (13),
cases exist of homologous protein families where subfamilies with the same fold VT (14), and VTML (15) series) that have been derived using a specific set of 'true'
have evolved distinct molecular functions. The term homology is often used in alignments. However, these 'standard' substitution scores reflect a standardized
practice when two sequences have the same structure or function, although in evolutionary model and introduce inconsistencies when applied to nonstandard
the case of two sequences sharing a common function this ignores the possibility cases (16). Although this does not impact too severely on alignments between
that the sequences are analogs resulting from convergent evolution, now often closely related sequences, sequences in the so-called 'twilight zone' (<300f0
referred to as nonorthologous displacement. sequence identity) are extremely difficult to align (3), partly for this reason. This is
Unfortunately, it is not straightforward to infer homology from similarity, as because the evolutionary scenario relating them becomes virtually undetectable
enormous differences exist between sequence similarities within homologous against the 'noise' introduced by the extent of mutational change that has
families. Many protein families of common descent comprise members that share occurred (17).
pairwise sequence similarities that are only slightly higher than those observed The dynamic programming algorithm also relies on the specification of gap
between unrelated proteins. This region of uncertainty has been characterized penalties, which model the relative probabilities for the occurrence of insertion/
to lie in the range of 15-25% sequence identity (3) (see below) and is commonly deletion events during evolution. In most available methods, a penalty score is
referred to as the 'twilight zone' There are even some known examples of applied for creating (opening) a gap, and a further penalty score is added for each
homologous proteins with sequence similarities below the randomly expected extension of the gap (affine gap penalties), so that the chance for an insertion/
level given their amino acid composition (4). As a consequence, it is impossible to deletion depends linearly upon the length of the associated fragment. Given an
prove using sequence similarity that two sequences are not homologous. exchange matrix and gap penalty values (which together are commonly called the
42 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES INTRODUCTION 43

scoring scheme), the dynamic programming algorithm is guaranteed to produce an inserted domain B in aligning a two-domain sequence AC (where A and C
the highest scoring alignment of any pair of sequences, the optimal alignment. represent domains) with a three-domain sequence ABC, are often too costly so
that such sequences become misaligned.

1.4.2 Global versus local alignments


Two types of alignment are generally distinguished: global and local alignment.
1.6 Sequence identity as a measure of similarity
Global alignment (7) denotes an alignment over the full length of both sequences,
In addition to similarity scores calculated as above, a measure of the sequence
which is an appropriate strategy to follow when two sequences are similar or
identity between the two sequences in an alignment is valuable because it is
have roughly the same length. However, some sequences may show similarity
simple and also because it gives a good rule-of-thumb indication of the likely
limited to a motif or a domain only, whilst the remaining sequence stretches
structural and functional relationship between the aligned sequences.
may be essentially unrelated. In such cases, global alignment may well misalign
Sequence identity is normally expressed as the percentage of identical residues
the related fragments, as these become overshadowed by the unrelated sequence
found in a given alignment. normalized using either the length of the alignment
portions that the global method attempts to· align, possibly leading to a score
or the length of the shorter sequence. This measurement does not depend on
that would not allow the recognition of any similarity. If not much knowledge
an exchange matrix or on gap penalties and therefore is not directly biased by
about the relationship of two sequences is available, it is usually better to
assumptions about the underlying evolutionary model. However, the alignment
align selected fragments of either sequence. This can be done using the local
itself will almost always have been constructed in the first place using a dynamic
alignment technique (8). The first method for local alignment. often referred
programming algorithm, which depends on an exchange matrix and gap penalty
to as the Smith-Waterman algorithm (8), is in fact a minor modification of the
values, so sequence identity cannot be regarded as independent from sequence
dynamic programming algorithm for global alignment. The algorithm selects
the best scoring subsequence from each sequence and provides their alignment, similarity.
Using sequence identity as a measure, Sander and Schneider (19) estimated
thereby disregarding the remaining sequence fragments. Later elaborations of the
that if two protein sequences are longer than 80 residues, they could relatively
algorithm include methods to generate a number of suboptimal local alignments
safely be assumed to be homologous whenever their sequence identity is 25% or
in addition to the optimal pairwise alignment (18).
more. Another commonly used notion is that if two sequences share more than
50% sequence identity, their enzymatic function will be the same (20). Contrary
1.5 Alignment scores as a measure of Similarity to this notion, however, it has been estimated that 70% of pair fragments above
50% sequence identity might not have a completely identical function (20).
In order to optimize alignments and determine the degree of similarity they reflect, An example is Bacillus subtilis exodeoxyribonuclease and rat DNA lyase, which
it is obviously necessary to have a measure by which to score an alignment. As share 57% identity over 122 alignment positions, yet fulfill different functions
the dynamic programming algorithm essentially models the alignment of two (DNA degradation and repair, respectively). Despite its popularity and use in
sequences as a Markov process, where the amino acid (or nucleotide) matches empirical rules as above, the use of sequence identity percentages is not optimal
are considered independent, the product of the probabilities for each match for homology searches (5). As a result, no major sequence comparison methods
within an alignment should be taken. As many of the scoring matrices contain employ sequence identity scores in deriving statistical significance estimates.
exchange propensities converted to logarithmic values (log odds). the score for
any alignment can be calculated by summing the log-odd values corresponding
to matched residues minus appropriate gap penalties: 1.7 Statistics of alignment similarity scores
Sa,b = LI S(ai,b) - Lk Nk . gp(k) Sequence alignment methods can always produce alignment with an associated
where the first summation is over the exchange values associated with I matched similarity score, even in the case of absence of any biological relationship. Although
residues and the second over each group of gaps of length k, with Nk the number of similarity scores of unrelated sequences are essentially random, they can behave
gaps of length k and gp(k) the associated gap penalty. In case affine gap penalties like 'real' scores and, for example, like the latter are correlated with the length of
are used (see above), gp(k) = pi + k'pe, where pi and pe are the penalties for gap the sequences compared.
initialization and extension, respectively. In other words, the alignment score is For this reason, and particularly in the context of database searching, it
composed of a term reflecting the similarities between the aligned residues at is important to know what scores can be expected by chance and how scores
each point in the alignment, minus a term reflecting the number and sizes of that deviate from random expectation should be assessed. If, armed with this
gaps that were needed to make the alignment. A consequence of the widely used knowledge, a similarity between two sequences is deemed to be statistically
affine gap penalty scheme is that the long gaps required, for example to span significant, this provides confidence in inferring a biological relationship. Because
44 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 45

of the complexities of protein evolution and distant relationships observed in standard workstation. However, for any biologist who has a new protein sequence
nature, any statistical scheme will inevitably lead to situations where a sequence of unknown functionality, comparison with all known and annotated sequences
is assessed as unrelated whilst it is in fact homologous (false negative), or the is paramount.
inverse, where a sequence is deemed homologous whilst it is in fact biologicaliy Therefore, fast routines have been devised that enable database searches on
unrelated (false positive). A frequent cause of false positives - and hence of even small computers with only a small loss of sensitivity compared with searches
erroneous transfer of annotation - is based on similarity found over relatively using full dynamic programming. With the recentadventofparallel multi-processor
short sequence regions, or similarity based on different domains in multi-domain computers at central sites, researchers can routinely perform multiple sequence
structures (20). searches over complete sequence databases. However, for large-scale application
of the dynamic programming technique, the computational requirements are still
prohibitive. For example, consider the task of searching the Swiss-Prot database
1.8 Protein domains
against a query sequence of 400 amino acids. As release 50.0 of UniProtKB/Swiss-
Prot contains 222289 sequence entries, comprising 81585146 amino acids,
Many protein families have diverged from common ancestors by evolving different
finding local alignments via dynamic programming over this database would
combinations and associations of domains (21-23). Domains are characterized
entail about 10 10 matrix operations. Given the fact that many servers routinely
as semi-independent three-dimensional units in proteins, often with a particular
handle thousands of such queries a day (over 50000 per day in the case of the
function, observed to be genetically mobile and frequently moving within and
NCBI server), it is clear that the application of dynamic programming would lead
between biological systems through mechanisms of gene or exon shuffling. An
to unfeasible waiting times.
understanding of the domain organization of a protein sequence is crucial for
Although some special hardware has been designed to accelerate the dynamic
structural and functional genomics initiatives and the reader is referred to Chapter
programming algorithm, the solution has depended largely on the development
8 for a discussion of protein architecture and domains.
of several heuristic algorithms that represent shortcuts to speed up the basic
The .correct partitioning of a protein into its putative domains is especially
alignment procedure. These include the currently most widely used heuristic
important in the comparative analysis of entire genome sequences. Consideration
method for scouring sequence databases for homologies, PSI-BLAST (33), an
of domain architecture will shed light on the evolution, structure, and function
extension of the BLAST technology (34), and FASTA (35). which is another commonly
of a protein family. For example, the 'Rosetta Stone' genome analysis method (24)
used heuristic method for fast sequence comparison. At the same time, advances
exploits the fact that a multi-domain protein in one organism may be present as
in computer hardware have made it possible to use some more computationally
separate (and hence, presumably, interacting) proteins in another organism. It is
intense approaches such as the hidden Markov modeling-based tools SAM-T99 (36),
clear that such analysis requires accurate sequence comparison tools at the level
SAM-T2K (37), and HMMER2 (38).
of the domain rather than of the whole protein.
Domain annotation of a protein sequence in the absence of structural
information has proved to be a difficult problem. For example, the method 2.1 Should one compare protein or nucleotide sequences?
of Wheelan et 01. (25) is based on the fact that domains have a distinct size
distribution, averaging at 100 residues. Accurate predictions are limited to two- As long as we are considering sequences between encoded proteins, the actual
domain proteins with less than 300 residues. George and Heringa (26) improved pairwise comparison between two sequences can take place at the nucleotide
the delineation of protein domain boundaries to 52% using a consistency- or peptide level. However, the most effective way to compare sequences is at
based protocol over sets of protein ab initio three-dimensional model structures the protein level (39), which requires that nucleotide sequences must first be
generated using distance geometry. Currently, most annotated domain databases translated in all six reading frames followed by comparison with each of these
are based on inferring domains by sequence similarity searches (27-32). A number conceptual protein sequences.
of these search techniques will be discussed in the next section. Although mutation, insertion, and deletion events take place at the DNA
level, there are several reasons why comparing protein sequences can reveal more
distant relationships:

1. Many mutations within DNA are synonymous, which means that they do not
lead to a change in the corresponding amino acids. As a result of the fact
A typical application to infer knowledge for a given query sequence is to compare that most evolutionary selection pressure is exerted on protein sequences,
it with all sequences in an annotated sequence database. Unfortunately, the synonymous mutations can lead to an overestimation of the sequence
dynamic programming algorithm (see above) is too slow for repeated searches divergence if compared at the DNA level.
over large databases and may take many hours for a single query sequence on a 2. Evolutionary relationships can be expressed more finely using a 20x20
46 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 47

amino acid exchange table than by using exchange values among four and NR-ExPasy databases are popular. Also the genome survey sequence (GSS),
nucleotides, leading to a significant increase in statistical subtlety for protein expressed sequence tag (EST), sequence-tagged site (STS) and high-throughput
sequences. Amino acid substitution matrices incorporate subtle differences genomic sequence nucleotide databases can be scoured to find homologies, gain
in physicochemical properties among the 20 residue types, rendering protein insight in expression data, or locate a gene on the genome map. The NR-NCBI
sequences more informative than nucleotide sequences. database is compiled by the National Center for Biotechnology Information (NCB!)
3. DNA sequences contain noncoding regions, which should be avoided in as a nonredundant (NR) protein sequence database for BLAST searches. It contains
homology searches. Note that the latter is still an issue when using DNA a total of about one million nonidentical sequences from GenBank CDS (coding
translated into protein sequences through a codon table. However, a sequence) translations, Protein Data Bank (PDB), Swiss-Prot, PIR, and Protein
complication arises when using translated DNA sequences to search at the Research Foundation (PRF).
protein level because frame shifts can occur, leading to stretches of incorrect
amino acids in the wrongly transcribed product and possible elongation of
sequences due to missed stop codons. On the other hand, frame shifts typically 2.3 Heuristic sequence similarity searching methods _",_"_,~~_,,,p,,<"~',",~'c _~_o

result in stretches of highly unlikely and distant amino acids, which can be
used as a signal to trace their occurrence. Both the FASTA and BLAST suite of programs feature a quick step for initial filtering
of the database sequences, followed by a second slower step to scrutinize the
sequences and compile the final alignments between the query and each of the
2.2 Curated and annotated sequence databases database sequences. If the initial filtering step is too strict, there is a biological risk:
homologous sequences will be discarded before the more detailed analysis and are
The success of sequence similarity searches depends crucially on the quality and lost (false negatives). If the initial filtering step is too permissive, however, there
coverage of the sequence database used. Although the amount of raw sequence is a computational penalty because too many unrelated sequences are passed
data is increasing rapidly, and although modern sequencing techniques achieve a through to the slower subsequent step. In both the FASTA method and a recent
very high accuracy, the utility of this data depends crucially upon its annotation. implementation of the BLAST algorithm, the slow step incorporates the dynamic
Incorrect annotation of database sequences can distort similarity searches (for programming algorithm to compile a local alignment.
example, when the location or structure of predicted genes in the database
sequence is incorrect), or can lead to false inferences when genuine similarities 2.3.1 FASTA
are found but the database sequence has been annotated with an incorrect
In the early years of sequence database searching, the heuristic method FASTA (35)
function.
was the most widely used technique. The FASTA program compares a given query
As inferring and experimentally validating the annotations represents a
sequence with a library of sequences and calculates for each pair the highest-
bottleneck, there is a rapidly widening gap between sequence and annotation
scoring local 3lignment. The speed of the algorithm is obtained by delaying
data. This is reflected by the fact that many sequences have 'unknown' as their
3pplication of the dynamic progr3mming technique to the moment where the
functional annotation, whilst an increasing number of sequences, especially
most similar segments are already identified by faster and less-sensitive techniques.
those originating from bacterial genomes, have annotations such as 'conserved
To accomplish this, the FASTA routine operates in four steps of which the first two
hypothetical: Conserved hypothetical open reading frames have homologs,
represent a quick filter to eliminate sequences that have no fragments scoring
usually in other organisms (which at least gives reassurance that the open reading
beyond a specified threshold value. Sequence fragments that score beyond a
frame truly is a gene - see Chapter 4), but none of these homologs have known
given threshold value after the first two steps are combined and realigned in the
functions.
Although many new protein structures are now being determined using last two steps.
X-ray crystallography, nuclear magnetic resonance spectroscopy, and cryoelectron The four basic steps of FASTA are as follows:
microscopy, without direct experimental evidence there is considerable difficulty 1. The first step searches for identical 'words' (short segments of sequence) of a
in assigning functions to proteins from their structures. This can even be the case user-specified length Cktup') occurring in the query sequence and the target
for homologs of well-characterized proteins because of the recruitment of similar sequencers). For each target sequence, the ten regions with the highest density
proteins for divergent functions. Computational prediction methods can aid to of ungapped common words are determined. The technique is based on that of
some extent, but for reliable annotation, manual curation is often essential. Wilbur and Lipman (40, 41) and, for not-too-distant sequences (>35% residue
Widely used annotated databanks for homology searches include the annotated identity), little sensitivity is lost whilst speed is greatly increased. The search is
EMBL, GenBank, and DDBJ for nucleotide sequences, whilst for protein sequences performed by 'hashing techniques', where a look-up table is constructed for
the Swiss-Prot, Protein Information Resource (PIRl. TrEMBL, GenPept, NR-NCBI, all words in the query sequence and is then used to compare all encountered
48 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 49

identical words in the database sequence(s).Generally, for proteins, a word 5. For running the actual FASTA routine, go to the home page of the method at http://www.ebi.ac.uk/
length of two amino acids is sufficient (ktup=2), whilstfor nucleotide sequences fasta33j3·2. Note that the FASTA method should not be confused with the FASTA sequence
ktup=6 is the default word length. Searching with higher ktup values increases format mentioned in the preceding step.
the speed but also the risk that similar regions are missed. 6. This page offers many variants of FASTA, but we will use FASTA3 - make sure that this is selected
2. In the second step, these ten regions are rescored using the Dayhoff PAM-250 (highlighted) in the Program menu.
residue exchange matrix (42). 7. There are numerous options on the page, but most of them should .be left at their default
3. In the third step, a threshold value is applied to filter the ten regions: sequences values. You can (for example) choose to receive the results bye-mail rather than interactively.
with none of the ten regions scoring beyond the threshold are effectively You can also specify which protein sequence databases to search against (menu at
discarded at this point; regions scoring higher than the threshold value and right); you can select multiple databases by Shift-clicking or by other key/mouse combinations
being sufficiently near to each other in the sequence are joined, now allowing (depending on your computer and browser) but, for this example, leave the Databases menu
set to UniProt.
gaps. The highest-scoring region of these new fragments is retained.
4. The fourth and final step performs a full dynamic programming alignment 8. The other parameters define the details of the search. The values for the gap penalty (both to
over the region yielded in the preceding step, which is widened by 32 residues open a gap and to extend it by one residue), the ktup value (the size of the 'words' that are
used in the early stages of the search - see above), and the substitution matrix that is used
on either side (43). .
to evaluate the similarity between amino acids can all be altered, but for now, leave them at
In early FASTA versions, the best-scoring regions resulting from steps 2 and 3 above their default values.
were reported as init1 and initn in the FASTA output, respectively, whilst the final 9. The expectation upper and lower values (Evalues) can also be altered. The Evalue is a measure of
alignment score (step 4) was written under opt. Modern implementations of FASTA, the statistical significance of a hit (see below) - higher values correspond to lower significance.
however, only report an E value for each of the database sequence fragments The default upper value of 10 ensures that we will find even fairly distantly related proteins.
aligned with the query as a measure of their statistical significance as putative The default lower value is effectively zero - so that we will also find extremely closely related
(or identical) proteins. Leave these settings at their defaults.
homologs (see section 2.4).
In Protocol 1, we will give an example of the use of FASTA, focusing on a subunit '0. It is also possible to restrict the search to a part of the query sequence (using Sequence
of a large enzymatic complex called cytochrome c oxidase. This complex is found range), or to compare the query only against database proteins that have a certain range of
sizes (Database range), but we will leave these at their default settings, so that we search all
both in bacteria and mitochondria where it catalyzes electron transfer through
of our query sequence against all proteins in the database.
the last part of the respiratory chain. The starting point will be the mouse (Mus
musculus) subunit IV of the mouse cytochrome c oxidase complex. 11. Under Scores a Alignments, set both Scores and Align to 100 (the default values are 50)
- this ensures that we will retrieve up to 100 matches.
12. Further help can be obtained by clicking on any of the colored menu titles.
13. Either copy and paste the complete contents of NP_034071.fa into the large text window
lower down the screen, or use the Choose file (or Browse) button to choose this file. (Note
Protocol 1 that many other sequence formats can also be used.)
14. Click Run Fasta3 and wait for your job to be processed (this should take only a minute or
A typical search using FASTA so).
1. First, retrieve the query sequence that we will be using for this example. Go to the NCBI web 15. When your results page appears a, it should look similar to Fig. 1. (Of course, the UniProt
site at http://www.ncbLnlm.nih.gov 3.1. From the pull-down menu headed Search at the top database that was searched is updated frequently. It is therefore quite likely that some of the
left, select Protein and type 'NP_034071' in the adjacent text box. Click Go or press 'Enter' on 'hits' will change by the time you read this. At the time of writing, this search produced 81
your keyboard; we now see a link to the corresponding entry. hits.)
2. Click on the link (NP_034071) at the top of the entry and then, from the Display menu (left, 16. The upper part of the screen (Submission parameters) simply summarizes the settings you
towards the top), select FASTA to display the protein sequence in the FASTA format. Copy the have used and some aspects of your query sequence.
entire.entry (from the header line starting '>gi! 6753498' to the end of the protein sequence,
ending :.. DK.l\JEWKK'). 17. The lower part of the screen is a table listing the hits, starting with the most significant
(lowest Evalue - see section 2.4). (Clicking on any of the other headings in the table will cause
3. Paste the text into a new document in Word or another text processor and save the file as the results to be resorted by that parameter; if you try this, click on EO afterwards, to again
'NP_034071.fa' on your computer. It is important to save it as 'text only'. sort the list by Evalue.)
4. There are several other ways to obtain the same sequence from NCBI or from other databases 18. For each hit, there is a link to UniProt (under DB:IO). Also reported are the length of the
- for example, see Chapters 1 and 2 for general information on retrieving sequences from protein, the percentage of residues in the alignment that were either identical or similar to
databases. Also, a copy of the file NP_034071.fa can be downloaded from the ProtocoU those in the query sequence, and the length of the overlap between the two sequences in the
folder for this chapter on the book's web site. 3lignment.
50 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 51

next 80, etc.). Also shown are consensus sequences at different levels of stringency, although
this is only meaningful when all of the proteins are well aligned.
23. Click on Return to result to return to the results table.
24. There are many options for viewing and downloading the results, or for selecting only some of
the hits (using the check boxes under Alignment on the left of the table) to examine further.
In particular, the VisualFasta option gives a quiek graphical indication of the strength and
extent of each of the alignments.

Note
aNote that results are stored for about 24 h. Thus, if you copy the web address of the results page,
you can return to it at a later point.

Protocol 2 suggests a variant on the previous search, this time examining only
more distantly related proteins.

Protocol 2
A FASTA search for distant homologs
1. Repeat the FASTA search from Protocol 1, but this time set the Expectation upper value and
Expectation lower value to 20 and 0.001, respectively. This will find only those sequences
with a low similarity (down to an E value of 20, which is of very low significance) and will
exclude the most similar sequences (with Evalues below 0.001). At the time of writing, this
search produced 24 hits.
:~_1_9 ____~~ :~~~!\l8.t~crfpt_._

2. If you look at these hits using the Mview option, you will see that the identity with the query
Figure 1. FASTA output. sequence is generally very sparse, and that the identical residues are widely scattered and
Below the summary table (which gives details of the search parameters), 81 hits are listed in tend to be at different places in the different hits. This suggests that many of these hits are
(decreasing significance). spurious.
3. There will certainly be some true homologs these less-significant hits, but it is difficult
to spot these purely from the sequence identities Mview. Examining the alignments gives a
19. Note that the top sequences have very low E values and can therefore be trusted to be little more information; a true but distant homolog might be expected to have some similarity
homologous to the query. (The top match, in fact, is identical to the query sequence). However. across most of the protein length (rather than good similarity on only one or a few smail
at the bottom of the list we also find some sequences that apparently are not related to the areas). However, more confident identification of true distant homologs is not possible using
query (e.g. UNIPROT:Q2CHW9_9RHOB showing an Evalue of 9). Although there might well be this simple search strategy.
ous sequences having unfavorable Evalues (e.g. UNIPROT:Q2TWP1_ASPOR
8.1), users should generaily be cautious of Evalues above 0.001, as at this
score level, false positives can arise.
2.3.2 BLAST
20. Click on Show aligmnents (underneath the Submission Parameters). The display will now
show alignment details and the alignment itself for each of the hits. Each alignment shows Since its inception in 1990, the BLAST (basic local alignment search tool) program has
the aliqned parts of the two proteins plus (if the alignment covers only part of the sequences) quickly gained a dominant position, and the original publication of the technique
A':' indicates that the amino acids in the query sequence and the hit (34) is the most cited paper in molecular biology to date. BLAST is a speed-optimized
a'.' that they are similar (for example, lysine and arginine). technique that maintains significant sensitivity through the combination of a fast
21. Click on Summary table to return to the previous page. and subsequent slow algorithmic step.
22. Click on MView. The hits are now displayed in color-coded form: residues identical to the The BLAST suite includes a number of variants to allow all possible combinations
query sequence in the respective alignment are colored, whilst nonidentical residues are gray. of comparisons between nucleotide or protein sequences. In particular, nucleotide
(Note that the display is in 'blocks' - the first 80 residues for each aligned protein, then the sequences can be translated in all six possible reading frames for comparison either
52 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 53

with protein sequences or with other, similarly translated nucleotide sequences maximum scoring pairs (MSPs) resulting from the word extensions are presented
(see Table 1). as the final result in the original BlP.5T version.
As the original BLAST program spends more than 90% of its time on extending
Table 1. BLAST variants words, a key improvement of the BLAST method (33) has been to extend words
only when there are two hits on the same diagonal within a given distance, A,
Note the distinction between BlASTN (which compares query and database nucleotide sequences directly
of each other (in other words, where there is a chance of the extension running
with each other) and TBLASTX (which first translates the query and database sequences in all possible
frames and then compares the resulting protein sequences). into further regions of similarity). As this would lead to far fewer words being
extended, sensitivity is maintained by lowering the value of T for finding the
Program Query sequence Database Notes initial HSPs. With a lower value of T, far more single hits are produced, but only
BLASTN Nucleotide Nucleotide Direct comparison between nucleotide sequences a minority has an associated second hit nearby on the same diagonal. In the
BLASTP Protein Protein Direct comparison between protein sequences more recent version of BLAST, word extension is done using dynamic programming,
BLASTX Nucleotide Protein All six translations of the query sequence are compared leading to gapped alignments. The updated technique, referred to as gapped BLAST
with the protein database (33), is therefore more similar in spirit to the earlier FASTA program (35) than to the
TBLASTN Protein Nucleotide The query protein is compared against all six translations earlier BLAST method (34). It is also slightly faster than the earlier BLAST method, as
of each sequence in the nucleotide database extension by dynamic programming is only triggered when the aforementioned
TBLASTX Nucleotide Nucleotide All six translations of the query sequence are compared two-hit extension has a sufficiently large score. If this is the case, the highest
against all six translations of each sequence in the scoring segment of length 11 along the region covered by the two-hit extension
nucleotide database
is taken as the seed. Dynamic programming is then initiated in the forward and
backward directions from the central pair in the ll-Iong HSP. Gapped extension
The basic idea behind the initial quick filtering step in BLAST is the generation proceeds as long as the score remains above a given threshold, whilst the score
of all consecutive tripeptides in a given protein query sequence, or l1-nucleotide is temporarily allowed to drop below the threshold as long as it takes off again
words when searching with a DNA sequence. For each of the words, a table is and rises above the threshold value. The ends of the alignment are finally pruned
constructed of words deemed to be 'similar', where the number of similar to yield the best local alignment given the 11-residue seed, and this alignment is
tripeptides corresponds to only a fraction of the 203 possible tripeptides, or 4 11 reported to the user.
possible nucleotide 11 mers. The BLAST program uses the tables of similar words to A difference between FASTA and BLAST is that FASTA uses the BLOSUM50 substitution
quickly scan a database of protein or nucleotide sequences for ungapped regions matrix when calculating similarities between protein sequences, whilst BLAST uses
showing high similarity; each time, a database word is accepted whenever it BLOSUM62 (although this can be changed in both programs). BLOSUM62 is a
occurs in the table for the query word considered. 'harder' matrix (Le. overall it tends to report a lower similarity between any two
In this respect, BLAST differs significantly from FASTA, in that it can consider nonidentical amino acids than BLOSUM50j, which is amenable to less-divergent
similar 'words' in the early stages of the search whereas FASTA considers sequence comparisons. Another difference is that the BLAST server can return a
identical words. Similar regions found by BLAST between the query and database maximum of 20000 hit sequence descriptions and alignments, whilst the FASTA
sequences scoring beyond a given threshold, T, are referred to as high-scoring server (Protocol l) is limited to a maximum of 100.
sequence pairs (HSPs) and are retained for further processing. To score these Protocol 3 takes the user through a basic BLAST search, again using subunit IV of
regions, BLAST employs the BLOSUM62 amino acid exchange matrix (10l, such the mouse cytochrome c oxidase complex from M. musculus as an example.
that the existence of HSPs scoring higher than T signifies pairwise similarity
beyond random probability, which is taken as a signal that the database
sequence considered is related. The computational strategy involved behind the
quick initial step in the BLAST method is based on deterministic finite automata,
which allows very quick searching of the similar-words table associated with
each query word.
The original BLAST method features a slow algorithmic step that tries to refine
the database hits by extending each HSP in either direction in an attemp_t to
generate a longer alignment with a higher score than the nonextendedregion.
During extension, the alignment score is temporarily allowed to drop but not more
than a pre-set drop threshold of 5, which is set to 20 for protein sequences and 22
for DNA sequences, before the score picks up again to arrive at a higher value. The
54 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES II 55

Protocol 3 ~ ;><G~l result::; or


BLASTP 2.Z.l5 [0et-1lI-1M)

A typical search using BLAST ~.

II~~~~: !~~~;s~~~~""~':;"~i~ ~~-


aluohul, st.apben ? .. , 'Thomas L. Maddwf" Al,ojandro A. s:ohU£.r~

1. Open the BLAST homepage at http://www.ncbi.nlm.nih.govjblast/ 3.3• protein datahasa '" -·~ ... teic A4ida ltrtl& .. 25l33e~l,{02"

2. Click on protein blast. w·......


"-T"I'<Iti

3. Paste the cytochrome c oxidase sequence from M. musculus in FASTA format (see Protocol 1)
into the Enter Query Sequence box. DWrlhution oUt HINt Bits 9D tIk.
4. Below we can see that the choice of available databases is different from that available ,........~~ .......
=~"'""'""='= ........
MOit:ie-owr to show d«fftne M'd Kons, click to fthew aifG'tJm'Mt:S
~...iooZ~.................~==--= .......-=--~........,r'
from the FASTA search page (Protocol 1), although most are major protein databases that will Color key fOr alllnment score. .
have essentially the same content. Leave Database at its default of nr - this encompasses • 40 40--50 50"'
-810 30,-ZOO >=ZOO
all nonredundant translations from GenBank and several other major databases (click on the Qu.,,>, I I i I
o 30 60 90 120 150
Database link, or indeed on any of the highlighted headings, for a 'Help' page giving more
details).
5. Leave all other settings at their defaults for now and click BLAST to start the search using
standard BLAST? (protein-protein BLASi).
6. Typically within a few seconds the results of the BLAST? run will be displayed (see Fig. 2).
7. At the top of the results page (after the references), a graphic similar to FASTA'S Mview (see
Protoco/Jj represents the extent and significance of hits against the query sequence. Moving
_ 1*,. at a n ttft . . 'Ra,l.iittgtd s1;1jlb-
the mouse over any of the colored bars will cause details of that alignment to appear in the
,.
small text window above the graphic. ~_. FO<tu..,in... ,,,,,,iti""''''' . , U _ t a , (Bits) Value

8. Below the graphic, the hits are listed in order of increasing Evalue (decreasing significance). Clil§7S3429Im!!tiNP 934971 11 qt:oobrome. -c ruddue ~I.lbunit; IV i~.. ~ ...ill • ..-9Q m
The hit sequences found for the cytochrome c oxidase query are in line with those retrieved by ril13729Q8fghlMl{)2122 I! Q!(eochrcm& a oxidue subunit IV >q:L •• Jll 9..-90 IS
Qi!U?J18g!""fllIP Q5B898 11 c:yt<>cbl'Ollll>" o>ti.da.... iJUl:!W1it IV i ... ...rnt 1"",S6I!D
the FASTA method (Protocol 1), but the different nomenclatures do not make this obvious. sriil41945afilaplQ9mBICQX4t mX7 ~hrM& 0 oxidase 9ubuniH~ -ill !Ie-1i

9. The default threshold for the Evalue (see section 2.4) is 0, but can be adjusted by the user. As Igil 510810211 re# tu 536752 11
(di470;4R6+!r9jlp' gQIQ9143:9
J?UDICHth similar ~ (!y't<x'!Jtroma-. ~.-
11 <:ytoc~ Q oxyda$. subunit .. ..
J.n
....aM
3oe-77
2.... 16
IS
m
PIUillXc:nm. BiI\;i,lar to CytOM ..... ,.
with the FASTA method, sequences with E values beyond 0.00 should be treated with caution rlJl0?12242QjmftXP CPlOSiO?4 11
qi!lQt1Qt2§,I;tll!lp O"'~· ....... '" ~! Pll2Bl)!C'l'Blh aimi1ar to cytoch .. ~ ..
-Ua J.e-721S
3... 12 IS
--XU
as they may well be unrelated (e.g. sequence gi 1118401825 i ref IXP_OOI033232 .11, which ~<i"." 'qi,n p.rod:Uot [11M ..... f ••• J.ll 3_7~ !1'1
s.i,.lj.5~"'''
has an Evalue of 2.6). .. "t; auiMln:i:t; .IV i .... '. Jll

10. Links on the left of each of the named hits link to the Entrez protein database entry for the
protein. U and G symbols to the right of some hits link to the entries in UniGene or Entrez
1... ~giflqg]2p42qtxaelU Q01QR4Q§4 U mP1tIl)ICHttJ. _'" slJ.bunit 4. iso£oX'm
1" mitOQhondri.~l. prtJCura:or t:Cyt;ochr~ " ox;i.dallta .ubuni.~ ""'.
ICytoo~" ~ polyptlpt.~da
Gene. From all of these linked databases, there are numerous links to other resources for that is<>icm II {COX IV-I}
iso£o.r.. 2 (Kacaca mula;,;,,,]
IV)

protein. qi t lQ91jUtf26 f raj!! xr


gol g t i ? , ~! m
J?R1mzcrmt>t aim.Ua.r ~-o
1, mit.ochond.rll1,l p:re<'JurlJor (Cytoahrome 0: ~ .suJ;mQit
c exida• .,. attbunit 4 iso£o:r:m.

11. Below the list of hits, each alignment is shown in detail. (You can jump directly to one of the i.G'.,,,.,1) iCOX IV_I} (Cyt;o<:b.-- c: oxid...... polyptlpt..de
isoform 1 [Kacaca mulattlt 1
alignments by cllcking on the link under Score in the list of hits, or by clicking on the colored jI.en
vth-169
bar in the graphic window.) Note that redundant entries are listed here (although only one of S<z¢.-& I$(t 213 bit-a- (698)" Expa.crt: ~ 30-12, Mat.hoch Compoai1:.i.<m-baeed
!<l<mtiU •• - 135/169 (79%1, Positive. - tSO/169 {8n~. (laps - 0/1&9
them is given in the graphic window and in the list of hits). I
'Quary 1
12. On the query page (step 3), there are numerous options to alter the search parameters or limit Sbjot 1
the search. In particular, under Algorithm parameters (below), the Evalue threshold (Expect Query ~l

threshold), word size (equivalent to ktup in FASTA), and the gap penalties (Gap costs) can be
adjusted, as can more-complex parameters.
13. For example, the Organism box (under Choose Search set) can be used to limit the search to Figure 2. BLAST output.
specific organisms or groups of organisms. (Try repeating the search with this option set to Only sections of the full output page are shown. Beneath the references at the top of the page,
Custom ... combined with Sus scrofa; the search should return only a few hits, all from pig.) a graphic shows the distribution and strength of hits against the query sequence, as a series
More complex iimits can be imposed by typing an Entrez query into the adjacent text box; see of colored bars. Below this, a table lists the hits in order of increasing Evalue (decreasing
Chapter 1 for more information on Entrez search terms. significance), with links to other databases. Below this, each alignment is shown in detail
(duplicate entries, such as gi1 09129420 and gi1 09129422, are shown as a single alignment).
56 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 57

2.3.3 BLAT simulations have indicated that the theory probably applies to gapped alignments
as well, so that its application to general pairwise alignment is not likely to
A recent adaptation of the BLAST routine is the BLAT (BLAST-like alignment tool; 44)
introduce error. Therefore, E values are roughly comparable across the various
program. BU\T performs rapid mRNA/DNA and cross-species protein alignments. It
search tools. A number of state-of-the-art homology search techniques adopt
is more accurate and about SOD times faster than other tools such as BLAST when
the Karlin-Altschul statistical framework and routinely calculate P or Evalues for
used for mRNA/DNA alignments, and 50 times faster for protein alignments at
each query-database pairwise sequence comparison (see next section).
sensitivity settings typically used when comparing vertebrate sequences. When
BLAT is applied to DNA sequences, it builds an index of the entire genome in
memory. The index consists of all nonoverlapping 11 mers except for those heavily
2.4.1 Statistics of local aiignments without gaps
involved in repeats. As the total index amounts to less than a gigabyte, it can An important contribution for fast sequence database searching has been the
be kept in RAM for quick access, allowing BLAT to perform very quick searches on realization (45,47, 48) that local similarity scores of ungapped alignments follow
a standard Linux box. The index is used to delineate areas that are likely to be the extreme value distribution (EVD) (49). This distribution is unimodal but not
homologous, which are then loaded into memory for further detailed alignment. symmetrical like the normal distribution, because the right-hand tail at high-
Protein BLAT works in a similar manner, except with 4mers rather than 11 mers. scoring values falls offmore gradually than the lower tail. reflecting the fact that
The protein index takes a little more than 2 gigabytes, which is also feasible on the best local alignment is associated with a score that is the maximum out of a
modern workstations. great number of independent alignments (see Fig. 3).
The standard implementation of BLAT quickly finds sequences of :2:95% similarity Following the EVD, the probability (P) of a score 5 being larger than a given
and length 40 bases or more when applied to DNA. It may miss more divergent value x can be calculated as:
regions of longer length or more similar ones of shorter length. None the less,
p(S:2: x) = 1 - exp(_e-lL(x-.ul)
BLAT is guaranteed to detect sequence matches down to 33 bases and sometimes
detects identical regions as short as 20 bases. When applied to search protein where,u = (In Kmnl/A and Kis a constant that can be estimated from the background
sequences, BLAT finds sequences of :2:800J0 sequence identity and of length 20 amino amino acid distribution and scoring matrix (for a collection of values for It and K
acids or more. In practice, DNA BLAT works well on primates and protein BLAT on land
vertebrates.
o.W
2.4 Statistical significance of search results - E values
I
The BLAST method is based on an exhaustive statistical analysis of ungapped
f
aiignments (45) and provides a rigorous statistical framework, based on the extreme 30
r
I
0.
value theorem, to estimate the statistical significance of putative homologs.
The E(or expectation) value indicates the expected number of sequences with
an alignment score equal to or greater than that of the alignment considered,
taking into account factors such as the size of database being searched and the
composition of the query sequence. For example. if an alignment has an E value
of 1e-9 (10- 9), this means that a match with that score (or better) would only be
expected to occur by chance (i.e. in the absence of true homology) in the database
f020f I
with a probability of 1 in a billion and is thus highly significant. Conversely, if a hit
0.10
I I
has an Evalue of3.0, this means that one might expect about three equally similar
sequences to be found in the database by chance alone - clearly, therefore, such
a hit is not necessarily a homologous sequence.
The original BLAST program could detect only local alignments without gaps
and therefore might miss some significant similarities. The more recent version of r==--
the BLAST algorithm, gapped BLAST, is able to insert gaps in the alignments, leading O.OO~~.O <
0.0 Score 50
. 10.0
to increased sensitivity (33). The original statistical framework for ungapped Figure 3. Extreme value distribution.
alignments is used to assess the significance of the gapped alignments as well, Shown is the probability density function for the extreme value distribution (EDV)
although no mathematical proof for this is yet available (46). However, computer resulting from parameter values J1 = 0 and A, = 1.
58 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES II 59

over a set of widely used scoring matrices, see 50). Using the equation for J..L, the 2.4.3 Statistics of database searches
probability for 5 becomes:
In order to be useful in sequence database searches, the above framework
p(S? x) = 1- exp(-Kmne-AXj for comparing a pair of random sequences should be adapted to multiple
pairwise comparisons. Here, it becomes important to establish the probability
!n practice, the probabiiity p(S? xl is estimated using the approximation:
for a given query sequence to have a significant similarity with at least one of
1 - exp(-e-Xj :=:: e-x the database sequences. The P value is the probability of seeing at least one
unrelated score 5 greater than or equal to a given score x in a database search
which is valid for large values of x. This leads to a simplification of the equation over n sequences. This probability has been demonstrated to follow a Poisson
for P(S? x): . distribution (53):
p(S? xl :=:: e-Mx-,u) = Kmne-AX P (x, nl = 1 - e-n·p(s<!x)
The lower the probability for a given threshold value x, the more significant the where n is the number of sequences in the database. In addition to the P value,
score S. some database search methods employ the E value of the Poisson distribution,
In spite of the usefulness of the above statistical estimates in recognizing which is defined as the expected number of nonhomologous sequences with a
sequence similarity, it should be noted that they do not judge the distribution of score greater than or equal to a score x in a database of n sequences:
similarity along the sequences, which is a crucial aspect in assessing homology.
E(x, n) = n· p(S ? xl
For example, a statistically significant alignment score can correspond to a single
domain in a mUlti-domain protein sequence or to a single motif within a domain, For example, if the Evalue of a matched database sequence segment is 0.01, then
thereby still conferring an incomplete biological picture. the expected number of random hits with score S? x is 0.01, which means that
this Evalue is expected by chance only once in 100 independent searches over the
database. However, if the Evalue of a hit is 5, then five fortuitous hits with S? x
2.4.2 Statistics of local alignments with gaps are expected within a single database search, which renders the hit not significant.
Although similarities between sequences can be detected reasonably well Database searching is commonly performed using an Evalue of between 0.1 and
using methods that do not allow insertions/deletions in aligned sequences, it 0.001. Low E values decrease the number of false positives in a database search,
is clear that insertion/deletion events playa major role in divergent sequences. but increase the number of false negatives such that the sensitivity (see below) of
This means that accommodating gaps within alignments of distantly related the search is lowered.
sequences is important for obtaining an accurate measure of similarity. In addition to Por Evalues, a number of sequence similarity searching routines
Unfortunately, a rigorous statistical framework as obtained for gapless local provide an additional normalized alignment score based on the raw alignment
alignments has not been conceived for local alignments with gaps. However, score, S. This score, called the bit score, is defined as:
although it has not been proven analytically that the distribution of 5 for
gapped alignments can be approximated with the EVD, there is accumulated
B= (AS -In KJ/ln 2
evidence that this is the case: for example, for various scoring matrices, gapped where 5 is the raw alignment score and A and K are the aforementioned statistical
alignment similarities have been observed to grow exponentially with the parameters of the scoring system (50). The bit score, B, is a linear transformation
sequence lengths (51). Other empirical studies have shown it to be likely that of the raw score and has a standard set of units - the higher the score, the more
the distribution of local gapped similarities follows the EVD (52, 53). although significant the alignment. As bit scores are normalized with respect to the scoring
an appropriate downward correction for the effective sequence length has system, they can be used to compare alignment scores from different searches
been recommended (50). The distribution of empirical similarity values can based on different scoring schemes, which is not warranted using raw alignment
be obtained from unrelated biological sequences (54). Fitting of the EVD scores.
parameters A and K (see above) can be performed using a linear regression
technique (54), although the technique is not robust against outliers, which
can have a marked influence. Maximum likelihood estimation (55, 56) has 2.5 Fast Smith-Waterman local alignment searches
been shown to be superior for EDV parameter fitting and, for example, is the
method used to parameterize the gapped BLAST method (33). However, when Collins and Coulson (57) devised a parallel computer protocol to perform
low gap penalties are used to generate the alignments, the similarity scores database searches based on an implementation of the full Smith-Waterman
can lose their local character and assume more global behavior, such that the (8) local alignment technique. They implemented their VlPSRCH protocol (58)
EVD-based probability estimates are no longer valid (51). on massively parallel computers with single-instruction multiple data-type
60 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES METHODS AND APPROACHES 61

processors. Following Collins and Coulsen (57), a number of implementations corresponding to the best local alignment quantifies the degree of similarity of
that enable fast Smith-Waterman-based local searches have emerged. One of this sequence with the probe profile. The scores are then corrected for sequence
the central computer sites where such programs are running as web servers is the length, represented in the form of Z scores, and ranked to create the final list of
European Bioinformatics Institute outstation of the European Molecular Biology databank search hits. Top-scoring sequences with scores above some threshold
Laboratory. Available are MPSRCH (http://www.ebLac.uk/MPsrch/index.htmI 3.4) and level are then likely to be related to the multiply aligned sequences used to build
a fast implementation of the true Smith-Waterman algorithm (SCANPS) (59), both the profile. In addition to aligning a single sequence to a profile, it is also possible
allowing users to perform database queries via the Internet. The output of a to align two profiles. In this case, two matched profile positions receive a score
query is a list of top-scoring local alignments (one per protein) where statistical by summing the products of the corresponding propensities from the two profiles
significance measures are also given based on the mean value and standard over the 20 residue types.
deviation of the distribution of scores over the entire database (57). The speed of A number of improvements have been effected since the early Gribskov
the techniques allows several PAM exchange weight matrices (based on different et 01. (60) approach. A subclass of profiles for ungapped alignments is referred
evolutionary distances) to be used in searching the databanks with the same to as position-specific scoring matrices, a term developed by the BLAST team
query sequence. (33). An approach adopted in many profile-based methods is to implement
a more probabilistic and informational scheme based on log likelihoods and
normalization using expected residue compositions, which has been shown to
2.6 Profile searching lead to more sensitive comparisons than the classic Gribskov et 01. approach.
A common problem in this approach is the occurrence of zero values at some
A natural extension of sequence database searching is provided by methods positions in the matrix. This not only leads to divide-by-zero problems in the
that use information over an entire sequence alignment of a certain protein analysis, but also fails to recognize the potential for diversity at some sites,
family to find additional related family members. The earliest conceptually clear which might be seen if a large enough set of sequences was available. A common
technique of this kind of sequence searching was called profile analysis (60), way to deal with the under-representation of nucleotides or amino acids at
which combines a full representation of a sequence alignment with a sensitive alignment positions is the application of pseudo-counts (61-64). Pseudo-counts
searching algorithm. The procedure takes as input a multiple alignment of n effectively extrapolate the number of amino acids at each alignment position
sequenccs. First, a profile is constructed from the alignment, i.e. an alignment- by adding extra artificial residue counts to the profile, based for example on a
specific scoring table, which comprises the likelihood of each residue type known residue composition observed in the database. This generally enhances
occurring in each position of the multiple alignment. A typical profile has the predicted power of the profile.
Lx (20 + 2) elements, where L is the total length of the alignment and 20 rows In 1994, Baldi et 01. (65) and Krogh et 01. (66) pioneered the use of hidden
are reserved for the number of amino acid types, whilst the last two rows are Markov models (HMMs) to represent an aligned block of sequences. A distinct
often reserved for affine gap penalties (see above). Gribskov et 01. (60) used advantage of HMMs over traditional profiles is that an HMM incorporates a
a single extra column in the profile to describe the local weight for both the richer probabilistic description of insertions and deletions probabilities. Whereas
gap opening and the gap extension penalty. For gapless alignment positions, in profiles there is just a single gap penalty for each position, such that the
the weight is the maximal value, whereas for positions with insertions/deletions, introduction of a gap in either the profile or the sequence aligned against the
the weighting factor is lowered according to the maximum length of the gap profile leads to the same penalty, in an HMM these two events can be modeled
crossing a given alignment position. The advantage of positional gap weights is with different probabilities. An extensive library of HMMs for protein domains is
that multiple alignment regions with gaps (loop regions) will be assigned lowered deposited in the Pfam database (30). Profile searching using HMMs is currently
gap penalties and hence will be more likely than core regions to attract gaps in a one of the most sensitive search techniques.
target sequence during profile searching, consistent with structural considerations. Bucher et 01. (67) unified the profile, motif, and HMM approaches through
However, the implementation by Gribskov et 01. does not take the frequency of extension of the profile definition with regular expression-like patterns, weight
gaps at each alignment position into account for the estimation of gap opening matrices, and HMMs. They proved that their generalized profiles are equivalent
and/or extension penalties. Many alternative profile implementations therefore to certain types of HMM. The generalized profiles have been used to extend the
reserve the two last columns of the profile for positional gap opening (Paoen ) PROSITE protein motif database (67, 68), which in its basic form is a library of
and gap extension (Pextend) penalties, which can be individually determined us'ing regular expressions. The profile syntax enables the emulation of most common
protocols that take the above considerations into account. motif search techniques, such as direct searching for PROSITE patterns, searching
In the approach of Gribskov et 01., a profile calculated as described above is for patterns without gaps (69), searching using the profile definition of Gribskov
aligned with the databank sequence by means of the Smith-Waterman dynamic et 01. (60), flexible pattern searches (70), searching using HMMs (66), and domain
programming procedure (8). For each database sequence, the alignment score and fragment searches using the HMMER method (71 J.
62 CHAPTER 3: SEQUENCE SiMILARITY SEARCHES METHODS AND APPROACHES 63

2.6.1 PSI-BLAST - iterative searching


The most widely used algorithm of the BLAST suite is position-specific iterated BLAST
Protocol 4
(PSI-BLAST) (33), which exploits increased sensitivity offered by multiple alignments
A search for divergent family members using PSI-BLAST
and derived profiles in an iterative fashion.
The program initially operates on a single query sequence by performing a 1. Open the BLAST home page at http://www.ncbi.nlm.nih.gov!blast/ 3.3 •
conventional gapped BLAST search (see above). Low-complexity regions of the 2. Click on protein blast and then select PSI-BLAST (below under Program selection).
query sequence (such as coiled-coil or transmembrane regions (71), which are
3. Paste the cytochrome c oxidase sequence from M. musculus in FASTA format (see Protocol 1)
widespread) are filtered and masked out, as they are not likely to convey a specific into the Enter Query Sequence box. Leave all other settings at their default values, click
signal for recognizing a particular protein family. The PSI-BLAST program searches BLAST, and then retrieve your results as in Protocol 3.
the database with the filtered query sequence, selects the significant hits, and
4. The output of PSI-BLAST is quite similar to that of normal BLASTP except that the E value cut-off
then aligns these with each other. point is now set to 0.00. A dividing line in the hit list clearly shows which sequences fall below
From the aligned 'first-generation' hits, PSI-BLAST then creates a profile that this new E value cut-off point. This cut-off value is chosen to ensure that only reasonably
reflects regions conserved across the aligned proteins. This profile is known as significant hits to the query protein are used in building the profile for the next iteration.
a position-specific scoring matrix (PSSM): in effect, it is a way of giving extra S. All of the hits above the cut-off line will be marked 'NEW', as they were found in this iteration
weight to those parts of the protein that are most characteristic of all members of the search.
of the family. The PSSM is then used to rescan the database in a second round 6. Now click on the button Run PSI-Blast iteration 2 below the list of 'NEW' hits. The program
aimed at finding new sequences. These new sequences will be ones that had too will run again, now using a query profile (which holds information for all selected sequences
little similarity to the original query sequence to be found in the first round, but above the cut-off value in the preceding iteration) instead of the original query sequence.
that do have significant similarity when the PSSM is used to concentrate on the 7. The result (see Fig. 4, also available in the color section) shows a number of newly identified
conserved motifs. putative homologs (indicated by 'NEW' to the left of the hit; those hits from the previous
This process can be repeated and, at each iteration, the 'profile' (PSSM) of round are marked with a green dot).
the group of proteins is refined, making the next round more sensitive to more 8. Note that the E values for some of the earlier sequences (those found in the first round)
distantly related family members. This continues until the user decides to stop are much lower (more significant) than before. This is because in the refined search these
or until the search has converged (i.e. further iterations produce no additional sequences were found to conform more closely to the 'profile' of the family (the PSSM), even
hits). The main web server for PSI-BLAST is located at http://www.ncbi.nlm.nih.gov/ if they had only modest similarity to the original query sequence. For instance, 'hypothetical
blast/3.3, but a large number of mirror sites exist. protein had an Evalue of 0.8 in the original search (and was below the cut-off
The web server enables the user to specify at each iteration round which line), but now an E value of 6e-9 in this scconditeration: it matches the family profile
well, even though it matched the original query protein only loosely.
sequences should be included in the profile, whilst by default all sequences are
included that score beyond a user-set E value. However, the user needs to ask for 9. In some cases, proteins found in the first search will be rejected (falling below the cut-off
each subsequent iteration. An alternative to the PSI-BLAST web server is a stand- line) in later iterations. This normally indicates that the protein matched the query protein
overall to a reasonable degree, but did not conform weil to the specific profile of the family
alone version of the program, downloadable from the aforementioned URL,
of related proteins.
which allows the user to specify beforehand the desired number of iterations.
An important limitation of the original PSI-BLAST program is that the statistics do 10. You can continue this process, running further iterations. The advantage is that more distant
can be found, but there is also a higher chance that more unrelated sequences will
not take compositional biases into account. This was addressed in an update of
end up above the Ecut-off value. Typically, PSI-BLAST runs are iterated three to ten times. In this
the PSI-BLAST technique (72). Biased amino acid compositions could confuse the exampie (and searching the database as at the time of writing), no new sequences were found
original algorithm and lead to a build-up of errors in subsequent iterations. This after the seventh iteration.
is significant for cross-genome comparisons, as many large overall compositional
differences exist among individual genomes. Another matter of debate is the way
in which the PSSM is generated. It is clear that erroneous alignments are likely to
drive the engine into inclusion of false positives.
In Protocol 4, we will use PSI-BLAST in an iterative search, again starting from the
mouse cytochrome c oxidase (subunit IV).
64 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES TROUBLESHOOTING 65

Many of these scoring schemes have been assessed in recent comparison studies
and have shown little significant difference in their respective performances (90,
91). However, most of the profile-profile alignment approaches to date have
been used mainly for sequence database searching (local pairwise alignment),
where a popular application has been to use profile-profile comparisons for
~:~~ £3
3.-11 £3 aligning a profile derived from a query multiple aiignment with a number of
",,",ui!!
le-J.Q ['!]E profiles describing a collection of different protein families.
i!!
£3 A direct application of the profile-profile alignment technique is implemented
5e-10 i!! in PRALINE-PSI (92), a multiple alignment technique that relies on constructing a
profile for each of the query sequences using the PSI-BLAST method. Pre-alignment
profiles (pre-profiles) are generated using each sequence in a set as a PSI-BLAST
(33, 46) query. The resulting PSI-BLAST local alignments are filtered for redundancy
'.-03 m
m and converted to PRALINE pre-profiles, which replace the single sequence input that
m
m would otherwise be used for the alignment (see 93-95 for further details). The
m
m increased sensitivity of the PRALINE-PSI method in detecting similarities becomes
5e--D&B
"e-OS m:
most evident in aligning distant sequence pairs (or sequence-profile and profile-
profile pairs in multiple sequence alignment).

Figure 4. PSI-BLAST output (see page xix for color version). 3.1 Iterative homology searching problems "~_"""'~'"
Part of the results after the second PSI-RL'\ST iteration are shown. The output format is P<:<:pnt'::lll\/
the same as for BLASTP (see Fig. 2), but new family members found in this iteration are Iterative sequence search methods such as PSI-BLAST can be a powerful way of
'NEW'. Family members found in the previous round are indicated by green dots. Sequences finding distant homologies, but often fail when querying a multi-domain protein
below the horizontal dividing line have too high an Eva/ue; only those above the line will be used in
or a protein with regions of compositional bias. For example, common conserved
rnmniiincr thp PSSM (or 'profile') of the protein family for the next iteration ofthe search.
protein domains such as the tyrosine kinase domain can obscure weak but relevant
matches to other domain types (96), whilst sequences containing low-complexity
regions, such as coiled coils and transmembrane regions, can cause an explosion
2:6.2 Profile-profile alignment
of the search rather than convergence due to the absence of any strong sequence
The previous sections described how a query multiple sequence alignment, based signals. Conversely, some searches may lead to premature convergence; this occurs
on a given query sequence and a number of putative homologs, can confer an when the PSSM is too strict, only allowing matches to very similar proteins, i.e.
enhanced signal for recognizing distantly evolved members of a given family. sequences with the same domain organization as the query are detected but no
Over the last few years, further improvements to the alignment of distant homologs with different domain combinations.
sequences have been achieved using several approaches. As a first improvement, An additional problem with iterative searches is 'matrix migration' (also referred
the evolutionary model describing the relationship of a set of sequences can to as 'profile wander'), which occurs when the search strategy is too permissive so
be readjusted to fit the sequence set rather than using a pre-set generic model that information from false-positive sequences is included in the profile, resulting
incorporated in a single-residue exchange matrix such as the PAM or BLOSUM in the possible loss of truly homologous sequences found in earlier rounds. A further
series. Recently, Yu et 01. (16) showed that the use of organism-specific or loss of information can be incurred with PSI-BLAST, as PSI-BLAST PSSMs are trimmed to
alignment-set-specific background frequencies for contextual readjustment use only the highest-scoring region in a search, ignoring less-conserved regions.
of the standard amino acid exchange weights provides a more sensitive and The alternative database search method QUEST (97) alleviates these problems by
biologically accurate way of aligning sequences. Alternatively, structural or using an independent multiple-alignment program to generate a true multiple
homologous sequence information can be incorporated into the alignment sequence alignment between iterations, and not a 'master-slave' alignment,
process to help identify the distant relationships between sequences. The benefits thereby improving the quality of the PSSM. The QUEST method also removes any
of using related sequence information has been shown in numerous profile- sequences that are deemed to be too divergent as a reliable family member in
profile alignment methods that apply different profile-scoring schemes (73-89). order not to 'pollute' the PSSM, which leads to increased search capabilities.
66 CHAPTER 3: SEQUENCE SIMILARITY SEARCHES REFERENCES 67

3.2 Post-processing of homology searches true hits found relative to the total number of sequences in the database that
are homologous to the query. The sensitivity reflects the extent to which the
A few methods exist to predict domain boundaries through post-processing BLAST method is able to identify distantly related sequences. In many studies, this
searches. The BALLAST method can be used to visualize conservation profiles for a measure is also referred to as coverage. The specificity (or selectivity) is defined
query sequence based on sequence searching (98, 99), although the method does as TN/(FP + TN), where FP is the number of false positives, which denotes the
not delineate domain boundaries. Another technique is the PASS (prediction of fraction of entries correctly excluded as hits and hence measures the avoidance
autonomous folding units based on sequence similarities) method, which uses of unrelated hits. Yet another widely used measure is the positive predictive value,
a simple and non iterative method of domain delineation based on the stacking defined as TP/(TP + FP), which measures the proportion of true homologs within
of sequences from a gapped BLAST search onto the query sequence (100). Regions all sequences designated by the search tool as related. In practical database
along a query sequence often have a varying number of matching sequences searches, there is a trade-off between sensitivity and specificity: the more the
from the BLAST data leading to abrupt increases and decreases in sequence numbers P or E values are relaxed to allow more distantly related sequences to be found,
along the query. The PASS method is based on a single BLAST run and does not use the more likely it becomes that chance hits infiltrate the search. Moreover, even
iteration to include information from distant homologs. Furthermore, the current if a statistically highly significant similarity is encountered, problems remain. For
release of the PRODOM domain database (29) is created using the \<1KD0\<12 method example, if high similarity is found over only a portion of the sequences, the
(101). which performs PSI-BLAST searches starting with the smallest sequence in the sequences may each contain multiple domains and share a single homologous
database as a query, which is supposed to represent a single domain. All domain domain only (see above), so that only an aspect of the overall function might
sequences identified are removed from the database, after which the process is be inferred. in iterative homology searches. protein sequences containing more
iterated with the remaining subsequences and terminated when the database than one structural domain can be problematic in that they cause the search
becomes empty. The \<1K00\<12 method is an iterative protocol but does not address to terminate prematurely or lead to an 'explosion' of common domains (102).
the aforementioned problems connected to PSSM-based iterative searches. For example, the occurrence in the query sequence of a common and conserved
The DOMAINATION method (102) assigns domain boundaries by applying PSI-RLAST in protein domain such as the tyrosine kinase domain, which is then hit many
a repetitive fashion. The distribution of the aligned positions of Nand C termini times in the database, can obscure weaker but also relevant matches to other
from PSI-BLAST local sequence alignments is used to identify potential domain domain types {102l, particularly when the E value is set to include only strong
boundaries. DOMAINATION incorporates an iterative strategy for chopping and joining hits. Conversely, when multi-domain sequences with the same sequential order of
domains and domain segments based on the loss and gain of domains. This allows domains as in the query sequence are found initially during an iterative search,
the recognition of both continuous and discontinuous domains. For each domain homologs with different domain combinations might well be missed due to
inferred from the corresponding PSI-BLAST local alignments. profiles are created by early convergence of the search. To reduce the chance of including spurious hits,
filtering redundant sequences and subsequent multiple sequence alignments. some database search engines, such as PSI-BLAST (33), scan query sequences for the
Each profile filtered in this way is then used in further iterative database searches presence of so-called low-complexity regions. These are then excluded from the
using PSI-BLAST. All profiles are required to contain the original query sequence at alignment to limit the inclusion of false-positive hits due to database sequence
each iteration of PSI-BLAST to avoid profile wander, but parameters are set to ensure matches with these regions. However, the occurrence of database sequences with
that the profiles are divergent enough to capture distant sequence fragments. The low-complexity regions can still cause an explosion of false positives in iterative
whole process of iterative PSI-BLAST searches is repeated until domain assignment homology searches (102). Despite recent improvements in search techniques,
ends and no new homologs are found anymore. DOMM,ATION can successfully assign complications such as the above illustrate that automatic biological evaluation of
domain boundaries within a given query sequence, whilst the added information homology searches in genomic pipelines remains elusive for biologically intricate
gleaned from the putative domains delineated during the parallel and iterated relationsh ips.
searches leads to a search performance enhanced by 15 percentage points over a
wide range of Evalues compared with stand-alone PSI-BLAST searches.

3.3 Evaluating sequence database searches _m_"_""~~~~~',",<~~'~~~~~~ ;


1. Teichmann SA, Rison SC, Thornton JM, Riley M, Gough j & Chothia C (2001) J. Mol. Bioi.
311,693-708.
A few useful measures are commonly used to gauge the accuracy of sequence 2. May AC (2001) Nature, 413, 453.
3. Doolittle RF (1981) Science, 214, 149-159.
database search methods over an annotated nonredundant database. The
4. Pascarella S & Argos P (1992) Protein Eng. 5, 121-137.
sensitivity of a search is defined as TP/(TP + FN), where TP is the number of true * 5. Abagyan RA & Batalov S (1997)). Mol. Bioi. 273, 355-368. - An important and
positives and FN the number of faise negatives, which reflects the fraction of comprehensive study of the relationship between sequence similarity and homology.
Exploring the Variety of Random
Documents with Different Content
de uer con el dolor qu'estás
quexandote?
Yo dexo mi ganado alli,
atendiendome,
que en quanto el claro sol no
ua encubriendose
bien puedo estar contigo
entreteniendome.
Tu mal me di[1224] pastor,
que el mal diziendose
se passa a menos costa, que
callandolo,
y la tristeza en fin va
despidiendose.
Mi mal contaria yo, pero
contandolo,
se me acrecienta, y más en
acordarseme
de quan en vano, ay triste,
estoy llorandolo.
La vida a mi pesar veo
alargarseme,
mi triste coraçon no ay
consolarmele,
y vn desusado mal veo
acercarseme.
De quien medio esperé, vino
a quitarmele,
mas nunca le esperé, porque
esperandole,
pudiera con razon dexar de
darmele.
Andaua mi passion
sollicitandole,
con medios no importunos,
sino licitos,
y andaua el crudo amor ella
estoruandole,
Mis tristes pensamientos
muy solicitos
de vna á otra parte
reboluiendose,
huyendo en toda cosa el ser
illicitos,
pedian a Diana, que
pudiendose
dar medio en tanto mal, y sin
causartele
se diesse: y fuesse vn triste
entreteniendose.
Pues qué hizieras, di, si en
vez de dartele
te le quitare? ay triste, que
pensandolo,
callar querria mi mal, y no
contartele.
Pero despues (Sireno)
ymaginandolo
vna pastora inuoco
hermosissima,
y ansi va a costa mia en fin
passandolo.

SIRENO

Syluano mio, vna afeccion


rarissima,
vna verdad que ciega luego en
viendola,
vn seso, y discrecion
excelentissima:
con una dulce habla, que en
oyendola,
las duras peñas mueue
enterneciendolas,
qué sentiria un amador
perdiendola?
Mis ouejuelas miro, y pienso
en viendolas
quantas uezes la uía
repastandolas
y con las suias propias
recogiendolas.
Y quantas uezes la topé
lleuandolas,
al rio por la siesta, a do
sentandose,
con gran cuidado estaua alli
contandolas?
Despues si estaua sola,
destocandose,
vieras el claro sol
embidiosissimo
de sus cabellos, y ella alli
peinandose,
Pues (o Syluano amigo mio
carissimo)
quantas uezes de subito
encontrandome
se le encendia aquel rostro
hermosissimo.
Y con qué gracia estaua
preguntandome
que como auia tardado, y aun
riñiendome
y si esso m'enfadaua
halagandome.
Pues quantos dias la hallé
atendiendome
en esta clara fuente, y yo
buscandola
por aquel soto espesso, y
deshaziendome,
Cómo qualquier trabajo en
encontrandola
de ouejas y corderos, lo
oluidauamos
hablando ella comigo, y yo
mirandola.
Otras uezes (Syluano)
concertauamos
la çampoña y rabel con que
tañiamos,
y mis uersos entonce alli
cantauamos.
Despues la flecha y arco
apercebiamos
y otras uezes la red, y ella
siguiendome
jamas sin caça a nuestra
aldea boluiamos.
Assi fortuna anduuo
entreteniendome
que para mayor mal yua
guardandome,
el qual no terná fin, sino
muriendome.

SYLUANO

Sireno, el crudo amor que


lastimandome
jamas cansó, no impide el
acordarseme
de tanto mal, y muero en
acordandome.
Miré a Diana, y ui luego
abreuiarseme
el plazer y contento, en solo
uiendola,
y a mi pesar la uida ui
alargarseme,
O quantas uezes la hallé
perdiendola,
y quantas uezes la perdi
hallandola,
y yo callar, suffrir, morir
sirviéndola[1225]?
La uida perdí yo, quando
topandola
miraua aquellos ojos, que
ayradissimos
boluia contra mí luego en
hablandola.
Mas quando los cabellos
hermosissimos
descogia y peinaua, no
sintiendome
se me boluian los males
sabrosissimos.
Y la cruel Diana en
conosciendome
boluia como fiera, que
encrespandose
arremete al leon, y
deshaziendome.
Vn tiempo la esperança,
ansi burlandome
mantuuo el coraçon
entreteniendole,
mas el mismo despues
desengañandome,
burló del esperar, y fue
perdiendole.

No mucho despues que los


pastores dieron fin al triste canto,
uieron salir dentro el arboleda que
junto al rio estaña, una pastora
tañendo con una çampoña, y
cantando con tanta gracia y
suauidad como tristeza: la qual
encobria gran parte de su
hermosura (que no era poca) y
preguntó Sireno, como quien auia
mucho que no repastaua por
aquel valle, quién fuesse[1226].
Syluano le respondió: esta es una
hermosa pastora, que de pocos
dias acá apascienta por estos
prados, muy quexosa de amor, y
segun dize con mucha razon,
aunque otros quieren dezir, que
ha mucho tiempo que se burla
con el desengaño. Por uentura,
dixo Sireno, está en su mano el
desengañarse? Si, respondió
Syluano, porque no puedo yo
creer, que ay muger en la uida,
que tanto quiera que la fuerça del
amor le estorue entender si es
querida, o no.
De contraria opinion soy. De
contraria (dixo Syluano) pues no
te irás alabando, que bien caro te
cuesta auerte fiado en las
palabras de Diana, pero no te doy
culpa, que ansi como no ay a
quien no uença su hermosura,
assi no aurá a quien sus palabras
no engañen. ¿Cómo puedes tú
saber esso, pues ella jamas te
engañó con palabras, ni con
obras? Verdad es (dixo Syluano)
que siempre fuy della
desengañado, mas yo osaria jurar
(por lo que despues acá ha
sucedido) que jamas me
desengañó a mi, sino por
engañarte a ti. Pero dexemos
esto, y oyamos esta pastora que
es gran amiga de Diana, y segun
lo que de su gracia y discreccion
me dizen, bien meresce ser oyda.
A este tiempo llegaua la hermosa
pastora junto a la fuente,
cantando este soneto.

Soneto.

Ya he uisto yo a mis ojos


más contento
y he uisto mas alegre el alma
mia,
triste de la que enfada do
algun dia
con su uista causó
contentamiento.
Mas como esta fortuna en
un momento
os corta la rayz del alegria,
lo mismo que ay de vn es, a
un ser solia,
ay de un gran plazer a un gran
tormento.
Tomaos allá con tiempos,
con mudanças,
tomaos con mouimientos
desuariados,
vereys el coraçon quan libre
os queda.
Entonces me fiaré yo en
esperanças,
quando los casos tenga
sojuzgados
y echado un clauo al exe de la
rueda.

Despues que la pastora acabó de


cantar se uino derecha a la fuente
adonde los pastores estauan, y
entretanto que uenia, dixo
Syluano (medio riendo) no hagas
sino hazer caso de aquellas
palabras, y acceptar por testigo el
ardiente sospiro con que dió fin a
su cantar. Desso no dudes
(respondió Sireno) que tan presto
yo la quisiera bien como aunque
me pese creyera todo lo que ello
me quisiera dezir. Pues estando
ellos en esto llegó Seluagia, y
quando conoscio a los pastores,
muy cortesemente los saludó,
diziendo: Qué hazeys, o
desamados pastores, en este
verde y deleytoso prado? No
dizes mal, hermosa Seluagia, en
preguntar qué hazemos (dixo
Syluano) hazemos tan poco para
lo que deuiamos hazer, que jamas
podremos concluyr cosa que el
amor nos haga dessear? No te
espantes desso, dixo Seluagia,
que cosas ay que antes que se
acaben, acaban ellas a quien las
dessea. Syluano respondio: a lo
menos si hombre pone su
descanso en manos de muger,
primero se acabará la uida, que
con ella se acabe cosa con que
se espere recebille. Desdichadas
destas mugeres (dixo Seluagia)
que tan mal tratadas son de
uuestras palabras. Mas destos
hombres (respondio Syluano) que
tanto peor lo son de uuestras
obras. Puede ser cosa más baxa,
ni de menor ualor, que por la cosa
más liuiana del mundo, olvideys
uosotras a quien más amor ayais
tenido? Pues ausentaos algun dia
de quien bien quereys, que a la
buelta aureys menester negociar
de nueuo. Dos cosas siento, dixo
Seluagia, de lo que dizes, que
uerdaderamente me espantan, la
vna, es que ueo en tu lengua al
reues de lo que de tu condicion
tuue entendido siempre, porque
imaginaua yo quando oya hablar
en tus amores, que eras en ellos
vn Fenix, y que ninguno de
quantos hasta oy an querido bien,
pudieron llegar al estremo que tú
as tenido, en querer a una
pastora que yo conosco, causas
harto sufficientes para no tratar
mal de mugeres, si la malicia no
fuera más que los amores. La
segunda es que hablas en cosa
que no entiendes, porque hablar
en oluido, quien jamas tuuo
esperiencia dél, más se deue
atribuir a locura que a otra cosa.
Si Diana jamas se acordo de ti,
cómo puedes tú quexarte de su
oluido? A ambas cosas, dixo
Syluano, pienso responderte, si
no te cansas en oyrme. Plega a
Dios que jamas me uea con más
contento del que aora tengo, si
nadie, por más exemplo que me
trayga puede encarecer el poder
que sobre mi alma tiene aquella
desagradescida, y desleal pastora
(que tú conoces, y yo no quisiera
conocer) pero quanto mayor es el
amor que le tengo, tanto más me
pesa, que en ella aya cosa que
pueda ser reprehendida; porque
ay está Sireno, que fue más
fauorescido de Diana que todos
los del mundo lo an sido de sus
señoras y lo ha oluidado de la
manera que todos sabemos. A lo
que dizes, que no puedo hablar
en mal, de que no tengo
esperiencia, bueno seria que el
medico no supiesse tratar de mal
que él no uuiesse tenido, y de
otra cosa, Seluagia te quiero
satisfazer, no pienses que quiero
mal a las mugeres, que no ay
cosa en la uida a quien más
dessee seruir: mas en pago de
querer bien, soy tratado mal, y de
aqui nasce dezillo yo, de quien es
su gloria causarmele. Sireno que
auia rato que callaua, dixo contra
Seluagia. Pastora, si me oyesses,
no pornias culpa a mi competidor
(o hablando mas propriamente, a
mi charo amigo Syluano) dime,
por qué causa soys tan mouibles,
que en un punto derribais a un
pastor de lo más alto de su
uentura, a lo más baxo de su
miseria? Pero sabeys a qué lo
atribuyo? a que no teneys
uerdadero conoscimiento de lo
que traeys entre manos; tratays
de amor, no soys capazes de
entendelle, ved cómo sabreys
aueniros con el. Yo te dixo Sireno
(dixo Seluagia) que la causa
porque las pastoras oluidamos,
no es otra, sino la misma porque
de uosotros somos oluidadas.
Son cosas que el amor haze y
deshaze: cosas que los tiempos y
los lugares las mueuen o las[1227]
ponen silencio: mas no por
defecto del entendimiento de las
mugeres, de las quales ha auido
en el mundo infinitas que
pudieran enseñar a uiuir a los
hombres, y aun los enseñaran a
amar, si fuera el amor cosa que
pudiera enseñarse. Mas con todo
esto, creyo que no ay mas baxo
estado en la uida, que el de las
mugeres: porque si os hablan
bien, pensays que estan muertas
de amores; si no os hablan,
creeys que de alteradas y
fantasticas lo hazen; si el
recogimiento que tienen no haze
a nuestro proposito, teneys lo por
hypocresia: no tienen
desemboltura que no os parezca
demasiada: si callan, dezis que
son necias, si hablan, que son
pesadas: y que no ay quien las
suffra, si os quieren todo lo del
mundo, creeys que de malas lo
hazen, si os oluidan, y se apartan
de las occasiones de ser
enfamadas, dezis que de
inconstantes y poco firmes en un
proposito. Assi que no está en
más pareceros la muger buena, o
mala, que en acertar ella a no
salir jamas de lo que pide uuestra
inclinacion. Hermosa Seluagia
(dixo Sireno) si todas tuuiessen
ese entendimiento y biueza de
ingenio, bien creo yo que jamas
darian occasion a que nosotros
pudiessemos quexarnos de sus
descuydos. Mas para que
sepamos la razon que tienes de
agrauiarte de amor, ansi Dios te
de el consuelo que para tan graue
mal es menester, que nos cuentes
la hystoria de tus amores, y todo
lo que en ellos hasta aora te ha
succedido (que de los nuestros
sabes más de lo que nosotros te
sabremos dezir) por uer si las
cosas que en él as passado te
dan licencia para hablar en ellos
tan sueltamente. Que cierto tus
palabras dan a entender ser tú la
más esperimentada en ellos, que
otra jamas aya sido. Seluagia le
respondio: si yo no fuere (Sireno)
la más esperimentada, seré la
más mal tratada que nunca nadie
penso ser, y la que con más razon
se puede quexar de sus
desuariados effectos, cosa harto
sufficiente para poder hablar en
él. Y porque entiendas por lo que
passé, lo que siento de esta
endiablada passion, poned un
poco uuestras desuenturas en
mano del silencio, y contaros he
las maiores que jamas aueys
oydo.
En el ualeroso y inexpugnable
reino de los Lusitanos, ay dos
caudalosos rios que cansados de
regar la mayor parte de nuestra
España, no muy lexos el vno del
otro entran en el mar Oceano, en
medio de los quales ay muchas y
muy antiguas poblaciones, a
causa de la fertilidad de la tierra
ser tan grande, que en el uniuerso
no ay otra alguna que se yguale.
La uida desta prouincia es tan
remota y apartada de cosas que
puedan inquietar el pensamiento,
que si no es quando Venus, por
manos del ciego hijo, se quiere
mostrar poderosa, no ay quien
entienda en más que en sustentar
una vida quieta, con sufficiente
mediania, en las cosas que para
passalla son menester. Los
ingenios de los hombres son
aparejados para passar la uida
con assaz contento, y la
hermosura de las mugeres para
quitalla al que mas confiado
biuiere. Ay muchas casas por
entre las florestas sombrias, y
deleytosos ualles: el termino de
los quales siendo proueydo de
rocio del soberano cielo, y
cultiuado con industria de los
habitadores dellas, el gracioso
uerano tiene cuydado de
offrecerles el fruto de su trabajo, y
socorrerles a las necessidades de
la uida humana. Yo uiuia en una
aldea que está junto al caudaloso
Duero (que es vno de los dos rios
que os tengo dicho) adonde está
el suntuosissimo templo de la
diosa Minerua, que en ciertos
tiempos del año es uisitado de
todas o las más pastoras y
pastores que en aquella prouincia
biuen. Començando un dia, antes
de la celebre fiesta a solemnizalla
las pastoras y nymphas, con
cantos y hymnos muy suaues, y
los pastores con desafios de
correr, saltar, luchar, y tirar la
barra, poniendo por premio para
el que uictorioso saliere, quales
una guirnalda de uerde yedra,
quales una dulce çampoña, o
flauta, ó un cayado de nudoso
fresno, y otras cosas de que los
pastores se precian. Llegando
pues el dia en que la fiesta se
celebraua, yo con otras pastoras
amigas mias: dexando los
seruiles, y baxos paños, y
uistiendonos de los mejores que
teniamos, nos fuymos el dia antes
de la fiesta determinadas de
uerlas aquella noche en el templo,
como otros años lo soliamos
hazer. Estando pues como digo
en compañia de estas amigas
mias, uimos entrar por la puerta,
una compañia de hermosas
pastoras, a quien algunos
pastores acompañauan: los
quales dexandolas dentro, y
auiendo hecho su deuida oracion,
se salieron al hermoso ualle, por
que la orden de aquella prouincia
era que ningun pastor pudiesse
entrar en el templo, más que a
dar la obediencia, y se boluiesse
luego a salir, hasta que el dia
siguiente pudiessen todos entrar
a participar de las cerimonias y
sacrificios que entonces hazian. Y
la causa desto era, porque las
pastoras y Nimphas quedassen
solas y sin ocasion de entender
en otra cosa, sino celebrar la
fiesta regozijandose vnas con
otras, cosas que otros muchos
años solian hazer, y los pastores
fuera del templo en vn uerde
prado que alli estaua, al
resplandor de la nocturna Diana.
Pues auiendo entrado los
pastores que digo en el suntuoso
templo, despues de hechas sus
oraciones y de haber offrescido
sus offrendas delante del altar,
junto a nosotros se assentaron. Y
quiso mi uentura que junto a mi
se sentasse una dellas para que
yo fuesse desuenturada todos los
dias que su memoria me
durasse[1228]. Las pastoras
venian disfraçadas, los rostros
cubiertos con unos uelos blancos
y presos en sus chapeletes de
menuda paja subtilissimamente
labrados con muchas
guarniciones de lo mismo tan bien
hechas y entretexidas, que de oro
no les lleuara uentaja. Pues
estando yo mirando la que junto a
mi se auia sentado, ui que no
quitaua los ojos de los mios, y
quando yo la miraua, abaxava ella
los suyos fingiendome quererme
uer sin que yo mirasse en ello. Yo
desseaua en estremo saber quién
era, por que si hablasse comigo,
no cayesse yo en algun yerro a
causa de no conocella. Y todauia
todas las uezes que yo me
descuydaua, la pastora no
quitaua los ojos de mí, y tanto que
mil uezes estime por
hablalla[1229], enamorada de unos
hermosos ojos que ella solamente
tenia descubiertos. Pues estando
yo con toda la atencion possible,
sacó la más hermosa y la más
delicada mano, que yo despues
acá he uisto, y tomandome la mia,
me la estuuo mirando un poco. Yo
que estaua más enamorada della
de lo que se podria dezir, le dixe:
Hermosa y graciosa pastora, no
es sola essa mano, la que aora
está aparejada para seruiros, mas
tambien lo está el coraçon, y el
pensamiento de cuya ella es.
Ysmenia (que assi se llamaua
aquella que fue causa de toda la
inquietud de mis pensamientos)
teniendo ya imaginado hazerme
la burla que adelante oireys, me
respondio muy baxo, que nadie lo
oyesse: graciosa pastora, soy yo
tan uuestra, que como tal me
atreui a hazer lo que hize,
suplicoos que no os
escandalizeys, porque en uiendo
uuestro hermoso rostro, no tuue
más poder en mi. Yo entonces
muy contenta me llegué más a
ella, y le dixe (medio riendo).
¿Cómo puede ser, pastora, que
siendo uos tan hermosa os
enamoreys de otra que tanto le
falta para serlo, y más siendo
muger como uos? Ay pastora,
respondió ella, que el amor que
menos uezes se acaba es este, y
el que más consienten passar los
hados, sin que las bueltas de
fortuna ni las mudanças del
tiempo les vayan a la mano. Yo
entonces respondi: si la
naturaleza de mi estado me
enseñara a responder a tan
discretas palabras, no me lo
estoruara el desseo que de
seruiros tengo: mas creeme,
hermosa pastora, que el proposito
de ser uuestra, la muerte no será
parte para quitarmele. Y despues
desto los abraços fueron tantos,
los amores que la vna á la otra
nos deziamos, y de mi parte tan
uerdaderos, que ni teniamos
cuenta con los cantares de las
pastoras, ni mirauamos las
danças de las Nymphas, ni otros
regozijos que en el templo se
hazia[1230]. A este tiempo
importunaua yo a Ysmenia que
me dixesse su nombre, y se
quitasse el reboço, de lo qual ella
con gran dissimulacion se
escusaua y con grandissima
astucia mudaua proposito. Mas
siendo ya passada media noche,
y estando yo con el mayor desseo
del mundo de verle el rostro, y
saber cómo se llamaua, y de
adónde era, comence a quexarme
d'ella, y a dezir que no era
possible que el amor que me
tenia fuesse tan grande, como
con sus palabras me manifestaua:
pues auiendole yo dicho mi
nombre, me encubria el suyo, y
que cómo podia yo biuir,
queriendola como la queria, si no
supiesse a quién queria, o
adónde auia de saber nueuas de
mis amores? E otras cosas dichas
tan de veras que las lagrimas me
ayudaron a mouer el coraçon de
la cautelosa Ysmenia, de manera
que ella se leuantó: y tomandome
por la mano me apartó hazia una
parte, donde no auia quien
impedir nos pudiesse y començo
a dezirme estas palabras
(fingiendo que del alma le salian).
Hermosa pastora, nascida para
inquietud de un espiritu, que
hasta aora ha biuido tan esento
quanto ha sido possible, quien
podra dexar de dezirte lo que
pides auiendote hecho señora de
su libertad? Desdichado de mí,
que la mudança del habito te
tiene engañada aunque el engaño
ya resulta en daño mio. El reboço
que quieres que yo quite, ues lo
aqui donde lo quito, dezirte he mi
nombre, no te haze mucho al
caso, pues aunque yo no quiera
me uerás mas uezes de las que
tú podras suffrir. Y diziendo esto,
y quitandose el reboço, vieron mis
ojos un rostro que aunque el
aspecto fuesse un poco uaronil,
su hermosura era tan grande que
me espantó. E prosiguiendo
Ysmenia su plática, dixo: y por
que, pastora, sepas el mal que tu
hermosura me ha hecho, y que
las palabras que entre las dos
como de burlas han passado son
de ueras: sabe que yo soy
hombre y no muger, como antes
pensauas. Estas pastoras que
aqui uees por reyrte comigo (que
son todas mis parientas) me han
uestido desta manera que de otra
no pudiera quedar en el templo, a
causa de la orden que en esto se
tiene. Quando yo hube entendido
lo que Ysmenia me auia dicho, y
le ui como digo en el rostro, no
aquella blandura, ni en los ojos
aquel reposo que las donzellas
por la mayor parte solemos tener,
crey que era uerdad lo que me
dezia, y quedé tan[1231] fuera de
mi, que no supe qué respondelle.
Todauia contemplaua aquella
hermosura tan estremada, miraua
aquellas palabras que me dezia
con tanta dissimulacion (que
jamas supo nadie hazer cierto de
lo fingido como aquella cautelosa
y cruel pastora). Vime aquella
hora tan presa de sus amores, y
tan contenta de entender que ella
lo estaua de mi, que no sabria
encarecello, y puesto caso que de
semejante passion hasta aquel
punto no tuuiesse experiencia
(causa harto sufficiente para no
saber dezilla) todavia

You might also like