Computer Manipulation of DNA and Protein Sequences
Computer Manipulation of DNA and Protein Sequences
7
Protein Sequences
The ability to determine DNA sequences is (public domain software as well as copyrighted
now commonplace in many molecular biology software that is distributed without charge by
laboratories. As the amount of DNA sequence the author) related to DNA sequence manipu-
data available to researchers has increased, the lation. Additional software related to carrying
use of computers to manipulate, compare, and out DNA and protein homology searches is
analyze this data has grown accordingly. described in Chapter 19.
Growth rates of DNA databases have entered The discussion of computer-mediated DNA
an exponential phase as determination of DNA sequence analysis in this unit is intended to be
sequence becomes more automated and less a generalized overview; specific instructions
expensive (see Fig. 19.2.1). The cost of a com- for using particular sequence analysis software
puter system appropriate for an individual labo- packages are beyond the scope of this unit.
ratory has also decreased, allowing essentially However, recommendations of software pack-
all researchers Internet access to electronic da- ages for particular tasks have been included
tabases and computerized sequence analysis wherever possible. Our goal is to serve as a
(see UNIT 19.1). starting point for researchers interested in util-
This unit outlines a variety of methods by izing the tremendous sequencing resources
which DNA sequences can be manipulated by available to the computer-knowledgeable mo-
computers. To begin analysis of a DNA se- lecular biology laboratory.
quence, the information must be in a form that NOTE: The Appendix to this unit, located at
computers can understand. Generally this re- the end of the unit but before the references,
quires that the DNA sequence information be lists the addresses and phone numbers of all
contained in a file (a file on the computer is software and hardware vendors discussed be-
simply a collection of information in computer- low. Additional information about relevant
readable form). Procedures for entering se- journals and databases is also provided in the
quence data into the computer and assembling Appendix.
raw sequence data into a contiguous sequence
are described first. This is followed by a de- SEQUENCE DATA ENTRY
scription of methods of analyzing and manipu- For small sequencing projects, DNA se-
lating sequences—e.g., verifying sequences, quence data can easily be entered and manipu-
constructing restriction maps, designing oli- lated using a word processor or text editor on a
gonucleotides, identifying protein-coding re- microcomputer such as an Apple Macintosh or
gions, and predicting secondary structures. IBM-compatible. As the size of a project in-
This unit also provides information on the large creases, however, a specialized editor for data
amount of software available for sequence entry becomes more and more desirable. All
analysis. software packages for DNA sequence analysis
In addition to the information concerning provide some form of sequence editor—a pro-
computer manipulations of DNA sequences in gram that acts like a word processor for se-
this unit, related units can be found in Chapter quence data. Such an editor may be no more
19; these units explain how to use the world- than a small window into which sequence data
wide Internet computer network to access in- is entered or it may contain specialized features
formation from or submit information to the that aid in proofreading and facilitate input.
major sequence databases and other electronic Currently there is no standard format for
resources, and how to conduct DNA and protein DNA or peptide sequence files. In the early
homology searches. The chapter also includes days of sequence analysis, most software pro-
general information about electronic informa- grams had their own unique formats for storing
tion handling. sequence information. Although these format
Many of the electronic resources described differences still exist today (see Fig. 7.7.1),
in this unit and in Chapter 19 are available at most available software can import several
no cost on the Internet. The Appendix to this different sequence formats. Thus, a sequence
unit lists some of the commercial software, file can be converted into the default format for
shareware (whereby the author asks for a do- a particular program by importing (reading) the
nation from satisfied users), and free software original sequence file into the program and then
DNA Sequencing
Contributed by J. Michael Cherry 7.7.1
Current Protocols in Molecular Biology (1995) 7.7.1-7.7.23
Copyright © 2000 by John Wiley & Sons, Inc. Supplement 30
A) EMBL:
ID hummycc.primate
DE hummycc.primate
SQ 100 BP
AGCTTGTTTGGCCGTTTTAGGGTTTGTTGGAATTTTTTTTTCGTCTATGT
ACTTGTGAATTATTTCACGTTTGCCATTACCGGTTCTCCATAGGGTGATG
//
B) GenBank:
LOCUS HUMMYCC 10996 bp ds-DNA PRI 25-JUL-1994
DEFINITION Human (Lawn) c-myc proto-oncogene, complete coding ...
ACCESSION J00120 K01908 M23541 V00501 X00364
KEYWORDS Alu repeat; c-myc proto-oncogene; myc oncogene; ...
SOURCE Human DNA (genomic library of Lawn et al.), clones ...
ORGANISM Homo sapiens ...
REFERENCE 1 (bases 3507 to 7559)
AUTHORS Colby, W.W., Chen, E.Y., Smith, D.H., and Levinson, A.D.
TITLE Identification and nucleotide sequence of a human ...
JOURNAL Nature 301 (5902), 722-725 (1983)
MEDLINE 83141777
COMMENT The myc gene is the cellular homologue of the ...
FEATURES Location/Qualifiers
BASE COUNT 2747 a 2723 c 2733 g 2793 t
ORIGIN 198 bp upstream of Sau96A site, on chrmosome 8 (q24)
1 agcttgtttg gccgttttag ggtttgttgg aatttttttt tcgtc ...
//
C) GCG:
Human (Lawn) c-myc proto-oncogene, complete coding sequence and flanks.
hummycc.primate Length: 100 Sun, Mar 17, 1991 9:48 PM Check: 6864 ..
1 AGCTTGTTTG GCCGTTTTAG GGTTTGTTGG AATTTTTTTT TCGTCTATGT
51 ACTTGTGAAT TATTTCACGT TTGCCATTAC CGGTTCTCCA TAGGGTGATG
D) Intelligenetics:
; hummycc.primate, 100 bases.
;Human (Lawn) c-myc proto-oncogene, complete coding sequence and flanks.
;
hummycc.primate
AGCTTGTTTGGCCGTTTTAGGGTTTGTTGGAATTTTTTTTTCGTCTATGT
ACTTGTGAATTATTTCACGTTTGCCATTACCGGTTCTCCATAGGGTGATG1
E) NBRF:
>DL;hummycc.primate
hummycc.primate, 100 bases.
1 AGCTTGTTTG GCCGTTTTAG GGTTTGTTGG AATTTTTTTT TCGTCTATGT
51 ACTTGTGAAT TATTTCACGT TTGCCATTAC CGGTTCTCCA TAGGGTGATG*
Figure 7.7.1 Commonly used sequence file formats having specific defined elements and defining
codes. (A) EMBL comment lines begin with two-letter codes: ID, short sequence name; DE, description;
and SQ, sequence length. DNA or protein sequence follows; sequence end is denoted by two slashes
(//) on a separate line. (B) GenBank comments precede the sequence and are separated from it by the
code “ORIGIN.” Sequence end is denoted by two slashes on a separate line. The actual text of this entry
has been abbreviated; see Fig. 19.2.3 for a more complete example of a GenBank file. (C) GCG
comments precede the sequence and are separated from it by two dots (..). (D) Intelligenetics comment
lines begin with semicolons (;). A single description line follows, and then the sequence begins on a
separate line. Sequence end is denoted by a numeral one (1). (E) NBRF (also called PIR format) first
line starts with four required characters: a greater-than sign (>); either “D” for DNA or “P” for protein;
either “L” for linear or a “C” for circular; and a semicolon. The short sequence name follows on the same
line. The next line is a description line. Sequence starts on a new line and its end is denoted by an asterisk
(*). (F) DNA Strider Text is similar to the Intelligenetics format, but lacks description line. (G) FASTA (also
Computer called Pearson format) first line begins with a greater-than sign (>), followed by sequence name and a
Manipulation of short description. Sequence data starts on a separate line. Note: Some formats (including GenBank,
DNA and Protein GCG, and NBRF) allow numbers to be included within the sequence for ease of reading (the numbers
Sequences
are ignored during sequence analysis).
7.7.2
Supplement 30 Current Protocols in Molecular Biology
saving the file in the desired format. However, creating a small DNA sequence, then experi-
it is generally simplest to use the sequence menting with the comment and sequence for-
editor or entry program provided with the soft- mats as well as the word processor’s Save
ware to be used for subsequent analysis. option. The resulting file can then be checked
in the analysis program.
Manual Entry Using Word Processors Sequence editors. All analysis programs in-
and Sequence Editors clude some form of sequence editor. This is the
Word processors. A DNA or peptide se- most effective tool for entering DNA or peptide
quence can be entered into a computer using a sequence data because it produces a sequence
word processor or text editor program, simply file in the correct format with the comments
by creating a new document and then typing in separate from the sequence as they should be.
the sequence data. This is sufficient to put There are several free or relatively inexpensive
sequence data in computer-readable format and programs available for Macintosh and IBM-
is useful if no specialized sequence entry pro- compatible computers that simplify the task of
gram is available or if the sequence file is to be entering sequence data. One example for the
transferred to a different computer for analysis. Macintosh is DNA Strider, which is built
It is imperative that DNA or peptide se- around a very easy-to-use sequence editor. It
quence documents be saved in “text” or ASCII also makes restriction maps, finds open reading
format, because sequence analysis programs frames, and translates DNA sequences into
cannot translate the default files used by word amino acid sequences.
processor programs; however, most packages Recently, several programs have become
provide an easy one-step method of reformat- available for microcomputers that “speak” the
ting text files containing DNA or peptide se- sequence as it is entered. This may seem like a
quences into a format appropriate to the com- frivolous feature; however, it is quite useful for
puter at hand. Saving documents as text may entering data by hand from an autoradiogram.
require a slightly different Save command or The audio feature permits verification of the
option, as specified in the word processor man- sequence, thus reducing the need to continually
ual. look back and forth between the film and the
Sequence files often contain more informa- computer screen. Some of these talking se-
tion than just the DNA or peptide sequence quence editors use a digitized human voice, but
data. Comments and reference information others use synthetically generated speech. The
about the sequence and the history and dates of digitized voice, which is generally easier to
changes made to the sequence are commonly understand, is more common in programs for
included. Such comments must be distinguish- Macintosh computers—e.g., SeqSpeak, de-
able by the program from the sequence data in signed for Macintosh equipped with Hyper-
order for the sequence analysis software to Card version 2.0 and distributed free of charge
function properly. However, it is important to (see Table 7.7.2).
determine what sequence and comment-file
formats are allowed by the particular analysis Semiautomated Entry Using
software being used. The most commonly used Digitizing Hardware and Software
file formats are shown in Figure 7.7.1. Note that Several sequence analysis software pack-
some—e.g., the Intelligenetics and DNA ages can accommodate the connection of a
Strider text formats—use specific characters to relatively inexpensive digitizing pad to the
identify the comments before the sequence file; computer; a digitizing pad permits DNA se-
other formats simply require a string of char- quences to be read directly from an autora-
acters that is used to denote the end of the diogram. Digitizing pads increase both the
comments and the beginning of the sequence— speed and the accuracy of input. Digitizing
e.g., the GenBank flatfile format uses a line devices designed for DNA sequence entry con-
beginning with “ORIGIN” and GCG uses “..” sist of a light box for viewing the autoradiogram
(two periods). The NBRF software package and a stylus that resembles a ball-point pen with
(also known as PIR) allows only a single line an attached wire. When the stylus is pressed at
of comments. Some formats (e.g., NBRF and the location of a lane or band on the autora-
Intelligenetics) require a specific terminator diogram, the digitizing pad detects a signal and
character for the sequence entry. converts it to an x-y coordinate value that is
It is a good idea to test the file format before recorded by the computer. Most digitizing pad
investing a large amount of time typing in data entry software requires the basic shape
with the word processor. This can be done by (boundaries) of the autoradiogram lanes to be DNA Sequencing
7.7.3
Current Protocols in Molecular Biology Supplement 30
entered initially. Subsequently, when the stylus gins, 1988), because it is then possible to incor-
is pressed on an individual band in a lane, the porate a known sequence of DNA into each gel;
computer utilizes this boundary information to the resulting sequence can then be analyzed to
determine which lane was selected, then dis- determine the particular characteristics of the
plays the appropriate nucleotide letter on the gel. Having this sort of an internal standard
computer screen. greatly simplifies the image analysis required
The result of the digitized entry process is a to read more than one unknown sequence from
file containing the newly determined DNA se- the same gel using the automatic scanner.
quence. In some cases the file must be refor- Automated DNA sequencers. Automated
matted before it can be used with an analysis DNA sequencing machines (discussed in UNIT
program—this can be determined by compar- 7.0) determine the sequence of a DNA fragment
ing the format defined by the entry software and then place the DNA sequence into a file.
with the list of acceptable formats in the se- The most common automated sequencers use
quence analysis software. special fluorescence-tagged oligonucleotide
DNA sequence entry using a digitizing pro- primers in a dideoxy primer-extension reaction.
gram can be very quick and accurate when the Reaction products are separated by electropho-
gel is of good quality. However, if the lanes are resis through a polyacrylamide gel and the
irregular due to curving or excessive smiling, fluorescent tag is detected after excitation by a
the program may incorrectly identify which laser at the bottom of the gel. Because each
lane contains a selected band. Commercial gel- specific nucleotide reaction uses a different
reading programs are designed to handle some fluorescent tag, all four reactions can be loaded
types of common gel problems. For example, in the same lane. The output from the sequencer
programs avoid smiling errors by requiring the is a trace of the amount of fluorescence ob-
order of the bands to be determined by the user served as the tagged reaction products pass off
so that the program determines only which lane the gel. An attached computer analyzes the
was selected. The SEQED program of the GCG trace and determines the DNA sequence it rep-
package for UNIX and VMS and the DNAStar resents. Thus, entry of sequence data is com-
package for IBM-compatibles and Macintosh pletely automated; the user merely has to define
can both be used with a digitizing pad for or identify the particular sequencing run.
semiautomatic DNA sequence determination. Automated sequencers are quick and cur-
rently provide about the same level of accuracy
Automated Entry Using Gel Readers as manual sequencing. Automated and manual
and Automated Sequencers sequencing employ similar sequencing tech-
Automated gel readers. Several automated nologies—e.g., primer extension and polyacry-
autoradiogram readers are now available that lamide gel electrophoresis. Although automated
include a scanner and a computer. Automated sequencing machines are expensive, for large
readers use a high-resolution gray-scale scan- sequencing projects, the automated sequencers
ner or a digital video camera to digitize an may be more economical than manual sequenc-
autoradiogram of a sequencing gel produced by ing because of the reduction in labor costs.
conventional methods. The digitized image of Editing automated sequence data. Because
the film is then analyzed by the computer using automated autoradiogram scanners are not ab-
image-analysis techniques. The software is de- solutely accurate, the software packages that
signed to identify first the locations of lanes and accompany them typically permit the user to
then bands within the lanes. This process can edit the automatically entered sequence. Gen-
be confounded by smudges and random spots erally, the editing process is greatly simplified
on the original film, as well as by gel smiling for projects where overlapping randomly cho-
and curved lanes. Indeed, a major challenge to sen segments or overlapping nested sets of
designers of automatic gel-reading programs is deletions are sequenced because each nucleo-
the elimination of such noise in scanned autora- tide is determined multiple times. A sequence-
diograms. Automated gel-reader systems (sold assembly program can then be used to align the
by BRL and Milligen, among others) are gen- sequences based on overlapping regions. For
erally priced for use by large laboratories and DNA sequences that have been determined
core facilities. multiple times, inconsistencies between differ-
When using automated image analysis, it ent versions are readily identified by the soft-
Computer may be advantageous to employ the multiplex ware, allowing the user to go back to the origi-
Manipulation of
DNA and Protein method of DNA sequencing (introduction to nal data (digitized image or autoradiogram) to
Sequences Chapter 7 and UNIT 15.2; Church and Kieffer-Hig- assess why the inconsistency occurred.
7.7.4
Supplement 30 Current Protocols in Molecular Biology
Automatic sequencing machines (such as large regions that have been subcloned for se-
those produced by Pharmacia Biotech, Li-Cor, quencing, where it may not be completely clear
and Applied Biosystems) provide a similar ed- how the shorter pieces of DNA that were actu-
iting feature whereby the user can view the ally sequenced fit together within the larger
fluorescent trace on a computer screen and region.
override the computer’s choice of base if it
appears to be incorrect. Some vendors of auto- Test Translations to Detect Shifts in
mated sequencing machines use a statistical Reading Frame
means of correcting for discrepancies between One of the most widespread features of
regions that have been sequenced multiple sequencing programs is the capacity to translate
times. That is, if a particular region is sequenced a DNA sequence into amino acids, thereby
several times and the number of differences is generating a putative protein sequence from an
low, the machine will automatically adopt the open reading frame (ORF); many restriction
majority consensus sequence. mapping programs also include a translation
feature. When the DNA sequence in question
SEQUENCE DATA VERIFICATION is a prokaryotic coding region or the sequence
As with any experimental result, it is impor- from a cDNA clone, or when it is believed to
tant to verify DNA sequence data. This is typi- be related to a gene from another organism,
cally accomplished by sequencing a given re- translating the sequence in this fashion pro-
gion more than once and by sequencing both vides a simple check for deletions and inser-
strands. The computer can be used to compare tions. The process will find any stop codons
sequences, find overlapping regions, highlight within the sequence; if a stop codon is found
differences, and provide a graphic overview of within an area that is actually known or thought
the overlapping regions. to be a coding region, the sequence is automat-
ically suspect and should be rechecked. (It is
Comparison of Multiple Entries important to remember that in some DNA—
For small projects, the amount and quality such as mitochondrial DNA—the stop codons
of data can be optimized. To identify errors may be used to specify an amino acid. In yeast
introduced during manual or automated data mitochondrial DNA, for example, the codon
entry, each gel can be read more than once and TGA specifies isoleucine. Some software is
the independent readings compared. The com- able to take into account variant genetic codes;
parison can be carried out automatically using for instance, DNA Strider allows the user to
an alignment program (see Homology Search- select from a list of variant codes, and the GCG
ing) and any differences can then be investi- package allows the user to reset individual
gated. Similarly, several of the digitizing gel codons.) Although this technique will detect
readers provide a confirmation option in which only a subset of possible errors, it can often
each segment of a sequence can be checked quickly identify common problems such as
automatically, with the machine simply re-en- simple typing mistakes or miscounting of the
tering the band locations. These programs, such number of the same nucleotides in a run of
as SEQED (included in the GCG package; see identical nucleotides.
Table 7.7.1), alert the user as differences occur. Most DNA sequence packages allow the
In this way the entry process can be quite fast user to specify a range and reading frame to be
without sacrificing accuracy. used in translating a DNA sequence. DNA
For large sequencing projects that involve Strider (Table 7.7.1) has a useful feature that
sequencing a number of random clones from a hunts for ORFs in a DNA sequence and high-
library covering the region of interest, it is lights the putative coding sequence, thus doing
usually possible to achieve a sufficient level of all the necessary work.
redundancy that a majority consensus sequence
can be determined. Detecting Overlap with Other
Sequenced Fragments
Comparison to Known Restriction With increasing worldwide interest in
Maps genome sequencing projects, sequence assem-
If a restriction map for the DNA region of bly packages now provide very effective auto-
interest already exists, a restriction map gener- matic DNA sequence assembly, connecting
ated from the DNA sequence (as described shorter pieces of DNA to build the longest
below) can serve as a check on the accuracy of continuous sequence possible. Once the se-
the sequence. This is particularly useful for quences to be assembled are identified, they are DNA Sequencing
7.7.5
Current Protocols in Molecular Biology Supplement 30
compared, overlaps identified, and contiguous GAP program; Needleman and Wunsch, 1970)
sequences (contigs) constructed. Typically, pa- are designed to find the maximum number of
rameters are available to allow adjustment of matches between two sequences over the entire
the alignment process. If the default parameter length of the two sequences, with the minimum
settings do not result in sequence assembly, the number of gaps. This type of program is best
minimum amount of overlap required to iden- suited for aligning two sequences along their
tify a match (i.e., the number of base pairs entire lengths. If the two sequences differ
involved) can be reduced or the stringency of greatly in size, however, the result may not
the required overlap decreased (i.e., the number satisfactorily represent the similarity between
of allowable mismatches increased). A good them. For identifying regions of similarity be-
commercial assembly program is LaserGene, tween two pieces of DNA that are largely dif-
available from DNAStar for both Macintosh ferent in sequence or of significantly different
and IBM-compatible computers (Table 7.7.1). lengths, programs employing algorithms such
Once a sequence contig has been assembled, as those described by Smith and Waterman
the LaserGene program creates a graphic over- (1981) or Wilbur and Lipman (1983)—e.g., the
view of the sequencing project, highlighting GCG BESTFIT program—are more suitable.
regions that need further verification. Another These algorithms do not try to align the two
commercial program is Sequencher (Gene sequences being compared in their entirety, but
Codes). This program provides expanded fea- instead search for short matches within the
tures for the assembly, processing, and editing sequences.
of DNA sequences determined with the ABI
sequencer; however, it is only available for the Editing a Contig and Verifying the
Macintosh computer. Sequence
If a sophisticated assembly program is not Generally, software packages that provide
available, contigs can be constructed “by hand” sequence assembly programs include multiple
using a comparison program and a multiple sequence editors that display the individual
sequence editor. First, the comparison program sequences of the aligned contig together one on
is used to analyze two suspected overlapping top of the other, one sequence per line (see Fig.
sequences; then, if an overlap is found, a con- 7.7.2), and can generate the consensus se-
sensus sequence is generated that can then be quence automatically. A multiple sequence edi-
compared with other sequences. In this manner, tor differs from the sequence editors mentioned
an overall consensus sequence (contig) can be earlier in that several sequences can be manipu-
progressively assembled. A potential pitfall of lated at once. However, the most significant
this process that should be mentioned is that feature of a multiple sequence editor is its
although some DNA sequence analysis pro- ability to produce a consensus sequence for an
grams will consider both the DNA sequence aligned set of sequences. If the consensus se-
that has actually been entered and its comple- quence is not satisfactory, the alignment can
ment sequence (the sequence from the comple- instantly be changed and a new consensus se-
mentary strand of DNA, sometimes called the quence generated.
reverse complement sequence), others analyze Another use of a multiple sequence editor is
only the sequence entered. With these latter for manually comparing a group of overlapping
programs, unless the DNA sequence of interest sequences by aligning common regions verti-
contains an inverted repeat (palindromic re- cally on the screen (Fig. 7.7.2), which makes it
gion), only one of the strands of DNA is likely easy to identify differences in the aligned se-
to be recognizable as being similar to a pre- quences. It is generally useful to go back to the
viously known sequence. It is therefore impor- original gel to determine whether these differ-
tant to determine which way a particular pro- ences result from misreading, sequence com-
gram operates and, when using a program that pression, or gel defect. Multiple sequence edi-
does not automatically consider the comple- tors will often have the ability to display the
ment sequence, to conduct a separate search of reverse complement of a particular sequence,
that sequence. irrespective of which strand was sequenced.
When conducting sequence comparisons to Some very useful programs are available
identify overlaps, it is important to take into free of charge; these programs provide auto-
account the fact that different programs take matic multiple sequence alignment functions
Computer slightly different approaches to this process. as well as other types of analysis on a set of
Manipulation of
DNA and Protein Some comparison programs that use the Nee- sequences. Examples of these are the MACAW
Sequences dleman and Wunsch algorithm (e.g., the GCG software for Microsoft Windows or Macintosh,
7.7.6
Supplement 30 Current Protocols in Molecular Biology
TACCTCAGCCAGCATGGCAGCCTCTTTCCCACCCACCTTGGGACTCAGTTCTGCCCCAGATGAAATTCAGCACCC
AAGGTACCTCAGCCAGCATGGCAGCCTCTTTCCCACCCACCTTGGGACTCAGTTC
AAGGTACCTCAGCCAGCATGGCAGCCTCTTTCCCACCCACCTTGGGACTCAGTTC
AAGGTACCTCAGCCAGCATGGCAGCCTCTTTCCCACCCACCTTGGGACTCAGTTCTGCCCCAGATGAAATTCAGCACCC
....|.........|.........|.........|.........|.........|.........|.........|....
100 110 120 130 140 150 160 170
7
6
5A +-------------------->
4A +-------------------->
3A +---------------*---->
2A +-------------------->
C +------------------------------------------------>
|......|......|......|......|......|......|......|......|......|......|
0 50 100 150 200 250 300 350 400 450 500
Figure 7.7.2 Multiple sequence editor. The GCG program GELASSEMBLE displays the aligned
sequences on the top of the screen and a schematic of the sequenced fragments on the bottom.
Arrows indicate the direction of sequencing; the asterisk in the lower part of the display indicates
the position of the cursor in the sequence alignment as the user edits the sequence.
developed by Greg Schuler of the NCBI all commercial software provides excellent re-
(Schuler et al., 1991), and GDE for X-Windows, striction mapping features.
developed by Steven Smith of the Harvard An up-to-date list of restriction enzymes is
Genome Laboratory (Table 7.7.2). maintained by Richard Roberts (New England
Biolabs). It is accessible in the form of a text
RESTRICTION MAPPING file called rebase and contains all known type
Once a sequence contig is generated, a map II restriction enzyme recognition sites, sites of
of restriction endonuclease cleavage sites can cutting, a complete cross-reference of isoschi-
be a useful aid in further analysis of the region. zomers, reference citations, and commercial
As stated above, construction of a restriction sources for the restriction enzyme with ad-
map based on newly determined sequence in- dresses and phone numbers. The latest version
formation allows rapid visual comparison of of rebase can be obtained free from a variety of
the sequence with a known restriction map. electronic mail (e-mail) and network servers
Restriction maps also identify sites for sub- (see UNIT 19.1).
cloning or other molecular genetic manipula-
tions and can provide useful summaries of Mapping to Predict Band Sizes
newly determined DNA sequences. Some software programs can model double
digests, predicting the band sizes that would
Mapping All Known Commercial result from cleavage of an entered sequence
Restriction Enzyme Recognition Sites with a given pair of restriction enzymes. Such
Programs are available that identify on a double-digest band-size analyses simplify the
strand of DNA the sites of all known restriction reading of restriction fragment patterns on gels
enzymes, according to either the first nucleo- and can also be useful for planning subsequent
tide of the recognition site or the location of cloning strategies. A few programs can also
cleavage. Most of these programs produce one predict the results of partial digests. Software
list of the sizes of fragments that would be packages that provide restriction fragment pat-
created by cutting with each restriction enzyme tern analysis features are available from Tex-
and a separate list of the recognition sites iden- tco’s Gene Construction Kit (for Macintosh
tified. They can also display the cleavage sites only) and Pro-RFLP from DNA ProScan (for
of specified sets of restriction enzyme of inter- Macintosh and IBM-compatible computers).
est, such as those that generate 3′- or 5′-strand
overhangs or blunt ends. All restriction map- Graphical Restriction Mapping
ping programs contain a file or files in which Several very elegant programs are available
restriction site data are stored; this information that produce graphical restriction-site maps
can be updated as new restriction enzymes and that are useful for searching for possible sites
their cleavage sites are identified. Restriction to be used in further recombinant DNA manipu-
mapping programs are one of the most common lations. Graphical restriction maps are often
types of molecular biology software; virtually represented as collections of horizontal lines, DNA Sequencing
7.7.7
Current Protocols in Molecular Biology Supplement 30
0 100 200 300 400 500 600 700 800 900 1000
HhaI 11
HincII 1
HindIII 1
HinfI 2
HinP1I 11
HpaII 9
HphI 3
MaeI 1
MaeII 2
MaeIII 3
MboII 3
MmeI 2
MnlI 7
MseI 2
NaeI 3
NarI 3
NciI 2
NheI 1
NlaIII 7
0 100 200 300 400 500 600 700 800 900 1000
Figure 7.7.3 One type of graphical restriction map. This figure was produced by the free PlotZ
program; GCG MAPPLOT produces similar output.
one for each restriction enzyme, with the rec- banding patterns that will be observed follow-
ognition sites represented by short vertical ing particular restriction digests. The programs
marks at the appropriate locations (Fig. 7.7.3). allow the user to specify the type of gel medium
A limited number of programs produce pictures being used, making it possible to determine
of sequences that look like standard circular visually whether the banding pattern observed
plasmid maps, including coding regions and on a gel could be produced by a given sequence.
other features of interest within the sequence, Some programs will even simulate partial di-
as well as the vector DNA and polylinker. These gests. The Gene Construction Kit provides the
graphical maps can be saved and then manipu- gel simulation feature as does the ACEDB
lated in a graphics editor or drawing program, genome database software (see Genetic Se-
greatly simplifying the task of preparing figures quence Databases); the latter is limited in that
for presentation; an excellent example of such it can simulate the gel banding patterns only for
a program is the Textco Gene Construction Kit. sequences already contained within the data-
This program reads a variety of sequence-file base being used.
formats and generates a picture of the DNA
sequence in standard restriction map form. The PREDICTION OF NUCLEIC ACID
pictures can be edited to include arrows and STRUCTURE
information boxes that are of publishable qual- Once the DNA sequence of a region has been
ity. In contrast to most graphical mapping pro- identified, a number of analyses can be per-
grams, the Gene Construction Kit and other formed to identify interesting features such as
commercially available programs such as repeats, areas of atypical base composition, and
MacVector and LaserGene use DNA sequence RNA secondary structure. These in turn can
data directly to generate the pictorial maps. help to define functional regions within the
Many shareware mapping programs, such as sequence of interest.
Jingdong Liu’s MacPlasMap, do not work with
the sequence data directly; rather, recognition Base Repeats
site locations must be input by the user. One Direct and inverted repeats are often part of
shareware program that does have the capacity transcriptional or translational control regions.
to create graphical maps directly from sequence Most sequence analysis software can identify
data, and provides many analysis features as repeats and provide an optional graphic display
well, is DNA Strider, developed by Christian of their location. The standard “dot matrix” plot
Computer Marck (Table 7.7.2). is a simple and effective method of identifying
Manipulation of Recent versions of many mapping programs repeated regions. The two-dimensional dot ma-
DNA and Protein
Sequences can actually simulate the electrophoretic-gel trix represents one sequence on the x axis and
7.7.8
Supplement 30 Current Protocols in Molecular Biology
the other on the y axis. Such programs generate The greatest problem with predicting secon-
a “dot” at each x-y intersection of all GG, AA, dary structure is modeling the interactions pre-
TT, and CC pairs; thus, repeats are revealed as sent in a tertiary structure and then relating
diagonal lines. Some dot matrix programs show those back to the primary sequence for use in a
inverted repeats—which may indicate potential folding program. Indeed, current RNA-folding
stem-loop structures in the corresponding DNA programs do not take into account possible
or RNA—as lines with negative slopes relative tertiary structures of a nucleic acid molecule.
to direct repeats. To identify repeats within a These programs determine the energetics of a
single sequence, the same sequence is used for limited number of two-dimensional folded
both the x and y axes. structures. The most stable structure predicted
Analysis packages that do not produce a by the program in a two-dimensional world
graphical presentation of the repeated regions may be far from the most stable structure in
usually include a program that lists the repeats three dimensions where loops can interact with
and their locations. This involves searching a loops, helical regions can stack, and various
sequence against itself to find direct repeats or non-Watson-Crick base-pairing structures can
against its complement to detect inverted re- occur (see illustrations in APPENDIX 1).
peats. The repeats must then be plotted by hand Currently, the most sophisticated RNA-
or the distances between them calculated to folding program is MFOLD (for Multifold, an
determine if a periodicity is present. Needless extension of an earlier program known as RNA-
to say, the graphical presentation is preferred Fold or FOLD in the GCG package), designed
because patterns can be readily seen that it by Michael Zuker of the National Research
would take some time and effort to reveal by Council of Canada (Zuker, 1989a,b; Table
the comparison method. 7.7.2). In addition to standard analysis of base-
pairing energetics, MFOLD takes into account
GC and AT Content base-pair-stacking energies (Freier et al., 1986;
The GC/AT content of a sequence may pro- Turner et al., 1987) and single-base-stacking
vide some insight into structural features such (dangle) enthalpies (Turner et al., 1988). An-
as Z DNA and bent DNA (see APPENDIX 1). other major feature of MFOLD is that in addition
GC/AT content can also serve as a reasonable to predicting a single structure with the lowest
indicator of a coding region in many inverte- predicted free energy, it also depicts many
brate and plant species. A few sequence-analy- suboptimal structures. The number of subopti-
sis packages contain specialized programs, mal structures displayed can be varied within a
such as GCG STATPLOT, that show the GC and percentage or an absolute value of the best
AT content of a sequence in graphical form— structure’s energy. VMS, UNIX, DOS, and
allowing AT- or GC-rich regions to be readily Macintosh versions of this program are avail-
identified. However, many packages provide able from many of the software archives (see
little more than a tally of nucleotide composi- Table 7.7.2). Although the output of MFOLD is
tion, i.e., the program only lists the number of text-based (Fig. 7.7.4A), several programs are
A, G, C, and T residues in a sequence. A simple available that generate graphic representations
method to determine the local GC/AT content of the predicted structure (e.g., LoopViewer,
of a large sequence is to divide the sequence developed by Don Gilbert; see Fig. 7.7.4B).
into several small (e.g., 100-base) segments,
then determine the GC/AT content of these OLIGONUCLEOTIDE DESIGN
small segments individually. STRATEGY
Increased use of polymerase chain reaction
RNA Secondary Structure (PCR) methods has stimulated the develop-
Although folding programs are available ment of many programs to aid in the design or
that predict RNA secondary structure, this type selection of oligonucleotides used as primers
of analysis is still an art. RNA-folding pro- for PCR. Four such programs that are freely
grams help identify possible stable stems in an available via the Internet (see Table 7.7.2) are:
RNA molecule, but a trial-and-error process is PRIMER by Mark Daly and Steve Lincoln of
required to determine the biological signifi- the Whitehead Institute (UNIX, VMS, DOS,
cance of these results for a given RNA mole- and Macintosh), Oligonucleotide Selection
cule. Even with this limitation, secondary struc- Program (OSP) by Phil Green and LaDeana
ture predictions can be useful for identifying Hiller of Washington University in St. Louis
mRNA control regions as well as possible sta- (UNIX, VMS, DOS, and Macintosh), PGEN by
ble folded regions of an RNA molecule. Yoshi (DOS only), and Amplify by Bill Engels DNA Sequencing
7.7.9
Current Protocols in Molecular Biology Supplement 34
A 10 20
–––––––––––––––G ––––––––– UUC
––C G
AGUUGU UGCC GU GGGUC C
UCGACA ACGG CA CCCAG U
AUUAUGCUGAGUGAUA UUU GUUGCUGAU ––U U
90 80 70 60 30
40
UACA
UCCCU U
AGGGA U
CAUC
50
20
30
10
70
90
60
80
50 40
Figure 7.7.4 (A) Text-based output from Zuker’s RNA-folding program, available from GCG under
the name of FOLD. This type of representation is difficult to visualize, but acceptable when only a
quick view of the possible folded structures is desired. (B) Graphic representation of the structure
shown in part A, produced by the GCG Squiggles program. The free LoopViewer program (for
Macintosh) produces similar representations.
Computer
Manipulation of
DNA and Protein
Sequences
7.7.10
Supplement 34 Current Protocols in Molecular Biology
of the University of Wisconsin (Macintosh cleotide with more than one nucleotide at any
only). Generally these programs help in the position within the sequence except for the 3′
design of PCR primers by searching for bits of nucleotide.
known repeated-sequence elements and then For efficiency of synthesis and hybridiza-
optimizing the Tm by analyzing the length and tion, the following guidelines, designed to yield
GC content of a putative primer. Commercial oligonucleotides with the lowest possible level
software is also available and primer selection of degeneracy, should be followed. First, it is
procedures are rapidly being included in most not necessary to incorporate both G and A to
general sequence analysis packages. match a consensus sequence position contain-
ing C and T, because G pairs with both C and
Sequencing and PCR Primers T. Second, inosine (I) pairs with G, C, and A.
Designing oligonucleotides for use as either Third, a pyrimidine-pyrimidine mismatch does
sequencing or PCR primers requires selection not disrupt base pairing, but a purine-purine
of an appropriate sequence that specifically mismatch is destabilizing.
recognizes the target, and then testing the se- As a general rule, create the minimum se-
quence to eliminate the possibility that the quence that hybridizes to the consensus se-
oligonucleotide will have a stable secondary quence. For each species (unique sequence) of
structure. Inverted repeats in the sequence can oligonucleotide in the synthesis, the concentra-
be identified using a repeat-identification or tion of oligo used in hybridizations must be
RNA-folding program such as those described increased to achieve an equivalent C0t value.
above (see Prediction of Nucleic Acid Struc-
ture). If a possible stem structure is observed, IDENTIFICATION OF
the sequence of the primer can be shifted a few PROTEIN-CODING REGIONS
nucleotides in either direction to minimize the Identification of potential protein-coding re-
predicted secondary structure. The sequence of gions, especially in genomic DNA sequences
the oligonucleotide should also be compared from higher eukaryotes, is still not a completely
with the sequences of both strands of the ap- automated process. It is helpful to simply trans-
propriate vector and insert DNA. Obviously, a late the region in all six reading frames (three
sequencing primer should only have a single on each strand) and then identify all possible
match to the target DNA. It is also advisable to exon regions as uninterrupted open reading
exclude primers that have only a single mis- frames (ORFs). Although identifying the AUG
match with an undesired target DNA sequence. initiation codon is a simple task, determining
For PCR primers used to amplify genomic the location of introns is not as straightforward
DNA, the primer sequence should be compared (unless the sequence is from yeast, where the
to the sequences in the GenBank database to rules for splicing are simple and seemingly
determine if any significant matches occur. If absolute). Although several research groups are
the oligonucleotide sequence is present in any working on techniques to identify exon-intron
known DNA sequence or, more importantly, in boundaries, the process requires thoughtful
any known repetitive elements, the primer se- consideration of several types of analyses. One
quence should be changed. useful technique available with some software
packages (e.g., GCG TestCode) uses a purely
Degenerate Probes for Detecting statistical method to determine the nonrandom-
Related Genes ness of the triplet code characteristic of an ORF
Once a conserved protein sequence has been (Fickett, 1982). This method works best with
identified, a degenerate oligonucleotide can be large windows (∼200 nucleotides), although it
designed for use as a hybridization probe to may be of limited use for identifying small
screen a library to identify additional members ORFs. The basic principle is that introns are
of the protein family (see UNIT 6.4). To design evolving without any restraint and are thus
this oligonucleotide, the conserved protein se- more random in sequence than exons, which
quence must be translated into a degenerate are subject to stabilizing selection.
DNA sequence. Most software packages pro- Prediction of ORFs is a very active research
vide this feature; their output is a DNA se- area. The rules used by the cellular machinery
quence that is produced using the IUPAC de- to define splice sites for higher eukaryotic se-
generate-nucleotide codes. A degenerate oli- quences are still illusive. Four projects are try-
gonucleotide is then synthesized to correspond ing to develop truly automatic gene-identifica-
to the back-translated protein sequence. Most tion methods. Although there is some disagree-
DNA synthesizers will create an oligonu- ment about the current exact reliability of these DNA Sequencing
7.7.11
Current Protocols in Molecular Biology Supplement 34
methods, they all claim to locate about 90% of HOMOLOGY SEARCHING
known (i.e., previously well-characterized) ex- Searching for homology between a newly
ons, and they are continuing to be im- obtained sequence and a sequence already
proved.The first, Gene Recognition and Analy- listed in one of the DNA or protein databases
sis Internet Link (GRAIL), is available via an can be very informative. Similarity to a known
electronic mail (e-mail) server and as a UNIX sequence can suggest the function of the new
application; it is being developed by a group protein or indicate that no similar sequence has
led by Edward Uberbacher of the Oak Ridge yet been deposited in the database. Because of
National Laboratory. GRAIL utilizes an artifi- the size of the databases and the speed with
cial intelligence technique called a neural net- which they are expanding (see Fig. 19.2.1), the
work that learns by example. The GRAIL soft- task of searching the database is not always easy
ware was not programmed to recognize a spe- to accomplish using an isolated laboratory mi-
cific set of characteristics of human coding crocomputer. Searching sequence databases for
regions. Rather, it is given a set of well-charac- similarities is one of the few sequence-analysis
terized human sequences for which the loca- tasks that is still best performed on a larger
tions of exons and introns have been experi- computer system. To search a single sequence
mentally identified. The neural network is pro- against the entire GenBank, European Molecu-
grammed to search for particular types of lar Biology Laboratory (EMBL), or DNA Data
simple features and then to correlate these fea- Bank of Japan (DDBJ) databases requires about
tures with the input set’s exon-intron bounda- an hour on a smaller VAX or microcomputer.
ries. For this reason, many laboratories obtain an
In contrast to GRAIL, the second program, account on a large computer system or modern
GeneID, developed by Steen Knudsen and workstation that provides access to the large
Kathleen Klose of Temple Smith’s group at genetic sequence databases (see UNIT 19.1). Sev-
Boston University, utilizes many features of eral sources, including the European Bioinfor-
coding regions, including exon/intron consen- matics Institute (EBI), the National Center for
sus sequences and codon preferences. Because Biotechnology Information (NCBI), and the
its rules are generally based on human se- University of Houston, provide free databases
quences, GeneID’s usefulness with nonmam- and homology searches over the Internet to
malian sequences may be limited. GeneID is anyone with access to e-mail (see UNIT 19.1).
available from BMERC at Boston University Protocols for carrying out these homology
via an e-mail server (see Appendix and UNIT 19.1). searches are described in UNIT 19.3.
The third gene-finding resource, BCM Gene
Finder, is provided by the Baylor College of Comparison of Two Sequences
Medicine via e-mail and the World Wide Web Many programs are able to align two DNA
(see Appendix and UNIT 19.1). A set of analysis or protein sequences. Such programs are often
programs is available to aid in the identification used to format an alignment for publication or
of genes in human DNA sequences. The analy- simply to identify regions of similarity between
sis method involves predicting all possible in- two input sequences. In addition, these pro-
ternal exons based on the combination of char- grams also introduce gaps into a sequence to
acteristics describing a potential splice site. optimize the alignment. Because alignment
Then the set of potential exons is analyzed to programs assign numerical penalties to gaps
determine the optimal combination and a model and mismatches, the alignment can be influ-
for the putative gene is constructed. enced by varying gap and mismatch parame-
The final service was announced in 1992 and ters.
uses a neural network program to identify cod- Protein sequences are aligned using a scor-
ing regions. The NetGene service is available ing matrix developed by Margaret Dayhoff
through an e-mail server and is provided by the (Schwartz and Dayhoff, 1978) known as
Department of Physical Chemistry at the Tech- PAM250, which represents the evolutionary
nical University of Denmark. This server ap- change that takes place in a protein sequence
pears to be changing the slowest, but because over time. A PAM (rearranged acronym for
no one has yet produced the definitive gene- Accepted Point Mutations) is a measure of the
finding software, it is suggested that all of these number of individual amino acid changes oc-
servers should be tried to determine which curring per 100 amino acid residues as a result
Computer resource is the most useful. of evolution. A PAM of 250, therefore, repre-
Manipulation of
DNA and Protein sents 250 mutations occurring within 100 resi-
Sequences
7.7.12
Supplement 34 Current Protocols in Molecular Biology
dues. For two proteins to be separated by 250 grams discussed above. Three standard multi-
PAM requires some amino acids in the se- ple-alignment algorithms have been developed
quence to have mutated multiple times. How- for UNIX and VMS-based systems: Des Hig-
ever, as calculated by Dayhoff, two such protein gins’ Clustal (Higgins and Sharp, 1988, 1989;
sequences would still have ∼20% of their amino see Table 7.7.2), the Feng and Doolittle algo-
acids in common, assuming both a completely rithm implemented in the GCG PILEUP pro-
random distribution of mutations and lack of gram (Feng and Doolittle, 1987; see Table
selection for conserved functional sequences. 7.7.1), and PIRAlign from NBRF, which is
The so-called PAM250 log odds matrix, the log based on a variation of the Needleman-Wunsch
of the probability that a given amino acid could algorithm (Needleman and Wunsch, 1970; see
mutate in the evolutionary time equal to 250 Table 7.7.1). A powerful multiple-alignment
PAM, was created by analyzing many families tool for Microsoft Windows and Macintosh is
of protein sequences that were available in the the MACAW program created by Greg Schuler
late 1970s. The PAM250 matrix is used to de- (Schuler et al., 1991). These programs identify
termine the score of an aligned pair of se- common regions (or segments) in all the se-
quences by summing the matrix values corre- quences that have been input, and then use the
sponding to each aligned pair of amino acids. common regions as a starting point for building
These matrices are still commonly used today; an overall alignment. They generally work best
however, the PAM matrix table has been recal- if the extent of the sequences being aligned is
culated (Gonnet et al., 1992). The Gonnet tables limited to the regions that are conserved among
are the first recalculation of the PAM tables them.
since their initial formulation. Subsequently a
new amino acid substitution matrix was derived Database Searches
directly from sequence or three-dimensional Most researchers currently carry out both
structural alignments of distantly related pro- DNA and protein database searches over the
teins. This matrix, called BLOSUM62, is re- Internet using the program Basic Local Align-
ported to perform better than the previous ma- ment Search Tool (BLAST), developed by
trix at detecting distant relationships using NCBI (Altschul et al., 1990). Detailed proto-
either BLAST or FASTA (Henikoff and Henik- cols for carrying out BLAST searches are given
off, 1993). The BLAST servers at the NCBI use in an upcoming supplement. Here, we simply
the BLOSUM62 matrix by default for protein present a brief overview of the most important
sequence comparisons. features of the BLAST family of programs.
An example of the way information con- The BLAST family of programs allows rapid
tained within a substitution matrix can be ex- similarity searching of nucleic acid or protein
pressed is given here. The standard PAM250 databases. The basic BLAST algorithm is used
matrix states that a methionine opposite an in several different programs that are each spe-
methionine in an aligned pair of sequences has cific for a particular database and a particular
a score of 6, a methionine opposite a valine has type of input sequence. BLASTN is used to
a score of 2, and a methionine opposite a cys- search a nucleic acid sequence against a nucleic
teine has a score of −5. These scores reflect the acid database, BLASTP is used to search an
fact that in the data used to generate the matrix, amino acid sequence against a protein database,
methionine was observed to change frequently and TBLASTN is used to search an amino acid
to valine (thus, this is considered to be a con- sequence against a nucleic acid database. In the
servative substitution), whereas methionine latter program, the database is translated in all
very rarely changed to cysteine. Thus, some six reading frames prior to the search. The
amino acid changes have positive effects on the converse analysis is performed by BLASTX,
alignment score, and others have negative ef- which takes a nucleic acid input sequence and
fects. The larger the number, the better the translates it in all six reading frames before
match, and thus a negative value between a pair searching against a protein-sequence database.
of amino acids in the matrix means that that Finally, if BLASTP does not identify significant
combination actually takes away from the sequence similarities, a more extensive pro-
alignment. gram, BLAST3, is available that also compares
an amino acid input sequence against a protein
Comparison of Multiple Sequences database. Like BLASTP, BLAST3 identifies re-
Programs for simultaneously aligning mul- gions of similarity between the input sequence
tiple sequences require more computer power and sequences present in the database, but the
and are less common than the alignment pro- initial search is at lower stringency. This search DNA Sequencing
7.7.13
Current Protocols in Molecular Biology Supplement 34
produces a collection of pairwise matching k-tuple value can be decreased (decreasing the
sequences that are then compared to each other, k-tuple value causes the search to be more
resulting in three-way matches where the com- sensitive and take much more computer time).
ponent two-way matches were not significant. FASTA builds a list or dictionary of all pos-
BLAST3 can be useful in identifying divergent sible sequences of the size specified by the
members of a common gene family. k-tuple value. The test sequence and all se-
If the sequence of interest contains a pro- quences in the database are then processed to
tein-coding region, it is more informative to find the locations of all segments in the se-
search the predicted protein sequence against quence of a length equal to the k-tuple value
one of the protein databases than to search the that are present in the dictionary. For example,
DNA sequence against a nucleic acid database. if the k-tuple was set to four, the sequence
Because protein sequences evolve at a slower AGTCCTG would only have four entries
rate than DNA sequences, a distant homology (AGTC, GTCC, TCCT, and CCTG) in the dic-
between protein sequences may be missed at tionary of 256 (44) different four- (k-tuple
the DNA level. If no coding region has been value) letter words. The dictionaries of the two
defined, BLASTX can be used to translate the sequences can be more quickly compared than
DNA sequences in all six reading frames before the sequences themselves, allowing for effi-
searching against a protein database. Because cient identification of regions that contain small
the protein-sequence databases only contain similarities. Once a list of the highest-scoring
identified proteins, it is also important to check sequences is produced using the initial fast
a newly defined amino acid sequence, or trans- search, a second comparison is performed on
lated DNA sequence, against the current Gen- just the top-scoring sequences. This secondary
Bank, EMBL, or DDBJ DNA sequence data- alignment uses the algorithm of Needleham and
bases using TBLASTN. Such a search may iden- Wunsch (1970) to produce an alignment with
tify a significant similarity to a DNA sequence gaps and is the output at the conclusion of the
that was not previously known to encode a analysis. If no good homologies are observed
protein. after running FASTA, it is sometimes helpful to
An important feature of BLAST is a statisti- repeat the analysis using a smaller k-tuple value
cal significance score of the reported match. or an alternate scoring matrix.
This statistical significance score is determined A convenient method of performing a
using an implementation of Karlin’s signifi- FASTA search on a local computer is to use a
cance formula (Karlin and Altschul, 1990), free e-mail server (UNIT 19.1). A computerized
which calculates the Poisson probability that service provided by several institutions auto-
the observed sequence similarity will occur by matically accepts requests for FASTA searches
chance based on the size and composition of via e-mail, searches the sequence contained
the sequence database as well as on the size and within the mail message against a variety of
quality of the match. databases, and then returns the results via e-
Another frequently used program for mail. A mail server for FASTA is available from
searching both protein and DNA sequence da- the University of Houston and the European
tabases is FASTA (Pearson and Lipman, 1988), Bioinformatics Institute. A sample mail mes-
an updated version of the FASTN and FASTP sage is shown in Figure 7.7.5. The FASTA pro-
programs. FASTA, which has been included in gram is currently available for most computers
many commercial analysis packages, uses an in common use: UNIX, VMS, IBM-compat-
initial fast search through the database to iden- ible, and Macintosh versions of the program are
tify sequences with a high degree of identity to available via the Internet or from Bill Pearson
the test sequence. This fast search is performed at the University of Virginia. The DOS and
by limiting the search to short regions of iden- Macintosh versions of FASTA can search the
tity between the test sequence and the database. CD-ROM version of GenBank or EMBL data-
The word size (or k-tuple) parameter used in bases in a few hours.
the program can be varied; it is the size of the Both FASTA and TFASTA (which, like T-
initial match (sequence segment) that is used. BLAST, checks an amino acid sequence against
The size of the k-tuple parameter indirectly DNA sequence databases) produce a score rep-
affects the speed and sensitivity of the initial resenting the quality of the match. This score
search through the database. Generally, the de- is not the same as the significance score deter-
Computer fault value is the most efficient value to use; mined by one of the BLAST programs. FASTA
Manipulation of
DNA and Protein however, if distant homologies are desired, the and TFASTA scores are for each matching pair
Sequences
7.7.14
Supplement 34 Current Protocols in Molecular Biology
TITLE A test search of the EMBL Other Mammalian DNA sequences
LIB EMAM
WORD 4
LIST 100
ALIGN 20
SEQ
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
caccaccatggacagcaaa
END
Figure 7.7.5 Text of a message sent to the EBI FASTA mail server (e-mail address: [email protected]).
This message requests that the sequence be searched against the Other Mammalian section of
the EMBL database. The answer will include the top 100 matching sequences and alignments of
the top 20 matching sequences.
of sequences generated using the PAM250 ma- 19.1) using either a command-line interface or
trix described earlier (see Comparison of Two graphical interface (NetBLAST). UNIX and
Sequences). Thus, the score is a numerical VMS versions are available via network file
value representing the quality of the observed servers (see Table 7.7.2). A public domain pro-
similarity between two sequences. A high-scor- gram called MAILFASTA is now available; this
ing similarity is generally a significant match; program will take a DNA or protein sequence,
however, FASTA does not give a significance reformat it, and e-mail it to any or all of the
value like that provided by BLAST programs. following e-mail servers: BLAST, BLITZ,
With FASTA output, it is up to the user to BLOCKS, FASTA, GeneID, GRAIL, PredictPro-
recognize whether a particular observed match tein, and Pythia.
is significant or not, and this determination is
not always an easy task. Moreover, with either GENETIC SEQUENCE DATABASES
FASTA or BLAST, determining whether a match AND OTHER ELECTRONIC
is biologically significant is up to the investi- RESOURCES AVAILABLE TO
gator. If the BLAST mathematical significance MOLECULAR BIOLOGISTS
value indicates that a match is not the result of Different gene sequence databases can be
chance, then the biological significance is very searched via the Internet or are available to
likely to be high. However, if the mathematical anyone free of charge either via network file
significance indicates a large probability that transfer protocol (FTP; see UNIT 19.1) or for the
the match is just a random event, this does not price of the distribution media. Some of the
necessarily mean that the biological signifi- larger sequence analysis packages (e.g., GCG)
cance is also low. Unfortunately, there is no provide tools that reformat the database data
systematic way to determine biological signifi- files as they are received from the database
cance of an apparent homology in cases where distributors. Reformatting is generally per-
mathematical similarity is low. In making this formed by a computer systems manager. Many
determination, the known or presumed func- users currently access the protein and DNA
tion of the protein, the consensus match to databases on the Internet. Detailed descriptions
known active sites or sequence motifs, and the of the databases and how to access them via the
number of distinct sequences that fit the pro- Internet can be found in UNIT 19.2. UNIT 19.1 is a
posed homologous group must be taken into general guide to the Internet and includes a list
account. of electronic resources available to molecular
Because BLAST and FASTA use different biologists. Detailed protocols to carry out ho-
algorithms, it is advisable to perform searches mology searches using the BLAST family of
on a given sequence using both of these pro- programs are provided in UNIT 19.3.
grams. If a significant match is not identified Acknowledgement: We wish to thank Rose
using one of the programs, use the other. Marie Woodsmall and her colleagues at the
The BLAST programs are available from National Center for Biotechnology Informa-
e-mail servers and as an Internet service (UNIT tion for their assistance in updating this unit.
DNA Sequencing
7.7.15
Current Protocols in Molecular Biology Supplement 34
LITERATURE CITED Pearson, W.R. and Lipman, D.J. 1988. Improved
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., tools for biological sequence comparison. Proc.
and Lipman, D.J. 1990. Basic local alignment Natl. Acad. Sci. U.S.A. 85:2444-2448.
search tool. J. Mol. Biol. 215:403-410. Schuler, G.D., Altschul, S.F., and Lipman, D.J.
Church, G.M. and Kieffer-Higgins, S. 1988. Multi- 1991. A workbench for multiple alignment con-
plex DNA sequencing. Science 240:185-188. struction and analysis. Proteins Struct. Funct.
Genet. 9:180-190.
Feng, D.F. and Doolittle, R.F. 1987. Progressive
sequence alignment as a prerequisite to correct Schwartz, R.M. and Dayhoff, M.O. (eds.) 1978.
phylogenetic trees. J. Mol. Evol. 25:351-360. Matrices for Detecting Distant Relationships:
Atlas of Protein Sequence and Structure. Na-
Fickett, J. 1982. Recognition of protein coding re- tional Biomedical Research Foundation, Wash-
gions in DNA sequences. Nucl. Acids Res. ington, D.C.
10:5303-5318.
Smith, T.F. and Waterman, M.S. 1981. Identification
Freier, S.M., Kierzek, R., Jaeger, J.A., Sugimoto, N., of common molecular subsequences. J. Mol.
Caruthers, M.H., Neilson, T., and Turner, D.H. Biol. 147:195-197.
1986. Improved free-energy parameters for pre-
dictions of RNA duplex stability. Proc. Natl. Turner, D.H., Sugimoto, N., Jaeger, J.A., Longfel-
Acad. Sci. U.S.A. 83:9373-9377. low, C.E., Freier, S.M., and Kierzek, R. 1987.
Improved parameters for prediction of RNA
Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. structure. Cold Spring Harbor Symp. Quant.
Exhaustive matching of the entire protein se- Biol. 52:123-133.
quence database. Science 256:1443-1445.
Turner, D.H., Sugimoto, N., and Freier, S.M., and
Henikoff, S. and Henikoff, J.G. 1993. Performance Kierzek, R. 1988. RNA structure prediction.
evaluation of amino acid substitution matrices. Annu. Rev. Biophys. Chem. 17:167-192.
Proteins 17:49-61.
Wilbur, W.J. and Lipman, D.J. 1983. Rapid similar-
Higgins, D.G. and Sharp, P.M. 1988. Clustal: A ity searches of nucleic acid and protein data
package for performing multiple sequence align- banks. Proc. Natl. Acad. Sci. U.S.A. 80:726-730.
ment on a microcomputer. Gene 73:237-244.
Zuker, M. 1989a. On finding all suboptimal foldings
Higgins, D.G. and Sharp, P.M. 1989. Fast and sen- of an RNA molecule. Science 244:48-52.
sitive multiple sequence alignments on a micro-
computer. Comp. App. Biosci. 5:151-153. Zuker, M. 1989b. The use of dynamic programming
algorithms in RNA secondary structure predic-
Karlin, S. and Altschul, S.F. 1990. Methods for tion. In Mathematical Methods for DNA Se-
assessing the statistical significance of molecular quences (M.S. Waterman, ed.) p. 159-184. CRC
sequence features by using general scoring Press, Boca Raton, Fla.
schemes. Proc. Natl. Acad. Sci. U.S.A. 87:2264-
2268.
Needleman, S.B. and Wunsch, C.D. 1970. A general
method applicable to the search for similarities Contributed by J. Michael Cherry
in the amino acid sequence of two proteins. J. Stanford University
Mol. Biol. 48:443-453. Palo Alto, California
[email protected]
Computer
Manipulation of
DNA and Protein
Sequences
7.7.16
Supplement 34 Current Protocols in Molecular Biology
APPENDIX
This section lists products and other resources designed for DNA and protein sequence
analysis—e.g., databases, software, and journals—and how to obtain them. Some but not
all of these products have been described or cross-referenced earlier in this unit.
DNA Sequencing
7.7.17
Current Protocols in Molecular Biology Supplement 30
Table 7.7.1 Commercially Available Sequence-Analysis Software
Ball & Stick Molecular graphics display, printing, and Cherwell Scientific Publishing
manipulation for the Macintosh; a helpful 27 Park End Street
demo is available via ftp.bio.indiana.edu. Oxford, OX1 1HU, UK
(44) 865 774 800
FAX (44) 865 794 664
ChemDraw Desktop publishing for chemical Cambridge Scientific Computing
Chem3D structures in two and three dimensions; 875 Massachusetts Ave., Suite 61
very useful for creating journal figures or Cambridge, MA 02139
instructional illustrations of molecular (617) 491-6862
structures. FAX (617) 491-8208
DNA Strider A simple and very useful Macintosh Christian Marck
sequence analysis program; includes Service de Biochimie—Bat 142
restriction mapping and circular plasmid Centre d’Etudes Nucléaires de
maps generated from DNA sequence. Saclay
91191 Gif-sur-Yvette Cedex
France
EUGENE & SAM Extensive nucleic and amino acid Lark Sequencing Technologies
analysis package for Sun Microsystems 9545 Katy Freeway, Suite 200
SPARCstations; includes FASTA for Houston, TX 77024
sequence searches and very quick (713) 464-7488; (800) 288-3720
keyword searching on DNA and protein FAX (713) 464-7492
database.
GCG Package including nucleic and amino Genetics Computer Group
acid sequence analysis, sequencing University Research Park
project management, database searching, 575 Science Drive, Suite B
RNA folding, protein secondary structure Madison, WI 53711
prediction, and sequence motif (608) 231-5200
generation and searching for VMS and FAX (608) 231-5202
UNIX multi-user systems. E-mail: [email protected]
Gene Construction The ultimate plasmid database, design, Textco
Kit and presentation tool for the Macintosh; 27 Gilson Road
DNA Inspector DNA Inspector is a basic analysis West Lebanon, NH 03784
program for the Macintosh; demo disks (603) 643-1471
available.
GENEPRO Complete sequence-analysis software for Riverside Scientific Enterprises
DOS, including nucleic and amino acid 15705 Point Monroe Drive N.E.
analysis; GenBank and EMBL DNA Bainbridge Island, WA 98110
databases or PIR and SWISS-PROT (206) 842-9498
protein databases can be searched on FAX (206) 842-9534
floppy disks; demo disk available.
HIBIO DNASIS Complete DNA and protein analysis Hitachi Software Engineering
PROSIS packages for DOS; includes restriction America
analysis, secondary structure prediction, Computer Division
sequencing project management, digitizer 1111 Bayhill Drive, Suite 395
and speech synthesizer support, and San Bruno, CA 94066
database searches from CD-ROM drive. (800) 624-6176
In CA (800) 225-9925
FAX (415) 615-7699
Intelligenetics Multifunction sequence analysis Intelligenetics
Suite package—Intelligentics Suite is for VMS 700 East El Camino Real
PC Gene and Sun Microsystems, PC Gene is for Mountain View, CA 94040
GeneWorks DOS, and GeneWorks is for Macintosh. (415) 962-7300
FAX (415) 962-7302
Computer continued
Manipulation of
DNA and Protein
Sequences
7.7.18
Supplement 30 Current Protocols in Molecular Biology
Table 7.7.1 Commercially Available Sequence-Analysis Software, continued
DNA Sequencing
7.7.19
Current Protocols in Molecular Biology Supplement 30
Table 7.7.2 Free Sequence-Analysis Software Programs
continued
Computer
Manipulation of
DNA and Protein
Sequences
7.7.20
Supplement 30 Current Protocols in Molecular Biology
Table 7.7.2 Free Sequence-Analysis Software Programs, continued
continued
DNA Sequencing
7.7.21
Current Protocols in Molecular Biology Supplement 30
Table 7.7.2 Free Sequence-Analysis Software Programs, continued
continued
Computer
Manipulation of
DNA and Protein
Sequences
7.7.22
Supplement 30 Current Protocols in Molecular Biology
Table 7.7.2 Free Sequence-Analysis Software Programs, continued
Web Uniform Resource Locator. See UNIT 19.1 for descriptions of these sources. Contact person is listed in parentheses.
DNA Sequencing
7.7.23
Current Protocols in Molecular Biology Supplement 30