Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
1. Introduction
Tire discovery of sequence Immology to a known
protein or family of proteins often provides tire first
clues about, the fimction of a newly sequenced gene.
As tire DNA and amino acid sequence databases
continue to grow in size they become increasingly
nsefld in tim analysis of newly sequenced genes and
proteins because of the greater chance of finding
such homologies. There are a number of software
tools for searching sequence databases but all use
some measure of similarity between sequences to
distinguish biologically significant relationships
from chance similarities. Perhai)s the best studied
measures are those used in conjunction with variations of the dynamic programming algorithm
(Needleman & Wunsch, 1970; Sellers, 1974; Sankoff
& Kruskal, 1983; Waterman, 1984). These methods
assign scores to insertions, deletions and replacements, and compute an aligmnent of two" sequences
that corresI)onds to the least costly set of such
mutations. Such an alignment may be thought of as
minimizing the evolutionary distance or maximizing
the similarity between tire two sequences compared.
In either case, the cost of this alignment is a
measure of similarity; tim algorithm guarantees ' it is
403
0022-2836/90/190403-08 $03.00/0
S.F.
404
Altschul e t al.
2. Methods
(a) The maximal segment pair measure
Sequence similarity measures generally can be classified
as either global or local. Global similarity algorithms
optimize the overall alignment of two sequences, which
may include large stretches of low similarity (Needleman
& Wunsch, 1970). Local similarity algorithms seek only
relatively conserved subsequences, and a single comparison may yield several distinct subsequence alignments;
uneonserved regions do not contribute to the measure of
similarity (Smith & Waterman, 1981; Goad-& Kanehisa,
1982; Sellers, 1984). Local similarity measures are
generally preferred for database searches, where eDNAs
may be compared with partially sequenced genes, and
where distantly related proteins may share only isolated
regions of similarity, e.g. in the vicinity of an active site.
Many similarity measures, including the one we
employ, begin with a matrix of similarity scores for all
possible pairs of residues. Identities and conservative
replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence
comparisons we generally use the PAM-120 matrix (a
variation of that of Dayhoff el al., 1978), while for DNA
sequence comparisons we score identities + 5 , and
mismatches --4; other scores are of course possible. A
sequence segment is a contiguous stretch of residues of
any length, and the similarity score for two aligned
segments of the same length is the sum of tim similarity
values for each pair of aligned residues.
Given these rules, we define a maximal segment pair
(MSP) to be the highest scoring pair of identical length
segments chosen f r o m 2 sequences. The boundaries of an
MSP are chosen to maximize its score, so an MSP may be
of any length. The MSP score, which BLAST heuristically
a t t e m p t s to calculate, provides a measure of local similarity for any pair of sequences. A molecular biologist,
however, may be interested in all conserved regions
shared by 2 proteins, not only in their highest scoring
pair. We therefore define a segment pair to be locally
maximal if its score cannot be improved either by
extending or by shortening both segments (Sellers, 1984).
BLAST can seek all locally maximal segment pairs with
scores above some cutoff.
L.ike many other similarity measures, tile MSP score for
2 sequences may be computed in time proportional to the
product of their lengths using a simple dynamic programruing algorithm. An important advantage of the MSP
measure is t h a t recent mathematical results allow the
statistical significance of MSP scores to be estimated
under an appropriate random sequence model (Karlin &
Altsehul, 1990; Karlin et al., 1990). Furthermore, for any
t Al)breviations used: BLAST, blast local alignment
scareh tool; MSP, maximal segment pair; bp,
base-pair(s).
405
from the query word list for the full search. Matches to
the sublibrary, however, are reported in the final output.
These 2 filters allow alignments to regions with biased
composition, or to regions containing repetitive elements
to be reported, as long as adjacent regions not containing
such features share significant similarity to the query
sequence.
The BLAST strategy admits numerous variations. We
implemented a version of BLAST that uses dynamic
programming to extend hits so as to allow gaps in the
resulting alignments. Needless to say, this greatly slows
the extension process. While the sensitivity of amino acid
searches was improved in some cases, the selectivity was
reduced as well. Given the trade-off of speed and selectivity for sensitivity, it is questionable whether the gap
version of BLAST constitutes an improvement. We also
implemented the alternative of making a table of all
occurrences of the w-mers in the database, then scanning
the query sequence and processing hits. The disk space
requirements are considerable, approximately 2 computer
words for ever)" residue in the database. More damaging
was that for query sequences of typical length, the need
for random access into the database (as opposed to
sequential access) made the approach slower, on the
computer systems we used, titan scanning the entire
database.
3. R e s u l t s
T o e v a l u a t e t h e u t i l i t y o f o u r m e t h o d , we d e s c r i b e
t h e o r e t i c a l r e s u l t s a b o u t t h e s t a t i s t i c a l significance
o f M S P scores, s t u d y t h e a c c u r a c y of t h e a l g o r i t h m
for r a n d o m s e q u e n c e s a t a p p r o x i m a t i n g M S P scores,
c o m p a r e t h e p e r f o r m a n c e o f t h e a p p r o x i m a t i o n to
t h e fidl c a l c u l a t i o n on a s e t o f r e l a t e d p r o t e i n
s e q u e n c e s a n d , finally, d e m o n s t r a t e its p e r f o r m a n c e
c o m p a r i n g long D N A sequences.
(a) Performance of B L A S T with random sequences
T h c o r e t i c a l r e s u l t s on t h e d i s t r i b u t i o n o f M S P
scores from tim c o m p a r i s o n o f r a n d o m sequences
h a v e r e c e n t l y b e c o m e a v a i l a b l e ( K a r l i n & AItschul,
1990; K a r l i n et al., 1990). I n brief, g i v e n a s e t o f
p r o b a b i l i t i e s for t h e o c c u r r e n c e o f i n d i v i d u a l
residues, a n d a set o f scores for a l i g n i n g p a i r s of
residues, tile t h e o r y p r o v i d e s t w o I ) a r a m e t e r s ). a n d
K for e v a l u a t i n g tile s t a t i s t i c a l significance o f MSI )
scores. W h e n t w o r a n d o m sequences o f l e n g t h s m
a n d n a r e c o m p a r e d , t h e p r o b a b i l i t y o f finding a
s e g m e n t p a i r w i t h a score g r e a t e r t h a n o r e q u a l to
S is:
1 - e -y,
(1)
where y--Kmn
e -ks. M o r e g e n e r a l l y , t h e p r o b a b i l i t y o f finding c or m o r e d i s t i n c t s e g m e n t pairs,
all w i t h a score o f a t l e a s t S , is g i v e n b y t h e f o r m u l a :
r
"
Y~.
l_e-y
(2)
i~O
S . F . Altschul et al.
406
1"6-
1"2-
0"8"
0.4"
I
15
24
35
42
51
60
threshold parameters
On what basis do we ehoose tile particular setting
of t h e p a r a m e t e r s w and 7' for exeeuting BI,AST on
real data.~ We begin i)3" considering tile word
length w.
The time required to execute BLAST is the snm
of the times required (1) to compile a list of words
that can score at least T when comi)arcd with words
from the queiT; (2) to scan the database for lilts (i.e.
matches to words on this list); and (3) to exten(l all
hits to seek segment pairs with scores exceeding the
cutoff. The time for the last of these tasks is I)ropor tional to tile n n m b e r of bits, which clearly depends
on the p a r a m e t e r s w and T. Given a random protein
model and a set of substitution scores, it is simple to
calculate the probability t h a t two random words of
length w will have a score of at least T, i.e. the
probal)ility of a hit arising from an arbitrary pair of
words in the query and the database. Using the
random model and scores of tile previous section, we
have calculated these probabilities for a variety of
p a r a m e t e r choices and recorded them in Table 1.
For a given level of sensitivity (chance of missing an
MSP), one can ask w h a t choice of w minimizes tile
407
Table 1
The probability of a hit at various settings of the parameters w and T, and the
proportion of random M S P s missed by B L A S T
Linear regression
- I n (q) = a S + b
P r o b a b i l i t y of a
hit x 105
11
12
13
14,
15
16
17
18
253
147
83
48
26
14
7
4
0"1236
0"0875
0"0625
0-0463
0-0328
0-0232
0"0158
0"0109
-- 1-005
--0-746
--0"570
--0"461
--0"353
--0"263
--0"191
--0"137
1
4
II
20
33
46
59
70
1
3
8
16
28
41
55
67
0
2
6
12
23
36
51
63
0
l
4
l0
20
32
47
60
0
1
3
8
17
29
43
57
0
0
2
6
14
26
40
54
0
0
2
5
12
23
37
51
13
14
15
16
17
18
19
20
127
78
47
28
16
9
5
3
0"1192
0"0904
0"0686
0"0519
0"0390
0"0290
0"0215
00159
--1"278
-- 1"012
--0"802
--0-634
--0"498
--0"387
--0-298
--0-234
2
5
10
18
28
40
51
62
1
3
7
14
23
35
46
57
1
2
5
11
19
30
41
53
0
1
4
8
16
26
37
49
0
1
3
6
13
22
33
45
0
0
2
5
II
19
30
41
0
0
1
4
9
17
. 27
38
15
16
17
18
19
20
21
22
64
40
25
15
9
5
3
2
0'1137
0"0882
0"0679
0-0529
0"0413
0-0327
0"0257
0-0200
--1"525
--1"207
--0"939
--0-754
--0"608
--0-506
--0-420
--0"343
3
6
12
20
29
38
48
57
2
4
9
15
23 32
42
52
1
3
6
12
19
28
37
47
1
2
4
9
15
23
32
42
0
i
3
7
13
20
29
38
0
1
2
5
lO
17
25
35
0
0
2
4
8
14
22
31
03
006
0-01
0-002
45
50 "
55
60
65
70
75
50
40'
/
/
/
30'
o~ 2 0 '
E
I--
10.
,/ e /
2.5
5-0
7-5
Words (x I0-4)
S . F . Altschul et al.
408
Table
10
20
39
25
17
12
25
17
12
17
12
9
S:
44
55
70
p-value
l.O
0-8
O-Ol
90
lO-5
12
9
409
Table 3
Superfamily
searched
MYMQW
KVMSTI
OKBOG
ITHU
KYBOA
CCHU
FECF
GIobin
Immunoglobulin
Protein kinase
Serpin
Serine protease
Cytochrome c
Ferredoxin
Cutoff
score 5'
22
20
19
18
17
16
15
Number of MSPs
in superfamily
with score
at least S
47
47
52
50
49
46
44
115
153
9
12
59
81
22
169
155
42
12
59
91
23
178
155
47
12
59
91
23
222
156
59
12
59
96
24
238
156
60
12
59
98
24
255
157
60
12
59
98
24
281
158
60
12
59
98
24
285
158
60
12
59
98
24
MYMQW, woolly monkey myoglobin; KVMSTI, mouse Ig ~ chain precursor V region; OKBOG, bovine cGMP-dependent protein
kinase; ITHU, human a-l-antitrypsin precursor; KYBOA, bovine ehymotrypsinogenA; CCHU, human cytoehromec; FECF,
Chlorobiumsp. ferredoxin.
Table 4
Time
Words
Hits
8
9
10
II
12
15"9
6-8
4"3
3'5
3'2
44,587
44,586
44,585
44,584
44,583
118,941
39,218
15,321
7345
4197
Matches
130
123
114
106
98
410
S . F . Altschul et al.
4. Conclusion
References
Coulson, A. F. W., Collins, J. F. & Lyall, A. (1987).
Comput. d. 30, 420-424.
Edited by S. Brenner