We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 3
Figure 1. The sensi of he woh and on-hire at action of
‘Sbscore Usngle ALOSUMEG amine ad sabtaion as(08) ae
tag rguenis gpd by equaion 3 and the background ane sd
frequracsP, of Reisen and Rebinson (2,100 000 adel HSPs were
pore fer cach ofthe wal stes 3792 conesponding 1 toad
Sees 9015.1 Iwas deteninedty npecuon wht uch ISP aed
tc casa wo mo-oedapyng engl Wed ps it omin sor elet
Tieand wit adsunce Woon snot or asin eng word pa wih
sein sora eat 1, The conerpondig petite of mang an BSP
Shing the wot eure with T= [andthe one-hit heute wth T= 1,
ste plated aa fiction of normalized HSP scoe The twodteehod ote
Fenelon lisPr wh score at ea 3b
efficiently. Specifically, we choose a window length A, and
invoke an extension only when two non-overlapping hits are
found within distance A of one another on the sane diagonal. Any
hit that overlaps the most recent one is ignored. Efficient
execution requires an array to record foreach diagonal, the frst
Coordinate ofthe most recent hit found. Since databate sequences
are scanned sequentially this coordinate always inceates for
foccessive hits, The idea of seeking mulple hts on the same
diagonal was first used in the context of biclogeal databace
searches by Wilbur and Lipman (1.
Because we require two his rather than one to invoke an
extension, te threshold parameter 7 mnt be lowered to retain
comparable sensitivity. The effect is that many more single bit
fre found, but only a smal faction have an associated second hit
oon the same diggonal that biggers an extension, The great
tale of hits may be dsmlsed ater the minor calculation of
looking up, forthe appropriate diagonal, the coordinate ofthe
tost recent hit checking whether itis within distance A of the
‘rent hits coordinate and finaly replacing the old wih the pew
Coordinate. Empirically, the computation saved by tequring
fewer extensions more than 38 bits. While this
‘would appear sufficient for most purposes, the one-hit default T
parameter has typically been set as low as 11, yielding an
execution time nearly three times that for T= 13. Why pay this
price for what appears at best marginal gains in sensitivity? The4394 Nucleic Acids Research, 1997, Vol. 25, No. 17
‘The times required by various steps of the BLAST algorithm
vvary substantially from one query and one database to another,
‘Table 1 shows typical relative times spent by the original and the
gapped BLAST programs on various algorithmic stages. The
‘original BLAST’ program is represented, here and below, by a
variant form of blastp version 1.4.9, modified so that it uses the
same edge-effect correction (22) and background amino acid
frequencies as the ‘gapped BLAST". The times represent the
average for tee different queries, with the time for the original
BLAST program normalized in each instance to 100 units
More concretely, to search SWISS-PROT (26), release 34
(59576 sequences: 21 219 450 residues), with the lengtt-567
influenza A virus hemagglutinin precursor (27) as query, the
original BLAST program requires 45.8 s, and the gapped BLAST
program 15.8 s, This timing experiment, and others referred to
below, was run on one 200 Miz R10000 epu processor of a
lightly loaded SGI Power Challenge XL computer with 2.5,
Gbytes of RAM, This machine runs the operating systema IRIX,
version 6.2, which is an implementation of UNIX. We used the
standard SGI C compiler, with the -O flag for optimization, to
compile all ersions of the programs. The times reported are the
user times given by the rime command, and are forthe better of
two identical runs
‘A closely related type of gapped extension routine to that used
here was developed by G, Myers during the evaluation of the
original BLAST algorithm. It was not included in the publicly
distributed code primarily because the then current strategy of
extending every hit decreased the algorithm's speed unduly forthe
relatively small gain in sensitivity realized (1).
‘As discussed above, the statistical significance of gapped
alignments may be evaluated using the two statistical parameters
‘agand Ke, The current version ofthe Fasta program (2) estimates,
these parameters on each run, by analyzing the distribution of
alignment scores proxtced by all the sequences in the database
BLAST gains speed by producing alignments for only the few
database sequences likely to be related to the query, and therefore
does not have the option of estimating Ay and Kz on the Ay,
Instead, it uses estimates of these parameters produced before
hand by random sinmulation (3). A drawback ofthis approach is
thatthe program may not accept an arbitrary scoring system, for
which no simulation has been performed, and still produce
accurate estimates of statistical significance. The original BLAST
programs, in contrast, because they dealt only with ungapped
Tocal alignments, could derive 2, and Ky ftom theory for any
scoring matrix (8,9)
ITERATED APPLICATION OF BLAST TO
POSITION-SPECIFIC SCORE MATRICES
Database searches using position-specifie score matrices, also
called profiles or motifs, often are much better able to detect weak
relationships than are database searches that use a simple
sequence as query (28-38). Employing these methods, however,
frequently has involved the use of several different programs and
a fair degree of expertise. Accordingly, to render the power of
motif searches more readily available, we have written a
procedure to construct a position-specific score matrix automati-
cally from the output of a BLAST run, and modified BLAST to
‘operate using such a matrix in the place of a simple query. The
resulting PSI-BLAST program often is substantially more
sensitive than the corresponding BLAST program, but for each
iteration takes litle more than the same time to run. In related
work, Henikoff and Henikoff (39) have described how, short of
modifying BLAST so that it may operate on a position-specific
score malrix, a single artificial sequence that approximates such
a matrix may be used as a query with the original BLAST
programs
The construction of a position-specifie score matrix is a
smult-stage process, and at each stage a choice must be made
among a number of alternative routes. We have been guided by
the goals of automatic operation, speed of execution, and general
simplicity. The issues discussed below are: (i) general architec-
ture of the score matrix; (i) construction of the multiple
alignment from which the matrix is derived: (ii) weights for
sequences within the multiple alignment, and evaluation of the
effective number of independent observations it constitutes;
(Gv) estimation of target frequencies, and the construction of
‘matrix scores; (v) applying BLAST to a position-specific matrix,
and the statistical evaluation of search results. We do not claim
‘our current implementation is optimal, and it is likely that over
time some of its details will change
Score matrix architecture
“The alignment ofa simple sequence with a pattern embodied by
1 position-specific score matrix is almost completely analogous
to the alignment of two simple sequences. The only real
difference is that the score for aligning a leter with a pattern
position is given by the matrix itself, rather dan with reference to
substitution matrix. For proteins, a query of length L and a
substitution matrix of dimension 20 x 20 are replaced by a
position-specific matzix of dimension L x 20. Position-specific
gap costs may be defined as well (34,40). As with pairwise
sequence comparison, one may choose among finding the best
global alignment of the matrix and the simple sequence (23),
finding the best aligament of the complete matrix with a segment
of the sequence (41), and finding the best local alignment of the
matrix and sequence (2.
Position-specific protein score matrices draw their power from
two sources. The fist is improved estimation of the probabilities
‘with which amino acids occur at various pattern positions, leading
to a more sensitive scoring system. The second is relatively
precise definition of the boundaries of important motifs. By
demanding the complete alignment of one or more motif, rather
than seeking an arbitrary local alignment, the size of the search
space may be greatly reduced, thereby lowering the level of
random noise, Unfortunately, there are many obstacles to
automating well the delineation ofa set of motifs from the output
ofa database search. The query sequence may contain a variety
of different domains, and share different subsets of them with
different proteins in the database. Furthermore, defining the
proper extent of even a single motif may be challenging (42).
Accordingly, we have chosen to forgo the potential advantages
of restricting the length of our derived matrices, and then
demanding that they be completely aligned with segments of
database sequences (41). Instead, each matrix we construct has
Tength precisely equal to that of the original query sequence.
When searching the database with such a matrix, we seck local
alignments, in full analogy to those sought by BLAST when used
for straightforward sequence-sequence comparison. Finally, we
4do not altempt to derive position-specific gap scores for use with
‘our position-specific substitution scores, Instead, in each iteration000
200
200
200
4080 oo 70
Optima oosl alignment scare
Figure 6. The dissbuton of opiimal leat aligament scores fom the
comparison of a postionspeciic sere matnx with 10 000 random protein
Sequences The score atx was constuted by PSLBLAST fom th 12 oal
‘ignments wih Evalue 000 found ina search of SWISS-PROT using as
‘oer he Iengh-S67induenza A virus bemaggluiin precursor (27 (SWISS-
PROT acersson 0,035) The random sequences, cach of eat 367, were
gencated using the anio acd frequencies af Robinson and Robinson 2).
‘pial loca abgnment scares were calculated using the postonspeeiie
‘mauisinconjuncton wath 10+ gap costs. Tae exteme vale dstabuion that
bet its the data (318) potied Ay? goodness-of-fit ent with 3M depres of
‘feedom has value 41.8, cmespending oa Pvalue of 0.20,
lowest E-value found, as well asthe number of shuffled sequences
yielding Evalues <1 and 10, For comparison, we performed the
Nucleic Acids Research, 1997, Vol. 25,No. 17 3397
identical shuffled-tatabase test on the gapped and original
versions of BLAST. To reduce the probability that high-scoring
alignments were missed due to the heuristic nature of the
algorithms, we performed these tests with T = 9 rather than the
default value of 11, The results are given in Table 2, For the 11
queries, the median of the low PSI-BLAST E-values was 0.87,
which comesponds to a median P-value of 0.58 (8,9). The mean
numbers of shuffled database sequences with E-values