0% found this document useful (0 votes)
8 views

F 56665

Uploaded by

Maria Tsirka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
8 views

F 56665

Uploaded by

Maria Tsirka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 3
Figure 1. The sensi of he woh and on-hire at action of ‘Sbscore Usngle ALOSUMEG amine ad sabtaion as(08) ae tag rguenis gpd by equaion 3 and the background ane sd frequracsP, of Reisen and Rebinson (2,100 000 adel HSPs were pore fer cach ofthe wal stes 3792 conesponding 1 toad Sees 9015.1 Iwas deteninedty npecuon wht uch ISP aed tc casa wo mo-oedapyng engl Wed ps it omin sor elet Tieand wit adsunce Woon snot or asin eng word pa wih sein sora eat 1, The conerpondig petite of mang an BSP Shing the wot eure with T= [andthe one-hit heute wth T= 1, ste plated aa fiction of normalized HSP scoe The twodteehod ote Fenelon lisPr wh score at ea 3b efficiently. Specifically, we choose a window length A, and invoke an extension only when two non-overlapping hits are found within distance A of one another on the sane diagonal. Any hit that overlaps the most recent one is ignored. Efficient execution requires an array to record foreach diagonal, the frst Coordinate ofthe most recent hit found. Since databate sequences are scanned sequentially this coordinate always inceates for foccessive hits, The idea of seeking mulple hts on the same diagonal was first used in the context of biclogeal databace searches by Wilbur and Lipman (1. Because we require two his rather than one to invoke an extension, te threshold parameter 7 mnt be lowered to retain comparable sensitivity. The effect is that many more single bit fre found, but only a smal faction have an associated second hit oon the same diggonal that biggers an extension, The great tale of hits may be dsmlsed ater the minor calculation of looking up, forthe appropriate diagonal, the coordinate ofthe tost recent hit checking whether itis within distance A of the ‘rent hits coordinate and finaly replacing the old wih the pew Coordinate. Empirically, the computation saved by tequring fewer extensions more than 38 bits. While this ‘would appear sufficient for most purposes, the one-hit default T parameter has typically been set as low as 11, yielding an execution time nearly three times that for T= 13. Why pay this price for what appears at best marginal gains in sensitivity? The 4394 Nucleic Acids Research, 1997, Vol. 25, No. 17 ‘The times required by various steps of the BLAST algorithm vvary substantially from one query and one database to another, ‘Table 1 shows typical relative times spent by the original and the gapped BLAST programs on various algorithmic stages. The ‘original BLAST’ program is represented, here and below, by a variant form of blastp version 1.4.9, modified so that it uses the same edge-effect correction (22) and background amino acid frequencies as the ‘gapped BLAST". The times represent the average for tee different queries, with the time for the original BLAST program normalized in each instance to 100 units More concretely, to search SWISS-PROT (26), release 34 (59576 sequences: 21 219 450 residues), with the lengtt-567 influenza A virus hemagglutinin precursor (27) as query, the original BLAST program requires 45.8 s, and the gapped BLAST program 15.8 s, This timing experiment, and others referred to below, was run on one 200 Miz R10000 epu processor of a lightly loaded SGI Power Challenge XL computer with 2.5, Gbytes of RAM, This machine runs the operating systema IRIX, version 6.2, which is an implementation of UNIX. We used the standard SGI C compiler, with the -O flag for optimization, to compile all ersions of the programs. The times reported are the user times given by the rime command, and are forthe better of two identical runs ‘A closely related type of gapped extension routine to that used here was developed by G, Myers during the evaluation of the original BLAST algorithm. It was not included in the publicly distributed code primarily because the then current strategy of extending every hit decreased the algorithm's speed unduly forthe relatively small gain in sensitivity realized (1). ‘As discussed above, the statistical significance of gapped alignments may be evaluated using the two statistical parameters ‘agand Ke, The current version ofthe Fasta program (2) estimates, these parameters on each run, by analyzing the distribution of alignment scores proxtced by all the sequences in the database BLAST gains speed by producing alignments for only the few database sequences likely to be related to the query, and therefore does not have the option of estimating Ay and Kz on the Ay, Instead, it uses estimates of these parameters produced before hand by random sinmulation (3). A drawback ofthis approach is thatthe program may not accept an arbitrary scoring system, for which no simulation has been performed, and still produce accurate estimates of statistical significance. The original BLAST programs, in contrast, because they dealt only with ungapped Tocal alignments, could derive 2, and Ky ftom theory for any scoring matrix (8,9) ITERATED APPLICATION OF BLAST TO POSITION-SPECIFIC SCORE MATRICES Database searches using position-specifie score matrices, also called profiles or motifs, often are much better able to detect weak relationships than are database searches that use a simple sequence as query (28-38). Employing these methods, however, frequently has involved the use of several different programs and a fair degree of expertise. Accordingly, to render the power of motif searches more readily available, we have written a procedure to construct a position-specific score matrix automati- cally from the output of a BLAST run, and modified BLAST to ‘operate using such a matrix in the place of a simple query. The resulting PSI-BLAST program often is substantially more sensitive than the corresponding BLAST program, but for each iteration takes litle more than the same time to run. In related work, Henikoff and Henikoff (39) have described how, short of modifying BLAST so that it may operate on a position-specific score malrix, a single artificial sequence that approximates such a matrix may be used as a query with the original BLAST programs The construction of a position-specifie score matrix is a smult-stage process, and at each stage a choice must be made among a number of alternative routes. We have been guided by the goals of automatic operation, speed of execution, and general simplicity. The issues discussed below are: (i) general architec- ture of the score matrix; (i) construction of the multiple alignment from which the matrix is derived: (ii) weights for sequences within the multiple alignment, and evaluation of the effective number of independent observations it constitutes; (Gv) estimation of target frequencies, and the construction of ‘matrix scores; (v) applying BLAST to a position-specific matrix, and the statistical evaluation of search results. We do not claim ‘our current implementation is optimal, and it is likely that over time some of its details will change Score matrix architecture “The alignment ofa simple sequence with a pattern embodied by 1 position-specific score matrix is almost completely analogous to the alignment of two simple sequences. The only real difference is that the score for aligning a leter with a pattern position is given by the matrix itself, rather dan with reference to substitution matrix. For proteins, a query of length L and a substitution matrix of dimension 20 x 20 are replaced by a position-specific matzix of dimension L x 20. Position-specific gap costs may be defined as well (34,40). As with pairwise sequence comparison, one may choose among finding the best global alignment of the matrix and the simple sequence (23), finding the best aligament of the complete matrix with a segment of the sequence (41), and finding the best local alignment of the matrix and sequence (2. Position-specific protein score matrices draw their power from two sources. The fist is improved estimation of the probabilities ‘with which amino acids occur at various pattern positions, leading to a more sensitive scoring system. The second is relatively precise definition of the boundaries of important motifs. By demanding the complete alignment of one or more motif, rather than seeking an arbitrary local alignment, the size of the search space may be greatly reduced, thereby lowering the level of random noise, Unfortunately, there are many obstacles to automating well the delineation ofa set of motifs from the output ofa database search. The query sequence may contain a variety of different domains, and share different subsets of them with different proteins in the database. Furthermore, defining the proper extent of even a single motif may be challenging (42). Accordingly, we have chosen to forgo the potential advantages of restricting the length of our derived matrices, and then demanding that they be completely aligned with segments of database sequences (41). Instead, each matrix we construct has Tength precisely equal to that of the original query sequence. When searching the database with such a matrix, we seck local alignments, in full analogy to those sought by BLAST when used for straightforward sequence-sequence comparison. Finally, we 4do not altempt to derive position-specific gap scores for use with ‘our position-specific substitution scores, Instead, in each iteration 000 200 200 200 4080 oo 70 Optima oosl alignment scare Figure 6. The dissbuton of opiimal leat aligament scores fom the comparison of a postionspeciic sere matnx with 10 000 random protein Sequences The score atx was constuted by PSLBLAST fom th 12 oal ‘ignments wih Evalue 000 found ina search of SWISS-PROT using as ‘oer he Iengh-S67induenza A virus bemaggluiin precursor (27 (SWISS- PROT acersson 0,035) The random sequences, cach of eat 367, were gencated using the anio acd frequencies af Robinson and Robinson 2). ‘pial loca abgnment scares were calculated using the postonspeeiie ‘mauisinconjuncton wath 10+ gap costs. Tae exteme vale dstabuion that bet its the data (318) potied Ay? goodness-of-fit ent with 3M depres of ‘feedom has value 41.8, cmespending oa Pvalue of 0.20, lowest E-value found, as well asthe number of shuffled sequences yielding Evalues <1 and 10, For comparison, we performed the Nucleic Acids Research, 1997, Vol. 25,No. 17 3397 identical shuffled-tatabase test on the gapped and original versions of BLAST. To reduce the probability that high-scoring alignments were missed due to the heuristic nature of the algorithms, we performed these tests with T = 9 rather than the default value of 11, The results are given in Table 2, For the 11 queries, the median of the low PSI-BLAST E-values was 0.87, which comesponds to a median P-value of 0.58 (8,9). The mean numbers of shuffled database sequences with E-values

You might also like