Blast & Fasta
Blast & Fasta
Searching
HelixInfoSystems
Predominant application
for database searches
HelixInfoSystems
Database searching
One
approachuse
algorithm
Smith-Waterman
Problem?
Databases are huge (GenBank ~30 million
sequences,
Swiss-Prot >> 100,000 sequences)
S-W is slow (O(Nn2)) where, n is the
sequence length
and N is the number of sequences in the
database
HelixInfoSystems
Solution?
Use faster heuristic approaches
FastA Fast Alignment
Blast Basic Local Alignment
Search Tool
HelixInfoSystems
What is Heuristics?
Heuristics - solve a problem that ignores
whether the solution can be proven to be
correct, but which usually produces a good
solution
Solves a simpler problem that contains or
intersects with the solution of the more complex
problem.
Heuristics are intended to gain computational
performance or conceptual simplicity,
potentially at the cost
of accuracy or precision.
HelixInfoSystems
7
FASTA
HelixInfoSystems
FASTA
Developed ~1985 by Lipman and Pearson
(now many
variants/updates/improvements)
Goal: Perform fast, approximate local
alignments to find
sequences in the database that are related to
the query
sequence
HelixInfoSystems
Steps in FastA
1. Choose a value for the ktup parameter:
will look for exact matches of this length
between the query and target sequences
typically ktup=6 for DNA (range 4-6),
ktup=1 (range 1-2) for protein (why the
difference?)
2. Find hot spots (location of matching
ktup-length substrings) in a dot plot
HelixInfoSystems
10
HelixInfoSystems
11
HelixInfoSystems
12
HelixInfoSystems
13
Hashing technique
Computational trick that makes FASTA fast
is
how it locates the hot spots
Uses hashing technique (map a string of 1,
2, or more
characters to an integer
e.g., AAA 0
AAC 1
...
TTT 63 (oversimplified)
HelixInfoSystems
14
Hashing technique
Can preprocess the database and create a
table that stores locations (offsets) of each
possible k-tuple
20k for aminoacids (400 if k=2),
4k for DNA (4096 if k=6),
Then use hash code computed from query
sequence k-tuples to look up these entries
quickly
HelixInfoSystems
15
Example
HelixInfoSystems
16
Contd.Steps in FastA
3.Find 10 best diagonal runs (sequence of nearby
hot
spots on same diagonal)
FASTA gives each hot spot a positive score, and
each space between consecutive hot spots a
negative score that decreases with distance
Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but does not contain indels because
they are all on the same
diagonal
HelixInfoSystems
17
Contd.Steps in FastA
4. Evaluate each diagonal run using an
appropriate
scoring matrix (PAM-n, BLOSUM-n, etc.)
and find
the best scoring run = init1
Runs with low scores discarded (filtration)
HelixInfoSystems
18
Contd.Steps in FastA
5. Try to find good diagonal runs
from close diagonals by now
allowing indels
good means those having score
exceeding a chosen threshold:
HelixInfoSystems
19
20
Contd.Steps in FastA
6. If initn score reaches a threshold value,
get opt score using Smith-Waterman
alignment (dont waste time on this
otherwise)
7. Rank database sequences according to
opt scores; use full Smith-Waterman
method (no band) to align query
sequence against each of the highest
ranking sequences from the database
HelixInfoSystems
21
Contd.Steps in FastA
8. Perform statistical analysis of the
probability that
given level of matching would be
obtained by chance if
sequences were unrelated
HelixInfoSystems
22
FASTA Results
When init1 = init0 = opt:
100% homology over the matched stretch.
When initn > init1:
More than 1 matching region in the
database with poorly matching separating
regions.
When opt > initn:
The matching regions are greatly improved
by adding gaps in one or both of the
sequences.
HelixInfoSystems
23
and
referenced
biology/bioinformatics resource
HelixInfoSystems
24
BLAST
Improves search speed of
FASTA
Retains sensitivity of
searches
HelixInfoSystems
25
BLAST Algorithm
Step:1
HelixInfoSystems
26
Step 1 - Example
Size w words in the query sequence.
27
6,5,11
6,1,11
0,5,11
0,5,11
6,5,2
GNW
GAW
HelixInfoSystems
22
18
16
16
13
10
9
28
29
extend
Hit!
extend
HelixInfoSystems
30
Step3
31
32
Interpreting BLAST
Results
Bit Score
E values and p values
34
E values
Expect value (E) is the number of alignments
with
scores greater than or equal to score S that
are
expected to occur by chance in a database
search.
HelixInfoSystems
35
E-Value
E = Kmn e-S
36
Interpreting
scores
% identity is not the best indicator of
homology
Statistical theory in next talk
E-value < 0.001 typically used to infer
homology
E-values > 0.001 may still be homologous
Analysis of conservation of functional
motifs
More sensitive techniques
HelixInfoSystems
37
BLAST family of
programs
blastp - amino acid query sequence
against a protein sequence database
blastn - nucleotide query sequence
against a nucleotide sequence database
blastx - nucleotide query sequence
translated in all reading frames against
a protein database
HelixInfoSystems
38
BLAST family of
programs
tblastn - protein query sequence
against a nucleotide sequence
database dynamically translated in
all reading frames
tblastx - six-frame translations of a
nucleotide query sequence against
the six-frame translations of a
nucleotide sequence database.
HelixInfoSystems
39
Gapped
BLAST
40
PSI-BLAST
Position Specific Iterated BLAST
The search can be improved, if the
important parts of the query are known.
The important parts of the query quite
often correspond to conserved regions, or
regions with less mutations, or regions
that define structure and functionality
within a family of proteins.
HelixInfoSystems
41
Position-Specific
Iterated BLAST
42
A
B
C
D
F
G
H
B D C A A C D F G H N N H D C C
1 3
2
2
2 2
5
2
8
HelixInfoSystems
43
PSI-BLAST
More sequences are found that can
then be added onto the multiple
alignment
Caution should be used with PSIBLAST:
a greedy algorithm is used
most recently added sequences will
influence the next round of
sequences HelixInfoSystems
44
PHIBLAST
45
PHI-BLAST
One example of a regular expression:
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R[STAQ]-A-x-[LIVMA]-x-[STACV]
HelixInfoSystems
46
Comparisons of BLAST
and FASTA
BLAST
FASTA
Calculate probabilities
Calculate significance
(sometimes fails entirely
from the given dataset
if some assumptions are
(problems if dataset is
invalid)
small)
HelixInfoSystems
47