0% found this document useful (0 votes)

307 views

Blast & Fasta

Database searches are used to discover or verify the identity of genes, find members of gene families, and classify groups of genes. The Smith-Waterman algorithm provides an accurate approach but is slow for large databases. Heuristic methods like FASTA and BLAST use word matches and extensions to provide faster searches while maintaining sensitivity. BLAST has become the most widely used search tool due to its speed and sensitivity in finding local alignments.

Uploaded by

Meow

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

307 views

Blast & Fasta

Uploaded by

Meow

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Database

Searching

Why do database searches?

Often seems strange that
similarity searching
is so central to bioinformatics

HelixInfoSystems

Predominant application
for database searches

To discover or verify identity of a newly

sequenced gene

To find other members of a multigene family

To classify groups of genes

HelixInfoSystems

Database searching
One
approachuse
algorithm

Smith-Waterman

to find local alignments that compare

query
sequence to each sequence in database.
HelixInfoSystems

Problem?
Databases are huge (GenBank ~30 million
sequences,
Swiss-Prot >> 100,000 sequences)
S-W is slow (O(Nn2)) where, n is the
sequence length
and N is the number of sequences in the
database
HelixInfoSystems

Solution?
Use faster heuristic approaches
FastA Fast Alignment
Blast Basic Local Alignment
Search Tool
HelixInfoSystems

What is Heuristics?
Heuristics - solve a problem that ignores
whether the solution can be proven to be
correct, but which usually produces a good
solution
Solves a simpler problem that contains or
intersects with the solution of the more complex
problem.
Heuristics are intended to gain computational
performance or conceptual simplicity,
potentially at the cost
of accuracy or precision.
HelixInfoSystems
7

FASTA

HelixInfoSystems

FASTA
Developed ~1985 by Lipman and Pearson
(now many
variants/updates/improvements)
Goal: Perform fast, approximate local
alignments to find
sequences in the database that are related to
the query
sequence

HelixInfoSystems

Steps in FastA
1. Choose a value for the ktup parameter:
will look for exact matches of this length
between the query and target sequences
typically ktup=6 for DNA (range 4-6),
ktup=1 (range 1-2) for protein (why the
difference?)
2. Find hot spots (location of matching
ktup-length substrings) in a dot plot
HelixInfoSystems

Hot spots with ktup=1

HelixInfoSystems

Hot spots with ktup=2

HelixInfoSystems

Hot spots with ktup=3

HelixInfoSystems

Hashing technique
Computational trick that makes FASTA fast
is
how it locates the hot spots
Uses hashing technique (map a string of 1,
2, or more
characters to an integer
e.g., AAA 0
AAC 1
...
TTT 63 (oversimplified)
HelixInfoSystems
14

Hashing technique
Can preprocess the database and create a
table that stores locations (offsets) of each
possible k-tuple
20k for aminoacids (400 if k=2),
4k for DNA (4096 if k=6),
Then use hash code computed from query
sequence k-tuples to look up these entries
quickly
HelixInfoSystems
15

Example

E.g., in the previous example with ktup=2 and

top sequence from database: gctggaaggcat

Can now scan the query sequenc

by sliding a window along it,
looking up each ktup substring
in the hash table to retrieve the
location(s) in the database seque

HelixInfoSystems

Contd.Steps in FastA
3.Find 10 best diagonal runs (sequence of nearby
hot
spots on same diagonal)
FASTA gives each hot spot a positive score, and
each space between consecutive hot spots a
negative score that decreases with distance
Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but does not contain indels because
they are all on the same
diagonal
HelixInfoSystems
17

Contd.Steps in FastA
4. Evaluate each diagonal run using an
appropriate
scoring matrix (PAM-n, BLOSUM-n, etc.)
and find
the best scoring run = init1
Runs with low scores discarded (filtration)
HelixInfoSystems

Contd.Steps in FastA
5. Try to find good diagonal runs
from close diagonals by now
allowing indels
good means those having score
exceeding a chosen threshold:
HelixInfoSystems

Finding best path?

Find a maximum weight path in this graph;
corresponds to a single local alignment
between the two sequences compared

The score of this path (initn

sum of scores of aligned
individual regions minus ga
penalty for each inserted g
between regions
HelixInfoSystems

Contd.Steps in FastA
6. If initn score reaches a threshold value,
get opt score using Smith-Waterman
alignment (dont waste time on this
otherwise)
7. Rank database sequences according to
opt scores; use full Smith-Waterman
method (no band) to align query
sequence against each of the highest
ranking sequences from the database
HelixInfoSystems

Contd.Steps in FastA
8. Perform statistical analysis of the
probability that
given level of matching would be
obtained by chance if
sequences were unrelated
HelixInfoSystems

FASTA Results
When init1 = init0 = opt:
100% homology over the matched stretch.
When initn > init1:
More than 1 matching region in the
database with poorly matching separating
regions.
When opt > initn:
The matching regions are greatly improved
by adding gaps in one or both of the
sequences.
HelixInfoSystems
23

Basic Local Alignment

Search Tool
Most widely used
computational

and

referenced

biology/bioinformatics resource

HelixInfoSystems

BLAST
Improves search speed of
FASTA
Retains sensitivity of
searches

HelixInfoSystems

BLAST Algorithm
Step:1

HelixInfoSystems

Step 1 - Example
Size w words in the query sequence.

t the query sequence by a moving window of

Example: for a human RBP query
FSGTWYA (query word is in red)
W=3
The moving window of words:
FSG SGT GTW TWY WYA
HelixInfoSystems

Step 1: compile a list of words

scoring at least T with query
word
GTW
Word Hits > T
ASW
ATW
NTW
Threshold (T)=11GTY
Word Hits < T

6,5,11
6,1,11
0,5,11
0,5,11
6,5,2

GNW
GAW
HelixInfoSystems

22
18
16
16
13
10
9
28

2. Scan the database for entries

that contains any word from the
compiled hit list.

Exact matches of words from the word list

to the database sequences
HelixInfoSystems

3. Extend: when you manage

to find a hitextend the hit in
either
direction.
Keep track of the score (use a scoring matrix)
Stop when the score drops below some cutoff.

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

extend

Hit!

extend
HelixInfoSystems

Step3

For each exact word match,

alignment is extended in both
directions to find high score
segments
HelixInfoSystems

Blast Word Size

Minimum word size of 9 needed to
detect

similarity (default is 11)

For proteins word size is 3 & match
need not be

exact - less of an issue

HelixInfoSystems

Interpreting BLAST
Results
Bit Score
E values and p values

From raw scores to bit

scores
There are two kinds of scores:

Raw scores (calculated from a

substitution matrix)

Bit scores (normalized scores)

S = bit score = (S - lnK) / ln2
HelixInfoSystems

E values
Expect value (E) is the number of alignments
with
scores greater than or equal to score S that
are
expected to occur by chance in a database
search.

HelixInfoSystems

E-Value
E = Kmn e-S

E is the number of hits you would expect

from your search with scores greater
than S where:
K is a constant
m is the size of the query
n is the size of the database being
searched
scales for the specific scoring matrix
used
HelixInfoSystems

Interpreting
scores
% identity is not the best indicator of
homology
Statistical theory in next talk
E-value < 0.001 typically used to infer
homology
E-values > 0.001 may still be homologous
Analysis of conservation of functional
motifs
More sensitive techniques
HelixInfoSystems

BLAST family of
programs
blastp - amino acid query sequence
against a protein sequence database
blastn - nucleotide query sequence
against a nucleotide sequence database
blastx - nucleotide query sequence
translated in all reading frames against
a protein database
HelixInfoSystems

BLAST family of
programs
tblastn - protein query sequence
against a nucleotide sequence
database dynamically translated in
all reading frames
tblastx - six-frame translations of a
nucleotide query sequence against
the six-frame translations of a
nucleotide sequence database.
HelixInfoSystems

Gapped
BLAST

The Gapped Blast algorithm allows

gaps to be
introduces into the alignments.
That means that
similar regions are not broken into
several
segments.
This method reflects biological
HelixInfoSystems
relationships much

PSI-BLAST
Position Specific Iterated BLAST
The search can be improved, if the
important parts of the query are known.
The important parts of the query quite
often correspond to conserved regions, or
regions with less mutations, or regions
that define structure and functionality
within a family of proteins.
HelixInfoSystems

Position-Specific
Iterated BLAST

Instead of 20 x 20 matrix, use m x 20

matrix, where m is the length of the
query
First, run normal BLAST
Using the results, construct the position
specific score matrix
Next iterations use global alignment and
the position specific score matrix
HelixInfoSystems

Position Specific Matrix

- Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A
B
C
D
F
G
H

B D C A A C D F G H N N H D C C
1 3
2
2

2 2
5

2
8

HelixInfoSystems

PSI-BLAST
More sequences are found that can
then be added onto the multiple
alignment
Caution should be used with PSIBLAST:
a greedy algorithm is used
most recently added sequences will
influence the next round of
sequences HelixInfoSystems
44

PHIBLAST

Pattern Hit Initiated BLAST

functions in same manner as PSIBLAST except that the query
sequence is first searched for a
regular expression
search for similar sequences is
focused on regions containing the
pattern
HelixInfoSystems

PHI-BLAST
One example of a regular expression:
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R[STAQ]-A-x-[LIVMA]-x-[STACV]

HelixInfoSystems

Comparisons of BLAST
and FASTA
BLAST

FASTA

It can produce more than

one HSP per database
entry

Produces only one best

alignment.

Better for protein than

nucleotides.

Better for nucleotides

than proteins

Faster than FASTA

Slower than BLAST

Less sensitive when

using default settings

More sensitive. Misses

less homologous

Less separation between

true
and random hits.

More separation between

true homologs and
random hits.

Calculate probabilities
Calculate significance
(sometimes fails entirely
from the given dataset
if some assumptions are
(problems if dataset is
invalid)
small)
HelixInfoSystems

TWC Mock Exam 2022-23 T2 (With Answers)
100% (1)
TWC Mock Exam 2022-23 T2 (With Answers)
4 pages
Classification, Tabulation, Graphical and Diagrammatic Presentation of Data
100% (2)
Classification, Tabulation, Graphical and Diagrammatic Presentation of Data
29 pages
Expert System in Business
No ratings yet
Expert System in Business
10 pages
Question Bank
No ratings yet
Question Bank
5 pages
A Machine Learning Approach To Waiting Time Prediction in Queueing Scenarios
No ratings yet
A Machine Learning Approach To Waiting Time Prediction in Queueing Scenarios
5 pages
Explanation of Simplex Method
No ratings yet
Explanation of Simplex Method
3 pages
Fundamentals of Business Analytics & Research
No ratings yet
Fundamentals of Business Analytics & Research
4 pages
Software Quality Metric Unit 2 Notes
No ratings yet
Software Quality Metric Unit 2 Notes
16 pages
TQM Numericals Practice-1
No ratings yet
TQM Numericals Practice-1
6 pages
Unit 4
No ratings yet
Unit 4
4 pages
Question Bank Answers - BD in IT
67% (3)
Question Bank Answers - BD in IT
13 pages
Final PPT 2
No ratings yet
Final PPT 2
42 pages
HCI-Lecture-14 - 15
No ratings yet
HCI-Lecture-14 - 15
94 pages
Chapter 1 (1
No ratings yet
Chapter 1 (1
5 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
DBT Skill Dev Scheme Details
100% (2)
DBT Skill Dev Scheme Details
2 pages
Chapter 9
No ratings yet
Chapter 9
18 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Project Final Review Iip PDF
No ratings yet
Project Final Review Iip PDF
30 pages
Movie Recommendation System-ABSTRACT
No ratings yet
Movie Recommendation System-ABSTRACT
1 page
Winisis Handbook en 1
No ratings yet
Winisis Handbook en 1
174 pages
Introduction To Data Management: Chapter 1, Pratt & Adamski
100% (1)
Introduction To Data Management: Chapter 1, Pratt & Adamski
25 pages
Planning in A Small Business Enterprise
No ratings yet
Planning in A Small Business Enterprise
9 pages
Internship Report: Nivedha A (192BT145)
No ratings yet
Internship Report: Nivedha A (192BT145)
11 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Sem 2 M.pharm Presentation
No ratings yet
Sem 2 M.pharm Presentation
16 pages
Project Proposal Template BIS
No ratings yet
Project Proposal Template BIS
9 pages
Theoretical Concept of Unix Operating System
100% (1)
Theoretical Concept of Unix Operating System
10 pages
HRMS Project Peport
No ratings yet
HRMS Project Peport
35 pages
Smarter Work Management System
No ratings yet
Smarter Work Management System
3 pages
CTEC3424 2023 Coursework Packet
No ratings yet
CTEC3424 2023 Coursework Packet
10 pages
CBS Case Study
No ratings yet
CBS Case Study
11 pages
Qadm 3
No ratings yet
Qadm 3
7 pages
MIS Question Paper
No ratings yet
MIS Question Paper
1 page
Plant Detection Final
No ratings yet
Plant Detection Final
25 pages
The Robot Suicide
No ratings yet
The Robot Suicide
3 pages
PSSM
No ratings yet
PSSM
17 pages
Objective Physics
No ratings yet
Objective Physics
4 pages
Unit 3: Management Information System System Analysis Concept
100% (1)
Unit 3: Management Information System System Analysis Concept
3 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
50 pages
Bioinformatics Class Notes
No ratings yet
Bioinformatics Class Notes
12 pages
Computer Vision
No ratings yet
Computer Vision
7 pages
CHAPTER - 3rm
No ratings yet
CHAPTER - 3rm
19 pages
Mcsa Test Q
No ratings yet
Mcsa Test Q
6 pages
Introduction To Information Retrieval-Ch2 Solutions
No ratings yet
Introduction To Information Retrieval-Ch2 Solutions
2 pages
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
No ratings yet
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
18 pages
Sample Biology Assignment
No ratings yet
Sample Biology Assignment
3 pages
Drug Management System (Synopsis)
No ratings yet
Drug Management System (Synopsis)
10 pages
DSS - Ch01 Decision Support System Lecture Notes
No ratings yet
DSS - Ch01 Decision Support System Lecture Notes
7 pages
Cafe Management Report (Final)
No ratings yet
Cafe Management Report (Final)
36 pages
Assignment: Statistics and Probability
100% (1)
Assignment: Statistics and Probability
24 pages
Role of Ict in Business
No ratings yet
Role of Ict in Business
15 pages
Foss Lab Programs
No ratings yet
Foss Lab Programs
12 pages
Information Technology For Manager Assignment PDF
0% (1)
Information Technology For Manager Assignment PDF
14 pages
Management Information System
No ratings yet
Management Information System
3 pages
6 (4 Files Merged)
0% (1)
6 (4 Files Merged)
4 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
BLAST
No ratings yet
BLAST
30 pages
Fundamentals of bioinformatics_L5
No ratings yet
Fundamentals of bioinformatics_L5
56 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
4 Retiming
No ratings yet
4 Retiming
36 pages
MATLAB Session 2 Newton-Raphson Method Fall 2021-2022
No ratings yet
MATLAB Session 2 Newton-Raphson Method Fall 2021-2022
8 pages
Aakarshit Gupta Practical File 12th B
No ratings yet
Aakarshit Gupta Practical File 12th B
26 pages
Lecture 2
No ratings yet
Lecture 2
41 pages
Final Exam - 2 PDF
No ratings yet
Final Exam - 2 PDF
5 pages
The CYK Algorithm
No ratings yet
The CYK Algorithm
9 pages
Data Structures Wikibooks
No ratings yet
Data Structures Wikibooks
120 pages
Chap3 Linear Programming
No ratings yet
Chap3 Linear Programming
33 pages
Floyd Warshall Algorithm PDF
No ratings yet
Floyd Warshall Algorithm PDF
5 pages
Huffman Coding Tree
No ratings yet
Huffman Coding Tree
4 pages
Gaussian Elimination and Gauss-Jordan Method
100% (1)
Gaussian Elimination and Gauss-Jordan Method
3 pages
Micro Project On Implemtation of Various Sorting Techniques
78% (18)
Micro Project On Implemtation of Various Sorting Techniques
11 pages
Advanced Computer Networks Week-2 Assignment
No ratings yet
Advanced Computer Networks Week-2 Assignment
5 pages
P15CS71 - Z2
No ratings yet
P15CS71 - Z2
3 pages
Seminar Persentation: Upgma
No ratings yet
Seminar Persentation: Upgma
16 pages
1018450515278_Breadth-First Search (BFS) and Depth-First Search
No ratings yet
1018450515278_Breadth-First Search (BFS) and Depth-First Search
11 pages
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
No ratings yet
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
3 pages
Apriori Algorithm Example Problems
No ratings yet
Apriori Algorithm Example Problems
8 pages
Data Structures Question Bank
No ratings yet
Data Structures Question Bank
20 pages
Hash List
No ratings yet
Hash List
1 page
Data Struct Algorithms
No ratings yet
Data Struct Algorithms
115 pages
TSP
No ratings yet
TSP
4 pages
Matrix Comp
No ratings yet
Matrix Comp
716 pages
Zayankashif 0094 A 2
No ratings yet
Zayankashif 0094 A 2
5 pages
Shortest Path Algorithms
No ratings yet
Shortest Path Algorithms
5 pages
LogicBuildingHour Plan Mile2A PDF
No ratings yet
LogicBuildingHour Plan Mile2A PDF
1 page
Runge-Kutta-Fehlberg: (T K4 (T K I :K1 - ) K4)
No ratings yet
Runge-Kutta-Fehlberg: (T K4 (T K I :K1 - ) K4)
2 pages
NMCP MCQ Unit 3
100% (3)
NMCP MCQ Unit 3
3 pages
CS583 Association Sequential Patterns
No ratings yet
CS583 Association Sequential Patterns
64 pages

Blast & Fasta

Uploaded by

Blast & Fasta

Uploaded by

Database

Why do database searches?

To discover or verify identity of a newly

To find other members of a multigene family

To classify groups of genes

to find local alignments that compare

Hot spots with ktup=1

Hot spots with ktup=2

Hot spots with ktup=3

E.g., in the previous example with ktup=2 and

Can now scan the query sequenc

Finding best path?

The score of this path (initn

Basic Local Alignment

t the query sequence by a moving window of

Step 1: compile a list of words

2. Scan the database for entries

Exact matches of words from the word list

3. Extend: when you manage

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

For each exact word match,

Blast Word Size

similarity (default is 11)

exact - less of an issue

From raw scores to bit

Raw scores (calculated from a

Bit scores (normalized scores)

E is the number of hits you would expect

The Gapped Blast algorithm allows

Instead of 20 x 20 matrix, use m x 20

Position Specific Matrix

Pattern Hit Initiated BLAST

It can produce more than

Produces only one best

Better for protein than

Better for nucleotides

Faster than FASTA

Slower than BLAST

Less sensitive when

More sensitive. Misses

Less separation between

More separation between

You might also like