0% found this document useful (0 votes)
24 views

Database Similarity Searching

The document discusses database similarity searching and the basic local alignment search tool (BLAST) algorithm. Database similarity searching involves comparing a query sequence to all sequences in a database using pairwise alignment. BLAST is a heuristic algorithm that uses word matches to quickly identify related sequences and extend alignments to generate high-scoring local alignments between the query and database sequences.

Uploaded by

Nickson Onchoka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Database Similarity Searching

The document discusses database similarity searching and the basic local alignment search tool (BLAST) algorithm. Database similarity searching involves comparing a query sequence to all sequences in a database using pairwise alignment. BLAST is a heuristic algorithm that uses word matches to quickly identify related sequences and extend alignments to generate high-scoring local alignments between the query and database sequences.

Uploaded by

Nickson Onchoka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

22-Mar-12

INTRODUCTION
• A main application of pairwise alignment is retrieving biological
sequences in databases based on similarity.

• This process involves submission of a query sequence and


performing a pairwise comparison of the query sequence with all
individual sequences in a database.
DATABASE SIMILARITY SEARCHING
• Thus, database similarity searching is pairwise alignment on a
large scale.

• This type of searching is one of the most effective ways to assign


putative functions to newly determined sequences.

• Special search methods are needed to speed up the


computational process of sequence comparison.

UNIQUE REQUIREMENTS OF DATABASE • Ideally, one wants to have the greatest sensitivity, selectivity, and
speed in database searches.
SEARCHING – However, satisfying all three requirements is difficult in reality. What
• There are unique requirements for implementing algorithms for Happens is that an increase in sensitivity is associated with decrease in
sequence database searching. selectivity.
– A very inclusive search tends to include many false positives.
• The first criterion is sensitivity, which refers to the ability to find as – Similarly, an improvement in speed often comes at the cost of lowered
many correct hits as possible. sensitivity and selectivity.
– It is measured by the extent of inclusion of correctly identified sequence members of – A compromise between the three criteria often has to be made.
the same family.
– These correct hits are considered “true positives” in the database searching exercise.
• In database searching, as well as in many other areas in
• The second criterion is selectivity, also called specificity, which bioinformatics, are two fundamental types of algorithms.
refers to the ability to exclude incorrect hits. – One is the exhaustive type, which uses a rigorous algorithm to find the best
– These incorrect hits are unrelated sequences mistakenly identified in database or exact solution for a particular problem by examining all mathematical
searching and are considered “false positives.”
combinations.
• The third criterion is speed, which is the time it takes to get results – Dynamic programming is an example of the exhaustive method and is
from database searches. computationally very intensive.
– Depending on the size of the database, speed sometimes can be a primary concern. – Another is the heuristic type, which is a computational strategy to find an
empirical or near optimal solution by using rules of thumb.

HEURISTIC DATABASE SEARCHING BASIC LOCAL ALIGNMENT SEARCH TOOL


• Currently, there are two major heuristic algorithms for performing
database searches: BLAST and FASTA.
(BLAST)
– These methods are not guaranteed to find the optimal alignment or true • The BLAST program was developed by Stephen Altschul of NCBI
homologs, but are 50–100 times faster than dynamic programming. in 1990.
– The increased computational speed comes at a moderate expense of sensitivity – Has since become one of the most popular programs for sequence
and specificity of the search. analysis.

• Both BLAST and FASTA use a heuristic word method for fast • BLAST uses heuristics to align a query sequence with all
pairwise sequence alignment. sequences in a database.
– It works by finding short stretches of identical or nearly identical letters in two
sequences. • The objective is to find high-scoring ungapped segments among
– These short strings of characters are called words, which are similar to the related sequences.
windows used in the dot matrix method. high-scoring strings
– The basic assumption is that two related sequences must have at least one word • The existence of such segments above a given threshold
in common. By first identifying word matches, a longer alignment can be indicates pairwise similarity beyond random chance, which
obtained by extending similarity regions from the words. helps to discriminate related sequences from unrelated
– Once regions of high sequence similarity are found, adjacent high-scoring regions sequences in a database.
can be joined into a full alignment.

1
22-Mar-12

Illustration of the BLAST procedure using a hypothetical query


sequence matching with a hypothetical database sequence.
BLAST ALIGNMENT STEPS
The example of the word match is highlighted in the box.
• The first step is to create a list of words from the query sequence.
– Each word is typically three residues for protein sequences and eleven
residues for DNA sequences.
– The list includes every possible word extracted from the query sequence.
– This step is also called seeding.

• The second step is to search a sequence database for the


occurrence of these words.
– This step is to identify database sequences containing the matching words.

• The matching of the words is scored by a given substitution matrix


is the third step.
– A word is considered a match if it is above a threshold.

• The fourth step involves pairwise alignment by extending from the A BIT SCORE
words in both directions while counting the alignment score using • A bit score is another prominent statistical indicator used in addition
the same substitution matrix. to the E value in a BLAST output.
– The extension continues until the score of the alignment drops below a
threshold due to mismatches (the drop threshold is twenty-two for proteins • The bit score measures sequence similarity independent of query
and twenty for DNA). sequence length and database size and is normalized based on the
– The resulting contiguous aligned segment pair without gaps is called high- raw pairwise alignment score.
scoring segment pair (HSP). • The bit score (S') is determined by the following formula:
– In the original version of BLAST, the highest scored HSPs are presented as
S’= (λ × S − lnK)/ ln2
the final report. They are also called maximum scoring pairs.
– where λ is the Gumble distribution constant, S is the raw alignment score, and
K is a constant associated with the scoring matrix used.
– A recent improvement in the implementation of BLAST is the
ability to provide gapped alignment. • Clearly, the bit score (S') is linearly related to the raw alignment
– In gapped BLAST, the highest scored segment is chosen to be extended in score (S).
both directions using dynamic programming where gaps may be introduced. – Thus, the higher the bit score, the more highly significant the match is.
– The extension continues if the alignment score is above a certain threshold; – The bit score provides a constant statistical indicator for searching different
otherwise it is terminated databases of different sizes or for searching the same database at different
times as the database enlarges.

BLAST VARIANTS STATISTICAL SIGNIFICANCE


• The BLAST output provides a list of pairwise sequence matches
• BLAST is a family of programs that includes BLASTN, BLASTP,
ranked by statistical significance.
BLASTX TBLASTN, and TBLASTX.
– The significance scores help to distinguish evolutionarily related sequences from
– BLASTN queries nucleotide sequences with a nucleotide unrelated ones. Generally, only hits above a certain threshold are displayed.
sequence database.
– BLASTP uses protein sequences as queries to search against a • In BLAST searches, the statistical indicator is known as the E-value
protein sequence database. (expectation value).
– It indicates the probability that the resulting alignments from a database search
– BLASTX uses nucleotide sequences as queries and translates are caused by random chance.
them in all six reading frames to produce translated protein – The E-value is related to the P-value used to assess significance of single pairwise
sequences, which are used to query a protein sequence alignment.
database.
– TBLASTN queries protein sequences to a nucleotide sequence • BLAST compares a query sequence against all database sequences,
database with the sequences translated in all six reading frames. and so the E-value is determined by the following formula:
– TBLASTX uses nucleotide sequences, which are translated in all E=m×n×P
six frames, to search against a nucleotide sequence database – Where m is the total number of residues in a database, n is the number of residues
in the query sequence, and P is the probability that an HSP alignment is a result of
that has all the sequences translated in six frames. random chance.

2
22-Mar-12

EXAMPLES BLAST Output Format


• Aligning a query sequence of 100 residues to a database containing • The BLAST output includes a graphical overview box, a matching list
a total of 1012 residues results in a P-value for the ungapped HSP and a text description of the alignment.
region in one of the database matches of 1 × 1−20.
• The E-value, which is the product of the three values, is 100 × 1012 × • The graphical overview box contains colored horizontal bars that
10−20, which equals 10−6. It is expressed as 1e − 6 in BLAST output. allow quick identification of the number of database hits and the
– This indicates that the probability of this database sequence match occurring due degrees of similarity of the hits.
to random chance is 10−6.
• The color coding of the horizontal bars corresponds to the ranking
• The E-value provides information about the likelihood that a given of similarities of the sequence hits (red: most related; green and
sequence match is purely by chance. blue: moderately related; black: unrelated).
– The lower the E-value, the less likely the database match is a result of random chance
and therefore the more significant the match is.
– Empirical interpretation of the E-value is as follows. If E < 1e − 50 (or 1 × 10−50), there • The length of the bars represents the spans of sequence alignments
should be an extremely high confidence that the database match is a result of
homologous relationships.
relative to the query sequence
– If E is between 0.01 and 1e − 50, the match can be considered a result of homology.
– If E is between 0.01 and 10, the match is considered not significant, but may hint at a
tentative remote homology relationship.

FASTA
• FASTA (FAST ALL, www.ebi.ac.uk/fasta33/) was the first database
similarity search tool developed, preceding the development of
BLAST.

• FASTA uses a “hashing” strategy to find matches for a short stretch


of identical residues with a length of k.

• The string of residues is known as ktuples or ktups, which are


equivalent to words in BLAST, but are normally shorter than the
words.

• Typically, a ktup is composed of two residues for protein sequences


and six residues for DNA sequences.

STEPS • The 2ND step is to narrow down the high similarity regions between
the two sequences.
• The 1ST step in FASTA alignment is to identify ktups between two – Normally, many diagonals between the two sequences can be identified in the
sequences by using the hashing strategy. hashing step.
– The top ten regions with the highest density of diagonals are identified as
• This strategy works by constructing a lookup table that shows the high similarity regions. The diagonals in these regions are scored using a
position of each ktup for the two sequences under consideration. substitution matrix.
– Neighboring high-scoring segments along the same diagonal are selected and
joined to forma single alignment.
• The positional difference for each word between the two
– This step allows introducing gaps between the diagonals while applying gap
sequences is obtained by subtracting the position of the first penalties.
sequence from that of the second sequence and is expressed – The score of the gapped alignment is calculated again.
• as the offset. • In step 3, the gapped alignment is refined further using the Smith–
Waterman algorithm to produce a final alignment.
• The ktups that have the same offset values are then linked to reveal
a contiguous identical sequence region that corresponds to a • The last step is to perform a statistical evaluation of the final
stretch of diagonal in a two-dimensional matrix. alignment as in BLAST, which produces the E-value.

3
22-Mar-12

Steps of the FASTA alignment procedure. In step 1 (left ), all possible ungapped
alignments are found between two sequences with the hashing method. In step 2
(middle), the alignments are scored according to a particular scoring matrix. Only
the ten best alignments are selected. In step 3 (right ), the alignments in the same
diagonal are selected and joined to form a single gapped alignment, which is
optimized using the dynamic programming approach.

FASTA SUBPROGRAMS STATISTICAL SIGNIFICANCE


• FASTA also uses E-values and bit scores.
• Similar to BLAST, FASTA has a number of subprograms.
– Estimation of the two parameters in FASTA is essentially the same as in BLAST.

• The web-based FASTA program offered by the European • However, the FASTA output provides one more statistical parameter,
Bioinformatics Institute (www.ebi.ac.uk/) allows the use of either the Z-score.
DNA or protein sequences as the query to search against a protein – This describes the number of standard deviations from the mean score for the
database or nucleotide database. database search.

• FASTX, translates a DNA sequence and uses the translated protein • Because most of the alignments with the query sequence are with
sequence to query a protein database, and unrelated sequences, the higher the Z-score for a reported match,
the further away from the mean of the score distribution, hence,
the more significant the match.
• TFASTX, which compares a protein query sequence to a translated – For a Z-score > 15, the match can be considered extremely significant, with
DNA database. certainty of a homologous relationship.
– If Z is in the range of 5 to 15, the sequence pair can be described as highly
probable homologs.
– If Z < 5, their relationships is described as less certain.

COMPARISON OF FASTA AND BLAST


• BLAST and FASTA have been shown to perform almost equally well
in regular database searching.

• The major difference is in the seeding step; BLAST uses a


substitution matrix to find matching words, whereas FASTA
identifies identical matching words using the hashing procedure.

• By default, FASTA scans smaller window sizes. Thus, it gives more


sensitive results than BLAST, with a better coverage rate for
homologs. However, it is usually slower than BLAST.

• The use of low-complexity masking in the BLAST procedure means


that it may have higher specificity than FASTA because potential
false positives are reduced. BLAST sometimes gives multiple best-
scoring alignments from the same sequence; FASTA returns only one
final alignment.

You might also like