Database Similarity Searching
Database Similarity Searching
INTRODUCTION
• A main application of pairwise alignment is retrieving biological
sequences in databases based on similarity.
UNIQUE REQUIREMENTS OF DATABASE • Ideally, one wants to have the greatest sensitivity, selectivity, and
speed in database searches.
SEARCHING – However, satisfying all three requirements is difficult in reality. What
• There are unique requirements for implementing algorithms for Happens is that an increase in sensitivity is associated with decrease in
sequence database searching. selectivity.
– A very inclusive search tends to include many false positives.
• The first criterion is sensitivity, which refers to the ability to find as – Similarly, an improvement in speed often comes at the cost of lowered
many correct hits as possible. sensitivity and selectivity.
– It is measured by the extent of inclusion of correctly identified sequence members of – A compromise between the three criteria often has to be made.
the same family.
– These correct hits are considered “true positives” in the database searching exercise.
• In database searching, as well as in many other areas in
• The second criterion is selectivity, also called specificity, which bioinformatics, are two fundamental types of algorithms.
refers to the ability to exclude incorrect hits. – One is the exhaustive type, which uses a rigorous algorithm to find the best
– These incorrect hits are unrelated sequences mistakenly identified in database or exact solution for a particular problem by examining all mathematical
searching and are considered “false positives.”
combinations.
• The third criterion is speed, which is the time it takes to get results – Dynamic programming is an example of the exhaustive method and is
from database searches. computationally very intensive.
– Depending on the size of the database, speed sometimes can be a primary concern. – Another is the heuristic type, which is a computational strategy to find an
empirical or near optimal solution by using rules of thumb.
• Both BLAST and FASTA use a heuristic word method for fast • BLAST uses heuristics to align a query sequence with all
pairwise sequence alignment. sequences in a database.
– It works by finding short stretches of identical or nearly identical letters in two
sequences. • The objective is to find high-scoring ungapped segments among
– These short strings of characters are called words, which are similar to the related sequences.
windows used in the dot matrix method. high-scoring strings
– The basic assumption is that two related sequences must have at least one word • The existence of such segments above a given threshold
in common. By first identifying word matches, a longer alignment can be indicates pairwise similarity beyond random chance, which
obtained by extending similarity regions from the words. helps to discriminate related sequences from unrelated
– Once regions of high sequence similarity are found, adjacent high-scoring regions sequences in a database.
can be joined into a full alignment.
1
22-Mar-12
• The fourth step involves pairwise alignment by extending from the A BIT SCORE
words in both directions while counting the alignment score using • A bit score is another prominent statistical indicator used in addition
the same substitution matrix. to the E value in a BLAST output.
– The extension continues until the score of the alignment drops below a
threshold due to mismatches (the drop threshold is twenty-two for proteins • The bit score measures sequence similarity independent of query
and twenty for DNA). sequence length and database size and is normalized based on the
– The resulting contiguous aligned segment pair without gaps is called high- raw pairwise alignment score.
scoring segment pair (HSP). • The bit score (S') is determined by the following formula:
– In the original version of BLAST, the highest scored HSPs are presented as
S’= (λ × S − lnK)/ ln2
the final report. They are also called maximum scoring pairs.
– where λ is the Gumble distribution constant, S is the raw alignment score, and
K is a constant associated with the scoring matrix used.
– A recent improvement in the implementation of BLAST is the
ability to provide gapped alignment. • Clearly, the bit score (S') is linearly related to the raw alignment
– In gapped BLAST, the highest scored segment is chosen to be extended in score (S).
both directions using dynamic programming where gaps may be introduced. – Thus, the higher the bit score, the more highly significant the match is.
– The extension continues if the alignment score is above a certain threshold; – The bit score provides a constant statistical indicator for searching different
otherwise it is terminated databases of different sizes or for searching the same database at different
times as the database enlarges.
2
22-Mar-12
FASTA
• FASTA (FAST ALL, www.ebi.ac.uk/fasta33/) was the first database
similarity search tool developed, preceding the development of
BLAST.
STEPS • The 2ND step is to narrow down the high similarity regions between
the two sequences.
• The 1ST step in FASTA alignment is to identify ktups between two – Normally, many diagonals between the two sequences can be identified in the
sequences by using the hashing strategy. hashing step.
– The top ten regions with the highest density of diagonals are identified as
• This strategy works by constructing a lookup table that shows the high similarity regions. The diagonals in these regions are scored using a
position of each ktup for the two sequences under consideration. substitution matrix.
– Neighboring high-scoring segments along the same diagonal are selected and
joined to forma single alignment.
• The positional difference for each word between the two
– This step allows introducing gaps between the diagonals while applying gap
sequences is obtained by subtracting the position of the first penalties.
sequence from that of the second sequence and is expressed – The score of the gapped alignment is calculated again.
• as the offset. • In step 3, the gapped alignment is refined further using the Smith–
Waterman algorithm to produce a final alignment.
• The ktups that have the same offset values are then linked to reveal
a contiguous identical sequence region that corresponds to a • The last step is to perform a statistical evaluation of the final
stretch of diagonal in a two-dimensional matrix. alignment as in BLAST, which produces the E-value.
3
22-Mar-12
Steps of the FASTA alignment procedure. In step 1 (left ), all possible ungapped
alignments are found between two sequences with the hashing method. In step 2
(middle), the alignments are scored according to a particular scoring matrix. Only
the ten best alignments are selected. In step 3 (right ), the alignments in the same
diagonal are selected and joined to form a single gapped alignment, which is
optimized using the dynamic programming approach.
• The web-based FASTA program offered by the European • However, the FASTA output provides one more statistical parameter,
Bioinformatics Institute (www.ebi.ac.uk/) allows the use of either the Z-score.
DNA or protein sequences as the query to search against a protein – This describes the number of standard deviations from the mean score for the
database or nucleotide database. database search.
• FASTX, translates a DNA sequence and uses the translated protein • Because most of the alignments with the query sequence are with
sequence to query a protein database, and unrelated sequences, the higher the Z-score for a reported match,
the further away from the mean of the score distribution, hence,
the more significant the match.
• TFASTX, which compares a protein query sequence to a translated – For a Z-score > 15, the match can be considered extremely significant, with
DNA database. certainty of a homologous relationship.
– If Z is in the range of 5 to 15, the sequence pair can be described as highly
probable homologs.
– If Z < 5, their relationships is described as less certain.