Algorithm Design and Scoring Matrices PDF
Algorithm Design and Scoring Matrices PDF
Sequence of
instructions
algorithm
End state
output
Example
Search the entire genome for all plausible genes
Genes start with ’ATG’
• Given a genome search for all occurrences of ATG
and report their position in the genome.
• Read in genome sequence
Store in program/data structure
• Search for start codon
• If A, check if next is T
If T, check if next is G
• Output position
• Rough outline…???
• Pseudo code..???
Pseudo code
• What is rough outline?
• Outline of an algorithm
• Can not be compiled nor executed
• PSEUDO CODE
• There are no real formatting or syntax rules
• Purpose?
• Enables the programmer to concentrate on the
• implementation of the algorithm
• Programming language independent
Algorithm design technique
• Intuitive algorithm
• Given a genome sequence, make a copy of it.”
• Abstract algorithm
• Flow chart or u can make Pseudo code
String Copy(s,n)
for i ← to n
ti ← si
return t
Implemented algorithm
Programing language (C, Perl, R, Java)
Dynamic programming
Used to derive pairwise alignments
1. Needleman-Wunsch for global alignments
2. Smith-Waterman for local alignments
• Advantage……???
• Disadvantage……?????
Dynamic programming
Design
1. Brake a problem into sub-problems
2. Construct an optimal solution each sub-
problem
3. Derive overall optimal solution by combining
solutions for each sub-problem – without
recomputing already computed solutions
Gapped and Ungaped alignments
Extended gap penalty
Basic Local Alignment Search Tool
(BLAST)
• Generate a list of words
• Break down the query sequence into words, i.e.
• subsequences of length of words (here w=4)
• For each word, create similar words, using
substitution matrix
• Search DB for exact matches, called seeds
• Basic Local Alignment Search Tool (BLAST)
Extend match – MSP
Extend hit in both directions – locally maximal segment pair
• (MSP).
Terminate – when score for a segment pair is less than a certain
threshold
Score
• A number used to assess the biological
relevance of a finding.
• In the context of sequence alignments, a score
is a numerical value that describes the overall
quality of an alignment.
• Higher numbers correspond to higher
similarity.
• The score scale depends on the scoring system
used (substitution matrix, gap penalty).
Bit score
• The bit score gives an indication of how good
the alignment is; the higher the score, the
better the alignment.
• In general terms, this score is calculated from a
formula that takes into account the alignment
of similar or identical residues, as well as any
gaps introduced to align the sequences.
Bit-score:
• A log-scaled version of a score.
• Max score = highest alignment score (bit-score) between the
query sequence and the database sequence segment .
• In the context of sequence alignments (BLAST), the bit-score
S' is a normalized score expressed in bits that lets you estimate
the magnitude of the search space you would have to look
through before you would expect to find an score as good as or
better than this one by chance.
• Guideline
• PAM40 for highly similar sequences
• PAM70 for medium similar sequences
• PAM250 for highly divergent sequences
Similarity scoring
• BLOcks of Amino Acid SUbstitution Matrix
• BLOSUM matrix
• Derived from blocks, i.e. ungapped local alignments,
with different levels of identity
• E.g., BLOSUM62 derived from Blocks with >62%
identity
• Henikoff & Henikoff (1992) used 2000 blocks,
representing 500 protein groups
PAM vs. BLOSUM
• Roughly equivalent PAM and BLOSUM
matrices
• PAM100 <=> Blosum90
• PAM120 <=> Blosum80
• PAM160 <=> Blosum60
• PAM200 <=> Blosum52
• PAM250 <=> Blosum45