0% found this document useful (0 votes)
46 views

Algorithm Design and Scoring Matrices PDF

Uploaded by

Jahan Rana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Algorithm Design and Scoring Matrices PDF

Uploaded by

Jahan Rana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Bioinformatics

Muhammad Muddassir Ali


[email protected]
IBBT
Topics of this lecture
• Course aims and learning goals

• Concept of bioinformatics algorithm


• Different algorithmic methods
• Home task…
Learning objectives
The student should be able to:
• Understand the concept of bioinformatics algorithm

• Understand the basic algorithm that works behind sequence alignment


What is an algorithm..??

A Procedure i.e sequence or set of


instructions, for accomplishing a well
formulated problem.
Exemple
Initial state
input

Sequence of
instructions

algorithm

End state

output
Example
Search the entire genome for all plausible genes
Genes start with ’ATG’
• Given a genome search for all occurrences of ATG
and report their position in the genome.
• Read in genome sequence
Store in program/data structure
• Search for start codon
• If A, check if next is T
If T, check if next is G

• Output position
• Rough outline…???
• Pseudo code..???
Pseudo code
• What is rough outline?
• Outline of an algorithm
• Can not be compiled nor executed
• PSEUDO CODE
• There are no real formatting or syntax rules
• Purpose?
• Enables the programmer to concentrate on the
• implementation of the algorithm
• Programming language independent
Algorithm design technique
• Intuitive algorithm
• Given a genome sequence, make a copy of it.”
• Abstract algorithm
• Flow chart or u can make Pseudo code
String Copy(s,n)
for i ← to n
ti ← si
return t
Implemented algorithm
Programing language (C, Perl, R, Java)
Dynamic programming
Used to derive pairwise alignments
1. Needleman-Wunsch for global alignments
2. Smith-Waterman for local alignments

• Advantage……???
• Disadvantage……?????
Dynamic programming

Design
1. Brake a problem into sub-problems
2. Construct an optimal solution each sub-
problem
3. Derive overall optimal solution by combining
solutions for each sub-problem – without
recomputing already computed solutions
Gapped and Ungaped alignments
Extended gap penalty
Basic Local Alignment Search Tool
(BLAST)
• Generate a list of words
• Break down the query sequence into words, i.e.
• subsequences of length of words (here w=4)
• For each word, create similar words, using
substitution matrix
• Search DB for exact matches, called seeds
• Basic Local Alignment Search Tool (BLAST)
Extend match – MSP
Extend hit in both directions – locally maximal segment pair
• (MSP).
Terminate – when score for a segment pair is less than a certain
threshold
Score
• A number used to assess the biological
relevance of a finding.
• In the context of sequence alignments, a score
is a numerical value that describes the overall
quality of an alignment.
• Higher numbers correspond to higher
similarity.
• The score scale depends on the scoring system
used (substitution matrix, gap penalty).
Bit score
• The bit score gives an indication of how good
the alignment is; the higher the score, the
better the alignment.
• In general terms, this score is calculated from a
formula that takes into account the alignment
of similar or identical residues, as well as any
gaps introduced to align the sequences.
Bit-score:
• A log-scaled version of a score.
• Max score = highest alignment score (bit-score) between the
query sequence and the database sequence segment .
• In the context of sequence alignments (BLAST), the bit-score
S' is a normalized score expressed in bits that lets you estimate
the magnitude of the search space you would have to look
through before you would expect to find an score as good as or
better than this one by chance.

• S is the raw score. Parameters λ and K depend on the


substitution matrix and on the gap penalties (Altchul).
• The bit-scores is thus a rescaled version of the raw alignment
score that is independent of the size of the search space.
• Total score = sum of alignment scores of all segments from the
same database sequence that match the quary sequence
(calculated over all segments). This score is different from the
max score if several parts of the database sequence match
different parts of the query sequence
• Query coverage = percent of the query length that is included
in the aligned segments. This coverage is calculated over all
segments (cf. total score).
• E-value = number of alignments expected by chance with a
particular score or better. The expect value is the default
sorting metric and normally gives the same sorting order as
Max score.
BLAST – E-value

• Actually it is not a chance, nor a probability value, but rather


the estimate of how many times (this means "counts") you
would expect a result (e.g. a score in a sequence comparison)
at least as extreme as the one observed occurring by chance.
• A value close to zero means that you would practically expect
no unrelated sequence to score as high to your query sequence.
Apparently, no negative e-values may be observed.
• Measures the reliability of a match
• Given a match with score S, then E is the expected
• number of matches with score S or higher

• Lower E-value = more reliable match
• E-values, such as E=0.00001 (E=10-5)
(raise to power 5)
• 2e-6= 2x10-6

• E-val (S) = P-val (S) * N where N is the size of


the search space (N = n*m where n is the
length of the query sequence and m is the
length of the database).
Similarity scoring
Similarity scoring matrices for
proteins
• (Point Accepted Mutation PAM)
• PAM1
• Observed substitution rates when 1% of the amino acids have
• changed per 100 aa, i.e., 1 mutation per 100 aa
• Dayhoff (1978), used 71 protein families with 1572 mutations
• Matrices with higher PAM values derived from PAM1
• PAM250 = 250 mutations per 100 aa
• Due to back mutations and silent mutations, sequences at
PAM250 are ~20% identical
Point Accepted Mutation (PAM)

• Guideline
• PAM40 for highly similar sequences
• PAM70 for medium similar sequences
• PAM250 for highly divergent sequences
Similarity scoring
• BLOcks of Amino Acid SUbstitution Matrix
• BLOSUM matrix
• Derived from blocks, i.e. ungapped local alignments,
with different levels of identity
• E.g., BLOSUM62 derived from Blocks with >62%
identity
• Henikoff & Henikoff (1992) used 2000 blocks,
representing 500 protein groups
PAM vs. BLOSUM
• Roughly equivalent PAM and BLOSUM
matrices
• PAM100 <=> Blosum90
• PAM120 <=> Blosum80
• PAM160 <=> Blosum60
• PAM200 <=> Blosum52
• PAM250 <=> Blosum45

You might also like