Bioinformatics Session8
Bioinformatics Session8
Session 8
P: Protein
N: Nucleotide
X: DNA query are dynamically translated into
six protein sequences
T: “translating,” where a DNA database is
dynamically translated into six proteins.
Understanding BLAST with human protein
RBP4 (NP_006735.2) as an example:
• RBP4: Retinol Binding Protein 4
• Accession number: NP_006735.2
• Open NCBI-Blast web page:
https://blast.ncbi.nlm.nih.gov/Blast.cgi
Expect value, Max score and Total score
• Expect value: the number of different alignments with scores (S) that are
expected to occur by chance in a database search.
• This provides an estimate of the number of false positive results from a
BLAST search.
• Max score: highest alignment score (bit-score) between the query
sequence and the database sequence segment.
• Total score: sum of alignment scores of all segments that match the query
sequence.
• This score is different from the max score if several parts of the
database sequence match different parts of the query sequence.
Raw scores and Bit scores
• Raw scores (S) are calculated from the substitution matrix and gap
penalty parameters that are chosen Unitless
• The bit score (S′) is calculated from the raw score by normalizing
with the statistical variables that define a given scoring system
Information content (bits)
Bit scores can tell you the E value if you know the size of the
search space, m × n
BLASTP Algorithm Parts: List, Scan, Extend
Default W = 3?
Amino acids: 20
203 = 8000 possible words
204 = 160000 words
205 = 3200000 words
Phase 1: Seed & List
BLOSUM62 matrix
Bit scores can tell you the E value if you know the size of the
search space, m × n
• The BLASTP algorithm can be described in three phases:
• For protein searches, BLAST compiles a preliminary list of pairwise
alignments called word pairs.
• The algorithm scans a database for word pairs that meet some threshold
score T. When this occurs, such hits are extended using ungapped and
gapped alignments. BLAST extends the word pairs to find those that surpass
a cutoff score S, at which point those hits will be reported to the user. Scores
are calculated from scoring matrices (such as BLOSUM62) along with gap
penalties.
• A trace‐back procedure is performed in which the locations of insertions,
deletions and mismatches are assigned.
Main page of BLASTP search
Select by Author
Select by Algorithm
Search parameters
(3) This setting specifies the statistical significance threshold for reporting
matches against database sequences.
The default value (10) means that 10 such matches are expected to be
found merely by chance (Karlin and Altschul (1990)).
If the statistical significance of to a match is greater than the EXPECT
threshold, the match will not be reported. Lower EXPECT thresholds are
more stringent, leading to fewer chance matches being reported.
30
BLAST search has a wide variety of uses:
• Determining what orthologs and paralogs are known for a particular protein
or nucleic acid sequence.
• Determining what proteins or genes are present in a particular organism.
• Determining the identity of a DNA or protein sequence.
• Discovering new genes.
• Determining what variants have been described for a particular gene or
protein.
• Investigating expressed sequence tags (ESTs) that may exhibit alternative
splicing.
• Exploring amino acid residues that are important in the function and/or
structure of a protein.
Basic Local Alignment Search Tool (BLAST)
• BLAST is the main NCBI tool for comparing a protein or DNA sequence to
sequences in databases (Altschul et al., 1990, 1997).
• BLAST search can reveal what related sequences are present in the same
organism and other organisms.
• BLAST is a family of programs: BLASTP, BLASTN, BLASTX, TBLASTX,
TBLASTN, PSI-BLAST.
• A DNA sequence can be converted into 6 potential proteins and compare
protein sequences to dynamically translated DNA databases or vice versa.
• The programs produce high‐scoring segment pairs (HSPs) that represent
local alignments between your query and database sequences.