0% found this document useful (0 votes)
75 views33 pages

Bioinformatics Session8

The document describes the BLAST algorithm process for protein sequence searches. It discusses three main phases: 1) BLAST compiles a list of pairwise alignments called word pairs from the query sequence. 2) The algorithm scans the database for word pair matches above a threshold score T and extends these hits. 3) A trace-back procedure assigns locations of insertions, deletions, and mismatches to generate alignments. Expect value provides the likelihood of alignments occurring by chance based on scores from substitution matrices and sequence lengths. Parameters like word size, scoring matrices, and expect value thresholds allow users to customize BLAST searches.

Uploaded by

Rohan Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views33 pages

Bioinformatics Session8

The document describes the BLAST algorithm process for protein sequence searches. It discusses three main phases: 1) BLAST compiles a list of pairwise alignments called word pairs from the query sequence. 2) The algorithm scans the database for word pair matches above a threshold score T and extends these hits. 3) A trace-back procedure assigns locations of insertions, deletions, and mismatches to generate alignments. Expect value provides the likelihood of alignments occurring by chance based on scores from substitution matrices and sequence lengths. Parameters like word size, scoring matrices, and expect value thresholds allow users to customize BLAST searches.

Uploaded by

Rohan Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Bioinformatics (BIO213)

Session 8

Sessions 9,10 was about running BLAST


Four components for performing a BLAST search

1. Selecting a sequence of interest and pasting, typing, or


uploading it into the BLAST input box.
2. Selecting a BLAST program (BLASTP, BLASTN, BLASTX,
TBLASTX, or TBLASTN).
3. Selecting a database to search. A common choice is the
nonredundant (nr) database, but there are many other
databases.
4. Selecting optional parameters, both for the search and for the
format of the output.
Overview of the five main BLAST algorithms.

P: Protein
N: Nucleotide
X: DNA query are dynamically translated into
six protein sequences
T: “translating,” where a DNA database is
dynamically translated into six proteins.
Understanding BLAST with human protein
RBP4 (NP_006735.2) as an example:
• RBP4: Retinol Binding Protein 4
• Accession number: NP_006735.2
• Open NCBI-Blast web page:
https://blast.ncbi.nlm.nih.gov/Blast.cgi
Expect value, Max score and Total score
• Expect value: the number of different alignments with scores (S) that are
expected to occur by chance in a database search.
• This provides an estimate of the number of false positive results from a
BLAST search.
• Max score: highest alignment score (bit-score) between the query
sequence and the database sequence segment.
• Total score: sum of alignment scores of all segments that match the query
sequence.
• This score is different from the max score if several parts of the
database sequence match different parts of the query sequence.
Raw scores and Bit scores
• Raw scores (S) are calculated from the substitution matrix and gap
penalty parameters that are chosen  Unitless

• The bit score (S′) is calculated from the raw score by normalizing
with the statistical variables that define a given scoring system 
Information content (bits)

• Bit scores from different alignments, even using different scoring


matrices in separate BLAST searches, can therefore be compared.
Substitution matrix specific probabilities and
Expect Value (E-value) scores are assigned for each aligned pair of
residues and the overall alignment.

Comparing a query sequence to a set of random


sequences (uniform‐length) generates scores that fit an
extreme‐value distribution.

The expected number of HSPs having a S


by chance alone is determined from this
extreme value distribution

It is used to compute expect value


• E: Expect value, the number of different alignments with scores
(S) that are expected to occur by chance in a database search.

• This provides an estimate of the number of false positive results


from a BLAST search.

• m & n: lengths of sequences being compared


• K and λ are Karlin–Altschul statistics parameters
Important properties of Expect-value (E)
• The value of E decreases exponentially with increasing S.
• The score S reflects the similarity of each pairwise comparison.
• Higher S values correspond to better alignments and lower E values.
• As E approaches zero, the probability that the alignment occurred by chance
approaches zero.
• The expected score for aligning a random pair of amino acids must be negative.
• Otherwise, very long alignments of two sequences could accumulate large
positive scores and appear to be significantly related when they are not.
• The size of the database that is searched – as well as the size of the query –
influences the likelihood that particular alignments will occur by chance.
Properties of E-value

E-value and bits score are related

Bit scores can tell you the E value if you know the size of the
search space, m × n
BLASTP Algorithm Parts: List, Scan, Extend

• BLASTP works in three phases:


1. Seed & List,
2. Scan, Extend
3. Traceback
Phase 1: Seed & List
• Setup: compile a list of words (W=3) above threshold T
• Query sequence: human beta globin NP_000509.1
This sequence is read; low complexity or other filtering is applied; a
“lookup” table is built.

Default W = 3?

Amino acids: 20
203 = 8000 possible words
204 = 160000 words
205 = 3200000 words
Phase 1: Seed & List

BLOSUM62 matrix

A threshold value T is established for the


score of aligned words.
Phase 2: Scanning and extensions

(Dictionary: optimal way of


storing data)

The database hits are extended in both directions to


obtain high‐scoring segment pairs (HSPs).
Phase 3: Traceback
• Calculate locations of insertions, deletions, and matches (for
alignments saved in Phase 2)
• Apply composition-based statistics (for BLASTP, TBLASTN)
• Generate gapped alignment
Phase 1 of BLASTn differs from BLASTp
• For nucleotide BLASTN searches, exact matches are required
rather than words above a threshold.
• The default word size is 11 (and can be adjusted to values of 7
or 15).
• Lowering the word length effectively achieves the same aim as
lowering the threshold score.
• Specifying a smaller word size induces a slower, more accurate
search.
The effect of varying the threshold on the
number of database hits and extensions
Increasing the threshold
of T limits the search
space significantly
Let’s run some BLAST!
• Protein_Essential_cat.fasta
• orf_trans.fasta
Substitution matrix specific probabilities and
Expect Value (E-value) scores are assigned for each aligned pair of
residues and the overall alignment.

Comparing a query sequence to a set of random


sequences (uniform‐length) generates scores that fit an
extreme‐value distribution.

The expected number of HSPs having a S


by chance alone is determined from this
extreme value distribution

It is used to compute expect value


• E: Expect value, the number of different alignments with scores
(S) that are expected to occur by chance in a database search.

• This provides an estimate of the number of false positive results


from a BLAST search.

• m & n: lengths of sequences being compared


• K and λ are Karlin–Altschul statistics parameters
Important properties of Expect-value (E)
• The value of E decreases exponentially with increasing S.
• The score S reflects the similarity of each pairwise comparison.
• Higher S values correspond to better alignments and lower E values.
• As E approaches zero, the probability that the alignment occurred by chance
approaches zero.
• The expected score for aligning a random pair of amino acids must be negative.
• Otherwise, very long alignments of two sequences could accumulate large
positive scores and appear to be significantly related when they are not.
• The size of the database that is searched – as well as the size of the query –
influences the likelihood that particular alignments will occur by chance.
Properties of E-value

E-value and bits score are related

Bit scores can tell you the E value if you know the size of the
search space, m × n
• The BLASTP algorithm can be described in three phases:
• For protein searches, BLAST compiles a preliminary list of pairwise
alignments called word pairs.
• The algorithm scans a database for word pairs that meet some threshold
score T. When this occurs, such hits are extended using ungapped and
gapped alignments. BLAST extends the word pairs to find those that surpass
a cutoff score S, at which point those hits will be reported to the user. Scores
are calculated from scoring matrices (such as BLOSUM62) along with gap
penalties.
• A trace‐back procedure is performed in which the locations of insertions,
deletions and mismatches are assigned.
Main page of BLASTP search

Input as an accession No.,


GI identifier, or FASTA‐
format

Database (nr: most


common), Restrict or
exclude organism/Taxonomic
group

Select by Author

Select by Algorithm

Search parameters
(3) This setting specifies the statistical significance threshold for reporting
matches against database sequences.
The default value (10) means that 10 such matches are expected to be
found merely by chance (Karlin and Altschul (1990)).
If the statistical significance of to a match is greater than the EXPECT
threshold, the match will not be reported. Lower EXPECT thresholds are
more stringent, leading to fewer chance matches being reported.

(4) Word size: BLAST is a heuristic that works by finding word-


matches between the query and database sequences.
This process is like finding "hot-spots" that BLAST can then use
to initiate extensions that might eventually lead to full-blown
alignments.
When a query is used to search a database, the BLAST
algorithm first divides the query into a series of smaller
sequences (words) of a particular length (word size).
For BLASTP, a larger word size yields a more accurate search
EXAMPLE: Eyeless Gene Homeobox
Compare the gene eyeless of Drosophila melanoganster with the human gene aniridia. They are master
regulatory genes producing proteins that control large cascade of other genes. Certain segments of genes
eyeless of Drosophila and human aniridia are almost identical. The most important of such segments encodes
the PAX (paired-box) domain, a sequence of 128 amino acids whose function is to bind specific sequences of
DNA. Another common segment is the HOX (homeobox) domain that is thought to be part of more than 0.2% of
the total number of vertebrate genes.

30
BLAST search has a wide variety of uses:
• Determining what orthologs and paralogs are known for a particular protein
or nucleic acid sequence.
• Determining what proteins or genes are present in a particular organism.
• Determining the identity of a DNA or protein sequence.
• Discovering new genes.
• Determining what variants have been described for a particular gene or
protein.
• Investigating expressed sequence tags (ESTs) that may exhibit alternative
splicing.
• Exploring amino acid residues that are important in the function and/or
structure of a protein.
Basic Local Alignment Search Tool (BLAST)
• BLAST is the main NCBI tool for comparing a protein or DNA sequence to
sequences in databases (Altschul et al., 1990, 1997).
• BLAST search can reveal what related sequences are present in the same
organism and other organisms.
• BLAST is a family of programs: BLASTP, BLASTN, BLASTX, TBLASTX,
TBLASTN, PSI-BLAST.
• A DNA sequence can be converted into 6 potential proteins and compare
protein sequences to dynamically translated DNA databases or vice versa.
• The programs produce high‐scoring segment pairs (HSPs) that represent
local alignments between your query and database sequences.

You might also like