0% found this document useful (0 votes)

75 views33 pages

Bioinformatics Session8

The document describes the BLAST algorithm process for protein sequence searches. It discusses three main phases: 1) BLAST compiles a list of pairwise alignments called word pairs from the query sequence. 2) The algorithm scans the database for word pair matches above a threshold score T and extends these hits. 3) A trace-back procedure assigns locations of insertions, deletions, and mismatches to generate alignments. Expect value provides the likelihood of alignments occurring by chance based on scores from substitution matrices and sequence lengths. Parameters like word size, scoring matrices, and expect value thresholds allow users to customize BLAST searches.

Uploaded by

Rohan Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views33 pages

Bioinformatics Session8

Uploaded by

Rohan Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Bioinformatics (BIO213)

Session 8

Sessions 9,10 was about running BLAST

Four components for performing a BLAST search

1. Selecting a sequence of interest and pasting, typing, or

uploading it into the BLAST input box.
2. Selecting a BLAST program (BLASTP, BLASTN, BLASTX,
TBLASTX, or TBLASTN).
3. Selecting a database to search. A common choice is the
nonredundant (nr) database, but there are many other
databases.
4. Selecting optional parameters, both for the search and for the
format of the output.
Overview of the five main BLAST algorithms.

P: Protein
N: Nucleotide
X: DNA query are dynamically translated into
six protein sequences
T: “translating,” where a DNA database is
dynamically translated into six proteins.
Understanding BLAST with human protein
RBP4 (NP_006735.2) as an example:
• RBP4: Retinol Binding Protein 4
• Accession number: NP_006735.2
• Open NCBI-Blast web page:
https://blast.ncbi.nlm.nih.gov/Blast.cgi
Expect value, Max score and Total score
• Expect value: the number of different alignments with scores (S) that are
expected to occur by chance in a database search.
• This provides an estimate of the number of false positive results from a
BLAST search.
• Max score: highest alignment score (bit-score) between the query
sequence and the database sequence segment.
• Total score: sum of alignment scores of all segments that match the query
sequence.
• This score is different from the max score if several parts of the
database sequence match different parts of the query sequence.
Raw scores and Bit scores
• Raw scores (S) are calculated from the substitution matrix and gap
penalty parameters that are chosen  Unitless

• The bit score (S′) is calculated from the raw score by normalizing
with the statistical variables that define a given scoring system 
Information content (bits)

• Bit scores from different alignments, even using different scoring

matrices in separate BLAST searches, can therefore be compared.
Substitution matrix specific probabilities and
Expect Value (E-value) scores are assigned for each aligned pair of
residues and the overall alignment.

Comparing a query sequence to a set of random

sequences (uniform‐length) generates scores that fit an
extreme‐value distribution.

The expected number of HSPs having a S

by chance alone is determined from this
extreme value distribution

It is used to compute expect value

• E: Expect value, the number of different alignments with scores
(S) that are expected to occur by chance in a database search.

• This provides an estimate of the number of false positive results

from a BLAST search.

• m & n: lengths of sequences being compared

• K and λ are Karlin–Altschul statistics parameters
Important properties of Expect-value (E)
• The value of E decreases exponentially with increasing S.
• The score S reflects the similarity of each pairwise comparison.
• Higher S values correspond to better alignments and lower E values.
• As E approaches zero, the probability that the alignment occurred by chance
approaches zero.
• The expected score for aligning a random pair of amino acids must be negative.
• Otherwise, very long alignments of two sequences could accumulate large
positive scores and appear to be significantly related when they are not.
• The size of the database that is searched – as well as the size of the query –
influences the likelihood that particular alignments will occur by chance.
Properties of E-value

E-value and bits score are related

Bit scores can tell you the E value if you know the size of the
search space, m × n
BLASTP Algorithm Parts: List, Scan, Extend

• BLASTP works in three phases:

1. Seed & List,
2. Scan, Extend
3. Traceback
Phase 1: Seed & List
• Setup: compile a list of words (W=3) above threshold T
• Query sequence: human beta globin NP_000509.1
This sequence is read; low complexity or other filtering is applied; a
“lookup” table is built.

Default W = 3?

Amino acids: 20
203 = 8000 possible words
204 = 160000 words
205 = 3200000 words
Phase 1: Seed & List

BLOSUM62 matrix

A threshold value T is established for the

score of aligned words.
Phase 2: Scanning and extensions

(Dictionary: optimal way of

storing data)

The database hits are extended in both directions to

obtain high‐scoring segment pairs (HSPs).
Phase 3: Traceback
• Calculate locations of insertions, deletions, and matches (for
alignments saved in Phase 2)
• Apply composition-based statistics (for BLASTP, TBLASTN)
• Generate gapped alignment
Phase 1 of BLASTn differs from BLASTp
• For nucleotide BLASTN searches, exact matches are required
rather than words above a threshold.
• The default word size is 11 (and can be adjusted to values of 7
or 15).
• Lowering the word length effectively achieves the same aim as
lowering the threshold score.
• Specifying a smaller word size induces a slower, more accurate
search.
The effect of varying the threshold on the
number of database hits and extensions
Increasing the threshold
of T limits the search
space significantly
Let’s run some BLAST!
• Protein_Essential_cat.fasta
• orf_trans.fasta
Substitution matrix specific probabilities and
Expect Value (E-value) scores are assigned for each aligned pair of
residues and the overall alignment.

Comparing a query sequence to a set of random

sequences (uniform‐length) generates scores that fit an
extreme‐value distribution.

The expected number of HSPs having a S

by chance alone is determined from this
extreme value distribution

It is used to compute expect value

• E: Expect value, the number of different alignments with scores
(S) that are expected to occur by chance in a database search.

• This provides an estimate of the number of false positive results

from a BLAST search.

• m & n: lengths of sequences being compared

E-value and bits score are related

Bit scores can tell you the E value if you know the size of the
search space, m × n
• The BLASTP algorithm can be described in three phases:
• For protein searches, BLAST compiles a preliminary list of pairwise
alignments called word pairs.
• The algorithm scans a database for word pairs that meet some threshold
score T. When this occurs, such hits are extended using ungapped and
gapped alignments. BLAST extends the word pairs to find those that surpass
a cutoff score S, at which point those hits will be reported to the user. Scores
are calculated from scoring matrices (such as BLOSUM62) along with gap
penalties.
• A trace‐back procedure is performed in which the locations of insertions,
deletions and mismatches are assigned.
Main page of BLASTP search

Input as an accession No.,

GI identifier, or FASTA‐
format

Database (nr: most

common), Restrict or
exclude organism/Taxonomic
group

Select by Author

Select by Algorithm

Search parameters
(3) This setting specifies the statistical significance threshold for reporting
matches against database sequences.
The default value (10) means that 10 such matches are expected to be
found merely by chance (Karlin and Altschul (1990)).
If the statistical significance of to a match is greater than the EXPECT
threshold, the match will not be reported. Lower EXPECT thresholds are
more stringent, leading to fewer chance matches being reported.

(4) Word size: BLAST is a heuristic that works by finding word-

matches between the query and database sequences.
This process is like finding "hot-spots" that BLAST can then use
to initiate extensions that might eventually lead to full-blown
alignments.
When a query is used to search a database, the BLAST
algorithm first divides the query into a series of smaller
sequences (words) of a particular length (word size).
For BLASTP, a larger word size yields a more accurate search
EXAMPLE: Eyeless Gene Homeobox
Compare the gene eyeless of Drosophila melanoganster with the human gene aniridia. They are master
regulatory genes producing proteins that control large cascade of other genes. Certain segments of genes
eyeless of Drosophila and human aniridia are almost identical. The most important of such segments encodes
the PAX (paired-box) domain, a sequence of 128 amino acids whose function is to bind specific sequences of
DNA. Another common segment is the HOX (homeobox) domain that is thought to be part of more than 0.2% of
the total number of vertebrate genes.

30
BLAST search has a wide variety of uses:
• Determining what orthologs and paralogs are known for a particular protein
or nucleic acid sequence.
• Determining what proteins or genes are present in a particular organism.
• Determining the identity of a DNA or protein sequence.
• Discovering new genes.
• Determining what variants have been described for a particular gene or
protein.
• Investigating expressed sequence tags (ESTs) that may exhibit alternative
splicing.
• Exploring amino acid residues that are important in the function and/or
structure of a protein.
Basic Local Alignment Search Tool (BLAST)
• BLAST is the main NCBI tool for comparing a protein or DNA sequence to
sequences in databases (Altschul et al., 1990, 1997).
• BLAST search can reveal what related sequences are present in the same
organism and other organisms.
• BLAST is a family of programs: BLASTP, BLASTN, BLASTX, TBLASTX,
TBLASTN, PSI-BLAST.
• A DNA sequence can be converted into 6 potential proteins and compare
protein sequences to dynamically translated DNA databases or vice versa.
• The programs produce high‐scoring segment pairs (HSPs) that represent
local alignments between your query and database sequences.

5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
1.4 Take Home Quiz
No ratings yet
1.4 Take Home Quiz
3 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
BLAST
No ratings yet
BLAST
30 pages
Lecture 22, Regulation of Microbial Population Behavior
No ratings yet
Lecture 22, Regulation of Microbial Population Behavior
18 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Mission NEET PG Biochemistry Day-1 by Dr. Smily Pruthi mam
No ratings yet
Mission NEET PG Biochemistry Day-1 by Dr. Smily Pruthi mam
48 pages
9700 BIOLOGY: MARK SCHEME For The October/November 2011 Question Paper For The Guidance of Teachers
No ratings yet
9700 BIOLOGY: MARK SCHEME For The October/November 2011 Question Paper For The Guidance of Teachers
8 pages
Bio Python 202111
No ratings yet
Bio Python 202111
63 pages
In Vivo Image Analysis Using IRFP Transgenic Mice
No ratings yet
In Vivo Image Analysis Using IRFP Transgenic Mice
9 pages
Protein Structure Classification/domain Prediction: SCOP and CATH (Bioinformatics) .
100% (4)
Protein Structure Classification/domain Prediction: SCOP and CATH (Bioinformatics) .
23 pages
Sonogenetic Control of Multiplexed Genome Regulation and Base Editing
No ratings yet
Sonogenetic Control of Multiplexed Genome Regulation and Base Editing
11 pages
WR CSP 20 NameList Engl 261020
No ratings yet
WR CSP 20 NameList Engl 261020
373 pages
Chen Et Al., 2019
No ratings yet
Chen Et Al., 2019
34 pages
Bioinformatics Session4
No ratings yet
Bioinformatics Session4
27 pages
Module 1 Assessment
No ratings yet
Module 1 Assessment
19 pages
Bioinformatics Session11
No ratings yet
Bioinformatics Session11
19 pages
Inasal Products
No ratings yet
Inasal Products
5 pages
Jurnal Pendukung 14 (LN)
No ratings yet
Jurnal Pendukung 14 (LN)
8 pages
History Short Note
No ratings yet
History Short Note
63 pages
Data Mining in Molecular Biology A Journey From Ra
No ratings yet
Data Mining in Molecular Biology A Journey From Ra
7 pages
Elizabeth Elaine Allen - F - 22022022035142
No ratings yet
Elizabeth Elaine Allen - F - 22022022035142
1 page
Novogene Sample Submission Guide
No ratings yet
Novogene Sample Submission Guide
11 pages
Before Li Ion Batteriesacs - Chemrev.8b00422
No ratings yet
Before Li Ion Batteriesacs - Chemrev.8b00422
24 pages
Lecture 2 Dna Ectraction Non Organic Method
No ratings yet
Lecture 2 Dna Ectraction Non Organic Method
21 pages
HP Republic Day Offer
No ratings yet
HP Republic Day Offer
9 pages
Plasma Membrane English
No ratings yet
Plasma Membrane English
12 pages
Carbohydrates in Fish Nutrition Digestion and Abso
No ratings yet
Carbohydrates in Fish Nutrition Digestion and Abso
21 pages
Performance Assessment - Sickle-Cell Trait Infographic
No ratings yet
Performance Assessment - Sickle-Cell Trait Infographic
2 pages
List Game & Software Update 26-08-2021
No ratings yet
List Game & Software Update 26-08-2021
23 pages
Bioinformatics Session16!17!25102021
No ratings yet
Bioinformatics Session16!17!25102021
39 pages
Genetic Variation and Role in Pharmacology: Presented by
No ratings yet
Genetic Variation and Role in Pharmacology: Presented by
27 pages
Flistmbbs GSF
No ratings yet
Flistmbbs GSF
17 pages
En Ni NTA Superflow BioRobot Handbook
No ratings yet
En Ni NTA Superflow BioRobot Handbook
64 pages
Biomolecules Quiz ReviewA
No ratings yet
Biomolecules Quiz ReviewA
2 pages
A Study On Hiring & Interview Processes at Deloitte, Gurgaon
No ratings yet
A Study On Hiring & Interview Processes at Deloitte, Gurgaon
32 pages
ch7 test bank
No ratings yet
ch7 test bank
6 pages
Smart Sound System Applied For The Extensive Care of People With Hearing Impairment
No ratings yet
Smart Sound System Applied For The Extensive Care of People With Hearing Impairment
13 pages
Protien Synth 11th
No ratings yet
Protien Synth 11th
3 pages
Dlp-In - Ware - Barista
No ratings yet
Dlp-In - Ware - Barista
24 pages
23 JUN 2023 Office To Home CRN7462103276
No ratings yet
23 JUN 2023 Office To Home CRN7462103276
3 pages
Ecoli Insulin Factory PDF
0% (1)
Ecoli Insulin Factory PDF
8 pages
Interloop CEP HFE 2020im09
No ratings yet
Interloop CEP HFE 2020im09
5 pages
Junior Quiz 16 August
No ratings yet
Junior Quiz 16 August
4 pages
Pairing Nethealand
No ratings yet
Pairing Nethealand
72 pages
Albert's Diamond Jewelers Complaint
No ratings yet
Albert's Diamond Jewelers Complaint
17 pages
Lab Manual: Web Technology Laboratory
No ratings yet
Lab Manual: Web Technology Laboratory
57 pages
Application
No ratings yet
Application
4 pages
Gita Tera Gyan Amrit Nepali
No ratings yet
Gita Tera Gyan Amrit Nepali
364 pages
Ebook Selected Chapters
No ratings yet
Ebook Selected Chapters
63 pages
ARTISTRY SkinAnalyzer Rally EN
No ratings yet
ARTISTRY SkinAnalyzer Rally EN
20 pages
Genomic Dna by Ligation SQK lsk110 GDE - 9108 - v110 - Revx - 10nov2020 Minion
No ratings yet
Genomic Dna by Ligation SQK lsk110 GDE - 9108 - v110 - Revx - 10nov2020 Minion
26 pages
Microbial Fingerprinting Fact Sheet
No ratings yet
Microbial Fingerprinting Fact Sheet
9 pages
DNA The Molecule of Life
No ratings yet
DNA The Molecule of Life
37 pages
URBN166 Worksheet 6
No ratings yet
URBN166 Worksheet 6
4 pages
MD Nastran R3 Explicit Nonlinear (SOL 700) User's Guide
No ratings yet
MD Nastran R3 Explicit Nonlinear (SOL 700) User's Guide
602 pages
Automatic Arabic Number Plate Recognition
No ratings yet
Automatic Arabic Number Plate Recognition
7 pages
Msds Optigard Ab 100
No ratings yet
Msds Optigard Ab 100
10 pages
Enzyme: Ayesha Shafi Pharm-D, (P.U.), M. Phil. Pharmaceutical Chemistry (P.U.)
100% (2)
Enzyme: Ayesha Shafi Pharm-D, (P.U.), M. Phil. Pharmaceutical Chemistry (P.U.)
34 pages
Dcof Full Notes (Module 1)
No ratings yet
Dcof Full Notes (Module 1)
26 pages
How Indra Nooyi Turned Design Thinking Into Strategy - An Interview With PepsiCo's CEO
No ratings yet
How Indra Nooyi Turned Design Thinking Into Strategy - An Interview With PepsiCo's CEO
13 pages
DLL - Mapeh 4 - Q2 - W3
No ratings yet
DLL - Mapeh 4 - Q2 - W3
7 pages
Chapter2 Diode Applications-1
No ratings yet
Chapter2 Diode Applications-1
22 pages
List Price W.E.F. 28 Oct., 2020
No ratings yet
List Price W.E.F. 28 Oct., 2020
8 pages
'The Seven Daughters of Eve' Book Review
0% (1)
'The Seven Daughters of Eve' Book Review
4 pages
This Study Resource Was: Cell Division Worksheet
No ratings yet
This Study Resource Was: Cell Division Worksheet
4 pages
MA AIHC 4SEM DrManojKumar
No ratings yet
MA AIHC 4SEM DrManojKumar
18 pages
GE-8 Ethics - Module 1 - Moral Philosophy, - Birth, - Meaning, - Comparative Analysis, - Scope, - Assumptions
No ratings yet
GE-8 Ethics - Module 1 - Moral Philosophy, - Birth, - Meaning, - Comparative Analysis, - Scope, - Assumptions
9 pages
My Chapter One
No ratings yet
My Chapter One
96 pages
Samsung Le26r71b 72b Le32r71b r72b Le37r72b Le40r71b 72b Chassis-Gbd26ke Gbd32ke Gbd40ke Gbr26ke Gbr32ke Gbr37ke Gbr40ke SM
100% (2)
Samsung Le26r71b 72b Le32r71b r72b Le37r72b Le40r71b 72b Chassis-Gbd26ke Gbd32ke Gbd40ke Gbr26ke Gbr32ke Gbr37ke Gbr40ke SM
223 pages
Biology For The IB Diploma Chapter 3 Summary
No ratings yet
Biology For The IB Diploma Chapter 3 Summary
7 pages
Bioinformatics Session1
No ratings yet
Bioinformatics Session1
35 pages
Warren L. Dela Cerna: Objectives
No ratings yet
Warren L. Dela Cerna: Objectives
4 pages
PC Control Command Reference For The Ts-480Hx/ Sat Transceiver
No ratings yet
PC Control Command Reference For The Ts-480Hx/ Sat Transceiver
26 pages
NOTES CD Lecture Generic 2022
No ratings yet
NOTES CD Lecture Generic 2022
182 pages
Chola Bronzes
No ratings yet
Chola Bronzes
8 pages
Novel Wearable Sensor Device For Continuous Monitoring of Cardiac Activity During Sleep
No ratings yet
Novel Wearable Sensor Device For Continuous Monitoring of Cardiac Activity During Sleep
74 pages
Cambridge IGCSE: Physics 0625/22
No ratings yet
Cambridge IGCSE: Physics 0625/22
20 pages
Review Article: Nutrigenomics: Definitions and Advances of This New Science
No ratings yet
Review Article: Nutrigenomics: Definitions and Advances of This New Science
6 pages
Chapter 29 - Fabrics Fibres
No ratings yet
Chapter 29 - Fabrics Fibres
20 pages
Chap 021
No ratings yet
Chap 021
10 pages
Abbreviations From Chapter 6 To Chapter 11 PDF
No ratings yet
Abbreviations From Chapter 6 To Chapter 11 PDF
60 pages
Aarti Shriniwar Portfolio
No ratings yet
Aarti Shriniwar Portfolio
74 pages
PEGFP-N1 Vector Information
No ratings yet
PEGFP-N1 Vector Information
3 pages
Laporan PKL Jundan
No ratings yet
Laporan PKL Jundan
30 pages
CTP 08-01 BP
No ratings yet
CTP 08-01 BP
1 page
Book Stall Management
No ratings yet
Book Stall Management
14 pages
The Effectiveness of Medical Laboratory Trainees' Internship Programs in Improving Technologists' Competence and Operational Strategy For The Healthcare Sector in Kenya
No ratings yet
The Effectiveness of Medical Laboratory Trainees' Internship Programs in Improving Technologists' Competence and Operational Strategy For The Healthcare Sector in Kenya
5 pages
Employment News 16 December - 22 December
0% (1)
Employment News 16 December - 22 December
48 pages
GEBERIT Final PriceNET - 2021
No ratings yet
GEBERIT Final PriceNET - 2021
5 pages
CBSE Class 10 April03 English 2023 Question Paper Set 2 5 2
No ratings yet
CBSE Class 10 April03 English 2023 Question Paper Set 2 5 2
19 pages
JayDeep S CV PDF
No ratings yet
JayDeep S CV PDF
1 page
Dishtv All Profile
No ratings yet
Dishtv All Profile
9 pages
Methods In: Experimen
No ratings yet
Methods In: Experimen
503 pages