0% found this document useful (0 votes)
52 views38 pages

Bs982 l08 Basic Blast

The document provides an introduction to the Basic Local Alignment Search Tool (BLAST) by outlining why BLAST is used, how BLAST works, and how to run BLAST searches. BLAST is a tool that helps find similar sequences to a query sequence in a database. It works by breaking the query into pieces and searching for matches in the database, then extending matches. The document guides users on choosing a query sequence, BLAST program, database, and other parameters to run their own BLAST search.

Uploaded by

Narges Miri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views38 pages

Bs982 l08 Basic Blast

The document provides an introduction to the Basic Local Alignment Search Tool (BLAST) by outlining why BLAST is used, how BLAST works, and how to run BLAST searches. BLAST is a tool that helps find similar sequences to a query sequence in a database. It works by breaking the query into pieces and searching for matches in the database, then extending matches. The document guides users on choosing a query sequence, BLAST program, database, and other parameters to run their own BLAST search.

Uploaded by

Narges Miri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

BS982

Welcome to BLAST
Basic Local Alignment Search Tool
Ben Skinner

School of Life Sciences • University of Essex


[email protected]
Outline

• Why BLAST?

• How BLAST works

• Running BLAST for yourself


Why BLAST?

• Falling cost of sequencing


• Growing size of databases - 1979 Los Alamos Sequence Database (became GenBank)

https://www.ncbi.nlm.nih.gov/genbank/statistics/
Why BLAST?

• Imagine – you have sequenced something from an environmental sample


• Has anyone seen this before? Is it new?

Extract DNA

Your Sequence

DNA:
ATGTGATCCGACTATGACA….

PROTEIN:
MRPILMKGHERPLTFLRYNRDG….

https://www.universe-review.ca/I11-50-DNAsequencing1.jpg
Why BLAST?

• Or you may have a protein/DNA sequence from a database:


NCBI/EMBL/SwissProt/UniProt
• What else is it similar to?

Your Sequence

DNA:
ATGTGATCCGACTATGACA….

PROTEIN:
MRPILMKGHERPLTFLRYNRDG….
Why BLAST?

BLAST is a tool that helps us find similar


sequences to a query sequence in a
database

BLAST can search for matches to DNA or


protein queries
Why BLAST?

• Usually the first thing you do when you obtain a


new, unknown sequence
• Provides biological context and helps establish
hypotheses to be tested (by experimentation or
further bioinformatic analysis)
• Part of our lexicon: “To Blast a sequence” as in
“To Google a question”
• Emphasizes speed over sensitivity
• Quick and simple to use, but very often used
sub-optimally
BLAST can be used for different purposes

• Looking for species. If you are sequencing DNA from unknown


species - identify the correct species or homologous species.
• Looking for domains. If you BLAST a protein sequence (or a
translated nucleotide sequence) – identify known domains
• Looking at phylogeny. You can use the BLAST web pages to
generate a phylogenetic tree of the BLAST result.
• Mapping DNA to a known chromosome. If you are sequencing a
gene from a known species but have no idea of the chromosome
location
• Annotations. BLAST can also be used to map annotations from
one organism to another or look for common genes in two
related species.
Important concepts

• Similarity: Degree of likeness between two sequences, usually


expressed as a percentage of similar (or identical) residues over
a given length of the alignment. Can usually be easily calculated.
• Homology: Statement about common evolutionary ancestry of
two sequences – hypotheses
• A high degree of similarity implies a high probability of homology
Homologues: genes with a common ancestor

Remember the Hox genes from


Eukaryotic genomes lecture?
Conservation of sequence implies conservation of function

Key to finding important regions –


functional protein domains

https://ars.els-cdn.com/content/image/1-s2.0-S2001037015000070-gr3.jpg
Searching for similarity and homology
Outline

• Why BLAST?

• How BLAST works

• Running BLAST for yourself


Searching for similarity

• Important goal of genomics is to determine if a particular sequence is “like”


another sequence
• Compare new sequences with sequences already stored in a database
• Two alignment types: global and local
Global alignment

• Compares one whole sequence with another entire sequence (end to


end alignment)
• Suitable for aligning closely related species, for example comparing
two genes with the same function in humans and mouse

See lecture on sequence alignment for details

Needleman-Wunsch algorithm
Local alignment

• Uses a subset of a sequence to align a subset of other sequences


• Reveals regions that are highly similar, but do not necessarily
provide comparison across the entire two sequences
• Find conserved patterns in DNA sequences or conserved domains in
two proteins
• May uncover regions of homology that are related by descent
between otherwise diverse sequences

Smith-Waterman algorithm
How does BLAST work? Basic overview

The original BLAST program (Altschul et al 1990 J Mol Biol 215:403 )


• Sequence query is broken into words of length W
• Align all words with sequences in the database
• Calculate a score T for each word that aligns with a sequence in
the database using a substitution matrix
• Discard words whose T value is below a neighbourhood score
threshold
• Extend words in both directions until score drops below the
previous best score
Using BLAST: the different tools and databases

Query tool to retrieve


homologous genes from a
database BLAST
Sequence
Database
(Target)

Putative Function

Kerfeld and Scott, PLoS Biology 2011 9(2): e1001014.


Using BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
Outline

• Why BLAST?

• How BLAST works (basics)

• Running BLAST
Basic search strategy

• (1) Choose your sequence (query)


• (2) Choose the BLAST program
• (3) Choose the database to search
• (4) Choose optional parameters

• Then click “BLAST”


Getting a query sequence

• We start by finding our query sequence in FASTA format

• e.g. from Genbank, Ensembl, or a paper

Amino acid sequence of a protein, in FASTA format:


>ribosomal protein L7/L12 [Thiomicrospira crunogena XCL-2]
MAITKDDILEAVANMSVMEVVELVEAMEEKFGVSAAAVAVAGPAGDAGAAGEEQTEFDVVLTGAGDNKVAAIKAVRGATG
LGLKEAKSAVESAPFTLKEGVSKEEAETLANELKEAGIEVEVK

Nucleotide sequence of a gene, in FASTA format:


>gi|118139508:333094-333465 Thiomicrospira crunogena XCL-2
ATGGCAATTACAAAAGACGATATTTTAGAAGCAGTTGCTAACATGTCAGTAATGGAAGTTGTTGAACTTGTTGAAGCAAT
GGAAGAGAAGTTTGGTGTTTCTGCAGCAGCAGTTGCGGTTGCAGGTCCTGCAGGTGATGCTGGCGCTGCTGGTGAAGAAC
AAACAGAGTTTGACGTTGTCTTGACTGGTGCTGGTGACAACAAAGTTGCAGCAATCAAAGCCGTTCGTGGCGCAACTGGT
CTTGGGCTTAAAGAAGCGAAAAGTGCAGTTGAAAGTGCACCATTTACGCTTAAAGAGGGTGTTTCTAAAGAAGAAGCAGA
AACTCTTGCAAATGAGCTTAAAGAAGCAGGTATTGAAGTCGAAGTTAAATAA
What is the FASTA format?

• 1985: FASTP program developed for fast alignment of protein sequences, FASTN for
fast alignment of nucleotide sequences

• 1988: FASTA program for fast alignment of all sequence types - nucleotide or protein
https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

• Had a format for input query sequences

• FASTA the program is rarely used now – replaced by BLAST – but the FASTA format lives
on

”description line” (not read as sequence data) > ribosomal proteinL7/L12


• Begins with >
MAITKDDILEAVANMSVMEVVELVEA
• Ends with a new line MEEKFGVSAAAVAVAGPAGDAGAA
GEEQTEFDVVLTGAGDNKVAAIKAVR
Sequence data GATGLGLKEAKSAVESAPFTLKEG
(amino acid in this case) VSKEEAETLANELKEAGIEVEVK
What is the FASTA format?

• FASTA format files are one of the most common and universally used input files in
bioinformatics and sequence analysis

• Most bioinformatics software expect this format

• For more advanced users there is freely available code to input/output this format in
bioinformatic pipelines, e.g. BioPython and many R packages.

”description line” (not read as sequence data) > ribosomal proteinL7/L12


• Begins with >
MAITKDDILEAVANMSVMEVVELVEA
• Ends with a new line MEEKFGVSAAAVAVAGPAGDAGAA
GEEQTEFDVVLTGAGDNKVAAIKAVR
Sequence data GATGLGLKEAKSAVESAPFTLKEG
(amino acid in this case) VSKEEAETLANELKEAGIEVEVK
Choose the BLAST program to use

• What type of alignment do you


want?

Basic

• blastn – nucleotide – nucleotide


database

• blastp – protein – protein


database

More complex

• blastx – translated nucleotide –


protein database

• tblastn – protein – translated


nucleotide database

• tblastx – translated nucleotide – https://blast.ncbi.nlm.nih.gov/Blast.cgi


translated nucleotide database
NCBI BLAST interface (blastp: proteins)

Paste FASTA format


sequence here
Paste FASTA sequence here
NCBI BLAST interface (blastp: proteins)
Other input options

• When using NCBI’s BLAST, you can also use a GenBank


accession number instead of the FASTA sequence
• e.g. the accession for that ribosomal sequence is
WP_011369709.1
Choose the database to search

• nr/nt = non-redundant protein and nucleotide databases (most general


database)
• Human G+T – genomic and transcript database for humans
• dbest = database of expressed sequence tags
• gss = genomic survey sequences

Protein databases

Nucleotide databases
BLASTn page
• Default = human
• Next = mouse
• Then = others
• Default others = nr/nt (includes
many of those below, hence nr
= “non-redundant”.
• Many more specific databases
that allow to focus search
BLASTn page
• Search can be focussed on a
particular organism
• Model sequences and those
from uncultured organisms can
be excluded
• Entrez allows key words to be
used to restrict the search, e.g.
“olfactory receptor”
Choose optional parameters
• To understand optimal parameters you need to look at the output first!

Graphic
display

Descriptions
display

Alignment
display

Kerfeld and Scott, PLoS Biology 2011 9(2): e1001014.


BLAST results page: potential homologues identified

S‘ (Bit score) a E (expect value) a


measure of overall statistical
sequence similarity measure
Kerfeld and Scott, PLoS Biology 2011 9(2): e1001014.
How to do BLAST wrong: believing the E-value tells the
whole story

Does it cover the whole length of both the query and subject sequences?

Discovery of a Distant Homolog or


Garbage?

Kerfeld and Scott, PLoS Biology 2011


Is your result biologically meaningful?

Important considerations
• Understand the output
• Adjust the input
• Treat the analysis like an experiment (not like a google search)
Optional parameters:
Click here to see the list of algorithm
parameters that can be changed
Summary

• BLAST is not one tool – it is a suite of tools

• Can quickly search for nucleotide or protein sequences – or more


complex queries

• It is not a black box – you should understand how BLAST works

• Tuning the input parameters may be needed to find sequences of


interest

Next time: more complex BLAST & choosing those


optional parameters to improve your search

You might also like