BLAST: An Introductory Tool For Students To Bioinformatics Applications
BLAST: An Introductory Tool For Students To Bioinformatics Applications
net/publication/267332265
CITATIONS READS
0 11,294
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Gareth Syngai on 25 October 2014.
Gareth Gordon Syngai1*, Pranjan Barman2, Rupjyoti Bharali2 & Sudip Dey3
1
Department of Biochemistry, Lady Keane College, Shillong – 793001
2
Department of Biotechnology, Gauhati University, Guwahati – 781014
3
Sophisticated Analytical Instrument Facility, North Eastern Hill University, Shillong – 793022
[email protected], [email protected], [email protected], [email protected]
*Corresponding author: Gareth Gordon Syngai; email:[email protected]
Abstract
BLAST which is a sequence similarity search program is an excellent starting point for teaching bioinformatics
to students and it has the potential to enhance a student’s grasp of biomedical, biochemical, and biogeochemical
concepts. This article discusses the underlying concepts of the BLAST algorithm, the scores and statistics of the
alignments; with illustrations using the NCBI BLAST. The article also emphasizes the need for students to be
familiarized with the basic concepts and programs of bioinformatics which is a necessity in biological sciences
now-a-days because of the recent advances in high-throughput techniques for data generation and analysis.
Keywords
BLAST, algorithm, introductory tool, bioinformatics teaching, bioinformatics applications
Introduction
The Basic Local Alignment Search Tool (BLAST) which are often quite abstract in nature. This is possible
is one of the most commonly used tools for comparing because of the abundance of sequence data present in public
sequence information and retrieving sequences from databases which raises the far more attractive possibility
databases and is thus an excellent starting point for teaching of using searches tailored to a particular course, or, better
bioinformatics (Kerfeld & Scott, 2011). BLAST has been yet, allowing the students to choose their own examples.
utilized in nearly every branch of biology, far beyond the
scope of molecular genetics, molecular biology and protein Another benefit of teaching students how the BLAST
biochemistry, and this tool has made great contributions algorithm works is that it provides an opportunity to illustrate
to many scientific fields since its development (Altschul how mathematics functions as a language of biology.
et al., 1997; Altschul, 1991). Currently, the work of most But of higher significance is the fact that understanding
biologists, bioinformaticians, evolutionists and medical the steps in the calculation of an E-value provides an
scientists cannot progress without the use of BLAST opportunity to show the relationship between how the
(Dong-Wook et al., 2012). algorithm works based on the fundamental principles
The major reasons for the ever-growing popularity of biochemistry and evolution (Kerfeld & Scott, 2011).
of BLAST are the flexibility of the search algorithm, Here, this paper presents a concise and conceptual
reliable statistical reports, continual software development approach with simplified interpretations of the BLAST
and the speed attained by the heuristic search methods algorithm for helping the students understand the underlying
(Neumann et al., 2013). basics of the BLAST program which in turn has the
On the other hand, by using BLAST, students can be potential to enhance a student’s grasp of biomedical,
introduced to concepts of molecular evolution (e.g., gene biochemical, and biogeochemical concepts; thus helping
duplication and divergence; orthologs versus paralogs) widen the scope for multidisciplinary integration.
67
BLAST: The Tool accuracy of the algorithm is slightly decreased (Zhimin
BLAST is a sequence similarity search program that & Zhongwen, 2013).
can be used via a web interface or as a stand-alone tool The Algorithm
to compare a user’s query to a database of sequences The algorithm itself is straightforward, the important
(Altschul et al., 1997; Altschul et al., 1990). There are concept being that of the segment pair. Given two sequences,
several types of BLAST to compare all combinations of a segment pair is defined as a pair of sub-sequences of the
nucleotide or protein queries with nucleotide or protein same length that form an ungapped alignment. BLAST
databases (McGinnis & Madden, 2004). BLAST performs calculates all segment pairs between the query and the
comparisons between pairs of sequences, searching for database sequences, above a scoring threshold. The
regions of local similarity (Pertsemlidis & Fondon III, algorithm searches for fixed-length hits, which are then
2001). extended until certain threshold parameters are achieved.
The rationale for local similarity searching is that The resulting high-scoring pairs (HSPs) form the basis of
functional sites (e.g., catalytic sites of enzymes) are the ungapped alignments that characterize BLAST output.
localized to relatively short regions, which are conserved Subsequently, a modification of the algorithm had been
irrespective of deletions or mutations in intervening parts introduced for generating gapped alignments (Altschul
of the sequence. Thus, a search for local similarity may et al., 1997). The new algorithm seeks only one, rather
produce more biologically meaningful and sensitive than all, ungapped alignments that make up a significant
results than a search attempting to optimize alignment match, and hence speeds the initial database search.
over the entire sequence lengths (Attwood et al., 2007). Dynamic programming is used to extend a central pair
of aligned residues in both directions to yield the final
Sequence similarity searching, typically with BLAST,
gapped alignment. Having dropped the requirements
is the most widely used and most reliable strategy for
to find all ungapped alignments independently, the
characterizing newly determined sequences. Sequence new algorithm is three times faster than its predecessor
similarity searches can identify “homologous” proteins (Attwood et al., 2007).
or genes by detecting excess similarity between the
newly determined sequence (the query sequence) and any The Steps
similar sequence in the database; which in turn reflects There are three major steps in the BLAST algorithm
common ancestry. and the details of which are as described below:
Homology implies that sequences may be related by Step 1: BLAST filters the low complexity regions (e.g.,
divergence from a common ancestor or share common CA repeats) and removes them from the query sequence
functional aspects. Homologous genes found in different (Pertsemlidis & Fondon III, 2001). The reasons being
species that evolved from the same gene in a common that low-complexity regions and interspersed repeats
ancestor are called orthologs, whereas homologous typically match many sequences, and as such these
genes in the same organism (arising by duplication of a matches are normally not of biological interest which
single gene in the evolutionary past) are called paralogs. may in turn lead to spurious results, and confound the
Homologous genes (both orthologs and paralogs) often statistics used by BLAST.
have the same or related functions (Pierce, 2002). BLAST offers two query masking modes to avoid such
Sequence homology searches are a key computational matches. One is known as “hard-masking” and replaces the
tool of molecular biology and they are important as their masked portion of the query by X’s or N’s for all phases
products, the high scoring alignments, are used in a of the search. On the other hand, “soft-masking” makes
range of areas, from estimating evolutionary histories, to the masked portion of the query unavailable for finding
predicting functions of genes and proteins, to identifying the initial word hits, but the masked portion is available
possible drug targets (Pearson, 2013; Bayat, 2002; Bailey for the gap-free and gapped extension once an initial
& Gribskov, 1998). word hit has been found (Camacho et al., 2009). Filtering
is only applied to the query sequence (or its translation
The BLAST algorithm was described by Altschul et al. products), not to the database sequences. Default filtering
in 1990. It became popular largely because implementations is by the Nucleotide Dust Masker program (Morgulis
of it have been very efficient and it has been optimized et al., 2006) and SEG program (Wootton & Federhen,
to work with parallel UNIX architectures from an early 1996). The BLAST formatter now can represent these
stage (Attwood et al., 2007). The BLAST algorithm is regions by lower-case letters, making them distinct from
a heuristic program, which means that it relies on some the (upper-case) non-filtered regions. In addition, the user
smart shortcuts to perform the search faster (Madden, may select from three colors (black, gray, red) to vary
2002). However, in this trade-off for increased speed, the the emphasis on these regions. This new display option
CATCCTCGCCCGTTTCCACGCCGTCGTCCTCCTCATCATCGGCGAGAGCTGATTGCGTGGTGGTCAGAGG
CGAACCAGCGGTCTTCGTGGAGCTGGGACCCAGATCAAGGCTGCTCAACAGATTGCCTGCCGACTGGGAA
GACGTTAGGGTGTCCTTGTGATAGGAGCTGTGCCGATTGCCCAGCTTAGTGGATAGTGTTAGGTCGCCGT
TGCTCGTTGGGCGTAGACTGCCCACCACCTGACCACCGGGCAGGGTGGCGCTTCTCTTGTGGCGACCCTT
CGACTTGGGAAAGGCAGCCAGGATGTTGAGCCACCACTGGGATTCCTCTGAACTGGTGCCCTTCACAAAG
GTCACGCGCTCGGGAGCGGTTATGGCGATGGAGTTGGGGTGACCTGTCACCTCCACGGCGCTGGTAACCT
CCAGCACTTTGGTCATATCAACGCACGCCTGCGGTATGGTTTCGGGCTATAGAAAATATATGTAAATTAA
AGAGTAAACAAGTTGTATTTTAAGATTTTAATTAGGAGAATTAATTAATCGGTAATCAAATGAACTCGGC
CTATCGCGTAATAATATACATTTTTTAATTTAATGACTAATAAATAATATAAAATCTAATTAATAGTTCA
GTAAGTTAGTAAAAGTAAATCAATCTGGTGGTAATTTAAGAAGCCACTTTAATTCTTCCACTTCATAAAT
Fig 2. Nucleotide sequence of D. melanogaster alcohol dehydrogenase structural gene and flanks (composite
sequence) [GenBank: Z00030.1] in FASTA format.
Fig 3. BLAST window with the query sequence pasted in it and the selected databases.
Fig 5. Description section in the BLAST report showing one-line summaries of sequences producing significant alignments.
Fig 6. Alignment section from a BLAST report showing pair-wise sequence alignment between a query sequence
and a database sequence.