CE6068 Lecture 3
CE6068 Lecture 3
• Quick Review
• Introduction to Sequence Alignment
• Methods for Sequence Alignment
1
Quick Review
DNA Nucleotides Nitrogenous Bases
含氮鹼基
• There are 4 different nitrogenous bases: adenine(A), cytosine(C), guanine(G), and thymine(T),
3
Applications of DNA Arrays
雜交
• Sequencing by hybridization
膠體電泳
‐ A promising alternative to sequencing by gel electrophoresis
‐ It may be able to reconstruct longer DNA sequences in shorter time
• Expression profile of a cell
‐ DNA arrays allow us to monitor the activities within a cell
‐ Each spot contains the complement of a particular gene
‐ Due to hybridization, we can measure the concentration of different mRNAs within a cell
• SNP detection
‐ Using probes with different alleles to detect the single nucleotide variation.
4
Microarray
• Microarray technology has become one of the indispensable tools that many
biologists use to monitor genome-wide expression levels of genes (actually the
mRNAs) in a given organism.
Expression level is estimated by measuring the amount of mRNA for a particular gene.
• A microarray is typically a glass slide on to which DNA molecules are fixed in an
orderly manner at specific locations called spots (or features).
• DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome. 5
The Principle of Microarray
• Microarrays can be used to measure
gene expression in many ways, but one
of the most popular applications is to
compare expression of a set of genes
from a cell maintained in a particular
condition (condition A) to the same set
of genes from a reference cell
maintained under normal conditions
(condition B).
6
20240313 Exercise #13
• What is the purpose of DNA microarrays?
(A) To sequence DNA (B) To visualize DNA
(C) To replicate DNA (D) To monitor gene expression
DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome.
7
20240313 Exercise #16
• What is the role of the ribosome in protein synthesis?
(A) Synthesizing DNA (B) Folding proteins into their three-dimensional structures
(C) Translating mRNA into protein (D) Transcribing DNA into RNA
‣ Protein folding is influenced by the sequence of amino acids in the polypeptide chain,
which dictates the chemical and physical interactions that determine the final shape of
the protein.
‣ While the information for folding is inherent in the polypeptide sequence synthesized
by the ribosome, the ribosome itself does not directly fold the proteins.
‣ Protein folding can occur spontaneously as the polypeptide chain is being synthesized
and emerges from the ribosome.
8
20240313 Exercise #17
• What is the main difference between RNA and DNA nucleotides?
(A) RNA nucleotides include the sugar ribose, while DNA includes deoxyribose
(B) DNA nucleotides include uracil, while RNA includes thymine
(C) RNA nucleotides are only composed of adenine and guanine
(D) DNA nucleotides can form double-stranded structures, while RNA cannot
(D) is not entirely accurate to say RNA cannot form double-stranded structures. While DNA is known for
its stable double-helical structure, RNA can also form double-stranded regions through base pairing within
a single molecule (forming hairpin structures) or between two RNA molecules. However, these RNA
double-stranded regions are typically shorter and more transient compared to the long, stable double helix
of DNA. 9
Chromosomes
染色體
• A chromosome is a long DNA molecule coiled around
組織蛋白
proteins called histones, forming a structure that
organizes and condenses DNA to fit within the cell's
nucleus.
• Chromosomes ensure DNA is accurately replicated and
distributed in the process of cell division. Humans, for
example, have 23 pairs of chromosomes in each cell, for
a total of 46.
Ref. https://my.clevelandclinic.org/health/body/23064-dna-genes--chromosomes 10
Chromosomes
染色體
• A chromosome is a long中節
DNA molecule coiled around
組織蛋白
proteins called histones, forming a structure that
organizes and condenses DNA to fit within the cell's
核小體
nucleus.
• Chromosomes ensure DNA is accurately replicated and
distributed in the process of cell division. Humans, for
染色單體
example, have 23 pairs of chromosomes in each cell, for
a total of 46.
Ref. https://www.britannica.com/science/tumor-necrosis-factor Ref. https://my.clevelandclinic.org/health/body/23064-dna-genes--chromosomes 11
Genome
• The genome of an organism is the complete set of genetic material, including all of its
genes and noncoding sequences, contained within the chromosomes.
• The genome encompasses not only the coding regions that specify proteins and
functional RNAs but also regulatory sequences, introns, and intergenic regions.
• It represents the entire blueprint for the organism's development, physiology, behavior,
and evolution.
12
Gene
Ref. https://microbenotes.com/genes-and-loci-a-complete-guide/
• A gene is a specific segment or sequence of DNA located within the genome that
codes for a particular product, typically a protein or a functional RNA molecule, such
as ribosomal RNA (rRNA) or transfer RNA (tRNA).
• Genes are the basic physical and functional units of heredity and are the instructions
that dictate how an organism is built and how it operates.
基因座
• Each gene has a specific location (locus) on a chromosome, one of the structures that
organize DNA within the nucleus.
13
Genome and Gene (1/2)
• The genome is like a library that contains numerous books (genes), each with specific
instructions for building various components of the organism or for carrying out specific
functions. In this analogy, if the genome is the entire library, genes are individual books
within that library.
• The function and expression of genes are determined by their sequences within the context of
the entire genome. Genes do not work in isolation; their activity can be regulated by other
genes or by non-coding regions of the genome. The genome's structure, including how genes
are arranged and interact with each other and with non-coding DNA, influences the
organism's development, traits, and behavior. 14
Genome and Gene (2/2)
配子
• While all cells in an organism (except for gametes and some immune cells) contain the same
genome, different genes are expressed in different cell types and at different times. This
selective gene expression allows for the diverse range of cell types and functions within an
organism, all originating from the same genomic blueprint.
• Genomes evolve over time through mutations, gene duplications, and other genomic
rearrangements. Genes within the genome can be gained, lost, or modified, leading to
evolutionary changes in the organism. The study of genomes (genomics) thus provides
insights into the evolutionary history and relationships among species.
15
Genes and Chromosomes
• Every gene is located on a chromosome.
• Chromosomes serve as the structural foundation for organizing and segregating DNA
in a way that ensures accurate gene expression and DNA replication.
• The specific location of a gene on a chromosome is referred to as its locus.
• Each gene's locus is fixed, meaning that genes are found in the same location on
chromosomes across the individuals of a species.
16
Ref. https://quizlet.com/398226811/year-9-genes-and-chromosomes-diagram/
DNA Sequences and Chromosomal Location
• DNA sequences, which include both coding sequences (genes) and non-coding
sequences, are indeed located in specific regions of chromosomes.
• The entire DNA sequence of a chromosome includes vast stretches of non-coding
DNA that play various roles, such as regulation of gene expression, structuring of the
chromosome, and protection of the chromosome ends (telomeres), in addition to the
端粒
coding regions (genes) that are transcribed into RNA.
Ref. https://www.toolsbiotech.com/news_detail.php?id=157
17
Functional Implications
• The location of a gene on a chromosome can have functional implications.
‐ Gene Regulation: The expression of genes can be influenced by nearby regulatory sequences and
the chromatin state (how tightly DNA is packed) of their chromosomal region.
‐ Genetic Linkage: Genes that are located close to each other on the same chromosome tend to be
inherited together. This principle of genetic linkage is used in genetic mapping to study the
inheritance patterns of traits.
‐ Chromosomal Rearrangements: Sometimes, chromosomal rearrangements (such as translocations,
deletions, or duplications) can affect gene function by moving a gene to a new chromosomal
environment, altering gene expression, or disrupting gene structure.
18
Introduction to Sequence Alignment
What is Sequence Alignment (1/2)
• Sequence alignment is a method used to arrange DNA, RNA, or protein sequences to
identify regions of similarity that may indicate functional, structural, or evolutionary
relationships.
• It involves the comparison of sequences to find a series of individual characters or
patterns that are in the same order in the sequences but not necessarily contiguous.
• This comparison often requires the introduction of gaps in one or more of the
sequences to optimize the alignment, with the objective of maximizing the number of
matches and minimizing the number of mismatches and gaps.
20
What is Sequence Alignment (2/2)
• The core questions sequence alignment aims to answer is:
How can two or more biological sequences be optimally aligned to reveal regions
of similarity and difference, and what do these regions tell us about the biological
function and evolutionary history of these sequences?
How can we identify regions of similarity between two or more biological
sequences, and what does this similarity imply about their structural, functional, or
evolutionary relationships?
21
Definition
• An alignment of two sequences is formed by inserting spaces in arbitrary locations
along the sequences so that they end up with the same length and there are no two
spaces at the same position of the two augmented sequences.
22
The Core Question of Sequence Alignment
• Structural Similarity:
If two sequences can be aligned closely, does this imply a structural similarity, suggesting that
they fold in similar ways or have similar molecular shapes?
• Functional Similarity:
Does sequence similarity imply that the sequences perform similar functions in the cell, such
as enzymatic activity, DNA binding, or signaling?
• Evolutionary Relatedness:
Do similarities in sequences suggest a common evolutionary ancestor, and can differences
help us infer evolutionary distances between species? 23
Why Sequence Alignment
• Gene and Protein Function Prediction: By aligning an unknown sequence with known
sequences, we can infer the function of genes or proteins based on similarity to those with
known functions.
• Understanding Evolutionary Relationships: Sequence alignments can reveal how species
are related and how certain genes or proteins have evolved over time. This helps in
constructing phylogenetic trees and understanding evolutionary mechanisms.
• Identifying Disease-Causing Mutations: By comparing diseased and normal sequences,
alignments can highlight mutations responsible for diseases, aiding in diagnostics and
treatment strategies.
• Drug Discovery and Design: Sequence alignment can identify targets for drug binding or
modification, helping in the design of drugs with better efficacy and lower toxicity.
• Conserved Regions and Regulatory Elements Identification: In genomics, alignments help
identify conserved DNA sequences that may play crucial roles in gene regulation,
development, and cellular processes.
24
Interpretation of Sequence Alignment
• The interpretation of sequence alignment involves analyzing the arrangement of two
or more biological sequences—DNA, RNA, or proteins—to identify regions of
similarity and difference.
• These alignments are pivotal for drawing conclusions about the functional, structural,
and evolutionary relationships among the sequences in question.
25
Similarity and Homology
• Similarity refers to the degree to which nucleotide or amino acid residues match
between sequences in the alignment. High similarity often suggests a related function
or origin.
• Homology, a term derived from evolutionary biology, indicates that sequences share a
同源性
common ancestry. In bioinformatics, sequence similarity is used as a basis to infer
homology, though it's important to note that high similarity does not automatically
imply homology without additional evolutionary evidence.
26
Matches, Mismatches, and Gaps
• Matches occur when identical residues are aligned, suggesting regions of
conservation that are often crucial for maintaining the structural integrity or functional
activity of the molecule.
• Mismatches represent divergent residues and can indicate points of evolutionary
change, mutation, or functional diversification.
• Gaps are introduced into alignments to maximize similarity, representing insertions or
deletions (indels) that have occurred since the last common ancestor. The placement
and length of gaps can provide insights into evolutionary events and functional
differences.
27
Matches, Mismatches, and Gaps
• Matches occur when identical residues are aligned, suggesting regions of
conservation that are often crucial for maintaining the structural integrity or functional
activity of the molecule.
• Mismatches represent divergent residues and can indicate points of evolutionary
change, mutation, or functional diversification.
• Gaps are introduced into alignments to maximize similarity, representing insertions or
deletions (indels) that have occurred since the last common ancestor. The placement
and length of gaps can provide insights into evolutionary events and functional
differences. Ref. https://www.researchgate.net/figure/Example-of-match-mismatch-and-gap_fig1_347177164
28
Scoring Matrices and Alignment Scores
• The alignment score is a quantitative measure of how well the sequences align, based
on a scoring matrix that assigns values to matches, mismatches, and gaps. High scores
generally indicate a better alignment and potentially more significant biological
relationships. PAM: Point Accepted Mutation
BLOSUM: BLOcks SUbstitution Matrix
• Scoring matrices, such as PAM or BLOSUM for proteins, are used to calculate these
scores. These matrices are derived from empirical data on the frequency of
substitutions between amino acids in related proteins, helping to distinguish between
more likely (conservative) and less likely (radical) substitutions.
29
Conservation and Variability
• Conserved regions within alignments are stretches of high similarity across all
sequences being compared. These often correlate with important functional or
structural elements, such as active sites in enzymes or binding domains in proteins.
• Variable regions show more divergence and may indicate areas where evolutionary
changes have provided adaptive advantages or where different functions have evolved.
30
Functional and Evolutionary Insights
• Sequence alignment can reveal the functional relationships between sequences by
highlighting conserved motifs or domains that are critical for biological activity.
• From an evolutionary perspective, alignments can be used to construct phylogenetic
trees, illustrating the evolutionary distances and relationships between sequences.
These trees help trace the lineage and divergence of genes or proteins across different
species.
31
Key Terminologies (1/8)
• Query Sequence
‐ The sequence that is being searched against a database or compared to other sequences.
‐ Example: A DNA sequence obtained from a new species being compared to a database of
known sequences to find matches.
• Target Sequence
‐ The sequence(s) against which the query sequence is compared.
‐ Example: Known sequences in a database that are compared with a query sequence to
identify similarities.
32
Key Terminologies (2/8)
• Matches
‐ In an alignment, matches occur when identical residues (nucleotides or amino acids) are
aligned in both sequences.
‐ Example: In the alignment of two sequences, ACGT and ACGG, the first three nucleotides
(ACG) are matches.
• Mismatches
‐ Aligned residues that are different between sequences, indicating variation.
‐ Example: In aligning ACGT and ACGG, the last nucleotide is a mismatch (T in the first
sequence and G in the second).
33
Key Terminologies (3/8)
• Gaps
‐ Spaces introduced into sequences during alignment to optimize similarity. They
represent insertions or deletions.
‐ Example: Aligning ACTG and ACG may result in AC-TG (with a gap in the second
sequence) to indicate a deletion or insertion event.
34
Key Terminologies (4/8)
• Scoring Matrices
‐ Mathematical matrices used to score alignments, rewarding matches and penalizing
mismatches and gaps.
‐ Example: BLOSUM62 is a scoring matrix often used for protein alignments, where
each cell value represents the score for aligning two amino acids.
35
Key Terminologies (5/8)
‐ A sequence derived from an alignment that represents the most common residue
found at each position. It reflects the conserved regions across the aligned
sequences.
‐ Example: If three sequences align as ACGT, ACGG, and ACCG, the consensus
might be ACGG, indicating the most common residues at each position.
36
Key Terminologies (6/8)
• Global Alignment
‐ An alignment strategy that aligns entire sequences from beginning to end, optimizing for the best
possible match across the whole length.
‐ Example: The Needleman-Wunsch algorithm is used for global alignment, suitable for sequences of
similar length.
• Local Alignment
‐ An alignment strategy that finds the best matching subsequence(s) between sequences, allowing for
the alignment of shorter regions that are highly similar within longer sequences.
‐ Example: The Smith-Waterman algorithm is used for local alignment, useful for identifying
functional domains within genes or proteins.
37
Key Terminologies (6/8)
• Global Alignment
‐ An alignment strategy that aligns entire sequences from beginning to end, optimizing for the best
possible match across the whole length.
‐ Example: The Needleman-Wunsch algorithm is used for global alignment, suitable for sequences of
similar length.
• Local Alignment
‐ An alignment strategy that finds the best matching subsequence(s) between sequences, allowing for
the alignment of shorter regions that are highly similar within longer sequences.
‐ Example: The Smith-Waterman algorithm is used for local alignment, useful for identifying
Ref. https://microbenotes.com/local-global-multiple-sequence-alignment/
41
Local Alignment
• Goal: To find the highest scoring alignment for any subsequence within the given
sequences. This means finding regions of high similarity without concern for the
alignment of the sequences outside these regions. It's used when the aim is to identify
regions of conservation or functional significance within larger, perhaps only partially
related, sequences.
• Unlike global alignment, local alignment searches for the best matching segment
between sequences, making it ideal for uncovering functional motifs or domains
within otherwise dissimilar sequences.
42
Semi-global Alignment
• Goal: To align one sequence completely while aligning a significant portion of the
other sequence. This type is used when aligning sequences to reference genomes or
when one sequence is a complete gene and the other is a partial sequence or contains
extra regions (e.g., introns in genomic vs. cDNA).
• Semi-global alignment combines aspects of both global and local alignments. It's
similar to global alignment in trying to align entire sequences but allows for
overhangs like in local alignments, accommodating sequences of different lengths
without penalizing the overhanging ends.
43
Methods for Sequence Alignment
Methods for Sequence Alignment
• Dot matrices
• Dynamic programing
‐ Global alignment
‐ Local alignment
• BLAST heuristic approach
45
Dot Matrix Method
• The dot matrix method is a graphical approach
used to visualize sequence similarity between
two biological sequences (DNA, RNA, or
proteins).
• It's one of the simplest forms of sequence
comparison and does not directly involve a
scoring system like dynamic programming
algorithms. Ref. https://microbenotes.com/local-global-multiple-sequence-alignment/
46
Basic Terminologies in Dot Matrices (1/2)
• Window Size: Refers to the length of the sequence segment considered for matching at any
point. A larger window size smooths out the plot, making it easier to identify longer regions
of similarity but potentially obscuring smaller details.
• Threshold: The minimum level of similarity that must be met within the window to place a
dot in the matrix. Adjusting the threshold can help highlight more significant matches or
reduce noise from random matches.
• Diagonal Line: In a dot matrix, a continuous diagonal line indicates a region of consecutive
matches between the sequences. Such lines represent similarity or identical sequences.
47
Basic Terminologies in Dot Matrices (2/2)
• Gaps and Mismatches: Absences of dots along a diagonal line suggest gaps (insertions or deletions)
or mismatches (substitutions) in the alignment of the two sequences.
• Repeats and Inversions:
‐ Repeats are indicated by parallel diagonal lines, showing that a sequence segment appears more than once
in either or both sequences.
‐ Inversions or reverse complements (particularly relevant in DNA sequences) appear as diagonal lines
running perpendicular (in the opposite direction) to the main diagonal. These indicate that a sequence in
one strand is the reverse complement of a sequence in the other strand.
• Noise: Random dots scattered throughout the matrix that do not form part of a significant pattern.
Noise often results from random matches, especially in sequences with a high degree of similarity by
chance.
48
Applying Dot Matrices
• Dot matrices are particularly useful for quickly identifying sequence features without
the need for complex computation. By visually scanning the plot, researchers can
identify:
‐ Regions of Alignment: Indicated by straight, diagonal lines.
‐ Structural Features: Such as repeats or palindromic sequences, highlighted by parallel lines
or lines perpendicular to the main diagonal.
‐ Evolutionary Relationships: By comparing sequences from different species or genes,
researchers can infer evolutionary conservation or divergence.
49
Dot Matrix Method – How It Works
• A two-dimensional matrix is created, with one sequence placed along the x-axis and
the other along the y-axis.
• A dot is placed at the intersection (i, j) in the matrix if the nucleotide or amino acid at
position i in one sequence matches the nucleotide or amino acid at position j in the
other sequence.
• The resulting pattern of dots can reveal similarity regions, repetitive sequences,
inversions, and other structural motifs.
50
Adjusting Parameters
https://barryus.shinyapps.io/dotplot/
• The utility of a dot matrix can be significantly influenced by the choice of parameters
like window size and threshold.
• A smaller window size and lower threshold can highlight short regions of high
similarity but may introduce more noise into the plot.
• Conversely, larger window sizes and higher thresholds can help identify longer
regions of similarity but may overlook shorter, potentially significant, motifs.
51
Dynamic Programming
• Dynamic programming algorithms for sequence alignment can indeed be seen as
sophisticated extensions or evolutions of the basic concept introduced by dot matrices.
• Dynamic Programming Algorithms take the concept further by not only identifying regions of
similarity but also quantitatively evaluating the best possible alignment between sequences.
They do this by assigning scores to matches, mismatches, and gaps, and by using these scores
to systematically construct an optimal alignment path through a matrix.
• This approach is exemplified by the Needleman-Wunsch and Smith-Waterman algorithms,
which apply dynamic programming to find the optimal global and local alignments,
respectively. 52
Dynamic Programming in Sequence Alignment
53
Needleman-Wunsch Algorithm (1/2)
• Specific Question:
How can we align two sequences in their entirety, from start to finish, to maximize
their alignment score based on matches, mismatches, and gaps?
• The Needleman-Wunsch algorithm uses dynamic programming to find the optimal
global alignment between two sequences.
• It systematically compares every character of one sequence with every character of
the other, considering the possibility of matches, mismatches, and gaps.
54
Ref. https://upload.wikimedia.org/wikipedia/commons/3/3f/Needleman-Wunsch_pairwise_sequence_alignment.png
57
Example (1/4)
- A G C A T G C
• Sequences:
-
‣ Sequence A (horizontal, columns): AGCATGC A
‣ CSequence B (vertical, rows): ACAATCC C
• Scoring system: A
‣ +2 for a match A
‣ -1 for a mismatch T
‣ -1 for a gap C
C
58
Example (2/4)
- A G C A T G C
• Step 1: Initialization
- 0 -1 -2 -3 -4 -5 -6 -7
‣ First, we create a matrix with Sequence A on A -1
the top and Sequence B on the side. C -2
‣ The matrix dimensions will be (length of A -3
Sequence A: AGCA-TGC
Sequence B: A-CAATCC
63
Needleman-Wunsch Algorithm – Complexity
• Time Complexity:
The time complexity of the Needleman-Wunsch algorithm is O(mn), where m and n
are the lengths of the two sequences. This is because each cell in the m x n matrix
needs to be filled based on the scores of its three neighboring cells.
• Space Complexity:
Similarly, the space complexity is also O(mn) due to the need to store the entire
matrix for the traceback process. This can become a limitation for very long
sequences.
64
Smith-Waterman Algorithm
• The Smith-Waterman algorithm identifies the most similar fragment(s) or substrings
between two sequences, which is crucial for uncovering functional motifs, domains
within proteins or genes, and other significant local similarities.
• Unlike the Needleman-Wunsch algorithm, which aligns sequences in their entirety
(global alignment), the Smith-Waterman algorithm focuses on the highest-scoring
local alignments, providing a more nuanced view when the sequences have diverged
significantly or when only parts of the sequences are of interest.
65
Smith-Waterman Algorithm Steps
• Initialization: A scoring matrix is created, similar to the Needleman-Wunsch approach, with dimensions (m+1) x (n+1),
where m and n are the lengths of the two sequences. The first row and column are initialized to 0, which is a key
difference from the global alignment approach and facilitates the focus on local alignment.
• Matrix Filling: Each cell [i,j] in the matrix is calculated based on:
‐ The score from the cell directly above plus the gap penalty.
‐ The score from the cell directly to the left plus the gap penalty.
‐ The score from the diagonal cell plus a match score (if the residues match) or a mismatch penalty (if they do not).
‐ Zero, which ensures that negative scores are not propagated, emphasizing the algorithm's focus on local rather than global alignment.
This choice allows alignments to start and end anywhere within the matrix, optimizing for the highest-scoring local alignment.
‐ The optimal score at each cell is the maximum of these four options.
• Identifying the Highest-Scoring Subsequence:
‐ Unlike the global alignment process that always backtracks from the bottom-right corner, the Smith-Waterman algorithm starts the
traceback from the cell with the highest score in the entire matrix. This reflects the start of the best-scoring local alignment.
‐ The traceback continues until a cell with a score of 0 is reached, indicating the end of this local alignment.
66
Smith-Waterman Algorithm – Complexity
• Time Complexity:
The time complexity of the Smith-Waterman algorithm is O(mn), as it involves filling
a matrix of size m x n where m and n are the lengths of the two sequences.
• Space Complexity:
The space complexity is also O(mn) due to the need to store the entire matrix for the
traceback process. For long sequences, this can be computationally expensive.
67
Basic Local Alignment Search Tool
• BLAST (Basic Local Alignment Search Tool) is one of the most widely used
bioinformatics tools for comparing an input biological sequence (the query sequence)
against a database of sequences.
• BLAST finds regions of local similarity between sequences, efficiently identifying
significant matches even among millions of sequences.
• It balances sensitivity and speed, making it an indispensable tool for researchers
looking to identify homologous sequences, infer function, and explore evolutionary
relationships.
68
Key Features and Components of BLAST (1/2)
• Query and Database: BLAST allows a user to input a query sequence (DNA, RNA, or
protein) and compares it against a database of sequences. The databases can be specific to
certain types of sequences, such as nucleotides (nt) or proteins (nr), and can include genomic
data from various organisms.
• Algorithm: At its core, BLAST uses a heuristic search algorithm that quickly identifies high-
scoring sequence alignments, prioritizing speed without significantly compromising the
accuracy of the results. Unlike exhaustive search algorithms like Smith-Waterman, BLAST
searches for short matches between the query sequence and database sequences (words) and
uses these matches as seeds to initiate local alignments. 69
Key Features and Components of BLAST (2/2)
72
Key Terms and Concepts (3/4)
• Gap Penalties
Scores subtracted for introducing gaps (insertions or deletions) in the alignment. Gap
penalties discourage the algorithm from introducing too many gaps, ensuring that the
alignments reflect biologically plausible relationships.
• Substitution Matrix
A table used in protein alignments that scores amino acid substitutions. Common
matrices include BLOSUM62 and PAM250, which reflect the biological likelihood of
one amino acid replacing another over evolutionary time.
73
Key Terms and Concepts (4/4)
• Filter Low Complexity Regions
A feature in BLAST that masks regions of the sequence that are simple or repetitive
and likely to produce spurious or misleading alignments. Filtering these regions helps
focus the search on more biologically meaningful similarities.
• Hit
A sequence in the database that shows significant similarity to the query sequence.
Hits are ranked by BLAST based on their scores, E-values, and other metrics to help
users identify the most relevant matches.
74
How BLAST Works
• Identifying Seeds: BLAST begins by identifying short segments of the query sequence that
match sequences in the database with a minimum threshold of similarity. These short matches
are called "seeds."
• Query Extension: For each seed, BLAST extends the alignment in both directions to find the
best possible local alignment that includes the seed, stopping when the alignment score begins
to decrease (indicating that the optimal alignment boundary has been reached).
• Ranking and Reporting: The resulting alignments are then scored, ranked, and reported to
the user. The alignments are typically filtered and sorted based on their scores, E-values, and
bit scores to present the most biologically relevant results first. 75
The Core Steps of the BLAST Algorithm
• Building a Lookup Table for the Query Sequence
• Scanning the Database for Matches to the Query Words
• Extending the Seeds
• Evaluating the Significance of the Alignments
• Reporting the Results
76
Interpretation of BLAST Results
• The output typically includes a list of hits, each with associated metrics like E-value
and bit score, which indicate the strength and significance of the alignment.
• Additionally, the results provide detailed alignments, showing exactly how the query
sequence matches up with each hit, including any gaps or mismatches.
77
Q &A
Thank you!