0% found this document useful (0 votes)
60 views80 pages

CE6068 Lecture 3

The document discusses DNA, genes, genomes, chromosomes and their roles. It provides definitions and examples to explain key concepts in molecular biology and genomics. The document is intended as a lesson to teach students about fundamental biological concepts.

Uploaded by

林采玟
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views80 pages

CE6068 Lecture 3

The document discusses DNA, genes, genomes, chromosomes and their roles. It provides definitions and examples to explain key concepts in molecular biology and genomics. The document is intended as a lesson to teach students about fundamental biological concepts.

Uploaded by

林采玟
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

CE6068

Bioinformatics and Computational Molecular


Sequence Alignment Fundamentals
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/3/20
Outline

• Quick Review
• Introduction to Sequence Alignment
• Methods for Sequence Alignment

1
Quick Review
DNA Nucleotides Nitrogenous Bases
含氮鹼基
• There are 4 different nitrogenous bases: adenine(A), cytosine(C), guanine(G), and thymine(T),

Ref. Figure 1.6 in Algorithms in Bioinformatics: A Practical Introduction by Wing-Kin Sung

• C and T have 1 ring, and are called pyrimidines (嘧啶)


• A and G have 2 rings, and are called purines (嘌呤)

3
Applications of DNA Arrays
雜交
• Sequencing by hybridization
膠體電泳
‐ A promising alternative to sequencing by gel electrophoresis
‐ It may be able to reconstruct longer DNA sequences in shorter time
• Expression profile of a cell
‐ DNA arrays allow us to monitor the activities within a cell
‐ Each spot contains the complement of a particular gene
‐ Due to hybridization, we can measure the concentration of different mRNAs within a cell
• SNP detection
‐ Using probes with different alleles to detect the single nucleotide variation.
4
Microarray
• Microarray technology has become one of the indispensable tools that many
biologists use to monitor genome-wide expression levels of genes (actually the
mRNAs) in a given organism.
Expression level is estimated by measuring the amount of mRNA for a particular gene.
• A microarray is typically a glass slide on to which DNA molecules are fixed in an
orderly manner at specific locations called spots (or features).
• DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome. 5
The Principle of Microarray
• Microarrays can be used to measure
gene expression in many ways, but one
of the most popular applications is to
compare expression of a set of genes
from a cell maintained in a particular
condition (condition A) to the same set
of genes from a reference cell
maintained under normal conditions
(condition B).
6
20240313 Exercise #13
• What is the purpose of DNA microarrays?
(A) To sequence DNA (B) To visualize DNA
(C) To replicate DNA (D) To monitor gene expression

DNA microarrays, also known as DNA chips or biochips, are a technology used to
measure the expression levels of thousands of genes simultaneously or to genotype
multiple regions of a genome.

7
20240313 Exercise #16
• What is the role of the ribosome in protein synthesis?
(A) Synthesizing DNA (B) Folding proteins into their three-dimensional structures
(C) Translating mRNA into protein (D) Transcribing DNA into RNA

‣ Protein folding is influenced by the sequence of amino acids in the polypeptide chain,
which dictates the chemical and physical interactions that determine the final shape of
the protein.
‣ While the information for folding is inherent in the polypeptide sequence synthesized
by the ribosome, the ribosome itself does not directly fold the proteins.
‣ Protein folding can occur spontaneously as the polypeptide chain is being synthesized
and emerges from the ribosome.
8
20240313 Exercise #17
• What is the main difference between RNA and DNA nucleotides?
(A) RNA nucleotides include the sugar ribose, while DNA includes deoxyribose
(B) DNA nucleotides include uracil, while RNA includes thymine
(C) RNA nucleotides are only composed of adenine and guanine
(D) DNA nucleotides can form double-stranded structures, while RNA cannot
(D) is not entirely accurate to say RNA cannot form double-stranded structures. While DNA is known for
its stable double-helical structure, RNA can also form double-stranded regions through base pairing within
a single molecule (forming hairpin structures) or between two RNA molecules. However, these RNA
double-stranded regions are typically shorter and more transient compared to the long, stable double helix
of DNA. 9
Chromosomes
染色體
• A chromosome is a long DNA molecule coiled around
組織蛋白
proteins called histones, forming a structure that
organizes and condenses DNA to fit within the cell's
nucleus.
• Chromosomes ensure DNA is accurately replicated and
distributed in the process of cell division. Humans, for
example, have 23 pairs of chromosomes in each cell, for
a total of 46.
Ref. https://my.clevelandclinic.org/health/body/23064-dna-genes--chromosomes 10
Chromosomes
染色體
• A chromosome is a long中節
DNA molecule coiled around
組織蛋白
proteins called histones, forming a structure that
organizes and condenses DNA to fit within the cell's
核小體
nucleus.
• Chromosomes ensure DNA is accurately replicated and
distributed in the process of cell division. Humans, for
染色單體
example, have 23 pairs of chromosomes in each cell, for
a total of 46.
Ref. https://www.britannica.com/science/tumor-necrosis-factor Ref. https://my.clevelandclinic.org/health/body/23064-dna-genes--chromosomes 11
Genome
• The genome of an organism is the complete set of genetic material, including all of its
genes and noncoding sequences, contained within the chromosomes.
• The genome encompasses not only the coding regions that specify proteins and
functional RNAs but also regulatory sequences, introns, and intergenic regions.
• It represents the entire blueprint for the organism's development, physiology, behavior,
and evolution.

12
Gene
Ref. https://microbenotes.com/genes-and-loci-a-complete-guide/

• A gene is a specific segment or sequence of DNA located within the genome that
codes for a particular product, typically a protein or a functional RNA molecule, such
as ribosomal RNA (rRNA) or transfer RNA (tRNA).
• Genes are the basic physical and functional units of heredity and are the instructions
that dictate how an organism is built and how it operates.
基因座
• Each gene has a specific location (locus) on a chromosome, one of the structures that
organize DNA within the nucleus.

13
Genome and Gene (1/2)
• The genome is like a library that contains numerous books (genes), each with specific
instructions for building various components of the organism or for carrying out specific
functions. In this analogy, if the genome is the entire library, genes are individual books
within that library.
• The function and expression of genes are determined by their sequences within the context of
the entire genome. Genes do not work in isolation; their activity can be regulated by other
genes or by non-coding regions of the genome. The genome's structure, including how genes
are arranged and interact with each other and with non-coding DNA, influences the
organism's development, traits, and behavior. 14
Genome and Gene (2/2)
配子
• While all cells in an organism (except for gametes and some immune cells) contain the same
genome, different genes are expressed in different cell types and at different times. This
selective gene expression allows for the diverse range of cell types and functions within an
organism, all originating from the same genomic blueprint.
• Genomes evolve over time through mutations, gene duplications, and other genomic
rearrangements. Genes within the genome can be gained, lost, or modified, leading to
evolutionary changes in the organism. The study of genomes (genomics) thus provides
insights into the evolutionary history and relationships among species.

15
Genes and Chromosomes
• Every gene is located on a chromosome.
• Chromosomes serve as the structural foundation for organizing and segregating DNA
in a way that ensures accurate gene expression and DNA replication.
• The specific location of a gene on a chromosome is referred to as its locus.
• Each gene's locus is fixed, meaning that genes are found in the same location on
chromosomes across the individuals of a species.

16
Ref. https://quizlet.com/398226811/year-9-genes-and-chromosomes-diagram/
DNA Sequences and Chromosomal Location
• DNA sequences, which include both coding sequences (genes) and non-coding
sequences, are indeed located in specific regions of chromosomes.
• The entire DNA sequence of a chromosome includes vast stretches of non-coding
DNA that play various roles, such as regulation of gene expression, structuring of the
chromosome, and protection of the chromosome ends (telomeres), in addition to the
端粒
coding regions (genes) that are transcribed into RNA.

Ref. https://www.toolsbiotech.com/news_detail.php?id=157
17
Functional Implications
• The location of a gene on a chromosome can have functional implications.
‐ Gene Regulation: The expression of genes can be influenced by nearby regulatory sequences and
the chromatin state (how tightly DNA is packed) of their chromosomal region.
‐ Genetic Linkage: Genes that are located close to each other on the same chromosome tend to be
inherited together. This principle of genetic linkage is used in genetic mapping to study the
inheritance patterns of traits.
‐ Chromosomal Rearrangements: Sometimes, chromosomal rearrangements (such as translocations,
deletions, or duplications) can affect gene function by moving a gene to a new chromosomal
environment, altering gene expression, or disrupting gene structure.
18
Introduction to Sequence Alignment
What is Sequence Alignment (1/2)
• Sequence alignment is a method used to arrange DNA, RNA, or protein sequences to
identify regions of similarity that may indicate functional, structural, or evolutionary
relationships.
• It involves the comparison of sequences to find a series of individual characters or
patterns that are in the same order in the sequences but not necessarily contiguous.
• This comparison often requires the introduction of gaps in one or more of the
sequences to optimize the alignment, with the objective of maximizing the number of
matches and minimizing the number of mismatches and gaps.
20
What is Sequence Alignment (2/2)
• The core questions sequence alignment aims to answer is:
 How can two or more biological sequences be optimally aligned to reveal regions
of similarity and difference, and what do these regions tell us about the biological
function and evolutionary history of these sequences?
 How can we identify regions of similarity between two or more biological
sequences, and what does this similarity imply about their structural, functional, or
evolutionary relationships?

21
Definition
• An alignment of two sequences is formed by inserting spaces in arbitrary locations
along the sequences so that they end up with the same length and there are no two
spaces at the same position of the two augmented sequences.

22
The Core Question of Sequence Alignment
• Structural Similarity:
If two sequences can be aligned closely, does this imply a structural similarity, suggesting that
they fold in similar ways or have similar molecular shapes?
• Functional Similarity:
Does sequence similarity imply that the sequences perform similar functions in the cell, such
as enzymatic activity, DNA binding, or signaling?
• Evolutionary Relatedness:
Do similarities in sequences suggest a common evolutionary ancestor, and can differences
help us infer evolutionary distances between species? 23
Why Sequence Alignment
• Gene and Protein Function Prediction: By aligning an unknown sequence with known
sequences, we can infer the function of genes or proteins based on similarity to those with
known functions.
• Understanding Evolutionary Relationships: Sequence alignments can reveal how species
are related and how certain genes or proteins have evolved over time. This helps in
constructing phylogenetic trees and understanding evolutionary mechanisms.
• Identifying Disease-Causing Mutations: By comparing diseased and normal sequences,
alignments can highlight mutations responsible for diseases, aiding in diagnostics and
treatment strategies.
• Drug Discovery and Design: Sequence alignment can identify targets for drug binding or
modification, helping in the design of drugs with better efficacy and lower toxicity.
• Conserved Regions and Regulatory Elements Identification: In genomics, alignments help
identify conserved DNA sequences that may play crucial roles in gene regulation,
development, and cellular processes.
24
Interpretation of Sequence Alignment
• The interpretation of sequence alignment involves analyzing the arrangement of two
or more biological sequences—DNA, RNA, or proteins—to identify regions of
similarity and difference.
• These alignments are pivotal for drawing conclusions about the functional, structural,
and evolutionary relationships among the sequences in question.

25
Similarity and Homology
• Similarity refers to the degree to which nucleotide or amino acid residues match
between sequences in the alignment. High similarity often suggests a related function
or origin.
• Homology, a term derived from evolutionary biology, indicates that sequences share a
同源性
common ancestry. In bioinformatics, sequence similarity is used as a basis to infer
homology, though it's important to note that high similarity does not automatically
imply homology without additional evolutionary evidence.

26
Matches, Mismatches, and Gaps
• Matches occur when identical residues are aligned, suggesting regions of
conservation that are often crucial for maintaining the structural integrity or functional
activity of the molecule.
• Mismatches represent divergent residues and can indicate points of evolutionary
change, mutation, or functional diversification.
• Gaps are introduced into alignments to maximize similarity, representing insertions or
deletions (indels) that have occurred since the last common ancestor. The placement
and length of gaps can provide insights into evolutionary events and functional
differences.
27
Matches, Mismatches, and Gaps
• Matches occur when identical residues are aligned, suggesting regions of
conservation that are often crucial for maintaining the structural integrity or functional
activity of the molecule.
• Mismatches represent divergent residues and can indicate points of evolutionary
change, mutation, or functional diversification.
• Gaps are introduced into alignments to maximize similarity, representing insertions or
deletions (indels) that have occurred since the last common ancestor. The placement
and length of gaps can provide insights into evolutionary events and functional
differences. Ref. https://www.researchgate.net/figure/Example-of-match-mismatch-and-gap_fig1_347177164

28
Scoring Matrices and Alignment Scores
• The alignment score is a quantitative measure of how well the sequences align, based
on a scoring matrix that assigns values to matches, mismatches, and gaps. High scores
generally indicate a better alignment and potentially more significant biological
relationships. PAM: Point Accepted Mutation
BLOSUM: BLOcks SUbstitution Matrix
• Scoring matrices, such as PAM or BLOSUM for proteins, are used to calculate these
scores. These matrices are derived from empirical data on the frequency of
substitutions between amino acids in related proteins, helping to distinguish between
more likely (conservative) and less likely (radical) substitutions.
29
Conservation and Variability
• Conserved regions within alignments are stretches of high similarity across all
sequences being compared. These often correlate with important functional or
structural elements, such as active sites in enzymes or binding domains in proteins.
• Variable regions show more divergence and may indicate areas where evolutionary
changes have provided adaptive advantages or where different functions have evolved.

30
Functional and Evolutionary Insights
• Sequence alignment can reveal the functional relationships between sequences by
highlighting conserved motifs or domains that are critical for biological activity.
• From an evolutionary perspective, alignments can be used to construct phylogenetic
trees, illustrating the evolutionary distances and relationships between sequences.
These trees help trace the lineage and divergence of genes or proteins across different
species.

31
Key Terminologies (1/8)
• Query Sequence
‐ The sequence that is being searched against a database or compared to other sequences.
‐ Example: A DNA sequence obtained from a new species being compared to a database of
known sequences to find matches.
• Target Sequence
‐ The sequence(s) against which the query sequence is compared.
‐ Example: Known sequences in a database that are compared with a query sequence to
identify similarities.
32
Key Terminologies (2/8)
• Matches
‐ In an alignment, matches occur when identical residues (nucleotides or amino acids) are
aligned in both sequences.
‐ Example: In the alignment of two sequences, ACGT and ACGG, the first three nucleotides
(ACG) are matches.
• Mismatches
‐ Aligned residues that are different between sequences, indicating variation.
‐ Example: In aligning ACGT and ACGG, the last nucleotide is a mismatch (T in the first
sequence and G in the second).
33
Key Terminologies (3/8)

• Gaps
‐ Spaces introduced into sequences during alignment to optimize similarity. They
represent insertions or deletions.
‐ Example: Aligning ACTG and ACG may result in AC-TG (with a gap in the second
sequence) to indicate a deletion or insertion event.

34
Key Terminologies (4/8)

• Scoring Matrices
‐ Mathematical matrices used to score alignments, rewarding matches and penalizing
mismatches and gaps.
‐ Example: BLOSUM62 is a scoring matrix often used for protein alignments, where
each cell value represents the score for aligning two amino acids.

35
Key Terminologies (5/8)

• Consensus Sequence 相同/共有序列

‐ A sequence derived from an alignment that represents the most common residue
found at each position. It reflects the conserved regions across the aligned
sequences.
‐ Example: If three sequences align as ACGT, ACGG, and ACCG, the consensus
might be ACGG, indicating the most common residues at each position.

36
Key Terminologies (6/8)
• Global Alignment
‐ An alignment strategy that aligns entire sequences from beginning to end, optimizing for the best
possible match across the whole length.
‐ Example: The Needleman-Wunsch algorithm is used for global alignment, suitable for sequences of
similar length.
• Local Alignment
‐ An alignment strategy that finds the best matching subsequence(s) between sequences, allowing for
the alignment of shorter regions that are highly similar within longer sequences.
‐ Example: The Smith-Waterman algorithm is used for local alignment, useful for identifying
functional domains within genes or proteins.
37
Key Terminologies (6/8)
• Global Alignment
‐ An alignment strategy that aligns entire sequences from beginning to end, optimizing for the best
possible match across the whole length.
‐ Example: The Needleman-Wunsch algorithm is used for global alignment, suitable for sequences of
similar length.
• Local Alignment
‐ An alignment strategy that finds the best matching subsequence(s) between sequences, allowing for
the alignment of shorter regions that are highly similar within longer sequences.
‐ Example: The Smith-Waterman algorithm is used for local alignment, useful for identifying
Ref. https://microbenotes.com/local-global-multiple-sequence-alignment/

functional domains within genes or proteins.


38
Key Terminologies (7/8)
• Semi-global Alignment
‐ An alignment strategy that aligns the entirety of one sequence with a segment of another,
useful for comparing sequences of different lengths.
‐ Example: Aligning a full gene sequence with a partial sequence found in a genomic
database.
• Multiple Sequence Alignment
‐ The alignment of three or more sequences, aiming to find a global alignment that
maximizes the match across all sequences.
‐ Example: Using ClustalW to align sequences from different species to identify conserved
evolutionary motifs.
39
Key Terminologies (8/8)
• Homology
‐ A term indicating that sequences share a common ancestry. Homologous sequences can
have similar structure, function, or genetics due to being derived from a common ancestor.
細胞色素 酵母菌
‐ Example: Cytochrome c in humans and yeast; despite significant evolutionary distance, the
protein performs similar functions, indicating homology.
• Phylogenetic Tree 親緣關係樹
‐ A diagram that represents evolutionary relationships among various biological species or
entities based upon similarities and differences in their genetic or physical characteristics.
‐ Example: A tree showing the evolutionary relationships between various species of birds
based on DNA sequence alignment.
40
Global Alignment
• Goal: To align the entire length of both sequences from start to end, maximizing the
number of matches and minimizing mismatches and gaps across the whole sequence.
It's particularly useful for comparing sequences that are suspected to be highly similar
across their entire length.
• Global alignment is distinct in its attempt to align every part of both sequences,
making it ideal for sequences of similar length and overall similarity but less suitable
for sequences with highly variable regions or significant length differences.

41
Local Alignment
• Goal: To find the highest scoring alignment for any subsequence within the given
sequences. This means finding regions of high similarity without concern for the
alignment of the sequences outside these regions. It's used when the aim is to identify
regions of conservation or functional significance within larger, perhaps only partially
related, sequences.
• Unlike global alignment, local alignment searches for the best matching segment
between sequences, making it ideal for uncovering functional motifs or domains
within otherwise dissimilar sequences.
42
Semi-global Alignment
• Goal: To align one sequence completely while aligning a significant portion of the
other sequence. This type is used when aligning sequences to reference genomes or
when one sequence is a complete gene and the other is a partial sequence or contains
extra regions (e.g., introns in genomic vs. cDNA).
• Semi-global alignment combines aspects of both global and local alignments. It's
similar to global alignment in trying to align entire sequences but allows for
overhangs like in local alignments, accommodating sequences of different lengths
without penalizing the overhanging ends.
43
Methods for Sequence Alignment
Methods for Sequence Alignment
• Dot matrices
• Dynamic programing
‐ Global alignment
‐ Local alignment
• BLAST heuristic approach

45
Dot Matrix Method
• The dot matrix method is a graphical approach
used to visualize sequence similarity between
two biological sequences (DNA, RNA, or
proteins).
• It's one of the simplest forms of sequence
comparison and does not directly involve a
scoring system like dynamic programming
algorithms. Ref. https://microbenotes.com/local-global-multiple-sequence-alignment/

46
Basic Terminologies in Dot Matrices (1/2)
• Window Size: Refers to the length of the sequence segment considered for matching at any
point. A larger window size smooths out the plot, making it easier to identify longer regions
of similarity but potentially obscuring smaller details.
• Threshold: The minimum level of similarity that must be met within the window to place a
dot in the matrix. Adjusting the threshold can help highlight more significant matches or
reduce noise from random matches.
• Diagonal Line: In a dot matrix, a continuous diagonal line indicates a region of consecutive
matches between the sequences. Such lines represent similarity or identical sequences.

47
Basic Terminologies in Dot Matrices (2/2)
• Gaps and Mismatches: Absences of dots along a diagonal line suggest gaps (insertions or deletions)
or mismatches (substitutions) in the alignment of the two sequences.
• Repeats and Inversions:
‐ Repeats are indicated by parallel diagonal lines, showing that a sequence segment appears more than once
in either or both sequences.
‐ Inversions or reverse complements (particularly relevant in DNA sequences) appear as diagonal lines
running perpendicular (in the opposite direction) to the main diagonal. These indicate that a sequence in
one strand is the reverse complement of a sequence in the other strand.
• Noise: Random dots scattered throughout the matrix that do not form part of a significant pattern.
Noise often results from random matches, especially in sequences with a high degree of similarity by
chance.
48
Applying Dot Matrices
• Dot matrices are particularly useful for quickly identifying sequence features without
the need for complex computation. By visually scanning the plot, researchers can
identify:
‐ Regions of Alignment: Indicated by straight, diagonal lines.
‐ Structural Features: Such as repeats or palindromic sequences, highlighted by parallel lines
or lines perpendicular to the main diagonal.
‐ Evolutionary Relationships: By comparing sequences from different species or genes,
researchers can infer evolutionary conservation or divergence.
49
Dot Matrix Method – How It Works
• A two-dimensional matrix is created, with one sequence placed along the x-axis and
the other along the y-axis.
• A dot is placed at the intersection (i, j) in the matrix if the nucleotide or amino acid at
position i in one sequence matches the nucleotide or amino acid at position j in the
other sequence.
• The resulting pattern of dots can reveal similarity regions, repetitive sequences,
inversions, and other structural motifs.

50
Adjusting Parameters
https://barryus.shinyapps.io/dotplot/
• The utility of a dot matrix can be significantly influenced by the choice of parameters
like window size and threshold.
• A smaller window size and lower threshold can highlight short regions of high
similarity but may introduce more noise into the plot.
• Conversely, larger window sizes and higher thresholds can help identify longer
regions of similarity but may overlook shorter, potentially significant, motifs.

51
Dynamic Programming
• Dynamic programming algorithms for sequence alignment can indeed be seen as
sophisticated extensions or evolutions of the basic concept introduced by dot matrices.
• Dynamic Programming Algorithms take the concept further by not only identifying regions of
similarity but also quantitatively evaluating the best possible alignment between sequences.
They do this by assigning scores to matches, mismatches, and gaps, and by using these scores
to systematically construct an optimal alignment path through a matrix.
• This approach is exemplified by the Needleman-Wunsch and Smith-Waterman algorithms,
which apply dynamic programming to find the optimal global and local alignments,
respectively. 52
Dynamic Programming in Sequence Alignment

• The essence of applying dynamic programming to sequence alignment lies in


constructing a matrix where the alignment is built progressively.
• Each cell in the matrix represents a subproblem, specifically the score of the best
alignment between substrings of the two sequences up to those points.
• By filling out this matrix based on previously computed values, dynamic
programming avoids redundant calculations, making the alignment process efficient
even for long sequences.

53
Needleman-Wunsch Algorithm (1/2)
• Specific Question:
How can we align two sequences in their entirety, from start to finish, to maximize
their alignment score based on matches, mismatches, and gaps?
• The Needleman-Wunsch algorithm uses dynamic programming to find the optimal
global alignment between two sequences.
• It systematically compares every character of one sequence with every character of
the other, considering the possibility of matches, mismatches, and gaps.

54
Ref. https://upload.wikimedia.org/wikipedia/commons/3/3f/Needleman-Wunsch_pairwise_sequence_alignment.png

Needleman-Wunsch Algorithm (2/2)


• The process involves creating a matrix where the first sequence is along the rows and the
second along the columns. Each cell in the matrix represents the score of the best alignment
between substrings ending at those positions. The score for each cell is determined by the
maximum of:
‐ The score above plus the gap penalty,
‐ The score to the left plus the gap penalty,
‐ The diagonal score plus the match or mismatch score.
• The optimal alignment is then traced back from the bottom-right corner of the matrix to the
top-left, determining the path that led to the maximal score. 55
Scoring Scheme
• Match Score: When the residues from the two sequences match at positions [i] and [j], a
positive score is added. This score encourages alignments of identical or similar residues. For
example, a match might be scored as +1.
• Mismatch Penalty: If the residues do not match, a penalty is subtracted. This penalty
discourages the alignment of different residues. For example, a mismatch might have a score
of -1.
• Gap Penalty: Introducing a gap incurs a negative score, penalizing the alignment for
discontinuity. The gap penalty might be a fixed value (e.g., -2) regardless of the gap's length,
representing a linear gap penalty model. 56
Needleman-Wunsch Algorithm Steps
• Initialization: Create a 2D matrix with dimensions (m+1) x (n+1), where m and n are the lengths of the two sequences.
The first row and column are initialized to represent the cumulative penalty for introducing gaps up to that point, starting
from (0, 0) to (m, n).
• Matrix Filling: Fill in each cell [i, j] of the matrix based on the scores of:
‐ The cell directly above [i-1, j], plus the gap penalty (for a gap in the second sequence).
‐ The cell to the left [i, j-1], plus the gap penalty (for a gap in the first sequence).
‐ The diagonal cell [i-1, j-1], plus the match score if the residues are identical or the mismatch penalty if they are not.
‐ The score for cell [i, j] is the maximum of these three values.
• Traceback:
‐ Starting from the bottom-right cell [m,n], trace back to the top-left cell [0,0], choosing at each step the path that led to the current
cell's score.
‐ The path of the traceback determines the alignment, with diagonal moves indicating matches or mismatches, and horizontal or vertical
moves indicating gaps.

57
Example (1/4)
- A G C A T G C
• Sequences:
-
‣ Sequence A (horizontal, columns): AGCATGC A
‣ CSequence B (vertical, rows): ACAATCC C
• Scoring system: A

‣ +2 for a match A

‣ -1 for a mismatch T

‣ -1 for a gap C
C
58
Example (2/4)
- A G C A T G C
• Step 1: Initialization
- 0 -1 -2 -3 -4 -5 -6 -7
‣ First, we create a matrix with Sequence A on A -1
the top and Sequence B on the side. C -2
‣ The matrix dimensions will be (length of A -3

Sequence B + 1) x (length of Sequence A + 1). A -4

‣ We then initialize the first row and column T -5

based on the gap penalty. C -6


C -7
59
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2
choosing the highest score from: C -2
‣ Diagonal cell + match/mismatch score A -3

‣ Left cell + gap penalty A -4

‣ Upper cell + gap penalty T -5


C -6
For cell [1,1] (comparing A vs A):
Max(diagonal + match, left + gap, up + gap) = Max(0+2, -1-1, -1-1) C -7
=2
60
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2 1
choosing the highest score from: C -2
‣ Diagonal cell + match/mismatch score A -3

‣ Left cell + gap penalty A -4

‣ Upper cell + gap penalty T -5


C -6
For cell [1,2] (comparing g vs A):
Max(diagonal + mismatch, up + gap, left + gap) = Max(-1-1, -2-1, 2- C -7
1) = 1
61
Example (3/4)
- A G C A T G C
• Step 2: Filling the Matrix
- 0 -1 -2 -3 -4 -5 -6 -7
We fill in each cell using the scoring rules, A -1 2 1 0 -1 -2 -3 -4
choosing the highest score from: C -2 1 1 3 2 1 0 -1
‣ Diagonal cell + match/mismatch score A -3 0 0 2 5 4 3 2

‣ Left cell + gap penalty A -4 -1 -1 1 4 4 3 2

‣ Upper cell + gap penalty T -5 -2 -2 0 3 6 5 4


C -6 -3 -3 0 2 5 5 7
C -7 -4 -4 -1 1 4 4 7
62
Example (4/4)
• Step 3: Traceback
Starting from the bottom-right corner, we
trace back to the origin (top-left), choosing
paths based on the scoring decisions made
during matrix filling. The path indicates the
alignment.

Sequence A: AGCA-TGC
Sequence B: A-CAATCC
63
Needleman-Wunsch Algorithm – Complexity
• Time Complexity:
The time complexity of the Needleman-Wunsch algorithm is O(mn), where m and n
are the lengths of the two sequences. This is because each cell in the m x n matrix
needs to be filled based on the scores of its three neighboring cells.
• Space Complexity:
Similarly, the space complexity is also O(mn) due to the need to store the entire
matrix for the traceback process. This can become a limitation for very long
sequences.
64
Smith-Waterman Algorithm
• The Smith-Waterman algorithm identifies the most similar fragment(s) or substrings
between two sequences, which is crucial for uncovering functional motifs, domains
within proteins or genes, and other significant local similarities.
• Unlike the Needleman-Wunsch algorithm, which aligns sequences in their entirety
(global alignment), the Smith-Waterman algorithm focuses on the highest-scoring
local alignments, providing a more nuanced view when the sequences have diverged
significantly or when only parts of the sequences are of interest.

65
Smith-Waterman Algorithm Steps
• Initialization: A scoring matrix is created, similar to the Needleman-Wunsch approach, with dimensions (m+1) x (n+1),
where m and n are the lengths of the two sequences. The first row and column are initialized to 0, which is a key
difference from the global alignment approach and facilitates the focus on local alignment.
• Matrix Filling: Each cell [i,j] in the matrix is calculated based on:
‐ The score from the cell directly above plus the gap penalty.
‐ The score from the cell directly to the left plus the gap penalty.
‐ The score from the diagonal cell plus a match score (if the residues match) or a mismatch penalty (if they do not).
‐ Zero, which ensures that negative scores are not propagated, emphasizing the algorithm's focus on local rather than global alignment.
This choice allows alignments to start and end anywhere within the matrix, optimizing for the highest-scoring local alignment.
‐ The optimal score at each cell is the maximum of these four options.
• Identifying the Highest-Scoring Subsequence:
‐ Unlike the global alignment process that always backtracks from the bottom-right corner, the Smith-Waterman algorithm starts the
traceback from the cell with the highest score in the entire matrix. This reflects the start of the best-scoring local alignment.
‐ The traceback continues until a cell with a score of 0 is reached, indicating the end of this local alignment.
66
Smith-Waterman Algorithm – Complexity
• Time Complexity:
The time complexity of the Smith-Waterman algorithm is O(mn), as it involves filling
a matrix of size m x n where m and n are the lengths of the two sequences.
• Space Complexity:
The space complexity is also O(mn) due to the need to store the entire matrix for the
traceback process. For long sequences, this can be computationally expensive.

67
Basic Local Alignment Search Tool
• BLAST (Basic Local Alignment Search Tool) is one of the most widely used
bioinformatics tools for comparing an input biological sequence (the query sequence)
against a database of sequences.
• BLAST finds regions of local similarity between sequences, efficiently identifying
significant matches even among millions of sequences.
• It balances sensitivity and speed, making it an indispensable tool for researchers
looking to identify homologous sequences, infer function, and explore evolutionary
relationships.
68
Key Features and Components of BLAST (1/2)

• Query and Database: BLAST allows a user to input a query sequence (DNA, RNA, or
protein) and compares it against a database of sequences. The databases can be specific to
certain types of sequences, such as nucleotides (nt) or proteins (nr), and can include genomic
data from various organisms.
• Algorithm: At its core, BLAST uses a heuristic search algorithm that quickly identifies high-
scoring sequence alignments, prioritizing speed without significantly compromising the
accuracy of the results. Unlike exhaustive search algorithms like Smith-Waterman, BLAST
searches for short matches between the query sequence and database sequences (words) and
uses these matches as seeds to initiate local alignments. 69
Key Features and Components of BLAST (2/2)

• Scoring System: BLAST employs a scoring system similar to dynamic programming


methods, assigning scores to matches and mismatches and penalizing gaps. It uses this
scoring system to evaluate the significance of the alignments found.
• E-value: One of the key outputs of a BLAST search is the Expectation value (E-value),
which quantifies the number of matches one can expect to find by chance when searching a
database of a particular size. Lower E-values indicate more significant alignments.
• Bit Score: Another important metric provided by BLAST, the bit score, normalizes the
alignment score against the statistical properties of the scoring system used. This allows for
the comparison of scores across different searches and databases. 70
Key Terms and Concepts (1/4)
• E-value (Expectation Value)
A statistical measure that describes the number of hits one can expect to find by chance when
searching a database of a particular size. The E-value helps assess the significance of an
alignment; lower E-values indicate more statistically significant matches.
• Bit Score
A normalized score that indicates the quality of the alignment independent of the database
size and the query sequence length. Higher bit scores represent more favorable alignments.
• Identity
The percentage of identical matches between the query sequence and the database sequence in
the alignment. High identity often suggests a close relationship between the sequences.
71
Key Terms and Concepts (2/4)
• Coverage
Refers to the extent to which the query sequence and the database sequence align with each
other. High coverage means a larger portion of the sequences is included in the alignment,
which can be crucial for determining functional or evolutionary relationships.
• Seed
A short initial match between the query and database sequences that BLAST uses to initiate
the search for longer, more significant alignments. Seeds are extended in both directions to
find the best possible local alignment.

72
Key Terms and Concepts (3/4)
• Gap Penalties
Scores subtracted for introducing gaps (insertions or deletions) in the alignment. Gap
penalties discourage the algorithm from introducing too many gaps, ensuring that the
alignments reflect biologically plausible relationships.
• Substitution Matrix
A table used in protein alignments that scores amino acid substitutions. Common
matrices include BLOSUM62 and PAM250, which reflect the biological likelihood of
one amino acid replacing another over evolutionary time.
73
Key Terms and Concepts (4/4)
• Filter Low Complexity Regions
A feature in BLAST that masks regions of the sequence that are simple or repetitive
and likely to produce spurious or misleading alignments. Filtering these regions helps
focus the search on more biologically meaningful similarities.
• Hit
A sequence in the database that shows significant similarity to the query sequence.
Hits are ranked by BLAST based on their scores, E-values, and other metrics to help
users identify the most relevant matches.
74
How BLAST Works
• Identifying Seeds: BLAST begins by identifying short segments of the query sequence that
match sequences in the database with a minimum threshold of similarity. These short matches
are called "seeds."
• Query Extension: For each seed, BLAST extends the alignment in both directions to find the
best possible local alignment that includes the seed, stopping when the alignment score begins
to decrease (indicating that the optimal alignment boundary has been reached).
• Ranking and Reporting: The resulting alignments are then scored, ranked, and reported to
the user. The alignments are typically filtered and sorted based on their scores, E-values, and
bit scores to present the most biologically relevant results first. 75
The Core Steps of the BLAST Algorithm
• Building a Lookup Table for the Query Sequence
• Scanning the Database for Matches to the Query Words
• Extending the Seeds
• Evaluating the Significance of the Alignments
• Reporting the Results

76
Interpretation of BLAST Results
• The output typically includes a list of hits, each with associated metrics like E-value
and bit score, which indicate the strength and significance of the alignment.
• Additionally, the results provide detailed alignments, showing exactly how the query
sequence matches up with each hit, including any gaps or mismatches.

77
Q &A
Thank you!

You might also like