2. Sequence alignment
2. Sequence alignment
Sequence Alignments
5/3/2025
1
Sequence alignment: Overview
seq1: CATTTATTTTC
seq2: AATTTGTA Mismatch
Match
Indel
• Match vs mismatch.
• Gap (added to increase number of match) represents insertion or deletion
(indels)
2
Sequence alignment: Purpose
seq1: CATTTATTTTC
seq2: AATTTGTA
3
Sequence alignment: Purpose
4
Sequence alignment: Purpose
Assembled sequence:
The more that that you read, the more things you will know.
5
Sequence homology, similarity and identity
• Sequence identity can be the same as similarity (for DNA) but is different from
similarity (for protein)
6
Sequence evolution
• Major changes:
• Substitution GACTGGA
• Insertion
• Deletion Substitution: G -> C CACTGGA
Deletion: C CATGGA
Speciation event
Substitution: G ->T CATGTA
Insertion: T CATGTTA
CATGTTA CACTGGA
7
Sequence alignment: which alignment is the best?
C A T G T T A C A - T G T T A C A T - G T T A
| | | | | | | | | | | |
C A C T G G A C A C T G G - A C A C T G G - A
8
Pairwise alignment: Global vs. local
10
Sequence alignment: dynamic programming method
Sequence 2 (length m)
C A – T G T T A
C A C T G G - A
Sequence 1 (length n)
2
11
Sequence alignment: dynamic programming method
12
Scoring matrix
• Substitution matrix is a set of values for quantifying the likelihood of one residue
being substituted by another in an alignment.
• Scoring matrices for nucleotide sequences are relatively simple. A positive value
or high score is given for a match and a negative value or low score for a mismatch.
• Scoring matrices for amino acids are more complicated because scoring reflects
the physicochemical properties of amino acid residues, as well as the likelihood of
certain residues being substituted among true homologous sequences
13
Scoring matrix
14
Local alignment: Smith and Waterman algorithm
15
Sequence alignment: dot plots
• Seq1: GATTCTATCTAACTA
• Seq2: GTTCTATTCTAAC
G A T T C T A T – C T A A C T A
| | | | | | | | | | | |
G – T T C T A T T C T A A C - -
16
Database similarity searching: pairwise alignment on large scale
17
Database similarity searching: pairwise alignment on large scale
• Requirements:
• Sensitivity: the ability to find as many correct hits as possible
• Selectivity (specificity): to find as few unrelated hits as possible
• Speed: the time it takes to get results
• Approaches:
• Exhaustive type: dynamic programming (Waterman and Smith algorithm)
• Heuristic type: take shortcut by reducing the search space.
18
Basic Local Alignment Search Tool (BLAST)
19
BLAST steps
1. Break query sequence into words
(e.g. 3 aa or 11 nucleotides)
2. Scan every 3 residues in word
database
3. Assume one of the words finds
matches in the database
4. Calculate sums of match scores
based on a scoring matrix
5. Find the database sequence
corresponding to the best word
match and extend alignment in both
directions
6. Determine the high scored segment
above threshold (e.g., 22)
20
BLAST results
21
Statistical significance of BLAST search results
E-value = m x n x P
m: total number of residues in a database
n: number of residues in the query sequence
P: probability that an alignment is a result of random chance
22
BLAST results
23
BLAST results
24
Problems
1. Obtain the human HBA and HBB protein sequences. Perform pairwise
alignment on NCBI and on EBI websites
2. You have isolated a novel bacterial strain from a soil sample and subject
PCR product of 16S rRNA gene for Sanger sequencing. Now that you have
a sequence of 16S rRNA gene, use Blastn on NCBI to identify the identity of
your isolate.
25