0% found this document useful (0 votes)
0 views

2. Sequence alignment

Chapter 2 discusses sequence alignments, which are essential for inferring relationships between biological sequences, predicting functions, and assembling larger sequence units. It covers concepts such as homology, similarity, identity, and various alignment methods including global and local alignments, as well as algorithms like Needleman-Wunsch and Smith-Waterman. The chapter also highlights the importance of database searching and tools like BLAST for large-scale sequence analysis.

Uploaded by

phamngochuyen425
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

2. Sequence alignment

Chapter 2 discusses sequence alignments, which are essential for inferring relationships between biological sequences, predicting functions, and assembling larger sequence units. It covers concepts such as homology, similarity, identity, and various alignment methods including global and local alignments, as well as algorithms like Needleman-Wunsch and Smith-Waterman. The chapter also highlights the importance of database searching and tools like BLAST for large-scale sequence analysis.

Uploaded by

phamngochuyen425
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 2.

Sequence Alignments

5/3/2025

1
Sequence alignment: Overview

• Sequence alignment provides inference for the relatedness of two


sequences under study.

seq1: CATTTATTTTC
seq2: AATTTGTA Mismatch

Match
Indel
• Match vs mismatch.
• Gap (added to increase number of match) represents insertion or deletion
(indels)

2
Sequence alignment: Purpose

• Predict function of a sequence by inference from a well-characterized


sequence

seq1: CATTTATTTTC
seq2: AATTTGTA

• Infer evolutionary relationship between sequences: If the two sequences


share significant similarity, it is likely that the two sequences must have
derived from a common evolutionary origin

• Predict structural and functional motif: active site, receptor site

3
Sequence alignment: Purpose

• Assembly of sequence reads into larger units such as contigs or genomes

Seq 1 The more that


Seq 2 that you read,
Seq 3 you read, the more things
Seq 4 things you will
Seq 5 will know.

4
Sequence alignment: Purpose

• Assembly of sequence reads into larger units such as contigs or genomes

Seq 1 The more that


Seq 2 that you read,
Seq 3 you read, the more things
Seq 4 things you will
Seq 5 will know.

Assembled sequence:
The more that that you read, the more things you will know.

5
Sequence homology, similarity and identity

• Two sequences share homology when they share a common ancestor.


Homology are not a quantitative term

• Sequence similarity is the percentage of aligned residues that are similar in


physiochemical properties such as size, charge, and hydrophobicity. Similarity
is a quantitative term

• Sequence identity can be the same as similarity (for DNA) but is different from
similarity (for protein)

• Sequence identity refers to the percentage matches of the aligned residues

6
Sequence evolution

• Major changes:
• Substitution GACTGGA
• Insertion
• Deletion Substitution: G -> C CACTGGA
Deletion: C CATGGA
Speciation event
Substitution: G ->T CATGTA
Insertion: T CATGTTA

CATGTTA CACTGGA

7
Sequence alignment: which alignment is the best?

C A T G T T A C A - T G T T A C A T - G T T A
| | | | | | | | | | | |
C A C T G G A C A C T G G - A C A C T G G - A

8
Pairwise alignment: Global vs. local

• In global alignment, two sequences


to be aligned are assumed to be
generally similar over their entire
length.
• Global alignment applies for closely
related sequences
• Local alignment does not assume
similar length between aligned
sequence, finds local regions that
share the highest level of similarity
• Local alignment to search for
conversed regions within the
sequence
9
Sequence alignment: dynamic programming method

• Global alignment: Needleman and


Wunsch algorithm

Match: +1, mismatch: -1, gap: -2

• Step 1: set up a matrix


• Step 2: score a matrix
• Step 3: trace back and identify
alignment

10
Sequence alignment: dynamic programming method

Sequence 2 (length m)
C A – T G T T A
C A C T G G - A
Sequence 1 (length n)

2
11
Sequence alignment: dynamic programming method

• Match: +1, mismatch: -1, gap: -3

12
Scoring matrix
• Substitution matrix is a set of values for quantifying the likelihood of one residue
being substituted by another in an alignment.

• Substitution matrix is derived from statistical analysis of residue substitution data


from sets of reliable alignments of highly related sequences.

• Scoring matrices for nucleotide sequences are relatively simple. A positive value
or high score is given for a match and a negative value or low score for a mismatch.

• Scoring matrices for amino acids are more complicated because scoring reflects
the physicochemical properties of amino acid residues, as well as the likelihood of
certain residues being substituted among true homologous sequences

13
Scoring matrix

14
Local alignment: Smith and Waterman algorithm

• Negative scores are replaced by 0


• Tracing back scoring matrix starts
from the cell with the highest score

15
Sequence alignment: dot plots
• Seq1: GATTCTATCTAACTA
• Seq2: GTTCTATTCTAAC

G A T T C T A T – C T A A C T A
| | | | | | | | | | | |
G – T T C T A T T C T A A C - -

• Put a dot at where a match is found


• Connect the dots in diagonal direction
• Drawback: high noise
• Solution: sliding window with a
threshold

16
Database similarity searching: pairwise alignment on large scale

• Database searching: a mean of assigning putative functions to newly


determined sequences.

• How: by pairwise alignment on a large scale: a query sequence (input


sequence) vs. thousands of sequences in the database

17
Database similarity searching: pairwise alignment on large scale

• Requirements:
• Sensitivity: the ability to find as many correct hits as possible
• Selectivity (specificity): to find as few unrelated hits as possible
• Speed: the time it takes to get results
• Approaches:
• Exhaustive type: dynamic programming (Waterman and Smith algorithm)
• Heuristic type: take shortcut by reducing the search space.

18
Basic Local Alignment Search Tool (BLAST)

• Developed by Stephen Altschul of NCBI in 1990


• Became one of the most popular programs for sequence analysis
• Use heuristic approach to align a query sequence with all sequences in the
database
• Objective: find high-scoring ungapped segments along related sequences.

19
BLAST steps
1. Break query sequence into words
(e.g. 3 aa or 11 nucleotides)
2. Scan every 3 residues in word
database
3. Assume one of the words finds
matches in the database
4. Calculate sums of match scores
based on a scoring matrix
5. Find the database sequence
corresponding to the best word
match and extend alignment in both
directions
6. Determine the high scored segment
above threshold (e.g., 22)

20
BLAST results

21
Statistical significance of BLAST search results

• E-value (Expectation value) indicates the probability that the resulting


alignments from a database search are caused by random chance

E-value = m x n x P
m: total number of residues in a database
n: number of residues in the query sequence
P: probability that an alignment is a result of random chance

E.g., E-value = 1012 x 100 x 10-20 = 10-6

22
BLAST results

23
BLAST results

24
Problems

1. Obtain the human HBA and HBB protein sequences. Perform pairwise
alignment on NCBI and on EBI websites
2. You have isolated a novel bacterial strain from a soil sample and subject
PCR product of 16S rRNA gene for Sanger sequencing. Now that you have
a sequence of 16S rRNA gene, use Blastn on NCBI to identify the identity of
your isolate.

25

You might also like