0% found this document useful (0 votes)
31 views28 pages

Lecture - 02 - Comparative Sequence Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views28 pages

Lecture - 02 - Comparative Sequence Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Comparative

Sequence Analysis

Department of Life Sciences, SBASSE, LUMS


Genome to Gene

Heredity Unit

2
Latest on Genome Sequencing
• Human Genome Project (1990 – 2003)

Now!

3
Our Genome and Need for Comparative
Genomics
• Number of bases: 3.2 billion bases

• Number of chromosomes: 23 pairs

• Percentage of genes: Only 1% of genome is genes

• Protein-coding Gene Number: 20,000 - 25,000

• Average gene size: ~ 3000 bases & huge variation


• Largest known human gene consists of 2.4 million bases (dystrophin)

• Repetition: Almost 45-50% of the DNA is repetitive

• Similarity between individuals: Almost all (99.9%) nucleotide bases are exactly the same
in all people 4
Proteome to Protein
Genes: 30,000

Alternative Splicing: 2 - 3 per gene


3 x 30,000 = 90,000 proteins

Post translational modifications


10 x 90,000 = 900,000 proteins

Peng and Gygi, JMS 2001


Asa Wheelock

5
Need for Comparative Proteomics
• Number of reported proteins: 150 million and counting

6
Benefits of Comparative Genomics
• Comparison of whole genome sequences provides a highly detailed
view of how organisms are related to each other at the genetic level

• Comparative genomics also provides a powerful tool for studying


evolutionary changes among organisms

• Helps to identify genes that are conserved or common among species


that give each organism its unique characteristics

7
Fly vs. Humans
Comparison between fruit fly genome with the human genome:

• about 75% percent of genes are conserved

• two organisms appear to share a core set of genes

• two-thirds of human genes known to be involved in cancer have


counterparts in the fruit fly

8
Evolutionary Relationship

9
COV2

10
http://bacterialphylogeny.info/overview.html

11
What have we done and what’s
next?
DONE: Gene and Protein Sequences
• GenBank (DNA Sequences)
• Uniprot (Protein Sequences)
• GeneMark (Gene Prediction)

NEXT: Sequence & Structure Analysis


• BLAST (nucleotide, protein)
• PDB
• iTASSER

12
From Sequences to Comparisons
• Problem: If we sequence a new gene or protein, can we compare it
with the existing information in GenBank or Uniprot?

• Idea: Compare NOVEL sequences with KNOWN (previously


characterized) genes or proteins.

• Benefit: STRUCTURAL , FUNCTIONAL and EVOLUTIONARY


information can be inferred from WELL DESIGNED comparisons.

• The most common tool used is called BLAST.


13
BLAST?
• Basic Local Alignment Search Tool

• A method for rapid searching of sequence databases, for both


nucleotides and proteins.

• The BLAST algorithm detects local as well as global matches


(alignments) and regions of similarity embedded in otherwise unrelated
proteins.

• Uses statistical theory to determine if a match might have occurred by


chance.
14
https://blast.ncbi.nlm.nih.gov/Blast.cgi

15
BLAST - Workflow
1. BLAST searches the database sequences using “Dynamic Programming” on “promising”
sequences.

2. This is done by indexing all database sequences in a so-called suffix-tree which makes it
very fast to search for perfect matching sub-strings. A suffix tree is the quickest possible
way (so far) to search for the longest matching sub-string between two strings.

3. BLAST creates a list of all “words” (short subsequences) that have a certain “threshold”
score when compared with the query sequence. Words are 16-256 nucleotides or 3
amino acids put together in a row consecutively.

4. A lookup hash table is made of all such words and “neighboring” words present in the
query sequence (rather than just random words).

5. When a BLAST search is run, candidate sequences from the database is picked based on
perfect matches to small sub-sequences in the query sequence. 16
BLOSUM62 Match/Mismatch Matrix

17
• Here the word is PQG and
Score from neighboring words are
BLOSUM everything with a score
above 13 (for three
letters) as calculated by
the given scoring system
(e.g., BLOSUM62).
T is user provided threshold!
• PSG is a neighboring word,
PQA is not.

18
Example Blast search method
Query sequence: PQGELV

•Make list of all possible k-mer words (length 3 for proteins)


PQG (score 18)
QGE (score 16)
GEL (score 15)
ELV (score 13)

•Assign scores from Blosum62, use those with score >= 13


• PQG, QGE, GEL & ELV

•In total we get: PQG, QGE, GEL & ELV


Example Blast search method
• Make k-mer (word-size 3) of all sequences in database
• Store in a suffix-tree (fast tree-structure to search for identical matches)

• Find all database sequences that has at least 2 matches among our 3 words
• PQG, GEL & PEG

• Find database hit and extend alignment (High-scoring Segment Pair):


Query: M E T P Q G I A V
Database: - - - P Q G E L V
8 5 5 2 0 8

• HSP: PQGI (score 8+5+5+2)

• If 2 HSP in query sequence are < 40 positions away


• Full alignment on query and hit sequences
Advantages of BLAST
• The BLAST algorithm was written balancing speed and
increased sensitivity for finding distant sequence relationships.
• Speed is achieved by:
1. Pre-indexing the database before the search
2. Parallel processing
3. Hash table that contains neighborhood words rather than just random words.

• BLAST emphasizes regions of local alignment to detect


relationships among sequences having isolated regions of
similarity between them.

21
BLAST for Nucleotides and Proteins
• Nucleotides
• blastn
• Compares a nucleotide query sequence against a nucleotide sequence
database.

• Proteins
• blastp
• Compares an amino acid query sequence against a protein sequence
database.

22
Comparing an unknown nucleotide
sequence with possible “protein”
sequences!!
• blastx
> but what about the 6 possible ORFs?

• Compares a nucleotide query sequence translated in all reading


frames against a protein sequence database.

• This option may be used to find potential translation products of


an unknown nucleotide sequence.

23
How about the reverse of blastx?
• tblastn

• Compares a protein query sequence against a nucleotide


sequence database dynamically translated in all reading
frames.

24
Comparing all translated ORFs of a
nucleotide sequence with all ORFs
of a nucleotide DB
• tblastx

• Compares the six-frame translations of a nucleotide query


sequence against the six-frame translations of a nucleotide
sequence database.

25
Getting started with BLAST
Getting started:
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/BLAST/
and
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

26
So what if we find out the Alien
Gene in GenBank?
• Homologs
• Features (including DNA and protein sequences) in species being compared that are similar
because they are ancestrally related

• Homologs can be either Orthologs and Paralogs

• Orthologs
• Homologous genes (or any DNA sequences) that separated because of a speciation event
• Derived from the same gene in the last common ancestor

• Paralogs
• Homologous genes that separated because of gene duplication events within the same species

27
28

You might also like