0% found this document useful (0 votes)
10 views

Slides 3

Uploaded by

Phlip Ong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Slides 3

Uploaded by

Phlip Ong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

So Many Choices, So Little Money:

systematic assignment of proteins to functional classes

With the human genome finished it is unclear where to


focus the efforts.

We have been here We are heading here where next


The three layers of genome annotation: where, what
and how?
The three layers of genome annotation: where, what
and how?
If gene hunting is easy (which is not universally
accepted) then assignment of gene/protein function
is not!
A number of observations may make the job easier.

• Proteins contain a variety of different functional


domains
• The evolution of genes can add different domains
and result in the generation of novel functional units
• Proteins exist in families which are conserved
amongst species
Protein Domains

• A domain is an independent structural unit


which can be found alone or in conjunction
with other domains or repeats.

• Module = mobile domain.

• Different domains have distinct functions.

• Many eukaryotic proteins have multiple


domains.
Protein Domains

PX domain with SH3 domain with


ligand ligand
• Rapidly growing databases of protein
sequences due to genome sequencing
projects.
• Many new proteins belong to protein families
with known functions, (significant sequence
similarity).
• Only a small fraction of known proteins have
functions determined by experiment.
• Databases providing computational sequence
analysis allow us to classify new proteins to
known families, and thus (potentially)
determine their function.
There are multiple tools to allow such
analysis
SMART (Simple Modular Architecture
Research Tool)
• There are over 600 domain families.
• Provides information about :
– function .
– subcellular localization.
– phyletic distribution.
– tertiary structure.
• Based on HMMs (Hidden Markov Models).
Domain Architecture
Protein: PA-3427CG
Species: Drosophila
melanogaster

Protein: ENSMUSP00000023109
Species: Mus musculus

Protein:
ENSANGP00000009529
Species: Anopheles gambiae
PROSITE - database of protein families
and domains
• Database of biologically significant sites and patterns.
Contains 1,609 profiles.
• Pattern – conserved sequence of a few amino acids.
• Identifies to which known family of proteins (if any) the
new sequence belongs.
• Used to determine the function of uncharacterized proteins
translated from genomic or cDNA sequences.
The evolution of genes can add different domains and
result in the generation of novel functional units.
Direct comparisons of homologous sequences
between different species can aid in the understanding
of protein classes.

• One of the most powerful approaches to predict the exact function of


a protein is to find its characterised ortholog from a different species

• An “Ortholog” is a homologous sequence from a different species


that arose from a common ancestor gene but may or may not have
a similar function

• A “Paralog” is a homologous sequence that diverged in a single


species by gene duplication
For example most rodent genes have a
human counterpart

1:1 Other Non-


Orthologues Homologues homologues

~80%

~20%
<1%
The numbers of proteins in different species
varies
The transcription factors families shown are the largest of
their category out of the 1,502 human protein families
In order to extract the maximum amount of information
from the rapidly accumulating genome sequences, all
conserved genes need to be classified according to their
homologous relationships.

• Comparison of proteins encoded in complete genomes allowed the


delineation of clusters of orthologous groups (COGs).
• Each COG consists of individual orthologous proteins or orthologous
sets of paralogs from at least three lineages. Orthologs typically have the
same function, allowing transfer of functional information from one
member to an entire COG.
• This relation automatically yields a number of functional predictions for
poorly characterized genomes.
• The COGs comprise a framework for functional and evolutionary
genome analysis.
A functional and phylogenetic
breakdown of the COGs.
Each column shows a COG; a
double streak indicates that two or
more paralogs from the given
species belong to the particular
COG.
Better (and improving) organisation of
data from multiple sources allows a more
complete understanding of genomic
information.

This will better allow for functional analysis


The Gene Ontology (GO) project is a collaborative effort to
address the need for consistent descriptions of gene
products in different databases.

The GO project has developed three structured controlled


vocabularies (ontologies) that describe gene products in
terms of their associated biological processes, cellular
components and molecular functions in a species-
independent manner.

The use of GO terms by collaborating databases facilitates


uniform queries across them.

As an example, you can use GO to find all the gene products


in the mouse genome that are involved in signal
transduction, or you can zoom in on all the receptor tyrosine
kinases.
There are many practical examples of the
use of such analysis
The malaria genome — and beyond
Nature 419, 512 - 519 (2002)
Species of malaria parasite that infect rodents have long been used
as models for malaria disease research. This study reported the
whole-genome shotgun sequence of one species, Plasmodium yoelii
yoelii, and comparative studies with the genome of the human malaria
parasite Plasmodium falciparum clone 3D7. A synteny map of 2,212 P.
y. yoelii contiguous DNA sequences (contigs) aligned to 14 P.
falciparum chromosomes reveals marked conservation of gene
synteny within the body of each chromosome.
Of about 5,300 P. falciparum genes, more than 3,300 P. y. yoelii
orthologues of predominantly metabolic function were identified.

This was the first genome sequence of a model eukaryotic parasite,


and it provides insight into the use of such systems in the modelling
of Plasmodium biology and disease.
A proteomic view of the Plasmodium falciparum life
cycle Nature 419, 520–526 (2002);
sporozoite, merozoite, trophozoite and gametocyte preparations were
lysed, digested and analysed independently by Tandem mass
spectrometry (MS/MS).
A proteomic view of the Plasmodium falciparum life
cycle Nature 419, 520–526 (2002);
sporozoite, merozoite, trophozoite and gametocyte preparations were
lysed, digested and analysed independently by Tandem mass
spectrometry (MS/MS).
Data sets from blood stages were searched against a database
containing both P. falciparum protein sequences and 24,006 ORFs
from the human, mouse and rat RefSeq NCBI databases.

Functional profiles of expressed proteins.


Functional classification comparison between P.
falciparum and P. y. yoelii proteins.
Protein Structure Prediction and Structural Genomics

Understanding of biological role of proteins will require


knowledge of their structure and function.

Although experimental structure determination methods are


providing high-resolution structure information about a subset
of the proteins, computational structure prediction methods
will provide valuable information for the large fraction of
sequences whose structures will not be determined
experimentally.
Protein Structure Prediction and Structural Genomics

Understanding of biological role of proteins will require


knowledge of their structure and function.

Although experimental structure determination methods are


providing high-resolution structure information about a subset
of the proteins, computational structure prediction methods
will provide valuable information for the large fraction of
sequences whose structures will not be determined
experimentally.

This can be done in a high throughput manner


http://protein.gsc.riken.go.jp/

Aim
The aim of the PRG research is to experimentally obtain three-
dimensional (3D) protein structures and their molecular
functions on an equivalent scale to the genome sequencing
projects. In our project, the research focus will be shifted back to
biologists, to elucidate the cellular functions, or to chemists, who
will promote drug discovery programs based on information
regarding the active-site geometries for drug design .
Protein Structure Prediction and Structural Genomics

The first class of protein structure prediction methods,


including threading and comparative modelling, rely on
detectable similarity spanning most of the modelled
sequence and at least one known structure.

The second class of methods, de novo or ab initio


methods, predict the structure from sequence alone,
without relying on similarity at the fold level between the
modelled sequence and any of the known structures.
Modelling protein structures as a functional genomics tool

The first step in modelling of a protein sequence is to attempt to find related


known protein structures in the Protein Data Bank for as many domains in the
modelled sequence as possible (fold recognition or fold assignment).
The folds of domains in the target sequence can be assigned by pairwise and
multiple sequence similarity searches as well as by threading methods that
rely explicitly on the known structures of the candidate template proteins.
We used a structure prediction service to analysis the
PLUNC family

Fold Library Last Updated: Wed Oct 9 06:00:00 2002: [7733] Structures

Last updated: Tue Aug 6 12:27:55 2002 Visitors To Date:


Welcome to the 3D-PSSM Web Server V 2.6.0
A Fast, Web-based Method for Protein Fold Recognition using 1D and 3D
Sequence Profiles coupled with Secondary Structure and Solvation
Potential Information.

http://www.sbg.bio.ic.ac.uk/~3dpssm/
PLUNC proteins are
predicted to be
structurally similar to
BPI and LBP
All PLUNCs on the BPI x-ray structure

This analysis suggests that PLUNCs retain the hydrophobic pockets seen in
BPI and may therefore have the ability to interact with bacterial lipopeptides
for example LPS. They may be either pro or anti-inflammatory
Applications of comparative
modeling.

The potential uses of a comparative


model depend on its accuracy. This in
turn depends significantly on the
sequence identity between the
modeled sequence and the known
structure on which the model was
based. Sample models and
corresponding experimental
structures are shown on the right.
Multiple types of interacting technologies can be
used to practically assign potential function
Systematic assignment of gene function is still an
evolving art form.

All of the computational techniques will still only give an


indication of the function of a gene.

At the end of the day it is still an absolute requirement to directly


show that an individual protein exhibits the expected/predicted
function

You might also like