0% found this document useful (0 votes)
41 views

Sanchez Bioinformatics 1999

MODBASE is a database of comparative protein structure models generated automatically from protein sequences. The database contains over 17,000 models of proteins from various organisms. Models are evaluated for accuracy based on overlap with known protein structures to identify reliable models for predicting protein structure and function.

Uploaded by

hahaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Sanchez Bioinformatics 1999

MODBASE is a database of comparative protein structure models generated automatically from protein sequences. The database contains over 17,000 models of proteins from various organisms. Models are evaluated for accuracy based on overlap with known protein structures to identify reliable models for predicting protein structure and function.

Uploaded by

hahaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Vol. 15 no.

12 1999
BIOINFORMATICS APPLICATIONS NOTE Pages 1060–1061

M OD BASE: A database of comparative protein


structure models
Roberto Sánchez and Andrej Šali
Laboratories of Molecular Biophysics, The Pels Family Center for Biochemistry and
Structural Biology, The Rockefeller University, 1230 York Ave, New York, NY 10021,
USA

Received on March 16, 1999; revised on June 9, 1999; accepted on June 24, 1999

Abstract quence and a related known protein structure are aligned


Summary: M OD BASE is a database of evaluated and by the ALIGN2D command of M ODELLER (Sánchez and
annotated comparative protein structure models. The Šali, in preparation). This procedure places gaps in the
database also includes fold assignments and alignments structurally reasonable context. In the third step, all the
on which the models were based. pairwise sequence–structure alignments are used indi-
Availability: M OD BASE is accessible on the Web at http:// vidually to build 3D models for the matched parts of the
guitar.rockefeller.edu/ modbase. Models for yeast proteins protein sequences by the program M ODELLER (Šali and
are also accessible through links from the S ACCH 3D Blundell, 1993; Sánchez and Šali, 1997a). The fourth step,
database at http:// genome-www.stanford.edu/ Sacch3D. evaluation of models, is discussed in the following section.
Contact: [email protected]; http//guitar.rockefeller.edu/ It is essential for assessing the value of 3D protein
models to estimate their overall accuracy (Lüthy et al.,
Native three-dimensional structure (3D) of a protein is 1992; Sippl, 1993; Sánchez and Šali, 1997b). In the fold
valuable in testing, understanding, and modifying protein assignment step of the pipeline, a relatively permissive
function. While 3D structures of only a tiny fraction of cutoff is used for selecting known protein structures for
known protein sequences (Benson et al., 1999) have been model building. This results in a smaller number of
defined experimentally (Abola et al., 1987), comparative missed hits, but it also increases the number of false fold
modeling can frequently provide a useful 3D model assignments and the number of mistakes in alignments.
of a protein (Johnson et al., 1994; Sánchez and Šali, The fold assignment errors begin to appear when relatively
1997b). Despite the usefulness of comparative modeling, dissimilar template–target sequences are matched (i.e.
it is still not a common sequence analysis tool for the <30% sequence identity). In addition, even if the fold is
biologist, partly due to the lack of easy access to reliable assigned correctly, errors in the alignment may still result
and evaluated models. The S WISS -M ODEL (Guex et al., in a bad model. The alignment errors can be significant
1999) database of comparative models attempts to resolve when the sequence identity drops below 35%. A reliable
this problem, as does the M OD BASE database described model is obtained only if both the correct fold assignment
in this paper. and an approximately correct alignment are made. The
M OD BASE is a database of annotated comparative overall accuracy of a model is measured by an overlap
protein structure models. The models consist of coordi- between the model and the actual structure. The overlap
nates for all non-hydrogen atoms in the modeled part of is defined as the fraction of residues whose Cα atoms are
a protein. Models are generated entirely automatically within 3.5 Å of each other in the globally superposed
in a four step procedure (Sánchez and Šali, 1998, 1999): pair of structures. Models that overlap with the correct
(i) fold assignment, (ii) sequence–structure alignment, structures in more than 30% of their residues are defined
(iii) model building, and (iv) model evaluation. This pro- here as ‘good’ models. Such models are likely to have
cedure can be applied to thousands of protein sequences, a correct fold, which is frequently sufficient for coarse
including complete genomes and large protein sequence prediction of protein function (Orengo et al., 1994). A
databases. In the fold assignment step, each sequence method for calculating the probability of whether a given
from a genome is compared with a non-redundant set model is good, pG, was developed (Sánchez and Šali,
of proteins of known 3D structure (Abola et al., 1987). 1998) and is used to evaluate all the models in M OD BASE.
This is achieved by an iterative sequence similarity search The database currently contains models for segments of
by program PSI-BLAST (Altschul et al., 1997). In the more than 17,000 proteins in Saccharomyces cerevisiae,
second step, the matching parts of a given protein se- Mycoplasma genitalium, Caenorhabditis elegans, Es-

1060 
c Oxford University Press 1999
ModBase: A database of comparative protein structure models

Table 1. Contents of M OD BASE sequence databases (Bairoch and Apweiler, 1999) and var-
ious EST databases will be processed by the end of 1999.
Organism Proteins Modelsb % of organism % of organism
with proteins with residues Acknowledgments
modelsa models modeled We are grateful to Dr. Steve A. Chervitz for making links
from SGD to M OD BASE and Paul de Bakker for help
Saccharomyces cerevisiae 2587 4484 42 20
Mycoplasma genitalium 216 280 45 29
in implementing the WWW interface. RS is a Howard
Caenorhabditis elegans 7900 13523 39 22 Hughes Medical Institute predoctoral fellow. AŠ is a
Escherichia coli 1625 2560 38 27 Sinsheimer Scholar and an Alfred P. Sloan Research
Methanobacterium thermo. 663 1125 21 19 Fellow. The project has also been aided by grants from
Synechocystis sp. 1000 1670 38 25 NIH (GM 54762) and NSF (BIR-9601845).
Pyrococcus horikoshii 611 946 30 24
Methanococcus jannaschii 630 987 36 28 References
Haemophilus influenzae 670 1217 40 30
Aquifex acolicus 665 1063 44 31 Abola,B.B., Bernstein,F.C., Bryant,S.H., Koetzle,T. and Weng,J.
Mycoplasma pneumoniae 244 297 18 16 (1987) Protein data bank. In Allen,F.H., Bergerhoff,G. and
Sulfolobus solfataricus 301 579 30 26 Sievers,R. (eds), Crystallographic Databases—Information,
Content, Software Systems, Scientific Applications Data
a The number of proteins that have at least one segment modeled reliably. Commission of the International Union of Crystallography,
Whether or not a model is reliable is predicted as described briefly in the text, Bonn/Cambridge/Chester, pp. 107–132.
and in more detail in Sánchez and Šali (1998). Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J.Z., Miller,W.
b The number of models calculated for the genome. This number is larger
and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new
than the number of proteins modeled because many proteins have generation of protein database search programs. Nucleic Acids
independently calculated models for the same domain in the protein, as well
Res., 25, 3389–3402.
as independently calculated models for different domains in the same protein.
Bairoch,A. and Apweiler,R. (1999) The SWISS-PROT protein
cherichia coli, Methanobacterium thermoautotrophicum, sequence data bank and its supplement TrEMBL in 1999. Nucleic
Synechocystis sp., Pyrococcus horikoshii, Methanococcus Acids Res., 27, 49–54.
Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Guel-
jannaschii, Haemophilus influenzae, Aquifex aeolicus,
lette,B.F. F., Rapp,B.A. and Wheeler,D.L. (1999) Genbank.
Mycoplasma pneumoniae and Sulfolobus solfataricus Nucleic Acids Res., 27, 12–17.
(Table 1). Guex,N., Diemand,A. and Peitsch,M.C. (1999) Protein modelling
The database is searchable by protein names, keywords, for all. Trends Biochem. Sci., 24, 364–367.
template structure, organism, model reliability, model Johnson,M.S., Srinivasan,N., Sowdhamini,R. and Blundell,T.L.
size, target–template sequence identity, and alignment (1994) Knowledge-based protein modelling. CRC Crit. Rev.
significance. It is also possible to search for sequence sim- Biochem. Mol. Biol., 29, 1–68.
ilarities to the model sequences using BLAST (Altschul Lüthy,R., Bowie,J.U. and Eisenberg,D. (1992) Assessment of pro-
et al., 1997). Searching produces a table of models tein models with three-dimensional profiles. Nature, 356, 83–85.
satisfying all search criteria. The table lists the modeled Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Protein super-
regions of the target proteins, the templates used to con- families and domain super-folds. Nature, 372, 631–634.
Orengo,C.A., Pearl,F.M. G., Bray,J.B., Todd,A.B., Martin,A.C.,
struct the models, target-template similarities, and model
Conte,L.L. and Thornton,J.M. (1999) The CATH database pro-
reliabilities. For each model, it also includes links to a vides insights into protein structure/function relationship. Nu-
more detailed description of the model, a summary of all cleic Acids Res., 27, 275–279.
models for a given protein, and the PDB database (Abola Šali,A. and Blundell,T.L. (1993) Comparative protein modelling
et al., 1987) for a detailed description of the template bysatisfaction of spatial restraints. J. Mol. Biol., 234, 779–815.
structure used in modeling. The model description page Sánchez,R. and Šali,A. (1997a) Evaluation of comparative protein
contains a schematic representation of the target-template structure modeling by MODELLER-3. Proteins, Suppl. 1, 50–
alignment and links to the template fold entries in the 58.
CATH database (Orengo et al., 1999). In addition, it Sánchez,R. and Šali,A. (1997b) Advances in comparative protein-
links to the model coordinates in the PDB format, the structure modeling. Curr. Opin. Struct. Biol., 7, 206–214.
target-template alignment used to derive the model, and Sánchez,R. and Šali,A. (1998) Large-scale protein structure model-
ing of the Saccharomyces cerevisiae genome. Proc. Natl Acad.
display of the model by the 3D visualization program
Sci. USA, 95, 13597–13602.
RASMOL (Sayle and Milner-White, 1995). Sánchez,R. and Šali,A. (1999) Comparative protein structure mod-
In the future, M OD BASE will grow to reflect (i) the eling in genomics. J. Comp. Phys., 151, 388–401.
growth of the sequence databases, (ii) the growth of Sayle,R. and Milner-White,B.J. (1995) RasMol: Biomolecular
the database of known protein structures, (iii) and im- graphics for all. Trends Biochem. Sci., 20, 374.
provements in the software for calculating the models. Sippl,M.J. (1993) Recognition of errors in three-dimensional struc-
It is expected that the S WISS -P ROT+T R EMBL protein tures of proteins. Proteins, 17, 355–362.

1061

You might also like