Protein Databases

PROTEIN DATABASES
AND BIOINFORMATICS
1. Protein databases have become a crucial part of modern biology. Huge
amounts of data for protein structures, functions, and particularly
sequences are being generated.
2. Searching databases is often the first step in the study of a new protein.
Comparison between proteins or between protein families provides
information about the relationship between proteins within a genome or
across different species, and hence offers much more information than
can be obtained by studying only an isolated protein.
3. In addition, secondary databases derived from experimental databases

are also widely available. These databases reorganize and annotate the
data or provide predictions.
4. The use of multiple databases often helps researchers understand the

structure and function of a protein. Although some protein databases are
widely known, they are far from being fully utilized in the protein science
community.
1. Protein databases are especially powered by the Internet. Unlike
traditional media, such as the CD-ROM, the Internet allows databases to
be easily maintained and frequently updated with minimum cost.
Researchers with limited resources can afford to set up their own
databases and disseminate their data quickly.
2. Users worldwide can easily access the most up-to-date version through a
user-friendly interface. Most protein databases have interactive search
engines so that users can specify their needs and obtain the related
information interactively.
3. Many protein databases also allow submitters to deposit data, and

database servers can check the format of the data and provide
immediate feedback.
4. The use of multiple databases often helps researchers understand the

structure and function of a protein. Although some protein databases are
widely known, they are far from being fully utilized in the protein science
community.
PROTEIN SEQUENCE DATABASES
• Thanks to the Human Genome Project and other sequencing efforts, new sequences have been generated at a prodigious
rate. These sequences provide a rich information source and are the core of the revolutionary movement toward “large-
scale biology.” The protein sequences can be computationally annotated from these genomic sequences. Various databases
contain protein sequences with different focuses.
• Among all protein sequence databases, UniProt (UniProt Consortium, 2011) is the most widely used one. It provides more
annotations than any other sequence database with a minimal level of redundancy through human input or integration with
other databases. UniProtKB has three components: (1) Protein knowledgebase, including Swiss-Prot (manually annotated
and reviewed) and TrEMBL (automatically annotated) (Bairoch and Apweiler, 1999); (2) UniRef (sequence clusters for
fast sequence similarity searches); and (3) UniParc (sequence archive for keeping track of sequences and their identifiers).
• In addition to Swiss-Prot and TrEMBL, UniProtKB includes information from Protein Sequence Database (PSD) in the
Protein Identification Resource (PIR; Barker et al., 1999), which builds a complete and non-redundant database from a
number of protein and nucleic acid sequence databases together with bibliographic and annotated information.
PROTEIN SEQUENCE DATABASES CONTD….
• The National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov) also provides rich
information and a number of useful tools for protein sequences. For example, the nr protein database is used for
BLAST search (Altschul et al., 1997. It includes entries from the GenBank (Benson et al., 1999) translations,
UniProt, PIR, Protein Research Foundation (PRF) in Japan, and the Protein Data Bank (PDB). Only entries with
absolutely identical sequences are merged.
• Most of the sequence databases have a sequence search tool and cross-references to entries of other protein and
gene databases. Many sequence databases, such as UniProt, also provide text searching using, for instance,
protein names or key words. To study a new protein, the author recommends first performing a sequence search
using BLAST in nr if the protein sequence is available. The search often gives entry names in the protein
databases included in nr. Even when the protein is not found in nr, it is likely that a homologous protein will be
hit, which can often lead to some useful information, such as the function of the query protein.
• If the sequence of the query protein is unavailable, doing a text search in UniProt usually identifies the protein.
UniProt is probably the place to obtain the most information about a protein if it can be found in UniProt.
However, some additional information may be found by checking other sequence databases.
PROTEIN SEQUENCE DATABASES CONTD….
• For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG;

Ogata et al., 1999) annotates some gene entries with information about metabolic
and regulatory pathways.
• One can also study proteins based on gene models (predicted protein sequences)
from many species-specific genome resources, such as Mouse Genome Database
(MGD, http://www.informatics.jax.org), FlyBase (a resource for Drosophila genes,
http://flybase.org), WormBase (a resource for C. elegans,
http://www.wormbase.org), Saccharomyces Genome Database (SGD,
http://www.yeastgenome.org), Arabidopsis Information Resource (TAIR,
http://www.arabidopsis.org), and Soybean Knowledge Base (SoyKB,
http://soykb.org/).
• Although predicted sequences generated by computational gene-finding tools in

these resources may contain errors, a large number of proteins are covered and are
often reliable enough to provide useful information.
• When the protein of interest is from a species that is not covered by any of these
databases, it is likely that some information can be retrieved from its homolog of
a model organism in one of the databases.
PROTEIN STRUCTURAL DATABASES
• Searching structure databases is becoming more and more popular in molecular

biology. The three-dimensional structures of proteins not only define their
biological functions, but also hold a key in rational drug design.
• Traditionally, protein structures were solved at a low-throughput mode.

However, advances in new technologies, such as synchrotron radiation sources
and high-resolution nuclear magnetic resonance (NMR), accelerate the rate of
protein structure determination substantially. The only international repository
for the processing and distribution of protein structures is the PDB (
Bernstein et al., 1977). The structures in the PDB were determined
experimentally by X-ray crystallography, NMR, electron microscopy, etc.
• Theoretical models have been removed from PDB, effective July 2, 2002, based
on the new PDB policy. The PDB also contains some structures of chemical
ligands and nucleotides.
PROTEIN STRUCTURAL DATABASES CONTD…..
The PDB provides related information about the protein, such as secondary structure assignment and geometry. Each PDB
entry also links to a wide range of annotations from secondary databases, including:
(1) summary and display databases such as Structural Biology Knowledgebase (SBKB, http://sbkb.org), PISA (Protein
Interfaces, Surfaces and Assemblies; Krissinel and Henrick, 2007), Molecular Modelling Database (MMDB; Marchler
-Bauer et al., 1999) in Entrez, PDBsum (Laskowski et al., 1997), Jena Library of Biological Macromolecules (JenaLib,
http://www.fli-leibniz.de/IMAGE.html), PDBWiki (a community annotated knowledge base of biological molecular
structures, http://pdbwiki.org), and Proteopedia (a collaborative 3D-encyclopedia of proteins and other molecules;
Prilusky et al., 2011);
(2) domain annotation from SCOP (Murzin et al., 1995), CATH (Orengo et al., 1997), and Pfam (Finn et al., 2010);
(3) structure comparison to other proteins using various methods;
(4) the MEDLINE bibliography;
(5) protein movements recorded in Database of Macromolecular Movement (MolMovDB; Gerstein and Krebs, 1998); and
(6) geometry analyses of the protein, such as CSU Contacts of Structural Units (Sobolev et al., 1999) and castP Identification
of Protein Pockets & Cavities (Liang et al., 1998).
OTHER PROTEIN STRUCTURAL DATABASES
• Other structure-related databases can also provide useful information. For example, pdbLight (
http://mufold.org/pdblight.php) integrates protein sequence and structure data from multiple sources for
protein structure prediction and analysis, together with predicted SCOP classification for the weekly
updated PDB structures.
• BioMagResBank (BMRB; University of Wisconsin, 1999) is a repository for NMR spectroscopy data on
proteins, peptides, and nucleic acids. Particularly, it provides partial NMR data (e.g., chemical shifts)
before the full structure is solved.
• Protein Model Portal (PMP; Arnold et al., 2009) provides predicted structural models and their quality
assessments for a large number of proteins.
PROTEIN FAMILY DATABASES
• Proteins can be classified according to their sequence, evolutionary, structural, or functional

relationships. A protein in the context of its family is much more informative than the single
protein itself. For example, residues conserved across the family often indicate special functional
roles. Two proteins classified in the same functional family may suggest that they share similar
structures, even when their sequences do not have significant similarity.
• There is no unique way to classify proteins into families. Boundaries between different families
may be subjective. The choice of classification system depends in part on the problem; in general,
the author suggests looking into classification systems from different databases and comparing
them.
• Three types of classification methods are widely adopted based upon the similarity
of sequence, structure, or function.
• Sequence-based methods are applicable to any proteins whose sequences are known, while
structure-based methods are limited to the proteins of known structures, and function-based
methods depend on the functions of proteins being annotated.
• Sequence- and structure-based classifications can be automated and are scalable to high-
throughput data, whereas function-based classification is typically carried out manually.
• Structure- and function-based methods are more reliable, while sequence-based methods
may result in a false positive result when sequence similarity is weak (i.e., two proteins are
classified into one family by chance rather than by any biological significance). In addition,
since protein structure and function are better conserved than sequence, two proteins having
similar structures or similar functions may not be identified through sequence-based
methods.
OTHER DATABASES
Protein Modification Databases
Protein Localization Databases
Protein Binding Databases
Protein Energetics Databases

Protein Databases

Uploaded by

Copyright:

Available Formats

Protein Databases

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Protein Databases

Uploaded by

Copyright:

Available Formats

PROTEIN DATABASES

3. In addition, secondary databases derived from experimental databases

4. The use of multiple databases often helps researchers understand the

3. Many protein databases also allow submitters to deposit data, and

4. The use of multiple databases often helps researchers understand the

• For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG;

• Although predicted sequences generated by computational gene-finding tools in

• Searching structure databases is becoming more and more popular in molecular

• Traditionally, protein structures were solved at a low-throughput mode.

(3) structure comparison to other proteins using various methods;

(4) the MEDLINE bibliography;

• Proteins can be classified according to their sequence, evolutionary, structural, or functional

Protein Modification Databases

Protein Localization Databases

Protein Binding Databases

Protein Energetics Databases

You might also like