Protein Databases
Protein Databases
Protein Databases
AND BIOINFORMATICS
1. Protein databases have become a crucial part of modern biology. Huge
amounts of data for protein structures, functions, and particularly
sequences are being generated.
2. Searching databases is often the first step in the study of a new protein.
Comparison between proteins or between protein families provides
information about the relationship between proteins within a genome or
across different species, and hence offers much more information than
can be obtained by studying only an isolated protein.
2. Users worldwide can easily access the most up-to-date version through a
user-friendly interface. Most protein databases have interactive search
engines so that users can specify their needs and obtain the related
information interactively.
• Thanks to the Human Genome Project and other sequencing efforts, new sequences have been generated at a prodigious
rate. These sequences provide a rich information source and are the core of the revolutionary movement toward “large-
scale biology.” The protein sequences can be computationally annotated from these genomic sequences. Various databases
contain protein sequences with different focuses.
• Among all protein sequence databases, UniProt (UniProt Consortium, 2011) is the most widely used one. It provides more
annotations than any other sequence database with a minimal level of redundancy through human input or integration with
other databases. UniProtKB has three components: (1) Protein knowledgebase, including Swiss-Prot (manually annotated
and reviewed) and TrEMBL (automatically annotated) (Bairoch and Apweiler, 1999); (2) UniRef (sequence clusters for
fast sequence similarity searches); and (3) UniParc (sequence archive for keeping track of sequences and their identifiers).
• In addition to Swiss-Prot and TrEMBL, UniProtKB includes information from Protein Sequence Database (PSD) in the
Protein Identification Resource (PIR; Barker et al., 1999), which builds a complete and non-redundant database from a
number of protein and nucleic acid sequence databases together with bibliographic and annotated information.
PROTEIN SEQUENCE DATABASES CONTD….
• The National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov) also provides rich
information and a number of useful tools for protein sequences. For example, the nr protein database is used for
BLAST search (Altschul et al., 1997. It includes entries from the GenBank (Benson et al., 1999) translations,
UniProt, PIR, Protein Research Foundation (PRF) in Japan, and the Protein Data Bank (PDB). Only entries with
absolutely identical sequences are merged.
• Most of the sequence databases have a sequence search tool and cross-references to entries of other protein and
gene databases. Many sequence databases, such as UniProt, also provide text searching using, for instance,
protein names or key words. To study a new protein, the author recommends first performing a sequence search
using BLAST in nr if the protein sequence is available. The search often gives entry names in the protein
databases included in nr. Even when the protein is not found in nr, it is likely that a homologous protein will be
hit, which can often lead to some useful information, such as the function of the query protein.
• If the sequence of the query protein is unavailable, doing a text search in UniProt usually identifies the protein.
UniProt is probably the place to obtain the most information about a protein if it can be found in UniProt.
However, some additional information may be found by checking other sequence databases.
PROTEIN SEQUENCE DATABASES CONTD….
• One can also study proteins based on gene models (predicted protein sequences)
from many species-specific genome resources, such as Mouse Genome Database
(MGD, http://www.informatics.jax.org), FlyBase (a resource for Drosophila genes,
http://flybase.org), WormBase (a resource for C. elegans,
http://www.wormbase.org), Saccharomyces Genome Database (SGD,
http://www.yeastgenome.org), Arabidopsis Information Resource (TAIR,
http://www.arabidopsis.org), and Soybean Knowledge Base (SoyKB,
http://soykb.org/).
• When the protein of interest is from a species that is not covered by any of these
databases, it is likely that some information can be retrieved from its homolog of
a model organism in one of the databases.
PROTEIN STRUCTURAL DATABASES
• Theoretical models have been removed from PDB, effective July 2, 2002, based
on the new PDB policy. The PDB also contains some structures of chemical
ligands and nucleotides.
PROTEIN STRUCTURAL DATABASES CONTD…..
The PDB provides related information about the protein, such as secondary structure assignment and geometry. Each PDB
entry also links to a wide range of annotations from secondary databases, including:
(1) summary and display databases such as Structural Biology Knowledgebase (SBKB, http://sbkb.org), PISA (Protein
Interfaces, Surfaces and Assemblies; Krissinel and Henrick, 2007), Molecular Modelling Database (MMDB; Marchler
-Bauer et al., 1999) in Entrez, PDBsum (Laskowski et al., 1997), Jena Library of Biological Macromolecules (JenaLib,
http://www.fli-leibniz.de/IMAGE.html), PDBWiki (a community annotated knowledge base of biological molecular
structures, http://pdbwiki.org), and Proteopedia (a collaborative 3D-encyclopedia of proteins and other molecules;
Prilusky et al., 2011);
(2) domain annotation from SCOP (Murzin et al., 1995), CATH (Orengo et al., 1997), and Pfam (Finn et al., 2010);
(5) protein movements recorded in Database of Macromolecular Movement (MolMovDB; Gerstein and Krebs, 1998); and
(6) geometry analyses of the protein, such as CSU Contacts of Structural Units (Sobolev et al., 1999) and castP Identification
of Protein Pockets & Cavities (Liang et al., 1998).
OTHER PROTEIN STRUCTURAL DATABASES
• Other structure-related databases can also provide useful information. For example, pdbLight (
http://mufold.org/pdblight.php) integrates protein sequence and structure data from multiple sources for
protein structure prediction and analysis, together with predicted SCOP classification for the weekly
updated PDB structures.
• BioMagResBank (BMRB; University of Wisconsin, 1999) is a repository for NMR spectroscopy data on
proteins, peptides, and nucleic acids. Particularly, it provides partial NMR data (e.g., chemical shifts)
before the full structure is solved.
• Protein Model Portal (PMP; Arnold et al., 2009) provides predicted structural models and their quality
assessments for a large number of proteins.
PROTEIN FAMILY DATABASES
• There is no unique way to classify proteins into families. Boundaries between different families
may be subjective. The choice of classification system depends in part on the problem; in general,
the author suggests looking into classification systems from different databases and comparing
them.
• Three types of classification methods are widely adopted based upon the similarity
of sequence, structure, or function.
• Sequence-based methods are applicable to any proteins whose sequences are known, while
structure-based methods are limited to the proteins of known structures, and function-based
methods depend on the functions of proteins being annotated.
• Sequence- and structure-based classifications can be automated and are scalable to high-
throughput data, whereas function-based classification is typically carried out manually.
• Structure- and function-based methods are more reliable, while sequence-based methods
may result in a false positive result when sequence similarity is weak (i.e., two proteins are
classified into one family by chance rather than by any biological significance). In addition,
since protein structure and function are better conserved than sequence, two proteins having
similar structures or similar functions may not be identified through sequence-based
methods.
OTHER DATABASES