Pi Is 0969212699801774

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Research Article 1099

A systematic comparison of protein structure classifications:


SCOP, CATH and FSSP
Caroline Hadley and David T Jones*

Background: Several methods of structural classification have been developed Address: Protein Structure Group, Department of
to introduce some order to the large amount of data present in the Protein Data Biological Sciences, University of Warwick,
Coventry CV4 7AL, UK.
Bank. Such methods facilitate structural comparisons and provide a greater
understanding of structure and function. The most widely used and *Corresponding author.
comprehensive databases are SCOP, CATH and FSSP, which represent three E-mail: [email protected]
unique methods of classifying protein structures: purely manual, a combination
Key words: evolution, protein folds, protein-
of manual and automated, and purely automated, respectively. In order to
structure classification, taxonomy
develop reliable template libraries and benchmarks for protein-fold recognition,
a systematic comparison of these databases has been carried out to determine Received: 28 January 1999
their overall agreement in classifying protein structures. Revisions requested: 12 March 1999
Revisions received: 1 April 1999
Accepted: 25 May 1999
Results: Approximately two-thirds of the protein chains in each database are
common to all three databases. Despite employing different methods, and Published: 27 August 1999
basing their systems on different rules of protein structure and taxonomy, SCOP,
CATH and FSSP agree on the majority of their classifications. Discrepancies and Structure September 1999, 7:1099–1112
http://biomednet.com/elecref/0969212600701099
inconsistencies are accounted for by a small number of explanations. Other
interesting features have been identified, and various differences between 0969-2126/99/$ – see front matter
manual and automatic classification methods are presented. © 1999 Elsevier Science Ltd. All rights reserved.

Conclusions: Using these databases requires an understanding of the rules


upon which they are based; each method offers certain advantages depending
on the biological requirements and knowledge of the user. The degree of
discrepancy between the systems also has an impact on reliability of prediction
methods that employ these schemes as benchmarks. To generate accurate fold
templates for threading, we extract information from a consensus database,
encompassing agreements between SCOP, CATH and FSSP.

Introduction accepted definition of a domain is a compact, local, semi-


Since the creation of the Protein Data Bank (PDB [1]) independent folding unit built from secondary structure
over twenty years ago, more than 8000 protein structures elements [2]. Because domains may function individually
have been deposited. With experimental techniques within a protein, with distinct functional and structural
becoming more advanced and less time consuming for roles, proteins are usually separated into discrete domains
solving protein structures, the rate of growth in structural before classification. The identification and delineation of
information is expected to rise even more rapidly. domains within protein structures is a difficult and often
Although a great deal of information may be revealed by subjective process. Although domains can often easily be
analysis of a single protein structure, it has long been distinguished by manual inspection, the automation of this
understood that a more global, comprehensive view of process is not simple. Many algorithms exist for domain
proteins comes from a comparison of multiple structures, assignment, each relying on a different set of defined rules
and investigations into their folding similarities and evolu- governing domain structure and packing, such as compact-
tionary relationships. A logical beginning to the compari- ness [3,4], surface area [5], residue–residue contact maps
son of protein structures is a system of classifying these [6] and hydrophobicity [7]. The difficulties in this first
structures in order to easily identify and group similar step of protein structure classification are an indication of
folds and families. An advantage of classifying proteins in the difficulty of this process in general.
this way is the prospect of introducing some sense of order
to the growing volume of structural data available. Once defined, domains can then be classified at the levels
of class, fold, and superfamily (and further subdivided into
One complication that immediately arises in structure clas- families). ‘Class’ is generally determined from the overall
sification is the fact that protein structures are often com- composition of secondary structure elements within a
posed of discrete globular domains. The commonly domain [8]. A ‘fold’ is determined from the number,
1100 Structure 1999, Vol 7 No 9

arrangement, and connectivity (or topology) of these ele- (C-level) is defined first, to prevent unnecessary structure
ments. A ‘superfamily’ consists of domains with similar comparisons between different classes at a later stage.
folds and usually similar functions, suggesting common This step is primarily automated, although difficult cases
ancestry, often in the absence of detectable sequence sim- may be dealt with manually. Domains are assigned to one
ilarity. Sequence similarity is usually taken into account in of four classes (mainly α, mainly β, alpha beta (αβ), or few
forming ‘families’, namely, groups of very closely related secondary structures) on the basis of composition, sec-
domains. Examples of families might be a group of the ondary-structure contacts and the proportion of parallel
same proteins from different species, or perhaps different and antiparallel sheets [13]. Within each class, structure
isozymes from the same species. comparisons are made to produce fold groups (T-level)
and then homologous superfamilies (H-level) [14,15]. The
Several systems have arisen to address the need for struc- final stage is the manual assignment of architecture
tural classification, in particular SCOP (Structural Classifica- (A-level) using visual inspection and reference to litera-
tion of Proteins [9]), CATH (Class Architecture Topology ture. This stage is particularly important when considering
Homology [10]) and FSSP (Families of Structurally Similar novel folds. Together, these levels (C-A-T-H) produce an
Proteins [11]). One of the advantages to having these index or number for each domain; domains sharing C-A-T
systems is that they represent three unique methods of clas- numbers have the same fold, whereas a shared H-level
sifying structural data: FSSP is based on a purely automated indicates a common evolutionary origin.
process, SCOP is almost completely manually derived, and
CATH employs an intermediate process, using automated FSSP is known as both Families of Structurally Similar
procedures along with human intervention. Proteins [16] and Fold classification based on Struc-
ture–Structure alignment of Proteins. Like SCOP and
SCOP organizes proteins in a hierarchy, from class down CATH, FSSP attempts to relate protein structures with
to fold, superfamily, and family [9]. A total of ten classes respect to evolutionary relationships, although unlike
are defined (only the first four of which are considered CATH and SCOP, it is fully automated and does not
here): all alpha, all beta, alpha and beta (α/β), alpha plus assign proteins into classes, fold families or superfamilies.
beta (α+β), multidomain, membrane and cell-surface pro- Instead, pairwise structural comparisons are made
teins and peptides, small proteins, peptides, designed pro- between proteins of a representative set (where no two
teins, and non-protein structures. Although the SCOP proteins or domains have greater than 25% sequence sim-
protein classification is essentially a manual process using ilarity) and members of a sequence-homologue set
visual inspection and comparison of structures, some (homologues with greater than 25% sequence identity)
automation is used for the most routine tasks such as clus- using the Dali program [17]. For each member of the rep-
tering protein chains on the basis of sequence similarity. resentative set, a file is created containing all pairwise
Proteins are usually (but not always) separated into structural matches above a Z-score of 2.0 (pairs with
domains, and most of these domains are classified into one values below this number are described as structurally
of the first five classes noted above. Structural similarities dissimilar). Other information is presented in the file,
of proteins at the fold level often represent favourable along with the alignment information generated by Dali.
packing arrangements and chain topologies, although Ultimately, a fold tree is constructed using hierarchical
some distant evolutionary links may exist. Common clustering methods; an indexing system is also incorpo-
ancestry (i.e. homology) is more clearly defined upon clas- rated by dividing the pairwise structural comparisons at
sification into superfamilies, where proteins with similar Z-scores of 2, 3, 4, 5, 10 and 15 (creating a six-character
structure and/or functional features are believed to share a index). These cut-offs are not an accurate distinction
common evolutionary origin. Proteins with similar between protein folds or superfamilies.
sequences, or very similar structures and functions that
imply a solid evolutionary link, are grouped together as Each of these three classification schemes has dedicated
families. Thus, members of the same family or superfam- users, who tend to use one method consistently rather
ily within SCOP share common ancestry. than try others with which they are unfamiliar. Recent
papers demonstrate the extent to which structural classi-
CATH is also a hierarchical system, which differs from fication databases are used in the bioinformatics field,
SCOP in that it incorporates some automation in classify- from the analysis of protein structure [18] to the extrac-
ing protein structures. After extracting highly resolved tion of homologous structures [19,20], the testing of pre-
structures from the Protein Data Bank (PDB), compar- diction methods [21–23] and the calculation of numbers
isons are made to group highly similar proteins on the of folds and families [24,25]. Other databases have appli-
basis of sequence similarity. A representative structure is cations in structural biology, such as VAST (Vector Align-
taken from each sequence family, and is divided into ment Search Tool, an algorithm that produces
domains using a consensus approach incorporating three neighbourhoods of similar folds by performing struc-
automatic domain-assignment techniques [12]. Class ture–structure comparisons of all domains in the PDB
Research Article Comparison of structure classifications Hadley and Jones 1101

[26,27]) and HOMSTRAD (HOMologous STRuctural at the fold level (78% of the FSSP matches at this Z-score
Alignment Database, which provides aligned three- are found in both SCOP and CATH). As Z-score increases
dimensional structures of homologous proteins [28]). from 4.0 to 6.0 and then to 8.0, agreement at both the fold
Databases of protein structural domain definitions, such and homology levels steadily increases. Beyond Z-score
as the DIAL-derived domain database (DDBASE [29]) 8.0, this trend continues. Table 1 further subdivides the
and the Database of Protein Domain Definitions (3Dee FSSP pairwise matches by sequence identity along with
[30]) can also be useful in structural investigations. Z-score: this more clearly shows that agreement between
However, SCOP, CATH and FSSP have the advantage databases increases both with Z-score and with sequence
of being the largest, most comprehensive and most fre- identity. For the most part, a combination of Z-score and
quently used classification databases available. sequence identity can be used to determine the likelihood
of a structural pairwise match being found in all three
When a structural classification database is required for a databases (indicating a fairly undisputed match). As might
specific purpose, such as (in our case) the production of be expected, a higher Z-score is needed at low sequence
fold recognition templates, it is not only important to identities, and vice versa. At low sequence identities
choose the right database for the right reasons, but also to (0–19.9%), agreement between databases is rarely high.
take into consideration the reliability of the classification On the other hand, with a high enough sequence identity,
scheme. The construction of fold templates aimed at the Z-score is not necessarily very important for structural
identification of distant superfamily members depends on agreement between databases. Even at a Z-score below
highly accurate groupings of homologous sequences, such 4.0, sequence identity of 30% or greater results in com-
as would be expected in homologous families in SCOP and plete agreement between the three databases.
CATH, and FSSP pairwise matches with high Z-scores.
However, with no indication as to the accuracy of this It is not possible to state a Z-score within FSSP for which
grouping in any of these databases, we found it necessary all three databases will completely agree on fold and/or
to investigate the three classification methods for content, homology for pairs of structures. This is certainly not what
reliability and accuracy. The widespread use of these data- the authors intended; however, the vast majority of users
bases in the field of bioinformatics certainly warrants an may not have sufficient structural or biological knowledge
investigation into their reliability and structural agreement. to ascertain the importance of the data presented in an
FSSP file. Determining Z-score cut-offs at which certain
This analysis of SCOP, CATH and FSSP reveals a large generalizations about the data could be made would facili-
percentage of agreement between the three databases, tate the use of this resource.
and highlights a number of important issues regarding
these specific resources, and protein-structure classifica- A high percentage of agreement exists between the three
tion in general. Because of the subjective nature of classi- databases at the fold level. Above 25% sequence identity, a
fication, using a combination of databases may be best; the threshold commonly used to distinguish between homolo-
benefit of a consensus structural-classification database for gous or randomly related proteins, the agreement between
the production of reliable threading templates is currently the three databases is almost always 100% (the agreement
being assessed. between FSSP and SCOP is 100%). There are some
exceptions, which represent 0.3% of the total number of
Results and discussion FSSP pairs in the region included in Table 1 (which itself
Shared-codes set only represents 31% of the total FSSP matches). Between
The comparison of SCOP, CATH and FSSP data pro- 25% and 29% sequence identity, five of the eight mis-
duced 6875 common chains. When domains were subse- matches present are due to domain-assignment discrepan-
quently taken into account, the SCOP set contained 8498 cies, usually where only one portion of the protein is
domains (74% of 11,515 domains in SCOP [March 1998, included in one of the databases, but another portion is
v 1.37]), and the CATH set contained 9874 domains (74% classified in the other. The two mismatches between
of 13,338 domains in CATH [April 1998, v 1.4]). The 6875 30–34% are also due to this problem. Between 45–49%,
shared chains represented approximately 78% of the 8805 five mismatches occur, all of which contain a phaseolin
chains taken from FSSP. The resulting set of shared PDB seed storage 7S protein domain paired with a canavalin 7S
codes is referred to as pset3. vicilin protein of the same family (according to SCOP clas-
sification, which agrees). The canavalin protein is less than
Database comparison half the size of the phaseolin, and is considered as one
Comparing FSSP pairwise matches to both SCOP and CATH domain in CATH; the phaseolin protein is separated into
The comparison of FSSP pairwise matches against SCOP four domains in CATH, and only two in SCOP.
and CATH is shown in Figures 1a–f. These pie charts
show that even at a relatively low Z-score of 4.0, the three At 100% sequence identity, only four mismatches contribute
databases have a high percentage of agreement, especially to the loss in agreement between the three databases. Two
1102 Structure 1999, Vol 7 No 9

Figure 1

(a) (c) (e)


FSSP FSSP FSSP + SCOP FSSP + SCOP
6% (686) FSSP + SCOP FSSP
1% (69) 3% (229) 2% (119) FSSP + CATH
0% (3)
4% (469) FSSP + CATH 1% (80)
4% (350)
FSSP + CATH
12% (1370)

ALL
78% (8790)
ALL
ALL
92% (7971)
97% (7343)

(b) (d) (f)


FSSP
FSSP
5% (382) FSSP + SCOP
11% (964) 2% (139)
FSSP FSSP + SCOP FSSP + CATH
26% (2981) 3% (260) 0% (31)
FSSP + CATH
1% (104)

FSSP + SCOP
4% (478)
FSSP + CATH
3% (315)
ALL
67% (7541)

ALL
ALL
85% (7291)
93% (6993)
Structure

Pie charts reflecting the agreement between pairwise matches in FSSP, homology level. (c,d) Pairwise matches (Z-score ≥ 6.0) compared to
CATH and SCOP. FSSP pairwise matches are compared to both CATH and SCOP as before. Agreement between the databases has
CATH and SCOP: they are found in FSSP only (i.e. in neither SCOP increased by at least 15% at both the fold and homology levels. The
nor CATH), in FSSP and SCOP (missed in CATH), in FSSP and CATH difference between FSSP + SCOP and FSSP + CATH agreement has
(missed in SCOP), or in all three databases. (a,b) FSSP pairwise also reduced. (e,f) Pairwise matches with Z-score ≥ 8.0. Already,
matches (Z-score ≥ 4.0) compared to CATH and SCOP matches at the agreement between the databases is as high as 97% at the fold level.
fold and homology level, respectively. Numbers in parentheses indicate Pairwise matches found in FSSP only are limited to three (see text for
the number of pairwise matches in question. At this Z-score, agreement description), and the numbers of FSSP pairwise matches found in
between the three databases is already high at both the fold and either SCOP or CATH (but not both) are very low.

of the mismatches pair two chains of restriction endonucle- CATH, whereas the other structure (1hyt) is classed as a
ase BamH1 (1bhm, chains A and B) with another BamH1 neutral protease fold (which encompasses thermolysin).
structure (1bam0). CATH has the 1bam0 domain in a differ- As they share obvious common ancestry, SCOP under-
ent three-layer (αβα) sandwich fold from the 1bhm standably considers both domains to have the same fold.
domains; the former is classified in the collagenase (catalytic
domain) fold, with the latter in the restriction endonuclease Among the mismatches above the 25% identity level,
domain 2 fold. Both folds have the three-layer (αβα) sand- there is only one case where both SCOP and CATH dis-
wich architecture. Because of a different arrangement of agree on the pairwise match: at 25% identity (now 26% in
helices, and the addition of small β strands in 1bam0 the current version of the FSSP database), the C-terminal
domain, CATH considers the geometry significantly differ- domain of ribosomal protein L7/12 is aligned to the Taq
ent to assign a non-endonuclease fold, despite the function DNA polymerase with a root mean square deviation
of the protein. (rmsd) of 3.9 Å. Both databases consider the ribosomal
protein as one domain, but classify the polymerase as six
Another mismatch in the 100% sequence identity region (in CATH) or three (SCOP) domains. There is a small
concerns two thermolysin structures; because one is a frag- region within the large polymerase structure that resem-
ment (1trlA), it is classed as a thermolysin fragment fold in bles the small β-sheet structure of the ribosomal protein;
Research Article Comparison of structure classifications Hadley and Jones 1103

Table 1

Percentage agreement of FSSP pairwise matches with SCOP and/or CATH.

Z-score of FSSP pairwise matches

Percentage identity of 2–3.9 4–5.9 6–7.9 8–9.9 10–11.9 12–13.9 14–19.9


FSSP pairwise matches

0–9 5.6 30.7 56.7 77.1 84.3 88.0 70.0


10–14 6.4 24.3 52.5 70.1 80.4 85.0 92.3
15–19 11.0 35.4 61.1 83.6 89.1 88.9 97.9
20–24 35.0 73.1 70.8 89.5 94.1 88.0 91.7
25–29 66.7 80.0 100 100 100 80.0 100

Percentage of FSSP pairwise matches (separated by Z-score and percentage identity) found in both SCOP and CATH; percentage agreement
improves with increases in both Z-score and percentage identity. See text for additional discussion.

the structural-alignment program in FSSP would pick up Figure 2


this small similarity, while there may not be an evolution-
ary relationship between the two domains. This is one (a)
example of a match between two protein domains that is CATH only
probably not indicative of common ancestry, but is simply 24% (251,340)
ALL
a chance match between similar regions. 36% (377,019)

Comparing CATH pairwise matches to both SCOP and FSSP


At the fold level, the percentage of agreement with both
SCOP and FSSP is small, at only 36% (see Figure 2). CATH + SCOP
16% (165,459)
Alone, SCOP contains 51% of the CATH matches, and
FSSP contains 60%. As compared to the SCOP compari-
son (see below), a much larger percentage of CATH CATH + FSSP
25% (260,796)
matches is missed in SCOP or FSSP, but a higher percent-
age of matches is found in FSSP. The majority of the 49% (b) CATH only
2% (8524)
of CATH matches missed in SCOP arise not from classifi-
cation mistakes, but from problems such as domain assign-
CATH + SCOP
ment and fold overlap. Much of the 40% of CATH 24% (89,289)
matches missed in FSSP may be due to changes in FSSP
data (discussed below).

The most commonly occurring CATH pairwise matches ALL CATH + FSSP
66% (246,955)
missed by SCOP occur in the three-layer (αβα) sandwich 8% (31,716)

Rossmann fold and the immunoglobulin fold. Most mis-


matches stem from the ‘fold-overlap’ problem, where a
Structure
fold within CATH encompasses more than one fold
within SCOP, and vice versa. When a domain is classified
Comparing CATH pairwise matches to SCOP and FSSP. (a) At the
within CATH as being a three-layer (αβα) sandwich Ross- fold level, only 36% of the pairwise matches found in CATH are
mann fold, there are several SCOP folds to which it could found in both SCOP and FSSP. Note that the number of CATH
conceivably belong. The same occurs with the matches found in SCOP and FSSP is the same as the number of
immunoglobulin fold within CATH: several SCOP folds, SCOP matches found in CATH and FSSP: the differing
percentages reflect the total number of pairwise matches, which is
such as the immunoglobulin-like β sandwich (SCOP code:
much higher in CATH than in SCOP. A large percentage of these
2.1), the prealbumin-like fold (2.3), and the cupredoxin matches is found only in CATH. (b) A smaller number of pairwise
fold (2.5) may contain these domains. Thus a domain clas- matches are found at the homology level, so the overall agreement
sified as one SCOP fold will not be paired with a domain between the databases is higher, and the number of pairwise
in another; although the structures are deemed by CATH matches confined solely to CATH is lower. SCOP (and FSSP) still
includes additional CATH matches that the other database
to be geometrically similar, SCOP separates them to does not.
reflect an evolutionary or topological distinction.
1104 Structure 1999, Vol 7 No 9

Figure 3 Because SCOP has a much smaller number of pairwise


matches than CATH, the percentage of matches agreed
(a) by all three databases is higher (64%) than in the CATH
SCOP only
2% (14,092) comparison. CATH alone agrees with over 90% of the
SCOP matches; FSSP agrees with over 70%. Only 2% of
SCOP + CATH the SCOP pairwise matches are absent in both CATH and
28% (165,459)
FSSP. It is likely that the same problems discussed for the
CATH comparison account for many of the mismatches
and disagreements between the databases in this case. Of
course, many of the comparisons (such as SCOP matches
found in CATH, and CATH matches found in SCOP) are
ALL SCOP + FSSP simply duplicated data.
63% (377,019) 7% (38,704)

At the homology level, fewer SCOP pairwise matches are


(b) SCOP only found in both CATH and FSSP. Taken individually, both
12% (51,358)
CATH and FSSP match fewer pairs than at the fold level
(79% and 67% respectively). Understandably, the percent-
age of matches not found in either database has increased.
These values reflect the difficulties in assigning homology
SCOP + CATH
21% (89,289) between pairs of similar structures.

ALL Differences and discrepancies between SCOP and CATH


58% (246,955)
Domain assignment
SCOP + FSSP As mentioned previously, the separation of proteins into
9% (39,710)
domains is a difficult and often subjective process. Table
Structure
2 highlights the difference between the methods of SCOP
and CATH for distinguishing domains. Of the 6875
Comparing SCOP pairwise matches to CATH and FSSP. (a) At the
fold level, almost two-thirds of the SCOP pairwise matches are also
protein chains in pset3, 1194 (∼17%) are assigned different
found in both FSSP and CATH. CATH agrees with a further 28% of numbers of domains in SCOP and CATH. In general,
the SCOP matches, whereas FSSP includes only an extra 7%. Only a CATH assigns more domains than SCOP. This is due to
small percentage of the pairwise matches is unique to SCOP. the fact that CATH employs a purely structural definition
(b) Fewer shared matches are found at the homology level in
comparison to the fold level. Because of the difficulties inherent in
for domains (essentially based on compactness), whereas
assigning homology, there is a higher percentage of SCOP matches at SCOP takes into account whether or not a domain is
this level that is not found in the other two databases. observed as recurring in another superfamily, or observed
as a separate single-domain fold. Occasionally, protein
chains are classified as multidomain in SCOP until a more
The problems inherent with domain assignment also thorough classification on individual domains can be com-
affect this comparison. Obviously, any protein separated pleted; for this reason, a chain could seem to consist of
into a different number of domains within SCOP and only one domain (i.e. as yet undivided) in SCOP while
CATH will probably be classified into completely differ- having several domains in CATH.
ent folds as well. There are cases of proteins not com-
pletely classified by one of the databases: a group of MHC Problems with domains account for several groups of dis-
(major histocompatibility complex) class II proteins has crepancies between SCOP and CATH. An obvious
only the N-terminal region included in SCOP, but both domain problem is the exclusion of one part of a protein.
the N- and C-terminal regions are found in CATH. The In the case of the MHC class II chains (1iea(A–D)), only
C-terminal region is an immunoglobulin fold; each of the N-terminal domain is included in SCOP. CATH
these MHC proteins will thus be paired with every other includes both the N- and C-terminal domains, so any
immunoglobulin fold domain within CATH, but will be protein matching the C-terminal domain of 1iea(A–D) in
missed in SCOP. This is one example that illustrates the CATH will not have an equivalent match in SCOP. The
impact one discrepancy may have on the rest of the data- definition of domain obviously leaves some room for
base; these four protein domains affect over 1000 pairwise interpretation, and, in some cases, dividing a protein
matches within CATH. along a possible structural-domain boundary may in fact
divide one active-site region into two or more nonfunc-
Comparing SCOP pairwise matches to both CATH and FSSP tional segments. Such is the case with papain (1ppo).
Approximately two-thirds of the SCOP matches at the SCOP treats the protein as one domain, leaving the cat-
fold level are found in both CATH and FSSP (Figure 3). alytic cysteine, histidine and asparagine together to form
Research Article Comparison of structure classifications Hadley and Jones 1105

Table 2 with mannose-binding protein A (1afb), where CATH


categorizes each trimer subunit as a single domain, while
Comparison of domain assignment methods: SCOP and CATH.
SCOP separates the triple coiled-coil helix from the
CATH (number of domains) mainly β region of each monomer, and classifies the
domains individually (Figure 4b).
SCOP (number of domains) 1 2 3 4 5 6 7

1 4475 817 51 31
Our exclusion of certain classes within this study has also
generated some discrepancies between CATH and SCOP,
2 80 1007 61 104 14 3
as the larger number of SCOP classes makes it difficult to
3 7 159 18 3 compare some proteins. For example, the C-terminal
4 3 38 2 domain of the regulatory chain of aspartate carbamoyl-
transferase (1acmB/D:101–153) is classed as a small
5 1
protein in SCOP (denoting structures usually dominated
6 by metal ligand, heme, and/or disulphide bridges)
7 1 (Figure 5a). However, CATH describes the same domain
as being in the mainly β class. By removing the small
The number of domains into which each chain is separated in SCOP
protein class within SCOP, we are forced to ignore any
and CATH is compared. For the most part, the two classification
schemes agree on the number of domains per chain (5681 of 6875 pairwise matches containing this domain. Similarly, the
chains is ∼82% agreement). However, in the case of chains split into two haematopoetic cell kinase (hck) structure has one region
domains in CATH, almost half are considered as only one domain within classed as multidomain within SCOP (1ad5A/B:249–531),
SCOP. Examples and possible reasons for this are discussed in the text. but approximately the same region is divided in CATH
and presented as two domains, one in the αβ and one in
the active site. CATH, however, splits the protein into the mainly α class (Figure 5b). When members of the
two domains, separating the cysteine from the asparagine multidomain class in SCOP are disregarded, correspond-
and histidine, and rendering each domain effectively ing pairwise matches are lost.
functionless (Figure 4a). CATH does the same for the
trypsin-like serine peptidases and the aspartic acid pepsin Class assignment
peptidases. The decision is in many respects a philosoph- One category of discrepancy between SCOP and CATH
ical one: whereas those interested in the biochemical arises from differences in class assignment, and the major-
aspects of protein structure may see the structure as a ity of disagreement arises from the presence of the two
complete functional unit, others with interests in the classes encompassing α/β domains in SCOP. However,
dynamics of protein folding may argue that the functional domains within each class are allocated consistently in
unit can be separated into smaller, commonly occurring SCOP, and there are no cases of pairwise matches pro-
structural domains. Interestingly, the opposite occurs duced within both FSSP and CATH that SCOP has missed

Figure 4

Examples of domain-assignment
(a) (b)
disagreements between CATH and SCOP.
(a) Structure of papain (1ppo) with catalytic
histidine, asparagine and cystine shown as
ball-and-stick residues. SCOP classifies the
structure as one domain (SCOP code: 4.3.1),
whereas CATH splits the structure into two,
as shown by blue (CATH code: 1.10.190.10)
and yellow (3.10.160.10) colouring. The
cartoon figures were prepared using
MOLSCRIPT [36]. (b) Structure of mannose-
binding protein (1afb). CATH treats each
monomer in the trimer as one domain
(coloured red, blue and yellow), whereas
SCOP separates the coiled-coil extension
(uncoloured) from the rest of the structure,
and classifies both domains individually.

Structure
1106 Structure 1999, Vol 7 No 9

Figure 5

(a) (b) Problems associated with the definition of


additional classes within SCOP. (a) The
C-terminal domain of the regulatory chain of
aspartate carbamoyltransferase
(1acmB:101–153; coloured blue) is classed as
a ‘small protein’ in SCOP, whereas CATH
classifies it as a single-sheet mainly β structure
of the monellin (subunit A) fold (2.20.30.60).
(b) The haematopoetic cell kinase structure has
one region classed as ‘multidomain’ within
SCOP (1ad5A/B:249–531). Approximately the
same region is presented as two domains in
CATH: one domain is an αβ two-layer sandwich
G4-amylase fold (1ad5A/B:259–344;
3.30.200.20; coloured blue), and the other is a
mainly α non-bundle casein kinase I delta
(subunit A, domain 2) fold (1ad5A/B:345–519;
1.10.510.10; coloured yellow).

Structure

because of a class mix-up (that is, defining one domain as secondary structural elements into account, whereas the
class α+β, and one as class α/β). As with domain identifica- other disregards them. The situation is reversed with the
tion, class assignment can be dependent on subjective case of the lysozyme superfamily: SCOP classifies these
rules. An example is the haemagglutinin domain (1hgg, proteins as α+β, whereas CATH disregards the presence of
chains A, C and E), a domain which CATH considers to be small β strands, and opts instead to classify them as mainly
in the αβ class because of the presence of two small helices α (Figure 6b). In this case, however, the evolutionary
amongst several β strands (Figure 6a). The SCOP authors importance of these strands is a crucial factor in SCOP’s
ignore these small helical elements, as they are not consis- determination of the overall structural class. Thus, for both
tently present across all available haemagglutinin structures SCOP and CATH the rules of classification are dependent
and play no significant role in the function of the protein. on the protein family in question, and are not consistent
The domain is thus classed as all β. The two methods are throughout the classification database. Reassuringly, there
clearly relying on a different set of definition rules for clas- are no cases where one scheme defines a mainly α domain
sifying their entries: one may take small percentages of that the other considers mainly β, or vice versa.

Figure 6

(a) (b) Examples of class assignment disagreements


between CATH and SCOP. (a) SCOP
ignores the small helical elements in the
haemagglutinin structure (1hgg, chains A, C
and E) and classifies the domain as mainly β,
whereas CATH takes the helices into account
and considers the structure αβ. (b) CATH
disregards the presence of small β strands in
the lysozyme superfamily (e.g. 1lys) and
considers the protein mainly α, whereas
SCOP takes into account the functional and
evolutionary importance of these strands, and
calls the lysozymes α/β.

Structure
Research Article Comparison of structure classifications Hadley and Jones 1107

Fold assignment Rather more surprisingly, the opposite case also occurs,
SCOP classifies pset3 into 286 separate folds, whereas where domains from a fold within SCOP are classified into
CATH uses 447 folds to classify the same set. Of the total more than one fold in CATH. The TIM-barrel fold in
of 429 folds in SCOP, 323 exist in the major structural SCOP (3.1) is one example. Corresponding CATH folds
classes 1–4 (∼75% of total). CATH defines a total of 590 include transaldolase B, chain A fold (3.20.25; 1ucwA);
folds, 527 of which are present in classes 1–3 (∼89% of urease, subunit C, domain 2 fold (3.20.50; 2kauC); and
total). The definition of fold is thus somewhat arbitrary chitobiase, domain 3 fold (3.20.60; 1qba0). The three
and left up to the creators of each classification method. examples share the same architecture, but vary in the
As such, similar folds may have different names, or one number of strands and helices comprising the barrel.
method may encompass a subset of proteins under one CATH has separated each fold for geometric reasons, due
general fold, whereas another method separates the set to low structural similarity scores, whereas SCOP consid-
into more specific, less-populated folds. Surprisingly, most ers all these barrels to be sufficiently similar to group
of the highly populated fold families in CATH are classi- together. The frequent occurrence of this kind of fold def-
fied into more than one fold family in SCOP (the ‘fold- inition discrepancy between SCOP and CATH is of
overlap’ issue mentioned previously). A good example of course due to the effect of independent fold definition,
this is the Rossmann fold family (α/β class, three-layer and reflects the difference between using geometry and
αβα sandwich architecture: CATH no. 3.40.50). Proteins evolution to classify structure.
within this fold family in CATH are classified in several
different fold families in SCOP, such as: the β subunit Favourable packing arrangements and protein architec-
(capsid) of the lumazine synthase/riboflavin synthase tures limit the number of possible protein folds, and for
complex (fold family 3.9; 1rvv, 30 chains), Flavodoxin-like this reason large numbers of protein structures might be
(3.13; 1ofv0), NAD(P)-binding Rossmann fold domains expected to fall into relatively small numbers of protein-
(3.19; 1fmcA), N-carbamoylsarcosine amidohydrolase fold families. Several estimations have been made in
(3.22; 1nbaA), P-loop-containing nucleotide triphosphate response to the question of how many folds exist in
hydrolases (3.25; 1ukz0), CheB methylesterase domain nature. Chothia originally estimated a conservative value
(C-terminal residues 152–349) (3.27; 1chd0), Subtilases for the total number of protein families of no more than
(3.28; 1selA), Phosphotyrosine protein phosphatases I-like 1000; thus the total number of folds would be even less
(3.31; 1phr0), anticodon-binding domain of Class II [31]. This was followed by estimates varying from around
aminoacyl-tRNA synthetase (aaRS) (3.37; 1adyA), IIA 1000 folds [32] to over 6000 [33,34]. The issue is unre-
domain of mannose transporter, IIA-Man (3.40; 1pdo0), solved, with other estimates varying between around 650
phosphoglycerate mutase-like (3.43; 3pgm0), phosphori- [25] and less than 5200 for human proteins alone [24].
bosyltransferases (PRTases) (3.44; 1sto0), integrin A (or I) Clearly the issue is a controversial one. Given the arbitrary
domain (3.45; 1lfaA), glycinamide ribonucleotide trans- definition of what constitutes a protein fold, the only point
formylase (3.46; 1cddA), S-adenosyl-L-methionine-depen- that seems to be in agreement is that a finite number of
dent methyltransferases (3.47; 1vid0), and α/β-hydrolases naturally occurring folds exists (at the very least there can
(3.50; 1tib0). All these folds are described as three-layer be no more folds than protein sequences). In the unlikely
αβα folds, with mostly parallel β sheets consisting of event that a standardization of fold definition and fold
between four and eight strands. The SCOP authors have nomenclature can be agreed, then perhaps more agree-
either made a topological distinction between the folds, or ment might be possible between the different estimates
are more conservative in their fold assignment, choosing to for fold numbers.
keep folds separate until sufficient evidence warrants their
unification. CATH on the other hand focuses on the geo- Homology assignment
metric aspects of structural similarity, and thus encom- Homology discrepancies can be seen clearly in cases where
passes all these SCOP folds into one large fold family, as one database classifies two domains within one structure as
they share the common Rossmann fold motif of a parallel homologous, but another database does not. Interestingly,
β sheet flanked on both sides by α helices. Although the in these cases a database may not only miss the homology
Rossmann fold typically describes structures with six- between two domains, but may consider them to have dif-
stranded β sheets, CATH has used this definition in a ferent folds or architectures. Over 150 domains disagree in
broader sense. Nevertheless, for domains in the CATH this way within SCOP and CATH. The most commonly
Rossmann fold family, the most commonly found SCOP occurring problems arise in the two largest fold groups
classification for these folds is also the Rossmann fold. All (which are also superfolds) in these databases: the Ross-
the domains in this fold level in SCOP are found in the mann fold and the immunoglobulin-like fold.
Rossmann fold level in CATH, and so a very high degree
of consistency is apparent between the two schemes, even Several cases exist where CATH recognizes a homologous
though SCOP defines a number of small subfamilies for relationship between two domains within one structure
these folds. that SCOP classifies as different fold. Elongation factor
1108 Structure 1999, Vol 7 No 9

Tu (EF-Tu) structures are split into three domains, two of There are of course many more examples of missed or
which CATH considers homologous (2.40.80.10; 1eft0, uncertain superfamily definitions in both SCOP and
domains 2 and 3); presumably the belief here is that these CATH, as the assignment of homology is often more sub-
domains are the results of a distant gene-duplication jective than the assignment of fold. Evolutionary relation-
event. These same domains are classed in separate folds ships are often disputed or unclear, and different groups
in SCOP: the reductase/isomerase/elongation factor may make individual decisions as to domain relationships.
common domain fold (2.29) and the EF-Tu C-terminal
domain fold (2.30). Both are closed Greek-key barrels, CATH architecture level
with six β strands, but SCOP has chosen to separate these CATH defines another level of classification that SCOP
folds while CATH combines them into a single fold does not consider, namely architecture. There are a small
family. A similar situation occurs in the 1cgt domain 4 number of multidomain proteins that have an internal
family of the immunoglobulin-like fold (mainly β, sand- homology recognized by SCOP but are classed into differ-
wich: 2.60.40.110). CATH considers domains 3 and 4 of ent architectures (and thus folds) by CATH. These exam-
these proteins to be homologous. SCOP considers the ples belong to the same SCOP homologous family: the
former an immunoglobulin-like β sandwich (2.1) and the actin-like ATPase domain (α/β, ribonuclease H-like motif
latter a prealbumin-like fold (2.3). Both folds are fold; 3.41.1). The G chains of all five members of the glyc-
described within SCOP as being Greek keys with seven erate kinase family have two domains in both SCOP and
strands in two sheets, with additional strands in some CATH, but SCOP classifies them as duplicated domains of
members. The reason for the separation of these obvi- three layers, whereas CATH classifies one as three-layer
ously similar folds is unclear, although it is presumably and the other as complex. A visual inspection of the
due to a higher apparent degree of calculated similarity domains using RASMOL [35] shows that the complex
within one subgroup. domain (1glaG:4–253) differs sufficiently from the standard
three-layer (αβα) sandwich (see Figure 7a–c). CATH has
The opposite case also occurs, where SCOP declares a the advantage of segregating such examples by using the
homologous relationship with which CATH disagrees. architecture level of classification, whereas SCOP must
The three domains in the A and B chains of the phospho- incorporate these domains into fold groups that may include
glucomutase (first three domains) family, superfamily and members with substantially different global structure.
fold within SCOP have a mixed β sheet of four strands
(e.g. rabbit phosphotransferase, 3pmgA/B). CATH classi- Database updates
fies the three domains completely differently: all have the One possible drawback to using SCOP may become
same three-layer (αβα) sandwich architecture, but each apparent when attempting to use the codes within this
has a different fold. The first has the α-D-glucose-1,6- paper to access data from the current version of the data-
biphosphate, subunit A, domain 1 fold (3.40.460), the base. The CATH authors anticipated the likely addition
second has a Rossmann fold (nitrogenase molybdenum- of data to each level of their hierarchy, and numbered
iron protein, subunit A, domain 3) (3.40.50), and the third each architecture, topology and homology level as a multi-
has the α-D-glucose-1,6-biphosphate subunit A domain 3 ple of ten. As such, new entries to each level can either be
fold (3.40.120). The domain locations within each data- added to the end of the database, or slotted in the middle
base are roughly the same. Oddly, although the descrip- of the current version by numbering between existing
tion of the SCOP fold is a mixed β sheet of four strands, entries (i.e. the super-roll architecture is indexed as 3.15 to
two of the domains have a different number of strands. fit between the roll (3.10) and the barrel (3.20) in the αβ
class). This ensures that current entries need not be
It is also common to find the databases agreeing on the changed. In contrast, the SCOP authors have apparently
fold of multiple domains, but disagreeing on homology. chosen to renumber entries upon the addition of new data
One example is the GMP synthetase subunit A domain 3 to their database. So the flavodoxin-like fold, index
fold (αβ two-layer sandwich, 3.30.300) in CATH, which is number 3.13 in version 1.37 of the database, is now 3.14 in
subdivided into three homology levels. Most proteins with the current version of the database (v1.39); the α/β-hydro-
a domain from one homology level also contain domains lase fold has gone from 3.50 to 3.56. This makes consistent
from the other two homology levels in the GMP syn- use of the data more difficult, especially when considering
thetase fold (1mxa domain 1 family, 3.30.300.20; 1mxa the number of other resources that link to or cross-refer-
domain 2 family, 3.30.300.40). In SCOP, the three ence SCOP data.
domains in these proteins are classified within the same
family: the S-adenosylmethionine synthetase homologous Notable features of FSSP
family (fold of the same name, α+β class; 4.75.1). With Homology at low Z-score
proteins having one or more homologous domains, there Of the 21,637 pairwise matches within FSSP, 10,322
are no cases of SCOP missing a homologous relationship (almost 48%) have a corresponding Z-score below 4.0 (i.e.
that CATH identifies. between 2.0 [the FSSP cut-off] and 3.9). A low Z-score
Research Article Comparison of structure classifications Hadley and Jones 1109

Figure 7

(a) (b) (c)

Structure

Escherichia coli glycerate kinase (1glaG), separated into two domains between the two domains runs horizontally, across the middle of this
by both SCOP and CATH. SCOP considers the two domains to be diagram. (b) Chain G, domain 1: 4–253. This domain is classified as
homologous, classing them as members of the actin-like ATPase ‘complex’ within the αβ class in CATH. (c) Chain G, domain 2:
domain superfamily of the ribonuclease H-like motif fold. CATH assigns 254–499. The domain is assigned the ‘three-layer (αβα) sandwich’
the domains different architectures. (a) Chain G in full. The interface architecture within the αβ class in CATH.

(i.e. less than 4.0) does not necessarily rule out a structural this information in FSSP is useful in that Z-score and
similarity or even homology between two structures. A sequence identity may be used to automatically identify
small percentage of FSSP pairwise matches with Z-scores very remote relationships between protein structures,
less than 4.0 is found in both SCOP and CATH fold fami- and the relationships between structures in a neighbour-
lies or even within superfamilies. Understandably, a much hood (rather than a hierarchy) can be closely examined.
smaller number of agreements is seen at the homology Users are free to interpret this data as they wish, without
level than at the fold level: only 166 FSSP pairwise any preformed decisions being made on the significance
matches exist at the SCOP and CATH homology level, as of the information.
compared to 673 at the fold level. At the fold level, the
immunoglobulin-like β sandwich is the predominant fold, Obviously, a clear picture cannot be derived from consid-
involving almost one-third of the matches. The TIM- ering the Z-scores alone. The sequence identity between
barrel fold, Rossmann fold (three-layer αβα sandwich) and two structures presents additional information for assess-
arc repressor mutant fold (DNA-binding three-helical ing similarity of proteins, particularly regarding the possi-
bundle in SCOP) also recur frequently. Although the bility that the structures are evolutionarily related. Within
matches at this level are only a small percentage of the the subset of FSSP pairwise matches with a Z-score below
possible matches within each of these folds, this is still an 4.0, sequence identity varies from below 10% to over 80%.
indication that some folds are more easily matched than As sequence identity relates only to the sequence region
others. These folds would seem to present a particular being aligned, this value may not necessarily reflect the
challenge to the Dali comparison method. global similarity between two proteins. For example, in
the case of a pairwise match between two calmodulin
A Z-score below 4.0 might be considered insignificant for chains with 88% identity, the alignment length is only 57
assigning an evolutionary relationship between two residues over two sequences of 148 residues (Xenopus
protein structures, hence the advantage of taking laevis calmodulin [1dmo0] and Paramecium tetraurelia
sequence identity into account. However, most of these calmodulin [1osa0]). Although these two calcium-binding
FSSP pairwise matches (10,233 of 10,322) have a domains are obviously homologous, they superimpose
sequence identity lower than 20%, a value commonly with an rmsd of 11.3 Å, producing a Z-score of 3.6. This is
considered a threshold for assigning obvious homologous presumably due to the calcium-induced conformational
relationships. Thus, the majority of the pairwise compar- change in one of the chains. Without taking the degree of
isons within FSSP are presumably not indicative of defi- sequence identity into account, these low rmsd and
nite evolutionary relationships. However, the inclusion of Z-score values are not sufficient to indicate a homologous
1110 Structure 1999, Vol 7 No 9

relationship. Few of the corresponding sequence identi- (2.120.10). This fold encompasses two homologous groups,
ties are as high as the value in this example, but other one of whose members has structures consisting of six
values indicate that homologous structures may not neces- blades (1mwe; 2.120.10.10) and the other whose members
sarily superimpose well enough to produce a high Z-score. contain seven blades (1gotB; 2.120.10.20). As a result, any
pairwise matches between these two homologous families
High Z-score without homology (i.e. when matching at the fold level) will be inaccurate. In
A high Z-score (i.e. greater than 6.0) does not necessarily addition, CATH misses the relationships between this
indicate a structural similarity between two proteins that group and other seven-bladed propeller structures, such as
agrees with SCOP and CATH. At the fold level, there are domains within the methylamine dehydrogenase chain H
69 pairwise matches within FSSP above Z-score 6.0 that (2.130.10) and galactose oxidase domain 2 (2.130.20) folds.
are not recognized in SCOP or CATH. The Z-scores After investigation of these errors within CATH, it tran-
range from 6.0 to 15.6, with corresponding sequence iden- spires that they are the results of a single typographical
tities ranging from 4% to 24%. Grouping these examples error during the initial manual definition of the architec-
gives an indication of the reasons that might have led ture (CA Orengo, personal communication). This example
SCOP and/or CATH to classify these structures as having demonstrates the impact that one small human error can
no fold similarity or homology. A large proportion of these make on a hierarchical database of protein structures.
pairwise matches involves α and β domains (classed as α/β
or α+β in SCOP) classified as different folds in SCOP, and It is well known that low sequence identity between two
different architectures in CATH. The folds vary in SCOP, proteins does not necessarily indicate the absence of an
but within CATH the architectures are largely limited to evolutionary relationship. There are examples of pairwise
the three-layer (αβα) sandwich (3.40), the three-layer matches with low sequence identity in FSSP that are clas-
(ββα) sandwich (3.50) and the complex architecture (3.90). sified in the same fold or homologous family in SCOP and
The complex architecture contains αβ proteins too elabo- CATH. Sequence identity should therefore not be taken
rate to fit into any other CATH architecture, with some alone as a criterion of homology between two structures:
examples containing combinations of helices and strands the same fold can often be shared by a variety of different
that resemble portions of a typical three-layer sandwich. proteins that share virtually no sequence similarity. Some-
Small similarities like these explain the low sequence times a fold may be the ideal structure for several proteins
identity and high Z-score between some structural pairs; with a range of function and ancestry. Some of the most
small subdomains or regions may superimpose well, interesting cases are pairs of protein domains with low
despite the domains having no overall similarity or sequence identity, high rmsd (indicating a suboptimal
obvious evolutionary relationship. superposition of structures) and low Z-score (indicating
the match may not be significant). At the homology level,
Not surprisingly, at higher Z-score cut-offs, the proportion 977 pairwise matches below 20% identity are shared
of FSSP pairwise matches absent at both the fold and between the three databases. (At the fold level this value
homology levels in SCOP and CATH decreases steadily. jumps to 2716 pairwise matches.) Although this subset of
The number of homology mismatches is always larger than domains represents a mere fraction of the 14,807 FSSP
the number of fold mismatches, and the relationship pairwise matches below 20% identity, they illustrate the
between the two values decreases by almost 50% with dangers in depending on sequence identity to provide an
each unit increase in Z-score. At a Z-score of 9.0, only one accurate picture of structural relationships. A high
mismatch remains at the fold level, where FSSP has paired sequence similarity between two proteins (assuming it
a G-protein transducin (1tbgA) and a methanol dehydroge- covers a reasonable length of the sequences in question)
nase (4aahA) with a sequence identity of 8%, rmsd of 4.0 Å usually indicates a homologous (and therefore structural)
and a Z-score of 15.6. SCOP and CATH both classify the relationship, whereas a low sequence identity cannot be
methanol dehydrogenase structure as an eight-bladed pro- used to rule out the opposite.
peller fold (CATH indicates this at the architecture level),
but disagree on the transducin: CATH classifies it as a six- Database updates
bladed propeller architecture, whereas SCOP considers it a Under certain circumstances, one of the key advantages of
seven-bladed fold. A closer look at the structure reveals FSSP, namely the automatic update procedure, may
that the transducin does have seven blades, each consist- sometimes cause difficulties. Unlike CATH and SCOP,
ing of four short β strands. This discrepancy does not FSSP is updated continuously, with data derived from the
affect the pairwise match in question: the two domains are Dali alignment program generating new and revised FSSP
propellers with a different number of blades; FSSP has files automatically. Because data is taken directly from the
apparently superimposed portions of them with a reason- PDB, which generally releases new structural information
able rmsd and Z-score, although no actual relationship is according to a weekly schedule, FSSP is constantly chang-
evident. The six-bladed propeller architecture in CATH ing. Each version of CATH or SCOP is guaranteed to be
has only one topology level, the neuraminidase fold relatively unchanged until the next major update, and
Research Article Comparison of structure classifications Hadley and Jones 1111

archive copies of previous releases are available from the geometric information, and the addition of ‘architec-
maintainers. Occasionally, new PDB codes are chosen to ture’ can reveal broad features of protein-fold shape,
supersede original codes; any pairwise matches with obso- but partial automation means examples near fixed
lete codes would obviously be missed. Because the organi- thresholds may be assigned inaccurately. FSSP is con-
zation of FSSP depends on a representative set of domains tinually updated and presents data for the user’s own
and a set of sequence homologues, the addition of new assessment; however without sufficient knowledge, a
structures might trigger a reorganization of the domain user may not assess this data appropriately.
groups, with the possibility of a new representative being
chosen. This would in turn affect the structural align- By presenting such a large amount of structural data
ments, which would then influence the sequence identity, with detailed geometric and evolutionary information,
rmsd, and Z-score of each pairwise match. Thus, both the these databases are a valuable resource for benchmark-
constituent protein domains, and the information gener- ing of methods, and structural studies. At present, using
ated by their alignments, are changing constantly in FSSP. these databases in conjunction with human judgement
and biological knowledge should be sufficient for provid-
In a comparison of FSSP files available in November 1997 ing accurate and reliable structural information to all
and July 1998, approximately 12% of the FSSP pairwise biologists. Whether a consensus database, devised by
matches used in this analysis were absent. In the remaining extracting undisputed protein classifications from SCOP,
pairwise matches, less than 0.05% of Z-scores, percentage CATH and FSSP, would improve the development of
identities and rmsd scores had changed, but changes accurate threading templates is currently being assessed.
included Z-score values that unfortunately crossed the arbi-
trary threshold of 4.0 commonly used to assign structural Materials and methods
relationships. Measures were taken to minimize the impact Generating a set of shared structures
these problems would have on the data presented here, but In order to compare the three databases, a standard list of common
structural identifiers was first generated. Unlike FSSP, both SCOP
it is likely that some of the pairwise matches missed in and CATH append the four-character PDB code with chain and
FSSP are due in part to the continual updating of data. domain identifiers (e.g. 1pdbC1 where ‘C’ identifies the chain, and ‘1’
indicates the first domain). If no chains or domains exist, an ‘0’ is
Biological implications used. In SCOP (March 1998; v1.37), only classes 1–4 were consid-
ered (mainly α, mainly β, α/β, α+β); in CATH (April 1998, v1.4), only
The reliability and accuracy of structure classification classes 1–3 (mainly α, mainly β, α/β) were included. In comparing
methods are important to structural and non-structural the codes found in each database, only chains (e.g. 1pdbA, 1pdbB
biologists alike. Protein structure data is used in various and 1pdbC) were considered, as domains are not consistently allo-
aspects of biology such as benchmarking, protein model- cated across different classification schemes. An ‘0’ was added to all
FSSP four-character codes, as FSSP contains some structures of
ling, evolutionary studies and drug design. This system- only one chain. An additional ‘0’ was added to all five-character
atic comparison of SCOP, CATH and FSSP represents codes, as FSSP does not separate any protein chains into domains.
the first attempt at estimating the degree of consistency Once a list of shared five-character codes (i.e. protein chains) was
between these databases, and facilitates a comparison created from the three databases, the corresponding domains for
each protein structure chain were then included for SCOP and
between fully and partially automated, and primarily
CATH. Each full protein structure code in SCOP and CATH corre-
manual, classification methods. sponds to a classification index number used to define its class, fold,
superfamily and family.
To a large extent, the three databases agree on classifica-
tions; certainly no one method is distinctly superior. Most Structure comparisons within each database
An all-against-all comparison of the classification numbers of the
of the differences and discrepancies that exist result from six-character codes (i.e. protein domains) within the master list (pset3)
the unique guidelines by which structures are classified produced a set of pairwise matches within both SCOP and CATH.
within each database. Biologists should note that there FSSP files consist of pairwise matches with Z-scores above 2.0
are no fixed principles of protein structure classification, between protein structures in the representative set and the sequence-
homologue set, and these matches were extracted along with the cor-
and each method relies on independently devised rules. responding Z-score and percentage sequence identity. Additionally, for
CATH and SCOP, matches were determined at both the homology or
Understanding these rules is crucial to making the most fold levels. In CATH, homology (i.e. homologous superfamily) is found
of each resource, as is the database structure (i.e. as a at the fourth place in the numbering system, with fold, or topology, at
the third position, but in SCOP, fold holds the second place in the
hierarchy or structural neighbourhood) and the way sep- index, with homology (or superfamily) in the third place. No more than
arate families are treated (i.e. whether small secondary one match per pair of codes (with chain identifiers) was recorded.
structure elements are included or disregarded when
assigning class, etc.). SCOP is a valuable resource for Structure comparisons between databases
detailed evolutionary information, but its purely manual Upon generating a set of pairwise matches for each database, compar-
isons were made between the three databases. Each pairwise match
derivation influences update frequency and means some was noted as being present in either both, neither, or one of the other
families or folds within the database may not be as remaining two databases. As FSSP is structured differently from both
exhaustively detailed as others. CATH provides useful SCOP and CATH, an additional category of ‘incompatible’ was created
1112 Structure 1999, Vol 7 No 9

to include SCOP or CATH matches comprising two codes found in 24. Zhang, C.T. (1997). Relations of the numbers of protein sequences,
either the representative FSSP set or the sequence homology set, but families and folds. Protein Eng. 10, 757-761.
not both. A large number of these incompatible pairwise matches could 25. Wang, Z.X. (1998). A re-estimation for the total numbers of protein
be converted into compatible matches by substituting each PDB code folds and superfamilies. Protein Eng. 11, 621-626.
26. Madej, T., Gibrat, J-F. & Bryant, S.H. (1995). Threading a database of
in the pair with its representative code within FSSP.
protein cores. Proteins 23, 356-369.
27. Gibrat, J-F., Madej, T. & Bryant, S.H. (1996). Surprising similarities in
Accessing the databases structure comparison. Curr. Opin. Struct. Biol. 6, 377-385.
The URL for the SCOP database is http://scop.mrc-lmb.cam.ac.uk, 28. Mizuguchi, K., Deane, C.M., Blundell, T.L. & Overington, J.P. (1998).
CATH is http://www.biochem.ucl.ac.uk/bsm/cath/, and FSSP is HOMSTRAD: a database of protein structure alignments for
http://www2.embl-ebi.ac.uk/dali/fssp. An interactive website with data from homologous families. Protein Sci. 7, 2469-2471.
this analysis can be found at http://globin.bio.warwick.ac.uk/~hadley/db. 29. Sowdhamini, R., Rufino, S.D. & Blundell, T.L. (1996). A database of
globular protein structural domains: clustering of representative family
members into similar folds. Fold. Des. 1, 209-220.
Acknowledgements 30. Siddiqui, A.S. & Barton, G.J. (1995). Continuous and discontinuous
This work was supported by Zeneca Pharmaceuticals and the Royal domains: an algorithm for the automatic generation of reliable protein
Society. We thank Dr David Timms and the authors of SCOP, CATH and domain definitions. Protein Sci. 4, 872-884.
FSSP for valuable comments during the course of this work. 31. Chothia, C. (1992). Proteins: One thousand families for the molecular
biologist. Nature 357, 543-544.
32. Blundell, T.L. & Johnson, M.S. (1993). Catching a common fold.
References Protein Sci. 2, 877-883.
1. Bernstein, F.C., et al., & Tasumi, M. (1977). The Protein Data Bank: a 33. Orengo, C.A., Jones, D.T. & Thornton, J.M. (1994). Protein
computer-based archival file for macromolecular structures. J. Mol. superfamilies and domain superfolds. Nature 372, 631-634.
Biol. 112, 535-542. 34. Alexandrov, N.N. & Go, N. (1994). Biological meaning, statistical
2. Richardson, J.S. (1981). The anatomy and taxonomy of protein significance, and classification of local spatial similarities in
structure. Adv. Protein Chem. 34, 167-339. nonhomologous proteins. Protein Sci. 3, 866-875.
3. Crippen, G.M. (1978). The tree structural organization of proteins. 35. Sayle, R.A. & Milner-White, E.J. (1995). RASMOL: biomolecular
J. Mol. Biol. 126, 315-332. graphics for all. Trends Biochem. Sci. 20, 374.
4. Rose, G.D. (1985). Automatic recognition of domains in globular 36. Kraulis, P.J. (1991). MOLSCRIPT: a program to produce both detailed
proteins. Methods Enzymol. 115, 430-440. and schematic plots of protein structures. J. Appl. Crystallogr.
5. Wodak, S.J. & Janin, J. (1981). Location of structural domains in 24, 946-950.
proteins. Biochemistry 20, 6544-6552.
6. Holm, L. & Sander, C. (1994). Parser for protein folding units. Proteins
19, 256-268.
7. Swindells, M.B. (1995). A procedure for detecting structural domains
in proteins. Protein Sci. 4, 103-112.
8. Levitt, M. & Chothia, C. (1976). Structural patterns in globular
proteins. Nature 261, 552-558.
9. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995). SCOP:
a structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247, 536-540.
10. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. &
Thornton, J.M. (1997). CATH—a hierarchic classification of protein
domain structures. Structure 5, 1093-1108.
11. Holm, L., Ouzounis, C., Sander, C., Tuparev, G. & Vriend, G. (1992). A
database of protein structure families with common folding motifs.
Protein Sci. 1, 1691-1698.
12. Jones, S., Stewart, M., Michie, A., Swindells, M.B., Orengo, C.A. &
Thornton, J.M. (1998). Domain assignment for protein structures using
a consensus approach: characterization and analysis. Protein Sci.
7, 233-242.
13. Michie, A.D., Orengo, C.A. & Thornton, J.M. (1996). Analysis of
domain structural class using an automated class assignment
protocol. J. Mol. Biol. 262, 168-185.
14. Taylor, W.R. & Orengo, C.A. (1989). Protein structure alignment.
J. Mol. Biol. 208, 1—22.
15. Orengo, C.A., Brown, N.P. & Taylor, W.R. (1992). Fast structure
alignment for protein databank searching. Proteins 14, 139-167.
16. Holm, L. & Sander, C. (1994). The FSSP database of structurally
aligned protein fold families. Nucleic Acids Res. 22, 3600-3609.
17. Holm, L. & Sander, C. (1993). Protein structure comparison by
alignment of distance matrices. J. Mol. Biol. 233, 123-138.
18. Martin, A.C., et al., & Thornton, J. M. (1998). Protein folds and
functions. Structure 6, 875-884.
19. Abagyan, R.A. & Batalov, S. (1997). Do aligned sequences share the
same fold? J. Mol. Biol. 273, 355-368.
20. Russell, R.B., Saqi, M.A., Bates, P.A., Sayle, R.A. & Sternberg, M.J.E.
(1998). Recognition of analogous and homologous protein folds—
assessment of prediction success and associated alignment accuracy
using empirical substitution matrices. Protein Eng. 11, 1-9.
21. Murzin, A.G. & Bateman, A. (1997). Distant homology recognition
Because Structure with Folding & Design operates a
using structural classification of proteins. Proteins 1, 105-112.
22. Chou, K.C. & Maggiora, G.M. (1998). Domain structural class ‘Continuous Publication System’ for Research Papers, this
prediction. Protein Eng. 11, 523-538. paper has been published on the internet before being printed
23. Gerstein, M. & Levitt, M. (1998). Comprehensive assessment of
automatic structural alignment against a manual standard, the SCOP (accessed from http://biomednet.com/cbiology/str). For
classification of proteins. Protein Sci. 7, 445-456. further information, see the explanation on the contents page.

You might also like