Academia.eduAcademia.edu

STRALCP structure alignment-based clustering of proteins

2007, Nucleic Acids Research

Protein structural annotation and classification is an important and challenging problem in bioinformatics. Research towards analysis of sequencestructure correspondences is critical for better understanding of a protein's structure, function, and its interaction with other molecules. Clustering of protein domains based on their structural similarities provides valuable information for protein classification schemes. In this article, we attempt to determine whether structure information alone is sufficient to adequately classify protein structures. We present an algorithm that identifies regions of structural similarity within a given set of protein structures, and uses those regions for clustering. In our approach, called STRALCP (STRucture ALignment-based Clustering of Proteins), we generate detailed information about global and local similarities between pairs of protein structures, identify fragments (spans) that are structurally conserved among proteins, and use these spans to group the structures accordingly. We also provide a web server at http://as2ts.llnl.gov/AS2TS/ STRALCP/ for selecting protein structures, calculating structurally conserved regions and performing automated clustering.

Published online 26 November 2007 Nucleic Acids Research, 2007, Vol. 35, No. 22 e150 doi:10.1093/nar/gkm1049 STRALCP—structure alignment-based clustering of proteins Adam Zemla1,*, Brian Geisbrecht2, Jason Smith1, Marisa Lam1, Bonnie Kirkpatrick1, Mark Wagner1, Tom Slezak1 and Carol Ecale Zhou1 1 2 Computing Applications and Research, Lawrence Livermore National Laboratory, Livermore, CA 94550 and Division of Cell Biology and Biophysics, University of Missouri-Kansas City, Kansas City, MO 64110, USA Received June 11, 2007; Revised October 14, 2007; Accepted November 6, 2007 ABSTRACT INTRODUCTION Most protein annotation and classification approaches depend heavily on the degree of observed amino acid sequence similarity to other related proteins. But even when sequence similarity between two proteins is low, structure similarity can be high. Thus, one of the most important improvements in protein classification would be protein homology/analogy identification at very low levels of sequence similarity (1). As Redfern et al. (2) explain ‘despite the advances in sequence comparison methods, *To whom correspondence should be addressed. Tel: +1 925 4235571; Fax: +1 925 4236437; Email: [email protected] ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 Protein structural annotation and classification is an important and challenging problem in bioinformatics. Research towards analysis of sequence– structure correspondences is critical for better understanding of a protein’s structure, function, and its interaction with other molecules. Clustering of protein domains based on their structural similarities provides valuable information for protein classification schemes. In this article, we attempt to determine whether structure information alone is sufficient to adequately classify protein structures. We present an algorithm that identifies regions of structural similarity within a given set of protein structures, and uses those regions for clustering. In our approach, called STRALCP (STRucture ALignment-based Clustering of Proteins), we generate detailed information about global and local similarities between pairs of protein structures, identify fragments (spans) that are structurally conserved among proteins, and use these spans to group the structures accordingly. We also provide a web server at http://as2ts.llnl.gov/AS2TS/ STRALCP/ for selecting protein structures, calculating structurally conserved regions and performing automated clustering. remote homologs in the ‘‘Midnight Zone’’ of sequence similarity (<15% identity) described by Rost, can still only be identified through protein structure comparison’. Redfern et al. also point out that ‘structure-based classifications are becoming increasingly important resources for recognizing these distant relatives and providing datasets for more far-reaching analyses of protein family evolution’. In our research and develop ment we follow these observations, and in order to detect the benefits and limitations of using purely structure-based approaches, we curently concentrated on structure similarity analyses only. The Protein Data Bank (PDB) (3) already contains more than 45 000 experimentally solved protein structures, and grows at a rate of more than 500 PDB entries per month. Among current entries, approximately 40% are multi-domain proteins (2) and, thus, there have been several attempts to classify individual domains of PDB protein structures into defined clusters (e.g. classes, folds, superfamilies, families) based on structure similarity as measured by various criteria. The most commonly used protein classification databases are SCOP (4) and CATH (5). The Structural Classification of Proteins (SCOP) database, a manual classification of PDB structures, is recognized by many as the gold standard of protein classification. In SCOP, proteins are classified to reflect both structural and evolutionary relatedness. Clustering is based mainly on visual inspection of similarities between conformations of secondary structure elements and on sequence similarities. However, SCOP classification lags the insertion of new structures in PDB, and manual classification cannot scale to meet the demands of this rapidly growing dataset. There are several algorithms already proposed to facilitate automated protein structure classification. For example, clustering can be done by selecting a single metric (e.g. Z-score (6) used in FSSP (7) Dali Fold Classification) or by combining different criteria to score the level of similarity, some examples of which are secondary structure content and orientation combined with calculated sequence similarities, and manual inspection, as PAGE 2 OF 8 e150 Nucleic Acids Research, 2007, Vol. 35, No. 22 MAMMOTH (15) and Dali (6). The classification results depend on the accuracy of the individual tools, so the authors use a variety of cutoffs and parameters optimized by training schemes to apply these tools in a specific order. In Ref. (16), the authors introduce a new structural representation of proteins to predict the family membership of proteins in the SCOP hierarchy. They define a graph theoretic representation of protein structures with nodes being residues and edges connecting residues when the distance separating them falls below a specified cutoff. Using these graphs as features, they train their Support Vector Machine (SVM) classifier with proteins from several SCOP families. The ultimate goal of the work presented here was to define criteria and to develop an algorithm that for a given set of protein structures would automatically identify structurally conserved regions and use them to create clustering results similar to those that would be obtained by manual inspection (e.g. SCOP curators). In our novel approach, called STRALCP (STRucture ALignmentbased Clustering of Proteins), for a given set of protein structures, we generate and combine detailed structural information about automatically detected global and local similarities between protein pairs, identify similar regions that are conserved within the set of proteins, report these regions, and use them to cluster the proteins according to their similarities in such identified structural frames. We use the Local-Global Alignment (LGA) algorithm (17) to perform all necessary structure comparison calculations. METHODS Our algorithm starts from structure alignment calculations performed by LGA (with a default value of distance cutoff DIST = 5 Å) to determine de novo (no sequence information is used) residue–residue correspondences between compared proteins. We use the LGA_S measure as a scoring function to evaluate the overall level of structure similarity and to allow an initial grouping and structural clustering of proteins. In our STALCP approach, an optimal number of clusters is determined by grouping models according to their overall similarity (LGA_S) combined with the information about local similarities in detected structurally conserved frames (we call them ‘spans’). LGA_S structure similarity scoring function (overall structure similarity) To perform a particular clustering for a set of protein structures, a suitable scoring function or, in general, a scoring algorithm that takes into account a number of characteristics of the compared proteins must be defined. Depending on the goal of the clustering, this can be done by selecting one measure or by combining different criteria to score the level of similarity. The LGA_S scoring function has two components, LCS (longest continuous segments) and GDT (global distance test), defined for the detection of regions of local and global structure similarities between analyzed structures. In comparing two protein structures, LGA superimposes a ‘model’ structure Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 used in the semiautomatic CATH database. Depending on the algorithm, classification results may differ significantly if different criteria are used to assess the level of similarity between compared structures or if the clustering criteria are focused on different structural features. The same set of proteins could be grouped differently by automatic sequence or structure comparison tools based on minor modifications to cutoffs or classification parameters. The Dali Fold Classification is based on exhaustive, allagainst-all 3D structure comparisons of proteins from the PDB, and is constructed by average linkage clustering of the structural similarity score derived from calculated alignments of distance matrices. The tree (dendrogram) is cut at Dali Z-score levels 2, 4, 8, 16, 32 and 64, where the first level (Z > 2) can be used as an operational definition of folds. A similar approach is used in CE (8) classification. After performing all-against-all comparisons of protein chains from the PDB, resulting CE Z-score values of 4.5 and above are used to discriminate at the family level, values between 4.0 and 4.5 at the superfamily and/or fold levels, and values between 3.5 and 4.0 are presumed to indicate possible biologically interesting similarities. The authors of the STRuster (9) method explore the calculation of root mean square deviations (RMSD) and use their algorithm to cluster alternative structural models from the PDB (i.e. models that correspond to different structure determination experiments). In addition to the traditional RMSD measure, the STRuster method uses two filters to define the final scoring metric called dissimilarity measure M (9). These two filters are introduced in order to identify both large and small (but significant) backbone conformational changes by reducing the influence in local large distances (only distances below 14.0 Å are considered) and also to restrict the analysis to significant structural differences (the distances above 1.0 Å). An approach for structural comparisons, fundamentally different from those using RMSD, was proposed by Rogen and Fain (10). They introduced the SGM (Scaled Gauss Metric), which is a metric derived from knot theoretical ideas to cluster proteins according to their structural topologies. They applied their method to predicting membership of proteins in CATH and achieved 95% accuracy at all levels of the classification hierarchy. In order to achieve a high level of agreement with other clustering schemes, some algorithms that use a multicriterion approach (weighted combination of different scoring schemes), are initially trained on labeled data from an existing structural hierarchy (SCOP or CATH) and use cross-validation (or similar methods) to select the best parameters for their classifiers. For example, ProtClass (11) uses a nearest-neighbor-based classification scheme and several structural features to classify proteins at the fold level of the SCOP hierarchy. Their features include secondary structure elements predicted by the Stride program (12), the sequence length, and the percentage of observed helices. SCOPmap (13) is an approach that achieves roughly 95% accuracy when classifying proteins into the superfamily level of the SCOP hierarchy. This approach combines many existing protein sequence and structure comparison tools, including PSI-BLAST (14), PAGE 3 OF 8 Nucleic Acids Research, 2007, Vol. 35, No. 22 e150 onto a ‘target’ structure (where the model is designated ‘M ’ and the target is ‘T ’). The LCS procedure localizes and superimposes the longest segments of residues that can fit under a selected set of RMSD cutoffs. The GDT algorithm is designed to complement evaluations made with LCS searching for the largest (not necessary continuous) set of ‘equivalent’ residues that deviate by no more than specified distance cutoffs. Let: LGA_S(M,T) structure similarity scoring function is defined as a function of two structures M and T calculated as a combination of R(r,M,T) results from LCS(M,T) calculations, and D(d,M,T) results from GDT(M,T): LGA SðM,TÞ ¼ ð1  wÞ  SðLCSðM,TÞÞ þ w  SðGDTðM,TÞÞ where SðLCSðM,TÞÞ ¼ 2 n  ðn þ 1Þ n X ðn  j þ 1Þ  R rj , M, T Þ, j¼1 n ¼ 3, rj ¼ 1:0, 2:0, 5:0, SðGDTðM,TÞ Þ ¼ k X  2 ðk  j þ 1Þ  D dj ,M,T , k  ðk þ 1Þ j¼1 k ¼ 20, dj ¼ 0:5, 1:0, . . . , 10:0, and w = 0.75 is a parameter (0 4 w 4 1) representing a weighting factor between S(LCS) and S(GDT) results. S(LCS) is a weighted sum of R(r,M,T) values calculated for n different RMSD cutoffs r (e.g. n = 3; r = 1.0, 2.0, 5.0), and S(GDT) is a weighted sum of D(d,M,T) values calculated using k different distance cutoffs d (e.g. k = 20; d = 0.5, 1.0, . . . , 10.0). In the formulae S(LCS) and S(GDT), the weighting schemes weight higher those R and D results that were calculated for smaller RMSD and distance cutoffs, respectively. The range of the LGA_S values is 0–100, and hierarchical clustering experiments performed on various STRALCP clustering approach (similarity in the set of structurally conserved local regions) The essence of the STRALCP algorithm is the ability to compare hundreds of protein structures in a single reference frame, identify similar fragments that are conserved within a set of analyzed proteins, and use this information to calculate the number of required clusters. Each calculated cluster is assigned with its structural fingerprint that can be defined by a representative structure and a set of spans that are shared among structures grouped together. Comparison of a new structure with a structural fingerprint determines whether the structure should be included to the particular cluster or whether it should be a member of another cluster. Our STRALCP algorithm, which automatically clusters proteins and identifies representative structures, can be described as the following list of steps: (i) LGA is used to perform all-against-all comparisons in which, for a given set of structures, each structure is used as a frame of reference for comparisons with others. (ii) Each frame of reference is assigned a set of sequential fragments, which are defined by splitting the corresponding amino acid sequence into consecutive n-residue-long sub-sequences (n = 10 is used as a default parameter; e.g. a 120-residuelong protein comprises 12 fragments). (iii) After performing all-against-all structure comparisons (step 1) the following information is assigned to each frame of reference: (a) LGA_S values between the frame of reference and all other structures, (b) the number of residue pairs that are superimposed locally within RMSD cutoff 0.5 Å (using 3-residue-long window). Continuous structural segments formed by such residue pairs that are at least five residues long are marked as candidate spans, (c) the number of non-empty fragments (non-empty fragments are sequential fragments defined in step 2 that overlap by at least one residue with at least one detected span in compared structures). (iv) For each frame of reference, all structures having at least 80% (default parameter) of the non-empty fragments in common are identified. A list consisting of maximum number of such structures is created and assigned to each frame of reference. (v) An optimal number of clusters is determined based on the following criteria: (a) the minimum number of clusters that yields a complete set of proteins in the combined lists from (4), Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016  m—the number of residues in M,  t—the number of residues in T,  R(r,M,T) = 100/t  L(r,M,T) is the percentage of the target’s (T’s) residues that are involved in the maximal (longest) continuous segment that fits within an RMSD of r Å. L(r,M,T) is the length of such identified longest continuous segment of M:T residue pairs,  X(M,T )—the set of all M:T superpositions calculated by the LGA algorithm,  G(s,d,M,T )—the number of M:T residue pairs for which the distance between Ca (alpha carbon) atoms is not greater than d Å after the superposition s 2 X(M,T) is applied,  D(d,M,T)=100/t  max{G(s,d,M,T):s 2 X(M,T)} is the maximal detected percentage of the Ca atoms in T structure that are within a distance threshold of d Å from M structure upon calculated s 2 X(M,T) superpositions. folds from SCOP database showed that LGA_S alone can serve as a good discriminator for the initial protein structure clustering (see the results shown in Figure 5a). e150 Nucleic Acids Research, 2007, Vol. 35, No. 22 (b) LGA_S between any pair of proteins from the cluster is at least 60% (default parameter), minimum value from (iii.a). (vi) Within each cluster, a representative structure is selected, which in comparison with other members of the cluster has the highest values determined in steps (iii.a), (iii.b) and (iii.c). RESULTS A proper protein classification is critical for better understanding and prediction of a protein’s structure, function and interaction with other proteins. It is known that sequence similarities nearly always correspond to structure similarities, enabling structure and function prediction for uncharacterized proteins. Structural similarity, however, does not necessarily correspond to sequence similarity (Figure 2). Through structural comparison and classification, we identified a family of crystal structures that failed to be detected (18) by sequence-based methods like PSI-BLAST (14). Using a structure-based method (e.g. DALI, LGA) it was found that three EAP domains from Staphylococcus aureus (18), which could not be properly classified by sequence-based methods, shared a previously unrecognized similarity to another class of bacterial toxins. Here we present our structure classification approach, STRALCP, applied to these domains. For each of the EAP domains [Eap2 (PDB entry: 1yn3), EapH1 (1yn4), EapH2 (1yn5)], we have performed structural PDB searches using our LGA server (19). As a result, 134 domains from the SCOP superfamily d.15.6 (Superantigen toxins, C-terminal domain) were identified as most similar to EAP structures (only 20 structures are shown in provided Figures 1, 3, and 4; 3 EAP domains and 17 domains from SCOP). Figure 1 shows that all 20 proteins are very similar in detected structurally conserved frames formed by 4 strands and 1 helix (Figure 2). The superposition of 1yn4_A and d1m4va2 (1m4v_A in Figure 2) corresponds to the fourth bar in Figure 1 and shows that these two structures differ in several loop regions only (structural deviations above 2 Å are colored in yellow-red). Note that the level of sequence identity between these two proteins is only 14% (Seq_ID), whereas the level of structure similarity is 75% (LGA_S). In general, all EAP domains have a high level of overall structure similarity (LGA_S over 60%) to most of the other analyzed structures, whereas the level of sequence identity is very low (below 20%). In Figure 1, we show PDB Seq_ID LGA_S 1yn4_A 1yn5_A 1yn3_A d1m4va2 d1v1pa2 d1ewca2 d1et9a2 d1aw7a2 d1f77a2 d1bxta2 d1ck1a2 d1goza2 d1sebd2 d1uupa2 d2tssa2 d1fnua2 d1hqrd2 d1ty0a2 d1lo5d2 d1esfa2 100.00 47.47 36.46 13.98 16.13 14.44 19.57 17.58 14.61 11.96 12.09 12.09 13.64 16.48 17.78 15.56 15.73 13.79 12.22 14.77 100.00 96.41 92.15 75.24 66.70 64.95 63.97 63.44 63.38 61.90 61.78 61.49 61.18 61.17 60.51 59.97 59.77 58.79 57.60 57.56 Figure 1. Structure similarities between EAP domains from S. aureus (PDB: 1yn3, 4 and 5) and 17 protein domains from the SCOP superfamily comprising superantigen toxins. All proteins were compared to the structure of EapH1 (1yn4_A), which serves as a frame of reference. Colored bars represent Calpha–Calpha distance deviation between 1yn4_A [99 residues; from the left (N-terminal) to the right (C-terminal)] superimposed with 20 structures from PDB (first bar represents a 1yn4_A–1yn4_A self-comparison). Colors represent distances between aligned residues and range from green (below 2 Å) to red (above 6 Å). The columns at the right contain information about the level of sequence identity (Seq_ID) and structure similarity (LGA_S). Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 Note: In step (v.a) a minimum number of clusters are defined based on local similarities in non-empty fragments along the protein sequence using initially selected representative frames of reference. Step (v.b) allows reassignment of less similar structures from one cluster to another. It also allows sub-division of clusters in order to satisfy the requirement that within each cluster any pair of proteins has at least 60% overall structure similarity. This way less similar structures are not grouped together even if they satisfy the requirement regarding a common set of nonempty fragments (step 4). PAGE 4 OF 8 PAGE 5 OF 8 Nucleic Acids Research, 2007, Vol. 35, No. 22 e150 the results from the structure comparisons of the set of selected 20 proteins when structure 1yn4_A was chosen as a frame of reference and in Figure 3 the structure SEH (PDB: 1f77; SCOP domain d1f77a2) (20) was selected as a frame of reference. From the comparison of these two plots we can conclude that d1f77a2 may serve as a better representation (average structure) for the analyzed set 1m4v 1yn4 C of 20 proteins (at least for the top 13 of them) than the structure 1yn4_A. The obtained results suggest that a given set of 20 structures can be structurally divided into at least two clusters. Our STRALCP system creates such a clustering automatically (Figure 4). By this clustering the EAP structures: 1yn3-5 are grouped together (Cluster2) with four other protein structures: SET1 (PDB: 1v1p) (21), SET3 (PDB: 1m4v) (22), and TSST1 (PDB: 1aw7, 2tss) (23,24). Additional tests showed that if we had introduced more strict structure similarity requirements [e.g. LGA_S cutoff 80% (see step 5.b in STRALCP algorithm)], then Cluster2 would have been split into two additional clusters (data not shown) where all three EAP domains (1yn3-5) were separated from the SET1, SET3 and TSST1 structures. Performance Figure 2. A 3D plot of structural superposition between 1yn4_A and 1m4v_A (SCOP domain: d1m4va2) that corresponds to the fourth colored bar in Figure 1. The level of sequence identity between proteins Seq_ID: 14%, and the level of structure similarity LGA_S: 75%. PD B d1f77a2 d1ewca2 d1esfa2 d1lo5d2 d1hqrd2 d1et9a2 d1bxta2 d1goza2 d1fnua2 d1uupa2 d1ty0a2 d1ck1a2 d1aw7a2 d1sebd2 d1v1pa2 d2tssa2 d1m4va2 1yn5_A 1yn3_A 1yn4_A Seq_ID 100.00 99.12 40.74 39.81 30.91 34.26 32.73 33.03 36.94 39.09 40.00 29.09 20.20 32.32 24.49 20.41 19.59 18.56 17.78 13.33 LGA_S 100.00 99.86 92.06 91.93 90.73 90.51 88.99 88.88 88.08 87.87 87.59 87.30 80.56 79.15 78.39 74.43 74.17 60.91 57.20 57.17 Figure 3. The results from the analysis of structure similarities between EAP domains from S. aureus and proteins from the SCOP superfamily of superantigen toxins (same domains as in Figure 1). SCOP domain d1f77a2 serves as a frame of reference for this comparison. The coloring scheme is the same as in Figures 1 and 2. Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 N As described in the Methods section the STRALCP clustering approach consists of two steps: (i) calculating all-against-all structural alignments using LGA program, and (ii) extracting from calculated alignments structurally conserved regions and using them to group proteins accordingly. The most cpu expensive is the first step. For example, on a single linux workstation equipped with AMD-64 5000 dual core processor, the calculations of all-against-all pairwise structural alignments for 20 discussed above structures (3 EAP domains and 17 domains from SCOP) lasted about 10 min, while the clustering (step 2) was completed in less than 10 s. In order to estimate the accuracy of a STRALCP clustering approach, we have performed comparisons with the SCOP (ver. 1.71) classification. We tested STRALCP calculations on domains from 25 different SCOP folds: a.5, a.7, a.8, a.24, a.29, a.137, b.2, b.42, b.43, b.68, b.80, PAGE 6 OF 8 e150 Nucleic Acids Research, 2007, Vol. 35, No. 22 Name d1f77a2 d1ck1a2 d1goza2 d1bxta2 d1uupa2 d1fnua2 d1esfa2 d1ewca2 d1ty0a2 d1et9a2 d1lo5d2 d1sebd2 d1hqrd2 .#..#.#####...##.#######################...########...#####...###############...####. .G..V.VDGIQ...RT.KKNVTLQELDIKIRKILSDKYKI...KGLIEFDM...YSFDI...YEIDKIYEDNKTLKS...DVNL. .L..V.ENKRN...QT.KKSVTAQELDIKARNFLINKKNL...TGYIKFIE...FWYDM...SKYLMIYKDNKMVDS...EVHL. .T..V.EDGKN...QT.KKKVTAQELDYLTRHYLVKNKKL...TGYIKFIE...FWYDM...SKYLMMYNDNKMVDS...EVYL. .T..V.EDNEN...TT.KKQVTVQELDCKTRKILVSRKNL...TGYIKFIE...FWYDM...SKYLMLYNDNKTVSS...EVHL. .V..V.IDGIQ...ET.KKMVTAQELDYKVRKYLTDNKQL...TGYIKFIP...FWFDF...SKYLMIYKDNETLDS...EVYL. .V..V.IDGIQ...ET.KKMVTAQELDYKVRKYTIDNKQL...TGYIKFIP...FWFDF...SKYLMIYKDNETLDN...EVYL. .P..L.LDGKQ...KT.KKNVTVQELDLQARRYLQEKYNL...RGLIVFHT...VNYDL...NTLLRIYRDNKTINS...DIYL. .G..V.VDGIQ...RT.KKNVTLQELDIKIRKILSDKYKI...KGLIEFDM...YSFDI...YEIDKIYEDNKTLKS...DVNL. .V..L.IDGVQ...KI.KPIFTIQEFDFKIRQYLMQTYKI...KGQLEIAI...ESFNL...SDIFKKYKDNKTINM...DIYL. .P..V.DKSKQ...TV.KPKVTAQEVDIKVRKLLIKKYDI...KGTVTLDL...IVFDL...NSMLKIYSNNERIDS...DVSI. .P..L.LDGKQ...KT.KKNVTVQELDLQARRYLQEKYNL...RGLIVFHT...VNYDL...NTLLRIYRDNKTINS...AIYL. .T..V.EDGKN...QT.KKKVTAQELDYLTRHYLVKNKKL...TGYIKFIE...FWYDM...SKYLMMYNDNKMVDS...EVYL. .L..L.ISGES...IL.KDIVTFQEIDFKIRKYLMDNYKI...SGRIEIGT...EQIDL...SDIFAKYKDNRIINM...DIYL. Cluster Cluster:2 Cluster:2 Cluster:2 Cluster:2 Cluster:2 Cluster:2 Cluster:2 Name d1m4va2 d2tssa2 d1aw7a2 d1v1pa2 1yn3_A 1yn4_A 1yn5_A ...####.......#####....#############.........#################..........###########.. ...VIKK.......YIYKE....KELDFKLRQYLIQ.........KIKVIMKDGGYYTFELN..........DGRNIEKMEAN.. ...KVKV.......KFDKK....STLDFEIRHQLTQ.........YWKITMNDGSTYQSDLS..........NIDEIKTIEAE.. ...KVKV.......KFDKK....STLDFEIRHALTQ.........YWKITMNDGSTYQSDLS..........NIDEIKTIEAE.. ...FVNK.......LIQKE....KELDFKIRQQLVN.........KIIINLKDENKVEIDLG..........NSKDIRGISVT.. ...TITV.......TFNKN....KDLEGKVKSVLES.........KYTVNFKNGTKKVIDLK..........NSSDIKSININ.. ...TISV.......VFPEN....QEIDSKVKNELAS.........TYTLTLNDGNKKVVNLK..........DPSTIKQIQIV.. ...TIAV.......NLPKD....LDLGNKVKALLYD.........VYTITWKDGSKKEVDLK..........DSNSIKQIDIN.. Figure 4. STRALCP clustering applied to the same set of 20 structures as in Figures 1 and 3. STRALCP calculations were performed using default parameters (LGA_S = 60%, DIST = 5 Å). Each row begins from the cluster number, followed by the domain name, and the set of amino acids that are extracted from detected structurally conserved spans. Dots indicate regions that structurally deviate in at least one pairwise comparison between members of the cluster. Note: dots do not indicate the actual number of residue pairs between detected spans. They are introduced for formatting purposes only. b.85, c.8, c.51, c.56, d.52, d.68, d.79, d.110, d.129, f.1, f.4, f.23, g.41, h.4. We have selected these folds as a representative sample from all 7 SCOP classes with an additional requirement that each fold consists of at least four superfamilies. In the SCOP database ver. 1.71, there are only 63 such folds that satisfy this requirement. In total, our benchmark set consisted of Nd = 4620 domains from Nf = 343 families, and Ns = 243 superfamilies. Complete sets of results from this experiment can be found at a server website: http://as2ts.llnl.gov/AS2TS/ STRALCP/. In Figure 5, we present the results from STRALCP calculations applied to 24 domains from SCOP fold a.8 (immunoglobulin/albumin-binding domain-like). In Figure 5a, we use SCOP fold a.8, as an example to show some of the details from the hierarchical dependencies among structures calculated using LGA_S as a single measure for clustering. In Figure 5b, we show how the structures from fold a.8 can be automatically separated using our STRALCP (multi-criteria-based clustering) approach. This example shows that by using STRALCP we can clearly separate proteins into appropriate clusters that correspond with a high agreement to the defined SCOP families (see Figure 5b, right column; SCOP family codes). In order to assess the accuracy of our clustering approach, we estimated the differences between SCOP (ver. 1.71) and STRALCP clustering (for example on the level of SCOP families) by introducing the following measure. Let:  Nc–the number of created clusters,  Cf(i)–the number of different families together within the i cluster, clustered The score indicating the misclustering effect MC (when domains from different SCOP families are grouped together) can be calculated using the formula: ! Nc 1 X 1  100:0 MC ¼ 1:0  Nc i¼1 CfðiÞ The range of this measure is 0.0 4 MC < 100.0, where 0.0 indicates no misclustering (i.e. agreement with SCOP families separation). The MC measure allows the comparison of different clustering schemes by their agreement in separating proteins from different clusters. The goal of this measure is not to calculate how many domains are clustered differently, but rather how many of the created clusters are compromised (proteins that are separated in another clustering scheme being merged). The results from the evaluation of the differences between SCOP and STRALCP clusters at the level of SCOP families showed that on average the level of misclustering (MC) is 3%. It suggests that the proposed strictly structure-based clustering method can be considered robust in that it detects relationships at the family level with a good agreement with the manually maintained SCOP database. DISCUSSION As discussed in Ref. (13), a strategy combining information from both sequence and structure comparisons would be expected to perform better than either method alone. However the analysis of the clustering approach applied to the benchmark set of 25 SCOP folds leads to the encouraging conclusion that the STRALCP algorithm, Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 Cluster Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 PAGE 7 OF 8 Nucleic Acids Research, 2007, Vol. 35, No. 22 e150 d1nu9c1 d1nu9f1 d1nu7d1 d1nu7h1 d1nu9f2 d1nu7h2 d1nu9c2 d1nu7d2 d1oksa_ d1r4ga_ d1bdc__ d2spza_ d1h0ta_ d1deeh_ d1ud0c_ d1ud0d_ d1ud0a_ d1ud0b_ d1htya1 d1r34a1 d1dkyb1 d1dkya1 d1dkza1 d1dkxa1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.6.1 a.8.5.1 a.8.5.1 a.8.1.1 a.8.1.1 a.8.1.1 a.8.1.1 a.8.4.1 a.8.4.1 a.8.4.1 a.8.4.1 a.8.3.1 a.8.3.1 a.8.4.1 a.8.4.1 a.8.4.1 a.8.4.1 Figure 5. (a) Dendrogram showing the results of an LGA_S-based (single measure) clustering of 24 SCOP domains from fold a.8. Each code (entry_family) represents one protein from the SCOP classification: entry and family number. We used the R package (version 2.1.1; http://www. r-project.org/) for the hierarchical clustering and visualization of calculated LGA_S results from all-against-all structure comparisons. (b) Clustering created using STRALCP algorithm with default cutoff LGA_S = 60%. even if it is based purely on structure comparisons, exhibits a low (on average 3%) misclustering effect: domains from different SCOP families were clustered separately. It is important to keep in mind that a purely structure-based approach to clustering may result in two proteins that are identical in sequence being clustered separately if the two structures differ in conformation; we observed that the STRALCP algorithm is able to detect the structural differences between domains from the same SCOP family and cluster them separately. It is for this reason that our clustering approach may produce more clusters than the number of SCOP families. For example the family a.8.6.1 (Figure 5b) was separated by STRALCP into two clusters: cluster3 (Staphylocoagulase first domain) and cluster4 (Staphylocoagulase second domain), and the family a.8.4.1 was divided into two clusters: cluster5 (DnaK domain from Escherichia coli) and cluster2 (DnaK domain from Rat). The STRALCP algorithm will also group proteins in different clusters if they significantly differ in length or if multi-domain structures are in different conformations (e.g. ‘open’ and ‘closed’ versions of the same protein). We also can observe additional subclustering of protein families when criteria for structure comparison are sufficiently stringent (e.g. a higher LGA_S cutoff is introduced). We consider this ability a beneficial one to the developed STRALCP approach. It provides valuable information about the regions that are structurally in the same conformation, which could be useful in various studies and classification schemes. The separation of similar or identical proteins, but in different structural conformations, could be reduced by introducing a sequence similarity analysis into the STRALCP algorithm. However, in this study, in order to detect the limits of purely structure-based approaches we do not include sequence information to the scoring and clustering algorithm. The sequence-based analysis may be considered as an option in future development efforts. ACKNOWLEDGEMENTS This work was performed under the auspices of the US Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. B.K. was supported by DOE CSGF fellowship under grant number DE-FG02-97ER25308. The design and development of the described system was supported by LLNL LDRD grant 04-ERD-068 to A.Z. Funding to pay the Open Access publication charges for this article was provided by US Department of Energy. Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 (a) Cluster:3 Cluster:3 Cluster:3 Cluster:3 Cluster:4 Cluster:4 Cluster:4 Cluster:4 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:1 Cluster:2 Cluster:2 Cluster:2 Cluster:2 Cluster:6 Cluster:6 Cluster:5 Cluster:5 Cluster:5 (b) Cluster:5 e150 Nucleic Acids Research, 2007, Vol. 35, No. 22 Conflict of interest statement. None declared. REFERENCES 14. Altschul,S.F., Madden,T.L., Scaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. 15. Ortiz,A.R., Strauss,C.E. and Olmea,O. (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11, 2606–2621. 16. Huan,J., Wang,W., Washington,A., Prins,J., Shah,R. and Tropsha,A. (2004) Accurate classification of protein structural families using coherent subgraph analysis. Pacific Symp. Biocomput., 9, 411–422. 17. Zemla,A. (2003) LGA – a method for finding 3D similarities in protein structures. Nucleic Acids Res., 31, 3370–3374. 18. Geisbrecht,B.V., Hamaoka,B.Y., Perman,B., Zemla,A. and Leahy,D.J. (2005) The crystal structures of EAP domains from Staphylococcus aureus reveal an unexpected homology to bacterial superantigens. J. Biol. Chem., 280, 17243–17250. 19. Zemla,A., Ecale Zhou,C., Slezak,T., Kuczmarski,T., Rama, D. Torres,C., Sawicka,D. and Barsky,D. (2005) AS2TS system for protein structure modeling and analysis. Nucleic Acids Res., 33, W111–W115. 20. Hakansson,M., Petersson,K., Nilsson,H., Forsberg,G., Bjork,P., Antonsson,P. and Svensson,L.A. (2000) The crystal structure of staphylococcal enterotoxin H: implications for binding properties to MHC class II and TcR molecules. J. Mol. Biol., 302, 527–537. 21. Al-Shangiti,A., Naylor,C., Nair,S., Briggs,D., Henderson,B. and Chain,B. (2004) Structural relationships and cellular tropism of staphylococcal superantigen-like proteins. Infect. Immun., 72, 4261–4270. 22. Arcus,V.L., Langley,R., Proft,T., Fraser,J.D. and Baker,E.N. (2002) The three-dimensional structure of a superantigen-like protein, SET3, from a pathogenicity island of the Staphylococcus aureus genome. J. Biol. Chem., 277, 32274–32281. 23. Earhart,C.A., Mitchell,D.T., Murray,D.L., Pinheiro,D.M., Matsumura,M., Schlievert,P.M. and Ohlendorf,D.H. (1998) Structures of five mutants of toxic shock syndrome toxin-1 with reduced biological activity. Biochemistry, 37, 7194–7202. 24. Prasad,G.S., Radhakrishnan,R., Mitchell,D.T., Earhart,C.A., Dinges,M.M., Cook,W.J., Schlievert,P.M. and Ohlendorf,D.H. (1997) Refined structures of three crystal forms of toxic shock syndrome toxin-1 and of a tetramutant with reduced activity. Protein Sci., 6, 1220–1227. Downloaded from http://nar.oxfordjournals.org/ by guest on October 9, 2016 1. Dietmann,S. and Holm,L. (2001) Identification of homology in protein structure classification. Nat. Struct. Biol., 8, 953. 2. Redfern,O., Grant,A., Maibaum,M. and Orengo,C. (2005) Survey of current protein family databases and their application in comparative, structural and functional genomics. J. Chromatogr. B, 815, 97–107. 3. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The protein data bank. Nucleic Acids Res., 8, 235–242. 4. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. 5. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093–1108. 6. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. 7. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. 8. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. 9. Domingues,F.S., Rahnenfuhrer,J. and Lengauer,T. (2004) Automated clustering of ensembles of alternative models in protein structure databases. Protein Eng., 17, 537–543. 10. Rogen,P. and Fain,B. (2003) Automatic classification of protein structure by using Gauss integrals. Proc. Natl Acad. Sci.USA, 100, 119–124. 11. Aung,Z. and Tan,K.L. (2005) Automatic 3D protein structure classification without structural alignment. J. Comp. Biol., 12, 1221–1241. 12. Frishman,D. and Argos,P. (1995) Knowledge-based secondary structure assignment. Proteins Struct. Funct. Genet., 23, 566–579. 13. Cheek,S., Qi,Y., Krishna,S., Kinch,L. and Grishin,N. (2004) SCOPmap: automated assignment of protein structures to evolutionary superfamilies. BMC Bioinformatics, 5, 197. PAGE 8 OF 8