Academia.eduAcademia.edu

SAM-T08, HMM-based protein structure prediction

2009, Nucleic Acids Research

The SAM-T08 web server is a protein structure prediction server that provides several useful intermediate results in addition to the final predicted 3D structure: three multiple sequence alignments of putative homologs using different iterated search procedures, prediction of local structure features including various backbone and burial properties, calibrated E-values for the significance of template searches of PDB and residue-residue contact predictions. The server has been validated as part of the CASP8 assessment of structure prediction as having good performance across all classes of predictions. The SAM-T08 server is available at http:// compbio.soe.ucsc.edu/SAM_T08/T08-query.html

W492–W497 Nucleic Acids Research, 2009, Vol. 37, Web Server issue doi:10.1093/nar/gkp403 Published online 29 May 2009 SAM-T08, HMM-based protein structure prediction Kevin Karplus* Department of Biomolecular Engineering, Baskin School of Engineering, University of California, Santa Cruz, CA 95064, USA Received January 30, 2009; Revised April 20, 2009; Accepted May 2, 2009 MSAs AND SEQUENCE LOGOS The SAM-T08 web server is a protein structure prediction server that provides several useful intermediate results in addition to the final predicted 3D structure: three multiple sequence alignments of putative homologs using different iterated search procedures, prediction of local structure features including various backbone and burial properties, calibrated E-values for the significance of template searches of PDB and residue–residue contact predictions. The server has been validated as part of the CASP8 assessment of structure prediction as having good performance across all classes of predictions. The SAM-T08 server is available at http:// compbio.soe.ucsc.edu/SAM_T08/T08-query.html Before starting to make MSAs and hidden Markov models (HMMs), the web site first does a quick blastp search of a non-redundant version of the PDB dataset (downloaded weekly from Dunbrack’s PISCES server) (7,8). This search is not used in subsequent steps, but can be useful for determining whether there are any very close templates and whether those templates are subsequently used in the model building. The process proper starts by doing three different iterated searches to find and align putative homologs from NR, NCBI’s non-redundant database of protein sequences (9). The first search, T06 from CASP7 in 2006, is the most sensitive, but can become contaminated with unrelated sequences 0.5% of the time. The next search, T04 from CASP6 in 2004, is slightly less sensitive, but has about the same probability of contamination. The T04 and T06 searches use similar iterations and usually produce similar results, but occasionally come up with different alignments or different sets of homologs, due to differences in parameter settings. The T2K search, from CASP4 in 2000, is the least sensitive, and so includes mainly closely related homologs. The less sensitive search is often useful for help in choosing templates when there are many homologous proteins of known structure. The MSAs are provided in machine-readable format [A2M (10)], and in a somewhat more human-readable HTML format (We use NCBI Entrez Utilities to retrieve taxonomy information about the sequences when making the HTML files. Because the XML files we retrieve are truncated by Entrez Utilities when they get too long, crashing the standard perl XML parser we are using to read them, our HTML files are sometimes not created when too many sequences are found. This is the most obvious known bug in the server.) Because there are often over 20 000 sequences in the multiple alignment, trying to view the alignments in traditional ways is often not very illuminating. To alleviate this problem, the server provides sequence logos for the alignments (Figure 1), where the height of each bar indicates how conserved STRUCTURE PREDICTION SERVER The SAM-T08 web server is a protein structure prediction server, the latest in a series of servers that started in 1999 with SAM-T99, (1–5). The input to the server is an amino acid sequence in FASTA format (limited to 700 residues), and the primary output is a 3D model in PDB format. In addition to providing 3D models, the SAMT08 web site provides a large number of intermediate results, which are often interesting in their own right: multiple sequence alignments (MSAs) of putative homologs, prediction of local structure features, lists of potential templates of known structure, alignments to templates and residue–residue contact predictions. The example sequence used in this article, and provided by the server if the user does not supply one, is T0437, one of the CASP8 prediction targets. An ensemble of NMR structures for T0437 is now available in PDB file 2k3i (6). The figures in this article are taken from our CASP8 prediction made on 6 June 2008, before the NMR structures were released. Full details of the prediction can be found at http://www.soe.ucsc.edu/~karplus/casp8/T0437/ decoys/SAM_T08/ *To whom correspondence should be addressed. Tel: +1831 459 4250; Email: [email protected] ß 2009 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 ABSTRACT Nucleic Acids Research, 2009, Vol. 37, Web Server issue W493 (a) T0437.t06 w0.5 3 2 I 1 M C D E XXX SN XX E D L XXXX V E I D L E E G V AE LVD XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX GS KV K ID S EC C AFV L D E EN K MF M N A L V S K E L L G L I E E V V L F Y K VQ E DSK FS S A S E T LA XX A X A X E Q V I K MT XX XX X A T S Z T Y A T S Z A Z Y 50 40 30 20 10 1 M K DV V D KC S TK G C AI D I GT V ID N D N CT S KFS RF FAT RE E AE S FM T KL KEL H 3 1 A V I E V D SK GR L K XXXXXX E D E KI XXXXXX XX TE PCEA E S V E X D A XXX F L F E IF L LL A Q M Y L D I C VVY LEM A I V V I I L V R V V I V V XXXXXXXXXXX XX XXXX F XXXXXX XX V E T X X L SD A L S V M K K ID LQSQ I A W V FV SG FA WS K L FKI R G X T L X X E G L F L A X T S Z E Z Y Z Y Z Y T (b) 90 80 70 60 51 A A AA S S AD E GA S V AY K I KD L EG Q V E LD A AFT FS CQA EM I IF E LS L RS LA Z Y A Z Y H T0437.t2k w0.5 3 2 II L 1 K K V D V SI C G A V KD TA G N L A F E E A E S Y D V K V I E E L V Q K V S K S A D E E K D D VVDN E E G L A E E L L F MM XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX E C XXXXXXX L K XX A A A XX A A L Y F E D A V QK ITK XXXX X A H Z A Z E Z A Y T Z A 50 40 30 20 10 1 M K DV V D KC S TK G C AI D I GT V ID N D N CT S KF S RF FA T RE E AE S FM TKL KEL H 3 2 A L F F E IF L R L E L Y C K VLYEM D I I E I I L A IVLQI L XXXX XXXXX XXXXX XXXXXXXXXXXXXXXXXXXXM XXXXXXXXXX A A AA S S AD E GA S V AY K I KD L EG Q V E LD A AF T FS CQ A EM I IF E LS LRS LA E D N D V K A LE A SG N V X T EM V G F Z V L V I V V A K QVQ L A SA ISG IRYKWS K Y E Q A IK V FWD FAV FGL K T A Y Z Y Z I V S A X 60 51 S V S X 70 S A XX V DD EL E C P T Y Z A 90 R K G S E VADV 80 1 Z Y Z H H Figure 1. Sequence logos from MSAs for CASP8 target T0437. In both sequence logos, the residues of the target sequence are consistent with the conserved residues in the MSA, indicating that the iterated searches were not contaminated by unrelated proteins. (a) Sequence logo for SAM_T06 MSA, based on 119 sequences. The SAM_T04 alignment is nearly identical. The sequence logo shows which residues are most highly conserved in this family of proteins. The groups of conserved residues are typical of motifs that are preserved through evolution. (b) Sequence logo for SAM_T2K MSA, based on 99 sequences. Note that R96 is conserved in the narrower set of homologs found by T2K, but is not conserved in the T06 alignment. the residues are and the letters in the bar give the probability distribution for the amino acids at that position. The pattern of conserved residues is often of use for making conjectures about function and binding sites, even when no confident tertiary structure prediction can be made. All three searches are provided separately, so that the sequence logos can be examined for contamination and results checked for consistency. The searches are combined later in the process. LOCAL STRUCTURE PREDICTION After the iterated searches, the MSAs are used as inputs to neural networks that predict various local structure properties: 12 backbone structure alphabets and three burial structure alphabets. The 12 backbone alphabets are str4, str2, alpha, bys, pb, n_notor, n_notor2, n_sep, o_notor, o_notor2, o_sep and dssp_ehl2. Many of these alphabets have been described previously (11–13), but some are new and are so far described in detail only on the Frequently Asked Questions (FAQ) page for the web site. The most familiar is the dssp_ehl2 alphabet, which has just three letters (E for beta strands and bridges, H for helices and L for everything else), which is a reduction of the DSSP alphabet (14). The str2 alphabet, which has been our most valuable backbone alphabet, is an extension of DSSP to distinguish between different types of beta strands (Figure 2). The str4 alphabet is an attempt to use different ways of classifying loop residues and strand residues, but turned out to be somewhat less useful than str2. The alpha alphabet classifies residues according to their Ca-Ca-Ca-Ca torsion angles, the bys alphabet is a classification of residues by f and c angles by Bystroff (15), and the pb alphabet is de Brevern’s protein blocks (16). Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 2 W494 Nucleic Acids Research, 2009, Vol. 37, Web Server issue T0437.t06 str2 4 3 2 1 CC TS A CTT A Z SCC C C ZZA Y AAAAAZ ZY HHHHHHHHHHHHHH YZY Y SCAZZ CS YZYZACCC SYMY CYC AAZTCMYZ C CCCC CCCCTS E ZSS CTXX XXXXX XXXXXXXXX XXX MK DVVD KCSTK GCAIDIGTVIDNDNCTS KFS RF FAT REE AES FM TKL KEL S QM Y C Z ASS TY MH HEMCCC QS P TS CEM EM T YTM CPY TZH Z C CXX G XXX XB XX XXX X X T S Z T Y A T S Z A Z Y 50 40 X X X X X X X X X X X X X 10 1 ZYCA Q SEAZTQ T C SSES YH T X Y 30 C SZ 20 SST S YTS XXXX H 3 2 1 HHHH C H TSS YZ Z ATC ZYY YQY C C SQ SECS EEQE AMMM S S EG ZTC ZZ Y XX XXXXXXXXXX TCHHG XT X XXXXX X A AAA AAA YZ TTZYZAZY Y YZYA A Z YZ AZ HHHHHHHH ZE ZZ ZY AAZ C Y SCYC AY CSSA C YAAACZ CTT YQQQQC CC C C TCS TT H YZYZZZYZY Z H A HHHC A XXXAAAAX XXXXXX XH M Y E XXXXE X XXXXXXXC X X H A T XXXX TT H T S Z E Z Y Z Y Z Y T Z Y A 90 80 70 60 51 AA AASS ADEGA SVAYKIKDLEGQVELDA AFT FS CQA EMI IFE LS LRS LA Z Y H Figure 2. Sequence logo for the prediction of the str2 backbone alphabet based on the T06 MSA. The strong predictions here are generally accurate, but the NMR models did not include the predicted hairpin turn around D24—the region before C26 had no structure in the NMR models. The predictions are often this good for simple globular proteins, but can be thrown off by metal binding sites and disulfide bridges, which are common in very small proteins. The notor and sep alphabets classify residues according to the hydrogen bond at the N or O atom. The notor and notor2 alphabets classify the Hbonds according to the (Ci 1 Ni Oj Nj+1) torsion angle for donor Ni and acceptor Oj with special cases for alpha helices (i = j + 4) and 3–10 helices and turns (i = j + 3). The notor2 alphabets have a few more special cases for i = j + 5) and common multiple hydrogen bond patterns. The sep alphabets classify the Hbonds according to the separation i j. The three burial alphabets predict the number of Cb atoms within 14 Å (seven classes), the number of (Cb) atoms within 8 Å at least nine residues apart along the chain (14 classes), and a somewhat more complicated count of nearby residues [near-backbone-11 (13), 11 classes]. The near-backbone-11 measure has been the most useful of these burial predictions (Figure 3). The burial alphabets are organized so that ‘A’ is the least buried class with increasing burial as the letters go through the alphabet. For each MSA and each structure alphabet, several outputs are provided: a table of the probability vector over the alphabet for each position in the sequence; a sequence logo summarizing the probability vectors, showing the prediction and strength of prediction at each position of the sequence (Figures 2 and 3); and a summary sequence giving the most probable letter at each position. For users wanting a quick approximate view of the local structure prediction, a consensus prediction of the three-state alphabet (E = strand, H = helix and L = loop) is provided. To aid in viewing the local structure predictions in the final tertiary prediction, rasmol scripts for coloring the model according to the predicted local structure are provided. TEMPLATE SELECTION AND ALIGNMENT Our templates come mainly from our template library, a large representative subset of PDB chains, for which we have precomputed a set of HMMs. We have separate template libraries for the different iterated search methods. As of 28 January 2009, the template libraries contained 19 621, 17 732 and 15 967 chains for T2K, T04 and T06, respectively, while a non-redundant PDB set contained 36 643 chains. After the local structure predictions are done, the SAM (Sequence Alignment and Modeling) tool suite (17) is used to build HMMs from the MSAs and predicted local structures. The HMMs are used to search PDB for potential templates for structure prediction. HMMs in the template library are used to score the target sequence, and all the resulting scores are merged into a best-scores-all.html file that summarizes the best hits, sorted by E-values. The table also includes links to the PDB (18) and Proteopedia (19) web sites for each template, as well as links to the Structural Classification of Proteins [SCOP (20)] website, when available. The E-values are moderately well calibrated (off by no more than a factor of 10 in cross-validation tests, unpublished data), so that E-values <0.01 indicate that a good structural template is available for at least part of the target protein and E-values >1 indicate that the method will be using mainly ab initio and fragment methods to generate the structure and that the tertiary structure is thus much less reliable. It is important for users to check the E-values, as the method always produces a full-length model, even when no good template is available. For T0437, the target 2jz5A has an E-value of 8.6e-12, indicating a very Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 4 Nucleic Acids Research, 2009, Vol. 37, Web Server issue W495 T0437.t06 near-backbone-11 4 3 2 1 AA J A C AB DI J A CC C C I CH D CD CJ IG G K M K DVVD KCSTK GCAIDIGTVIDNDNCTS KFS RF FAT REE AES FM TKL KEL D A B G A D I D B X D D A I G J D J D I J C EEGC C XX G G X X 20 G 10 1 D GG E D BHD F CG A E F HH BD E E EJE H IDD DD F B GDGE B CKC FHE E FDI FJ D A BFKG JFA AG E CD DG XXXX XXXXXXXXXBGXXXXXGGXXX A I C B A X X G D G D I D G B G A I 50 J JJ I K I J KI KB A I K I DK BH H HHD H GG G G XXXG XXXXXXXXXX 40 DID A B CBH XX 30 B B D AKE DB DE A A BCJ ECAGE GC CE XF XX X XX X G J G G 3 2 1 C KC JE I C A F G B EEH EEEFC G K JD D EJD E E BKD DIDA B CB A B BA DIB BA A DG B BDABGCJC A D DC DD AHA DC FE F H EG E GGXXXXXXXXXX K J D D B I D CGBC C E DEGE E D I D KB GD C C C CHB JC IE AJ A A CCD G ED B F CHC EE BH E KAE C GBGFGA EH XXXX XX XXXXXXX X XXXXXXXXXXXXXX G X XXXXXX G X X I X I A I G C G G G C G G X G G K G A B A G B K D G D I D A G A D G D K D K D J D I D G A G 90 80 70 60 51 A A AASS ADEGA SVAYKIKDLEGQVELDA AFT FS CQA EMI IFE LS LRS LA E G G G G I D Figure 3. Sequence logo for the prediction of near-backbone-11 burial based on the T06 MSA. The residues with A, B or C are predicted to be highly exposed, while those with H, I or J are predicted to be highly buried. The burial predictions are excellent for this target. confident similarity. Even the initial blastp over PDB finds this template, though the E-value is only 0.1, so the confidence is not as high. For longer multi-domain proteins, the server may have a good template-based model for one domain, and poor, ab initio models for the others. In those cases, it is often wise to split the target into separate domains and predict them separately. The server does not do this automatically. For each of the top templates, the server provides several alignments between the target and the template, which are used in subsequent tertiary prediction, but which could also be useful for transferring information (such as binding site residues) from the templates to the target. Of the various alignments, the t06-local-str2+near-backbone-110.8+0.6+0.8-adpstyle5 alignments are generally the most reliable. These are constructed by local alignment to a three-track HMM that has an amino acid profile, str2 predictions, and near-backbone-11 predictions as the three tracks, with track weights 0.8, 0.6 and 0.8 respectively, and using posterior decoding aligment (SAM parameter adpstyle=5). Although there are often better alignments in the pool, they do not come consistently from the same method. Crude models are generated from the top alignments, and superimposed in a pdb file. The undertakeralign.pdb.gz file can be viewed with any molecular modeling software to see what parts of the protein are coming from the templates and whether the templates agree on the structure of that portion of the protein. After the major alignments have been made, short gapless alignments (fragment lists) are made to provide reasonable local structures for building the final model. CONTACT PREDICTION We have two distinct ways of predicting what residues may be in contact: ab initio contact prediction using neural networks and information about correlated mutations in the MSAs (21), and distance constraints extracted from the best alignments, for use in constraining the tertiary structure prediction (22). The neural network predictions are most useful when there are no templates found with E-value 1. The server presents three different neural network predictions. The 647_47 prediction is the network validated at CASP7 (21). The 730_47 prediction does not use any paired column statistics, but just local structure prediction at the individual residues. The 648_17.730_47 prediction is a two-stage one that filters the 730_47 predictions using paired column statistics. In our testing, the two-stage method works best when the T06 MSA has enough diversity of sequences for correlated mutations to be detected. For ORFans and target sequences for which only very similar sequences are aligned in the T06 MSA, the 730_47 predictions are somewhat better (unpublished data). The constraints extracted from alignments are most useful when templates are found with low E-values, as the constraints are used in model generation for selecting templates and to keep the models from drifting too far from the templates. The constraints have also been used for model quality assessment in evaluating models from other servers (22,13), but that application is not provided by the web server. MODEL GENERATION Finally, the undertaker program is run to generate an allatom model using the templates, the local structure Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 4 W496 Nucleic Acids Research, 2009, Vol. 37, Web Server issue predictions, the distance constraints and the contact predictions. The prediction for our example is compared with an NMR structure in Figure 4. For compatibility with the CASP experiment, five models are produced: Model 1 is the polished model output from undertaker, Model 2 is the initial model after undertaker examines the templates but before attempting to remove clashes or gaps and Models 3–5 are incomplete models based on simple side chain replacement on the top three templates. A few other models are available in the ‘decoys’ subdirectory, including intermediate models in the optimization process, and models which have had the sidechains repacked by Rosetta (23), or which have had energy minimization by Gromacs (24), though neither of these programs is used for the primary output. All the results, including all intermediate files, are kept available on the web site for at least 1 week, and can be downloaded as a gzipped tarball from a link at the bottom of the page. VALIDATION AT CASP8 The SAM-T08 server was validated as part of the CASP8 protein structure prediction experiment in summer 2008. CASP (Critical Assessment of Structure Prediction) is a community-wide experiment held every 2 years. Predictors are given the sequences of proteins whose structures are being solved, but whose structures have not been publicly released, and are required to register their predictions within 3 days for servers, or 3 weeks for human-assisted methods. The predicted models are compared with the Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 Figure 4. Predicted model in blue and NMR structure in red superimposed, Only residues C26-L98 are shown, since the NMR models had no structure before C26. Residues V63-Q73 (shown in green) are misaligned in the prediction, resulting in the gap before V74, instead of the proper hairpin. Residues S56–G60 (shown in orange) were not evaluated at CASP, because the ensemble of NMR models had quite different structures for those five residues. The picture was generated by the rasmol molecular viewing software (25). The SAM-T08-human prediction was substantially better than the SAM-T08-server prediction, but it was based on Zhang-Server_TS5, the second best server model (Zhang-Server_TS2 was the best server model in our evaluation). experimental models to determine which methods are really working. Detailed results of the testing can be found on the CASP8 web page http://predictioncenter.org/casp8/ results.cgi as well as on several unofficial evaluation sites (list available at http://www.reading.ac.uk/bioinf/CASP8). Several different metrics have been used to evaluate the quality of predictions, and rankings of servers depend heavily on which metrics are used, what set of targets are compared, and whether whole-chain comparisons or domain-based comparisons are made. Although the SAM-T08 server was not the best server on the commonly used metrics that measure just the positions of the (Ca) atoms (GDT_TS and TM-score, for example), it did quite well (ranking 2–21 out of 70 servers overall, depending on the evaluation used). In Zhang’s ranking of the servers by TM score of domains on the hard targets, (http://zhang.bioinforma tics.ku.edu/casp8/13D.html), the SAM-T08-server ranks third, after Zhang-Server and BAKER-ROBETTA, while on the easy targets (where differences are smaller and many servers produce almost identical models), SAM-T08-server ranks 21st. If hydrogen bond scoring is included, SAM-T08-server moves to second place overall, and fourth on the easy targets. Using a contact-based measure, Nick Grishin ranked SAM-T08 server fifth or sixth on all targets (http://prodata.swmed.edu/CASP8/evaluation/Domains All.First.html). In the official evaluations, the SAM-T08 models were seen to have unusually good stereochemistry for homology models, even though the (Ca) traces were not the best (based on assessor’s presentations at CASP8 conference, not published yet). On the common backbone accuracy measures, the SAM-T08 server ranked 9th through 14th among servers (http://predictioncenter.org/casp8/groups_ analysis.cgi), except on the ‘high-accuracy’ server targets, where it was in the middle of the pack (31st out of 70). The SAM-T08 server generally ranked less well on the very easy targets (where most of the methods produced almost indistinguishable results) and better on the harder targets. Performance relative to other servers seemed to peak for those targets that had templates available, but for which finding and aligning the template was difficult, as we have focused our efforts most on fold recognition and alignment. The SAM-T08 server uses the same protocol for all targets, whether they have highly similar templates available or not, but the method is tuned for the difficult targets, rather than the easy ones. With a few notable exceptions (such as target T0442 domain 2), the SAM-T08 server did substantially better than the older SAM-T06 server in all evaluations. The older SAM-T02 server does not produced models, just alignments, and had substantially poorer performance than either of the more recent servers. The selection of templates and alignments by the HMMs has not improved substantially—the models built directly from the top alignment: SAM-T08-server_TS3, SAM-T06-server_TS2 and SAM-T02-server_AL1 are of variable quality, but not showing consistent improvement. The selection and Nucleic Acids Research, 2009, Vol. 37, Web Server issue W497 optimization of models by undertaker, however, is showing substantial improvement from SAM-T02 to SAM-T06 to SAM-T08. ACKNOWLEDGEMENTS FUNDING National Institutes of Health (Grant 1 R01 GM06857001). Conflict of interest statement. None declared. REFERENCES 1. Karplus,K., Barrett,C., Cline,M., Diekhans,M., Grate,L. and Hughey,R. (1999) Predicting protein structure using only sequence information. Proteins, (Suppl. 3), 121–125. 2. Fischer,D., Barrett,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., Karplus,K., Kelley,L.A., MacCallum,R.M., Pawlowski,K. et al. (1999) CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins, (Suppl. 3), 209–217. 3. Karplus,K., Karchin,R., Barrett,C., Tu,S., Cline,M., Diekhans,M., Grate,L., Casper,J. and Hughey,R. (2001) What is the value added by human intervention in protein structure prediction? Proteins, 45, 86–91. 4. Karplus,K., Karchin,R., Draper,J., Casper,J., Mandel-Gutfreund,Y., Diekhans,M. and Hughey,R. (2003) Combining local-structure, fold-recognition, and new-fold methods for protein structure prediction. Proteins, 53, 491–496. 5. Karplus,K., Katzman,S., Shackelford,G., Koeva,M., Draper,J., Barnes,B., Soriano,M. and Hughey,R. (2005) What’s new in protein-structure prediction for CASP6. Proteins, 61, 135–142. 6. Mills,J.L., Singarapu,K.K., Eletski,A., Sukumaran,D.K., Wang,D., Jiang,M., Ciccosanti,C., Xiao,R., Liu,J., Baran,M.C. et al. (2008) Solution NMR structure of protein yiiS from Shigella flexneri. Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022 Over the years, dozens of people have contributed to the tools of the SAM-T08 servers. Some of the more notable contributors (in alphabetical order) include John Archie, Bret Barnes, Christian Barrett, Sugato Basu, Jonathan Casper, Melissa Cline, Mark Diekhans, Chris Dragon, Birong Hu, Richard Hughey, Rachel Karchin, Sol Katzman, Firas Khatib, Anders Krogh, Martin Madera, Yael Mandel-Gutfreund, Martin Paluszewski, George Shackelford, Kimmen Sjölander, Don Speck, Grant Thiltgen and Spencer Tu. Comments on drafts of the paper by John Archie, Richard Hughey, Thomas Juettemann, Josue Samayoa, and Chirag Sharma are particularly appreciated. Northeast Structural Genomics Consortium target SfR90, PDB file doi:10.2210/pdb2k3i/pdb 7. Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. 8. Wang,G. and Dunbrack,R.L. Jr. (2003) PISCES: a protein sequence culling server. Bioinformatics, 19, 1589–1591. 9. Non-redundant Protein Database. (2008) Distributed by anonymous FTP from ftp.ncbi.nih.gov/blast/db/ 10. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci., 12, 95–107. Information on obtaining SAM is available at http://www.soe.ucsc.edu/research/compbio/sam.html 11. Karchin,R., Cline,M., Mandel-Gutfreund,Y. and Karplus,K. (2003) Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins, 51, 504–514. 12. Karchin,R., Cline,M. and Karplus,K. (2004) Evaluation of local structure alphabets based on residue burial. Proteins, 55, 508–518. 13. Archie,J. and Karplus,K. (2009) Applying undertaker cost functions to model quality assessment. Proteins, 75, 550–555. 14. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637. 15. Bystroff,C., Thorsson,V. and Baker,D. (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol., 301, 173–190. 16. de Brevern,A.G., Etchebest,C. and Hazout,S. (2000) Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins, 41, 271–287. 17. Hughey,R., Karplus,K. and Krogh,A. (1999) SAM: sequence alignment and modeling software system, version 3. Technical Report UCSC-CRL-99-11, University of California, Santa Cruz, Computer Engineering, UC Santa Cruz, CA 95064, October 1999. http://www.soe.ucsc.edu/research/compbio/sam.html 18. Bernstein,F.C., Koetzle,T.F., Williams,G.J., Meyer,E.E., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535–542. 19. Hodis,E., Prilusky,J., Martz,E., Silman,I., Moult,J. and Sussman,J.L (2008) Proteopedia—a scientific ‘wiki’ bridging the rift between three-dimensional structure and function of biomacromolecules. Genome Biol., 9, R121. 20. Murzin,A.G. (1996) Structural classification of proteins: new superfamilies. Curr. Opin. Struct. Biol., 6, 386–394. 21. Shackelford,G. and Karplus,K. (2007) Contact prediction using mutual information and neural nets. Proteins, 69, 159–164. 22. Paluszewski,M. and Karplus,K. (2009) Model quality assessment using distance constraints from alignments. Proteins, 75, 540–549. 23. Rohl,C.A., Strauss,C.E.M., Misura,K. and Baker,D. (2004) Protein structure prediction using Rosetta. Methods Enzymol., 383, 66–93. 24. van der Spoel,D., Lindahl,E., Hess,B., Groenhof,G., Mark,A.E. and Berendsen,H.J.C. (2005) GROMACS: fast, flexible and free. J. Comput. Chem., 26, 1701–1718. 25. Sayle,R. and Milner-White,E.J. (1995) RasMol: biomolecular graphics for all. Trends Biochem. Sci., 20, 374–376.