W492–W497 Nucleic Acids Research, 2009, Vol. 37, Web Server issue
doi:10.1093/nar/gkp403
Published online 29 May 2009
SAM-T08, HMM-based protein structure prediction
Kevin Karplus*
Department of Biomolecular Engineering, Baskin School of Engineering, University of California, Santa Cruz,
CA 95064, USA
Received January 30, 2009; Revised April 20, 2009; Accepted May 2, 2009
MSAs AND SEQUENCE LOGOS
The SAM-T08 web server is a protein structure prediction server that provides several useful intermediate results in addition to the final predicted 3D
structure: three multiple sequence alignments of
putative homologs using different iterated search
procedures, prediction of local structure features
including various backbone and burial properties,
calibrated E-values for the significance of template
searches of PDB and residue–residue contact predictions. The server has been validated as part of
the CASP8 assessment of structure prediction as
having good performance across all classes of predictions. The SAM-T08 server is available at http://
compbio.soe.ucsc.edu/SAM_T08/T08-query.html
Before starting to make MSAs and hidden Markov
models (HMMs), the web site first does a quick blastp
search of a non-redundant version of the PDB dataset
(downloaded weekly from Dunbrack’s PISCES server)
(7,8). This search is not used in subsequent steps, but
can be useful for determining whether there are any very
close templates and whether those templates are subsequently used in the model building.
The process proper starts by doing three different iterated searches to find and align putative homologs from
NR, NCBI’s non-redundant database of protein
sequences (9). The first search, T06 from CASP7 in
2006, is the most sensitive, but can become contaminated
with unrelated sequences 0.5% of the time. The next
search, T04 from CASP6 in 2004, is slightly less sensitive,
but has about the same probability of contamination. The
T04 and T06 searches use similar iterations and usually
produce similar results, but occasionally come up with
different alignments or different sets of homologs, due to
differences in parameter settings. The T2K search, from
CASP4 in 2000, is the least sensitive, and so includes
mainly closely related homologs. The less sensitive
search is often useful for help in choosing templates
when there are many homologous proteins of known
structure.
The MSAs are provided in machine-readable format
[A2M (10)], and in a somewhat more human-readable
HTML format (We use NCBI Entrez Utilities to retrieve
taxonomy information about the sequences when making
the HTML files. Because the XML files we retrieve are
truncated by Entrez Utilities when they get too long,
crashing the standard perl XML parser we are using to
read them, our HTML files are sometimes not created
when too many sequences are found. This is the most
obvious known bug in the server.) Because there are
often over 20 000 sequences in the multiple alignment,
trying to view the alignments in traditional ways is often
not very illuminating. To alleviate this problem, the server
provides sequence logos for the alignments (Figure 1),
where the height of each bar indicates how conserved
STRUCTURE PREDICTION SERVER
The SAM-T08 web server is a protein structure prediction
server, the latest in a series of servers that started in 1999
with SAM-T99, (1–5). The input to the server is an amino
acid sequence in FASTA format (limited to 700 residues), and the primary output is a 3D model in PDB
format. In addition to providing 3D models, the SAMT08 web site provides a large number of intermediate
results, which are often interesting in their own right: multiple sequence alignments (MSAs) of putative homologs,
prediction of local structure features, lists of potential
templates of known structure, alignments to templates
and residue–residue contact predictions.
The example sequence used in this article, and provided
by the server if the user does not supply one, is T0437, one
of the CASP8 prediction targets. An ensemble of NMR
structures for T0437 is now available in PDB file 2k3i (6).
The figures in this article are taken from our CASP8 prediction made on 6 June 2008, before the NMR structures
were released. Full details of the prediction can be
found at http://www.soe.ucsc.edu/~karplus/casp8/T0437/
decoys/SAM_T08/
*To whom correspondence should be addressed. Tel: +1831 459 4250; Email:
[email protected]
ß 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
ABSTRACT
Nucleic Acids Research, 2009, Vol. 37, Web Server issue W493
(a)
T0437.t06 w0.5
3
2
I
1
M
C
D
E
XXX
SN
XX
E
D
L
XXXX
V
E
I
D
L
E
E G V
AE
LVD
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
GS
KV
K
ID
S
EC
C
AFV
L
D
E
EN
K
MF
M
N
A
L
V
S
K
E
L
L
G
L
I
E
E
V
V
L
F
Y
K
VQ
E
DSK
FS
S
A
S
E
T
LA
XX
A
X
A
X
E
Q
V
I
K
MT
XX
XX
X
A
T S
Z
T Y A
T
S
Z A
Z Y
50
40
30
20
10
1
M K DV V D KC S TK G C AI D I GT V ID N D N CT S KFS RF FAT RE E AE S FM T KL KEL
H
3
1
A
V
I
E
V
D
SK
GR
L
K
XXXXXX
E
D
E
KI
XXXXXX XX
TE
PCEA
E
S
V
E
X
D
A
XXX
F
L
F E IF L
LL
A
Q
M
Y
L
D
I
C
VVY
LEM
A
I
V
V
I
I L
V
R
V
V
I
V
V
XXXXXXXXXXX
XX XXXX F
XXXXXX
XX
V
E
T
X
X
L
SD
A
L
S
V
M
K
K
ID
LQSQ
I
A
W
V
FV
SG
FA
WS
K
L
FKI
R
G
X
T
L
X
X
E
G
L
F
L
A
X
T
S
Z
E Z Y Z Y Z Y T
(b)
90
80
70
60
51
A A AA S S AD E GA S V AY K I KD L EG Q V E LD A AFT FS CQA EM I IF E LS L RS LA
Z Y A
Z Y H
T0437.t2k w0.5
3
2
II
L
1
K
K
V
D
V
SI
C
G
A
V
KD
TA
G
N
L
A
F
E
E
A
E
S
Y
D
V
K
V
I
E
E
L
V
Q
K
V
S
K
S
A
D
E
E
K
D
D
VVDN
E
E G L
A
E
E
L
L
F
MM
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
E
C
XXXXXXX
L
K
XX
A
A
A
XX
A
A
L
Y
F
E
D
A
V
QK
ITK
XXXX
X
A
H
Z A Z
E Z A Y
T
Z A
50
40
30
20
10
1
M K DV V D KC S TK G C AI D I GT V ID N D N CT S KF S RF FA T RE E AE S FM TKL KEL
H
3
2
A
L
F
F
E IF L R
L
E
L Y C
K
VLYEM
D
I
I
E
I
I L
A
IVLQI
L
XXXX
XXXXX XXXXX XXXXXXXXXXXXXXXXXXXXM
XXXXXXXXXX
A A AA S S AD E GA S V AY K I KD L EG Q V E LD A AF T FS CQ A EM I IF E LS LRS LA
E
D
N
D
V
K
A
LE
A
SG
N
V
X
T
EM
V
G
F
Z
V
L
V
I
V
V
A
K
QVQ
L
A
SA
ISG
IRYKWS
K
Y
E
Q
A
IK
V
FWD
FAV
FGL
K
T
A Y Z Y Z
I
V
S
A
X
60
51
S
V
S
X
70
S
A
XX
V
DD
EL
E
C
P
T
Y Z A
90
R
K
G
S
E
VADV
80
1
Z Y Z H
H
Figure 1. Sequence logos from MSAs for CASP8 target T0437. In both sequence logos, the residues of the target sequence are consistent with the
conserved residues in the MSA, indicating that the iterated searches were not contaminated by unrelated proteins. (a) Sequence logo for SAM_T06
MSA, based on 119 sequences. The SAM_T04 alignment is nearly identical. The sequence logo shows which residues are most highly conserved in
this family of proteins. The groups of conserved residues are typical of motifs that are preserved through evolution. (b) Sequence logo for SAM_T2K
MSA, based on 99 sequences. Note that R96 is conserved in the narrower set of homologs found by T2K, but is not conserved in the T06 alignment.
the residues are and the letters in the bar give the probability distribution for the amino acids at that position. The
pattern of conserved residues is often of use for making
conjectures about function and binding sites, even when
no confident tertiary structure prediction can be made.
All three searches are provided separately, so that the
sequence logos can be examined for contamination and
results checked for consistency. The searches are combined later in the process.
LOCAL STRUCTURE PREDICTION
After the iterated searches, the MSAs are used as inputs to
neural networks that predict various local structure properties: 12 backbone structure alphabets and three burial
structure alphabets. The 12 backbone alphabets are str4,
str2, alpha, bys, pb, n_notor, n_notor2, n_sep, o_notor,
o_notor2, o_sep and dssp_ehl2. Many of these alphabets
have been described previously (11–13), but some are new
and are so far described in detail only on the Frequently
Asked Questions (FAQ) page for the web site. The most
familiar is the dssp_ehl2 alphabet, which has just three
letters (E for beta strands and bridges, H for helices and
L for everything else), which is a reduction of the DSSP
alphabet (14). The str2 alphabet, which has been our most
valuable backbone alphabet, is an extension of DSSP to
distinguish between different types of beta strands
(Figure 2). The str4 alphabet is an attempt to use different
ways of classifying loop residues and strand residues, but
turned out to be somewhat less useful than str2. The alpha
alphabet classifies residues according to their Ca-Ca-Ca-Ca
torsion angles, the bys alphabet is a classification of residues by f and c angles by Bystroff (15), and the pb alphabet is de Brevern’s protein blocks (16).
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
2
W494 Nucleic Acids Research, 2009, Vol. 37, Web Server issue
T0437.t06 str2
4
3
2
1
CC
TS
A CTT
A
Z
SCC
C
C
ZZA
Y
AAAAAZ
ZY
HHHHHHHHHHHHHH
YZY Y
SCAZZ
CS
YZYZACCC
SYMY
CYC
AAZTCMYZ
C
CCCC CCCCTS
E
ZSS
CTXX
XXXXX XXXXXXXXX
XXX
MK DVVD KCSTK GCAIDIGTVIDNDNCTS KFS RF FAT REE AES FM TKL KEL
S
QM
Y
C
Z
ASS
TY
MH
HEMCCC
QS
P
TS
CEM
EM
T
YTM
CPY
TZH
Z
C
CXX
G
XXX
XB
XX
XXX
X
X
T S
Z
T Y A
T
S
Z A
Z Y
50
40
X
X
X
X
X
X
X
X
X
X
X
X
X
10
1
ZYCA
Q
SEAZTQ
T
C
SSES
YH
T
X
Y
30
C
SZ
20
SST
S
YTS
XXXX
H
3
2
1
HHHH
C
H
TSS
YZ
Z
ATC
ZYY
YQY
C
C
SQ
SECS
EEQE
AMMM
S
S
EG
ZTC
ZZ
Y
XX
XXXXXXXXXX
TCHHG
XT
X
XXXXX
X
A
AAA AAA
YZ TTZYZAZY Y YZYA
A Z YZ AZ HHHHHHHH
ZE
ZZ
ZY
AAZ C Y
SCYC
AY
CSSA
C
YAAACZ
CTT
YQQQQC
CC
C
C
TCS
TT
H
YZYZZZYZY
Z
H
A
HHHC
A
XXXAAAAX
XXXXXX
XH
M
Y
E
XXXXE
X
XXXXXXXC
X
X
H
A
T
XXXX
TT
H
T
S
Z
E Z Y Z Y Z Y T
Z Y A
90
80
70
60
51
AA AASS ADEGA SVAYKIKDLEGQVELDA AFT FS CQA EMI IFE LS LRS LA
Z Y H
Figure 2. Sequence logo for the prediction of the str2 backbone alphabet based on the T06 MSA. The strong predictions here are generally accurate,
but the NMR models did not include the predicted hairpin turn around D24—the region before C26 had no structure in the NMR models. The
predictions are often this good for simple globular proteins, but can be thrown off by metal binding sites and disulfide bridges, which are common in
very small proteins.
The notor and sep alphabets classify residues according
to the hydrogen bond at the N or O atom. The notor and
notor2 alphabets classify the Hbonds according to the
(Ci 1 Ni Oj Nj+1) torsion angle for donor Ni and
acceptor Oj with special cases for alpha helices
(i = j + 4) and 3–10 helices and turns (i = j + 3). The
notor2 alphabets have a few more special cases for
i = j + 5) and common multiple hydrogen bond patterns.
The sep alphabets classify the Hbonds according to the
separation i j.
The three burial alphabets predict the number of Cb
atoms within 14 Å (seven classes), the number of (Cb)
atoms within 8 Å at least nine residues apart along the
chain (14 classes), and a somewhat more complicated
count of nearby residues [near-backbone-11 (13),
11 classes]. The near-backbone-11 measure has been the
most useful of these burial predictions (Figure 3). The
burial alphabets are organized so that ‘A’ is the least
buried class with increasing burial as the letters go through
the alphabet.
For each MSA and each structure alphabet, several outputs are provided: a table of the probability vector over
the alphabet for each position in the sequence; a sequence
logo summarizing the probability vectors, showing the
prediction and strength of prediction at each position of
the sequence (Figures 2 and 3); and a summary sequence
giving the most probable letter at each position. For users
wanting a quick approximate view of the local structure
prediction, a consensus prediction of the three-state alphabet (E = strand, H = helix and L = loop) is provided. To
aid in viewing the local structure predictions in the final
tertiary prediction, rasmol scripts for coloring the model
according to the predicted local structure are provided.
TEMPLATE SELECTION AND ALIGNMENT
Our templates come mainly from our template library, a
large representative subset of PDB chains, for which we
have precomputed a set of HMMs. We have separate template libraries for the different iterated search methods. As
of 28 January 2009, the template libraries contained
19 621, 17 732 and 15 967 chains for T2K, T04 and T06,
respectively, while a non-redundant PDB set contained
36 643 chains.
After the local structure predictions are done, the SAM
(Sequence Alignment and Modeling) tool suite (17) is used
to build HMMs from the MSAs and predicted local structures. The HMMs are used to search PDB for potential
templates for structure prediction. HMMs in the template
library are used to score the target sequence, and all the
resulting scores are merged into a best-scores-all.html
file that summarizes the best hits, sorted by E-values.
The table also includes links to the PDB (18) and
Proteopedia (19) web sites for each template, as well as
links to the Structural Classification of Proteins [SCOP
(20)] website, when available.
The E-values are moderately well calibrated (off by no
more than a factor of 10 in cross-validation tests, unpublished data), so that E-values <0.01 indicate that a good
structural template is available for at least part of the
target protein and E-values >1 indicate that the method
will be using mainly ab initio and fragment methods to
generate the structure and that the tertiary structure is
thus much less reliable.
It is important for users to check the E-values, as
the method always produces a full-length model, even
when no good template is available. For T0437, the
target 2jz5A has an E-value of 8.6e-12, indicating a very
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
4
Nucleic Acids Research, 2009, Vol. 37, Web Server issue W495
T0437.t06 near-backbone-11
4
3
2
1
AA
J
A
C
AB
DI
J
A
CC C
C
I
CH
D
CD
CJ
IG
G
K
M K DVVD KCSTK GCAIDIGTVIDNDNCTS KFS RF FAT REE AES FM TKL KEL
D
A
B G A D I D
B
X
D
D A I G J D J D I J
C
EEGC
C
XX
G
G
X
X
20
G
10
1
D
GG
E
D
BHD
F
CG
A
E
F
HH
BD
E
E
EJE
H
IDD
DD
F
B
GDGE
B
CKC
FHE
E
FDI
FJ
D
A
BFKG
JFA
AG
E
CD
DG
XXXX
XXXXXXXXXBGXXXXXGGXXX
A
I
C
B A
X
X
G D G D I D
G
B G A
I
50
J
JJ
I
K
I
J
KI
KB
A
I
K
I
DK
BH
H
HHD
H
GG
G
G
XXXG
XXXXXXXXXX
40
DID
A
B
CBH
XX
30
B
B
D
AKE
DB
DE
A
A
BCJ
ECAGE
GC
CE
XF
XX
X
XX
X
G
J G
G
3
2
1
C
KC
JE
I
C
A
F
G
B
EEH
EEEFC
G
K JD
D
EJD
E E
BKD DIDA
B CB A
B
BA
DIB
BA
A
DG B
BDABGCJC
A
D
DC
DD
AHA
DC
FE
F
H
EG
E
GGXXXXXXXXXX
K
J
D
D
B
I
D
CGBC
C
E
DEGE
E
D
I
D
KB
GD
C
C
C
CHB
JC
IE
AJ
A
A
CCD
G
ED
B
F
CHC
EE
BH
E
KAE
C
GBGFGA
EH
XXXX
XX
XXXXXXX
X
XXXXXXXXXXXXXX
G
X
XXXXXX
G
X
X
I
X
I
A
I
G
C
G
G
G
C
G
G
X
G
G
K
G
A
B A G B K D G D I D A G A
D G D K D K D J D I D G A G
90
80
70
60
51
A A AASS ADEGA SVAYKIKDLEGQVELDA AFT FS CQA EMI IFE LS LRS LA
E G
G
G
G I D
Figure 3. Sequence logo for the prediction of near-backbone-11 burial based on the T06 MSA. The residues with A, B or C are predicted to be
highly exposed, while those with H, I or J are predicted to be highly buried. The burial predictions are excellent for this target.
confident similarity. Even the initial blastp over PDB finds
this template, though the E-value is only 0.1, so the confidence is not as high.
For longer multi-domain proteins, the server may have
a good template-based model for one domain, and poor,
ab initio models for the others. In those cases, it is often
wise to split the target into separate domains and predict
them separately. The server does not do this
automatically.
For each of the top templates, the server provides several alignments between the target and the template, which
are used in subsequent tertiary prediction, but which could
also be useful for transferring information (such as binding site residues) from the templates to the target. Of the
various alignments, the t06-local-str2+near-backbone-110.8+0.6+0.8-adpstyle5 alignments are generally the most
reliable. These are constructed by local alignment to a
three-track HMM that has an amino acid profile, str2
predictions, and near-backbone-11 predictions as the
three tracks, with track weights 0.8, 0.6 and 0.8 respectively, and using posterior decoding aligment (SAM
parameter adpstyle=5). Although there are often better
alignments in the pool, they do not come consistently from
the same method.
Crude models are generated from the top alignments,
and superimposed in a pdb file. The undertakeralign.pdb.gz file can be viewed with any molecular modeling software to see what parts of the protein are coming
from the templates and whether the templates agree on the
structure of that portion of the protein.
After the major alignments have been made, short
gapless alignments (fragment lists) are made to provide reasonable local structures for building the final
model.
CONTACT PREDICTION
We have two distinct ways of predicting what residues
may be in contact: ab initio contact prediction using
neural networks and information about correlated mutations in the MSAs (21), and distance constraints extracted
from the best alignments, for use in constraining the tertiary structure prediction (22).
The neural network predictions are most useful when
there are no templates found with E-value 1. The server
presents three different neural network predictions. The
647_47 prediction is the network validated at CASP7
(21). The 730_47 prediction does not use any paired
column statistics, but just local structure prediction at
the individual residues. The 648_17.730_47 prediction is
a two-stage one that filters the 730_47 predictions using
paired column statistics. In our testing, the two-stage
method works best when the T06 MSA has enough diversity of sequences for correlated mutations to be detected.
For ORFans and target sequences for which only very
similar sequences are aligned in the T06 MSA, the 730_47
predictions are somewhat better (unpublished data).
The constraints extracted from alignments are most
useful when templates are found with low E-values, as
the constraints are used in model generation for selecting
templates and to keep the models from drifting too far
from the templates. The constraints have also been used
for model quality assessment in evaluating models from
other servers (22,13), but that application is not provided
by the web server.
MODEL GENERATION
Finally, the undertaker program is run to generate an allatom model using the templates, the local structure
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
4
W496 Nucleic Acids Research, 2009, Vol. 37, Web Server issue
predictions, the distance constraints and the contact predictions. The prediction for our example is compared with
an NMR structure in Figure 4.
For compatibility with the CASP experiment, five
models are produced: Model 1 is the polished model
output from undertaker, Model 2 is the initial model
after undertaker examines the templates but before
attempting to remove clashes or gaps and Models 3–5
are incomplete models based on simple side chain replacement on the top three templates. A few other models are
available in the ‘decoys’ subdirectory, including intermediate models in the optimization process, and models
which have had the sidechains repacked by Rosetta (23),
or which have had energy minimization by Gromacs (24),
though neither of these programs is used for the primary
output.
All the results, including all intermediate files, are kept
available on the web site for at least 1 week, and can be
downloaded as a gzipped tarball from a link at the bottom
of the page.
VALIDATION AT CASP8
The SAM-T08 server was validated as part of the CASP8
protein structure prediction experiment in summer 2008.
CASP (Critical Assessment of Structure Prediction) is a
community-wide experiment held every 2 years. Predictors
are given the sequences of proteins whose structures are
being solved, but whose structures have not been publicly
released, and are required to register their predictions
within 3 days for servers, or 3 weeks for human-assisted
methods. The predicted models are compared with the
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
Figure 4. Predicted model in blue and NMR structure in red superimposed, Only residues C26-L98 are shown, since the NMR models
had no structure before C26. Residues V63-Q73 (shown in green) are
misaligned in the prediction, resulting in the gap before V74, instead of
the proper hairpin. Residues S56–G60 (shown in orange) were not
evaluated at CASP, because the ensemble of NMR models had quite
different structures for those five residues. The picture was generated by
the rasmol molecular viewing software (25). The SAM-T08-human prediction was substantially better than the SAM-T08-server prediction,
but it was based on Zhang-Server_TS5, the second best server model
(Zhang-Server_TS2 was the best server model in our evaluation).
experimental models to determine which methods are
really working.
Detailed results of the testing can be found on the
CASP8 web page http://predictioncenter.org/casp8/
results.cgi as well as on several unofficial evaluation sites
(list available at http://www.reading.ac.uk/bioinf/CASP8).
Several different metrics have been used to evaluate the
quality of predictions, and rankings of servers depend
heavily on which metrics are used, what set of targets
are compared, and whether whole-chain comparisons or
domain-based comparisons are made.
Although the SAM-T08 server was not the best server
on the commonly used metrics that measure just the positions of the (Ca) atoms (GDT_TS and TM-score, for
example), it did quite well (ranking 2–21 out of 70 servers
overall, depending on the evaluation used).
In Zhang’s ranking of the servers by TM score of
domains on the hard targets, (http://zhang.bioinforma
tics.ku.edu/casp8/13D.html), the SAM-T08-server ranks
third, after Zhang-Server and BAKER-ROBETTA,
while on the easy targets (where differences are smaller
and many servers produce almost identical models),
SAM-T08-server ranks 21st. If hydrogen bond scoring is
included, SAM-T08-server moves to second place overall,
and fourth on the easy targets.
Using a contact-based measure, Nick Grishin
ranked SAM-T08 server fifth or sixth on all targets
(http://prodata.swmed.edu/CASP8/evaluation/Domains
All.First.html).
In the official evaluations, the SAM-T08 models were
seen to have unusually good stereochemistry for homology models, even though the (Ca) traces were not the best
(based on assessor’s presentations at CASP8 conference,
not published yet). On the common backbone accuracy
measures, the SAM-T08 server ranked 9th through 14th
among servers (http://predictioncenter.org/casp8/groups_
analysis.cgi), except on the ‘high-accuracy’ server targets,
where it was in the middle of the pack (31st out of 70).
The SAM-T08 server generally ranked less well on the
very easy targets (where most of the methods produced
almost indistinguishable results) and better on the harder
targets. Performance relative to other servers seemed to
peak for those targets that had templates available, but
for which finding and aligning the template was difficult,
as we have focused our efforts most on fold recognition
and alignment.
The SAM-T08 server uses the same protocol for all
targets, whether they have highly similar templates available or not, but the method is tuned for the difficult targets, rather than the easy ones.
With a few notable exceptions (such as target T0442
domain 2), the SAM-T08 server did substantially better
than the older SAM-T06 server in all evaluations. The
older SAM-T02 server does not produced models, just
alignments, and had substantially poorer performance
than either of the more recent servers. The selection of
templates and alignments by the HMMs has not improved
substantially—the models built directly from the top
alignment: SAM-T08-server_TS3, SAM-T06-server_TS2
and SAM-T02-server_AL1 are of variable quality, but
not showing consistent improvement. The selection and
Nucleic Acids Research, 2009, Vol. 37, Web Server issue W497
optimization of models by undertaker, however, is showing substantial improvement from SAM-T02 to SAM-T06
to SAM-T08.
ACKNOWLEDGEMENTS
FUNDING
National Institutes of Health (Grant 1 R01 GM06857001).
Conflict of interest statement. None declared.
REFERENCES
1. Karplus,K., Barrett,C., Cline,M., Diekhans,M., Grate,L. and
Hughey,R. (1999) Predicting protein structure using only sequence
information. Proteins, (Suppl. 3), 121–125.
2. Fischer,D., Barrett,C., Bryson,K., Elofsson,A., Godzik,A.,
Jones,D., Karplus,K., Kelley,L.A., MacCallum,R.M., Pawlowski,K.
et al. (1999) CAFASP-1: critical assessment of fully automated
structure prediction methods. Proteins, (Suppl. 3), 209–217.
3. Karplus,K., Karchin,R., Barrett,C., Tu,S., Cline,M., Diekhans,M.,
Grate,L., Casper,J. and Hughey,R. (2001) What is the value added
by human intervention in protein structure prediction? Proteins, 45,
86–91.
4. Karplus,K., Karchin,R., Draper,J., Casper,J.,
Mandel-Gutfreund,Y., Diekhans,M. and Hughey,R. (2003)
Combining local-structure, fold-recognition, and new-fold methods
for protein structure prediction. Proteins, 53, 491–496.
5. Karplus,K., Katzman,S., Shackelford,G., Koeva,M., Draper,J.,
Barnes,B., Soriano,M. and Hughey,R. (2005) What’s new in protein-structure prediction for CASP6. Proteins, 61, 135–142.
6. Mills,J.L., Singarapu,K.K., Eletski,A., Sukumaran,D.K., Wang,D.,
Jiang,M., Ciccosanti,C., Xiao,R., Liu,J., Baran,M.C. et al. (2008)
Solution NMR structure of protein yiiS from Shigella flexneri.
Downloaded from https://academic.oup.com/nar/article/37/suppl_2/W492/1145388 by guest on 10 June 2022
Over the years, dozens of people have contributed to the
tools of the SAM-T08 servers. Some of the more notable
contributors (in alphabetical order) include John Archie,
Bret Barnes, Christian Barrett, Sugato Basu, Jonathan
Casper, Melissa Cline, Mark Diekhans, Chris Dragon,
Birong Hu, Richard Hughey, Rachel Karchin, Sol
Katzman, Firas Khatib, Anders Krogh, Martin Madera,
Yael Mandel-Gutfreund, Martin Paluszewski, George
Shackelford, Kimmen Sjölander, Don Speck, Grant
Thiltgen and Spencer Tu. Comments on drafts of the
paper by John Archie, Richard Hughey, Thomas
Juettemann, Josue Samayoa, and Chirag Sharma are particularly appreciated.
Northeast Structural Genomics Consortium target SfR90, PDB file
doi:10.2210/pdb2k3i/pdb
7. Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990)
Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
8. Wang,G. and Dunbrack,R.L. Jr. (2003) PISCES: a protein sequence
culling server. Bioinformatics, 19, 1589–1591.
9. Non-redundant Protein Database. (2008) Distributed by anonymous FTP from ftp.ncbi.nih.gov/blast/db/
10. Hughey,R. and Krogh,A. (1996) Hidden Markov models for
sequence analysis: extension and analysis of the basic method.
Comput. Appl. Biosci., 12, 95–107. Information on obtaining SAM
is available at http://www.soe.ucsc.edu/research/compbio/sam.html
11. Karchin,R., Cline,M., Mandel-Gutfreund,Y. and Karplus,K. (2003)
Hidden Markov models that use predicted local structure for fold
recognition: alphabets of backbone geometry. Proteins, 51, 504–514.
12. Karchin,R., Cline,M. and Karplus,K. (2004) Evaluation of local
structure alphabets based on residue burial. Proteins, 55, 508–518.
13. Archie,J. and Karplus,K. (2009) Applying undertaker cost functions
to model quality assessment. Proteins, 75, 550–555.
14. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary
structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers, 22, 2577–2637.
15. Bystroff,C., Thorsson,V. and Baker,D. (2000) HMMSTR: a hidden
Markov model for local sequence-structure correlations in proteins.
J. Mol. Biol., 301, 173–190.
16. de Brevern,A.G., Etchebest,C. and Hazout,S. (2000) Bayesian
probabilistic approach for predicting backbone structures in terms
of protein blocks. Proteins, 41, 271–287.
17. Hughey,R., Karplus,K. and Krogh,A. (1999) SAM: sequence
alignment and modeling software system, version 3. Technical
Report UCSC-CRL-99-11, University of California, Santa Cruz,
Computer Engineering, UC Santa Cruz, CA 95064, October 1999.
http://www.soe.ucsc.edu/research/compbio/sam.html
18. Bernstein,F.C., Koetzle,T.F., Williams,G.J., Meyer,E.E.,
Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and
Tasumi,M. (1977) The Protein Data Bank: a computer-based
archival file for macromolecular structures. J. Mol. Biol., 112,
535–542.
19. Hodis,E., Prilusky,J., Martz,E., Silman,I., Moult,J. and
Sussman,J.L (2008) Proteopedia—a scientific ‘wiki’ bridging the rift
between three-dimensional structure and function of biomacromolecules. Genome Biol., 9, R121.
20. Murzin,A.G. (1996) Structural classification of proteins: new
superfamilies. Curr. Opin. Struct. Biol., 6, 386–394.
21. Shackelford,G. and Karplus,K. (2007) Contact prediction using
mutual information and neural nets. Proteins, 69, 159–164.
22. Paluszewski,M. and Karplus,K. (2009) Model quality assessment
using distance constraints from alignments. Proteins, 75, 540–549.
23. Rohl,C.A., Strauss,C.E.M., Misura,K. and Baker,D. (2004) Protein
structure prediction using Rosetta. Methods Enzymol., 383, 66–93.
24. van der Spoel,D., Lindahl,E., Hess,B., Groenhof,G., Mark,A.E.
and Berendsen,H.J.C. (2005) GROMACS: fast, flexible and free.
J. Comput. Chem., 26, 1701–1718.
25. Sayle,R. and Milner-White,E.J. (1995) RasMol: biomolecular
graphics for all. Trends Biochem. Sci., 20, 374–376.