Analyzing Protein Circular Dichroism Spectra For Accurate Secondary Structures
Analyzing Protein Circular Dichroism Spectra For Accurate Secondary Structures
Analyzing Protein Circular Dichroism Spectra For Accurate Secondary Structures
ABSTRACT We have developed an algorithm a basis, then all these features will be in the analysis, even
to analyze the circular dichroism of proteins for if we do not recognize them directly.
secondary structure. Its hallmark is tremendous In this paper we present an algorithm to estimate the
flexibility in creating the basis set, and it also com- secondary structure of proteins from CD data that reaches
bines the ideas of many previous workers. We also a new level of accuracy. Indeed, the accuracy is about the
present a new basis set containing the CD spectra of same as the variation in secondary structure found in
22 proteins with secondary structures from high X-ray diffraction data.33 Workers cannot expect the accu-
quality X-ray diffraction data. High flexibility is racy in analyzing the CD spectra of proteins to be any
obtained by doing the analysis with a variable selec- better than the variation in the X-ray structures used for
tion basis set of only eight proteins. Many variable the proteins in the basis set. The method combines many of
selection basis sets fail to give a good analysis, but the ideas presented over the years in a new algorithm that
good analyses can be selected without any a priori gives a root mean square error of 4% or better for the
knowledge by using the following criteria: (1) the secondary structures ␣-helix (H), 310-helix (G), -strand
sum of secondary structures should be close to 1.0, (E), turn (T), and poly(L-proline) II type 31-helix (P).
(2) no fraction of secondary structure should be less A Fortran program called CDsstr that implements this
than –0.03, (3) the reconstructed CD spectrum should algorithm is available over the internet, and is free of
fit the original CD spectrum with only a small error, charge to anyone. Simply ftp to alpha.als.orst.edu. Login
and (4) the fraction of ␣-helix should be similar to as anonymous, and please use your email address as the
that obtained using all the proteins in the basis set. password. Change the directory by typing: cd /pub/
This algorithm gives a root mean square error for wcjohnson/cdsstr. Notice that these are standard slashes
the predicted secondary structure for the proteins since we are using a unix system. This directory contains
in the basis set of 3.3% for ␣-helix, 2.6% for 310-helix, the Fortran source code, test data and results, and the
4.2% for -strand, 4.2% for -turn, 2.7% for poly(L- compiled binary version for a PC. To ensure that the binary
proline) II type 31-helix, and 5.1% for other struc- version remains executable, type: bin. You can retrieve all
tures when compared with the X-ray structure. files by typing: mget *.*
Proteins 1999;35:307–312. r 1999 Wiley-Liss, Inc.
THE METHOD AND ITS RATIONALE
Key words: circular dichroism; secondary struc- When workers use the same basis set of protein CD
ture; proteins spectra together with the same known secondary struc-
tures, they are all stuck in the same vector space. Then the
INTRODUCTION analysis of a protein with unknown structure should not be
Over the years, many methods have been offered to very dependent on the method of investigating this vector
analyze the circular dichroism (CD) of proteins in the space. For the three-dimensional vector space in which we
amide region for their secondary structure.1–21 As more live, a vector will be the same whether it is described in a
ideas were presented, the accuracy of these analyses Cartesian coordinate system, a cylindrical coordinate sys-
increased. A number of reviews have discussed this tem, or a spherical coordinate system. Different methods of
work,22–27 and Greenfield27 has recently reviewed the analysis of CD spectra such as least squares fitting,
methods that workers are presently using to estimate the singular value decomposition (SVD), convex constraint
secondary structure of proteins from their CD data. The analysis, and neural networks simply apply different
most successful methods use the CD spectra of proteins coordinate systems in the vector space of protein CD
whose structure is known from X-ray diffraction as the spectra. All of these methods should give about the same
basis for analyzing the CD of a protein with unknown answer. We choose to use SVD in our algorithm.
structure. That is because there are more features than
pure secondary structures that affect the CD of proteins in
the amide region. For instance, the CD due to aromatic Grant sponsor: National Institutes of Health; Grant number: GM
and sulfur-containing side chains, the length of ␣-helices, 21479
and twists in -sheets all contribute to the CD in the amide *Correpondence to: W. Curtis Johnson, Department of Biochemistry
and Biophysics, Oregon State University, Corvallis, OR 97331–7305.
region. Proteins contain all the features that affect their E-mail: [email protected]
CD. If we use the CD of proteins with known structures as Received 22 September 1998; Accepted 5 January 1999
Mathematically, the number of features that can be -turns have been combined under the symbol T for
determined from the CD spectrum of a protein is equal to analysis of the protein CD spectra. The CD spectrum for
the number of protein CD spectra in the basis set. In each of the 22 proteins in the basis set can be analyzed for
practice accuracy is a problem, and the number of features secondary structure using the other 21 proteins in the
that can be determined is limited by the information basis set. The results of this analysis (HJ), which is
content of the data. Thus a central problem in the analysis essentially our original algorithm,11 are compared with the
of the CD spectrum of a protein for its secondary structure X-ray secondary structures in Table I. We see that the
is that there is not enough data in the CD spectrum to analysis for ␣-helix is quite good, as has been noted
accurately solve for all the features that determine the CD previously.11,20 However, the analysis for other structures
spectrum. SVD has been used to show that because of the is variable, and in particular the sum of fractions of
experimental error in the CD spectrum of a protein secondary structure is often very different from 1.0.
measured to 178 nm, it has an information content of The fact that the sum of structures is not 1.0 for every
five.11,28 This means that the CD spectrum is the equiva- protein using the HJ analysis is independent of the X-ray
lent of only five independent equations and therefore can secondary structures assigned to the proteins in the basis
accurately solve for only five unknowns. The problem is set. Indeed, if instead of analyzing for component second-
underdetermined, in the sense that there are more than ary structures, we simply analyze for the sum of secondary
five features that determine the CD of proteins. structure by assigning 1.0 for the sum to each protein in
However, the problem is not as bad as we might imagine. the basis set, we would still end up with the HJ sum of
SVD can also be used to evaluate the information content secondary structures given in Table I.
of the secondary structures. The information content of the The sum of structures problem is undoubtedly related to
secondary structures has been shown to be four,11,28 so one the fact that there are only five independent equations in a
equation remains to help determine the other features. CD spectrum of a protein measured to 178 nm, causing the
The problem is also overdetermined, in the sense that analysis for secondary structure to be underdetermined.
Tukey developed ‘‘variable selection’’ to get around the
there are more proteins in the basis set than the informa-
underdetermined problem.31,32 This powerful idea is a
tion content of the CD data. Then many different combina-
standard procedure in the statistical analysis of data. With
tions of the CD spectra for the basis proteins will fit the CD
variable selection proteins are removed from the basis set
to be analyzed within its experimental error. Small changes
to achieve an accurate analysis. Changing the coordinate
in the data can cause large changes in the analysis, and
system won’t change the analysis, but with variable selec-
the results may well be inaccurate. This problem can be
tion we change the vector space, which in turn will change
overcome with SVD by using only the five most important
the analysis. This kind of flexibility is the only way, outside
singular values and setting the rest equal to zero. The
of a constraint, to get the sum of structures to be 1.0 and
matrix algebra of using SVD has been described in a number
eliminate negative values for some structures. In previous
of publications, and will not be repeated here.12–14,19
work, flexible methods like variable selection,13 local linear-
Truncating the CD spectra of proteins at 190 nm reduces
ity,14 ridge regression,10 and cluster analysis20 have been
the information content to three or four, and truncating
used to change the basis set. Of course variable selection is
the CD spectra at 200 nm reduces the information content not without its own problems. How do we know which
to two. The equation that the sum of structures must be 1.0 proteins to remove, and how do we know when the analysis
adds another equation to the information content. How- is satisfactory?
ever, it has been shown that when this equation is used as Of course, we do not know a priori which proteins to
a constraint, it makes the analysis less accurate.29 Requir- remove from the basis set. In previous work13 we assumed
ing that the fractions of structure be positive will further that the final basis set should be as large as possible, and
destroy an analysis, predicting an inaccurate amount of this assumption has been called into question.14,21 We then
␣-helix, which without the constraint is usually fairly removed proteins so that the sum of structures became
good.29 close to 1.0 and negative fractions of structure were
A basis set consisting of the CD for 22 proteins digitized eliminated. In this work we find that it is best to use a
at 4 nm intervals from 234 to 178 nm was used in this small basis set, and eight proteins in the basis set gives the
work. Many of these CD spectra have been published best results for the 22 proteins in the basis set where we
previously,11–13,30 and all were available as the basis set already know the answer. Note that mathematically at
with the earlier program, VARSLC.27 Of course they are least six proteins are required to be in the basis set to solve
now available over the internet with CDsstr. Correspond- for the six features we are considering explicitly. There are
ing X-ray diffraction data with a resolution of at least 2.0 Å 319,770 combinations of selecting eight proteins from a
that has been refined are required for a protein to be basis set of 22, and there will be more as the number of
included in the basis set. This criterion eliminated some of proteins in the basis set is increased. Rather than generat-
the 33 proteins contained in the earlier basis set. The ing all these combinations, we follow Dalmas and Bannis-
accompanying publication describes the method we used ter,21 and randomly choose the eight proteins for variable
to analyze the X-ray diffraction data for secondary struc- selection. We keep only those combinations where the sum
ture, and the results are given in Table I of that publica- of structures is between 0.952 and 1.05, where no fraction
tion. Note that hydrogen-bond and non-hydrogen-bonded of secondary structure is less than –0.03, and where the
SECONDARY STRUCTURE FROM CD 309
Combination
number Protein numbersa H G E T P O
868 1, 3, 7, 10, 12, 15, 16, 19 0.06 0.02 0.37 0.16 0.09 0.35
22 2, 7, 8, 12, 13, 16, 17, 21 0.09 0.01 0.35 0.09 0.07 0.40
31 1, 5, 9, 10, 11, 12, 14, 20 0.07 0.02 0.40 0.07 0.09 0.33
513 1, 6, 7, 9, 11, 12, 14, 22 0.07 0.02 0.38 0.13 0.07 0.31
831 1, 3, 7, 9, 12, 14, 16, 22 0.09 0.03 0.39 0.12 0.07 0.35
397 3, 5, 6, 10, 14, 17, 20, 21 0.36 0.07 0.03 0.14 0.15 0.26
595 5, 7, 9, 11, 15, 16, 17, 22 0.27 0.05 0.15 0.17 0.03 0.33
10 2, 3, 7, 10, 14, 15, 21, 22 0.22 ⫺0.02 0.18 0.20 0.01 0.36
614 10, 11, 12, 16, 17, 20, 21, 22 ⫺0.02 ⫺0.03 0.50 0.01 0.06 0.45
119 3, 8, 14, 15, 16, 17, 18, 22 0.16 0.08 0.48 ⫺0.01 0.00 0.32
860 6, 8, 13, 14, 16, 17, 18, 22 0.07 0.02 0.43 ⫺0.01 0.01 0.44
100 1, 2, 5, 11, 13, 15, 17, 21 0.15 0.03 0.26 0.10 0.09 0.35
259 2, 5, 8, 10, 13, 14, 17, 21 0.17 0.00 0.23 0.09 0.09 0.38
602 2, 7, 9, 11, 14, 15, 17, 19 0.18 0.00 0.24 0.15 0.07 0.36
†H, ␣-helix; G, 310-helix; E, -strand; T, turns; P, poly(L-proline) II type 31-helix; O, other amides not in the previous categories.
aRefer to Table I for matching protein numbers to proteins.
reconstructed CD spectrum fits the original CD spectrum until the analysis was self-consistent. We use this self-
with an average root mean square error of less than 0.25 consistency in our algorithm.
⌬⑀ units. Typical successful combinations for concanava-
lan-A analyzed with the 21 other proteins are given in RESULTS AND DISCUSSION
Table II. We see that some analyses are quite good (the Table I shows the results of analyzing each of the 22
first five in the table), while others are not very good at all proteins in the basis set with the other 21 proteins by
(the remaining nine in the table). How can we choose the using our new algorithm. The predictions of secondary
good analyses without already knowing the answer? structure compare well with the X-ray diffraction num-
We do know that if we analyze using the complete basis bers. The root mean square error in the secondary struc-
set (HJ), we get about the right amount of ␣-helix (Table I). tures for the 22 proteins in the basis set are: 3.3% for H,
Graphing the HJ prediction for ␣-helix versus the X-ray 2.6% for G, 4.2% for E, 4.2% for T, 2.7% for P, and 5.1% for
␣-helix shows high correlation and accuracy. It is better O. Greenfield27 has recently compared various algorithms
than estimating from ⌬⑀ at 222 nm, or using only the first of analyzing CD for secondary structure. The best method,
SVD basis vector. We can use the amount of ␣-helix program SELCON from Sreerama and Woody,19,20 gave a
predicted by the complete basis set to select from the root mean square error of 8% for H⫹G, 7% for E, and 5%
variable selection analyses using eight proteins in the for T. When our new basis set is run on SELCON, the root
basis set without any a priori knowledge. In the end we use mean square error is 6.2% for H, 2.7 for G, 5.2% for E, 3.6%
slightly more complicated criteria in our algorithm. The for T, 2.5% for P, and 5.1% for O. Clearly the accuracy we
HJ method tends to overestimate ␣-helix for proteins with have achieved in this work is due both to the algorithm and
a low content, so if the predicted amount of ␣-helix is less to using a basis set with secondary structures from high
than 0.15, we average this fraction with the minimum quality X-ray data. The new criterion in the algorithm of
␣-helix in the successful combinations, and then select basing ␣-helix estimates on the HJ predictions allows
combinations that are within 3% of this value. If the great flexibility in choosing the basis set, improving accu-
predicted ␣-helix is between 0.15 and 0.25, we select racy. Our correlation between predicted and X-ray struc-
successful combinations that are within 3% of this value. If tures are 0.99 for H, 0.62 for G, 0.94 for E, 0.38 for T, 0.76
the predicted ␣-helix is between 0.25 and 0.65, we average for P, and 0.87 for O. Our error based on the center of the
this with the maximum ␣-helix in the successful combina- dynamic range for each structure is 9.4% for H, 43.3% for
tions, and select successful combinations within 3% of this G, 23.3% for E, 40.0% for T, 35.7% for P, and 17.3% for O.
value. Finally, if the predicted ␣-helix is greater than 0.65, This algorithm demonstrates that our intuition is not
we select successful combinations with the largest amount always correct. For instance, we believed that the variable
of ␣-helix, since the successful combinations tend to under- selection basis set should contain the maximum number of
estimate ␣-helix for all-␣ proteins. For concanavalan-A proteins, and stated this as one criterion in earlier work.13
these criteria select the first five and the eleventh success- However, in this research we found that decreasing the
ful combinations in Table II. number of proteins in the variable selection basis set
Sreerama and Woody19 improved SVD and variable improved the analysis, in the sense that there were some
selection by putting the protein with unknown structure combinations that gave results close to the X-ray struc-
that was being analyzed into the basis set and iterating ture. All ␣-helix proteins analyzed best with six, seven, or
SECONDARY STRUCTURE FROM CD 311
24. Woody RW. Circular dichroism of peptides. In: Hruby VJ, editors, 29. Manavalan P, Johnson WC, Jr. Protein secondary structure from
The Peptides, Vol. 7. New York:Academic Press; 1985. p 15–114. circular dichroism spectra. Proc Int Symp Biomol Struct Interac-
25. Yang JT, Wu C-SC, Martinez HM. Calculation of protein conforma- tions, Suppl J Biosci 1985;8:141–149.
tion from circular dichroism. Meth Enzymol 1986;130:208–269. 30. Toumadje A, Alcorn SW, Johnson WC, Jr. Extending CD spectra of
26. Johnson WC, Jr. Secondary structure of proteins through circular proteins to 168 nm improves the analysis of secondary structures.
dichroism spectroscopy. Annu Rev Biophys Chem 1988;17:145– Anal Biochem 1992;200:321–331.
166. 31. Mosteller F, Tukey JW. Data analysis and regression. Reading,
27. Greenfield NJ. Methods to estimate the conformation of proteins MA: Addison-Wesley; 1977. 588 p.
and polypeptides from circular dichroism data. Anal Biochem 32. Weisberg S. Applied linear regression. New York: John Wiley &
1996;235:1–10. Sons; 1980. 323 p.
28. Johnson WC, Jr. Analysis of circular dichroism spectra. Meth 33. King SM, Johnson WC. Assigning secondary structure from
Enzymol 1992;210:426–447. protein coordinate data. Proteins 1999;35:313–320.