Introducing Protein Folding Using Simple Models

Introducing Protein Folding Using Simple Models
D. Thirumalai and D. K. Klimov

Department of Chemistry and Biochemistry and Institute for Physical Science and Technology
arXiv:cond-mat/0101048v1 [cond-mat.soft] 4 Jan 2001
University of Maryland, College Park, MD 20742
Key Words: Biomolecular Folding, Symmetry, Designability, Evolution
I. INTRODUCTION
Most reactions in cells are carried out by enzymes [1]. In many instances the rates of
enzyme-catalyzed reactions are enhanced by million fold. A significantly large fraction of
all known enzymes are proteins which are made from twenty naturally occurring amino
acids. The amino acids are linked by peptide bonds to form polypeptide chains. The
primary sequence of a protein specifies the linear order in which the amino acids are linked.
To carry out the catalytic activity the linear sequence has to fold to a well defined three
dimensional structure. In cells only a relatively small fraction of proteins require assistance
from chaperones (helper proteins) [2]. Even in the complicated cellular environment most
proteins fold spontaneously upon synthesis. The determination of the three dimensional
folded structure from the one dimensional primary sequence is the most popular protein
folding problem.
For a number of years the protein folding problem remained only of academic interest.
The synthesis of proteins in cells was described by Crick [1]. Schematically this process can be
represented as DNA → RNA → P roteins. This proposal and Anfinsen’s [3] demonstration
that a denatured protein would fold to the native conformation under suitable conditions
was sufficient to understand the role of protein folding in cells. However, in the last couple
of decades many diseases have been directly linked to protein folding (especially misfolding)
[4]. Thus, there is an urgency to understand the mechanisms in the formation of folded
structures. Biotechnology industry is also interested in the problem because of the hope that
by understanding the way polypeptide chains fold one can design molecules (using natural
or synthetic constituents) of use in medicine. Finally, the full potential of the human genome
project involves understanding what the genes encode. For all these profoundly important
reasons the protein folding problem has taken center stage in molecular biology.
Because this problem is complex several venues of attack have been devised in the last
fifteen years. A combination of experimental developments (protein engineering, advances in
X-ray and NMR, various time resolved spectroscopies, single molecule manipulation meth-
ods) and theoretical approaches (use of statistical mechanics, computational strategies, use
of simple models) [5–7] has lead to a greater understanding of how polypeptide chains reach
the native conformation.
From our perspective there are four major problems that comprise the protein folding
enterprise. They are:
1
a) Prediction of the three dimensional fold of a protein given only the amino acid se-
quence. This is referred to as the structure prediction problem. In Fig. (1) we show the three
dimensional structure of myoglobin. The structure prediction problem involves determining
this fold using the primary sequence which is given in the lower part of Fig. (1).
b) Given that a sequence folds to a known native structure what are the mechanisms
in the transition from the unfolded conformation to the folded state? This is the kinetics
problem whose solution requires elucidation of the pathways and transition states in the
folding process.
c) How to design sequences that adopt a specified fold [8]? This is the inverse protein
folding problem that is vital to the biotechnology industry.
d) There are some proteins that do not spontaneously reach the native conformation. In
the cells these proteins fold with the assistance of helper molecules referred to as chaperonins.
The chaperonin-mediated folding problem involves understanding the interactions between
proteins.
It is not the aim of this chapter to introduce the reader to all the areas listed above. Our
goal is modest. We describe some of the theoretical developments which arose from studies
of caricatures of proteins. Such models were designed to understand certain general features
about protein structures and how these are kinetically reached. To keep the bibliography
compact we mostly cite review articles. The interested reader can find the original papers
in these cited works. We hope that this short introduction will entice the reader to delve
into the ever surprising world of biological macromolecules.
II. RANDOM HETEROPOLYMER AS A CARICATURE OF PROTEINS
In homopolymers all the constituents (monomers) are identical, and hence the interac-
tions between the monomers and between the monomers and the solvent have the same
functional form. To describe the shapes of a homopolymer (in the limit of large molecular
weight) it is sufficient to model the chain as a sequence of connected beads. Such a model
can be used to describe the shapes that a chain can adopt in various solvent conditions. A
measure of shape is the dimension of the chain as a function of the degree of polymerization,
N. If N is large then the precise chemical details do not affect the way the size scales with
N [9]. In such a description a homopolymer is characterized in terms of a single parameter
that essentially characterizes the effective interaction between the beads which is obtained
by integrating over the solvent coordinates.
Proteins are clearly not homopolymers because many energy scales are required to char-
acterize the polypeptide chain. Besides the excluded volume interactions and hydrogen
bonds the potential between the side chain depend on the nature of the residues [1]. There-
fore, as a caricature of proteins the heteropolymer model is a better approximation. A
convenient limit is the random heteropolymer for which approximate analytic treatments
are possible [10]. In a random heteropolymer the interactions between the beads are assumed
to be randomly distributed. Some of the interactions are attractive (which is responsible
for conferring globularity to the chain) while others are repulsive and these residues are
better accommodated in an extended conformation. In proteins water is a good solvent for
polar residues while it is poor solvent for hydrophobic residues. (In a good solvent contacts
between the monomer and solvent are favored whereas in a poor solvent the monomers are
2
attracted to each other). Because only 55% of the residues in proteins are hydrophobic it
is clear that in a typical protein energetic frustration plays a role. In addition because of
chain connectivity there is also topological frustration. This arises because residues that
are proximal tend to form structures on short length scales. Assembly of such short short
length scale structures would typically be incompatible with the global fold giving rise to
topological frustration. Even if energetic frustrations are eliminated a polypeptide chain (in
fact biomolecules) are topologically frustrated [7].
In the field of spin glasses and structural glasses such frustration effects are well known
[11]. Thus, it was natural to suggest that random heteropolymers could serve as a simple
representation of polypeptide chains. In Appendix A we sketch computational details for one
model of random heteropolymers. Bryngelson and Wolynes proposed, using phenomenologi-
cal arguments, that the Random Energy Model (REM) would be an appropriate description
of some aspects of proteins [12,5]. The rationale for this is the following: Consider the
exponentially large number of conformations. Because of the presence of several conflicting
energies in a polypeptide chain it is natural to assume that these energies are randomly
distributed. If there are no correlations between these energies and if the distribution is
Gaussian one gets the REM. Of course, in the REM model the chain connectivity is ignored
and there is no manifest for a spatial dependence of the chain coordinates. We show in
Appendix B that in the compact phase the random heteropolymer is equivalent to REM.
The random heteropolymer models of proteins are interesting from a statistical mechan-
ical perspective. However, they do not explain the key characteristics of proteins such as
reversible and cooperative folding to a unique native conformation. Moreover, the theories
for heteropolymers suggest that typically the energy landscapes for these systems are ex-
tremely rugged consisting of many minima that are separated by barriers of varying heights
[10]. This would mean that kinetically it would be impossible for chain with a typical real-
ization of interactions to reach the ground state in finite time scales. Thus the dynamics of
such random heteropolymer models typically exhibits glassy behavior. Natural proteins do
not exhibit any hall mark of glassy dynamics at most temperatures of interest. It follows
that a certain refinement of the random heteropolymers is required to capture protein-like
properties. One of the important theoretical advances is the observation that very simple
minimal models [13] can be constructed that capture many (not all) of the salient features
observed in proteins [13]. The simplest manifestation of such models are the lattice repre-
sentation of polypeptide chains. In the next section we introduce the models and describe
a few results that have been obtained by numerically exploring their behavior.
III. LATTICE MODELS OF PROTEINS
The computational protocol for describing protein folding mechanisms is straightforward

in principle. The dynamics are well described by the classical equations of motion. Simula-
tions of a monomeric protein involves equilibrating the polypeptide chain in a box of water
molecules at the desired temperature and density. If an appropriately long trajectory is gen-
erated then the dynamics of the protein can be directly monitored. There are two crucial
limitations that prevent a straightforward application of this approach to study the folding
of proteins. First, the interaction potentials or the force fields for such a complex system
are not precisely known. Molecular dynamics simulations in the standard packages use po-
3
tentials that rely on the transferability hypothesis i.e., interactions designed in one context
can be used in aqueous medium and for larger systems. The need to compute potentials
that can be used reliably in the simulations of protein dynamics remains acute.
The second problem is related to the limitations in generating really long time trajectories
that can sample all the relevant conformational space of proteins. To observe reversible
folding of even a moderate sized protein requires simulations that span the millisecond
time scale. More importantly making comparisons with experiments involves generating
many (greater than perhaps 100) folding trajectories so that a reliable ensemble average
is obtained. Thus, we need to make progress on both (force fields and enhanced sampling
techniques in long time simulations) fronts before a straightforward all atom simulations
become routine.
In light of the above mentioned difficulties various simplified models of proteins have
been suggested [13]. The main rationale for using such drastic simplifications is that a
detailed study of such models can enable us to decipher certain general principles that
govern the folding of proteins [5–7]. For these class of models detailed computations without
sacrificing accuracy is possible. Such an approach has yielded considerable insights into the
mechanisms, time scales, and pathways in the folding of polypeptide chains. In this section
we will outline some of the results that have been obtained (largely from our group) with
the aid of simple lattice models of proteins.
In the simple version of the lattice representation of proteins the polypeptide chain is
modeled as a sequence of connected beads. The beads are confined to the sites of a suitable
lattice. Host of the studies have used cubic lattice. To satisfy the excluded volume condition
only one bead is allowed to occupy a lattice site. If all the beads are identical we have a
homopolymer model whose characteristics on lattices have been extensively studied. To
introduce protein-like character the interactions between beads (ones separated by at least
three bonds) that are near neighbors on a lattice are assumed to depend on the nature of
the beads. The energy of a conformation, specified by {ri , i = 1, 2, .....N}, is
X
E({ri }) = ∆(| ri − rj | −a)Bij (1)
i<j
where N is the number beads in the chain, a is the lattice spacing, and Bij is the value
of the contact interaction between beads i and j. We will consider different forms of Bij .
Since this model can be viewed as a coarse grained representation of the α-carbons of the
polypeptide chain the value of a is typically taken to be about 3.8Å.
Lattice models have been used for a long time in polymer physics [14]. They were
instrumental in computing many properties (scaling of the size of the polymer with N,
distribution of end-to-end distance etc.) of real homopolymer chains. In the context of
proteins lattice models were first introduced by Go and coworkers [15]. The currently popular
Go model considers only interactions between residues (beads on the lattice) that occur in
the native (ground) state. Thus, in this ”strong specificity limit” only native contacts are
taken into account. It follows that in this version of the Go model the chain is forced to
adopt the lowest energy conformation at low temperatures. Go also considered a variant of
this model in which certain non-native contacts are allowed. Although these models were
insightful Go and coworkers did not use them to obtain plausible general principles of protein
folding. This was partly due to the fact that in their studies they typically used long chains,
and hence exact enumeration was not possible.
4
Simple lattice models, with the express purpose of obtaining minimal representations
of polypeptide chains, were first suggested by Chan and Dill [16]. In order to account for
the major interactions in proteins these authors argued that the twenty naturally occurring
amino acids can be roughly divided into categories, namely, hydrophobic (H) and polar (P).
Chan and Dill have suggested that this simple HP model can capture many salient features
of proteins. They also suggested that many of the conceptual puzzles (Levinthal paradox in
particular) could be addressed by systematically studying short chains. This simple exactly
enumerable HP model and their variants have been used to understand cooperativity, folding
kinetics, and designability of protein structures [13]. Thus, it is instructive to describe the
calculations that have been done using lattice models. A study of such models indeed
provides a god introduction to the computational aspects of protein folding.
Emergence of Structures: The sequence space of proteins is extremely dense. The number
of possible protein sequences is 20N . It is clear that even by the fastest combinatorial
procedure only a very small fraction of such sequences could have been synthesized. Of
course, not all of these sequences will encode protein structures which for functional purposes
are constrained to have certain characteristics. A natural question that arises is how do
viable protein structures emerge from the vast sea of sequence space? The two physical
features of folded proteins are : (1) In general native proteins are compact but not maximally
so. (2) The dense interior of proteins are largely made up of hydrophobic residues and the
hydrophilic residues are better accommodated on the surface. These characteristics give the
folded proteins a lower free energy in comparison to all other conformations.
Lattice models are particularly suited for answering the question posed above. We will
show that the two physical restrictions are sufficient to rationalize the emergence of very
limited (believed to be only on the order of a thousand or so) protein-like structures. To
provide a plausible answer to this question using lattice models we need to specify the form
of the interaction matrix elements Bij . For purposes of illustration we consider the random
bond (RB) model in which the elements Bij are distributed as
1 (Bij − B0 )2
P (Bij ) = √ exp[− ]. (2)
2πσ 2σ 2
Here σ(= 1) is the variance in Bij and the hydrophobicity parameter B0 is the mean value.
We chose B0 = 0 (half the beads are hydrophobic) and B0 = −0.1. The latter is motivated
by the observation that in natural proteins roughly 55% of the residues are hydrophobic
[17].
Protein-like structures are not only compact but also have low energy. With this in mind
we have calculated the number of compact structures (CSs) as CSs with low energy for a
given N. The number of CSs in its most general form may be written as
N N d−1
CN (CS) ≃ Z Z1 d
N γc −1 (3)
where ln Z is the conformational free energy (in units of kB T ), Z1 is the surface fugacity,
d is the spatial dimension, and γc is measures possible logarithmic corrections to the free
energy. It is clear that natural proteins are relatively unique and hence their number on an
average has to grow at rates that are much smaller than that given in Eq. (21).To explore
this we have calculated by exact enumeration the number of compact structures, CN (CS)
and the number of minimum energy compact structures CN (MES) as a function of N.
5
We performed exhaustive enumeration of all self-avoiding conformations, to explore the
conformational space of the polypeptide chain of a given length. In order to reduce the
sixfold symmetry on the cubic lattice we fixed the direction of the first monomeric bond in
all conformations. The remaining conformations are related by eightfold symmetry on the
cubic lattice (excluding the cases when conformations are completely confine to a plane or
straight line). To decrease further the number of conformations to be analyzed the Martin
algorithm [18] was modified to reject all conformations related by symmetry.
We define MES as those conformations, whose energies lie within the energy interval
∆ above above the lowest energy E0 . Several values for ∆ were used to ensure that no
qualitative changes in the results are observed. We set ∆ to be constant and equal to 1.2
(or 0.6) (definition (i)). We have also tested another definition for ∆, according to which
∆ = 1.3|E0 − tB0 |/N, where t is the number of nearest neighbor contacts in the ground
state (definition (ii)). It is worth noting that in the latter case ∆ increases with N. Both
definitions yield equivalent results. Using these definitions for ∆, we computed C(MES) as
a function of the number of residues N.
The computational technique involves exhaustive enumeration of all self-avoiding confor-
mations for N ≤ 15 on cubic lattice. In doing so we calculated the energies of all conforma-
tions according to Eq. (19), and then determined the number of MES. Each quantity, such as
the number of MES, C(MES), the lowest energy E0 , the number of nearest-neighbor contacts
t in the lowest energy structures, is averaged over 30 sequences. Therefore, when referring
to these quantities, we will imply their average values. To test the reliability of the compu-
tational results an additional sample of 30 random sequences was generated. Note that in
the case of C(MES) we computed the quenched averages, i.e., C(MES) = exp[ln[c(MES)]],
where c is the number of MES for one sequence.
The number of MES C(MES) is plotted as a function of the number of residues N in Fig.
(2) for B0 = −0.1 and ∆ = 0.6. A pair of squares at given N represents C(MES) computed
for two independent runs of 30 sequences each. For comparison, the number of self-avoiding
walks C(SAW) and the number of CS C(CS) are also plotted in this figure (diamonds and
triangles, respectively). The most striking and important result of this graph is the following:
As expected on general theoretical grounds, C(SAW) and C(CS) grow exponentially with N,
whereas the number of MES C(MES) exhibits drastically different scaling behavior. There is
no variation in C(MES) and its value remains practically constant within the entire interval
of N starting with N = 7. We find (see Fig. (2)) that C(MES) ≈ 101 . This result further
validates our earlier finding for two dimensional model. These results suggest that C(MES)
grows (in all likelihood) only as lnN with N. Thus the restriction of compactness and low
energy of the native states may force an upper bound on the number of distinct protein
folds.
3D HP model: The calculations described above suggest that upon imposing minimal restric-
tions on the structures (compactness and low energies) the structure space becomes sparse.
As suggested before this must imply that each basin of attraction (corresponding to a given
MES) in the structure space must contain numerous sequences. The way these sequences
are distributed among the very slowly growing number (with respect to N) of MES, i.e.,
the density of sequences in structure space, is an important question. This was beautifully
addressed in the paper by Li et. al [19]. They considered a three dimensional (N = 27)
cubic lattice. By using HP model and restricting themselves to only maximally compact
6
structures as putative native basins of attractions (NBA) they showed certain basins have
much larger number of sequences. In particular, they discovered that one of the NBAs
serves as a ground state for 3794 (total number is 227 ) sequences and hence was considered
most designable (Fig. (3)). The precise density of sequences among the NBAs is clearly
a function of the interaction scheme. These calculations and the arguments presented in
the previous subsection using the random bond model point out that since the number of
NBA for the entire sequence space is small it is likely that proteins could have evolved ran-
domly. Naturally occurring folds must correspond to one of the basins of attraction in the
structure space so that many sequences have these folds as the native conformations, i.e.,
these are highly designable structures in the language of LWC [19]. These ideas have been
further substantiated by Lindgard and Bohr [20], who showed that among maximally com-
pact structures there are only very few folds that have protein-like characteristics. These
authors also estimated using geometrical characteristics and stability arguments that the
number of distinct folds is on the order of a thousand. All of these studies confirm that the
density of the structure space is sparse. Thus, each fold can be designed by many sequences.
From the purely structural point of view nature does have several options in the sense that
many sequences can be ”candidate proteins”. However, there is also evolutionary pressure
to fold rapidly (i.e., a kinetic component to folding). This requirement further restricts the
possible sequences that can be considered as proteins, because they must satisfy the dual
criterion of reaching a definite fold on a biologically relevant time scale. These observations
are schematically sketched in Fig. (4).
Symmetry and designability: In the study by Li. et. al. [19] it was noted that highly
designable structures appear to be symmetric. Independently, in a thought provoking ar-
ticle Wolynes [21] has made a series of compelling arguments as to why nature might use
symmetry (at least in an inexact manner) to generate symmetrical tertiary folds of proteins.
Many enzymes are oligomers. Wolynes makes a number of observations about the symmetry
aspects of protein structures: (a) The terameric hemoglobin (α-helical protein) molecule has
an approximate two-fold symmetry. (b) A striking example of approximate symmetry in β
proteins is found in the the structure of a monomeric γ crystallin in which the shapes adopted
by residues 1-88 and 89-174 are nearly the same. However, the two individual sequences do
not bear much similarity. This, of course, is consistent with the notion that the structure
space is so sparse that many sequences are forced to adopt similar shapes. The interesting
conclusion from examining the γ crystallin structure is that the underlying symmetries in
the shape are only inexact. (c) The obvious example of very nearly symmetrical structures
are in helical proteins with the the four helix bundle being one the most prominent exam-
ples (see Fig. (5)). (d) Various proteins with mixed topology (like TIM barrels and jelly
rolls) appear to have the kind of inexact but apparent symmetrical arrangement discussed
by Wolynes [22]. (e) He also conjectured that it is likely that the underlying approximate
symmetry is reflected in the free energy landscape being funnel-like. This would facilitate
rapid folding which for many proteins may be a result of evolutionary pressure. The precise
connections between the symmetries and the folding mechanisms and functional competence
of biological molecules have not been worked out. Nevertheless it appears that employing
such ideas might be useful in de novo design of proteins.
We note that equally striking are the kind of symmetrical arrangements found in RNA
molecules [23]. The crystal structure of the P4-P6 domain of Tetrahymena self-splicing RNA
7
clearly is highly symmetric with helices packed in a nearly regular arrangement. Since in an
evolutionary sense the RNA world might have preceded the protein world it is interesting
to speculate that the emergence of inexact design may have been a biological necessity. The
observation of inexact symmetries in a protein structure might be a consequence of the fact
that they are present in the ”parent” molecules. In fact this evolutionary conservation may
have been imprinted when evolution from RNA world to the current scheme for protein
synthesis took place. The most compelling reason for observing near regular patterns in
biomolecular structures is because synthesis of symmetrical folds might be energetically
economical.
Exploring protein folding mechanism using lattice model: It is well known that proteins
reach the biologically active native states in a relatively short time which is on the order of
a second for most single domain proteins [1]. Based on folding and refolding experiments on
ribonuclease A Anfinsen concluded that under appropriate conditions natural sequences of
proteins spontaneously fold to their native conformation [3]. This implies that protein folding
is self-assembly process i.e., the information needed for specifying the topology of the native
state is contained in the primary sequence itself. This thermodynamic hypothesis does not,
however, address the question of how the native state is accessed in a short time scale. This
issue was raised by Levinthal who wondered how a polypeptide chain of reasonable length
can navigate the astronomically large conformational space so rapidly. Levinthal posited
that certain preferred pathways must guide the chain to the native state. The Levinthal
paradox, simplistic as it is, has served as an intellectual impetus to understand the ease
with which a polypeptide chain reaches the native conformation [5,6]. We use lattice models
to describe the foldability of biological sequences of proteins. A sequence is foldable if it
reaches the native state in a reasonable time and remains stable over some range of external
conditions (pH, temperature).
Characteristic Temperatures: The basic features of folding can be understood in terms of

two fundamental equilibrium temperatures that determine the ”phases” of the system [7].
At sufficiently high temperatures (kT greater than all the attractive interactions) the shape
of the polypeptide chain can be described as a random coil and hence its behavior is the same
as a self-avoiding walk. As the temperature is lowered one expects a transition at T = Tθ
to a compact phase. This transition is very much in the spirit of the collapse transition
familiar in the theory of homopolymers [9]. The number of compact conformations at Tθ
is still exponentially large. Because the polypeptide chains have additional energy scales
that discriminate between the various compact conformations we expect a transition to the
ground (native) state at a lower temperature TF . Generally the transition at Tθ is second
order, while the transition at TF is first order like. Such transitions here are for very small
systems and the notion of ”phases” should be used with care. These expectations, based
on fairly general arguments, have been confirmed in various lattice simulations of protein-
like heteropolymers. For the lattice models the collapse temperature Tθ is determined from
the peak of the specific heat and the folding transition temperature is obtained from the
fluctuations in the overlap function given by
8
∆χ =< χ2 > − < χ >2 (4)
where
1 X
χ = 1− δ(r~ij − ~rijN ) (5)
N2 − 3N + 2 i<j+2
with rijN referring to the native state. In Fig. (7a) (for the structure displayed in Fig. (6))
we plot the temperature dependence of Cv , which has a peak at Tθ = 0.83. This figure also
shows the variation of d < Rg > /dT with temperature. The peak of this curve (0.86)
almost coincides with that of the specific heat indicating that this transition is associated
with compaction of the chain. Hence, the maximum in Cv legitimately indicates the collapse
temperature. X-ray scattering experiments have been used to obtain Tθ for a few proteins.
In Fig. (7b) we show the temperature dependence of ∆χ from which the folding temperature
TF is determined to be 0.79.
Folding Rates: The key question we want to answer is what are the intrinsic sequence
dependent factors that not only determine the folding rates but also the stability of the
native state? It turns out that many of the global aspects of folding kinetics of proteins can
be understood in terms of the equilibrium transition temperatures. In particular, we will
show that the key factor that governs the foldability of sequences is the single parameter
Tθ − TF
σT = (6)
Tθ
which indicates how far TF is from Tθ . To establish a direct correlation between the folding
time τF and σT we generated a number of sequences for N = 27. The folding time was taken
to be equal to the mean first passage time. The first passage for a given initial trajectory was
calculated by determining the total number of Monte Carlo steps (MCS) needed to reach the
native conformation for the first time. By averaging over an ensemble of initial trajectories
(typically this number varies between 400 - 800 in our examples) the mean first passage time
is obtained. The precise moves that are utilized in the simulations are described elsewhere
[17]. The dependence of τF on σT (for the random bond model and for the other interaction
schemes) is given in Fig. (8). This figure shows a remarkable correlation between the folding
time and σT . A small change in σT results in a dramatic effect (a few orders of magnitude)
on the folding times. It is clear that both Tθ and TF are dependent on the sequence. As a
result mutations that preserve the native state can alter the folding rates due to the change
in the σT values.
Using lattice models we have also established that folding rates correlate well with Z =
(EN −EM S /δ, where EN is the native state energy, EM S is the average energy of the ensemble
of misfolded structures, and δ is the dispersion in the contact energies. The relationship
between σ and Z also suggests that, in general, the correlation between τF and σ should
be superior. More importantly, experimental measurements of Z are difficult. On the other
hand, both Tθ amd TF can be measured in scattering, CD, or fluorescence experiments.
Other measures, such as energy gap (however, it is defined), do not correlate with τF .
In the previous section we showed that because the structure space is very sparse there
have to be many sequences that map onto the countable number of basins in the structure
9
space. The kinetics here shows that not all the sequences, even for highly designable struc-
tures, are kinetically competent. Consequently, the biological requirements of stability and
speed of folding severely restrict the number of evolved sequences for a given fold. This very
important result is schematically shown in Fig. (4).
It is important to point out that the simulations reported in Fig. (8) were done at
sequence dependent temperatures using the condition < χ(Ts ) >= 0.21. At these temper-
atures, all of which are below their respective folding transition temperatures, the native
conformation has the highest occupation probability. In lattice models the native state is a
single conformation (a microstate) which is, of course, physically unrealistic. In real systems
there is a volume associated with the native basin of attraction (NBA) and there are many
conformations that map onto the NBA. The probability of being in the NBA at the various
simulation temperatures is in excess of 0.5 so that under the conditions of our simulations
the stability criterion is automatically satisfied. The results in Fig. (8), therefore, shows
that the dual requirement of stability and the kinetic accessibility of NBA is most easily
satisfied by those sequences that have small values of σT . Thus rapid folding occurs when
σT ≈ 0 i.e., near a multicritical-like point. In this case there are no detectable intermediate
”phases”.The sequence, whose native state is shown in Fig. (6) has σT = 0.05. We found
that this sequence folds rapidly.
Topological Frustration and Kinetic Partitioning Mechanism: Lattice models can

also be used to obtain the outlines of the mechanisms for folding of proteins. The qualitative
aspects of the folding kinetics of biomolecules can be understood in terms of the concept
of topological frustration. The primary sequence of proteins has about 55% hydrophobic
residues. The linear density of hydrophobic residues along the polypeptide chain is roughly
constant implying that the hydrophobic residues are spread throughout the chain. As a
result on any length scale l there is a propensity for the hydrophobic residues to form tertiary
contacts under folding conditions. The resulting structures which are formed by contacts
between residues that are in proximity would be in conflict with the global fold corresponding
to the native state. The incompatibility of structures on local scales with the near unique
native state on the global scale leads to topological frustration. It is important to realize
that topological frustration is inherent to all foldable sequences, and is a direct consequence
of the polymeric nature of proteins as well as competing interactions (hydrophobic residues
which prefer the formation of compact structures and hydrophilic residues which are better
accommodated by extended conformations). A consequence of topological frustration is that
the underlying energy landscape is rugged consisting of many minima that are separated by
barriers of varying heights.
It is important to understand the nature of the low-lying minima in the rugged energy
landscape. On the length scale l there are many ways of forming structures that are in
conflict with the global fold. It is expected that most of these structures have high free
energies and are unstable to thermal fluctuations. We expect a certain number of these
structures to have low free energies and be relatively deep minima. The overlap between
these structures and the native fold could be considerable and hence these structures could
be viewed as being native-like. These competing basins of attraction (CBA) in which the
polypeptide chain adopts native-like structures can act as kinetic traps that will slow down
the folding process.
10
The basic consequences of topological frustration for mechanisms of folding can be un-
derstood in terms of the kinetic partitioning mechanism (KPM) [7]. Imagine an ensemble of
denatured molecules in search of the native conformation. This is the experimental situation
that arises when the concentration of denatured molecules is decreased. It is clear that a
fraction of molecules Φ would reach the NBA rapidly without being trapped in the low lying
energy minima. The remaining fraction would be trapped in the minima and only on longer
time scales fluctuations enables the chain to reach the NBA. The value of the partition factor
Φ depends on the sequence and is explicitly determined by the σT value. Thus because of
topological frustration the initial pool of denatured molecules partitions into fast folders and
slow folders that reach the native state by indirect off-pathway processes.
From the description of the kinetic partitioning mechanism (KPM) given above it follows
that generically the time dependence of the fraction of molecules that have not folded at
time t, Pu (t), is given by
t X t
Pu (t) = Φ exp(− )+ ak exp(− ) (7)
τN CN C k
τk
where τN CN C is the time constant for reaching the native state by the fast process, τk
is the time for escape from the CBA labeled k, and ak is the ”volume” associated with
the k th CBA. From this consideration we expect that for a given sequence trajectories can
be grouped into those that reach the native conformation rapidly (Φ being their fraction)
and those that remain in one of the CBA for discernible length of time. In Fig. (9a) we
show an example of trajectory that reaches the native state directly from the random coil
conformation. In contrast in Fig. (9b) we show an example of a trajectory for the same
sequence at the same simulation temperature. This figure shows that on a very short time
scale the chain gets trapped in conformations other than the NBA and only on long time
scale it reaches the native state. This figure illustrates the basic principle of KPM. If we
perform an average over an ensemble of such trajectories the kinetic result given in Eq. (7)
ensues.
Classifying Folding Mechanisms in terms of σT : The various folding mechanisms

expected in foldable sequences may be classified in terms of the σT . We have already shown
that sequences that fold extremely rapidly have very small values of σT . Based on our study
of several model proteins as well as analysis of real proteins we classify the folding kinetics
of proteins as follows [7]:
Fast Folders: For these sequences the value of σT is less than a certain small value σl . For
such sequences the folding occurs directly from the ensemble of unfolded states to the NBA.
The free energy surface is dominated by the NBA (or a funnel) and the volume associated
with NBA is very large. The partition factor Φ is near unity so that these sequences reach
the native state by two-state kinetics. The amplitudes ak in Eq. (7) are nearly zero. There
are no intermediates in the pathways from the denatured state to the native state. Fast
folders reach the native state by a nucleation-collapse mechanism which means that once
a certain number of contacts (folding nuclei) are formed then the native state is reached
very rapidly [24,25]. The time scale for reaching the native state for fast folders (which are
11
normally associated with those sequences for which topological frustration is minimal) is
found to be
ηa
τN CN C = f (σT )N ω (8)
γ
where η is the solvent viscosity, a is the typical size of a residue, γ is the average surface
tension between the residue and water, f (σT ) is typically an exponential function of σT , and
the exponent ω is between 3.8 and 4.2. In general, only small proteins (N less than about
100) are fast folders.
Moderate Folders: Sequences for which σl ≤ σT ≤ σh can be classified as moderate fold-

ers. Unlike fast folding sequences the Φ values are fractional which means that a substantial
fraction of molecules is essentially trapped in one of the CBAs for some length of time.
For these sequences there are detectable intermediates and for all but very small proteins
the rate determining step is the activated transition from one of the CBAs to the native
state. The average time scale for transition from these misfolded structures to the native
conformation is given by
√
τF ≈ τ0 exp( N ) (9)
at T ≈ TF . This shows that typical barriers for moderate folders are quite small. As a result
the folding times even for long proteins (N ≈ 200) are only on the order of a second. It is
these small barriers that enable typical proteins to fold in biologically relevant time scale
without encountering the Levinthal paradox.
Slow Folders and Chaperones: For sequences with σT ≥ σh folding is extremely slow and
these sequences may not reach the native state in a biologically relevant time scales. The
volume corresponding to NBA is very small in this case and as a result Φ is nearly zero. The
free energy surface is dominated by CBAs. Under these circumstances spontaneous folding
does not become viable. In cells such proteins
√ are rescued by chaperones. Typically this
happens when N is so large that τ0 exp( N) exceeds reasonable folding time scales. Thus
in cells we expect that only those proteins which are large or whose biological functioning
state has to be oligomers require chaperones.
Minimum number of residues for obtaining foldable protein structures: Natural proteins
are made up of twenty amino acid residues. An important question, from the perspective of
protein design, is how many distinct types of residues are required for protein-like behavior?
Such a selection cannot be made arbitrarily because in natural proteins one should have
polar, hydrophobic, and charged residues. In addition, for optimal packing of the core
hydrophobic residues with different van der Waals radii may be required. To explore the
potential simplification of the number of residues Wang and Wang [26] have carried out
a highly significant study using lattice models and standard statistical potentials for the
contact interaction elements Bij (Eq. (19)). They discovered that a grouping of amino acid
residues into five categories mimics the folding behavior found using the standard twenty
residues. To demonstrate this they used a cubic lattice with N = 27, and mostly focussed
on the maximally compact structures as ground states. Thus, structures such as ones given
12
in Fig. (6) are not explicitly considered. Nevertheless, the demonstration that a suitable
set of five amino acid residue types is sufficient is an important result which should have
implication for protein design problem - the generation of primary sequences that can fold
to a chosen target folded structure.
In their original article they mostly focussed on various thermodynamic properties (na-
ture and degeneracy of the ground states). They have also carried out kinetic simulations
to assess if the kinetic properties are altered by using a reduced number of residues. To
test this idea Wang and Wang used the foldability index σ (which correlates well with fold
rates) as a discriminator of sequence properties. The precise question addressed by Wang
and Wang is the following: What is the minimum number of residues that are required to
obtain foldable (characterized by having relatively small values of σ) sequences? We find
that fast folding sequences have σ less than about a quarter. They carried out two sets of
computations. In one set they initially optimized the stability gap [5] of various sequences
using the twenty residues. They substituted the residues in these optimized sequences by the
representative residue for each group. Four subgroups were considered with each containing
five and a variant, three and two amino acids. The foldability index for the standard sample
and their substitutes is shown in Fig. (10) as solid circle. In another set of computations
they examined the foldability index (open diamonds in Fig. (10)) for sequences that were
optimized using the reduced sets of amino acids. Both these curves show that as long as
the number of amino acid types exceed five one can generate sequences with relatively small
values of σ. Fig. (10) also shows that smaller values of σ can be obtained if optimization
is carried out with reduced sets of amino acids. Such sequences are foldable i.e, the dual
requirements of stability over a wide temperature range and the kinetic accessibility of their
native states are simultaneously satisfied.
IV. CONCLUSIONS
The examples of modeling discussed in Sections (II) and (III) are meant to illustrate
the ideas behind theoretical and computational approaches to protein folding. It should be
borne in mind that we have discussed only a very limited aspect of the rich field of protein
folding. The computations described in Section (III) can be carried out easily on a desktop
computer. Such an exercise is, perhaps, the best of way of appreciating the simple approach
to get at the principles that govern the folding of proteins.
In this chapter we have not discussed experimental advances that are offering extraordi-
nary insights into the way the denatured molecules reach the native state. Two remarkable
experimental approaches hold the promise that in short order we will be able to watch the
folding process from submicrosecond time scale till the native state is reached. A brief
summary of these follow.
1) Eaton and coworkers showed in 1993 that optical triggers of folding can offer a window
into the folding process from microsecond time scale [27]. Since this pioneering work many
laboratories have probed the plausible structure formation that occur on short time scale.
Fast folding experimental techniques have been used to obtain detailed kinetics for the
building blocks of proteins, namely, β-hairpin, α-helices, and loops. Very recent experiments
have given compelling evidence that there are populated native-like intermediates even in
proteins that were thought to follow two-state kinetics.
13
2) Perhaps the most exciting development in the last few years is the ability to nanoma-
nipulate single biomolecules using atomic force microscopy and optical tweezer techniques
[28]. So far such experiments have been used to provide a microscopic basis of elasticity in
muscle proteins. If these stretching experiments can be combined with fluorescent resonance
energy transfer experiments then it is possible to follow the folding of individual molecules
as it passes through the transition state to the native conformation. It has been suggested
on theoretical grounds that such two-dimensional single molecule experiments can measure
directly the distribution of folding rates (and the barrier distribution) in much the same way
mean first passage times are computed in minimal protein models (see Section (III)).
The challenges posed by these high precision experiments demand more refined models
and further developments in computational techniques. For the theoretically inclined it will
no longer be sufficient to describe kinetics only in terms of energy landscapes. The wealth of
data that are being generated by experiments, such as the ones mentioned above, requires
quantitative understanding of the various factors that govern the pathways, mechanisms,
and the transition states in the folding process. These challenging issues will make the area
of biomolecular folding an engaging one for many years to come.
ACKNOWLEDGMENTS
We are grateful to John D. Weeks for useful comments and to Chao Tang for supplying
Fig. (3). We are indebted to Dr. J. Wang and Prof. W. Wang for kindly providing us with
Figure (10) prior to publication.
APPENDIX A:
There are several versions of the random heteropolymer models. To keep the discussions
technically simple we will consider one case - the so called random hydrophilic-hydrophobic
chain whose phases were studied by Garel, Orland and Leibler (GLO) [10]. The GLO
model consists of a polymer chain with N monomers. The GLO model can be viewed as a
generalization of the popular Edwards model which was introduced to understand swelling
of real homopolymer chains in good solvents [9]. In the GLO model the chain is made up
of hydrophobic (hydrophilic) residues that tend to collapse (swell) the chain when dispersed
in a solvent. The solvent mediated interactions at each site is assumed to be random. The
random interactions depend only on a given site i and the strength depends on the degree
of hydrophilicity λi . Besides the term accounting for chain connectivity there are two and
higher body interactions that determine the shape of the chain. In the GLO model the two
body interaction is given by
vij = v0 + β(λi + λj )δ[ri − rj ] (A1)
where v0 is repulsive short range interaction, λi is a quenched random variable which is

distributed as
1 (λi − λ0 )2
P (λi ) = √ exp[− ]. (A2)
2πσ 2 2σ 2
14
If the mean λ0 is positive then the majority of the residues are hydrophilic. Description of
the collapsed phase of the chain requires introducing three and and four-body interaction
terms. Thus, the total Hamiltonian is
1X 1 X 1 X
βH = vij + ω3 δ(ri − rj )δ(ri − rk ) + ω4 δ(ri − rj )δ(rj − rk )δ(rk − rl ).
2 i6=j 6 i6=j6=k 24 i6=j6=k6=l
(A3)
Since the charge variables λi are quenched the thermodynamics of the system requires av-
eraging the free energy using the distribution P (λi ) i.e.,
Z Y
F = −kB T P (λi ) ln Z(λi )d{λi }. (A4)
The average of ln Z(λi ) is most conveniently done using the replicas through the relation
lim Zn − 1
ln Z =n → 0 . (A5)
n
Using Eqs. (2-4) the required average can be carried out. This leads to a complicated
expression for Z n where the bar indicates the average over the quenched random variables
λi . In terms of the order parameters
Z
qab (r, r ′) = dsδ(ra (s) − r)δ(rb (s) − r ′ ) (A6)
and
Z
ρa (r) = dsδ(ra (s) − r) (A7)
(a and b are replica indices) the expression for Z n becomes

Z
Zn = Dqab (r, r ′ )D qbab (r, r ′)Dρa (r)Dφa (r) exp[Hef f ] (A8)
where
Hef f = G(qab , qbab , ρa , φa ) + ln ζ(qbab , φa ) (A9)
with
Z ′
X ρ2 ω ω4
G= dr (iρa φa − (v0 + 2βλ0 ) a − 3 ρ3a − ρ4a )
Z Z 2 6 24
X
+ dr dr ′ (iqab (r, r ′ )qbab (r, r ′) + β 2 λ2 qab (r, r ′ )ρa (r)ρb (r ′ ))
a<b
and
Z Y
ζ(qbab , φa ) = Dra (s) exp(−HT {ra (s)}). (A10)
a
In Eq. (10) HT {ra (s)} is
15
Z X Z X Z X
d N dr N N
HT {ra (s)} = exp[− 2 ds ( )2 − i ds φa (ra (s)) − i ds qbab (ra , rb )].
2a 0 a ds 0 a 0 a<b
(A11)
and
′
ω3 = ω3 − 3β 2 λ2 . (A12)
The path integrals in Eq. (10) may be evaluated using the spectrum of the effective n-body
Hamiltonian
d X 2 X X
Hn = − 2
∇a + iφ(ra ) + iqbab (ra , rb ) (A13)
2a a a a<b
in the limit of n → 0. If N is very large then we can use ground state dominance to evaluate
the spectrum of Hn . This gives
ς(qbab , φa ) ≃ exp[−N min {< Ψ | Hn | Ψ > −E0 (< Ψ | Ψ > −1}] (A14)
{Ψ(r)}
where E0 is the ground state energy of Hn . GLO evaluated the integral over qab (Eq. (6))
by a saddle point approximation which leads to
iqbab (r, r ′) = −β 2 λ2 ρa (r)ρb (r ′ ). (A15)
From the above equation it follows that in the mean-field limit replica symmetry is not
broken. This makes the GLO model conceptually simpler to interpret than the random
bond heteropolymer model discussed in the Appendix.
The total wavefunction Ψ{r1 , r2 , .....rn } (Eq. (12)) is written as a product of single
particle functions (Hartree approximation). The various integrals are evaluated in the saddle
point approximation. A simple Gaussian form for the trial one particle wavefunction
1 d r2
φ(r) = ( ) 4 exp(− ) (A16)
2πR2 2R2
is chosen with R being the single variational parameter. Upon performing the Gaussian
integrals the free energy per monomer f becomes
a2 1 (v0 + 2βλ0 ) N
βf = 2
+ √ d +Ω (A17)
8R (2 π) 2 Rd
where
1 ω3 1 d −d/2 −d N 2 1 d/2 ω4 N 3
Ω=( √ − ( ) (3 − 2 ))( ) + ( ) ( ) . (A18)
(2π 3)d 6 2π Rd (32π 3 ) 24 Rd
At low temperatures the shape of the chain is determined by the sign of first term in Eq.
(18). If the sign is negative then the positive four body term is required for a stable theory.
The phase of the random hydrophobic-hydrophilic model is complicated and depends on
the value of λ0 [10]. We only describe the hydrophilic case when λ0 is positive. In this case
16
there is a first order transition to a collapsed state (R ∼ N −1/d ) induced by the negative
three body term. GLO pointed out that this transition is neither the usual θ-point nor
is it a freezing temperature because there is no replica symmetry breaking. In fact, this
collapse transition resembles that seen in proteins where it is suspected that it is first order
transition. The microscopic origin of the first order transition upon collapse of polypeptide
chains is not fully understood. Recent arguments, suggest that it could arise because burial
of hydrophobic residues and accommodation of the hydrophilic ones at the surface of proteins
in water requires some work and perhaps this assembly happens in a discontinuous manner.
APPENDIX B:
In Section (II) we considered a variational-type theory to treat the thermodynamics of

the random hydrophobic-hydrophilic heteropolymer. Here we describe a limiting behavior of
the random bond model [29]. In this appendix we show that the random bond model in the
the compact phase is identical to the Random Energy Model (REM). Historically, REM was
proposed as caricature for proteins on phenomenological grounds [12]. The heteropolymer
with random bond interactions was treated using a variational theory which suggested that
when the disorder increases beyond a limiting value the chain undergoes a thermodynamic
glass transition. The nature of this transition is closely related to Potts glasses.
The random-bond heteropolymer is described by a Hamiltonian similar to Eq. (1) except
the short range two-body term vij is taken to be random with a Gaussian distribution. In
this case a three-body term with a positive value of ω3 is needed to describe the collapsed
phase.The Hamiltonian is
X X
H= (v0 + vij )δ(ri − rj ) + δ(ri − rj )δ(rj − rk ) (B1)
i<j i6=j6=k
The distribution of the random couplings is given by

1 v2
P (vij ) = √ exp(− ij2 ). (B2)
2πv 2 2v
In the collapse phase the monomer density ρ = RN3 is constant (for large N). Thus, the
only conformation dependent term in Eq. (A1) comes from the random two-body term.
Because this term is linear combination of Gaussian variables we expect that its distribution
is also Gaussian and hence, can be specified by the two moments. Let us calculate the
(1) (2)
correlation E1 E2 between the energies E1 and E2 of two conformations {r1 } and {ri } of
the chain in the collapsed state. The mean square of E1 is
v 2 X (1) (1) v2
E12 = δ(ri − rj ) = Nρ (B3)
2 i,j 2
which is independent of the collapsed conformation. Similarly, we have
v 2 X (1) (1) (2) (2)
E1 E2 = δ(ri − rj )δ(ri − rj )
2 i,j
v2 X 2
= q12 (r, r ′ )
2 r,r′
17
where q12 (r, r ′ ) is the overlap between the two conformations. Because
X
q12 (r, r ′ ) = N (B4)
r,r ′
ρ2
and since the monomer density is constant we q12 (r, r ′ ) = N.
. This implies
v2 2
E1 E2 = ρ. (B5)
2
Thus the joint probability is
E12 + E22
lim ∼ exp(− ) (B6)
N →∞ Nρv 2
which is equivalent the behavior in uncorrelated REM. Thus, it is not a surprise that in
large dimensions (which are captured by variational type treatments) the random bond
heteropolymer model yields exactly the same result as the REM.
18
REFERENCES
[1] L. Stryer. Biochemistry. W. H. Freeman, 1988.
[2] G. H. Lorimer. A quantitative assessment of the role of the chaperonin proteins in
protein folding in vivo. FASEB J, 10:5–9, 1996.
[3] C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181:223–
230, 1973.
[4] P. T. Lansbury. Evolution of amyloids: What normal protein folding can tell us about
fibrillogenesis and disease. Proc. Natl. Acad. Sci. (USA), 96:3342–3344, 1999.
[5] J. N. Onuchic, Z. A. Luthey-Schulten, and P. G. Wolynes. Theory of protein folding:
An energy landscape perspective. Ann. Rev. Phys. Chem., 48:545–600, 1997.
[6] K. A. Dill and H. S. Chan. From Levinthal to pathways to funnels. Natur. Struct. Biol.,
4:10–19, 1997.
[7] D. Thirumalai and D. K. Klimov. Deciphering the timescales and mechanisms of protein
folding using minimal off-lattice models. Curr. Opin. struct. Biol., 9:197–207, 1999.
[8] E. Shakhnovich. Protein design: a perspective from simple tractable models. Folding &
Design, 3:R45–R58, 1998.
[9] P. G. deGennes. Scaling Concepts in Polymer Physics. Cornell University Press, 1985.
[10] T. Garel, H. Orland, and D. Thirumalai. Analytical theories of protein folding. In R. El-
ber, editor, New Developments in Theoretical Studies of protein folding, pages 197–268.
Singapore, World Scientific, 1996.
[11] M. Virasaro M. Mezard and G. Parisi. Spin glass theory and beyond. World Scientific,
1987.
[12] J. D. Bryngelson and P. G. Wolynes. Spin glasses and the statistical mechanics of protein
folding. Proc. Natl. Acad. Sci. (USA), 84:7524–7528, 1987.
[13] K. A. Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S.
Chan. Principles of protein folding - a perspective from simple exact models. Protein
Science, 1995:561–602, 1995.
[14] W. J. C. Orr. Statistical treatment of polymer solutions at infinte dilution. Tran. Faraday
Soc., 43:12–27, 1947.
[15] H. Taketomi, Y. Ueda, and N. Go. Studies on protein folding, unfolding, and fluctuations
by computer simulation. Int. J. Pept. Protein Res., 7:445–459, 1975.
[16] H. S. Chan and K. A. Dill. Intrachain loops in polymers:effects of excluded volume. J.
Chem. Phys., 90:493–509, 1989.
[17] D. K. Klimov and D. Thirumalai. Factors governing the foldability of proteins. Proteins:
Struct. Funct. Genet., 26:411–441, 1996.
[18] D. Thirumalai and D. K. Klimov. Emergence of stable and fast folding protein structure.
In S. Kim, K. J. Lee, and W. Sung, editors, Stochastic dynamics and Pattern Forma-
tion in Biological and Complex Systems, pages 95–111. Melville, New York, American
Institue of Physics, 2000.
[19] H. Li, N. Winfreen, and C. Tang. Emergence of preferred structures in a simple model
of protein folding. Science, 273:666–669, 1996.
[20] Per-Anker Lindgard and H. Bohr. Magic numbers in protein structures. Phys. Rev.
Letters, 77:779–782, 1996.
[21] P. G. Wolynes. Symmetry and the energy landscape of biomolecules. Proc. Natl. Acad.
Sci. (USA), 93:14249–14255, 1996.
19
[22] P. G. Wolynes. Folding nucleus and energy landscapes of larger proteins within the
capillarity approximation. Proc. Natl. Acad. Sci. USA, 94:6170–6175, 1997.
[23] J.H. Cate, A.R. Gooding, E. Podell, K. Zhou, B.L. Golden, C.E. Kundrot, T.R. Cech,
and J.A. Doudna. Crystal structure of a group i ribozyme domain: Principles of rna
packing. Science, 273:1678–1685, 1996.
[24] Z. Guo and D. Thirumalai. Kinetics of protein folding: Nucleation mechanism, time
scales and pathways. Biopolymers, 36:83–103, 1995.
[25] E. I. Shakhnovich, V. Abkevich, and E. I. Shakhnovich. Conserved residues and the
mechanism of protein folding. Nature, 379:96–98, 1996.
[26] J. Wang and W. Wang. A computational approach to simplifying the protein folding
alphabet. Nat. Struct. Bio., 6:1033–1038, 1999.
[27] W.A. Eaton, V. Munoz, P.A. Thompson, E.R. Henry, and J. Hofrichter. Kinetics and
dynamics of loops, α-helices, β-hairpins, and fast-folding proteins. Acc. Chem. Res.,
31:745–753, 1998.
[28] T.E. Fisher, A.F. Oberhauser, M. Carrion-Vazquez, P.E. Marszalek, and J.M. Fer-
nandez. The study of protein mechanics with the atomic force microscope. Trends in
Biochemical Sciences, 24:379–384, 1999.
[29] E. Shakhnovich and A. Gutin. Formation of unique structure in polypeptide chains. the-
oretical investigation with the aid of replica approach. Biophysical Chemistry, 34:187–
199, 1989.
[30] R. Sayle and E.J. Milner-White. Rasmol: Biomolecular graphics for all. Trends in Bio-
chemical Sciences, 20:374–376, 1995.
20
FIGURES
Fig. (1) The 3D native structure of hemoglobin visualized using RasMol 2.6 [30]. The
linear sequence of amino acids of hemoglobin is given below the figure.
Fig. (2) Scaling of the number of MES C(MES) (squares) is shown for the hydrophobic
parameter B0 = −0.1 and ∆ = 0.6. Data are obtained for the cubic lattice. The pairs of
squares for each N represent the quenched averages for different samples of 30 sequences.
The number of compact structures C(CS) and self-avoiding conformations C(SAW) are also
displayed to underscore the dramatic difference of scaling behavior of C(MES) and C(CS)
(or C(SAW)). It is clear that C(MES) remains practically flat, i.e. it grows no faster than
lnN.
Fig. (3) Histogram of number of structures with a given number of associated sequences
Ns for 3D 3x3x3 case, in a log-log plot.
Fig. (4) Schematic illustration of the stages in the drastic reduction of sequence space in
the process of evolution to functionally competent protein structures.
Fig. (5) Native structure of acyl-coenzyme A binding protein (first NMR structure out of
29 deposited to PDB). The figure was created using RasMol 2.6 [30].
Fig. (6) The native conformation of fast folding sequence (N = 27) with random bond
potentials is shown. This structure has c = 22 non-bonded contacts, therefore it is not a
maximally compact conformation for which c = 28. The figure was created using RasMol
2.6 [30].
Fig. (7) (a) Thermodynamic functions computed for the sequence whose native state is
shown in Fig. (3). (a) Specific heat Cv (dotted line) and derivative of the radius of gyra-
tion with respect to temperature dRg /dT (dashed line) as a function of temperature. The
collapse temperature Tθ is determined from the peak of Cv and found to be 0.83. Tθ is
very close to the temperature, at which dRg /dT becomes maximum (0.86). This illustrates
that Tθ is indeed associated with the compaction of the chain. Temperature dependence of
fluctuations of overlap function ∆χ is given by solid line. The folding transition temperature
TF is obtained from the peak of ∆χ and for this sequence TF = 0.79. The curves are scaled
to fit one plot.
(b) Time dependence of the fraction of unfolded molecules Pu (t) for the sequence 74 calcu-
lated at folding conditions Ts ∼ < TF . The function Pu (t) is computed from a distribution
of first passage times τ1i . First passage time for a given initial condition is the first time
the trajectory reaches the native conformation. Typically an adequately converged distri-
bution is obtained by averaging over several hundred initial conditions. For the conditions
used in this simulation folding is two-state, therefore, Pu (t) is adequately fit with the single
exponential (thick solid line). The folding time τF obtained from the fit is 1.4 × 106 MCS.
Fig. (8) Plot of the folding times τF as a function of σT for the 22 sequences. This figures
shows that under the external conditions when the NBA is the most populated there is a
remarkable correlation between τF and σT . The correlation coefficient is 0.94. It is clear
that over a four orders of magnitude of folding times τF ≈ exp(−σT /σ0 ) where σ0 is a
constant. In both panels the filled and open circles are for the RB and KGS 27-mer models,
respectively. The open squares are for N = 36.
Fig. (9) Examples of folding trajectories at T = Ts derived from the condition < χ(Ts ) >=
0.21. (a) Fast folding trajectory as monitored by χ(t). It is seen that sequence reaches the
21
native state very rapidly in a two-state manner without being trapped in intermediates.
The first passage time for this trajectory is 277,912 MCS. (b) Slow folding trajectory for the
same sequence. The sequence becomes trapped in several intermediate states with large χ
en route to the native state. The first passage time is 11,442,793 MCS. Notice that the time
scales in both panels are dramatically different.
Fig. (10) The figure gives the foldability σ of 27-mer lattice chains with sets containing
different number of amino acids. The sets are generated according to scheme described
in [26]. The set of 20 amino acids is taken as a standard sample. Each sequence with
20 amino acids is optimized to fulfill the stability gap [5]. The residues in the standard
samples are substituted with four different sets containing smaller number of amino acids
[26]. The foldability of these substitution are indicated by the solid circles. The open
diamonds correspond to the sequences with same composition. However, the amino acids
are chosen from the reduced representation and the result sequence is optimized using the
stability gap [5].
22
1val - 2leu - 3ser - 4pro - 5ala - 6asp - 7lys - 8thr - 9asn - 10val - 11lys
12ala - 13ala - 14trp - 15gly - 16lys - 17val - 18gly - 19ala - 20his - 21ala - 22gly
23glu - 24tyr - 25gly - 26ala - 27glu - 28ala - 29leu - 30glu - 31arg - 32met - 33phe
34leu - 35ser - 36phe - 37pro - 38thr - 39thr - 40lys - 41thr - 42tyr - 43phe - 44pro
45his - 46phe - 47asp - 48leu - 49ser - 50his - 51gly - 52ser - 53ala - 54gln - 55val
56lys - 57gly - 58his - 59gly - 60lys - 61lys - 62val - 63ala - 64asp - 65ala - 66leu
67thr - 68asn - 69ala - 70val - 71ala - 72his - 73val - 74asp - 75asp - 76met - 77pro
78asn - 79ala - 80leu - 81ser - 82ala - 83leu - 84ser - 85asp - 86leu - 87his - 88ala
89his - 90lys - 91leu - 92arg - 93val - 94asp - 95pro - 96val - 97asn - 98phe - 99lys
100leu - 101leu - 102ser - 103his - 104cys - 105leu - 106leu - 107val - 108thr - 109leu - 110ala
111ala - 112his - 113leu - 114pro - 115ala - 116glu - 117phe - 118thr - 119pro - 120ala - 121val
122his - 123ala - 124ser - 125leu - 126asp - 127lys - 128phe - 129leu - 130ala - 131ser - 132val
133ser - 134thr - 135val - 136leu - 137thr - 138ser - 139lys - 140tyr - 141arg
Fig. (1)
23
Fig. (2)
24
1000
Number of structures
100
10
1
1 10 100 1000 10000
NS
Fig. (3)
25
Sequence space
Sequences which encode MES
Foldable sequences
Biologically competent sequences
Fig. (4)
26
Fig. (5)
27
Fig. (6)
28
Fig. (7)
29
Fig. (8)
30
Fig. (9)
31
Fig. (10)
32

Introducing Protein Folding Using Simple Models

Uploaded by

Copyright:

Available Formats

Introducing Protein Folding Using Simple Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introducing Protein Folding Using Simple Models

Uploaded by

Copyright:

Available Formats

Introducing Protein Folding Using Simple Models

D. Thirumalai and D. K. Klimov

University of Maryland, College Park, MD 20742

Key Words: Biomolecular Folding, Symmetry, Designability, Evolution

II. RANDOM HETEROPOLYMER AS A CARICATURE OF PROTEINS

III. LATTICE MODELS OF PROTEINS

The computational protocol for describing protein folding mechanisms is straightforward

Characteristic Temperatures: The basic features of folding can be understood in terms of

Topological Frustration and Kinetic Partitioning Mechanism: Lattice models can

Classifying Folding Mechanisms in terms of σT : The various folding mechanisms

Moderate Folders: Sequences for which σl ≤ σT ≤ σh can be classified as moderate fold-

vij = v0 + β(λi + λj )δ[ri − rj ] (A1)

where v0 is repulsive short range interaction, λi is a quenched random variable which is

(a and b are replica indices) the expression for Z n becomes

Hef f = G(qab , qbab , ρa , φa ) + ln ζ(qbab , φa ) (A9)

In Eq. (10) HT {ra (s)} is

iqbab (r, r ′) = −β 2 λ2 ρa (r)ρb (r ′ ). (A15)

In Section (II) we considered a variational-type theory to treat the thermodynamics of

The distribution of the random couplings is given by

Sequences which encode MES

Biologically competent sequences

You might also like