DCA_PM_PLM
DCA_PM_PLM
DCA_PM_PLM
of the general problem of inverse statistical mechanics, where the task is to learn model parameters
(fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In
the context of protein sequences, the approach has been referred to as direct-coupling analysis.
Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the
statistical properties of families of evolutionarily related proteins, significantly outperforms existing
approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.
This improved performance also relies on a modified score for the coupling strength. The results
are verified using known crystal structures of specific sequence instances of various protein families.
Code implementing the new method can be found at http://plmdca.csc.kth.se/.
PACS numbers: 02.50.Tt – Inference methods, 87.10.Vg – Biological information, 87.15.Qt – Sequence
analysis, 87.14.E- – Proteins
ally expensive, since it requires the computation of the The paper is structured as follows: in Section II, we
partition function Z for an arbitrary set of parameters. review the ideas underlying PSP and DCA and explain
The exact but iterative procedure known as Boltzmann the biological hypotheses linking protein 3D structure to
machines [28] does in fact work on small systems, but it correlations in amino-acid sequences. We also review ear-
is out of question for the problem sizes considered in this lier approaches to DCA. In Section III, we describe the
paper. On the other hand, the mean-field equations of Potts model in the context of DCA and the properties
(1) read [29–31]: of exponential families. We further detail a maximum-
X likelihood (ML) approach as brought to bear on the in-
tanh−1 mi = hi + Jij mj . (2) verse Potts problem and discuss in more detail why this
j is impractical for realistic system sizes, and we intro-
duce, similarly to (3) above, the inverse Potts mean-field
From (2) and the fluctuation-dissipation relations an model algorithm for the DCA (mfDCA) and a pseudolike-
equation can be derived connecting the coupling coef- lihood maximization procedure (plmDCA). This section
ficients Jij and the correlation matrix c = (cij ) [10]: also covers algorithmic details of both models such as reg-
ularization and sequence reweighting. A further impor-
Jij = −(c−1 )ij . (3) tant issue is the selection of an interaction score, which
reduces coupling matrices to a scalar score, and allows for
Equations (2) and (3) exemplify typical aspects of in- ranking of couplings according to their ’strength’. In Sec-
verse statistical mechanics, and inference in large sys-
tion IV, we present results from prediction experiments
tems in general. On one hand, the parameter recon-
using mfDCA and plmDCA assessed against known crys-
struction using these two equations is not exact. It is tal structures. In Section V, we summarize our findings,
only approximate, because the mean-field equations (2)
put their implications into context, and discuss possible
are themselves only approximate. It also demands rea- future developments. The appendixes contain additional
sonably good sampling, as the matrix of correlations is
material supporting the main text.
not invertible unless it is of full rank, and small noise
on its O(N 2 ) entries may result in large errors in esti-
mating the Jij . On the other hand, this method is fast,
as fast as inverting a matrix, because one does not need II. PROTEIN STRUCTURE PREDICTION AND
DIRECT-COUPLING ANALYSIS
to compute Z. Except for mean-field methods as in (2),
approximate methods recently used to solve the inverse
Ising problem can be grouped as expansion in correla- Proteins are essential players in almost all biological
tions and clusters [16, 19], methods based on the Bethe processes. Primarily, proteins are linear chains, each site
approximation [17, 18, 20–22], and the pseudolikelihood being occupied by 1 out of 20 different amino acids. Their
method [23, 27]. function relies, however, on the protein fold, which refers
For PSP, it is not the Ising model but a 21-state Potts to the 3D conformation into which the amino-acid chain
model which is pertinent [1]: The model shall be learned curls. This fold guarantees, e.g., that the right amino
such that it represents the statistical features of large acids are exposed in the right positions at the protein
multiple sequence alignments of homologous (evolution- surface, thus forming biochemically active sites, or that
arily related) proteins, and to reproduce the statistics of the correct pairs of amino acids are brought into contact
correlated amino acid substitutions. This can be done to keep the fold thermodynamically stable.
with the Potts equivalent of Eq. (1), i.e. using a model Experimentally determining the fold, using e.g. x-ray
with pairwise interactions. As will be detailed below, crystallography or NMR tomography, is still a rather
strong interactions can be interpreted as indicators for costly and time-consuming process. On the other hand,
spatial vicinity of residues in the three-dimensional pro- every newly sequenced genome results in a large num-
tein fold, even if residues are well separated along the ber of newly predicted proteins. The number of se-
sequence – thus linking evolutionary sequence statistics quenced organisms now exceeds 3, 700, and continues
with protein structure. But which of all the inference to grow exponentially (genomesonline.org [32]). The
methods in inverse statistical mechanics, machine learn- most prominent database for protein sequences, Uniprot
ing or statistics is most suitable for treating real protein (uniprot.org [33]), lists about 25,000,000 different pro-
sequence data? How do the test results obtained for inde- tein sequences, whereas the number of experimentally
pendently generated equilibrium configurations of Potts determined protein structures is only around 85,000
models translate to the case of protein sequences, which (pdb.org [34]).
are neither independent nor equilibrium configurations However, the situation of structural biology is not as
of any well-defined statistical-physics model? The main hopeless as these numbers might suggest. First, proteins
goal of this paper is to move towards an answer to this have a modular architecture; they can be subdivided into
question by showing that the pseudolikelihood method is domains which, to a certain extent, fold and evolve as
a very powerful means to perform DCA, with a predic- units. Second, domains of a common evolutionary ori-
tion accuracy considerably out-performing methods pre- gin, i.e., so-called homologous domains, are expected to
viously assessed. almost share their 3D structure and to have related func-
3
DCA and PSICOV both appear capable of outperform- It is immediate that the probabilistic model which max-
ing the Bayesian network approach [2, 9], their relative imizes the entropy while satisfying Eq. (7) must take
efficiency is currently open to investigation, and is not the Potts model form. Finding a Potts model which
assessed in this work. matches empirical frequencies and correlations is there-
Finally, predicting amino acid contacts is not only fore referred to as a maximum-entropy inference [43, 44],
a goal in itself, but also a means to assemble protein cf. also [1, 38] for a formulation in terms of protein-
complexes [40, 41] and to predict full 3D protein struc- sequence modeling.
tures [3, 4, 42]. Such tasks require additional work, using On the other hand, we can use Eq. (6) as a varia-
the DCA results as input, and are outside the scope of tional ansatz and maximize the probability of the input
the present paper. data set {σ (b) }B
b=1 with respect to the model parameters
hi (σ) and Jij (σ, σ ′ ); this maximum-likelihood perspective
is used in the following. We note that the Ising and the
III. METHOD DEVELOPMENT Potts models (and most models which are normally con-
sidered in statistical mechanics) are examples of expo-
A. The Potts model nential families, and have the property that means and
correlations are sufficient statistics [45–47]. Given un-
Let σ = (σ1 , σ2 , · · · , σN ) represent the amino acid se- limited computing power to determine Z, reconstruction
quence of a domain with length N . Each σi takes on can not be done better using all the data compared to us-
values in {1, 2, ..., q}, with q = 21: one state for each of ing only (empirical) means and (empirical) correlations.
the 20 naturally occurring amino acids and one additional It is only when one cannot compute Z exactly and has
state to represent gaps. Thus, an MSA with B aligned to resort to approximate methods, that using directly all
sequences from a domain family can be written as an in- the data can bring any advantage.
teger array {σ (b) }Bb=1 , with one row per sequence and one
column per chain position. Given an MSA, the empirical
individual and pairwise frequencies can be calculated as
B
B. Model parameters and gauge invariance
1 X (b)
fi (k) = δ(σi , k),
B The total number of parameters of Eq, (6) is N q +
b=1
B N (N −1) 2
1 X (b) (b) 2 q ,but, in fact, the model as it stands is over-
fij (k, l) = δ(σi , k) δ(σj , l), (4) parameterized in the sense that distinct parameter sets
B
b=1 can describe the same probability distribution. It is easy
where δ(a, b) is the Kronecker symbol taking value 1 if to see that the number of nonredundant parameters is
both arguments are equal, and 0 otherwise. fi (k) is hence N (q − 1) + N (N2−1) (q − 1)2 , cf. an Ising model (q = 2),
the fraction of sequences for which the entry on position which has N (N2+1) parameters if written as in Eq. (1)
i is amino acid k, gaps counted as a 21st amino acid. but would have 2N 2 parameters if written in the form of
Similarly, fij (k, l) is the fraction of sequences in which Eq. (6).
the position pair (i, j) holds the amino acid combination A gauge choice for the Potts model, which eliminates
(k, l). Connected correlations are given as the overparametrization in a similar manner as in the
cij (k, l) = fij (k, l) − fi (k) fj (l). (5) Ising model (and reduces to that case for q = 2), is
in which hi (σi ) and Jij (σi , σj ) are parameters to be de- Jij (q, l) = Jij (k, q) = hi (q) = 0, (9)
termined through the constraints
X
P (σi = k) = P (σ) = fi (k), for all i, j(> i), k, and l, cf. [2]. This gauge choice
σ corresponds to a lattice gas model with q − 1 different
σi =k
X particle types, and a maximum occupation number one.
P (σi = k, σj = l) = P (σ) = fij (k, l). (7) Using either (8) or (9) reconstruction is well-defined,
σ
σj =l
and it is straight-forward to translate results obtained in
σi =k one gauge to the other.
5
Given a set of independent equilibrium configurations Pseudolikelihood substitutes the probability in (10) by
{σ (b) }B
b=1 of the model Eq. (6), the ordinary statistical the conditional probability of observing one variable σr
approach to inferring parameters {h, J} would be to let given observations of all the other variables σ \r . That
those parameters maximize the likelihood (i.e., the prob- is, the starting point is
ability of generating the data set for a given set of pa- (b)
rameters). This is equivalent to minimizing the (rescaled) P (σr = σr(b) |σ \r = σ \r )
negative log-likelihood function
N
(b) P (b) (b)
B
1 X exp hr (σr ) + Jri (σr , σi )
l=− log P (σ (b) ). (10) i=1
i6=r
B = , (14)
b=1
N
Pq (b)
For the Potts model (6), this becomes
P
l=1 exp hr (l) + Jri (l, σi )
q
N X i=1
X i6=r
′
H. Interaction scores (after altering the fields appropriately) and that Jij (k, l)
satisfy (8). A possible Frobenius norm score is hence
In the inverse Ising problem, each interaction is scored v
u q
uX
by one scalar coupling strength Jij . These can easily
be ordered, e.g. by absolute size. In the inverse Potts
FN
Sij = kJ′ij k2 = t ′ (k, l)2 .
Jij (27)
k,l=1
problem, each position pair (i, j) is characterized by a
whole q×q matrix Jij , and some scalar score Sij is needed
Lastly, we borrow an idea from Jones et al. [9], whose
in order to evaluate the ‘coupling strength’ of two sites.
PSICOV method also uses a norm rank (1-norm instead
In [1] and [2] the score used is based on direct infor-
of Frobenius norm), but adjusted by an average-product
mation (DI), i.e., the mutual information of a restricted
correction (APC) term. APC was introduced in [51] to
probability model not including any indirect coupling ef-
suppress effects from phylogenetic bias and insufficient
fects between the two positions to be scored. Its con-
sampling. Incorporating also this correction, we have
struction goes as follows: For each position pair (i, j),
our scoring function
(the estimate of) Jij is used to set up a ’direct distribu-
tion’ involving only nodes i and j, S·jF N Si·F N
CN FN
Sij = Sij − , (28)
(dir) S··F N
(k, l) ∼ exp Jij (k, l) + h′i,k + h′j,l .
Pij (23)
where CN stands for ‘corrected norm’. An imple-
h′i,k and h′j,l are new fields, computed as to ensure agree- mentation of plmDCA in MATLAB is available at
ment of the marginal single-site distributions with the http://plmdca.csc.kth.se/.
empirical individual frequency counts fi (k) and fj (l).
DI
The score Sij is now calculated as the mutual infor-
mation of P (dir)
: IV. EVALUATING THE PERFORMANCE OF
MFDCA AND PLMDCA ACROSS PROTEIN
q (dir) FAMILIES
!
DI
X (dir) Pij (k, l)
Sij = Pij (k, l) log . (24)
fi (k) fj (l)
k,l=1 We have performed numerical experiments using
mfDCA and plmDCA on a number of domain families
DI
A nice characteristic of Sij is its invariance with re- from the Pfam database; here we report and discuss the
spect to the gauge freedom of the Potts model, i.e., both results.
choices Eqs. (8) and (9) (or any other valid choice) gen-
DI
erate identical Sij .
In the pseudolikelihood approach, we prefer not to use A. Domain families, native structures, and
DI true-positive rates
Sij , as this would require a pseudocount λ to regularize
the frequencies in (24), introducing a third regulariza-
tion variable in addition to λh and λJ . Another possible The speed of mfDCA enabled Morcos et al. [2] to
scoring function, already mentioned but not used in [1], conduct a large-scale analysis using 131 families. PLM
is the Frobenius norm (FN) is computationally more demanding than NMFI, so we
v chose to start with a smaller collection of 17 families,
u q
uX listed in Table I. To ease the numerical effort, we chose
kJij k2 = t Jij (k, l)2 . (25) families with relatively small N and B. More precisely,
k,l=1 we selected families out of the first 115 Pfam entries (low
Pfam ID), which have (i) at most N = 130 residues, (ii)
DI between 2,000 and 22,000 sequences, and (iii) reliable
Unlike Sij , (25) is not independent of gauge choice, so
one must be a bit careful. As was noted in [1], the zero structural information (cf. the PDB entries provided in
sum gauge (8) minimizes the Frobenius norm, in a sense the table). No selection based on DCA performance was
making (8) the most appropriate gauge choice for the done. In the appendix, a number of longer proteins is
score (25). Recall from above that our pseudolikelihood studied. The mfDCA performance on the selected fami-
uses the full representation and fixes the gauge by the lies was found to be coherent with the larger data set of
regularization terms Rl2 . Our procedure is therefore to Morcos et al..
first infer the interaction parameters using the pseudo- To reliably assess how good a contact prediction is,
likelihood and the regularization, and then to change to something to regard as a ’gold standard’ is helpful. For
the zero-sum gauge: each of the 17 families we have therefore selected one rep-
resentative high-resolution X-ray crystal structure (reso-
′
Jij (k, l) = Jij (k, l) − Jij (·, l) − Jij (k, ·) + Jij (·, ·), (26) lution below 3Å), see Table I for the corresponding PDB
identification.
where ‘·’ denotes average over the concerned position. From these native protein structures, we have ex-
One can show that (26) preserves the probabilities of (6) tracted position-position distances d(i, j) for each pair of
8
TABLE I. Domain families included in our study, listed with Pfam ID, length N , number of sequences B (after removal of
duplicate sequences), number of effective sequences Bef f (under x = 0.9, i.e., 90% threshold for reweighting), and the PDB
and UniProt specifications for the structure used to access the DCA prediction quality.
500
Number of occurrences
600
1000 400
400 300
500 200
200
100
0 0 0
0 8.5 20 40 0 8.5 20 40 0 8.5 20 40
Distance (Å) Distance (Å) Distance (Å)
FIG. 2. (Color online) Histograms of crystal-structure distances pooled from all 17 families, when including all pairs (left),
pairs with |j − i| > 4 (center), and pairs with |j − i| > 14 (right). The vertical line is our contact cutoff 8.5Å.
sequence positions, by measuring the minimal distance peak at 3-5Å to presumably correspond to short-range
between any two heavy atoms belonging to the amino interactions like hydrogen bonds or secondary-structure
acids present in these positions. The left panel of Fig. 2 contacts, whereas the last peak likely corresponds to
shows the distribution of these distances in all considered long-range, possibly water-mediated interactions. These
families. Three peaks protrude from the background dis- peaks contain the nontrivial information we would like
tribution: one at small distances below 1.5Å, a second at to extract from sequence data using DCA. In order to
about 3-5Å and a third at about 7-8Å. The first peak cor- accept the full second peak, we have chosen a distance
responds to the peptide bonds between sequence neigh- cutoff of 8.5Å for true contacts, slightly larger than the
bors, whereas the other two peaks correspond to non- value of 8Å used in [2].
trivial contacts between amino acids, which may be dis- Accuracy results are here reported primarily using
tant along the protein backbone, as can be seen from the true-positive (TP) rates, also the principal measurement
center and right panels of Fig. 2, which collect only dis- in [2] and [9]. The TP rate for p is the fraction of the p
tances between positions i and j with minimal separation strongest-scored pairs which are actually contacts in the
|j − i| ≥ 5 resp. |j − i| ≥ 15. Following [2], we take the crystal structure, defined as described above. To exem-
9
DI
cies when it gets to the evaluations of Sij . We append
’DI’ to the NMFI name, since, later on, we will also try
CN
the Sij score for NMFI (which we will call NMFI-CN).
With l2 regularization in the PLM algorithm, out-
comes were robust against the precise choice of λh ; TP FIG. 3. (Color online) Contact-detection results for the 17
rates were almost identical when λh was changed between families, sorted by Bef f . The y-axes are TP rates and x-
0.001 and 0.1. We therefore chose λh = 0.01 for all ex- axes are the number of predicted contacts p, based on pairs
with |j − i| > 4. The three curves for each method are the
periments. What mattered, rather, was the coupling reg-
three regularization levels yielding highest TP rates across all
ularization parameters λJ , for which we did a systematic families. The thickened curve highlights the best one out of
scan from λJ = 0 and up using step-size 0.005. these three (λ = Bef f for NMFI and λJ = 0.01 for PLM).
So, to summarize, the results reported here are based
on x = 0.9, cutoff 8.5Å, λh = 0.01, and λ and λJ drawn
from collections of values as described above. In the following discussion, we leave out all results for
NMFI-DI(true) and focus on PLM vs. NMFI-DI, i.e., the
version used in [2]. All plots remaining in this section use
C. Main comparison of mfDCA and plmDCA the optimal regularization values: λ = Bef f for NMFI
and λJ = 0.01 for PLM.
Fig. 3 shows TP rates for the different families and TP rates only classify pairs as contacts (d(i, j) < 8.5Å)
methods. We see that the TP rates of plmDCA (PLM) or noncontacts (d(i, j) ≥ 8.5Å). To give a more detailed
are consistently higher than those of mfDCA (NMFI), view of how scores correlate with spatial separation, we
especially for families with large Bef f . For what con- show in Fig. 4 a scatter plot of the score vs. distance for
cerns the two NMFI versions: NMFI-DI(true) avoids the all pairs in all 17 families. PLM and NMFI-DI both man-
strong failure seen in NMFI-DI for PF00084, but for most age to detect the peaks seen in the true distance distribu-
other families, see in particular PF00014 and PF00081, tion of Fig. 2, in the sense that high scores are observed
the performance instead drops using marginals without almost exclusively at distances below 8.5Å. Both meth-
DI
pseudocounts in the Sij calculation. For both NMFI-DI ods agree that interactions get, on average, progressively
and NMFI-DI(true), the best regularization was found to weaker going from peak one, to two, to three, and finally
be λ = 1 · Bef f , the same value as used in [2]. For PLM, to the bulk. We note that the dots scatter differently
the best parameter choice was λJ = 0.01. Interestingly, across the PLM and NMFI-DI figures, reflecting the two
DI
this same regularization parameter was optimal for ba- separate scoring techniques: Sij are strictly nonnega-
sically all families. This is somewhat surprising, since tive, whereas APC corrected norms can assume negative
both N and Bef f span quite wide ranges (48-130 and values. We also observe how sparse the extracted signal
1277-14125 respectively). is: most spatially close pairs do not show elevated scores.
10
FIG. 4. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-DI (right).
The vertical line is our contact cutoff at 8.5Å.
However, from the other side, almost all strongly coupled induced by pairs which are not in contact in the X-ray
pairs are close, so the biological hypothesis of Sec. II is crystal structure used for evaluating prediction accuracy.
well supported here. The important point – and this seems to be a distinguish-
Fig. 5 shows scatter plots of scores for PLM and NMFI- ing feature of maximum-entropy models as compared to
DI for some selected families. Qualitatively the same pat- simpler correlation-based methods – is to find a sufficient
terns were observed for all families. The points are clearly number of well-distributed contacts to enable de-novo 3D
correlated, so, to some extent, PLM and NMFI-DI agree structure prediction [3–5, 40–42]. In this sense, it is more
on the interaction strengths. Due to the different scoring important to achieve high accuracy of the first predicted
schemes, we would not expect numerical coincidence of contacts, than intermediate accuracy for many predic-
scores. Many of PLM’s top-scoring position pairs have tions.
also top scores for NMFI-DI and vice versa. The largest In summary, the results suggest that the PLM method
discrepancy is in how much more strongly NMFI-DI re- offers some interesting progress compared to NMFI. How-
sponds to pairs with small |j − i|; the crosses tend to ever, let us also note that in the comparison we also had
shoot out to the right. PLM agrees that many of these to change both scoring and regularization styles. It is
neighbor pairs interact strongly, but, unlike NMFI-DI, it thus conceivable that a NMFI with the new scoring and
also shows rivaling strengths for many |j − i| > 4-pairs. regularization could be more competitive with PLM. In-
deed, upon further investigation, detailed in Appendix B,
An even more detailed picture is given by considering
we found that part of the improvement in fact does stem
contact maps, see Fig. 6. The tendency observed in the
from the new score. In Appendix C, where we extend our
last scatter plots remains: NMFI-DI has a larger por-
tion of highly scored pairs in the neighbor zone, which results to longer Pfam domains, we therefore add results
from NMFI-CN, an updated version of the code used in
are the middle stretches in these figures. An important CN DI
observation is, however, that clusters of contacting pairs [2] which scores by Sij instead of Sij .
with long 1D sequence separation are captured by both
algorithms.
Note, that only a relatively small fraction of contacts D. Run times
is uncovered by DCA, before false positives start to ap-
pear, and that many native contacts are missed. How- In general, NMFI, which is merely a matrix inversion,
ever, the aim of DCA cannot be to reproduce a com- is very quick compared with PLM; most families in this
plete contact map: It is based on sequence correlations study took only seconds to run through the NMFI code.
alone (e.g. it cannot predict contacts for 100% conserved In contrast to the message-passing based method used
residues), it does not consider any physico-chemical prop- in [1], a DCA using PLM is nevertheless feasible for all
erty of amino acids (they are seen as abstract letters), it protein families in Pfam. The objective function in PLM
does not consider the need to be embeddable in 3D. Fur- is a sum over nodes and samples and its execution time
thermore, proteins may undergo conformational changes is therefore expected to depend both on B (number of
(e.g. allostery) or homo-dimerize, so coevolution may be members of a protein family in Pfam) and N (length of
11
PF00028 PF00035
90 90
1.4 1.4 80 80 60 60
PF00018 PF00028
1.2 1.2 70 70 50 50
60 60 40 40
1 1 50 50
40 40 30 30
PLM (CN)
0.8 0.8 30 30 20 20
0.6 20 20
0.6 10 10 10 10
0.4 00 00
0.4
0.2
0.2
0
0
−0.2 PF00043 PF00046
−0.2
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5
NMFI−DI (DI)
1.4 1.4
1.2 PF00043 1.2 PF00111 NMFI−DI PLM NMFI−DI PLM
1 1
0.8 0.8
0.6 0.6 90 90
80 80 50 50
0.4 0.4 70 70 40 40
60 60
0.2 0.2 50 50 30 30
40 40
0 0 30 30 20 20
20 20 10 10
−0.2 −0.2 10 10
0.2 0.4 0.6 0.1 0.2 0.3 0.4 00 00
FIG. 5. (Color online) Scatter plots of interaction scores for FIG. 6. (Color online) Predicted contact maps for PLM (right
PLM and NMFI-DI from four families. For all plots, the axes part, black symbols) and NMFI-DI (left part, red symbols
are as indicated by the top left figure. The distance unit in online) for four families. A pair (i, j)’s placement in the plots
the top box is Å. is found by matching positions i and j on the axes. Native
contacts are indicated in gray. True and false positives are
represented by circles and crosses, respectively. Each figure
shows the 1.5N strongest ranked pairs (including neighbors)
the aligned sequences in a protein family). for that family.
On a standard desktop computer, using a basic
MATLAB-interfaced C-implementation of conjugate gra-
dient descent, the run times for PF00014, PF00017, and as large as 10N , suggesting that plmDCA can be made
PF00018 (small N and B) were 50, 160 and 90 respec- competitive not only in terms of accuracy, but also in
tively. For PF00041 (small N but larger B) one run terms of computational speed.
took 15 min. For domains with larger N , like those in Finally, all of these times were obtained cold-starting
Appendix C, run times grow approximately apace. For with all fields and couplings at 0. Presumably, one can
example, the run times for PF00026 (N = 314) and improve by using an appropriate initial guess obtained,
PF00006 (N = 215) were 80 and 65 min respectively. say, from NMFI. This has however not been implemented
A well-known alternative use of pseudolikelihoods is to here.
minimize each gr separately. While slightly more crude,
such an ’asymmetric’ variant of plmDCA would amount
to N independent (multiclass) logistic regression prob- V. DISCUSSION
lems, which would make parallel execution trivial on up
to N cores. A rigorous examination of the asymmetric In this work, we have shown that a direct-couping anal-
version’s performance is beyond the scope of the present ysis built on pseudolikelihood maximization (plmDCA)
work, but initial tests suggest that even on a single pro- consistently outperforms the previously described mean-
cessor it requires convergence times almost an order of field based analysis (mfDCA), as assessed across a num-
magnitude smaller than the symmetric one (which we ber of large protein-domain families. The advantage of
used in this work), while still yielding almost exactly the the pseudolikelihood approach was found to be partially
same TP rates. Using N processors, the above mentioned intrinsic, and partly contingent on using a sampling-
run times could thus, in principle, be dropped by a factor corrected Frobenius norm to score inferred direct statis-
12
tical coupling matrices. ing the sequence statistics by pairwise Potts models, most
On one hand, this improvement might not be sur- extractable information might already be extracted from
prising: it is known that for very large data sets PLM the MSA. However, it may well also be that there is al-
becomes asymptotically equivalent to full maximum- ternative information hidden in the sequences, for which
likelihood inference, whereas mean-field inference re- we would need to go beyond pairwise models, or inte-
mains intrinsically approximate, and this may result in grate the physico-chemical properties of different amino
an improved PLM performance also for finite data sets acids into the procedure, or extract even more informa-
[23]. tion from large sets of evolutionarily related amino-acid
On the other hand, the above advantage holds if and sequences. DCA is only a step in this direction.
only if the following two conditions are fulfilled: Data In our work we have seen that simple sampling cor-
are drawn independently from a probability distribution, rections, more precisely sequence reweighting and the
and this probability distribution is the Boltzmann dis- average-product correction of interaction scores, lead to
tribution of a Potts model. None of these two con- an increased accuracy in predicting 3D contacts of amino
ditions actually hold for real protein sequences. On acids, which are distant on the protein’s backbone. It is,
artificial data, refined mean-field methods (Thouless- however, clear that these somewhat heuristic statistical
Anderson-Palmer equations, Bethe approximation) also fixes cannot correct for the complicated hierarchical phy-
lead to improved model inference as compared to NMFI, logenetic relationships between proteins, and that more
cf. e.g. [14, 16, 17, 21], but no such improvement has sophisticated methods would be needed to disentangle
been observed in real protein data [2]. The results of phylogenetic from functional correlations in massive se-
the paper are therefore interesting and highly nontriv- quence data. To do so is an open challenge, which would
ial. They also suggest that other model-learning meth- leave the field of equilibrium inverse statistical mechan-
ods from statistics such as ’Contrastive Divergence’ [52] ics, but where methods of inverse statistical mechanics
or the more recent ’Noise-Contrastive Estimation’ [53], may still play a useful role.
could be explored to further increase our capacity to ex- ACKNOWLEDGMENTS
tract structural information from protein sequence data.
Disregarding the improvements, we find that overall This work was supported by the Academy of Finland
the predicted contact pairs for plmDCA and mfDCA are as part of its Finland Distinguished Professor program,
highly overlapping, illustrating the robustness of DCA project 129024/Aurell, and through the Center of Excel-
results with respect to the algorithmic implementation. lence COIN. Discussions with S. Cocco and R. Monasson
This observations suggests that, in the context of model- are gratefully acknowledged.
[1] M. Weigt, R. White, H. Szurmant, J. Hoch, and T. Hwa, [10] H. J. Kappen and F. B. Rodriguez, in Advances in Neural
Proc. Natl. Acad. Sci. U. S. A. 106, 67 (2009) Information Processing Systems (The MIT Press, 1998)
[2] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. pp. 280–286
Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, [11] E. Schneidman, M. Berry, R. Segev, and W. Bialek, Na-
and M. Weigt, Proc. Natl. Acad. Sci. U. S. A. 108, ture 440, 1007 (2006)
E1293 (2011) [12] A. Braunstein and R. Zecchina, Phys. Rev. Lett. 96,
[3] D. S. Marks, L. J. Colwell, R. P. Sheridan, T. A. Hopf, 030201 (2006)
A. Pagnani, R. Zecchina, and C. Sander, PLoS One 6, [13] A. Braunstein, A. Pagnani, M. Weigt, and R. Zecchina,
e28766 (2011) J. Stat. Mech. 2008, P12001 (2008)
[4] J. Sulkowska, F. Morcos, M. Weigt, T. Hwa, and [14] Y. Roudi, J. A. Hertz, and E. Aurell, Front.
J. Onuchic, Proc. Natl. Acad. Sci. U. S. A. 109, 10340 Comput. Neurosci. 3 (2009), doi:“bibinfo doi
(2012) 10.3389/neuro.10.022.2009
[5] T. Nugent and D. T. Jones, Proc. Natl. Acad. Sci. U. S. [15] S. Cocco, S. Leibler, and R. Monasson, Proc. Natl. Acad.
A. 109, E1540 (2012) Sci. U. S. A. 106, 14058 (2009)
[6] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, [16] V. Sessak and R. Monasson, J. Phys. A: Math. Theor.
J. G. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, 42, 055001 (2009)
J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer, [17] M. Mézard and T. Mora, J. Physiol. (Paris) 103, 107
S. R. Eddy, A. Bateman, and R. D. Finn, Nucleic Acids (2009)
Res. 40, D290 (2012) [18] E. Marinari and V. V. Kerrebroeck, J. Stat. Mech.(2010),
[7] L. Burger and E. van Nimwegen, PLoS Comput. Biol. 6, doi:“bibinfo doi 10.1088/1742-5468/2010/02/P02008
E1000633 (2010) [19] S. Cocco and R. Monasson, Phys. Rev. Lett. 106, 090601
[8] S. Balakrishnan, H. Kamisetty, J. Carbonell, S. Lee, and (2011)
C. Langmead, Proteins: Struct., Funct., Bioinf. 79, 1061 [20] F. Ricci-Tersenghi, J. Stat. Mech.(2012)
(2011) [21] H. Nguyen and J. Berg, J. Stat. Mech.(2012)
[9] D. T. Jones, D. W. A. Buchan, D. Cozzetto, and M. Pon- [22] H. Nguyen and J. Berg, Phys. Rev. Lett. 109 (2012)
til, Bioinformatics 28, 184 (2012)
13
[23] E. Aurell and M. Ekeberg, Phys. Rev. Lett. 108, 090201 [53] M. Gutmann and A. Hyvärinen, J. Mach. Learn. Res. 13,
(2012) 307 (2012)
[24] A. Hyvärinen, J. Karhunen, and E. Oja, Independent
Component Analysis, Adaptive and Learning Systems for
Signal Processing, Communications, and Control (John Appendix A: Circle plots
Wiley & Sons, 2001)
[25] J. Rissanen, Information and Complexity in Statistical
Modeling (Springer, 2007) To get a sense of how false positives distribute across
[26] M. J. Wainwright and M. I. Jordan, the domains, we draw interactions into circles in Fig. 7.
Foundations and Trends in Machine Learning 1, 1 Among erroneously predicted contacts there is some ten-
(2008) dency towards loopiness, especially for NMFI-DI; the
[27] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, Ann. black lines tend to ‘bounce around’ in the circles. It
Stat. 38, 1287 (2010) hence seems that relatively few nodes are responsible for
[28] H. Ackley, E. Hinton, and J. Sejnowski, Cogn. Sci. 9, 147 many of the false positives. We performed an explicit
(1985) check of the data columns belonging to these ‘bad’ nodes,
[29] G. Parisi, Statistical Field Theory (Addison Wesley, and we found that they often contained strongly biased
1988)
data, i.e., had a few large f i (k). In such cases, it seemed
[30] L. Peliti, Statistical Mechanics in a Nutshell (Princeton
University Press, 2011) ISBN ISBN-13: 9780691145297 that NMFI-DI was more prone than PLM to report a
[31] K. Fischer and J. Hertz, Spin Glasses (Cambridge Uni- (predicted) interaction.
versity Press, 1993) ISBN 0521447771, 9780521447775
[32] I. Pagani, K. Liolios, J. Jansson, I. Chen, T. Smirnova,
B. Nosrat, and M. V.M., Nucleic Acids Res. 40, D571
Crystal structure PLM NMFI−DI
(2012)
[33] The Uniprot Consortium, Nucleic Acids Res. 40, D71
PF00028
(2012)
[34] H. Berman, G. Kleywegt, H. Nakamura, and J. Markley,
Structure 20, 391 (2012)
[35] U. Göbel, C. Sander, R. Schneider, and A. Valencia,
Proteins: Struct., Funct., Genet. 18, 309 (1994)
[36] S. W. Lockless and R. Ranganathan, Science 286, 295
(1999)
PF00035
(1957)
[44] E. T. Jaynes, Physical Review Series II 108, 171190
(1957)
[45] G. Darmois, C.R. Acad. Sci. Paris 200, 12651266 (1935),
in French
[46] E. Pitman and J. Wishart,
Math. Proc. Cambridge Philos. Soc. 32, 567579 (1936)
[47] B. Koopman, Trans. Am. Math. Soc. 39, 399409 (1936)
[48] A. Kholodenko, J. Stat. Phys. 58, 357 (1990)
[49] T. Tanaka, Neural Comput. 12, 19511968 (2000)
FIG. 7. (Color online) Connections for four families overlaid
[50] S. ichi Amari, S. Ikeda, and H. Shimokawa, “Informa-
on circles. Position ‘1’ is indicated by a dash. The leftmost
tion geometry and mean field approximation: the alpha-
column shows contacts in the crystal structure (dark gray for
projection approach,” in Advanced Mean Field Methods
d(i, j) < 5Å and light gray for 5Å≤ d(i, j) < 8.5Å). The other
– Theory and Practice (MIT Press, 2001) Chap. 16, pp.
two columns show the top 1.5N strongest ranked |j − i| > 4-
241–257, ISBN 0-262-15054-9
pairs for PLM and NMFI-DI, with gray for true positives and
[51] S. D. Dunn, L. M. Wahl, and G. B. Gloor, Bioinformatics
black for false positives.
24, 333 (2008)
[52] G. Hinton, Neural Comput. 14, 17711800 (2002)
14
Appendix B: Other scores for naive mean-field is in fact about the same for both methods. Still, PLM
inversion maintains a somewhat higher TP rates overall.
0.5
1 10 100
Number of predicted pairs
FIG. 10. (Color online) Contact-detection results for all the
CN
families in our study (sorted by Bef f ), now with the Sij
FIG. 8. (Color online) Mean TP rates, using pairs with |j − score for NMFI. The y-axes are TP rates and the x-axes
DI
i| > 4, for NMFI with old score Sij CDI
, new APC score Sij , are the number of predicted contacts p, based on pairs with
CN
and the norm score Sij . Each curve corresponds to the best |j − i| > 4. The three curves for each method are the three
λ for that particular score. regularization levels yielding highest TP rates across all fam-
ilies. The thickened curve highlights the best one out of these
three (λ = 2.3 · Bef f for NMFI-CN and λJ = 0.01 for PLM).
Motivated by the results of Fig. 8, we decided to com-
CN
pare NMFI and PLM under the Sij score. All figures
in this paragraph show the best regularization for each Figure 11 shows scatter plots for the same families as
CN
method, unless otherwise stated. Figure 9 shows score vs. in Fig. 5 but using the Sij scoring for NMFI. The points
distance for all pairs in all families. Unlike Fig. 4, the now lie more clearly on a line, from which we conclude
two plots now show very similar profiles. We note, how- that the bends in Fig. 5 were likely a consequence of
CN differing scores. Yet, the trends seen in Fig. 5 remain:
ever, that NMFI’s Sij scores trend two to three times
larger than PLM’s (the scales on the vertical axes are NMFI gives more attention to neighbor pairs than does
different). Perhaps this is an inherent feature of these PLM.
methods, or simply a consequence of the different types In Fig. 12, we recreate the contact maps of Fig. 6 with
of regularization. NMFI-CN in place of NMFI-DI and find that the plots
Figure 10 shows the same situation as Fig. 3, but us- are more symmetric. As expected, asymmetry is seen
CN primarily for small |j − i|; NMFI tends to crowd these
ing Sij to score NMFI. The three best regularization
choices for NMFI-CN turned out the same as before, i.e., regions with lots of loops.
λ = 1 · Bef f , λ = 1.5 · Bef f and λ = 2.3 · Bef f , but To investigate why NMFI assembles so many top-
the best out of these three was now λ = 2.3 · Bef f (in- scored pairs in certain neighbor regions, we performed
stead of λ = 1 · Bef f ). Comparing Fig. 3 and Fig. 10 one an explicit check of the associated MSA columns. A rel-
can see that the difference between the two methods is evant regularity was observed: when gaps appear in a
now smaller; for several families, the prediction quality sequence, they tend to do so in long strands. The pic-
15
FIG. 9. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-CN
(right). The vertical line is our contact cutoff at 8.5Å.
··· 1.5 1.5
· · · 6 5 9 7 268 7 4 4 2 2 · · ·
PF00028
1 PF00018
· · · 1 1 1 1 111 1 1 1 2 8 · · · 1
PLM (CN)
· · · 6 5 2 7 238 9 5 4 2 3 · · ·
0.5
· · · 0.5
3 7 4 7 268 7 9 4 2 3 · · ·
.
· · · 3 7 4 7 238 8 9 4 2 9 · · ·
0
0
· · · 1 1 1 1 111 4 5 4 2 9 · · ·
· · · 8 5 9 7 298 7 4 4 2 4 · · · 0 2 4 0 1 2 3 4
NMFI−CN (CN)
· · · 1 1 1 1 111 1 1 1 2 4 · · ·
··· 1 PF00043 1 PF00111
0.5 0.5
PF00028 PF00035
90 90
80 80 60 60
70 70 50 50
60 60 40 40
50 50
40 40 30 30
30 30 20 20
20 20
10 10 10 10
00 00
PF00043 PF00046
90 90
80 80 50 50
70 70 40 40
60 60
50 50 30 30
40 40
30 30 20 20
20 20 10 10
10 10
00 00
here was chosen x = 0.9. Perhaps the gap effect can the wider range of 50-400. The list is given in Table II.
be dampened by stricter definition of sequence unique- We here kept the reweighting level at x = 0.8 as in [2],
ness? In Fig. 14 we show another set of TP rates, but while the TP rates were again calculated using the cutoff
now for x = 0.75. We also include results for NMFI run 8.5Å. The pseudocount strength for NMFI was varied in
on alignment files from which all sequences with more the same interval as in the main text. We did not try
than 20% gaps have been removed. The best regulariza- to optimize the PLM regularization parameters for this
tion choice for each method turned out the same as in trial, but merely used λh = λJ = 0.01 as determined for
Fig. 10: λ = 2.3 · Bef f for NMFI-CN and λJ = 0.01 for the smaller families in the main text.
PLM. Overall, PLM maintains the same advantage over
NMFI-CN it had in Fig. 10. Removing gappy sequences
seems to trim down more TP rates than it raises, prob-
ably since useful information in the nongappy parts is Figure 15 shows qualitatively the same behavior as in
discarded unnecessarily. the smaller set of families: TP rates increase partly from
DI CN
changing from the Sij score to the Sij score, and partly
from changing from NMFI to PLM. Our positive results
PF00076 PF00041 PF00027 PF00107
thus do not seem to be particular to short-length families.
N=70, Beff=8298 N=85, Beff=7718 N=91, Beff=7082 N=130, Beff=7050
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100 Apart from the average TP rate for each value of p
PF00028
N=93, B =5474
PF00043
N=95, B =3467
PF00084
N=56, B =3176
PF00111
N=78, B =3110
(p’th strongest predicted interactions) one can also eval-
eff eff eff eff
1.0 1.0 1.0 1.0 uate performance by different criteria. In this larger sur-
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8 vey we investigated the distribution of values of p such
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6 that the TP rate in a family is one. Fig. 16 shows the
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100 histograms of the number of families for which the top
PF00011 PF00013 PF00018 PF00035
N=102, B =2367
eff
N=58, B =2095
eff
N=48, B =2030
eff
N=67, B =1645
eff
p predictions are correct, clearly showing that the differ-
1.0
0.9
1.0
0.9
1.0
0.9
1.0
0.9
ence between PLM and NMFI (using the two scores) pri-
0.8
0.7
0.8
0.7
0.8
0.7
0.8
0.7 marily occurs at the high end. The difference in average
0.6
0.5
0.6
0.5
0.6
0.5
0.6
0.5 performance between PLM and NMFI at least partially
1 10 100 1 10 100 1 10 100 1 10 100
PF00014 PF00046 PF00017 PF00081
stems from PLM getting more strongest contact predic-
N=53, B =1319
eff
N=57, B =1300
eff
N=77, B =1133
eff
N=82, B =645
eff tions with 100% accuracy.
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100
PF00105
N=70, B =644
eff
1.0
0.9 PLM (75% rew.)
0.8
0.7 NMFI−CN (75% rew.)
0.6 NMFI−CN (90% rew., no >20%−gap alignments)
0.5
1 10 100
1
PLM
NMFI−DI
NMFI−CN
0.9
families in our study. The y-axes are TP rates and x-axes are 0.8
the number of predicted contacts p, based on pairs with |j −
i| > 4. Two curves are included for reweighting margin x =
0.75, and one for reweighting margin x = 0.9 after deletion 0.7
of all sequences with more than 20% gaps. Results for each
method correspond to the regularization level yielding highest 0.6
TP rates across all families (λ = 2.3 · Bef f for both NMFI-CN
and λJ = 0.01 for PLM).
0.5
1 10 100
Number of predicted pairs
10
PLM
9 NMFI−DI
NMFI−CN
8
Number of families
6
0
5 13 21 29 37 45 53 61 69 77
Number of pairs
TABLE II. Domain families included in our extended study, listed with Pfam ID, length N , number of sequences B (after
removal of duplicate sequences), number of effective sequences Bef f (under x = 0.8, i.e., 80% threshold for reweighting), and
the PDB and UniProt specifications for the structure used to access the DCA prediction quality.