DCA_PM_PLM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Improved contact prediction in proteins:

Using pseudolikelihoods to infer Potts models


Magnus Ekeberg1 , Cecilia Lövkvist3 ,Yueheng Lan5 , Martin Weigt6 , Erik Aurell2,3,4,†∗
(Dated: January 15, 2013)
Spatially proximate amino acids in a protein tend to coevolve. A protein’s 3D structure hence
leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from
such correlations is an open problem in structural biology, pursued with increasing vigor as more and
more protein sequences continue to fill the data banks. Within this task lies a statistical inference
problem, rooted in the following: correlation between two sites in a protein sequence can arise from
firsthand interaction, but can also be network-propagated via intermediate sites; observed correlation
is not enough to guarantee proximity. To separate direct from indirect interactions is an instance
arXiv:1211.1281v2 [q-bio.QM] 12 Jan 2013

of the general problem of inverse statistical mechanics, where the task is to learn model parameters
(fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In
the context of protein sequences, the approach has been referred to as direct-coupling analysis.
Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the
statistical properties of families of evolutionarily related proteins, significantly outperforms existing
approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.
This improved performance also relies on a modified score for the coupling strength. The results
are verified using known crystal structures of specific sequence instances of various protein families.
Code implementing the new method can be found at http://plmdca.csc.kth.se/.

PACS numbers: 02.50.Tt – Inference methods, 87.10.Vg – Biological information, 87.15.Qt – Sequence
analysis, 87.14.E- – Proteins

I. INTRODUCTION actions mediated via multi-step paths across other ele-


ments. Correlations are, in general, straightforward to
In biology, new and refined experimental techniques compute from raw data, whereas parameters describing
have triggered a rapid increase in data availability during the true causal ties are not. The network of direct in-
the last few years. Such progress needs to be accompa- teractions can be thought of as hidden beneath observ-
nied by the development of appropriate statistical tools able correlations, and untwisting it is a task of inher-
to treat growing data sets. An example of a branch un- ent intricacy. In PSP, using mathematical means to
dergoing intense growth in the amount of existing data dispose of the network-mediated correlations observable
is protein structure prediction (PSP), which, due to the in alignments of evolutionarily related (and structurally
strong relation between a protein’s structure and its func- conserved) proteins is necessary to get optimal results
tion, is one central topic in biology. As we shall see, one [1, 2, 7–9] and yields improvements worth the computa-
can accurately estimate the 3D structure of a protein by tional strain put on the analysis. This approach to PSP,
identifying which amino-acid positions in its chain are which we refer to as direct-coupling analysis (DCA), is
statistically coupled over evolutionary time scales [1–5]. the focus of this paper.
Much of the experimental output is today readily acces- In a more general setting, the problem of inferring
sible through public databases such as Pfam [6], which interactions from observations of instances amounts to
collects over 13,000 families of evolutionarily related pro- inverse statistical mechanics, a field which has been
tein domains, the largest of them containing more than intensively pursued in statistical physics over the last
2 × 105 different amino-acid sequences. Such databases decade [10–23]. Similar tasks were formulated earlier in
allow researchers to easily access data, to extract infor- statistics and machine learning, where they have been
mation from it, and to confront their results. called model learning and inference [24–27]. To illustrate
A recurring difficulty when dealing with interacting this concretely, let us start from an Ising model,
systems is distinguishing direct interactions from inter-  
N
1 X X
P (σ1 , . . . , σN ) = exp  hi σi + Jij σi σj 
Z i=1 1≤i<j≤N
∗ 1 Engineering Physics Program, KTH Royal Institute of Tech- (1)
nology, 100 44 Stockholm, Sweden, 2 ACCESS Linnaeus Cen- and its magnetizations mi = ∂hi log Z and connected cor-
ter, KTH, Sweden, 3 Department of Computational Biology, Al- relations cij = ∂Jij log Z − mi mj . Counting the number
baNova University Center, 106 91 Stockholm, Sweden,4 Aalto
of observables (mi and cij ) and the number of parameters
University School of Science, Helsinki, Finland, 5 Department
of Physics, Tsinghua University, Beijing 100084, P. R. China, (hi and Jij ) it is clear that perfect knowledge of the mag-
6 Université Pierre et Marie Curie, UMR7238 – Laboratoire de netizations and correlations should suffice to determine
Génomique des Microorganismes, 15 rue de l’Ecole de Médecine, the external fields and the couplings exactly. It is, how-
75006 Paris, France,† [email protected] ever, also clear that such a process must be computation-
2

ally expensive, since it requires the computation of the The paper is structured as follows: in Section II, we
partition function Z for an arbitrary set of parameters. review the ideas underlying PSP and DCA and explain
The exact but iterative procedure known as Boltzmann the biological hypotheses linking protein 3D structure to
machines [28] does in fact work on small systems, but it correlations in amino-acid sequences. We also review ear-
is out of question for the problem sizes considered in this lier approaches to DCA. In Section III, we describe the
paper. On the other hand, the mean-field equations of Potts model in the context of DCA and the properties
(1) read [29–31]: of exponential families. We further detail a maximum-
X likelihood (ML) approach as brought to bear on the in-
tanh−1 mi = hi + Jij mj . (2) verse Potts problem and discuss in more detail why this
j is impractical for realistic system sizes, and we intro-
duce, similarly to (3) above, the inverse Potts mean-field
From (2) and the fluctuation-dissipation relations an model algorithm for the DCA (mfDCA) and a pseudolike-
equation can be derived connecting the coupling coef- lihood maximization procedure (plmDCA). This section
ficients Jij and the correlation matrix c = (cij ) [10]: also covers algorithmic details of both models such as reg-
ularization and sequence reweighting. A further impor-
Jij = −(c−1 )ij . (3) tant issue is the selection of an interaction score, which
reduces coupling matrices to a scalar score, and allows for
Equations (2) and (3) exemplify typical aspects of in- ranking of couplings according to their ’strength’. In Sec-
verse statistical mechanics, and inference in large sys-
tion IV, we present results from prediction experiments
tems in general. On one hand, the parameter recon-
using mfDCA and plmDCA assessed against known crys-
struction using these two equations is not exact. It is tal structures. In Section V, we summarize our findings,
only approximate, because the mean-field equations (2)
put their implications into context, and discuss possible
are themselves only approximate. It also demands rea- future developments. The appendixes contain additional
sonably good sampling, as the matrix of correlations is
material supporting the main text.
not invertible unless it is of full rank, and small noise
on its O(N 2 ) entries may result in large errors in esti-
mating the Jij . On the other hand, this method is fast,
as fast as inverting a matrix, because one does not need II. PROTEIN STRUCTURE PREDICTION AND
DIRECT-COUPLING ANALYSIS
to compute Z. Except for mean-field methods as in (2),
approximate methods recently used to solve the inverse
Ising problem can be grouped as expansion in correla- Proteins are essential players in almost all biological
tions and clusters [16, 19], methods based on the Bethe processes. Primarily, proteins are linear chains, each site
approximation [17, 18, 20–22], and the pseudolikelihood being occupied by 1 out of 20 different amino acids. Their
method [23, 27]. function relies, however, on the protein fold, which refers
For PSP, it is not the Ising model but a 21-state Potts to the 3D conformation into which the amino-acid chain
model which is pertinent [1]: The model shall be learned curls. This fold guarantees, e.g., that the right amino
such that it represents the statistical features of large acids are exposed in the right positions at the protein
multiple sequence alignments of homologous (evolution- surface, thus forming biochemically active sites, or that
arily related) proteins, and to reproduce the statistics of the correct pairs of amino acids are brought into contact
correlated amino acid substitutions. This can be done to keep the fold thermodynamically stable.
with the Potts equivalent of Eq. (1), i.e. using a model Experimentally determining the fold, using e.g. x-ray
with pairwise interactions. As will be detailed below, crystallography or NMR tomography, is still a rather
strong interactions can be interpreted as indicators for costly and time-consuming process. On the other hand,
spatial vicinity of residues in the three-dimensional pro- every newly sequenced genome results in a large num-
tein fold, even if residues are well separated along the ber of newly predicted proteins. The number of se-
sequence – thus linking evolutionary sequence statistics quenced organisms now exceeds 3, 700, and continues
with protein structure. But which of all the inference to grow exponentially (genomesonline.org [32]). The
methods in inverse statistical mechanics, machine learn- most prominent database for protein sequences, Uniprot
ing or statistics is most suitable for treating real protein (uniprot.org [33]), lists about 25,000,000 different pro-
sequence data? How do the test results obtained for inde- tein sequences, whereas the number of experimentally
pendently generated equilibrium configurations of Potts determined protein structures is only around 85,000
models translate to the case of protein sequences, which (pdb.org [34]).
are neither independent nor equilibrium configurations However, the situation of structural biology is not as
of any well-defined statistical-physics model? The main hopeless as these numbers might suggest. First, proteins
goal of this paper is to move towards an answer to this have a modular architecture; they can be subdivided into
question by showing that the pseudolikelihood method is domains which, to a certain extent, fold and evolve as
a very powerful means to perform DCA, with a predic- units. Second, domains of a common evolutionary ori-
tion accuracy considerably out-performing methods pre- gin, i.e., so-called homologous domains, are expected to
viously assessed. almost share their 3D structure and to have related func-
3

ties of a correlation-based route to protein sequence anal-


ysis, and these authors also outline a maximum-entropy
approach to get at direct interactions. Weigt et al. [1]
successfully executed this program subsequently called
direct-coupling analysis: the accuracy in predicting con-
tacts strongly increases when direct interactions are used
instead of raw correlations.
To computationally solve the task of inferring inter-
actions in a Potts model, [1] employed a generalization
of the iterative message-passing algorithm susceptibility
propagation previously developed for the inverse Ising
problem [17]. Methods in this class are expected to out-
perform mean-field based reconstruction methods similar
FIG. 1. (Color online) Left panel: small MSA with two posi- to (3) if the underlying graph of direct interactions is lo-
tions of correlated amino-acid occupancy. Right panel: hypo- cally close to tree-like, an assumption which may or may
thetical corresponding spatial conformation, bringing the two
not be true in a given application such as PSP. A sub-
correlated positions into direct contact.
stantial draw-back of susceptibility propagation as used
in [1] is that it requires a rather large amount of auxiliary
variables, and that DCA could therefore only be carried
tion. They can therefore be collected in protein domain out on protein sequences that are not too long. In [2],
families: the Pfam database (pfam.sanger.ac.uk [6]) this obstacle was overcome by using instead a simpler
lists almost 14,000 different domain families, and the mean-field method, i.e., the generalization of (3) to a 21-
number of sequences collected in each single family ranges state Potts model. As discussed in [2], this broadens the
roughly from 102 to 105 . In particular the larger fami- reach of the DCA to practically all families currently in
lies, with more than 1,000 members, are of interest to Pfam. It improves the computational speed by a factor
us, as we argue that their natural sequence variability of about 103 –104 , and it appears also to be more accurate
contains important statistical information about the 3D than the susceptibility-propagation based method of [1]
structure of its member proteins, and can be exploited to in predicting contact pairs. The reason behind this third
successfully address the PSP problem. advantage of mean-field over susceptibility propagation
Two types of data accessible via the Pfam database as an approximate method of DCA is unknown at this
are especially important to us. The first is the multiple time.
sequence alignment (MSA), a table of the amino acid Pseudolikelihood maximization (PLM) is an alternative
sequences of all the protein domains in the family lined method developed in mathematical statistics to approxi-
up to be as similar as possible. A (small and illustrative) mate maximum-likelihood inference, which breaks down
example is shown in Fig. 1 (left panel). Normally, not all the a priori exponential time complexity of computing
members of a family can be lined up perfectly, and the partition functions in exponential families [39]. On the
alignment therefore contains both amino acids and gaps. inverse Ising problem it was first used by Ravikumar et
At some positions, an alignment will be highly specific al. [27], albeit in the context of graph sign-sparsity recon-
(cf. the second, fully conserved column in Fig. 1), while struction; two of us showed recently that it outperforms
at others it will be more variable. The second data type many other approximate inverse Ising schemes on the
concerns the crystal structure of one or several members Sherrington-Kirkpatrick model, and in several other ex-
of a family. Not all families provide this second type of amples [23]. Although this paper reports the first use of
data. We discuss its use for an a posteriori assessment of the PLM method in DCA, the idea of using pseudolikeli-
our inference results in detail in Sec. IV. hoods for PSP is not completely novel. Balakrishnan et
The Potts-model based inference uses only the first al. [8] devised a version of this idea, but using a set-up
data type, i.e., the sequence data. Small spatial separa- rather different from that in [2], regarding, for example,
tion between amino acids in a protein, cf. the right panel what portions of the data bases and which measures of
of Fig. 1, encourages co-occurrence of favorable amino- prediction accuracy were used, and also, not couched in
acid combinations, cf. the left panel of Fig. 1. This spices the language of inverse statistical mechanics. While a
the sequence record with correlations, which can be re- competitive evaluation between [2] and [8] is still open,
liably determined in sufficiently large MSAs. However, we have not attempted such a comparison in this work.
the use of such correlations for predicting 3D contacts as Other ways of deducing direct interactions in PSP, not
a first step to solve the PSP problem remained of limited motivated from the Potts model but in somewhat simi-
success [35–37], since they can be induced both by direct lar probabilistic settings, have been suggested in the last
interactions (amino acid A is close to amino acid B), and few years. A fast method utilizing Bayesian networks
by indirect interactions (amino acids A and B are both was provided by Burger and van Nimwegen [7]. More
close to amino acid C). Lapedes et al. [38] were the first recently, Jones et al. [9] introduced a procedure called
to address, in a purely theoretical setting, these ambigui- PSICOV (Protein Sparse Inverse COVariance). While
4

DCA and PSICOV both appear capable of outperform- It is immediate that the probabilistic model which max-
ing the Bayesian network approach [2, 9], their relative imizes the entropy while satisfying Eq. (7) must take
efficiency is currently open to investigation, and is not the Potts model form. Finding a Potts model which
assessed in this work. matches empirical frequencies and correlations is there-
Finally, predicting amino acid contacts is not only fore referred to as a maximum-entropy inference [43, 44],
a goal in itself, but also a means to assemble protein cf. also [1, 38] for a formulation in terms of protein-
complexes [40, 41] and to predict full 3D protein struc- sequence modeling.
tures [3, 4, 42]. Such tasks require additional work, using On the other hand, we can use Eq. (6) as a varia-
the DCA results as input, and are outside the scope of tional ansatz and maximize the probability of the input
the present paper. data set {σ (b) }B
b=1 with respect to the model parameters
hi (σ) and Jij (σ, σ ′ ); this maximum-likelihood perspective
is used in the following. We note that the Ising and the
III. METHOD DEVELOPMENT Potts models (and most models which are normally con-
sidered in statistical mechanics) are examples of expo-
A. The Potts model nential families, and have the property that means and
correlations are sufficient statistics [45–47]. Given un-
Let σ = (σ1 , σ2 , · · · , σN ) represent the amino acid se- limited computing power to determine Z, reconstruction
quence of a domain with length N . Each σi takes on can not be done better using all the data compared to us-
values in {1, 2, ..., q}, with q = 21: one state for each of ing only (empirical) means and (empirical) correlations.
the 20 naturally occurring amino acids and one additional It is only when one cannot compute Z exactly and has
state to represent gaps. Thus, an MSA with B aligned to resort to approximate methods, that using directly all
sequences from a domain family can be written as an in- the data can bring any advantage.
teger array {σ (b) }Bb=1 , with one row per sequence and one
column per chain position. Given an MSA, the empirical
individual and pairwise frequencies can be calculated as
B
B. Model parameters and gauge invariance
1 X (b)
fi (k) = δ(σi , k),
B The total number of parameters of Eq, (6) is N q +
b=1
B N (N −1) 2
1 X (b) (b) 2 q ,but, in fact, the model as it stands is over-
fij (k, l) = δ(σi , k) δ(σj , l), (4) parameterized in the sense that distinct parameter sets
B
b=1 can describe the same probability distribution. It is easy
where δ(a, b) is the Kronecker symbol taking value 1 if to see that the number of nonredundant parameters is
both arguments are equal, and 0 otherwise. fi (k) is hence N (q − 1) + N (N2−1) (q − 1)2 , cf. an Ising model (q = 2),
the fraction of sequences for which the entry on position which has N (N2+1) parameters if written as in Eq. (1)
i is amino acid k, gaps counted as a 21st amino acid. but would have 2N 2 parameters if written in the form of
Similarly, fij (k, l) is the fraction of sequences in which Eq. (6).
the position pair (i, j) holds the amino acid combination A gauge choice for the Potts model, which eliminates
(k, l). Connected correlations are given as the overparametrization in a similar manner as in the
cij (k, l) = fij (k, l) − fi (k) fj (l). (5) Ising model (and reduces to that case for q = 2), is

A (generalized) Potts model is the simplest probabilis- q


X q
X q
X
tic model P (σ) which can reproduce the empirically ob- Jij (k, s) = Jij (s, l) = hi (s) = 0, (8)
served fi (k) and fij (k, l). In analogy to (1) it is defined s=1 s=1 s=1
as
for all i, j(> i), k, and l. Alternatively, we can choose a
 
N
1 X X
gauge where one index, say i = q, is special, and measure
P (σ) = exp  hi (σi ) + Jij (σi , σj ) , (6)
Z all interaction energies with respect to this value, i.e.,
i=1 1≤i<j≤N

in which hi (σi ) and Jij (σi , σj ) are parameters to be de- Jij (q, l) = Jij (k, q) = hi (q) = 0, (9)
termined through the constraints
X
P (σi = k) = P (σ) = fi (k), for all i, j(> i), k, and l, cf. [2]. This gauge choice
σ corresponds to a lattice gas model with q − 1 different
σi =k
X particle types, and a maximum occupation number one.
P (σi = k, σj = l) = P (σ) = fij (k, l). (7) Using either (8) or (9) reconstruction is well-defined,
σ
σj =l
and it is straight-forward to translate results obtained in
σi =k one gauge to the other.
5

C. The inverse Potts problem E. Pseudolikelihood maximization

Given a set of independent equilibrium configurations Pseudolikelihood substitutes the probability in (10) by
{σ (b) }B
b=1 of the model Eq. (6), the ordinary statistical the conditional probability of observing one variable σr
approach to inferring parameters {h, J} would be to let given observations of all the other variables σ \r . That
those parameters maximize the likelihood (i.e., the prob- is, the starting point is
ability of generating the data set for a given set of pa- (b)
rameters). This is equivalent to minimizing the (rescaled) P (σr = σr(b) |σ \r = σ \r )
negative log-likelihood function  
N
(b) P (b) (b)
B
1 X exp hr (σr ) + Jri (σr , σi )
l=− log P (σ (b) ). (10) i=1
i6=r
B =  , (14)
b=1 
N
Pq (b)
For the Potts model (6), this becomes
P
l=1 exp hr (l) + Jri (l, σi )
q
N X i=1
X i6=r

l(h, J) = log Z − fi (k)hi (k) (11)


i=1 k=1
where, for notational convenience, we take Jri (l, k) to
q mean Jir (k, l) when i < r. Given an MSA, we can maxi-
mize the conditional likelihood by minimizing
X X
− fij (k, l)Jij (k, l).
1≤i<j≤N k,l=1 B
1 X h
(b)
i
l is differentiable, so minimizing it means looking for a gr (hr , Jr ) = − log P{hr ,Jr } (σr = σr(b) |σ \r = σ \r ) .
point at which ∂hi (k) l = 0 and ∂Jij (k,l) l = 0. Hence, ML B
b=1
estimates will satisfy (15)
Note that this only depends on hr and Jr = {Jir }i6=r ,
∂hi (k) log Z − fi (k) = 0,
i.e., on the parameters featuring node r. If (15) is used
∂Jij (k,l) log Z − fij (k, l) = 0. (12) for all r this leads to unique values for the parameters hr
To achieve this minimization computationally, we need to but typically different predictions for Jrq and Jqr (which
be able to calculate the partition function Z of Eq. (6) for should be the same). Minimizing (15) must therefore be
any realization of the parameters {h, J}. This problem supplemented by some procedure on how to reconcile dif-
is computationally intractable for any reasonable system ferent values of Jrq and Jqr ; one way would be to simply
size. Approximate minimization is essential, and we will use their average 12 (Jrq + JTqr ) [27].
show that even relatively simple approximation schemes We here reconcile different Jrq and Jqr by minimizing
lead to accurate PSP results. an objective function adding gr for all nodes:
N
X
D. Naive mean-field inversion
lpseudo (h, J) = gr (hr , Jr ) (16)
r=1
N B
The mfDCA algorithm in [2] is based on the simplest 1 XX h
(b)
i
=− log P{hr ,Jr } (σr = σr(b) |σ \r = σ \r ) .
and computationally most efficient approximation, i.e., B r=1
b=1
naive mean-field inversion (NMFI). It starts from the
proper generalization of (2), cf. [48], and then uses lin- Minimizers of lpseudo generally do not minimize l; the re-
ear response: The J’s in the lattice-gas gauge Eq. (9) placement of likelihood with pseudolikelihood alters the
become: outcome. Note, however, that replacing l by lpseudo re-
solves the computational intractability of the parameter
N MF I
Jij (k, l) = −(C−1 )ab , (13) optimization problem: instead of depending on the full
where a = (q − 1)(i − 1) + k and b = (q − 1)(j − 1) + l. partition function, the normalization of the conditional
The matrix C is the N (q − 1) × N (q − 1) covariance ma- probability (14) contains only a single summation over
trix assembled by joining the N (q − 1) × N (q − 1) values the q = 21 Potts states. The intractable average over
cij (k, l) as defined in Eq. (5), but leaving out the last the N − 1 conditioning spin variables is replaced by an
state q. In Eq. (13), i, j ∈ {1, ..., N } are site indices, empirical average over the data set in Eq. (16).
and k, l run from 1 to q − 1. Under gauge Eq. (9), all
the other coupling parameters are zero. The term ’naive’
has become customary in the inverse statistical mechan- F. Regularization
ics literature, often used to highlight the difference to a
Thouless-Anderson-Palmer level inversion or one based A Potts model describing a protein family with se-
on the Bethe approximation. The original meaning of quences of 50-300 amino acids requires ca. 5 · 105 to
this term lies, as far as we are aware, in information ge- 2 · 107 parameters. At present, few protein families are in
ometry [49, 50]. this range in size, and regularization is therefore needed
6

to avoid overfitting. In NMFI, the problem results in an G. Sequence reweighting


empirical covariance matrix which typically is not of full
rank, and Eq. (13) is not well-defined. In [2], one of the Maximum-likelihood inference of Potts models relies
authors therefore used the pseudocount method where – as discussed above – on the assumption that the B
frequencies and empirical correlations are adjusted using sample configurations in our data set are independently
a regularization variable λ: generated from Eq. (6). This assumption is not correct
" # for biological sequence data, which have a phylogenetic
B
1 λ X (b) bias. In particular, in the databases there are many pro-
fi (k) = + δ(σi , k) , (17) tein sequences from related species, which did not have
λ+B q
b=1 enough time of independent evolution to reach statistical
" B
#
1 λ X (b) (b) independence. Furthermore, the selection of sequenced
fij (k, l) = + δ(σi , k) δ(σj , l) . species in the genomic databases is dictated by human
λ+B q2
b=1 interest, and not by the aim to have an as independent
as possible sampling in the space of all functional amino-
The pseudocount is a proxy for many observations, which acid sequences. A way to mitigate effects of uneven sam-
– if they existed – would increase the rank of the corre- pling, employed in [2], is to equip each sequence σ (b) with
lation matrix; the pseudocount method hence promotes a weight wb which regulates its impact on the parameter
invertibility of the matrix in Eq. (13). It was observed in estimates. Sequences deemed unworthy of independent-
[2] that for good performance in DCA, the pseudocount sample status (too similar to other sequences) can then
parameter λ has to be taken fairly large, on the order of have their weight lowered, whereas sequences which are
B. quite different from all other sequences will contribute
In the PLM method, we take the standard route of with a higher weight to the amino-acid statistics.
adding a penalty term to the objective function: A simple but efficient way (cf. [2]) is to measure
the similarity sim(σ (a) , σ (b) ) of two sequences σ (a) and
{hP LM , JP LM } = argmin{lpseudo (h, J)+R(h, J)}. (18) σ (b) as the fraction of conserved positions (i.e., identical
{h,J}
amino acids), and compare this fraction to a preselected
threshold x , 0 < x < 1. The weight given to a sequence
The turnout is then a trade-off between likelihood σ (b) can then be set to wb = m1b , where mb is the number
maximization and whatever qualities R is pushing for.
of sequences in the MSA similar to σ (b) :
Ravikumar et al. [27] pioneered the use of l1 regulariz-
ers for the inverse Ising problem, which forces a fraction
mb = |{a ∈ {1, ..., B} : sim(σ (a) , σ (b) ) ≥ x}|. (20)
of parameters to assume value zero, thus effectively re-
ducing the number of parameters. This approach is not
appropriate here since we are concerned with the accu- In [2], a suitable threshold x was found to be 0.8, re-
sults only weakly dependent on this choice throughout
racy (resp. divergence) of the strongest predicted cou-
plings; for our purposes it makes no substantial differ- 0.7 < x < 0.9. We have here followed the same procedure
using threshold x = 0.9. The corresponding reweighted
ence whether weak couplings are inferred to be small or
set precisely to 0. Our choice for R is therefore the sim- frequency counts then become
pler l2 norm " B
#
1 λ X (b)
fi (k) = + wb δ(σi , k) , (21)
N λ + Bef f q
X X b=1
Rl2 (h, J) = λh ||hr ||22 + λJ ||Jij ||22 , (19) " B
#
r=1 1≤i<j≤N 1 λ X (b) (b)
fij (k, l) = + wb δ(σi , k) δ(σj , l) ,
λ + Bef f q 2
b=1
using two regularization parameters λh and λJ for field
and coupling parameters. An advantage of a regular- PB
where Bef f = b=1 wb becomes a measure of the number
izer is that it eliminates the need to fix a gauge, since of effectively nonredundant sequences.
among all parameter sets related by a gauge transforma- In the pseudolikelihood we use the direct analog of
tion, i.e., all parameter sets resulting in the same Potts Eq. (21), i.e.,
model, there will be exactly one set which minimizes
the strictly convex regularizer. For the case of the l2 lpseudo (h, J) (22)
norm, it can be shown thatPthis leads to a gauge where
Pq B N
Jij (k, s) = λλJh hi (k), qs=1 Jij (s, l) = λλJh hj (l), and 1 X X h
(b)
i
Pq s=1 =− wb log P{hr ,Jr } (σr = σr(b) |σ \r = σ \r ) .
s=1 hi (s) = 0 for all i, j(> i), k, and l. Bef f
b=1 r=1
To summarize this discussion: For NMFI, we regu-
larize with pseudocounts under the gauge constraints As in the frequency counts, each sequence is considered
Eq. (9). For PLM, we regularize with Rl2 under the full to contribute a weight wb , instead of the standard weight
parametrization. one used in i.i.d. samples.
7


H. Interaction scores (after altering the fields appropriately) and that Jij (k, l)
satisfy (8). A possible Frobenius norm score is hence
In the inverse Ising problem, each interaction is scored v
u q
uX
by one scalar coupling strength Jij . These can easily
be ordered, e.g. by absolute size. In the inverse Potts
FN
Sij = kJ′ij k2 = t ′ (k, l)2 .
Jij (27)
k,l=1
problem, each position pair (i, j) is characterized by a
whole q×q matrix Jij , and some scalar score Sij is needed
Lastly, we borrow an idea from Jones et al. [9], whose
in order to evaluate the ‘coupling strength’ of two sites.
PSICOV method also uses a norm rank (1-norm instead
In [1] and [2] the score used is based on direct infor-
of Frobenius norm), but adjusted by an average-product
mation (DI), i.e., the mutual information of a restricted
correction (APC) term. APC was introduced in [51] to
probability model not including any indirect coupling ef-
suppress effects from phylogenetic bias and insufficient
fects between the two positions to be scored. Its con-
sampling. Incorporating also this correction, we have
struction goes as follows: For each position pair (i, j),
our scoring function
(the estimate of) Jij is used to set up a ’direct distribu-
tion’ involving only nodes i and j, S·jF N Si·F N
CN FN
Sij = Sij − , (28)
(dir) S··F N
(k, l) ∼ exp Jij (k, l) + h′i,k + h′j,l .

Pij (23)
where CN stands for ‘corrected norm’. An imple-
h′i,k and h′j,l are new fields, computed as to ensure agree- mentation of plmDCA in MATLAB is available at
ment of the marginal single-site distributions with the http://plmdca.csc.kth.se/.
empirical individual frequency counts fi (k) and fj (l).
DI
The score Sij is now calculated as the mutual infor-
mation of P (dir)
: IV. EVALUATING THE PERFORMANCE OF
MFDCA AND PLMDCA ACROSS PROTEIN
q (dir) FAMILIES
!
DI
X (dir) Pij (k, l)
Sij = Pij (k, l) log . (24)
fi (k) fj (l)
k,l=1 We have performed numerical experiments using
mfDCA and plmDCA on a number of domain families
DI
A nice characteristic of Sij is its invariance with re- from the Pfam database; here we report and discuss the
spect to the gauge freedom of the Potts model, i.e., both results.
choices Eqs. (8) and (9) (or any other valid choice) gen-
DI
erate identical Sij .
In the pseudolikelihood approach, we prefer not to use A. Domain families, native structures, and
DI true-positive rates
Sij , as this would require a pseudocount λ to regularize
the frequencies in (24), introducing a third regulariza-
tion variable in addition to λh and λJ . Another possible The speed of mfDCA enabled Morcos et al. [2] to
scoring function, already mentioned but not used in [1], conduct a large-scale analysis using 131 families. PLM
is the Frobenius norm (FN) is computationally more demanding than NMFI, so we
v chose to start with a smaller collection of 17 families,
u q
uX listed in Table I. To ease the numerical effort, we chose
kJij k2 = t Jij (k, l)2 . (25) families with relatively small N and B. More precisely,
k,l=1 we selected families out of the first 115 Pfam entries (low
Pfam ID), which have (i) at most N = 130 residues, (ii)
DI between 2,000 and 22,000 sequences, and (iii) reliable
Unlike Sij , (25) is not independent of gauge choice, so
one must be a bit careful. As was noted in [1], the zero structural information (cf. the PDB entries provided in
sum gauge (8) minimizes the Frobenius norm, in a sense the table). No selection based on DCA performance was
making (8) the most appropriate gauge choice for the done. In the appendix, a number of longer proteins is
score (25). Recall from above that our pseudolikelihood studied. The mfDCA performance on the selected fami-
uses the full representation and fixes the gauge by the lies was found to be coherent with the larger data set of
regularization terms Rl2 . Our procedure is therefore to Morcos et al..
first infer the interaction parameters using the pseudo- To reliably assess how good a contact prediction is,
likelihood and the regularization, and then to change to something to regard as a ’gold standard’ is helpful. For
the zero-sum gauge: each of the 17 families we have therefore selected one rep-
resentative high-resolution X-ray crystal structure (reso-

Jij (k, l) = Jij (k, l) − Jij (·, l) − Jij (k, ·) + Jij (·, ·), (26) lution below 3Å), see Table I for the corresponding PDB
identification.
where ‘·’ denotes average over the concerned position. From these native protein structures, we have ex-
One can show that (26) preserves the probabilities of (6) tracted position-position distances d(i, j) for each pair of
8

Family ID N B Bef f (90%) PDB ID UniProt entry UniProt residues


PF00011 102 5024 3481 2bol TSP36 TAESA 106-206
PF00013 58 6059 3785 1wvn PCBP1 HUMAN 281-343
PF00014 53 2393 1812 5pti BPT1 BOVIN 39-91
PF00017 77 2732 1741 1o47 SRC HUMAN 151-233
PF00018 48 5073 3354 2hda YES HUMAN 97-144
PF00027 91 12129 9036 3fhi KAP0 BOVIN 154-238
PF00028 93 12628 8317 2o72 CADH1 HUMAN 267-366
PF00035 67 3093 2254 1o0w RNC THEMA 169-235
PF00041 85 15551 10631 1bqu IL6RB HUMAN 223-311
PF00043 95 6818 5141 6gsu GSTM1 RAT 104-192
PF00046 57 7372 3314 2vi6 NANOG MOUSE 97-153
PF00076 70 21125 14125 1g2e ELAV4 HUMAN 48-118
PF00081 82 3229 1510 3bfr SODM YEAST 27-115
PF00084 56 5831 4345 1elv C1S HUMAN 359-421
PF00105 70 2549 1277 1gdc GCR RAT 438-507
PF00107 130 17864 12114 1a71 ADH1E HORSE 203-338
PF00111 78 7848 5805 1a70 FER1 SPIOL 58-132

TABLE I. Domain families included in our study, listed with Pfam ID, length N , number of sequences B (after removal of
duplicate sequences), number of effective sequences Bef f (under x = 0.9, i.e., 90% threshold for reweighting), and the PDB
and UniProt specifications for the structure used to access the DCA prediction quality.

500
Number of occurrences

600
1000 400

400 300

500 200
200
100

0 0 0
0 8.5 20 40 0 8.5 20 40 0 8.5 20 40
Distance (Å) Distance (Å) Distance (Å)

FIG. 2. (Color online) Histograms of crystal-structure distances pooled from all 17 families, when including all pairs (left),
pairs with |j − i| > 4 (center), and pairs with |j − i| > 14 (right). The vertical line is our contact cutoff 8.5Å.

sequence positions, by measuring the minimal distance peak at 3-5Å to presumably correspond to short-range
between any two heavy atoms belonging to the amino interactions like hydrogen bonds or secondary-structure
acids present in these positions. The left panel of Fig. 2 contacts, whereas the last peak likely corresponds to
shows the distribution of these distances in all considered long-range, possibly water-mediated interactions. These
families. Three peaks protrude from the background dis- peaks contain the nontrivial information we would like
tribution: one at small distances below 1.5Å, a second at to extract from sequence data using DCA. In order to
about 3-5Å and a third at about 7-8Å. The first peak cor- accept the full second peak, we have chosen a distance
responds to the peptide bonds between sequence neigh- cutoff of 8.5Å for true contacts, slightly larger than the
bors, whereas the other two peaks correspond to non- value of 8Å used in [2].
trivial contacts between amino acids, which may be dis- Accuracy results are here reported primarily using
tant along the protein backbone, as can be seen from the true-positive (TP) rates, also the principal measurement
center and right panels of Fig. 2, which collect only dis- in [2] and [9]. The TP rate for p is the fraction of the p
tances between positions i and j with minimal separation strongest-scored pairs which are actually contacts in the
|j − i| ≥ 5 resp. |j − i| ≥ 15. Following [2], we take the crystal structure, defined as described above. To exem-
9

plify TP rates, let us jump ahead and look at Fig. 3. For


PLM and protein family PF00076, the TP rate is 1 up PF00076 PF00107 PF00041 PF00027
CN N=70, B =14125 N=130, B =12114 N=85, B =10631 N=91, B =9036
to p = 80, which means that all 80 top-Sij pairs are eff eff eff eff
1.0 1.0 1.0 1.0
genuine contacts in the crystal structure. At p = 200, 0.9
0.8
0.9
0.8
0.9
0.8
0.9
0.8
the TP rate has dropped to 0.78, so 0.78 · 200 = 156 of 0.7
0.6
0.7
0.6
0.7
0.6
0.7
0.6
CN
the top 200 top-Sij pairs are contacts, while 44 are not. 0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
PF00028 PF00111 PF00043 PF00084
N=93, B =8317 N=78, B =5805 N=95, B =5141 N=56, B =4345
eff eff eff eff
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
B. Parameter settings 0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100
To set the stage for comparison, we started by running PF00013 PF00018 PF00011 PF00046
initial trials on the 17 families using both NMFI and N=58, B =3785
eff
N=48, B =3554 N=102, B =3481 N=57, B =3314
eff eff eff
1.0 1.0 1.0 1.0
PLM with many different regularization and reweighting 0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
strengths. Reweighting indeed raised the TP rates, and, 0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
as reported in [2] for 131 families, results seemed robust 0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
toward the exact choice of the limit x around 0.7 ≤ x ≤ PF00035 PF00014 PF00017 PF00081
N=67, Beff=2254 N=53, Beff=1812 N=77, Beff=1741 N=82, Beff=1510
0.9. We chose x = 0.9 to use throughout the study. 1.0 1.0 1.0 1.0
In what follows, NMFI results are reported using the 0.9
0.8
0.9
0.8
0.9
0.8
0.9
0.8
same list of pseudocounts as in Fig. S11 in [2]: λ = 0.7
0.6
0.7
0.6
0.7
0.6
0.7
0.6
0.5 0.5 0.5 0.5
w·Bef f with w = {0.11, 0.25, 0.43, 0.67, 1.0, 1.5, 2.3, 4.0, 1 10 100 1 10 100 1 10 100 1 10 100

9.0}. During our analysis we also ran intermediate values, PF00105


N=70, B =1277
eff
and we found this covering to be sufficiently dense. We 1.0
0.9 PLM
give outputs from two versions of NMFI: NMFI-DI and 0.8
0.7 NMFI−DI
NMFI-DI(true). The former uses pseudocounts for all 0.6
0.5
NMFI−DI(true)

calculations, whereas the latter switches to true frequen- 1 10 100

DI
cies when it gets to the evaluations of Sij . We append
’DI’ to the NMFI name, since, later on, we will also try
CN
the Sij score for NMFI (which we will call NMFI-CN).
With l2 regularization in the PLM algorithm, out-
comes were robust against the precise choice of λh ; TP FIG. 3. (Color online) Contact-detection results for the 17
rates were almost identical when λh was changed between families, sorted by Bef f . The y-axes are TP rates and x-
0.001 and 0.1. We therefore chose λh = 0.01 for all ex- axes are the number of predicted contacts p, based on pairs
with |j − i| > 4. The three curves for each method are the
periments. What mattered, rather, was the coupling reg-
three regularization levels yielding highest TP rates across all
ularization parameters λJ , for which we did a systematic families. The thickened curve highlights the best one out of
scan from λJ = 0 and up using step-size 0.005. these three (λ = Bef f for NMFI and λJ = 0.01 for PLM).
So, to summarize, the results reported here are based
on x = 0.9, cutoff 8.5Å, λh = 0.01, and λ and λJ drawn
from collections of values as described above. In the following discussion, we leave out all results for
NMFI-DI(true) and focus on PLM vs. NMFI-DI, i.e., the
version used in [2]. All plots remaining in this section use
C. Main comparison of mfDCA and plmDCA the optimal regularization values: λ = Bef f for NMFI
and λJ = 0.01 for PLM.
Fig. 3 shows TP rates for the different families and TP rates only classify pairs as contacts (d(i, j) < 8.5Å)
methods. We see that the TP rates of plmDCA (PLM) or noncontacts (d(i, j) ≥ 8.5Å). To give a more detailed
are consistently higher than those of mfDCA (NMFI), view of how scores correlate with spatial separation, we
especially for families with large Bef f . For what con- show in Fig. 4 a scatter plot of the score vs. distance for
cerns the two NMFI versions: NMFI-DI(true) avoids the all pairs in all 17 families. PLM and NMFI-DI both man-
strong failure seen in NMFI-DI for PF00084, but for most age to detect the peaks seen in the true distance distribu-
other families, see in particular PF00014 and PF00081, tion of Fig. 2, in the sense that high scores are observed
the performance instead drops using marginals without almost exclusively at distances below 8.5Å. Both meth-
DI
pseudocounts in the Sij calculation. For both NMFI-DI ods agree that interactions get, on average, progressively
and NMFI-DI(true), the best regularization was found to weaker going from peak one, to two, to three, and finally
be λ = 1 · Bef f , the same value as used in [2]. For PLM, to the bulk. We note that the dots scatter differently
the best parameter choice was λJ = 0.01. Interestingly, across the PLM and NMFI-DI figures, reflecting the two
DI
this same regularization parameter was optimal for ba- separate scoring techniques: Sij are strictly nonnega-
sically all families. This is somewhat surprising, since tive, whereas APC corrected norms can assume negative
both N and Bef f span quite wide ranges (48-130 and values. We also observe how sparse the extracted signal
1277-14125 respectively). is: most spatially close pairs do not show elevated scores.
10

FIG. 4. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-DI (right).
The vertical line is our contact cutoff at 8.5Å.

However, from the other side, almost all strongly coupled induced by pairs which are not in contact in the X-ray
pairs are close, so the biological hypothesis of Sec. II is crystal structure used for evaluating prediction accuracy.
well supported here. The important point – and this seems to be a distinguish-
Fig. 5 shows scatter plots of scores for PLM and NMFI- ing feature of maximum-entropy models as compared to
DI for some selected families. Qualitatively the same pat- simpler correlation-based methods – is to find a sufficient
terns were observed for all families. The points are clearly number of well-distributed contacts to enable de-novo 3D
correlated, so, to some extent, PLM and NMFI-DI agree structure prediction [3–5, 40–42]. In this sense, it is more
on the interaction strengths. Due to the different scoring important to achieve high accuracy of the first predicted
schemes, we would not expect numerical coincidence of contacts, than intermediate accuracy for many predic-
scores. Many of PLM’s top-scoring position pairs have tions.
also top scores for NMFI-DI and vice versa. The largest In summary, the results suggest that the PLM method
discrepancy is in how much more strongly NMFI-DI re- offers some interesting progress compared to NMFI. How-
sponds to pairs with small |j − i|; the crosses tend to ever, let us also note that in the comparison we also had
shoot out to the right. PLM agrees that many of these to change both scoring and regularization styles. It is
neighbor pairs interact strongly, but, unlike NMFI-DI, it thus conceivable that a NMFI with the new scoring and
also shows rivaling strengths for many |j − i| > 4-pairs. regularization could be more competitive with PLM. In-
deed, upon further investigation, detailed in Appendix B,
An even more detailed picture is given by considering
we found that part of the improvement in fact does stem
contact maps, see Fig. 6. The tendency observed in the
from the new score. In Appendix C, where we extend our
last scatter plots remains: NMFI-DI has a larger por-
tion of highly scored pairs in the neighbor zone, which results to longer Pfam domains, we therefore add results
from NMFI-CN, an updated version of the code used in
are the middle stretches in these figures. An important CN DI
observation is, however, that clusters of contacting pairs [2] which scores by Sij instead of Sij .
with long 1D sequence separation are captured by both
algorithms.
Note, that only a relatively small fraction of contacts D. Run times
is uncovered by DCA, before false positives start to ap-
pear, and that many native contacts are missed. How- In general, NMFI, which is merely a matrix inversion,
ever, the aim of DCA cannot be to reproduce a com- is very quick compared with PLM; most families in this
plete contact map: It is based on sequence correlations study took only seconds to run through the NMFI code.
alone (e.g. it cannot predict contacts for 100% conserved In contrast to the message-passing based method used
residues), it does not consider any physico-chemical prop- in [1], a DCA using PLM is nevertheless feasible for all
erty of amino acids (they are seen as abstract letters), it protein families in Pfam. The objective function in PLM
does not consider the need to be embeddable in 3D. Fur- is a sum over nodes and samples and its execution time
thermore, proteins may undergo conformational changes is therefore expected to depend both on B (number of
(e.g. allostery) or homo-dimerize, so coevolution may be members of a protein family in Pfam) and N (length of
11

PF00028 PF00035

Neighbor pair: |i−j|<5

Contact: 0 ≤ true distance<8.5 (and |i−j|≥ 5) NMFI−DI


NMFI−DI PLM PLM
Non contact: true distance ≥ 8.5 (and |i−j|≥5)

90 90
1.4 1.4 80 80 60 60
PF00018 PF00028
1.2 1.2 70 70 50 50
60 60 40 40
1 1 50 50
40 40 30 30
PLM (CN)

0.8 0.8 30 30 20 20
0.6 20 20
0.6 10 10 10 10
0.4 00 00
0.4
0.2
0.2
0
0
−0.2 PF00043 PF00046
−0.2
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5
NMFI−DI (DI)
1.4 1.4
1.2 PF00043 1.2 PF00111 NMFI−DI PLM NMFI−DI PLM
1 1
0.8 0.8
0.6 0.6 90 90
80 80 50 50
0.4 0.4 70 70 40 40
60 60
0.2 0.2 50 50 30 30
40 40
0 0 30 30 20 20
20 20 10 10
−0.2 −0.2 10 10
0.2 0.4 0.6 0.1 0.2 0.3 0.4 00 00

FIG. 5. (Color online) Scatter plots of interaction scores for FIG. 6. (Color online) Predicted contact maps for PLM (right
PLM and NMFI-DI from four families. For all plots, the axes part, black symbols) and NMFI-DI (left part, red symbols
are as indicated by the top left figure. The distance unit in online) for four families. A pair (i, j)’s placement in the plots
the top box is Å. is found by matching positions i and j on the axes. Native
contacts are indicated in gray. True and false positives are
represented by circles and crosses, respectively. Each figure
shows the 1.5N strongest ranked pairs (including neighbors)
the aligned sequences in a protein family). for that family.
On a standard desktop computer, using a basic
MATLAB-interfaced C-implementation of conjugate gra-
dient descent, the run times for PF00014, PF00017, and as large as 10N , suggesting that plmDCA can be made
PF00018 (small N and B) were 50, 160 and 90 respec- competitive not only in terms of accuracy, but also in
tively. For PF00041 (small N but larger B) one run terms of computational speed.
took 15 min. For domains with larger N , like those in Finally, all of these times were obtained cold-starting
Appendix C, run times grow approximately apace. For with all fields and couplings at 0. Presumably, one can
example, the run times for PF00026 (N = 314) and improve by using an appropriate initial guess obtained,
PF00006 (N = 215) were 80 and 65 min respectively. say, from NMFI. This has however not been implemented
A well-known alternative use of pseudolikelihoods is to here.
minimize each gr separately. While slightly more crude,
such an ’asymmetric’ variant of plmDCA would amount
to N independent (multiclass) logistic regression prob- V. DISCUSSION
lems, which would make parallel execution trivial on up
to N cores. A rigorous examination of the asymmetric In this work, we have shown that a direct-couping anal-
version’s performance is beyond the scope of the present ysis built on pseudolikelihood maximization (plmDCA)
work, but initial tests suggest that even on a single pro- consistently outperforms the previously described mean-
cessor it requires convergence times almost an order of field based analysis (mfDCA), as assessed across a num-
magnitude smaller than the symmetric one (which we ber of large protein-domain families. The advantage of
used in this work), while still yielding almost exactly the the pseudolikelihood approach was found to be partially
same TP rates. Using N processors, the above mentioned intrinsic, and partly contingent on using a sampling-
run times could thus, in principle, be dropped by a factor corrected Frobenius norm to score inferred direct statis-
12

tical coupling matrices. ing the sequence statistics by pairwise Potts models, most
On one hand, this improvement might not be sur- extractable information might already be extracted from
prising: it is known that for very large data sets PLM the MSA. However, it may well also be that there is al-
becomes asymptotically equivalent to full maximum- ternative information hidden in the sequences, for which
likelihood inference, whereas mean-field inference re- we would need to go beyond pairwise models, or inte-
mains intrinsically approximate, and this may result in grate the physico-chemical properties of different amino
an improved PLM performance also for finite data sets acids into the procedure, or extract even more informa-
[23]. tion from large sets of evolutionarily related amino-acid
On the other hand, the above advantage holds if and sequences. DCA is only a step in this direction.
only if the following two conditions are fulfilled: Data In our work we have seen that simple sampling cor-
are drawn independently from a probability distribution, rections, more precisely sequence reweighting and the
and this probability distribution is the Boltzmann dis- average-product correction of interaction scores, lead to
tribution of a Potts model. None of these two con- an increased accuracy in predicting 3D contacts of amino
ditions actually hold for real protein sequences. On acids, which are distant on the protein’s backbone. It is,
artificial data, refined mean-field methods (Thouless- however, clear that these somewhat heuristic statistical
Anderson-Palmer equations, Bethe approximation) also fixes cannot correct for the complicated hierarchical phy-
lead to improved model inference as compared to NMFI, logenetic relationships between proteins, and that more
cf. e.g. [14, 16, 17, 21], but no such improvement has sophisticated methods would be needed to disentangle
been observed in real protein data [2]. The results of phylogenetic from functional correlations in massive se-
the paper are therefore interesting and highly nontriv- quence data. To do so is an open challenge, which would
ial. They also suggest that other model-learning meth- leave the field of equilibrium inverse statistical mechan-
ods from statistics such as ’Contrastive Divergence’ [52] ics, but where methods of inverse statistical mechanics
or the more recent ’Noise-Contrastive Estimation’ [53], may still play a useful role.
could be explored to further increase our capacity to ex- ACKNOWLEDGMENTS
tract structural information from protein sequence data.
Disregarding the improvements, we find that overall This work was supported by the Academy of Finland
the predicted contact pairs for plmDCA and mfDCA are as part of its Finland Distinguished Professor program,
highly overlapping, illustrating the robustness of DCA project 129024/Aurell, and through the Center of Excel-
results with respect to the algorithmic implementation. lence COIN. Discussions with S. Cocco and R. Monasson
This observations suggests that, in the context of model- are gratefully acknowledged.

[1] M. Weigt, R. White, H. Szurmant, J. Hoch, and T. Hwa, [10] H. J. Kappen and F. B. Rodriguez, in Advances in Neural
Proc. Natl. Acad. Sci. U. S. A. 106, 67 (2009) Information Processing Systems (The MIT Press, 1998)
[2] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. pp. 280–286
Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, [11] E. Schneidman, M. Berry, R. Segev, and W. Bialek, Na-
and M. Weigt, Proc. Natl. Acad. Sci. U. S. A. 108, ture 440, 1007 (2006)
E1293 (2011) [12] A. Braunstein and R. Zecchina, Phys. Rev. Lett. 96,
[3] D. S. Marks, L. J. Colwell, R. P. Sheridan, T. A. Hopf, 030201 (2006)
A. Pagnani, R. Zecchina, and C. Sander, PLoS One 6, [13] A. Braunstein, A. Pagnani, M. Weigt, and R. Zecchina,
e28766 (2011) J. Stat. Mech. 2008, P12001 (2008)
[4] J. Sulkowska, F. Morcos, M. Weigt, T. Hwa, and [14] Y. Roudi, J. A. Hertz, and E. Aurell, Front.
J. Onuchic, Proc. Natl. Acad. Sci. U. S. A. 109, 10340 Comput. Neurosci. 3 (2009), doi:“bibinfo doi
(2012) 10.3389/neuro.10.022.2009
[5] T. Nugent and D. T. Jones, Proc. Natl. Acad. Sci. U. S. [15] S. Cocco, S. Leibler, and R. Monasson, Proc. Natl. Acad.
A. 109, E1540 (2012) Sci. U. S. A. 106, 14058 (2009)
[6] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, [16] V. Sessak and R. Monasson, J. Phys. A: Math. Theor.
J. G. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, 42, 055001 (2009)
J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer, [17] M. Mézard and T. Mora, J. Physiol. (Paris) 103, 107
S. R. Eddy, A. Bateman, and R. D. Finn, Nucleic Acids (2009)
Res. 40, D290 (2012) [18] E. Marinari and V. V. Kerrebroeck, J. Stat. Mech.(2010),
[7] L. Burger and E. van Nimwegen, PLoS Comput. Biol. 6, doi:“bibinfo doi 10.1088/1742-5468/2010/02/P02008
E1000633 (2010) [19] S. Cocco and R. Monasson, Phys. Rev. Lett. 106, 090601
[8] S. Balakrishnan, H. Kamisetty, J. Carbonell, S. Lee, and (2011)
C. Langmead, Proteins: Struct., Funct., Bioinf. 79, 1061 [20] F. Ricci-Tersenghi, J. Stat. Mech.(2012)
(2011) [21] H. Nguyen and J. Berg, J. Stat. Mech.(2012)
[9] D. T. Jones, D. W. A. Buchan, D. Cozzetto, and M. Pon- [22] H. Nguyen and J. Berg, Phys. Rev. Lett. 109 (2012)
til, Bioinformatics 28, 184 (2012)
13

[23] E. Aurell and M. Ekeberg, Phys. Rev. Lett. 108, 090201 [53] M. Gutmann and A. Hyvärinen, J. Mach. Learn. Res. 13,
(2012) 307 (2012)
[24] A. Hyvärinen, J. Karhunen, and E. Oja, Independent
Component Analysis, Adaptive and Learning Systems for
Signal Processing, Communications, and Control (John Appendix A: Circle plots
Wiley & Sons, 2001)
[25] J. Rissanen, Information and Complexity in Statistical
Modeling (Springer, 2007) To get a sense of how false positives distribute across
[26] M. J. Wainwright and M. I. Jordan, the domains, we draw interactions into circles in Fig. 7.
Foundations and Trends in Machine Learning 1, 1 Among erroneously predicted contacts there is some ten-
(2008) dency towards loopiness, especially for NMFI-DI; the
[27] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, Ann. black lines tend to ‘bounce around’ in the circles. It
Stat. 38, 1287 (2010) hence seems that relatively few nodes are responsible for
[28] H. Ackley, E. Hinton, and J. Sejnowski, Cogn. Sci. 9, 147 many of the false positives. We performed an explicit
(1985) check of the data columns belonging to these ‘bad’ nodes,
[29] G. Parisi, Statistical Field Theory (Addison Wesley, and we found that they often contained strongly biased
1988)
data, i.e., had a few large f i (k). In such cases, it seemed
[30] L. Peliti, Statistical Mechanics in a Nutshell (Princeton
University Press, 2011) ISBN ISBN-13: 9780691145297 that NMFI-DI was more prone than PLM to report a
[31] K. Fischer and J. Hertz, Spin Glasses (Cambridge Uni- (predicted) interaction.
versity Press, 1993) ISBN 0521447771, 9780521447775
[32] I. Pagani, K. Liolios, J. Jansson, I. Chen, T. Smirnova,
B. Nosrat, and M. V.M., Nucleic Acids Res. 40, D571
Crystal structure PLM NMFI−DI
(2012)
[33] The Uniprot Consortium, Nucleic Acids Res. 40, D71

PF00028
(2012)
[34] H. Berman, G. Kleywegt, H. Nakamura, and J. Markley,
Structure 20, 391 (2012)
[35] U. Göbel, C. Sander, R. Schneider, and A. Valencia,
Proteins: Struct., Funct., Genet. 18, 309 (1994)
[36] S. W. Lockless and R. Ranganathan, Science 286, 295
(1999)
PF00035

[37] A. A. Fodor and R. W. Aldrich, Proteins: Struct., Funct.,


Bioinf. 56, 211 (2004)
[38] A. S. Lapedes, B. G. Giraud, L. Liu, and G. D. Stormo,
Lecture Notes-Monograph Series: Statistics in Molecular
Biology and Genetics 33, 236 (1999)
[39] J. Besag, The Statistician 24, 179195 (1975)
[40] A. Schug, M. Weigt, J. Onuchic, T. Hwa, and H. Szur-
PF00043

mant, Proc. Natl. Acad. Sci. U. S. A. 106, 22124 (2009)


[41] A. E. Dago, A. Schug, A. Procaccini, J. A. Hoch,
M. Weigt, , and H. Szurmant, Proc. Natl. Acad. Sci. U.
S. A. 109, 10148 (2012)
[42] T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost,
C. Sander, and D. S. Marks, Cell 149, 1607 (2012)
[43] E. T. Jaynes, Physical Review Series II 106, 620630
PF00046

(1957)
[44] E. T. Jaynes, Physical Review Series II 108, 171190
(1957)
[45] G. Darmois, C.R. Acad. Sci. Paris 200, 12651266 (1935),
in French
[46] E. Pitman and J. Wishart,
Math. Proc. Cambridge Philos. Soc. 32, 567579 (1936)
[47] B. Koopman, Trans. Am. Math. Soc. 39, 399409 (1936)
[48] A. Kholodenko, J. Stat. Phys. 58, 357 (1990)
[49] T. Tanaka, Neural Comput. 12, 19511968 (2000)
FIG. 7. (Color online) Connections for four families overlaid
[50] S. ichi Amari, S. Ikeda, and H. Shimokawa, “Informa-
on circles. Position ‘1’ is indicated by a dash. The leftmost
tion geometry and mean field approximation: the alpha-
column shows contacts in the crystal structure (dark gray for
projection approach,” in Advanced Mean Field Methods
d(i, j) < 5Å and light gray for 5Å≤ d(i, j) < 8.5Å). The other
– Theory and Practice (MIT Press, 2001) Chap. 16, pp.
two columns show the top 1.5N strongest ranked |j − i| > 4-
241–257, ISBN 0-262-15054-9
pairs for PLM and NMFI-DI, with gray for true positives and
[51] S. D. Dunn, L. M. Wahl, and G. B. Gloor, Bioinformatics
black for false positives.
24, 333 (2008)
[52] G. Hinton, Neural Comput. 14, 17711800 (2002)
14

Appendix B: Other scores for naive mean-field is in fact about the same for both methods. Still, PLM
inversion maintains a somewhat higher TP rates overall.

We also investigated NMFI performance using the


DI CN
APC term for the Sij scoring and using our new Sij PF00076 PF00107 PF00041 PF00027
score. In the second case we first switch the parameter N=70, Beff=14125 N=130, Beff=12114 N=85, Beff=10631 N=91, Beff=9036
1.0 1.0 1.0 1.0
constraints from (9) to (8) using (26). Mean TP rates us- 0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
ing the modified score are shown in Fig. 8. We observe 0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
DI
that APC in Sij scoring increases TP rates slightly, 0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
CN
while Sij scoring can improve TP rates overall. We PF00028
N=93, B =8317
PF00111
N=78, B =5805
PF00043
N=95, B =5141
PF00084
N=56, B =4345
eff eff eff eff
remark, however, that for the second-highest ranked in- 1.0 1.0 1.0 1.0
DI 0.9 0.9 0.9 0.9
teraction (p = 2) NMFI with the original Sij (NMFI-DI) 0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
CN
ties with NMFI with Sij (NMFI-CN). 0.6
0.5
0.6
0.5
0.6
0.5
0.6
0.5
1 10 100 1 10 100 1 10 100 1 10 100
PF00013 PF00018 PF00011 PF00046
N=58, Beff=3785 N=48, Beff=3554 N=102, Beff=3481 N=57, Beff=3314
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
1.0 0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100
PF00035 PF00014 PF00017 PF00081
0.9 N=67, B =2254
eff
N=53, B =1812
eff
N=77, B =1741
eff
N=82, B =1510
eff
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
Mean TP rate

0.8 0.8 0.8 0.8


0.7 0.7 0.7 0.7
0.8 0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100
NMFI−DI PF00105
N=70, B =1277
eff
0.7 NMFI−CDI 1.0
0.9 PLM
0.8
NMFI−CN 0.7 NMFI−CN
0.6
0.5
0.6 1 10 100

0.5
1 10 100
Number of predicted pairs
FIG. 10. (Color online) Contact-detection results for all the
CN
families in our study (sorted by Bef f ), now with the Sij
FIG. 8. (Color online) Mean TP rates, using pairs with |j − score for NMFI. The y-axes are TP rates and the x-axes
DI
i| > 4, for NMFI with old score Sij CDI
, new APC score Sij , are the number of predicted contacts p, based on pairs with
CN
and the norm score Sij . Each curve corresponds to the best |j − i| > 4. The three curves for each method are the three
λ for that particular score. regularization levels yielding highest TP rates across all fam-
ilies. The thickened curve highlights the best one out of these
three (λ = 2.3 · Bef f for NMFI-CN and λJ = 0.01 for PLM).
Motivated by the results of Fig. 8, we decided to com-
CN
pare NMFI and PLM under the Sij score. All figures
in this paragraph show the best regularization for each Figure 11 shows scatter plots for the same families as
CN
method, unless otherwise stated. Figure 9 shows score vs. in Fig. 5 but using the Sij scoring for NMFI. The points
distance for all pairs in all families. Unlike Fig. 4, the now lie more clearly on a line, from which we conclude
two plots now show very similar profiles. We note, how- that the bends in Fig. 5 were likely a consequence of
CN differing scores. Yet, the trends seen in Fig. 5 remain:
ever, that NMFI’s Sij scores trend two to three times
larger than PLM’s (the scales on the vertical axes are NMFI gives more attention to neighbor pairs than does
different). Perhaps this is an inherent feature of these PLM.
methods, or simply a consequence of the different types In Fig. 12, we recreate the contact maps of Fig. 6 with
of regularization. NMFI-CN in place of NMFI-DI and find that the plots
Figure 10 shows the same situation as Fig. 3, but us- are more symmetric. As expected, asymmetry is seen
CN primarily for small |j − i|; NMFI tends to crowd these
ing Sij to score NMFI. The three best regularization
choices for NMFI-CN turned out the same as before, i.e., regions with lots of loops.
λ = 1 · Bef f , λ = 1.5 · Bef f and λ = 2.3 · Bef f , but To investigate why NMFI assembles so many top-
the best out of these three was now λ = 2.3 · Bef f (in- scored pairs in certain neighbor regions, we performed
stead of λ = 1 · Bef f ). Comparing Fig. 3 and Fig. 10 one an explicit check of the associated MSA columns. A rel-
can see that the difference between the two methods is evant regularity was observed: when gaps appear in a
now smaller; for several families, the prediction quality sequence, they tend to do so in long strands. The pic-
15

FIG. 9. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-CN
(right). The vertical line is our contact cutoff at 8.5Å.

ture can be illustrated by the following hypothetical MSA


(in our implementation, the gap state is 1): Neighbor pair: |i−j|<5

Contact: 0 ≤ true distance < 8.5 (and |i−j| ≥ 5)

Non contact: true distance ≥ 8.5 (and |i−j| ≥ 5)

 
··· 1.5 1.5
· · · 6 5 9 7 268 7 4 4 2 2 · · ·
PF00028
 
1 PF00018
· · · 1 1 1 1 111 1 1 1 2 8 · · · 1
 
PLM (CN)

· · · 6 5 2 7 238 9 5 4 2 3 · · ·
 
  0.5
· · · 0.5
 3 7 4 7 268 7 9 4 2 3 · · ·
.
· · · 3 7 4 7 238 8 9 4 2 9 · · ·
  0
0
· · · 1 1 1 1 111 4 5 4 2 9 · · ·
 
 
· · · 8 5 9 7 298 7 4 4 2 4 · · · 0 2 4 0 1 2 3 4
  NMFI−CN (CN)
· · · 1 1 1 1 111 1 1 1 2 4 · · ·
··· 1 PF00043 1 PF00111

0.5 0.5

We recall that gaps (’1’ states) are necessary for sat- 0 0

isfactory alignment of the sequences in a family and


0 1 2 3 4 0 1 2 3 4
that in our procedure we treat gaps just another amino
acid, with its associated interaction parameters. We then
make the obvious observation that independent samples
from a Potts model will only contain long subsequences
of the same state with low probability. In other words,
FIG. 11. (Color online) Scatter plots of interaction scores for
the model to which we fit the data cannot describe long PLM and NMFI-CN from four families. For all plots, the axes
stretches of ’1’ states, which is a feature of the data. It is are as indicated by the top left figure. The distance unit in
hence quite conceivable that the two methods handle this the top box is Å.
discrepancy between data and models differently since we
do expect this gap effect to generate large Jij (1, 1) for at
least some pairs with small |j − i|.
16

PF00028 PF00035

NMFI−CN PLM NMFI−CN PLM

90 90
80 80 60 60
70 70 50 50
60 60 40 40
50 50
40 40 30 30
30 30 20 20
20 20
10 10 10 10
00 00

PF00043 PF00046

PLM NMFI−CN PLM


NMFI−CN

90 90
80 80 50 50
70 70 40 40
60 60
50 50 30 30
40 40
30 30 20 20
20 20 10 10
10 10
00 00

FIG. 12. (Color online) Predicted contact maps for PLM


(right part, black symbols) and NMFI-CN (left part, green
symbols online) from four families. A pair (i, j)’s placement
in the plots is found by matching positions i and j on the
axes. Contacts are indicated by gray (dark for d(i, j) < 5Å
and light for 5Å≤ d(i, j) < 8.5Å). True and false positives are FIG. 13. Scatter plots of estimated Jij,kl = Jij (k, l) from
represented by circles and crosses, respectively. Each figure PF00014 (top) and PF00043 (bottom). Black dots are ‘gap–
shows the 1.5N strongest ranked pairs (including neighbors) gap’ interactions (k = l = 1), dark gray dots are ‘gap–amino-
for that family. acid’ interactions (k = 1 and l 6= 1, or k 6= 1 and l = 1),
and light gray dots are ‘amino-acid–amino-acid’ interactions
(k 6= 1 and l 6= 1).

Jones et al. [9] disregarded contributions from gaps in


their scoring by simply skipping the gap state when do-
ing their norm summations. We tried this but found no
significant improvement for either method. The change
seemed to affect only pairs with small |j − i| (which is
reasonable), and our TP rates are based on pairs with
|j − i| > 4. If gap interactions are indeed responsible for
reduced prediction qualities, removing their input during
scoring is just a Band-Aid type solution. A better way
Figure 13 shows scatter plots for all coupling param- would be to suppress them already in the parameter es-
eters Jij (k, l) in PF00014, which has a modest amount timation step. That way, all interplay would have to be
of gap sections, and in PF00043, which has relatively accounted for without them. Whether or not there are
many. As outlines above, the Jij (1, 1)-parameters are ways to effectively handle the inference problem in PSP
among the largest in magnitude, especially for PF00043. by ignoring gaps or treating them differently, is an issue
We also note that the black dots steer to the right; NMFI which goes beyond the scope of this work.
clearly reacts more strongly to the gap-gap interactions We also investigated whether the gap effect depends on
than PLM. the sequence similarity reweighting factor x, which up to
17

here was chosen x = 0.9. Perhaps the gap effect can the wider range of 50-400. The list is given in Table II.
be dampened by stricter definition of sequence unique- We here kept the reweighting level at x = 0.8 as in [2],
ness? In Fig. 14 we show another set of TP rates, but while the TP rates were again calculated using the cutoff
now for x = 0.75. We also include results for NMFI run 8.5Å. The pseudocount strength for NMFI was varied in
on alignment files from which all sequences with more the same interval as in the main text. We did not try
than 20% gaps have been removed. The best regulariza- to optimize the PLM regularization parameters for this
tion choice for each method turned out the same as in trial, but merely used λh = λJ = 0.01 as determined for
Fig. 10: λ = 2.3 · Bef f for NMFI-CN and λJ = 0.01 for the smaller families in the main text.
PLM. Overall, PLM maintains the same advantage over
NMFI-CN it had in Fig. 10. Removing gappy sequences
seems to trim down more TP rates than it raises, prob-
ably since useful information in the nongappy parts is Figure 15 shows qualitatively the same behavior as in
discarded unnecessarily. the smaller set of families: TP rates increase partly from
DI CN
changing from the Sij score to the Sij score, and partly
from changing from NMFI to PLM. Our positive results
PF00076 PF00041 PF00027 PF00107
thus do not seem to be particular to short-length families.
N=70, Beff=8298 N=85, Beff=7718 N=91, Beff=7082 N=130, Beff=7050
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100
0.5
1 10 100 Apart from the average TP rate for each value of p
PF00028
N=93, B =5474
PF00043
N=95, B =3467
PF00084
N=56, B =3176
PF00111
N=78, B =3110
(p’th strongest predicted interactions) one can also eval-
eff eff eff eff
1.0 1.0 1.0 1.0 uate performance by different criteria. In this larger sur-
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8 vey we investigated the distribution of values of p such
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6 that the TP rate in a family is one. Fig. 16 shows the
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100 histograms of the number of families for which the top
PF00011 PF00013 PF00018 PF00035
N=102, B =2367
eff
N=58, B =2095
eff
N=48, B =2030
eff
N=67, B =1645
eff
p predictions are correct, clearly showing that the differ-
1.0
0.9
1.0
0.9
1.0
0.9
1.0
0.9
ence between PLM and NMFI (using the two scores) pri-
0.8
0.7
0.8
0.7
0.8
0.7
0.8
0.7 marily occurs at the high end. The difference in average
0.6
0.5
0.6
0.5
0.6
0.5
0.6
0.5 performance between PLM and NMFI at least partially
1 10 100 1 10 100 1 10 100 1 10 100
PF00014 PF00046 PF00017 PF00081
stems from PLM getting more strongest contact predic-
N=53, B =1319
eff
N=57, B =1300
eff
N=77, B =1133
eff
N=82, B =645
eff tions with 100% accuracy.
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
1 10 100 1 10 100 1 10 100 1 10 100
PF00105
N=70, B =644
eff
1.0
0.9 PLM (75% rew.)
0.8
0.7 NMFI−CN (75% rew.)
0.6 NMFI−CN (90% rew., no >20%−gap alignments)
0.5
1 10 100

1
PLM
NMFI−DI
NMFI−CN
0.9

FIG. 14. (Color online) Contact-detection results for all the


Mean TP rate

families in our study. The y-axes are TP rates and x-axes are 0.8
the number of predicted contacts p, based on pairs with |j −
i| > 4. Two curves are included for reweighting margin x =
0.75, and one for reweighting margin x = 0.9 after deletion 0.7
of all sequences with more than 20% gaps. Results for each
method correspond to the regularization level yielding highest 0.6
TP rates across all families (λ = 2.3 · Bef f for both NMFI-CN
and λJ = 0.01 for PLM).
0.5
1 10 100
Number of predicted pairs

Appendix C: Extension to 28 protein families


FIG. 15. (Color online) Mean TP rates over the larger set of
To sample a larger set of families, we conducted an ad- 28 families for PLM with λJ = 0.01 and varying regularization
ditional survey of 28 families, now covering lengths across values for NMFI-CN and NMFI-DI.
18

10
PLM
9 NMFI−DI
NMFI−CN
8

Number of families
6

0
5 13 21 29 37 45 53 61 69 77
Number of pairs

FIG. 16. (Color online) Distribution of ’perfect’ accuracy for


the three methods. The x-axis shows the number of top-
ranked pairs for which the TP rates stays at one, and the
y-axis shows the number of families.
19

ID N B Bef f (80%) PDB ID UniProt entry UniProt residues


PF00006 215 10765 641 2r9v ATPA THEMA 149-365
PF00011 102 5024 2725 2bol TSP36 TAESA 106-206
PF00014 53 2393 1478 5pti BPT1 BOVIN 39-91
PF00017 77 2732 1312 1o47 SRC HUMAN 151-233
PF00018 48 5073 335 2hda YES HUMAN 97-144
PF00025 175 2946 996 1fzq ARL3 MOUSE 3-177
PF00026 314 3851 2075 3er5 CARP CRYPA 105-419
PF00027 91 12129 7631 3fhi KAP0 BOVIN 154-238
PF00028 93 12628 6323 2o72 CADH1 HUMAN 267-366
PF00032 102 14994 684 1zrt CYB RHOCA 282-404
PF00035 67 3093 1826 1o0w RNC THEMA 169-235
PF00041 85 15551 8691 1bqu IL6RB HUMAN 223-311
PF00043 95 6818 4052 6gsu GSTM1 RAT 104-192
PF00044 151 6206 1422 1crw G3P PANVR 1-148
PF00046 57 7372 1761 2vi6 NANOG MOUSE 97-153
PF00056 142 4185 1120 1a5z LDH THEMA 1-140
PF00059 108 5293 3258 1lit REG1A HUMAN 53-164
PF00071 161 10779 3793 5p21 RASH HUMAN 5-165
PF00073 171 9524 487 2r06 POLG HRV14 92-299
PF00076 70 21125 10113 1g2e ELAV4 HUMAN 48-118
PF00081 82 3229 890 3bfr SODM YEAST 27-115
PF00084 56 5831 3453 1elv C1S HUMAN 359-421
PF00085 104 10569 6137 3gnj VWF HUMAN 1691-1863
PF00091 216 8656 917 2r75 FTSZ AQUAE 9-181
PF00092 179 3936 1786 1atz VWF HUMAN 1691-1863
PF00105 70 2549 816 1gdc GCR RAT 438-507
PF00108 264 6839 2688 3goa FADA SALTY 1-254

TABLE II. Domain families included in our extended study, listed with Pfam ID, length N , number of sequences B (after
removal of duplicate sequences), number of effective sequences Bef f (under x = 0.8, i.e., 80% threshold for reweighting), and
the PDB and UniProt specifications for the structure used to access the DCA prediction quality.

You might also like