Nonnegative Decomposition of Multivariate Information
Paul L. Williams1, ∗ and Randall D. Beer1, 2
1
Cognitive Science Program and
School of Informatics and Computing
Indiana University, Bloomington, Indiana 47406 USA
(Dated: April 16, 2010)
arXiv:1004.2515v1 [cs.IT] 14 Apr 2010
2
Of the various attempts to generalize information theory to multiple variables, the most widely utilized,
interaction information, suffers from the problem that it is sometimes negative. Here we reconsider from first
principles the general structure of the information that a set of sources provides about a given variable. We begin
with a new definition of redundancy as the minimum information that any source provides about each possible
outcome of the variable, averaged over all possible outcomes. We then show how this measure of redundancy
induces a lattice over sets of sources that clarifies the general structure of multivariate information. Finally, we
use this redundancy lattice to propose a definition of partial information atoms that exhaustively decompose the
Shannon information in a multivariate system in terms of the redundancy between synergies of subsets of the
sources. Unlike interaction information, the atoms of our partial information decomposition are never negative
and always support a clear interpretation as informational quantities. Our analysis also demonstrates how the
negativity of interaction information can be explained by its confounding of redundancy and synergy.
PACS numbers: 89.70.-a, 87.19.lo, 87.10.Vg, 89.75.-k
Keywords: information theory, interaction information, redundancy, synergy, multivariate interaction
I.
INTRODUCTION
From its roots in Shannon’s seminal work on reliability
and coding in communication systems, information theory
has grown into a ubiquitous general tool for the analysis of
complex systems, with application in neuroscience, genetics,
physics, machine learning, and many other areas. Somewhat
surprisingly, the vast majority of work in information theory
concerns only the simplest possible case: the information that
a single variable provides about another. This is quantified
by Shannon’s mutual information, which is by far the most
widely used concept from information theory [1]. The second
most popular concept, conditional mutual information, considers interactions between multiple variables in only the most
rudimentary sense: it seeks to eliminate the influence of other
variables in order to isolate the dependency between two variables of interest. In contrast, many of the most interesting and
challenging scientific questions, such as many-body problems
in physics [2], n-person games in game theory [3], and population coding in neuroscience [4, 5], involve understanding the
structure of interactions between three or more variables.
The two main attempts to generalize information theory
to multivariate interactions are the total correlation proposed
by Watanabe [6] (also known as the multivariate constraint
[7], multiinformation [8], and integration [9]) and the interaction information of McGill [10] (also known as multiple
mutual information [11], co-information [12], and synergy
[13]). The total correlation, as its name suggests, measures
the total amount of dependency between a set of variables as
a single monolithic quantity. Thus, the total correlation does
not provide any insight into how dependencies are distributed
amongst the variables, i.e., it says nothing about the structure
of multivariate information.
∗
[email protected]
In contrast, interaction information was proposed as a measure of the amount of information bound up in a set of variables beyond that which is present in any subset of those variables. Thus, entropy and mutual information correspond to
first- and second-order interaction information, respectively,
and together with its third-, fourth-, and higher-order variants,
interaction information provides a way of characterizing the
structure of multivariate information. Interaction information
is also the natural generalization of mutual information when
Shannon entropy is viewed as a signed measure on information
diagrams [12, 14, 15]. However, the wider use of interaction information has largely been hampered by the “odd” [12]
and “unfortunate” [15] property that, for three or more variables, the interaction information can be negative (see also
[11, 14, 16–18]). For information as it is commonly understood, it is entirely unclear what it means for one variable to
provide “negative information” about another. Moreover, as
we demonstrate below, the confusing property of negativity is
actually symptomatic of deeper problems regarding the interpretation of interaction information for larger systems. As a
result, there remains no generally accepted extension of information theory for characterizing the structure of multivariate
interactions.
Here we formulate a new perspective on the structure of
multivariate information. Beginning from first principles, we
consider the general structure of the information that a set of
sources provide about a given variable. We propose a new
definition of redundancy as the minimum information that any
source provides about each outcome of the variable, averaged
over all possible outcomes. Then we show how this definition
can be used to exhaustively decompose the Shannon information in a multivariate system into partial information atoms
consisting of redundancies between synergies of subsets of the
sources. We also demonstrate that partial information forms
a lattice that clarifies the general structure of multivariate information. Unlike interaction information, the atoms of our
2
FIG. 1. Structure of multivariate information for 3 variables. Labelled
regions correspond to unique information (Unq), redundancy (Rdn),
and synergy (Syn).
information contributed jointly by R1 and R2 (FIG. 1).
In sum, for three variables we can identify unique information, redundancy, and synergy as the basic atoms of multivariate information. In fact, as later developments will clarify,
unique information is best thought of as a degenerate form of
redundancy or synergy, so that redundancy and synergy alone
constitute the basic building blocks of multivariate information. In particular, we will find that various combinations of
redundancy and synergy, which may at first sound paradoxical,
play a fundamental role in structuring multivariate information
in higher dimensions. Next we proceed to formalize these
ideas, beginning with the problem of defining a measure of
redundancy.
III.
partial information decomposition are never negative and always support a clear interpretation as informational quantities.
Finally, our analysis also demonstrates how the negativity of
interaction information can be explained by its confounding of
redundant and synergistic interactions.
II.
THE STRUCTURE OF MULTIVARIATE
INFORMATION
Suppose we are given a random variable S and a random
vector R = {R1 , R2 , . . . , Rn−1 }. Then our goal is to decompose the information that R provides about S in terms of the
partial information contributed either individually or jointly by
various subsets of R. For example, in a neuroscience context,
S may correspond to a stimulus that takes on different values
and R to the evoked responses of different neurons. In this
case, we would like to quantify the information that the joint
neural response provides about the stimulus, and to distinguish
between information due to responses of individual neurons
versus combinations of them [5, 13].
Consider the simplest case of a system with three variables.
How much total information does R = {R1 , R2 } provide
about S? How do R1 and R2 contribute to the total information? The answer to the first question is given by the mutual
information I(S; R1 , R2 ), while for the latter we can identify
three distinct possibilities. First, R1 may provide information
that R2 does not, or vice versa (unique information). For example, if R1 is a copy of S and R2 is a degenerate random variable,
then the total information from R reduces to the unique information from R1 . Second, R1 and R2 may provide the same
or overlapping information (redundancy). For example, if R1
and R2 are both copies of S then they redundantly provide
complete information. Third, the combination of R1 and R2
may provide information that is not available from either alone
(synergy). A well-known example for binary variables is the
exclusive-OR function S = R1 ⊕ R2 , in which case R1 and
R2 individually provide no information but together provide
complete information. Thus, intuitively, the total information
from R decomposes into unique information from R1 and R2 ,
redundant information shared by R1 and R2 , and synergistic
MEASURING REDUNDANCY
Let A1 , A2 , . . . , Ak be nonempty and potentially overlapping subsets of R, which we call sources. How can we quantify
the redundant information that all sources provide about S?
Of course, the information supplied by each Ai is given
simply by I(S; Ai ), the mutual information between S and
Ai . However, it is crucial to note that mutual information is
actually a measure of average or expected information, where
the expected value is taken over outcomes of the random variables. Thus, for instance, two sources might provide the same
average amount of information, while also providing information about different outcomes of S. Stated formally, the
information provided by a source A can be written as
X
I(S; A) =
p(s)I(S = s; A)
(1)
s
where the specific information I(S = s; A) quantifies the information associated with a particular outcome s of S. Various
definitions of specific information have been proposed to quantify different relationships between S and A (see Appendix A),
but for our purposes the most useful is
X
1
1
I(S = s; A) =
p(a|s) log
. (2)
− log
p(s)
p(s|a)
a
1
The term p(s)
is called the surprise of s, so I(S = s; A) is the
average reduction in surprise of s given knowledge of A. In
other words, I(S = s; A) quantifies the information that A
provides about each particular outcome s ∈ S, while I(S; A)
is the expected value of this quantity over all outcomes of S.
Given these considerations, a natural measure of redundancy
is the expected value of the minimum information that any
source provides about each outcome of S, or
X
Imin (S; {A1 , A2 , . . . , Ak }) =
p(s) min I(S = s; Ai ).
s
Ai
(3)
Imin captures the idea that redundancy is the information common to all sources (the minimum information that any source
provides), while taking into account that sources may provide
information about different outcomes of S. Note that, like the
3
mutual information, Imin is also an expected value of specific
information terms.
Imin also has several important properties that further support its interpretation as a measure of redundancy. First, Imin
is nonnegative, a property that follows directly from the nonnegativity of specific information (see Appendix D). Second,
Imin is less than or equal to I(S; Ai ) for all Ai ’s, with equality if and only if I(S = s; Ai ) = I(S = s; Aj ) for all i
and j and all s ∈ S. Thus, as one would hope, the amount
of redundant information is bounded by the information provided by each source, with equality if and only if all sources
provide the exact same information about S. Finally, and
closely related to the previous property, for a given source
A the amount of information redundant with A is maximal
for Imin (S; {A}) = I(S; A). In other words, redundant information is maximized by the “self-redundancy,” analogous
to the property that mutual information is maximized by the
self-information I(S; S) = H(S).
What are the distinct ways in which collections of sources
might contribute redundant information? Formally, answering
this question means identifying the domain of Imin . Thus
far, we have assumed that the natural domain is the collection
of all possible sets of sources, but in fact this can be greatly
simplified. To illustrate, consider two sources, A and B, with
A a subset of B. Clearly, any information provided by A
is also provided by B, so the redundancy between A and B
reduces to the self-redundancy for A,
that the former provides. The ordering relation 4 is formally
defined as
∀α, β ∈ A(R), (α 4 β ⇔ ∀B ∈ β, ∃A ∈ α, A ⊆ B). (5)
Applying this ordering to the elements of A(R) produces a
redundancy lattice, in which a higher element provides at least
as much redundant information as a lower one (FIG. 2; see
Appendix C).
The redundancy lattice provides a wealth of insight into the
structure of redundancy. For instance, from the redundancy
lattice it is possible to read off some of the properties of Imin
noted earlier. The property that redundancy for a source is
maximized by the self-redundancy can be seen from the fact
that any node corresponding to an individual source appears
higher in the redundancy lattice than any other node involving
that source. For example, in FIG. 2B, the node labeled {12},
corresponding to the self-redundancy for the source {R1 , R2 },
occurs higher than nodes labeled {12}{13}, {12}{13}{23},
and {3}{12}. Another property of Imin that can be seen from
these diagrams relates to the top and bottom elements of the
lattice. The top element corresponds to the self-redundancy for
R, reflecting the fact that Imin is bounded from above by the
total amount of information provided by R. At the other end of
the spectrum, the bottom element corresponds to the redundant
information that each individual element of R provides, with
all other possibilities for redundancy falling between these two
extremes.
Imin (S; {A, B}) = Imin (S; {A}) = I(S; A).
Furthermore, for any source C, it follows that
Imin (S; {A, B, C}) = Imin (S; {A, C}).
Extending
this idea, for any collection of sources where some are
supersets of others, the redundancy for that collection is
equivalent to the redundancy with all supersets removed. Thus,
the domain for Imin can be reduced to the collection of all
sets of sources such that no source is a superset of any other.
Formally, this set can be written as
A(R) = {α ∈ P1 (P1 (R)) : ∀Ai , Aj ∈ α, Ai 6⊂ Aj }, (4)
where P1 (R) = P(R) \ {∅} is the set of all nonempty subsets of R. Henceforth, we will denote elements of A(R),
corresponding to collections of sources, with bracketed expressions containing only the indices for each source. For instance,
{{R1 , R2 }} will be {12}, {{R1 }, {R2 , R3 }} will be {1}{23},
and so forth.
The possibilities for redundancy are also naturally structured,
which is shown by extending the same line of reasoning to
define an ordering 4 on the elements of A(R). Consider two
collections of sources, α, β ∈ A(R), where for each source
B ∈ β there exists a source A ∈ α with A a subset of B. This
means that for each source B ∈ β there is a source A ∈ α such
that A provides no more information than B. The redundant
information shared by all B ∈ β must therefore at least include
any redundant information shared by all A ∈ α. Thus, we can
define a partial order over the elements of A(R) such that one
element (collection of sources) is considered to precede another
if and only if the latter provides any redundant information
IV. PARTIAL INFORMATION DECOMPOSITION
The redundant information associated with each node of the
redundancy lattice includes, but is not limited to, the redundant
information provided by all nodes lower in the lattice. Thus,
moving from node to node up the lattice, Imin can be thought
of as a kind of “cumulative information function,” effectively
integrating the information provided by increasingly inclusive
collections of sources. Next, we derive an inverse of Imin
called the partial information function (PI-function). Whereas
Imin quantifies cumulative information, the PI-function measures the partial information contributed uniquely by each
particular collection of sources. This partial information will
form the atoms into which we decompose the total information
that R provides about S.
For a collection of sources α ∈ A(R), the PI-function,
denoted ΠR , is defined implicitly by
X
ΠR (S; β).
(6)
Imin (S; α) =
β4α
Formally, ΠR corresponds to the the Möbius inverse of Imin
[19, 20]. From this relationship, it is clear that ΠR can be
calculated recursively as
X
ΠR (S; α) = Imin (S; α) −
ΠR (S; β).
(7)
β≺α
Put into words, ΠR (S; α) quantifies the information provided
redundantly by the sources of α that is not provided by any
4
FIG. 2. Redundancy lattice for (A) 3 and (B) 4 variables.
simpler collection of sources (i.e., any β lower than α on the
redundancy lattice). In Appendix D, it is shown that ΠR can
be written in closed form as
X
ΠR (S; α) = Imin (S; α) −
p(s) max min I(S = s; B)
s
β∈α− B∈β
(8)
where α− represents the nodes immediately below α in the
redundancy lattice. From this formulation, it is readily shown
that ΠR is nonnegative (see Appendix D), and thus can be
naturally interpreted as an informational quantity associated
with the sources of α.
The decomposition of mutual information into a sum of
PI-terms follows from
X
I(S; A) = Imin (S; {A}) =
ΠR (S; β).
(9)
β4{A}
For the 3-variable case R = {R1 , R2 }, Equation (9) yields
I(S; R1 ) = ΠR (S; {1}) + ΠR (S; {1}{2})
and
I(S; R1 , R2 ) = ΠR (S; {1}) + ΠR (S; {2})
+ ΠR (S; {1}{2}) + ΠR (S; {12}).
(10)
(11)
The relationship between these equations can be represented
as a partial information (PI) diagram (FIG. 3A), which illustrates the way in which the total information that R provides about S is distributed across various combinations of
sources. Furthermore, comparing this diagram with FIG.
1 makes immediately clear the meaning of each partial information term. First, from Equation (8), we have that
ΠR (S; {1}{2}) = Imin (S; {1}{2}), which, from the definition of Imin , corresponds to the redundancy for R1 and R2 .
The unique information for R1 is given by ΠR (S; {1}) =
I(S; R1 ) − Imin (S; {1}{2}), which is the total information
from R1 minus the redundancy, and likewise for R2 . Finally,
the additional information provided by the combination of
R1 and R2 is given by ΠR (S; {12}), corresponding to their
synergy.
To fix ideas, consider the example in FIG. 4A. From the
symmetry of the distribution, it is clear that R1 and R2 must
provide the same amount of information about S. Indeed, this
is easily verified, with I(S; R1 ) = I(S; R2 ) = − 31 log 13 −
2
2
3 log 3 . However, it is also clear that R1 and R2 provide
information about different outcomes of S. In particular, given
knowledge of R1 , one can determine conclusively whether
or not outcome S = 2 occurs (which is not the case for R2 ),
and likewise for R2 and outcome S = 1. This feature is
captured by ΠR (S; {1}) = ΠR (S; {2}) = 31 , indicating that
R1 and R2 each provide 13 bits of unique information about S.
The redundant information, ΠR (S; {1}{2}) = log 3 − log 2,
captures the fact that knowledge of either R1 or R2 reduces
uncertainty about S from three equally likely outcomes to
two. Finally, R1 and R2 also provide 13 bits of synergistic
information, i.e., ΠR (S; {12}) = 13 . This value reflects the
fact that R1 and R2 together uniquely determine whether or
not outcome S = 0 occurs, which is not true for R1 or R2
alone.
Note that, unlike mutual information or interaction information, partial information is not symmetric. For instance,
the synergistic information that R1 and R2 provide about S is
not in general equal to the synergistic information that S and
R2 provide about R1 . This property is also illustrated by the
example in FIG. 4A. Given knowledge of S, one can uniquely
determine the outcome of R1 (and R2 ), so that S provides complete information about both. Thus, it is not possible for the
combination of S and R2 to provide any additional synergistic
information about R1 , since there is no remaining uncertainty
about R1 when S is known. In contrast, as was just noted, R1
and R2 provide 13 bits of synergistic information about S. This
asymmetry accounts for our decision to focus on information
about a particular variable S throughout, since in general the
analysis will differ depending on the variable of interest. Note
that total information is also asymmetric in this sense, i.e., in
general I(S; R1 , R2 ) 6= I(R1 ; S, R2 ) (though, of course, it is
symmetric in the sense that I(S; R1 , R2 ) = I(R1 , R2 ; S)).
The general structure of PI-diagrams becomes clear when we
consider the decomposition for four variables (FIG. 3B). First,
note that all of the possibilities for three variables are again
present for four. In particular, each element of R can provide
unique information (regions labeled {1}, {2}, and {3}), information redundantly with one other variable ({1}{2}, {1}{3},
and {2}{3}), or information synergistically with one other variable ({12}, {13}, and {23}). Additionally, information can
be provided redundantly by all three variables ({1}{2}{3})
or provided by their three-way synergy ({123}). More interesting are the new kinds of terms representing combinations
of redundancy and synergy. For instance, the regions marked
{1}{23}, {2}{13}, and {3}{12} represent information that is
available redundantly from either one variable considered individually or the other two considered together. Or, for instance,
the region labeled {12}{13}{23} represents the information
provided redundantly by the three possible two-way synergies.
In general, the PI-atom for a collection of sources corresponds
5
FIG. 4. Probability distributions for S ∈ {0, 1, 2} and R1 , R2 ∈
{0, 1}. Black tiles represent equiprobable outcomes. White tiles are
zero-probability outcomes.
FIG. 3. Partial information diagrams for (A) 3 and (B) 4 variables.
to the information provided redundantly by the synergies of all
sources in the collection. This point also clarifies our earlier
claim that unique information is best thought of as a degenerate
case: unique information corresponds to the combination of
first-order redundancy and first-order synergy.
In general, a PI-diagram for n variables, S and R =
{R1 , R2 , . . . , Rn−1 }, consists of the following (see Fig. S2
in Appendix E). First, for each element Ri ∈ R there is a
region corresponding to I(S; Ri ). Then, for every subset A
of R with two or more elements, I(S; A) is depicted as a
regionScontaining I(S; A) for all A ∈ A but not coextensive
with
A∈A I(S; A). The difference between I(S; A) and
S
I(S;
A) represents the synergy for A, the information
A∈A
gained from the combined knowledge of all elements in A
that is not available from any subset. In addition, regions of
the diagram intersect generically, representing all possibilities
for redundancy. In total, a PI-diagram is composed of the
(n − 1)-th Dedekind number [21] of PI-atoms, same as the cardinality of A(R) (see Appendix C). As described above, each
PI-atom represents the redundancy of synergies for a particular
collection of sources, corresponding to one distinct way for the
components of R to contribute information about S.
Finally, it is instructive to consider the relationship between the redundancy lattice and PI-diagram for n variables.
First, we note that Imin is analogous to set intersection for
PI-diagrams, consistent with the idea of redundancy as overlapping information. Specifically,
Imin (S; {A1 , A2 , . . . , Ak })
T
corresponds to the region i I(S; Ai ). From this correspondence between Imin and set intersection, we can establish the
following connection: for α, β ∈ A(R),
T α is lower than β in
theTredundancy lattice if and only if A∈α I(S; A) is a subset
of B∈β I(S; B) in the PI-diagram. Consequently, the redundancy lattice and PI-diagram can be viewed as complementary
representations of the same structure, with the PI-diagram a collapsed version of the redundancy lattice formed by embedding
regions according to the lattice ordering.
V. WHY INTERACTION INFORMATION IS
SOMETIMES NEGATIVE
We next show how PI-decomposition can be used to understand the conditions under which interaction information, the
standard generalization of mutual information to multivariate
interactions, is negative. The interaction information for three
variables is given by
I(S; R1 ; R2 ) = I(S; R1 |R2 ) − I(S; R1 )
(12)
and for n > 3 variables is defined recursively as
I(S; R1 ; R2 ; . . . ; Rn−1 ) =I(S; R1 ; R2 ; . . . ; Rn−2 |Rn−1 )
− I(S; R1 ; R2 ; . . . ; Rn−2 ) (13)
where the conditional interaction information is defined by
simply including the conditioning in all terms of the original
definition. Interaction information is symmetric for all permutations of its arguments, and is traditionally interpreted as
the information shared by all n variables beyond that which is
shared by any subset of those variables.
For 3-variable interaction information, a positive value is
naturally interpreted as indicating a situation in which any one
variable of the system enhances the correlation between the
6
other two. For example, a positive value for Equation (12) indicates that knowledge of R2 enhances the correlation between S
and R1 (and likewise for all other variable permutations). Thus,
in the terminology used here, a positive value for I(S; R1 ; R2 )
indicates the presence of synergy. On the other hand, a negative
value for I(S; R1 ; R2 ) indicates a situation in which any one
variable accounts for or “explains away” [22] the correlation
between the other two. In other words, a negative value for
I(S; R1 ; R2 ) indicates redundancy. Indeed, I(S; R1 ; R2 ) is a
widely used measure of synergy in neuroscience, where it is
interpreted in exactly this way [23–26].
The PI-decomposition for 3-variable interaction information
(FIG. 5A; see also Fig. S3 in Appendix E) confirms this interpretation, with I(S; R1 ; R2 ) equal to the difference between
the synergistic and the redundant information, i.e.,
I(S; R1 ; R2 ) = ΠR (S; {12}) − ΠR (S; {1}{2}).
(14)
Thus, it is indeed the case that positive values indicate synergy
and negative values indicate redundancy.
However, PI-decomposition also makes clear that
I(S; R1 ; R2 ) confounds redundancy and synergy, with the
meaning of interaction information ambiguous for any system
that exhibits a mixture of the two (cf. [27], who suggest the
possibility of mixed redundancy and synergy, but without
attempting to disentangle them). For instance, consider again
the example in FIG. 4A. As described earlier, in this case R1
and R2 provide log 3 − log 2 bits of redundant information and
1
3 bits of synergistic information. Consequently, I(S; R1 ; R2 )
is negative because there is more redundancy than synergy,
despite the fact that the system clearly exhibits synergistic
interactions. As a second example, consider the distribution in
FIG. 4B. In this case, R1 and R2 provide 12 bits of redundant
information, corresponding to the fact that knowledge of
either R1 or R2 reduces uncertainty about the outcomes
S = 0 and S = 2. Additionally, R1 and R2 provide 12 bits
of synergistic information, reflecting the fact that R1 and
R2 together provide complete information about outcomes
S = 0 and S = 2, which is not true for either alone. Thus, the
interaction information in this case is equal to zero despite
the presence of both redundant and synergistic interactions,
because redundancy and synergy are balanced.
The situation is worse for four-variable interaction information, which is known to violate the interpretation that positive values indicate (pure) synergy and negative values indicate (pure) redundancy [12, 28]. To demonstrate, consider
the case of 3-parity, which is the higher-order form of the
exclusive-OR, or 2-parity, function mentioned earlier. In this
case, we have a system of four binary random variables, S
and R = {R1 , R2 , R3 }, where the eight outcomes for R are
equiprobable and S = R1 ⊕ R2 ⊕ R3 . Intuitively, this corresponds to a case of pure synergy, since the value of S can
be determined only when all of the Ri are known. Indeed,
using Eq. (13) we find that I(S; R1 ; R2 ; R3 ) for this system
is equal to +1 bit, as expected from the interpretation that
positive values indicate synergy. However, now consider a
second system of binary variables, this time where the two
outcomes of S are equiprobable and R1 , R2 , and R3 are all
copies of S. Clearly this corresponds to a case of pure redun-
FIG. 5. PI-decomposition of interaction information for (A) 3 and (B)
4 variables. Blue and red regions represent PI-terms that are added
and subtracted, respectively. The green region in (B) represents a
PI-term that is subtracted twice.
dancy, since the value of S can be determined uniquely from
knowledge of any Ri , but I(S; R1 ; R2 ; R3 ) for this system is
again equal to +1 bit, same as the case of pure synergy. Thus,
a completely redundant system is assigned a positive value for
the interaction information, in clear violation of the idea that
redundancy is indicated by negative values. Worse still, the
4-variable interaction information fails to distinguish between
the polar opposites of purely synergistic and purely redundant
information.
The PI-decomposition for 4-variable interaction information
(FIG. 5B; see also Fig. S4 in Appendix E) clarifies why this is
the case. In terms of PI-atoms, I(S; R1 ; R2 ; R3 ) is given by
ΠR (S; {123}) + ΠR (S; {1}{2}{3})
−ΠR (S; {1}{23}) − ΠR (S; {2}{13}) − ΠR (S; {3}{12})
−ΠR (S; {12}{13}) − ΠR (S; {12}{23}) − ΠR (S; {13}{23})
−2 × ΠR (S; {12}{13}{23}).
(15)
Thus, I(S; R1 ; R2 ; R3 ) is equal to the sum of third-order synergy ({123}) and third-order redundancy ({1}{2}{3}), minus
the information provided redundantly by a first- and secondorder synergy ({1}{23}, {2}{13}, and {3}{12}), minus the
7
information provided redundantly by two second-order synergies ({12}{13}, {12}{23}, and {13}{23}), and minus twice
the information provided redundantly by all three second-order
synergies ({12}{13}{23}). Thus, systems with pure synergy
and pure redundancy have the same value for I(S; R1 ; R2 ; R3 )
because 4-variable interaction information adds in the highestorder synergy and redundancy terms. More generally, the PIdecomposition for I(S; R1 ; R2 ; R3 ) shows why it is difficult
to interpret as a meaningful quantity, and as one might expect
the story only becomes more complicated in higher dimensions. Thus, although one can readily decompose interaction
information into a collection of partial information contributions, and understand the conditions under which it will be
positive or negative depending on the relative magnitudes of
these contributions, the utility of interaction information for
larger systems is unclear.
VI. DISCUSSION
systems. For instance, with 9 variables there are more than
5 × 1022 possibilities [29], and beyond that the Dedekind numbers are not even currently known. Thus, clearly an important
direction for future work is to determine efficient ways of calculating partial information terms for larger systems. To this end,
the lattice structure of the terms is likely to play an essential
role. As with any ordered data structure, the fact that the space
of possibilities is highly organized can be readily exploited for
efficient use. For instance, as a simple example, if Imin is calculated in a descending fashion over the nodes of the redundancy
lattice and at a certain juncture has a value of zero, all of the
terms below that node can immediately be eliminated simply
from the monotonicity of Imin (see Appendix D). Moreover, if
the Markov property or any other constraints hold between the
variables, many of the possible partial information terms can
also be excluded. Finally, these considerations notwithstanding, it should also be emphasized that 3-variable interaction
is the current state of the art, and thus even the simplest form
of partial information decomposition can be used to address a
number of outstanding questions.
In physics, for example, 3-variable interactions have been
explored in relation to the non-separability of quantum systems
[30] and in the study of many-body correlation effects [31]. In
neuroscience, the concepts of synergy and redundancy for three
variables have been examined in the context of neural coding in
a number of theoretical and empirical investigations [23–26, 32,
33]. In genetics, multivariate dependencies arise in the analysis
of gene-gene and gene-environment interactions in studies of
human disease susceptibility [28, 34, 35]. Moreover, similar
issues have also been explored in machine learning [22, 27, 36],
ecology [37], quantum information theory [38], information
geometry [39], rough set analysis [40], and cooperative game
theory [41]. Thus, in all of these cases, the 3-variable form of
partial information decomposition can be applied immediately
to illuminate the structure of multivariate dependencies, while
the general form provides a clear way forward in the study of
more complex systems of interactions.
The main objective of this paper has been to quantify multivariate information in such a way that the structure of variable
interactions is illuminated. This was accomplished by first
defining a general measure of redundant information, Imin ,
which satisfies a number of intuitive properties for a measure
of redundancy. Next, it was shown that Imin induces a lattice
structure over the set of possible information sources, referred
to as the redundancy lattice, which characterizes the distinct
ways that information can be distributed amongst a set of
sources. From this lattice, a measure of partial information
was derived that captures the unique information contributed
by each possible combination of sources. It was then shown
that mutual information decomposes into a sum of these partial
information terms, so that the total information provided by
a source is broken down into a collection of partial information contributions. Moreover, it was demonstrated that each of
these terms supports clear interpretation as a particular combination of redundant and synergistic interactions between
specific subsets of variables. Finally, we discussed the relationship between partial information decomposition and interaction
information, the current de facto measure of multivariate interactions, and used partial information to clarify the confusing
property that interaction information is sometimes negative.
One obvious challenge with applying these ideas is that the
number of partial information terms grows rapidly for larger
We thank O. Sporns, J. Beggs, A. Kolchinsky, and L. Yaeger
for helpful comments. This work was supported in part by NSF
grant IIS-0916409 (to R.D.B.) and an NSF IGERT traineeship
(to P.L.W.).
[1] C. E. Shannon and W. Weaver, The Mathematical Theory of
Communication (Univ of Illinois Press, 1949).
[2] D. Pines, The Many-Body Problem (Addison-Wesley, 1997).
[3] R. D. Luce and H. Raiffa, Games and Decisions: Introduction
and Critical Survey (Dover, 1989).
[4] P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (MIT
Press, 2001).
[5] F. Rieke, D. Warland, R. de Ruyter van Steveninck, and
W. Bialek, Spikes: Exploring the Neural Code (MIT Press,
1999).
[6] S. Watanabe, IBM Journal of Research and Development, 4, 66
(1960).
[7] W. R. Garner, Uncertainty and Structure as Psychological Concepts (Wiley, 1962).
[8] M. Studeny and J. Vejnarova, Learning in Graphical Models,
261 (1998).
[9] G. Tononi, O. Sporns, and G. M. Edelman, Proc Natl Acad Sci
USA, 91, 5033 (1994).
[10] W. J. McGill, Psychometrika, 19, 97 (1954).
ACKNOWLEDGMENTS
8
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
T. S. Han, Information and Control, 46, 26 (1980).
A. J. Bell, Proceedings of ICA2003, 921 (2003).
T. J. Gawne and B. J. Richmond, J Neurosci, 13, 2758 (1993).
R. W. Yeung, Information Theory and Network Coding (Springer,
2008).
T. M. Cover and J. A. Thomas, Elements of Information Theory,
2nd ed. (Wiley-Interscience, 2006).
S. Takano, Proc Jpn Acad, 50, 109 (1974).
T. Tsujishita, Adv Appl Math, 16, 269 (1995).
Z. Zhang and R. W. Yeung, IEEE Trans Inf Theory, 44, 1440
(1998).
G. Rota, Probability Theory and Related Fields, 2, 340 (1964).
R. P. Stanley, Enumerative Combinatorics, Vol. 1 (Cambridge
Univ Press, 1997).
L. Comtet, Advanced Combinatorics: The Art of Finite and
Infinite Expansions (Springer, 1974).
J. Pearl and G. Shafer, Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference (Morgan Kaufmann,
1988).
N. Brenner, S. P. Strong, R. Koberle, W. Bialek, and
R. de Ruyter van Steveninck, Neural Comput, 12, 1531 (2000).
S. Panzeri, S. R. Schultz, A. Treves, and E. T. Rolls, Proc R Soc
B, 266, 1001 (1999).
E. Schneidman, W. Bialek, and M. J. Berry, J Neurosci, 23,
11539 (2003).
P. E. Latham and S. Nirenberg, J Neurosci, 25, 5195 (2005).
A. Jakulin and I. Bratko, Arxiv preprint cs/0308002 (2003).
D. Anastassiou, Mol Syst Biol, 3, 1 (2007).
D. Wiedemann, Order, 8, 5 (1991).
N. J. Cerf and C. Adami, Phys Rev A, 55, 3371 (1997).
H. Matsuda, Phys Rev E, 62, 3096 (2000).
I. Gat and N. Tishby, Advances in NIPS, 111 (1999).
N. S. Narayanan, E. Y. Kimchi, and M. Laubach, J Neurosci,
25, 4207 (2005).
J. H. Moore, J. C. Gilbert, C. T. Tsai, F. T. Chiang, T. Holden,
N. Barney, and B. C. White, J Theor Biol, 241, 252 (2006).
P. Chanda, A. Zhang, D. Brazeau, L. Sucheston, J. L. Freudenheim, C. Ambrosone, and M. Ramanathan, Am J Hum Genet,
81, 939 (2007).
D. J. C. MacKay, Information Theory, Inference, and Learning
Algorithms (Cambridge Univ Press, 2003).
L. Orlóci, M. Anand, and V. D. Pillar, Community Ecol, 3, 217
(2002).
V. Vedral, Rev Mod Phys, 74, 197 (2002).
S. Amari, IEEE Trans Inf Theory, 47, 1701 (2001).
G. Gediga and I. Düntsch, in Rough-Neural Computing, edited
by S. K. Pal, L. Polkowski, and A. Skowron (Physica Verlag,
Heidelberg, 2003).
M. Grabisch and M. Roubens, Int J Game Theory, 28, 547
(1999).
M. R. DeWeese and M. Meister, Network, 10, 325 (1999).
D. A. Butts, Network, 14, 177 (2003).
B. A. Davey and H. A. Priestley, Introduction to Lattices and
Order, 2nd ed. (Cambridge Univ Press, 2002).
G. A. Grätzer, General Lattice Theory, 2nd ed. (Birkhäuser,
2003).
J. Crampton and G. Loizou, “Two partial orders on the set of
antichains,” (2000), research note.
J. Crampton and G. Loizou, International Mathematical Journal,
1, 223 (2001).
S. M. Ross, A First Course in Probability, 8th ed. (Prentice Hall,
2009).
Appendix A: Measures of Specific Information
Measures of specific information are discussed in [42] in
the context of quantifying the information that specific neural
responses provide about a stimulus ensemble. For random
variables S and R, representing stimuli and responses, respectively, the information that R provides about S is decomposed
according to
X
p(r)ir (r)
(A1)
I(S; R) =
r∈R
and
ir (r) = H(S) − H(S|r)
(A2)
where H(S) is the entropy of S and ir (r) is the responsespecific information associated with each r ∈ R. The responsespecific information quantifies the change in uncertainty about
S when response r is observed. In [42], it is shown that ir
is the unique measure of specific information that satisfies
additivity, though it is also possible for ir to be negative.
To distinguish the different role played by stimuli as opposed
to responses, an alternative measure of specific information
is proposed in [43]. The stimulus-specific information for an
outcome s ∈ S is defined as
X
p(r|s)ir (r).
(A3)
is (s) =
r∈R
Like the response-specific information, the weighted average of
is (s) gives the mutual information I(S; R). Stimulus-specific
information quantifies the extent to which a particular stimulus
s tends to evoke responses that are informative about the entire
ensemble S (responses with high values for ir ).
Finally, both [42] and [43] also discuss I(S = s; R), the
measure of specific information used here (Eq. (2)). In [43],
I(S = s; R) is described as the reduction in surprise of a particular stimulus s gained from each response, averaged over all
responses associated with that stimulus. Thus, whereas is (s)
weights each response r according to the information that it
contributes about the entire ensemble S, I(S = s; R) quantifies only the information that R provides about the particular
outcome S = s. In [42], it is proven that I(S = s; R) is the
only measure of specific information that is strictly nonnegative.
Appendix B: Lattice Theory Definitions
Here we review only the basic concepts of lattice theory
needed for supporting proofs. For a thorough treatment, see
[44, 45].
Definition 1. A pair hX, 6i is a partially ordered set or poset
if 6 is a binary relation on X that is reflexive, transitive and
antisymmetric.
Definition 2. Let Y ⊆ X. Then a ∈ Y is a maximal element
in Y if for all b ∈ Y, a 6 b ⇒ a = b. A minimal element is
defined dually. We denote the set of maximal elements of Y by
Y and the set of minimal elements by Y .
9
FIG. S1: Basic lattice-theoretic concepts. (A) Hasse diagram of the lattice hP(X), ⊆i for X = {1, 2, 3}. (B) An example of a chain (blue
nodes) and an antichain (red nodes). (C) The top ⊤ and bottom ⊥ are shown in gray. Green nodes correspond to {1, 2, 3}− , the set of elements
covered by {1, 2, 3}. The orange region represents ↓ {1, 3}, the down-set of {1, 3}.
Definition 3. Let hX, 6i be a poset, and let Y ⊆ X. An
element x ∈ X is an upper bound for Y if for all y ∈ Y, y 6 x.
A lower bound for Y is defined dually.
Definition 4. An element x ∈ X is the least upper bound or
supremum for Y , denoted sup Y , if x is an upper bound of
Y and for all y ∈ Y and all z ∈ X, y 6 z implies x 6 z.
The greatest upper bound or infimum for Y , denoted inf Y , is
defined dually.
Definition 5. A poset hX, 6i is a lattice if, and only if, for all
x, y ∈ X both inf{x, y} and sup{x, y} exist in X. If hX, 6i
is a lattice, it is common to write x ∧ y, the meet of x and y,
and x ∨ y, the join of x and y, for
V inf{x, y}
W and sup{x, y},
respectively. For Y ⊆ X, we use Y and Y to denote the
meet and join of all elements in Y , respectively.
Definition 6. For a, b ∈ X, we say that a is covered by b (or
b covers a) if a < b and a 6 c < b ⇒ a = c. The set of
elements that are covered by b is denoted by b− .
Definition 9. For any x ∈ X, we define
˙ = {y ∈ X : y < x}
↓ x = {y ∈ X : y 6 x} and ↓x
˙ are called the down-set and strict down-set
where ↓ x and ↓x
of x, respectively.
FIG. S1C illustrates the concepts of top and bottom elements,
covering relations, and down-sets.
Appendix C: A(R) and the Redundancy Lattice
Formally, A(R) corresponds to the set of antichains on the
lattice hP(R), ⊆i (excluding the empty set). The cardinality
of this set for |R| = n − 1 is given by the (n − 1)-th Dedekind
number, which for n = 2, 3, 4, . . . is 1, 4, 18, 166, 7579, . . .
([21], p. 273). The fact that hA(R), 4i forms a lattice, which
we call the redundancy lattice, is proven in [46], where the
corresponding lattice is denoted hA(X), 4′ i (see also [47]).
As shown in [46], the meet (∧) and join (∨) for this lattice are
given by
The classic example of a lattice is the power set of a set X ordered by inclusion, denoted hP(X), ⊆i. Lattices are naturally
represented by Hasse diagrams, in which nodes correspond to
members of X and an edge exists between elements x and y if
x covers y. FIG. S1A depicts the Hasse diagram for the lattice
hP(X), ⊆i with X = {1, 2, 3}.
α∧β =α∪β
(A4)
and
α ∨ β = ↑ α∩ ↑ β.
(A5)
Appendix D: Supporting Proofs
Theorem 1. I(S = s; A) is nonnegative.
Definition 7. If hX, 6i is a poset, Y ⊆ X is a chain if for all
a, b ∈ Y either a 6 b or b 6 a. Y is an antichain if a 6 b only
if a = b.
FIG. S1B shows examples of a chain and an antichain.
Definition 8. If there exists an element ⊥ ∈ X with the property that ⊥ 6 x for all x ∈ X, we call ⊥ the bottom element
of X. The top element of X, denoted by ⊤, is defined dually.
Proof.
I(S = s; A) = D(p(a|s) k p(a)) ≥ 0
where D is the Kullback-Leibler distance and the last step
follows from the information inequality ([15], p. 26).
Lemma 1. I(S = s; A) increases monotonically on the lattice
hP(R), ⊆i.
10
Proof. Consider A, B with A ⊂ B ⊆ R. Let C = B \ A 6= ∅. Then we have
I(S = s; B) − I(S = s; A)
X
X
p(s, b)
p(s, a)
=
p(b|s) log
−
p(a|s) log
p(s)p(b)
p(s)p(a)
a
b
XX
XX
p(s, a)
p(s, a, c)
−
p(a, c|s) log
=
p(a, c|s) log
p(s)p(a, c)
p(s)p(a)
a
c
a
c
XX
p(s, c|a)
=
p(a, c|s) log
p(s|a)p(c|a)
a
c
X
X
p(c|a, s)
=
p(a)
p(c|a, s) log
p(c|a)
a
c
X
=
p(a)D(p(c|a, s) k p(c|a)) ≥ 0.
a
Theorem 2. Imin increases monotonically on the lattice
hA(R), 4i.
Applying the principle of inclusion-exclusion ([20], p. 64), we
have
|α− |
Proof. We proceed by contradiction. Assume there exists
α, β ∈ A(R) with α ≺ β and Imin (S; β) < Imin (S; α).
Then, from Eq. (3), there must exist B ∈ β such that
I(S = s; B) < I(S = s; A) for some outcome s ∈ S and for
all A ∈ α. Thus, from Lemma 1, there does not exist A ∈ α
such that A ⊆ B. However, since α ≺ β by assumption, there
exists A ∈ α such that A ⊆ B.
= f (↓ α) −
X
(−1)k−1
k=1
X
f(
\
↓ γ)
γ∈B
B⊆α−
|B|=k
and it is a basic
V theory that for any lattice L
T result of lattice
and A ⊆ L, a∈A ↓ a =↓ ( A) ([44], p. 57), so we have
|α− |
= f (↓ α) −
X
(−1)k−1
k=1
X
f (↓ (
B⊆α−
|B|=k
^
B))
|α− |
Theorem 3. ΠR can be stated in closed form as
= Imin (S; α) −
X
(−1)k−1
k=1
X
(−1)k−1
k=1
X
Imin (S;
B⊆α−
|B|=k
^
B⊆α
|B|=k
Proof. For B ⊆ A(R), define the set-additive function f as
max A =
|A|
X
(−1)k−1
|A|
X
(−1)k−1
X
min B
X
max B.
B⊆A
|B|=k
ΠR (S; β).
β∈B
From Eq. (6), it follows that Imin (S; α) = f (↓ α) and
or conversely,
min A =
k=1
˙
ΠR (S; α) = f (↓ α) − f (↓α)
[
= f (↓ α) − f (
↓ β).
β∈α−
B).
Lemma 2 (Maximum-minimums identity). Let A be a set of
numbers. The maximum-minimums identity states that
k=1
X
^
B).
(A6)
f (B) =
Imin (S;
−
|α− |
ΠR (S; α) = Imin (S; α) −
X
B⊆A
|B|=k
Proof. It is proven in a number of introductory texts, e.g. [48].
11
Theorem 4. ΠR can be stated in closed form as
ΠR (S; α) = Imin (S; α) −
X
s
p(s) max min I(S = s; B).
β∈α− B∈β
(A7)
Proof.
Combining Eqs. (A6) and (3) yields
|α− |
ΠR (S; α) = Imin (S; α) −
X
(−1)k−1
k=1
X X
B⊆α
|B|=k
−
s
p(s) min
V I(S = s; B)
B∈
B
|α− |
= Imin (S; α) −
X
p(s)
s
X
(−1)k−1
k=1
X
min
V I(S = s; B)
B∈
B⊆α−
|B|=k
B
and by Lemma 1 and Eq. (A4),
|α− |
= Imin (S; α) −
X
p(s)
X
p(s) max− min I(S = s; B).
s
X
(−1)k−1
k=1
X
min min I(S = s; B).
β∈B B∈β
B⊆α−
|B|=k
Then, applying Lemma 2 we have
= Imin (S; α) −
s
β∈α
B∈β
Theorem 5. ΠR is nonnegative.
Proof. If α = ⊥, ΠR (S; α) = Imin (S; α) and ΠR (S; α) ≥ 0 follows from the nonnegativity of Imin . To prove it for α 6= ⊥, we
proceed by contradiction. Assume there exists α ∈ A(R) \ {⊥} such that ΠR (S; α) < 0. Applying Eq. (3) to Theorem 4 and
combining summations yields
X
ΠR (S; α) =
p(s){min I(S = s; A) − max− min I(S = s; B)}.
s
A∈α
β∈α
B∈β
From this equation, it is clear that there must exist β ∈ α− such that for all B ∈ β, I(S = s; A) < I(S = s; B) for some
outcome s ∈ S and some A ∈ α. Thus, from Lemma 1, there does not exist B ∈ β such that B ⊆ A. However, since β ≺ α by
definition, there exists B ∈ β such that B ⊆ A.
12
Appendix E: Supplementary Figures
FIG. S2: Constructing a PI-diagram for 4 variables. (A) For each element Ri ∈ R there is a region corresponding to I(S; Ri ). (B-E) For each
subset
A of R with two or more elements, I(S; A) is depicted as a region containing I(S; A) for all A ∈ A but not coextensive with
S
A∈A I(S; A). Regions of the diagram intersect generically, representing all possibilities for redundancy.
13
FIG. S3: Computing the PI-decomposition for 3-variable interaction information. (A-B) Term-by-term calculation of
I(S; R1 ; R2 ) = I(S; R1 , R2 ) − I(S; R1 ) − I(S; R2 ). Blue and red regions represent PI-terms that are added and subtracted, respectively.
14
FIG. S4: Computing the PI-decomposition for 4-variable interaction information. (A-F) Term-by-term calculation of
I(S; R1 ; R2 ; R3 ) = I(S; R1 , R2 , R3 ) − I(S; R1 , R2 ) − I(S; R1 , R3 ) − I(S; R2 , R3 ) + I(S; R1 ) + I(S; R2 ) + I(S; R3 ). Blue and red
regions represent PI-terms that are added and subtracted, respectively. Green regions represent PI-terms that are subtracted twice.