Academia.eduAcademia.edu

Nonnegative Decomposition of Multivariate Information

Of the various attempts to generalize information theory to multiple variables, the most widely utilized, interaction information, suffers from the problem that it is sometimes negative. Here we reconsider from first principles the general structure of the information that a set of sources provides about a given variable. We begin with a new definition of redundancy as the minimum information that any source provides about each possible outcome of the variable, averaged over all possible outcomes. We then show how this measure of redundancy induces a lattice over sets of sources that clarifies the general structure of multivariate information. Finally, we use this redundancy lattice to propose a definition of partial information atoms that exhaustively decompose the Shannon information in a multivariate system in terms of the redundancy between synergies of subsets of the sources. Unlike interaction information, the atoms of our partial information decomposition are never negative and always support a clear interpretation as informational quantities. Our analysis also demonstrates how the negativity of interaction information can be explained by its confounding of redundancy and synergy.

Nonnegative Decomposition of Multivariate Information Paul L. Williams1, ∗ and Randall D. Beer1, 2 1 Cognitive Science Program and School of Informatics and Computing Indiana University, Bloomington, Indiana 47406 USA (Dated: April 16, 2010) arXiv:1004.2515v1 [cs.IT] 14 Apr 2010 2 Of the various attempts to generalize information theory to multiple variables, the most widely utilized, interaction information, suffers from the problem that it is sometimes negative. Here we reconsider from first principles the general structure of the information that a set of sources provides about a given variable. We begin with a new definition of redundancy as the minimum information that any source provides about each possible outcome of the variable, averaged over all possible outcomes. We then show how this measure of redundancy induces a lattice over sets of sources that clarifies the general structure of multivariate information. Finally, we use this redundancy lattice to propose a definition of partial information atoms that exhaustively decompose the Shannon information in a multivariate system in terms of the redundancy between synergies of subsets of the sources. Unlike interaction information, the atoms of our partial information decomposition are never negative and always support a clear interpretation as informational quantities. Our analysis also demonstrates how the negativity of interaction information can be explained by its confounding of redundancy and synergy. PACS numbers: 89.70.-a, 87.19.lo, 87.10.Vg, 89.75.-k Keywords: information theory, interaction information, redundancy, synergy, multivariate interaction I. INTRODUCTION From its roots in Shannon’s seminal work on reliability and coding in communication systems, information theory has grown into a ubiquitous general tool for the analysis of complex systems, with application in neuroscience, genetics, physics, machine learning, and many other areas. Somewhat surprisingly, the vast majority of work in information theory concerns only the simplest possible case: the information that a single variable provides about another. This is quantified by Shannon’s mutual information, which is by far the most widely used concept from information theory [1]. The second most popular concept, conditional mutual information, considers interactions between multiple variables in only the most rudimentary sense: it seeks to eliminate the influence of other variables in order to isolate the dependency between two variables of interest. In contrast, many of the most interesting and challenging scientific questions, such as many-body problems in physics [2], n-person games in game theory [3], and population coding in neuroscience [4, 5], involve understanding the structure of interactions between three or more variables. The two main attempts to generalize information theory to multivariate interactions are the total correlation proposed by Watanabe [6] (also known as the multivariate constraint [7], multiinformation [8], and integration [9]) and the interaction information of McGill [10] (also known as multiple mutual information [11], co-information [12], and synergy [13]). The total correlation, as its name suggests, measures the total amount of dependency between a set of variables as a single monolithic quantity. Thus, the total correlation does not provide any insight into how dependencies are distributed amongst the variables, i.e., it says nothing about the structure of multivariate information. ∗ [email protected] In contrast, interaction information was proposed as a measure of the amount of information bound up in a set of variables beyond that which is present in any subset of those variables. Thus, entropy and mutual information correspond to first- and second-order interaction information, respectively, and together with its third-, fourth-, and higher-order variants, interaction information provides a way of characterizing the structure of multivariate information. Interaction information is also the natural generalization of mutual information when Shannon entropy is viewed as a signed measure on information diagrams [12, 14, 15]. However, the wider use of interaction information has largely been hampered by the “odd” [12] and “unfortunate” [15] property that, for three or more variables, the interaction information can be negative (see also [11, 14, 16–18]). For information as it is commonly understood, it is entirely unclear what it means for one variable to provide “negative information” about another. Moreover, as we demonstrate below, the confusing property of negativity is actually symptomatic of deeper problems regarding the interpretation of interaction information for larger systems. As a result, there remains no generally accepted extension of information theory for characterizing the structure of multivariate interactions. Here we formulate a new perspective on the structure of multivariate information. Beginning from first principles, we consider the general structure of the information that a set of sources provide about a given variable. We propose a new definition of redundancy as the minimum information that any source provides about each outcome of the variable, averaged over all possible outcomes. Then we show how this definition can be used to exhaustively decompose the Shannon information in a multivariate system into partial information atoms consisting of redundancies between synergies of subsets of the sources. We also demonstrate that partial information forms a lattice that clarifies the general structure of multivariate information. Unlike interaction information, the atoms of our 2 FIG. 1. Structure of multivariate information for 3 variables. Labelled regions correspond to unique information (Unq), redundancy (Rdn), and synergy (Syn). information contributed jointly by R1 and R2 (FIG. 1). In sum, for three variables we can identify unique information, redundancy, and synergy as the basic atoms of multivariate information. In fact, as later developments will clarify, unique information is best thought of as a degenerate form of redundancy or synergy, so that redundancy and synergy alone constitute the basic building blocks of multivariate information. In particular, we will find that various combinations of redundancy and synergy, which may at first sound paradoxical, play a fundamental role in structuring multivariate information in higher dimensions. Next we proceed to formalize these ideas, beginning with the problem of defining a measure of redundancy. III. partial information decomposition are never negative and always support a clear interpretation as informational quantities. Finally, our analysis also demonstrates how the negativity of interaction information can be explained by its confounding of redundant and synergistic interactions. II. THE STRUCTURE OF MULTIVARIATE INFORMATION Suppose we are given a random variable S and a random vector R = {R1 , R2 , . . . , Rn−1 }. Then our goal is to decompose the information that R provides about S in terms of the partial information contributed either individually or jointly by various subsets of R. For example, in a neuroscience context, S may correspond to a stimulus that takes on different values and R to the evoked responses of different neurons. In this case, we would like to quantify the information that the joint neural response provides about the stimulus, and to distinguish between information due to responses of individual neurons versus combinations of them [5, 13]. Consider the simplest case of a system with three variables. How much total information does R = {R1 , R2 } provide about S? How do R1 and R2 contribute to the total information? The answer to the first question is given by the mutual information I(S; R1 , R2 ), while for the latter we can identify three distinct possibilities. First, R1 may provide information that R2 does not, or vice versa (unique information). For example, if R1 is a copy of S and R2 is a degenerate random variable, then the total information from R reduces to the unique information from R1 . Second, R1 and R2 may provide the same or overlapping information (redundancy). For example, if R1 and R2 are both copies of S then they redundantly provide complete information. Third, the combination of R1 and R2 may provide information that is not available from either alone (synergy). A well-known example for binary variables is the exclusive-OR function S = R1 ⊕ R2 , in which case R1 and R2 individually provide no information but together provide complete information. Thus, intuitively, the total information from R decomposes into unique information from R1 and R2 , redundant information shared by R1 and R2 , and synergistic MEASURING REDUNDANCY Let A1 , A2 , . . . , Ak be nonempty and potentially overlapping subsets of R, which we call sources. How can we quantify the redundant information that all sources provide about S? Of course, the information supplied by each Ai is given simply by I(S; Ai ), the mutual information between S and Ai . However, it is crucial to note that mutual information is actually a measure of average or expected information, where the expected value is taken over outcomes of the random variables. Thus, for instance, two sources might provide the same average amount of information, while also providing information about different outcomes of S. Stated formally, the information provided by a source A can be written as X I(S; A) = p(s)I(S = s; A) (1) s where the specific information I(S = s; A) quantifies the information associated with a particular outcome s of S. Various definitions of specific information have been proposed to quantify different relationships between S and A (see Appendix A), but for our purposes the most useful is   X 1 1 I(S = s; A) = p(a|s) log . (2) − log p(s) p(s|a) a 1 The term p(s) is called the surprise of s, so I(S = s; A) is the average reduction in surprise of s given knowledge of A. In other words, I(S = s; A) quantifies the information that A provides about each particular outcome s ∈ S, while I(S; A) is the expected value of this quantity over all outcomes of S. Given these considerations, a natural measure of redundancy is the expected value of the minimum information that any source provides about each outcome of S, or X Imin (S; {A1 , A2 , . . . , Ak }) = p(s) min I(S = s; Ai ). s Ai (3) Imin captures the idea that redundancy is the information common to all sources (the minimum information that any source provides), while taking into account that sources may provide information about different outcomes of S. Note that, like the 3 mutual information, Imin is also an expected value of specific information terms. Imin also has several important properties that further support its interpretation as a measure of redundancy. First, Imin is nonnegative, a property that follows directly from the nonnegativity of specific information (see Appendix D). Second, Imin is less than or equal to I(S; Ai ) for all Ai ’s, with equality if and only if I(S = s; Ai ) = I(S = s; Aj ) for all i and j and all s ∈ S. Thus, as one would hope, the amount of redundant information is bounded by the information provided by each source, with equality if and only if all sources provide the exact same information about S. Finally, and closely related to the previous property, for a given source A the amount of information redundant with A is maximal for Imin (S; {A}) = I(S; A). In other words, redundant information is maximized by the “self-redundancy,” analogous to the property that mutual information is maximized by the self-information I(S; S) = H(S). What are the distinct ways in which collections of sources might contribute redundant information? Formally, answering this question means identifying the domain of Imin . Thus far, we have assumed that the natural domain is the collection of all possible sets of sources, but in fact this can be greatly simplified. To illustrate, consider two sources, A and B, with A a subset of B. Clearly, any information provided by A is also provided by B, so the redundancy between A and B reduces to the self-redundancy for A, that the former provides. The ordering relation 4 is formally defined as ∀α, β ∈ A(R), (α 4 β ⇔ ∀B ∈ β, ∃A ∈ α, A ⊆ B). (5) Applying this ordering to the elements of A(R) produces a redundancy lattice, in which a higher element provides at least as much redundant information as a lower one (FIG. 2; see Appendix C). The redundancy lattice provides a wealth of insight into the structure of redundancy. For instance, from the redundancy lattice it is possible to read off some of the properties of Imin noted earlier. The property that redundancy for a source is maximized by the self-redundancy can be seen from the fact that any node corresponding to an individual source appears higher in the redundancy lattice than any other node involving that source. For example, in FIG. 2B, the node labeled {12}, corresponding to the self-redundancy for the source {R1 , R2 }, occurs higher than nodes labeled {12}{13}, {12}{13}{23}, and {3}{12}. Another property of Imin that can be seen from these diagrams relates to the top and bottom elements of the lattice. The top element corresponds to the self-redundancy for R, reflecting the fact that Imin is bounded from above by the total amount of information provided by R. At the other end of the spectrum, the bottom element corresponds to the redundant information that each individual element of R provides, with all other possibilities for redundancy falling between these two extremes. Imin (S; {A, B}) = Imin (S; {A}) = I(S; A). Furthermore, for any source C, it follows that Imin (S; {A, B, C}) = Imin (S; {A, C}). Extending this idea, for any collection of sources where some are supersets of others, the redundancy for that collection is equivalent to the redundancy with all supersets removed. Thus, the domain for Imin can be reduced to the collection of all sets of sources such that no source is a superset of any other. Formally, this set can be written as A(R) = {α ∈ P1 (P1 (R)) : ∀Ai , Aj ∈ α, Ai 6⊂ Aj }, (4) where P1 (R) = P(R) \ {∅} is the set of all nonempty subsets of R. Henceforth, we will denote elements of A(R), corresponding to collections of sources, with bracketed expressions containing only the indices for each source. For instance, {{R1 , R2 }} will be {12}, {{R1 }, {R2 , R3 }} will be {1}{23}, and so forth. The possibilities for redundancy are also naturally structured, which is shown by extending the same line of reasoning to define an ordering 4 on the elements of A(R). Consider two collections of sources, α, β ∈ A(R), where for each source B ∈ β there exists a source A ∈ α with A a subset of B. This means that for each source B ∈ β there is a source A ∈ α such that A provides no more information than B. The redundant information shared by all B ∈ β must therefore at least include any redundant information shared by all A ∈ α. Thus, we can define a partial order over the elements of A(R) such that one element (collection of sources) is considered to precede another if and only if the latter provides any redundant information IV. PARTIAL INFORMATION DECOMPOSITION The redundant information associated with each node of the redundancy lattice includes, but is not limited to, the redundant information provided by all nodes lower in the lattice. Thus, moving from node to node up the lattice, Imin can be thought of as a kind of “cumulative information function,” effectively integrating the information provided by increasingly inclusive collections of sources. Next, we derive an inverse of Imin called the partial information function (PI-function). Whereas Imin quantifies cumulative information, the PI-function measures the partial information contributed uniquely by each particular collection of sources. This partial information will form the atoms into which we decompose the total information that R provides about S. For a collection of sources α ∈ A(R), the PI-function, denoted ΠR , is defined implicitly by X ΠR (S; β). (6) Imin (S; α) = β4α Formally, ΠR corresponds to the the Möbius inverse of Imin [19, 20]. From this relationship, it is clear that ΠR can be calculated recursively as X ΠR (S; α) = Imin (S; α) − ΠR (S; β). (7) β≺α Put into words, ΠR (S; α) quantifies the information provided redundantly by the sources of α that is not provided by any 4 FIG. 2. Redundancy lattice for (A) 3 and (B) 4 variables. simpler collection of sources (i.e., any β lower than α on the redundancy lattice). In Appendix D, it is shown that ΠR can be written in closed form as X ΠR (S; α) = Imin (S; α) − p(s) max min I(S = s; B) s β∈α− B∈β (8) where α− represents the nodes immediately below α in the redundancy lattice. From this formulation, it is readily shown that ΠR is nonnegative (see Appendix D), and thus can be naturally interpreted as an informational quantity associated with the sources of α. The decomposition of mutual information into a sum of PI-terms follows from X I(S; A) = Imin (S; {A}) = ΠR (S; β). (9) β4{A} For the 3-variable case R = {R1 , R2 }, Equation (9) yields I(S; R1 ) = ΠR (S; {1}) + ΠR (S; {1}{2}) and I(S; R1 , R2 ) = ΠR (S; {1}) + ΠR (S; {2}) + ΠR (S; {1}{2}) + ΠR (S; {12}). (10) (11) The relationship between these equations can be represented as a partial information (PI) diagram (FIG. 3A), which illustrates the way in which the total information that R provides about S is distributed across various combinations of sources. Furthermore, comparing this diagram with FIG. 1 makes immediately clear the meaning of each partial information term. First, from Equation (8), we have that ΠR (S; {1}{2}) = Imin (S; {1}{2}), which, from the definition of Imin , corresponds to the redundancy for R1 and R2 . The unique information for R1 is given by ΠR (S; {1}) = I(S; R1 ) − Imin (S; {1}{2}), which is the total information from R1 minus the redundancy, and likewise for R2 . Finally, the additional information provided by the combination of R1 and R2 is given by ΠR (S; {12}), corresponding to their synergy. To fix ideas, consider the example in FIG. 4A. From the symmetry of the distribution, it is clear that R1 and R2 must provide the same amount of information about S. Indeed, this is easily verified, with I(S; R1 ) = I(S; R2 ) = − 31 log 13 − 2 2 3 log 3 . However, it is also clear that R1 and R2 provide information about different outcomes of S. In particular, given knowledge of R1 , one can determine conclusively whether or not outcome S = 2 occurs (which is not the case for R2 ), and likewise for R2 and outcome S = 1. This feature is captured by ΠR (S; {1}) = ΠR (S; {2}) = 31 , indicating that R1 and R2 each provide 13 bits of unique information about S. The redundant information, ΠR (S; {1}{2}) = log 3 − log 2, captures the fact that knowledge of either R1 or R2 reduces uncertainty about S from three equally likely outcomes to two. Finally, R1 and R2 also provide 13 bits of synergistic information, i.e., ΠR (S; {12}) = 13 . This value reflects the fact that R1 and R2 together uniquely determine whether or not outcome S = 0 occurs, which is not true for R1 or R2 alone. Note that, unlike mutual information or interaction information, partial information is not symmetric. For instance, the synergistic information that R1 and R2 provide about S is not in general equal to the synergistic information that S and R2 provide about R1 . This property is also illustrated by the example in FIG. 4A. Given knowledge of S, one can uniquely determine the outcome of R1 (and R2 ), so that S provides complete information about both. Thus, it is not possible for the combination of S and R2 to provide any additional synergistic information about R1 , since there is no remaining uncertainty about R1 when S is known. In contrast, as was just noted, R1 and R2 provide 13 bits of synergistic information about S. This asymmetry accounts for our decision to focus on information about a particular variable S throughout, since in general the analysis will differ depending on the variable of interest. Note that total information is also asymmetric in this sense, i.e., in general I(S; R1 , R2 ) 6= I(R1 ; S, R2 ) (though, of course, it is symmetric in the sense that I(S; R1 , R2 ) = I(R1 , R2 ; S)). The general structure of PI-diagrams becomes clear when we consider the decomposition for four variables (FIG. 3B). First, note that all of the possibilities for three variables are again present for four. In particular, each element of R can provide unique information (regions labeled {1}, {2}, and {3}), information redundantly with one other variable ({1}{2}, {1}{3}, and {2}{3}), or information synergistically with one other variable ({12}, {13}, and {23}). Additionally, information can be provided redundantly by all three variables ({1}{2}{3}) or provided by their three-way synergy ({123}). More interesting are the new kinds of terms representing combinations of redundancy and synergy. For instance, the regions marked {1}{23}, {2}{13}, and {3}{12} represent information that is available redundantly from either one variable considered individually or the other two considered together. Or, for instance, the region labeled {12}{13}{23} represents the information provided redundantly by the three possible two-way synergies. In general, the PI-atom for a collection of sources corresponds 5 FIG. 4. Probability distributions for S ∈ {0, 1, 2} and R1 , R2 ∈ {0, 1}. Black tiles represent equiprobable outcomes. White tiles are zero-probability outcomes. FIG. 3. Partial information diagrams for (A) 3 and (B) 4 variables. to the information provided redundantly by the synergies of all sources in the collection. This point also clarifies our earlier claim that unique information is best thought of as a degenerate case: unique information corresponds to the combination of first-order redundancy and first-order synergy. In general, a PI-diagram for n variables, S and R = {R1 , R2 , . . . , Rn−1 }, consists of the following (see Fig. S2 in Appendix E). First, for each element Ri ∈ R there is a region corresponding to I(S; Ri ). Then, for every subset A of R with two or more elements, I(S; A) is depicted as a regionScontaining I(S; A) for all A ∈ A but not coextensive with A∈A I(S; A). The difference between I(S; A) and S I(S; A) represents the synergy for A, the information A∈A gained from the combined knowledge of all elements in A that is not available from any subset. In addition, regions of the diagram intersect generically, representing all possibilities for redundancy. In total, a PI-diagram is composed of the (n − 1)-th Dedekind number [21] of PI-atoms, same as the cardinality of A(R) (see Appendix C). As described above, each PI-atom represents the redundancy of synergies for a particular collection of sources, corresponding to one distinct way for the components of R to contribute information about S. Finally, it is instructive to consider the relationship between the redundancy lattice and PI-diagram for n variables. First, we note that Imin is analogous to set intersection for PI-diagrams, consistent with the idea of redundancy as overlapping information. Specifically, Imin (S; {A1 , A2 , . . . , Ak }) T corresponds to the region i I(S; Ai ). From this correspondence between Imin and set intersection, we can establish the following connection: for α, β ∈ A(R), T α is lower than β in theTredundancy lattice if and only if A∈α I(S; A) is a subset of B∈β I(S; B) in the PI-diagram. Consequently, the redundancy lattice and PI-diagram can be viewed as complementary representations of the same structure, with the PI-diagram a collapsed version of the redundancy lattice formed by embedding regions according to the lattice ordering. V. WHY INTERACTION INFORMATION IS SOMETIMES NEGATIVE We next show how PI-decomposition can be used to understand the conditions under which interaction information, the standard generalization of mutual information to multivariate interactions, is negative. The interaction information for three variables is given by I(S; R1 ; R2 ) = I(S; R1 |R2 ) − I(S; R1 ) (12) and for n > 3 variables is defined recursively as I(S; R1 ; R2 ; . . . ; Rn−1 ) =I(S; R1 ; R2 ; . . . ; Rn−2 |Rn−1 ) − I(S; R1 ; R2 ; . . . ; Rn−2 ) (13) where the conditional interaction information is defined by simply including the conditioning in all terms of the original definition. Interaction information is symmetric for all permutations of its arguments, and is traditionally interpreted as the information shared by all n variables beyond that which is shared by any subset of those variables. For 3-variable interaction information, a positive value is naturally interpreted as indicating a situation in which any one variable of the system enhances the correlation between the 6 other two. For example, a positive value for Equation (12) indicates that knowledge of R2 enhances the correlation between S and R1 (and likewise for all other variable permutations). Thus, in the terminology used here, a positive value for I(S; R1 ; R2 ) indicates the presence of synergy. On the other hand, a negative value for I(S; R1 ; R2 ) indicates a situation in which any one variable accounts for or “explains away” [22] the correlation between the other two. In other words, a negative value for I(S; R1 ; R2 ) indicates redundancy. Indeed, I(S; R1 ; R2 ) is a widely used measure of synergy in neuroscience, where it is interpreted in exactly this way [23–26]. The PI-decomposition for 3-variable interaction information (FIG. 5A; see also Fig. S3 in Appendix E) confirms this interpretation, with I(S; R1 ; R2 ) equal to the difference between the synergistic and the redundant information, i.e., I(S; R1 ; R2 ) = ΠR (S; {12}) − ΠR (S; {1}{2}). (14) Thus, it is indeed the case that positive values indicate synergy and negative values indicate redundancy. However, PI-decomposition also makes clear that I(S; R1 ; R2 ) confounds redundancy and synergy, with the meaning of interaction information ambiguous for any system that exhibits a mixture of the two (cf. [27], who suggest the possibility of mixed redundancy and synergy, but without attempting to disentangle them). For instance, consider again the example in FIG. 4A. As described earlier, in this case R1 and R2 provide log 3 − log 2 bits of redundant information and 1 3 bits of synergistic information. Consequently, I(S; R1 ; R2 ) is negative because there is more redundancy than synergy, despite the fact that the system clearly exhibits synergistic interactions. As a second example, consider the distribution in FIG. 4B. In this case, R1 and R2 provide 12 bits of redundant information, corresponding to the fact that knowledge of either R1 or R2 reduces uncertainty about the outcomes S = 0 and S = 2. Additionally, R1 and R2 provide 12 bits of synergistic information, reflecting the fact that R1 and R2 together provide complete information about outcomes S = 0 and S = 2, which is not true for either alone. Thus, the interaction information in this case is equal to zero despite the presence of both redundant and synergistic interactions, because redundancy and synergy are balanced. The situation is worse for four-variable interaction information, which is known to violate the interpretation that positive values indicate (pure) synergy and negative values indicate (pure) redundancy [12, 28]. To demonstrate, consider the case of 3-parity, which is the higher-order form of the exclusive-OR, or 2-parity, function mentioned earlier. In this case, we have a system of four binary random variables, S and R = {R1 , R2 , R3 }, where the eight outcomes for R are equiprobable and S = R1 ⊕ R2 ⊕ R3 . Intuitively, this corresponds to a case of pure synergy, since the value of S can be determined only when all of the Ri are known. Indeed, using Eq. (13) we find that I(S; R1 ; R2 ; R3 ) for this system is equal to +1 bit, as expected from the interpretation that positive values indicate synergy. However, now consider a second system of binary variables, this time where the two outcomes of S are equiprobable and R1 , R2 , and R3 are all copies of S. Clearly this corresponds to a case of pure redun- FIG. 5. PI-decomposition of interaction information for (A) 3 and (B) 4 variables. Blue and red regions represent PI-terms that are added and subtracted, respectively. The green region in (B) represents a PI-term that is subtracted twice. dancy, since the value of S can be determined uniquely from knowledge of any Ri , but I(S; R1 ; R2 ; R3 ) for this system is again equal to +1 bit, same as the case of pure synergy. Thus, a completely redundant system is assigned a positive value for the interaction information, in clear violation of the idea that redundancy is indicated by negative values. Worse still, the 4-variable interaction information fails to distinguish between the polar opposites of purely synergistic and purely redundant information. The PI-decomposition for 4-variable interaction information (FIG. 5B; see also Fig. S4 in Appendix E) clarifies why this is the case. In terms of PI-atoms, I(S; R1 ; R2 ; R3 ) is given by ΠR (S; {123}) + ΠR (S; {1}{2}{3}) −ΠR (S; {1}{23}) − ΠR (S; {2}{13}) − ΠR (S; {3}{12}) −ΠR (S; {12}{13}) − ΠR (S; {12}{23}) − ΠR (S; {13}{23}) −2 × ΠR (S; {12}{13}{23}). (15) Thus, I(S; R1 ; R2 ; R3 ) is equal to the sum of third-order synergy ({123}) and third-order redundancy ({1}{2}{3}), minus the information provided redundantly by a first- and secondorder synergy ({1}{23}, {2}{13}, and {3}{12}), minus the 7 information provided redundantly by two second-order synergies ({12}{13}, {12}{23}, and {13}{23}), and minus twice the information provided redundantly by all three second-order synergies ({12}{13}{23}). Thus, systems with pure synergy and pure redundancy have the same value for I(S; R1 ; R2 ; R3 ) because 4-variable interaction information adds in the highestorder synergy and redundancy terms. More generally, the PIdecomposition for I(S; R1 ; R2 ; R3 ) shows why it is difficult to interpret as a meaningful quantity, and as one might expect the story only becomes more complicated in higher dimensions. Thus, although one can readily decompose interaction information into a collection of partial information contributions, and understand the conditions under which it will be positive or negative depending on the relative magnitudes of these contributions, the utility of interaction information for larger systems is unclear. VI. DISCUSSION systems. For instance, with 9 variables there are more than 5 × 1022 possibilities [29], and beyond that the Dedekind numbers are not even currently known. Thus, clearly an important direction for future work is to determine efficient ways of calculating partial information terms for larger systems. To this end, the lattice structure of the terms is likely to play an essential role. As with any ordered data structure, the fact that the space of possibilities is highly organized can be readily exploited for efficient use. For instance, as a simple example, if Imin is calculated in a descending fashion over the nodes of the redundancy lattice and at a certain juncture has a value of zero, all of the terms below that node can immediately be eliminated simply from the monotonicity of Imin (see Appendix D). Moreover, if the Markov property or any other constraints hold between the variables, many of the possible partial information terms can also be excluded. Finally, these considerations notwithstanding, it should also be emphasized that 3-variable interaction is the current state of the art, and thus even the simplest form of partial information decomposition can be used to address a number of outstanding questions. In physics, for example, 3-variable interactions have been explored in relation to the non-separability of quantum systems [30] and in the study of many-body correlation effects [31]. In neuroscience, the concepts of synergy and redundancy for three variables have been examined in the context of neural coding in a number of theoretical and empirical investigations [23–26, 32, 33]. In genetics, multivariate dependencies arise in the analysis of gene-gene and gene-environment interactions in studies of human disease susceptibility [28, 34, 35]. Moreover, similar issues have also been explored in machine learning [22, 27, 36], ecology [37], quantum information theory [38], information geometry [39], rough set analysis [40], and cooperative game theory [41]. Thus, in all of these cases, the 3-variable form of partial information decomposition can be applied immediately to illuminate the structure of multivariate dependencies, while the general form provides a clear way forward in the study of more complex systems of interactions. The main objective of this paper has been to quantify multivariate information in such a way that the structure of variable interactions is illuminated. This was accomplished by first defining a general measure of redundant information, Imin , which satisfies a number of intuitive properties for a measure of redundancy. Next, it was shown that Imin induces a lattice structure over the set of possible information sources, referred to as the redundancy lattice, which characterizes the distinct ways that information can be distributed amongst a set of sources. From this lattice, a measure of partial information was derived that captures the unique information contributed by each possible combination of sources. It was then shown that mutual information decomposes into a sum of these partial information terms, so that the total information provided by a source is broken down into a collection of partial information contributions. Moreover, it was demonstrated that each of these terms supports clear interpretation as a particular combination of redundant and synergistic interactions between specific subsets of variables. Finally, we discussed the relationship between partial information decomposition and interaction information, the current de facto measure of multivariate interactions, and used partial information to clarify the confusing property that interaction information is sometimes negative. One obvious challenge with applying these ideas is that the number of partial information terms grows rapidly for larger We thank O. Sporns, J. Beggs, A. Kolchinsky, and L. Yaeger for helpful comments. This work was supported in part by NSF grant IIS-0916409 (to R.D.B.) and an NSF IGERT traineeship (to P.L.W.). [1] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Univ of Illinois Press, 1949). [2] D. Pines, The Many-Body Problem (Addison-Wesley, 1997). [3] R. D. Luce and H. Raiffa, Games and Decisions: Introduction and Critical Survey (Dover, 1989). [4] P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (MIT Press, 2001). [5] F. Rieke, D. Warland, R. de Ruyter van Steveninck, and W. Bialek, Spikes: Exploring the Neural Code (MIT Press, 1999). [6] S. Watanabe, IBM Journal of Research and Development, 4, 66 (1960). [7] W. R. Garner, Uncertainty and Structure as Psychological Concepts (Wiley, 1962). [8] M. Studeny and J. Vejnarova, Learning in Graphical Models, 261 (1998). [9] G. Tononi, O. Sporns, and G. M. Edelman, Proc Natl Acad Sci USA, 91, 5033 (1994). [10] W. J. McGill, Psychometrika, 19, 97 (1954). ACKNOWLEDGMENTS 8 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] T. S. Han, Information and Control, 46, 26 (1980). A. J. Bell, Proceedings of ICA2003, 921 (2003). T. J. Gawne and B. J. Richmond, J Neurosci, 13, 2758 (1993). R. W. Yeung, Information Theory and Network Coding (Springer, 2008). T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. (Wiley-Interscience, 2006). S. Takano, Proc Jpn Acad, 50, 109 (1974). T. Tsujishita, Adv Appl Math, 16, 269 (1995). Z. Zhang and R. W. Yeung, IEEE Trans Inf Theory, 44, 1440 (1998). G. Rota, Probability Theory and Related Fields, 2, 340 (1964). R. P. Stanley, Enumerative Combinatorics, Vol. 1 (Cambridge Univ Press, 1997). L. Comtet, Advanced Combinatorics: The Art of Finite and Infinite Expansions (Springer, 1974). J. Pearl and G. Shafer, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1988). N. Brenner, S. P. Strong, R. Koberle, W. Bialek, and R. de Ruyter van Steveninck, Neural Comput, 12, 1531 (2000). S. Panzeri, S. R. Schultz, A. Treves, and E. T. Rolls, Proc R Soc B, 266, 1001 (1999). E. Schneidman, W. Bialek, and M. J. Berry, J Neurosci, 23, 11539 (2003). P. E. Latham and S. Nirenberg, J Neurosci, 25, 5195 (2005). A. Jakulin and I. Bratko, Arxiv preprint cs/0308002 (2003). D. Anastassiou, Mol Syst Biol, 3, 1 (2007). D. Wiedemann, Order, 8, 5 (1991). N. J. Cerf and C. Adami, Phys Rev A, 55, 3371 (1997). H. Matsuda, Phys Rev E, 62, 3096 (2000). I. Gat and N. Tishby, Advances in NIPS, 111 (1999). N. S. Narayanan, E. Y. Kimchi, and M. Laubach, J Neurosci, 25, 4207 (2005). J. H. Moore, J. C. Gilbert, C. T. Tsai, F. T. Chiang, T. Holden, N. Barney, and B. C. White, J Theor Biol, 241, 252 (2006). P. Chanda, A. Zhang, D. Brazeau, L. Sucheston, J. L. Freudenheim, C. Ambrosone, and M. Ramanathan, Am J Hum Genet, 81, 939 (2007). D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms (Cambridge Univ Press, 2003). L. Orlóci, M. Anand, and V. D. Pillar, Community Ecol, 3, 217 (2002). V. Vedral, Rev Mod Phys, 74, 197 (2002). S. Amari, IEEE Trans Inf Theory, 47, 1701 (2001). G. Gediga and I. Düntsch, in Rough-Neural Computing, edited by S. K. Pal, L. Polkowski, and A. Skowron (Physica Verlag, Heidelberg, 2003). M. Grabisch and M. Roubens, Int J Game Theory, 28, 547 (1999). M. R. DeWeese and M. Meister, Network, 10, 325 (1999). D. A. Butts, Network, 14, 177 (2003). B. A. Davey and H. A. Priestley, Introduction to Lattices and Order, 2nd ed. (Cambridge Univ Press, 2002). G. A. Grätzer, General Lattice Theory, 2nd ed. (Birkhäuser, 2003). J. Crampton and G. Loizou, “Two partial orders on the set of antichains,” (2000), research note. J. Crampton and G. Loizou, International Mathematical Journal, 1, 223 (2001). S. M. Ross, A First Course in Probability, 8th ed. (Prentice Hall, 2009). Appendix A: Measures of Specific Information Measures of specific information are discussed in [42] in the context of quantifying the information that specific neural responses provide about a stimulus ensemble. For random variables S and R, representing stimuli and responses, respectively, the information that R provides about S is decomposed according to X p(r)ir (r) (A1) I(S; R) = r∈R and ir (r) = H(S) − H(S|r) (A2) where H(S) is the entropy of S and ir (r) is the responsespecific information associated with each r ∈ R. The responsespecific information quantifies the change in uncertainty about S when response r is observed. In [42], it is shown that ir is the unique measure of specific information that satisfies additivity, though it is also possible for ir to be negative. To distinguish the different role played by stimuli as opposed to responses, an alternative measure of specific information is proposed in [43]. The stimulus-specific information for an outcome s ∈ S is defined as X p(r|s)ir (r). (A3) is (s) = r∈R Like the response-specific information, the weighted average of is (s) gives the mutual information I(S; R). Stimulus-specific information quantifies the extent to which a particular stimulus s tends to evoke responses that are informative about the entire ensemble S (responses with high values for ir ). Finally, both [42] and [43] also discuss I(S = s; R), the measure of specific information used here (Eq. (2)). In [43], I(S = s; R) is described as the reduction in surprise of a particular stimulus s gained from each response, averaged over all responses associated with that stimulus. Thus, whereas is (s) weights each response r according to the information that it contributes about the entire ensemble S, I(S = s; R) quantifies only the information that R provides about the particular outcome S = s. In [42], it is proven that I(S = s; R) is the only measure of specific information that is strictly nonnegative. Appendix B: Lattice Theory Definitions Here we review only the basic concepts of lattice theory needed for supporting proofs. For a thorough treatment, see [44, 45]. Definition 1. A pair hX, 6i is a partially ordered set or poset if 6 is a binary relation on X that is reflexive, transitive and antisymmetric. Definition 2. Let Y ⊆ X. Then a ∈ Y is a maximal element in Y if for all b ∈ Y, a 6 b ⇒ a = b. A minimal element is defined dually. We denote the set of maximal elements of Y by Y and the set of minimal elements by Y . 9 FIG. S1: Basic lattice-theoretic concepts. (A) Hasse diagram of the lattice hP(X), ⊆i for X = {1, 2, 3}. (B) An example of a chain (blue nodes) and an antichain (red nodes). (C) The top ⊤ and bottom ⊥ are shown in gray. Green nodes correspond to {1, 2, 3}− , the set of elements covered by {1, 2, 3}. The orange region represents ↓ {1, 3}, the down-set of {1, 3}. Definition 3. Let hX, 6i be a poset, and let Y ⊆ X. An element x ∈ X is an upper bound for Y if for all y ∈ Y, y 6 x. A lower bound for Y is defined dually. Definition 4. An element x ∈ X is the least upper bound or supremum for Y , denoted sup Y , if x is an upper bound of Y and for all y ∈ Y and all z ∈ X, y 6 z implies x 6 z. The greatest upper bound or infimum for Y , denoted inf Y , is defined dually. Definition 5. A poset hX, 6i is a lattice if, and only if, for all x, y ∈ X both inf{x, y} and sup{x, y} exist in X. If hX, 6i is a lattice, it is common to write x ∧ y, the meet of x and y, and x ∨ y, the join of x and y, for V inf{x, y} W and sup{x, y}, respectively. For Y ⊆ X, we use Y and Y to denote the meet and join of all elements in Y , respectively. Definition 6. For a, b ∈ X, we say that a is covered by b (or b covers a) if a < b and a 6 c < b ⇒ a = c. The set of elements that are covered by b is denoted by b− . Definition 9. For any x ∈ X, we define ˙ = {y ∈ X : y < x} ↓ x = {y ∈ X : y 6 x} and ↓x ˙ are called the down-set and strict down-set where ↓ x and ↓x of x, respectively. FIG. S1C illustrates the concepts of top and bottom elements, covering relations, and down-sets. Appendix C: A(R) and the Redundancy Lattice Formally, A(R) corresponds to the set of antichains on the lattice hP(R), ⊆i (excluding the empty set). The cardinality of this set for |R| = n − 1 is given by the (n − 1)-th Dedekind number, which for n = 2, 3, 4, . . . is 1, 4, 18, 166, 7579, . . . ([21], p. 273). The fact that hA(R), 4i forms a lattice, which we call the redundancy lattice, is proven in [46], where the corresponding lattice is denoted hA(X), 4′ i (see also [47]). As shown in [46], the meet (∧) and join (∨) for this lattice are given by The classic example of a lattice is the power set of a set X ordered by inclusion, denoted hP(X), ⊆i. Lattices are naturally represented by Hasse diagrams, in which nodes correspond to members of X and an edge exists between elements x and y if x covers y. FIG. S1A depicts the Hasse diagram for the lattice hP(X), ⊆i with X = {1, 2, 3}. α∧β =α∪β (A4) and α ∨ β = ↑ α∩ ↑ β. (A5) Appendix D: Supporting Proofs Theorem 1. I(S = s; A) is nonnegative. Definition 7. If hX, 6i is a poset, Y ⊆ X is a chain if for all a, b ∈ Y either a 6 b or b 6 a. Y is an antichain if a 6 b only if a = b. FIG. S1B shows examples of a chain and an antichain. Definition 8. If there exists an element ⊥ ∈ X with the property that ⊥ 6 x for all x ∈ X, we call ⊥ the bottom element of X. The top element of X, denoted by ⊤, is defined dually. Proof. I(S = s; A) = D(p(a|s) k p(a)) ≥ 0 where D is the Kullback-Leibler distance and the last step follows from the information inequality ([15], p. 26). Lemma 1. I(S = s; A) increases monotonically on the lattice hP(R), ⊆i. 10 Proof. Consider A, B with A ⊂ B ⊆ R. Let C = B \ A 6= ∅. Then we have I(S = s; B) − I(S = s; A) X X p(s, b) p(s, a) = p(b|s) log − p(a|s) log p(s)p(b) p(s)p(a) a b XX XX p(s, a) p(s, a, c) − p(a, c|s) log = p(a, c|s) log p(s)p(a, c) p(s)p(a) a c a c XX p(s, c|a) = p(a, c|s) log p(s|a)p(c|a) a c X X p(c|a, s) = p(a) p(c|a, s) log p(c|a) a c X = p(a)D(p(c|a, s) k p(c|a)) ≥ 0. a Theorem 2. Imin increases monotonically on the lattice hA(R), 4i. Applying the principle of inclusion-exclusion ([20], p. 64), we have |α− | Proof. We proceed by contradiction. Assume there exists α, β ∈ A(R) with α ≺ β and Imin (S; β) < Imin (S; α). Then, from Eq. (3), there must exist B ∈ β such that I(S = s; B) < I(S = s; A) for some outcome s ∈ S and for all A ∈ α. Thus, from Lemma 1, there does not exist A ∈ α such that A ⊆ B. However, since α ≺ β by assumption, there exists A ∈ α such that A ⊆ B. = f (↓ α) − X (−1)k−1 k=1 X f( \ ↓ γ) γ∈B B⊆α− |B|=k and it is a basic V theory that for any lattice L T result of lattice and A ⊆ L, a∈A ↓ a =↓ ( A) ([44], p. 57), so we have |α− | = f (↓ α) − X (−1)k−1 k=1 X f (↓ ( B⊆α− |B|=k ^ B)) |α− | Theorem 3. ΠR can be stated in closed form as = Imin (S; α) − X (−1)k−1 k=1 X (−1)k−1 k=1 X Imin (S; B⊆α− |B|=k ^ B⊆α |B|=k Proof. For B ⊆ A(R), define the set-additive function f as max A = |A| X (−1)k−1 |A| X (−1)k−1 X min B X max B. B⊆A |B|=k ΠR (S; β). β∈B From Eq. (6), it follows that Imin (S; α) = f (↓ α) and or conversely, min A = k=1 ˙ ΠR (S; α) = f (↓ α) − f (↓α) [ = f (↓ α) − f ( ↓ β). β∈α− B). Lemma 2 (Maximum-minimums identity). Let A be a set of numbers. The maximum-minimums identity states that k=1 X ^ B). (A6) f (B) = Imin (S; − |α− | ΠR (S; α) = Imin (S; α) − X B⊆A |B|=k Proof. It is proven in a number of introductory texts, e.g. [48]. 11 Theorem 4. ΠR can be stated in closed form as ΠR (S; α) = Imin (S; α) − X s p(s) max min I(S = s; B). β∈α− B∈β (A7) Proof. Combining Eqs. (A6) and (3) yields |α− | ΠR (S; α) = Imin (S; α) − X (−1)k−1 k=1 X X B⊆α |B|=k − s p(s) min V I(S = s; B) B∈ B |α− | = Imin (S; α) − X p(s) s X (−1)k−1 k=1 X min V I(S = s; B) B∈ B⊆α− |B|=k B and by Lemma 1 and Eq. (A4), |α− | = Imin (S; α) − X p(s) X p(s) max− min I(S = s; B). s X (−1)k−1 k=1 X min min I(S = s; B). β∈B B∈β B⊆α− |B|=k Then, applying Lemma 2 we have = Imin (S; α) − s β∈α B∈β Theorem 5. ΠR is nonnegative. Proof. If α = ⊥, ΠR (S; α) = Imin (S; α) and ΠR (S; α) ≥ 0 follows from the nonnegativity of Imin . To prove it for α 6= ⊥, we proceed by contradiction. Assume there exists α ∈ A(R) \ {⊥} such that ΠR (S; α) < 0. Applying Eq. (3) to Theorem 4 and combining summations yields X ΠR (S; α) = p(s){min I(S = s; A) − max− min I(S = s; B)}. s A∈α β∈α B∈β From this equation, it is clear that there must exist β ∈ α− such that for all B ∈ β, I(S = s; A) < I(S = s; B) for some outcome s ∈ S and some A ∈ α. Thus, from Lemma 1, there does not exist B ∈ β such that B ⊆ A. However, since β ≺ α by definition, there exists B ∈ β such that B ⊆ A. 12 Appendix E: Supplementary Figures FIG. S2: Constructing a PI-diagram for 4 variables. (A) For each element Ri ∈ R there is a region corresponding to I(S; Ri ). (B-E) For each subset A of R with two or more elements, I(S; A) is depicted as a region containing I(S; A) for all A ∈ A but not coextensive with S A∈A I(S; A). Regions of the diagram intersect generically, representing all possibilities for redundancy. 13 FIG. S3: Computing the PI-decomposition for 3-variable interaction information. (A-B) Term-by-term calculation of I(S; R1 ; R2 ) = I(S; R1 , R2 ) − I(S; R1 ) − I(S; R2 ). Blue and red regions represent PI-terms that are added and subtracted, respectively. 14 FIG. S4: Computing the PI-decomposition for 4-variable interaction information. (A-F) Term-by-term calculation of I(S; R1 ; R2 ; R3 ) = I(S; R1 , R2 , R3 ) − I(S; R1 , R2 ) − I(S; R1 , R3 ) − I(S; R2 , R3 ) + I(S; R1 ) + I(S; R2 ) + I(S; R3 ). Blue and red regions represent PI-terms that are added and subtracted, respectively. Green regions represent PI-terms that are subtracted twice.