Academia.eduAcademia.edu

Lattice duality: The origin of probability and entropy

2005, Neurocomputing

Bayesian probability theory is an inference calculus, which originates from a generalization of inclusion on the Boolean lattice of logical assertions to a degree of inclusion represented by a real number. Dual to this lattice is the distributive lattice of questions constructed from the ordered set of down-sets of assertions, which forms the foundation of the calculus of inquiry-a generalization of information theory. In this paper we introduce this novel perspective on these spaces in which machine learning is performed and discuss the relationship between these results and several proposed generalizations of information theory in the literature.

In Press: Special Issue of Neurocomputing journal dedicated to "Geometrical Methods in Neural Networks and Learning" Lattice Duality: The Origin of Probability and Entropy Kevin H. Knuth NASA Ames Research Center, Mail Stop 269-3, Moffett Field CA 94035-1000, USA Abstract Bayesian probability theory is an inference calculus, which originates from a generalization of inclusion on the Boolean lattice of logical assertions to a degree of inclusion represented by a real number. Dual to this lattice is the distributive lattice of questions constructed from the ordered set of down-sets of assertions, which forms the foundation of the calculus of inquiry—a generalization of information theory. In this paper we introduce this novel perspective on these spaces in which machine learning is performed and discuss the relationship between these results and several proposed generalizations of information theory in the literature. Key words: probability, entropy, lattice, information theory, Bayesian inference, inquiry PACS: 1 Introduction It has been known for some time that probability theory can be derived as a generalization of Boolean implication to degrees of implication represented by real numbers [13,14]. Straightforward consistency requirements dictate the form of the sum and product rules of probability, and Bayes’ theorem [13,14,49,48,22,36], which forms the basis of the inferential calculus, also known as inductive inference. However, in machine learning applications it is often times more useful to rely on information theory [47] in the design of an algorithm. On the surface, the connection between information theory and probability theory seems clear—information depends on entropy and entropy Email address: [email protected] (Kevin H. Knuth). Preprint submitted to Elsevier Science 22 November 2004 is a logarithmically-transformed version of probability. However, as I will describe, there is a great deal more structure lying below this seemingly placid surface. Great insight is gained by considering a set of logical statements as a Boolean lattice. I will show how this lattice of logical statements gives rise to a dual lattice of possible questions that can be asked. The lattice of questions has a measure, analogous to probability, which I will demonstrate is a generalized entropy. This generalized entropy not only encompasses information theory, but also allows for new quantities and relationships, several of which already have been suggested in the literature. A problem can be solved in either the space of logical statements or in the space of questions. By better understanding the fundamental structures of these spaces, their relationships to one another, and their associated calculi we can expect to be able to use them more effectively to perform automated inference and inquiry. In §2, I provide an overview of order theory, specifically partially-ordered sets and lattices. I will introduce the notion of extending inclusion on a finite lattice to degrees of inclusion effectively extending the algebra to a calculus, the rules of which are derived in the appendix. These ideas are used to recast the Boolean algebra of logical statements and to derive the rules of the inferential calculus (probability theory) in §3. I will focus on finite spaces of statements rather than continuous spaces. In §4, I will use order theory to generate the lattice of questions from the lattice of logical statements. I will discuss how consistency requirements lead to a generalized entropy and the inquiry calculus, which encompasses information theory. In §5 I discuss the use of these calculi and their relationships to several proposed generalizations of information theory. 2 Partially-Ordered Sets and Lattices 2.1 Order Theory and Posets In this section, I introduce some basic concepts of order theory that are necessary in this development to understand the spaces of logical statements and questions. Order theory works to capture the notion of ordering elements of a set. The central idea is that one associates a set with a binary ordering relation to form what is called a partially-ordered set, or a poset for short. The ordering relation, generically written ≤, satisfies reflexivity, antisymmetry, and 2 Fig. 1. Diagrams of posets described in the text. A. Natural numbers ordered by ‘less than or equal to’, B. Π3 the lattice of partitions of three elements ordered by ‘is contained by’, C. 23 the lattice of all subsets of three elements {a, b, c} ordered by ‘is a subset of’. transitivity, so that for elements a, b, and c we have P 1. For all a, a ≤ a (Reflexivity) P 2. If a ≤ b and b ≤ a, then a = b (Antisymmetry) P 3. If a ≤ b and b ≤ c, then a ≤ c (Transitivity) The ordering a ≤ b is generally read ‘b includes a’. In cases where a ≤ b and a = b, we write a < b. If it is true that a < b, but there does not exist an element x in the set such that a < x < b, then we write a ≺ b, read ‘b covers a’, indicating that b is a direct successor to a in the hierarchy induced by the ordering relation. This concept of covering can be used to construct diagrams of posets. If an element b includes an element a then it is drawn higher in the diagram. If b covers a then they are connected by a line. These poset diagrams (or Hasse diagrams) are useful in visualizing the order induced on a set by an ordering relation. Figure 1 shows three posets. The first is the natural numbers ordered by the usual ‘is less than or equal to’. The second is Π3 the lattice of partitions of three elements. A partition y includes a partition x, x ≤ y, when every cell of x is contained in a cell of y. The third poset, denoted 23 , is the powerset of the set of three elements P({a, b, c}), ordered by set inclusion ⊆. The orderings in Figures 1b and c are called partial orders since some elements are incomparable with respect to the ordering relation. For example, since it is neither true that {a} ≤ {b} or that {b} ≤ {a}, the elements {a} and {b} are incomparable, written {a}||{b}. In contrast, the ordering in Figure 1a is a total order, since all pairs of elements are comparable with respect to the ordering relation. A poset P possesses a greatest element if there exists an element ⊤ ∈ P , called 3 the top, where x ≤ ⊤ for all x ∈ P . Dually, the least element ⊥ ∈ P , called the bottom, exists when ⊥ ≤ x for all x ∈ P . For example, the top of Π3 is the partition 123 where all elements are in the same cell. The bottom of Π3 is the partition 1|2|3 where each element is in its own cell. The elements that cover the bottom are called atoms. For example, in 23 the atoms are the singleton sets {a}, {b}, and {c}. Given a pair of elements x and y, their upper bound is defined as the set of all z ∈ P such that x ≤ z and y ≤ z. In the event that a unique least upper bound exists, it is called the join, written x ∨ y. Dually, we can define the lower bound and the greatest lower bound, which if it exists, is called the meet, x ∧ y. Graphically the join of two elements can be found by following the lines upward until they first converge on a single element. The meet can be found by following the lines downward. In the lattice of subsets of the powerset 23 , the join ∨, corresponds to the set union ∪, and the meet ∧ corresponds to the set intersection ∩. Elements that cannot be expressed as a join of two elements are called join-irreducible elements. In the lattice 23 , these elements are the atoms. Last, the dual of a poset P , written P ∂ can be formed by reversing the ordering relation, which can be visualized by flipping the poset diagram upside-down. This action exchanges joins and meets and is the reason that their relations come in pairs, as we will see below. There are different notions of duality and the notion after which this paper is titled will be discussed later. 2.2 Lattices A lattice L is a poset where the join and meet exist for every pair of elements. We can view the lattice as a set of objects ordered by an ordering relation, with the join ∨ and meet ∧ describing the hierarchical structure of the lattice. This is a structural viewpoint. However, we can also view the lattice from an operational viewpoint as an algebra on the space of elements. The algebra is defined by the operations ∨ and ∧ along with any other relations induced by the structure of the lattice. Dually, the operations of the algebra uniquely determine the ordering relation, and hence the lattice structure. Viewed as operations, the join and meet obey the following properties for all x, y, z ∈ L L1. x ∨ x = x, x∧x=x (Idempotency) L2. x ∨ y = y ∨ x, L3. x ∨ (y ∨ z) = (x ∨ y) ∨ z, L4. x ∨ (x ∧ y) = x ∧ (x ∨ y) = x x∧y =y∧x (Commutativity) x ∧ (y ∧ z) = (x ∧ y) ∧ z (Associativity) (Absorption) 4 The fact that lattices are algebras can be seen by considering the consistency relations, which express the relationship between the ordering relation and the join and meet operations. x≤y ⇔ x∧y =x (Consistency Relations) x∨y =y Lattices that obey the distributivity relation D1. x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z) (Distributivity of ∧ over ∨) and its dual D2. x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z) (Distributivity of ∨ over ∧) are called distributive lattices. All distributive lattices can be expressed in terms of elements consisting of sets ordered by set inclusion. A lattice is complemented if for every element x in the lattice, there exists a unique element ∼ x such that C1. x∨ ∼ x = ⊤ C2. x∧ ∼ x = ⊥ (Complementation) Note that the lattice 23 (Fig. 1c) is complemented, whereas the lattice Π3 (Fig. 1b) is not. 2.3 Inclusion and the Incidence Algebra Inclusion on a poset can be quantified by a function called the zeta function    1 if x ≤ y ζ(x, y) =   0 if x  y (zeta function) (1) which describes whether the element y includes the element x. This function belongs to a class of real-valued functions f (x, y) of two variables defined on a poset, which are non-zero only when x ≤ y. This set of functions comprises the incidence algebra of the poset [44]. The sum of two functions f (x, y) + g(x, y) in the incidence algebra is defined the usual way by h(x, y) = f (x, y) + g(x, y), 5 (2) as is multiplication by a scalar h(x, y) = λf (x, y). However, the product of two functions is found by taking the convolution over the interval of elements in the poset  f (x, z)g(z, y). (3) h(x, y) = x≤z≤y To invert functions in the incidence algebra, one must rely on the Möbius function µ(x, y), which is the inverse of the zeta function [46,44,3]  ζ(x, z)µ(z, y) = δ(x, y), (4) x≤z≤y where δ(x, y) is the Kronecker delta function. These functions are the generalized analogues of the familiar Riemann zeta function and the Möbius function in number theory, where the poset is the set of natural numbers ordered by ‘divides’. We will see that they play an important role both in inferential reasoning as an extension of inclusion on the Boolean lattice of logical statements, and in the quantification of inquiry. 2.4 Degrees of Inclusion It is useful to generalize this notion of inclusion on a poset. I first introduce the dual of the zeta function, ζ ∂ (x, y), which quantifies whether x includes y, that is    1 if x ≥ y ∂ (dual of the zeta function) (5) ζ (x, y) =   0 if x  y Note that the dual of the zeta function on a poset P is equivalent to the zeta function defined on its dual P ∂ , since the ordering relation is simply reversed. I will generalize inclusion by introducing the function z(x, y), 1     1 if x ≥ y    z(x, y) =  0 if x ∧ y = ⊥ (degrees of inclusion) (6)      z otherwise, where 0 < z < 1. where inclusion on the poset is generalized to degrees of inclusion represented by real numbers. 2 This new function quantifies the degree to which x includes y. This generalization is asymmetric in the sense that the condition where ζ ∂ (x, y) = 1 is preserved, whereas the condition where ζ ∂ (x, y) = 0 has been modified. The motivation here is that, if we are certain that x includes y then we want to indicate this knowledge. However, if we know that x does not 1 2 I have dropped the ∂ symbol since the definition is clear. This function need not be normalized to unity, as we will see later. 6 include y, then we can quantify the degree to which x includes y. In this sense, the algebra is extended to a calculus. Later, I will demonstrate the utility of such a generalization. The values of the function z must be consistent with the poset structure. In the case of a lattice, when the arguments are transformed using the algebraic manipulations of the lattice, the corresponding values of z must be consistent with these transformations. By enforcing this consistency, we can derive the rules by which the degrees of inclusion are to be manipulated. This method of requiring consistency with the algebraic structure was first used by Cox to prove that the sum and product rules of probability theory are the only rules consistent with the underlying Boolean algebra [13,14]. The rules for the distributive lattices I will describe below are derived in the appendix, and the general methodology is discussed in greater detail elsewhere [36]. Consider a distributive lattice D and elements x, y, t ∈ D. Given the degree to which x includes t, z(x, t), and the degree to which y includes t, z(y, t), we would like to be able to determine the degree to which the join x ∨ y includes t, z(x ∨ y, t). In the appendix, I show that consistency with associativity of the join requires that z(x ∨ y, t) = z(x, t) + z(y, t) − z(x ∧ y, t). (7) For a join of multiple elements x1 , x2 , . . . , xn , this degree is found by z(x1 ∨ x2 ∨ · · · ∨ xn , t) =  i z(xi , t) −  z(xi ∧ xj , t) + i<j  z(xi ∧ xj ∧ xk , t) − · · · , (8) i<j<k which I will call the sum rule for distributive lattices. This sum rule exhibits Gian-Carlo Rota’s inclusion-exclusion principle, where terms are added and subtracted to avoid double-counting of the join-irreducible elements in the join [31,44,3]. The inclusion-exclusion principle is a consequence of the Möbius function for distributive lattices, which leads to an alternating sum and difference as one sums down the interval in the lattice. This demonstrates that the form of the sum rule is inextricably tied to the underlying lattice structure [36]. For the meet of two elements x ∧ y, it is clear that we can use (7) to obtain z(x ∧ y, t) = z(x, t) + z(y, t) − z(x ∨ y, t). (9) However, another useful form can be obtained by requiring consistency with distributivity. In the appendix, I show that this consistency constraint leads to z(x ∧ y, t) = Cz(x, t)z(y, x ∧ t), (10) 7 which is the product rule for distributive lattices. The constant C acts as a normalization factor, and is necessary when these degrees are normalized to values other than unity. Last, requiring consistency with commutativity of the meet leads to the analog of Bayes’ Theorem for distributive lattices z(y, x ∧ t) = z(y, t)z(x, y ∧ t) . z(x, t) (11) One does not think typically of Bayes’ Theorem outside of the context of probability theory, however, it is a general rule that is applicable to all distributive lattices. As I will demonstrate, it can be used in computing probabilities among logical assertions, as well as in working with questions. 2.5 Measures and Valuations The fact that one can define functions that take lattice elements to real numbers was utilized by Rota, who used this to develop and promote the field of geometric probability [45,31]. The more familiar term measure typically refers to a function µ defined on a Boolean lattice B, which takes elements of a Boolean lattice to a real number. For example, given x ∈ B, µ : x → R. The term valuation is a more general term that takes a lattice element x ∈ L to a real number, v : x → R, or more generally to an element of a commutative ring with identity [46], v : x → A. The function z, introduced above (6), is a bi-valuation since it takes two lattice elements x, y ∈ L as its arguments, z : x, y → R. When applied to a Boolean lattice, the function z is also a measure. 3 Logical Statements George Boole [9,10] was the first to understand the algebra of logical statements, which I will interchangeably call logical assertions. Boole’s algebra is so familiar, that I will spend little effort in describing it. In this algebra, there are two binary relations called conjunction (AND) and disjunction (OR), and one unary operation called complementation (NOT). The binary operations are commutative, associative, and distributive. Let us now adopt a different perspective, where we view this Boolean structure as a set of logical statements ordered by logical implication. A statement x includes a statement y, y ≤ x, when y implies x, written y → x. Thus the ordering relation ≤ is represented by →. Logical implication as an ordering 8 Fig. 2. The lattice of assertions A3 generated by three atomic assertions a, k, and n. The ideals of this lattice form the lattice of ideal questions I3 ordered by set inclusion ⊆, which is isomorphic to A3 . The the maximum of the set of statements in any ideal maps back to the assertion lattice. The statements comprising three ideals are highlighted on the right. relation among a set of logical assertions sets up a partial order on the set and forms a Boolean lattice. The join and meet are identified with the disjunction and conjunction, respectively. In this case the order-theoretic notation for the join and the meet conveniently matches the logical notation for the disjunction and conjunction. However, one must remember that the join and meet describe different operations in different lattices. A Boolean lattice follows L1-L4, D1, D2, C1 and C2, which is neatly summarized by saying that it is a complemented distributive lattice. To better picture this, consider a simple example [35] concerning the matter of ‘Who stole the tarts made by the Queen of Hearts?’ For the sake of this example, let us say that there are three mutually exclusive statements, one of which answers this question: a = ‘Alice stole the tarts!’ k = ‘The Knave of Hearts stole the tarts!’ n = ‘No one stole the tarts!’ The lattice A3 generated by these assertions is shown in Figure 2. The bottom element of the lattice is called the logical absurdity. It is always false, and as such, it implies every other statement in the lattice. The three statements a, k, and n are the atoms which cover the bottom. All other logical statements in this space can be generated from joins of these three statements. For example, the statement a∨k is the statement ‘Either Alice or the Knave stole the tarts! ’ The top element ⊤ = a∨k ∨n is called the truism, since it is trivially true that ‘Either Alice, the Knave, or nobody stole the tarts! ’. The truism is implied by 9 every other statement in the lattice. Since the lattice is Boolean, each element in the lattice has a unique complement (C1, C2). The statement a =‘Alice stole the tarts! ’ has as its complement the statement ∼ a = k ∨ n =‘Either the Knave or no one stole the tarts! ’ This statement ∼ a is equivalent to ‘Alice did not steal the tarts! ’ Last, note that this lattice (Fig. 2) is isomorphic to the lattice of powersets 23 (Fig. 1c), therefore Boolean algebra describes the operations on a powerset as well as implication on a set of logical statements. 3.1 The Order-Theoretic Origin of Probability Theory Deductive reasoning describes the act of using the Boolean lattice structure to determine whether one logical statement implies another given partial knowledge of relations among a set of logical statements. From the perspective of posets, this equates to determining whether one element of a poset includes another element given some partial knowledge of inclusion among a set of poset elements. Since inclusion on a poset is encoded by the zeta function ζ and its dual ζ ∂ (5), either of these functions can be used to quantify implication on A and perform deductive reasoning. Inductive reasoning or inference is different from deductive reasoning in the sense that it incorporates a notion of uncertainty not found in the Boolean lattice structure. Just as ζ ∂ quantifies deductive reasoning, its generalization z quantifies inductive reasoning. Probability 3 is simply this function z (6) defined on the Boolean lattice A, p(x|y) = z(x, y), (12) so that implication on the lattice is generalized to degrees of implication represented by real numbers     1 if x → y    p(y|x) =  0 if x ∧ y = ⊥ (probability) (13)      p otherwise, where 0 < p < 1. To make this more concrete, consider the example in Figure 2. Clearly, ⊤ ≥ a, which is equivalent to a → ⊤, so that ζ ∂ (⊤, a) = 1 and p(⊤|a) = 1. Now, ⊤  a and a ∧ ⊤ = a, therefore ζ ∂ (a, ⊤) = 0 and p(a|⊤) = p where 0 < p < 1. While the truism, ⊤ =‘Either Alice or the Knave or no one stole the tarts! ’, 3 I could call this degree of implication plausibility or perhaps by a new term, however we will see that this quantity follows all of the rules of probability theory. Since there is neither an operational nor a mathematical difference between this degree of implication and probability, I see no need to indicate a difference semantically. 10 does not imply that a =‘Alice stole the tarts! ’, the degree to which the truism implies that Alice stole the tarts, p(a|⊤), is a very useful quantity. This is the essence of inductive reasoning. Since probability is the function z defined on the Boolean lattice A, the rules by which probabilities may be manipulated derive directly from requiring that probability be consistent with the underlying Boolean algebra. Since Boolean algebras are distributive, we have already shown that there are three rules for manipulating probabilities: the sum rule of probability p(x ∨ y|t) = p(x|t) + p(y|t) − p(x ∧ y|t), (14) which is equivalent to (7), the product rule of probability p(x ∧ y|t) = p(x|t)p(y|x ∧ t), (15) which is equivalent to (10) with C = 1, and Bayes’ theorem p(y|x ∧ t) = p(y|t)p(x|y ∧ t) , p(x|t) (16) which is equivalent to (11). These three rules constitute the inferential calculus, which is a generalization of the Boolean algebra of logical assertions. There are several very important points to be made here. Probabilities are functions of pairs of logical statements and quantify the degree to which one logical statement implies another. For this reason, they are necessarily conditional. Since the rules by which probabilities are to be manipulated derive directly from consistency with the underlying Boolean algebra, probability theory is literally an extension of logic, as argued so effectively by E.T. Jaynes [28]. 3.2 Join-Irreducible Statements and Assigning Probabilities The join-irreducible elements of the Boolean lattice of logical statements are the atoms that cover the absurdity ⊥. This set of atomic statements {a1 , a2 , . . . , aN } comprises the exhaustive set of mutually exclusive assertions that form the basis of this lattice. All other statements in the lattice can be found by taking joins of these atoms. Given assignments of the degree to which ⊤ implies each of the atoms in the set {a1 , a2 , . . . aN }, (eg. p(ai |⊤)), the probabilities for all other statements in the lattice can be computed using the sum rule of the inferential calculus. 4 This was proven by Gian-Carlo Rota [46, Theorem 1, Corollary 2, p.35] who showed that: 4 In the case where assertions representing data have not been brought into the assertion lattice via the lattice product, these probability assignments can be considered assignments of the prior probabilities. See §3.3 for a note on lattice products and inference. 11 Theorem 1 (Rota, Assigning Valuations [46]) A valuation in a finite distributive lattice is uniquely determined by the values it takes on the set of join-irreducibles of L, and these values can be arbitrarily assigned. Furthermore, given the degree to which ⊤ implies an assertion x, the degree to which any other element y of the lattice implies x can be found via the product rule p(x|y) = p(x|y ∧ ⊤) = p(x ∧ y|⊤) . p(y|⊤) (17) Thus by assigning the probability that the truism implies each of the atoms, all the other probabilities can be uniquely determined using the inferential calculus. What is even more remarkable here, is that Rota proved that the values of the probabilities can be arbitrarily assigned. This means that there are no constraints imposed by the lattice structure, or equivalently the Boolean algebra, on the values of the probabilities. Thus the inferential calculus tells us nothing about assigning probabilities. Objective assignments can only be made by relying on additional consistency principles, such as symmetry, constraints, and consistency with other aspects of the problem at hand. Examples of useful principles are Jaynes’ Principle of Maximum Entropy [26] and his Principle of Group Invariance [25], which is a generalization of the Principle of Indifference [7,37,30]. Once these assignments are made, the inferential calculus, induced by consistency with order-theoretic principles, dictates the remaining probabilities. 3.3 Remarks on Lattice Products Two spaces of logical statements can be combined by taking the lattice product, which can be written as the Cartesian product of the lattice elements. By equating the bottom elements of the two spaces, we get the subdirect product, which is a distributive lattice. Such products of lattices are very important in inference, since it is exactly what one does when one takes a lattice of hypotheses and combines them with data. The product rule and Bayes’ theorem are extremely useful in these situations where the probabilities are assigned on the two lattices forming the product. These issues are discussed in more detail elsewhere [36]. 12 4 The Algebra and Calculus of Questions In his last published work exploring the relationships between inference and inquiry [15], Cox defined a question as the set of all logical statements that answer it. At first glance, this definition is strikingly simple. However, with further thought one sees that it captures the essence of a question and does so in a form that is accessible to mathematical investigation. In the previous section on logical statements, the modern viewpoint of lattice theory may seem like overkill. Its heavy mathematics are not necessary to reach the same conclusions that one reaches by simply working with Boole’s algebra. In addition, while some of new insight is gained, there is little there to change how one uses probability theory to solve inference problems. Here however, we will find lattice theory to be of great advantage by enabling us to visualize relations among sets of assertions that comprise the sets of answers to questions. 4.1 Down-sets and Ideals If a logical statement x answers a question, then any statement y such that y → x, or equivalently in order-theoretic notation, y ≤ x, answers the same question. Thus a question is not defined by just any set of logical statements, it is defined by a set that is closed when going down the assertion lattice. Such a set is called a down-set [16] Definition 1 (Down-set) A down-set is a subset J of an ordered set L where if a ∈ J, x ∈ L, x ≤ a then x ∈ J. Given an arbitrary subset K of L, we write the down-set formed from K as J = ↓K = {y ∈ L|∃x ∈ K where y ≤ x}. Let us begin exploring questions by considering the down-set formed from a set containing a single element {x}, which we write 5 as X = ↓{x} ≡ ↓x. Given any logical statement x in the Boolean lattice of assertions, we can consider the down-set formed from that assertion x X = ↓x = {y | y → x ∀ x, y ∈ A} (18) Such a down-set is called an ideal [8,16], and to emphasize this I have called these questions ideal questions to denote the fact that they are ideals of the assertion lattice A [34]. 5 Note that I am using lowercase letters to represent logical statements, uppercase letters to represent questions, and script letters to represent an entire lattice. 13 We are now in a position to compare questions. Two questions are equivalent if they ask the same thing—or equivalently when they are answered by the same set of assertions. The questions ‘Is it raining? ’ and ‘Is it not raining? ’ are both answered by either the statement ‘It is raining! ’ or the statement ‘It is not raining! ’ and all the statements that imply them. Thus our two questions ask the same thing and are therefore equivalent. Furthermore, if one question X is a subset of another question Y in a space of questions Q, then answering the question X will necessarily answer the question Y . This means that we can use the binary ordering relation ‘is a subset of’ to implement the ordering relation ‘answers’ and therefore order the set of questions. The set of ideal questions I ordered by set inclusion forms a lattice (Fig. 2) isomorphic to the original assertion lattice [8]. Thus there is a one-to-one onto mapping from each statement x ∈ A to its ideal question X ∈ Q. The atomic assertions map to atomic questions, each of which has only two possible answers. For example, the statement k from our previous example maps to K = ↓k = {k, ⊥}, (19) which is answered by either ‘The knave stole the tarts! ’ or the absurdity ⊥. 6 Robert Fry calls these atomic questions elementary questions [19], since you basically receive either exactly what you asked or no useful answer. The nonatomic statements map to more complex questions, such as KN = ↓k ∨ n = {k ∨ n, k, n, ⊥}, (20) where the symbol KN is considered to be a single symbol representing a single question formed from the down-set of the join of the statements k and n. Similarly, I will use AKN to represent the question AKN = ↓a ∨ k ∨ n. The lattice of ideal questions I can be mapped back to the lattice of assertions A by selecting the maximum element in the set. Ideal questions are certainly strange constructions, since they accept as an answer the single top statement that defines them. Thus it is possible that by asking an ideal question, no new or useful information will be attained. It is best to think of ideal questions as a mathematical generalization of our everyday questions—just as the negative numbers and zero are generalizations of our counting numbers. While they can’t describe a concrete situation, they are surely useful mathematical constructs. 7 6 It is admittedly strange that an absurd (false) statement is allowed to answer the question. This is a direct consequence of the fact that in Boolean algebra a false statement implies every other statement. We could easily define questions to exclude the bottom element, but this is not necessary to fully understand the algebra. 7 Thanks to Ariel Caticha for this analogy. 14 4.2 The Lattice of Questions We can construct more complex questions by considering down-sets, which are set unions of the ideals of the assertion lattice. For example, the question T =‘Who stole the tarts? ’ is formed from the union of the three elementary questions T = A ∪ K ∪ N. (21) Since A = ↓a = {a, ⊥} K = ↓k = {k, ⊥} N = ↓n = {n, ⊥} (22) the question T = A ∪ K ∪ N can be written as T = {a, k, n, ⊥}. (23) In this way, T is defined by its set of possible answers, including the absurdity. Note that the question T admits no ambiguous answers (such as a ∨ k). That is, by asking T one will discover precisely who stole the tarts. We could also ask the binary question B =‘Did or did not Alice steal the tarts? ’. This question can be written as the down-set formed from a =‘Alice stole the tarts! ’, and its complement ∼ a =‘Either the Knave or no one stole the tarts! ’ B = ↓a ∪ ↓∼ a = ↓a ∪ ↓k ∨ n = {a, ⊥} ∪ {k ∨ n, k, n, ⊥} = {k ∨ n, a, k, n, ⊥}, (24) where any one of the statements in the set will answer the question B. We can write B compactly as B = A ∪ KN . This construction produces every possible question that can be asked given the lattice of assertions A, see Figure 3. Since questions are sets, the set of questions ordered by set inclusion forms a poset ordered by ⊆, which is a distributive lattice [8,16]. More specifically, this construction results in the ordered set of down-sets of A, which is written O(A). Thus the Boolean lattice 2N is mapped to the free distributive lattice 8 F D(N ). Even though F D(N ) is 8 The term free refers to the fact that lattice is generated by considering all conceivable joins and meets of N elements, in this case the elements AK, AN , KN . This technique is just another way to construct the lattice. 15 Fig. 3. The ordered set of down-sets of the lattice of assertions A results in the lattice of all possible questions Q = O(A) ordered by ⊆. The join-irreducible elements of Q are the ideal questions, which are isomorphic to the lattice of assertions A ∼ J(Q). The down-sets corresponding to several questions, including the questions T and B discussed in the text, are illustrated on the right. a lattice of sets and is distributive, it is not complemented [34]. Thus questions in general have no complements. The question lattice Q is closed under set union and set intersection, which correspond to the join and the meet, respectively. Therefore, T = A∪K ∪N ≡ A ∨ K ∨ N . Unfortunately, the terminology introduced by Cox is at odds with the order-theoretic terminology, since the joint question is formed from the meet of two questions, and it asks what the two questions ask jointly, whereas the join of two questions, the common question, asks what the two questions ask in common. Consider the two questions T and B. Since T ⊆ B, the question T necessarily answers the question B. Thus asking ‘Who stole the tarts? ’ will resolve ‘Did or did not Alice steal the tarts? ’ The converse is not true, since if one asks ‘Did or did not Alice steal the tarts? ’, the reply could be ‘Either the Knave or no one stole the tarts! ’, which still does not answer ‘Who stole the tarts? ’ Thus the question T lies below the question B in the lattice Q3 indicating that T answers B. The consistency relations (discussed in §2.2) can be used to better visualize these relationships. Consider again the two questions T =‘Who stole the tarts? ’ and B =‘Did or did not Alice steal the tarts? ’. The join of these two questions T ∨ B asks what they ask in common, which is ‘Did or did not Alice 16 steal the tarts? ’. Whereas, their meet T ∧ B asks what they ask jointly, which is ‘Who stole the tarts? ’. So we have that T ∨B =B T ∧B =T which, by the consistency relations, implies that T ⊆ B. This can also be determined by taking the set union for the join and the set intersection for the meet, and working with expressions for the sets defining T and B. 4.3 The Central Issue Just as statements can be true or false, questions can be real or vain. Cox defined a real question as a question that is answered by at least one true statement, whereas a vain question is a question that is answered by no true statement [15]. In Lewis Carroll’s Alice’s Adventures in Wonderland, it turned out that no one stole the tarts. Thus, any question not allowing for that possibility is a vain question—there does not exist a true answer that will resolve the issue. When the truth values of the statements are not known, a question Q is only assured to be a real question when it is answered by every one of the atomic statements of A, or equivalently when Q ∧ Qi = ⊥ for all elementary questions Qi ∈ Q. Put simply, all possibilities must be accounted for. Previously, I called these questions assuredly real questions [34], which I will shorten here to real questions. The set of real questions is a sublattice R of the lattice Q. That is, it is closed under joins and meets. The bottom of R is the smallest real question, and it answers all other questions in R. It is formed from the join of all of the elementary questions, and as such it does not accept an ambiguous answer. For this reason, I call it the central issue. In our example, the central issue is the question T =‘Who stole the tarts? ’. Resolving the central issue will answer all the other real questions. Recall that by answering T =‘Who stole the tarts? ’, we necessarily will have answered B =‘Did or did not Alice steal the tarts? ’ As one ascends the real sublattice, the questions become more and more ambiguous. For example, the question AN ∪ KN will narrow down the inquiry, resolving whether it was Alice or the Knave, but not necessarily ruling out that no one stole the tarts. 17 4.4 Duality between the Assertions and Questions The question lattice Q is formed by taking the ordered set of down-sets of the assertion lattice, which can be represented by the map O : A → Q, so that Q = O(A). The join-irreducible questions J(Q) are the ideal questions, which by themselves form a lattice that is isomorphic to the assertion lattice A, which can be represented by the map J : A → Q. Thus we have two isomorphisms Q ∼ O(A) and A ∼ J(Q). This correspondence, called Birkhoff ’s Representation Theorem [16], holds for all finite ordered sets A. The lattice Q is called the dual of J(Q), and the lattice A is called the dual of O(A), so that the assertion lattice and the question lattice are dual to each other. This is of course a different notion of duality than I introduced earlier. What is surprising is that the join-irreducible map takes lattice products to sums of lattices, so that the map J acts like a logarithm, whereas the map O acts like the exponential function [16]. 4.5 The Geometry of Questions There are some interesting relationships between the lattice of questions and geometric constructs based on simplexes. A simplex is the simplest possible polytope in a space of given dimension. In zero dimensions, a 0-simplex is a point. A 1-simplex is a line segment, which consists of two 0-simplexes connected by a line. A 2-simplex is a triangle consisting of three 0-simplexes joined by three 1-simplexes, in conjunction with the filled in interior. The 3-simplex is a tetrahedron. Finally the n-simplex is an n-hypertetrahedron. Since, an n − 1 simplex can be used to construct an n-simplex, we can order these simplexes with an ordering relation ‘contains’. For example, if two 0simplexes, A and B, are used to create a 1-simplex AB, we write A ≤ AB and B ≤ AB. We can also define a join of a m-simplex with an n-simplex as a geometric object akin to the set union of the two simplexes. Such an object is called a simplicial complex. The set of all simplicial complexes formed from N distinct 0-simplexes forms the free distributive lattice F D(N ) [31]. We can identify each n-simplex with an ideal question in the question lattice formed by taking the down-set of the join of n assertions. This allows us to set up a one-to-one correspondence between the set of questions and the set of simplicial complexes. The lattice of questions is thus isomorphic to the lattice of simplicial complexes (Figure 4). Another interesting isomorphism can be established. Hypergraphs are graphs with generalized edges that can connect more than a single node. By identifying each n-simplex with an n-hypergraph, a one-to-one correspondence can 18 Fig. 4. The lattice of questions (A) is isomorphic to the lattice of simplicial complexes (B). The atomic questions A, K, and N are isomorphic to the three 0-simplexes. The real questions are isomorphic to the simplicial complexes that include every 0-simplex in the space. Questions are not only related to these geometric constructs, they are also isomorphic to hypergraphs. Since low-order hypergraphs look like low-order simplicial complexes, the lattice of hypergraphs with three generators is almost identical. The only exception is that instead of the 2-simplex at the top, we have the hypergraph connecting the three points (see top of B). be made between simplicial complexes and hypergraphs. Thus, a lattice of hypergraphs can be constructed, which is isomorphic to both the lattice of simplicial complexes and the lattice of questions. The relationship between hypergraphs and information theory was noted by Tony Bell [4]. I will show that the lattice on which Bell’s co-information is a valuation is precisely the question lattice [34]. 4.6 The Inquiry Calculus The algebra of questions provides us with the operations with which questions can be manipulated. Given two questions, we can form the common question 19 and the joint question using the join and meet respectively. Inclusion on the lattice Q indicates whether one question is answered by another. We now extend this algebra to a calculus by generalizing inclusion on this lattice to degrees of inclusion represented by real numbers just as we did on the Boolean lattice. Consider two questions, one of which I will call an outstanding issue I, and the other an inquiry Q. The degree to which the issue I is resolved (or answered) by the inquiry Q is a measure of the relevance of the inquiry to the issue. This is expressed mathematically by defining, d(I|Q) = z(I, Q), (25) which is explicitly written as d(I|Q) =     1 if I ≥ Q    (relevance) 0 if I ∧ Q = ⊥       d otherwise, where 0 < d < 1. (26) If the degree is low, then the inquiry has little relevance to the issue. If it is zero, the inquiry does not resolve the issue, and thus is not relevant. For this reason, I call this degree the relevance 9 , which I will write as d(I|Q) = z(I, Q). This can be read as the degree to which I is answered by Q, which is the same as the degree to which I includes Q on the lattice. In practice one would most likely work with real questions and compute quantities like d(T |B), which is the degree to which asking ‘Did or did not Alice steal the tarts? ’ resolves ‘Who stole the tarts? ’. This quantity d(T |B) measures the relevance of the question B to the issue T . The rules of the calculus are straightforward, since they were developed earlier for distributive lattices in general and applied to the assertions in the form of 9 This degree was called ‘bearing’ by Cox, and Robert Fry adopted the symbol b, which is an upside-down p to reflect the relationship with probability. Caticha suggested the name ‘relevance’ since its Latin origin would make it more accessible to speakers whose native language was not English. To reflect the relationship between the relevance d and the probability p as valuations equivalent to the generalized zeta function z defined on their respective lattices, I have chosen to define the relevance as d, which reverses the order of the arguments from Fry’s function b. Thus d(I|Q) ≡ b(Q|I). The practical difficulty in notation arises from the fact that when working with probability p(x|i), we hold the premise i constant and vary the hypothesis x; whereas when working with relevance d(I|Q), we typically hold the issue I constant and vary the question Q. This is due to the fact that our desired calculations on the question lattice are ‘upside-down’ with respect to those on the assertion lattice (a fact which actually led to Cox defining the algebra backwards in his 1979 paper). For this reason, the function b may be a more intuitive operational notation. 20 probability. There is the sum rule for the relevance of a question Q to the join of two questions X ∨ Y d(X ∨ Y |Q) = d(X|Q) + d(Y |Q) − d(X ∧ Y |Q), (27) and its generalization d(X1 ∨ X2 ∨ · · · ∨ Xn |Q) =  d(Xi |Q) − i  d(Xi ∧ Xj |Q) + i<j  d(Xi ∧ Xj ∧ Xk |Q) − · · · , (28) i<j<k the product rule d(X ∧ Y |Q) = Cd(X|Q)d(Y |X ∧ Q), (29) and a Bayes’ theorem analog d(Y |X ∧ Q) = d(Y |Q)d(X|Y ∧ Q) , d(X|Q) (30) where the constant C in the product rule is the value of the relevance d(⊤|X). Relevances, like probabilities, need not be normalized to one. With the rules of the inquiry calculus in hand, and with Rota’s theorem [46, Theorem 1, Corollary 2, p.35], we can take relevances assigned to the joinirreducible elements J(Q) (ideal questions) and compute the relevances between all pairs of questions on the lattice. However, we need an objective means by which to assign these relevances for the ideal questions. 4.7 Entropy from Consistency To assign relevances, we must maintain consistency with both the algebraic properties of the question lattice Q and the probability assignments on the Boolean lattice A. While I outline how this is done below, the detailed proofs will be published elsewhere. Clearly from Rota’s theorem, we need only determine the relevances of the ideal questions. Once those are assigned, the rest follow from the inquiry calculus. I will show that the form of the relevance is uniquely determined by requiring consistency between the probability measure defined on the assertion lattice A and the relevance measure defined on its isomorphic counterpart, the lattice of ideal questions I. I make a single assumption: the degree to which the top question ⊤ answers a join-irreducible question X depends only on the probability of the assertion x from which the question X was generated. That 21 is, given the ideal question X = ↓x d(X|⊤) = H(p(x|⊤)), (31) where H is a function to be determined. In this way, the relevance of the ideal question is required to be consistent with the probability assignments on A. Below I show that the lattice structure and the induced inquiry calculus imposes four constraints sufficient to uniquely determine the functional form of H (up to a constant). This is a conceptual improvement over the three assumptions made by Shannon in the derivation of his entropy [47]. First, the sum rule (27) demands that given three questions X, Y, Q ∈ Q the relevance is additive only when X ∧ Y = ⊥ d(X ∨ Y |Q) = d(X|Q) + d(Y |Q), iff X ∧ Y = ⊥. (additivity) (32) However, in general the sum rule (28) gives a result that is subadditive d(X ∨ Y |Q) ≤ d(X|Q) + d(Y |Q). (subadditivity) (33) This is a result of the generalized sum rule, which includes additional terms to avoid double-counting the overlap between the two questions. Commutativity of the join (L2) requires that d(X1 ∨X2 ∨· · ·∨Xn |Q) = d(Xπ(1) ∨Xπ(2) ∨· · ·∨Xπ(n) |Q) (symmetry) (34) for all permutations (π(1), π(2) · · · , π(n)) of (1, 2, · · · , n). Thus the relevance must be symmetric with respect to the order of the joins. Last, we consider what happens when an assertion f , known to be false, is added to the system. Associated with this assertion f is a question F =↓f ∈ Q. Now consider the relevance d(X1 ∨X2 ∨· · ·∨Xn ∨F |Q). Since f is known to be false, it can be identified with the absurdity ⊥, and the lattice A collapses from 2n+1 to 2n . The associated question F is then identified with F = ↓⊥ = ⊥, where it is understood that the first ⊥ refers to the bottom of the lattice A and the second refers to the bottom of the lattice Q. Since X ∨ ⊥ = X, we require, for consistency, d(X1 ∨X2 ∨· · ·∨Xn ∨F |Q) = d(X1 ∨X2 ∨· · ·∨Xn |Q). (expansibility) (35) This requirement is called expansibility. I now define a partition question as a real question where its set of answers are neatly partitioned. More specifically Definition 2 (Partition Question) A partition question is a real question  P ∈ R formed from the join of a set of ideal questions P = ni=1 Xi where ∀ Xj , Xk ∈ J(Q), Xj ∧ Xk = ⊥ when j = k. 22 I will denote the set of partition questions by P. There are five partition questions in our earlier example: AKN , N ∪ AK, K ∪ AN , A ∪ KN , and A ∪ K ∪ N , which form a lattice isomorphic to the partition lattice Π3 in Figure 1b. For partition questions, the relevance can be easily computed using (32) d( n  Xi |⊤) = i=1 n  H(p(xi |⊤)). (36) i=1 Writing the right-hand side as a function Kn of the n probabilities, we get d( n  Xi |⊤) = Kn (p1 , p2 , · · · , pn ), (37) i=1 where I have written pi = p(xi |⊤). An important result from Aczél et al. [2] states that if this function Kn satisfies additivity (32), subadditivity (33), symmetry (34), and expansibility (35), then the unique form of the function Kn is a linear combination of the Shannon and Hartley entropies Kn (p1 , p2 , · · · , pn ) = a Hm (p1 , p2 , · · · , pn ) + b o Hm (p1 , p2 , · · · , pn ), (38) where a, b are arbitrary non-negative constants, the Shannon entropy [47] is defined as n Hm (p1 , p2 , · · · , pn ) = −  pi log2 pi , (39) i=1 and the Hartley entropy [24] is defined as o Hm (p1 , p2 , · · · , pn ) = log2 N (P ), (40) where N (P ) is the number of non-zero arguments pi . An additional condition suggested by Aczél states that the Shannon entropy is the unique solution if the result is to be small for small probabilities [2]. That is, that the relevance varies continuously as a function of the probability. For the remainder of this work, I will assume that this is the case. This result is important since it rules out the use of other types of entropy, such as Renyi and Tsallis entropies, for the purposes of inference and inquiry as described here. Given these results, it is straightforward to show that the degree to which the top question answers an ideal question (31) can be written as d(X|⊤) = −ap(x|⊤) log2 p(x|⊤), (41) which is proportional to the probability-weighted surprise. With this result in hand, the inquiry calculus enables us to calculate all the other relevances of pairs of questions in the lattice. By requiring consistency with the lattice structure and assuming that the relevance of an ideal question is a continuous 23 function of the probability of its corresponding assertion, we have found that the relevance of a partition question is equal to the Shannon entropy. Thus d(A ∨ K ∨ N |⊤) ∝ −pa log2 pa − pk log2 pk − pn log2 pn , (42) where pa ≡ p(a|⊤), · · · , and we have set the arbitrary constant a = 1. Whereas, d(A ∨ KN |⊤) ∝ −pa log2 pa − pk∨n log2 pk∨n , (43) where pk∨n ≡ p(k ∨ n|⊤). The inquiry calculus allows us to compute the degree to which the question T =‘Who stole the tarts? ’ is answered by the question B =‘Did or did not Alice steal the tarts? ’ by d(T |B) = d(T |B ∧ ⊤) (44) d(B|⊤) d(T |⊤) d(B|⊤) = d(B|T ) d(T |⊤) d(B|⊤) =C , d(T |⊤) = d(B|T ∧ ⊤) where C is the chosen normalization constant for the relevance. By assigning probabilities to the different cases, this is easily computed using the equations (42) and (43) above. The relevance of ⊤ to issues such as AN ∪ KN ≡ AN ∨ KN is even more interesting, since this must be computed using the sum rule d(AN ∨ KN |⊤) = d(AN |⊤) + d(KN |⊤) − d(AN ∧ KN |⊤), (45) which is clearly related to the mutual information between AN and KN I(AN ; KN ) = H(AN ) + H(KN ) − H(AN, KN ), (46) although the information-theoretic notation obscures the conditionality of these measures. Thus the relevance of the common question is related to the mutual information, which describes what information is shared by the two questions. The term d(AN ∧ KN |⊤) is then identified as being related to the joint entropy. In the context of information theory, Cox’s choice in naming the common question and joint question is more clear. However, the inquiry calculus holds new possibilities. Relevances such as d(AN ∨ KN |Q), which are also related to the mutual information (45), can be computed using Bayes’ Theorem as in (44). Furthermore, the relevance of questions comprised of the join of multiple questions must then be computed using the 24 generalized sum rule (28), which is related to the sum rule via the Möbius function for the lattice. Combined with the Shannon entropy for relevance, this leads to the generalized entropy conjectured by Cox as the appropriate measure [14,15]. Furthermore, one can see that this also includes as special cases the multi-information introduced by McGill [42], the entropy space studied by Han [23], and the co-information, rediscovered by Bell [4], which are all valuations on the question lattice [34]. We now have a well-founded generalization of information theory, where the relationships among a set of any number of questions can be quantified. 5 Discussion There are some significant deviations in this work from Cox’s initial explorations. First, the question algebra is the free distributive algebra—not a Boolean algebra as Cox suggested. Cox actually first believed (correctly) that the algebra could not possibly be Boolean [14][pp. 52-3], but then later assumed that complements of questions exist [15][pp. 151-2], which led to the false conclusion that the algebra must be Boolean. This belief, that questions follow a Boolean algebra and therefore possess complements, spread to several early papers following Cox, including two of my own. The second major deviation is that the ordering relation that I have used for questions is reversed from the one implicitly adopted by Cox. This led to a version of the consistency relations in Cox’s work that is reversed from the consistency relations used in order theory, where joins and meets of questions are swapped. This is related to the third deviation, where I have adopted a notation for relevance that is consistent with the function z from which it and probability both derive. Thus, I use d(I|Q) to describe the degree to which the issue I is resolved by the question Q, which is in keeping with the notation for probability where p(x|t) is the degree to which the statement x is implied by the statement t. This function d(I|Q) is equivalent to Fry’s notation b(Q|I) [19]. While the two can be used interchangeably, Fry’s notation may be more intuitive since we are used to varying the quantity before the solidus, as in probability theory, rather than the quantity after. Nevertheless, I would like to stress that Cox’s achievement in penetrating the realm of questions is remarkable—especially considering the historical focus on the logic of assertions and the relative lack of attention paid to questions. It should be noted that several notable exceptions exist (eg. [12,6]), however much of the focus is on syntax and semantics of questions rather than the underlying mathematical structure. Cox’s method for deriving probability theory from consistency has both supporters and critics. It is my experience that many criticisms originate from a lack of understanding of how physicists use symmetries, constraints and consistency to narrow down the form of an unknown function in a problem. In 25 fortunate circumstances, this approach leads to a unique solution, or at least a useful solution, with perhaps an arbitrary constant. In more challenging situations, such as using symmetries to understand the behavior of the strong nuclear force, this approach may only exclude possibilities. Other criticisms seem to focus on details related to probability and plausibility. By deriving these rules for degrees of inclusion defined on distributive lattices, I have taken any disagreement to a wider arena—specifically one where probability and plausibility are no longer the issue. The fact that the application of this methodology of deriving laws from ordering relations [36] leads to probability theory, information theory, geometric probability [45,31], quantum mechanics [11], and perhaps to measures in new spaces [36] suggests that there are important insights to be gained here. It is remarkable that Cox’s insight led him to correctly conjecture the relationship between relevance and probability as a generalized entropy [14,15]. While generalizations to information theory go back to its inception [42], the relationship between entropy and lattices was first recognized by Watanabe [50] and was further studied by Han [23] who used Rota’s Principle of Inclusion and Exclusion to understand these higher-order entropies on a Boolean sublattice. 10 More recently these higher-order entropies reappeared in the form of Bell’s co-informations, which he realized were related to lattice structures [4]. Independently working with Cox’s definition of a question I identified the structure of the question lattice [34], and in this paper I have shown that consistency requirements uniquely determine the relevance measures as being derived from Shannon entropy. Cox’s generalized entropies arise through the application of the sum rule for relevance, which derives from Rota’s InclusionExclusion Principle. The product rule and Bayes Theorem then allow one to compute the relevance of any question on the lattice to any other in terms of ratios of generalized entropies. In retrospect, the relationship between questions and information theory is not surprising. For example, the design of a communication channel can be viewed as designing a question to be asked of a transmitter. Experimental design, which is also a form of question-asking, has relied on entropy [38,17,40]. Active learning has made experimental design an active process and entropy has found a role here as well alongside Bayesian inference [41,39]. Questionasking is also important when searching a space for an optimal solution [27,43]. Generalizations of information theory have found use in searching for causal interactions among a set of variables. Transfer entropy is designed to extend mutual information to address the asymmetric issue of causality [29]. Given two time-series, X and Y , the transfer entropy can be neatly expressed in terms of the relevance of the common question Xi+1 ∨ Yi minus the relevance 10 A special thanks to Aleks Jakulin who pointed me to these two important references while this paper was in review. 26 of Xi ∨ Xi+1 ∨ Yi , where Xi =‘What is the value of the time series X at the time i? ’ Last, Robert Fry has been working to extend the inquiry calculus to cybernetic control [20] and neural network design [18,21]. Given the scope of these applications, it will be interesting to see the implications that a detailed understanding of the inquiry calculus will have on future developments. It is already known that some problems can be solved in both the space of assertions and the space of questions. For example, the Infomax Independent Component Analysis (ICA) algorithm [5] is a machine learning algorithm that was originally derived using information theory. However, by considering a source separation problem in terms of the logical statements describing the physical situation, one can derive ICA using probability theory [32]. The information-theoretic derivation can be interpreted in terms of maximizing the relevance of the common question X ∨ Y , where X =‘What are the recorded signals? ’ and Y =‘How well have we modelled the source activity? ’ This is accomplished by using the prior probability distribution of the source amplitudes to encode how well the sources have been modelled [33]. 11 This notion of encoding answers to questions is important to inductive inquiry. In this paper, I have laid out the relationship between the algebra of a finite set of logical statements and the algebra of their corresponding questions. The Boolean lattice of assertions A gives rise to the free distributive lattice of questions Q as the ordered set of down-sets of the assertion lattice. The joinirreducible elements of the question lattice form a lattice that is isomorphic to the original assertion lattice. Thus the assertion lattice A is dual to the question lattice Q in the sense of Birkhoff’s representation theorem. Furthermore, I showed that the question lattice is isomorphic to both the lattice of simplicial complexes and the lattice of hypergraphs, which connects questions to geometric constructs. By generalizing the zeta function on each lattice, I have demonstrated that their algebras can be generalized to calculi, which effectively enable us to measure statements and questions. Probability theory is the calculus on A, and as such it is literally an extension of Boolean logic. The inquiry calculus, which is analogous to probability theory, relies on a measure called relevance, which describes the degree to which one question answers another. I have shown that the relevance of partition questions is uniquely determined to be related to the Shannon entropy thus ruling out the use of other entropies for use in inference and inquiry. By applying the sum and product rules, I have shown that one can compute generalized entropies, which are generalizations of familiar information-theoretic quantities such as mutual information, and include as special cases McGill’s multi-informations, Cox’s generalized entropies, and Bell’s co-informations. Traditional informa11 Note that complements of questions are used erroneously and unnecessarily in [33], and that the notation differs from that introduced in this present work as described earlier. 27 tion theory is thus only a small part of the inquiry calculus, which now offers new possibilities. An understanding of these fundamental relationships will be essential to utilizing the full power of these mathematical constructs. Acknowledgements This work supported by the NASA IDU/IS/CICT Program and the NASA Aerospace Technology Enterprise. I am deeply indebted to Ariel Caticha, Bob Fry, Carlos Rodrı́guez, Janos Aczél, Ray Smith, Myron Tribus, Ali Abbas, Aleks Jakulin, Jeffrey Jewell, and Bernd Fischer for insightful and inspiring discussions, and many invaluable remarks and comments. References [1] Aczél J. Lectures on Functional Equations and Their Applications. New York:Academic Press, 1966. [2] Aczél J., Forte B. and Ng C.T. Why the Shannon and Hartley entropies are ‘natural’. Adv. Appl. Prob., Vol. 6, pp. 131–146, 1974. [3] Barnabei M. and Pezzoli E. Gian-Carlo Rota on combinatorics. In Möbius functions (ed. J.P.S. Kung), Boston:Birkhauser, pp. 83–104, 1995. [4] Bell A.J. The co-information lattice. Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA 2003 (eds. S. Amari, A. Cichocki, S. Makino and N. Murata), 2003. [5] Bell A.J. and Sejnowski T.J. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, Vol.7, No. 6, pp. 1129–1159, 1996. [6] Belnap N.D. Jr. & Steel T.B. Jr. The Logic of Questions and Answers, New Haven:Yale Univ. Press, 1976. [7] Bernoulli J. Ars conjectandi, Basel:Thurnisiorum, 1713. [8] Birkhoff G.D. Lattice Theory, Providence:American Mathematical Society, 1967. [9] Boole G. The calculus of logic. Dublin Mathematical Journal, Vol. 3, pp. 183– 198, 1848. [10] Boole G. An investigation of the laws of thought. London:Macmillan, 1854. [11] Caticha A. Consistency, amplitudes and probabilities in quantum theory. Phys. Rev. A, Vol. 57, pp. 1572–1582, 1998. 28 [12] Cohen F. What is a question?, The Monist, Vol. 39, pp. 350–364, 1929. [13] Cox R.T. Probability, frequency, and reasonable expectation. Am. J. Physics, Vol. 14, pp. 1–13, 1946. [14] Cox, R.T. The algebra of probable inference. Baltimore:Johns Hopkins Press, 1961. [15] Cox R.T. Of inference and inquiry. In The Maximum Entropy Formalism (eds. R. D. Levine & M. Tribus). pp. 119–167, Cambridge:MIT Press, 1979. [16] Davey B.A. & Priestley H.A. Introduction to Lattices and Order. Cambridge:Cambridge Univ. Press, 2002. [17] Fedorov V.V. Theory of Optimal Experiments, New York:Academic Press, 1972. [18] Fry R.L. Observer-participant models of neural processing. IEEE Trans. Neural Networks, Vo. 6, pp. 918–928, 1995. [19] Fry R.L. Maximum entropy and Bayesian methods. Electronic course notes (525.475), Johns Hopkins University, 1999. [20] Fry R.L. The engineering of cybernetic systems. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Baltimore MD, USA, August 2001 (ed. R. L. Fry). New York:AIP, pp. 497–528, 2002. [21] Fry R.L., Sova R.M. A logical basis for neural network design. In Implementation Techniques: Neural Network Systems Techniques and Applications, Vol. 3, (ed. C. Leondes). London:Academic Press, 1998. [22] Garrett A.J.M Whence the laws of probability? Maximum Entropy and Bayesian Methods. Boise, Idaho, USA, 1997 (eds. G. J. Erickson, J. T. Rychert & C. R. Smith), Dordrecht:Kluwer Academic Publishers, pp. 71–86, 1998. [23] Han T.S. Linear dependence structure of the entropy space. Inform. Contr., Vol. 29, pp. 337-368, 1975. [24] Hartley R.V. Transmission of information. Bell System Tech. J., Vol. 7, pp. 535–563, 1928. [25] Jaynes E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cyb. Vol. SSC-4, pp. 227–, 1968. [26] Jaynes E.T. Where do we stand on maximum entropy? In The Maximum Entropy Formalism (eds. R. D. Levine & M. Tribus), pp. 15–118, Cambridge:MIT Press, 1979. [27] Jaynes E.T. Entropy and search theory. In Maximum Entropy and Bayesian Methods in Inverse Problems (eds. C. R. Smith & W. T. Grandy, Jr.), Dordrecht:Reidel, pp. 443–, 1985. [28] Jaynes E.T. Probability theory: the logic of science. Cambridge:Cambridge Univ. Press, 2003. 29 [29] Kaiser A., Schreiber T. Information transfer in continuous processes, Physica D, Vol. 166, pp. 43–62, 2002. [30] Keynes J.M. A treatise on probability. London:Macmillan, 1921. [31] Klain D.A. & Rota G.-C. Introduction to geometric probability. Cambridge:Cambridge Univ. Press, 1997. [32] Knuth K.H. A Bayesian approach to source separation. In Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation: ICA’99 (eds. J.-F. Cardoso, C. Jutten and P. Loubaton), Aussois, France, pp. 283–288, 1999. [33] Knuth K.H. Source separation as an exercise in logical induction. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris 2000 (ed. A. Mohammad-Djafari), AIP Conference Proceedings Vol. 568, Melville NY:American Institute of Physics, pp. 340–349, 2001. [34] Knuth K.H. What is a question? In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Moscow ID, USA, August 2002 (ed. C. Williams). AIP Conference Proceedings Vol. 659, Melville NY:AIP, , pp. 227– 242, 2002. [35] Knuth K.H. Intelligent machines in the 21st century: foundations of inference and inquiry, Phil. Trans. Roy. Soc. Lond. A, Vol. 361, No. 1813, pp. 2859-2873, 2003. [36] Knuth K.H. Deriving laws from ordering relations. In press: Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Jackson Hole WY, USA, August 2003 (ed. G. J. Erickson). AIP Conference Proceedings Vol. 707, Melville NY:AIP, , pp. 204–235, 2004. [37] Laplace P.S. Théorie analytique des probabilités. Paris:Courcier Imprimeur, 1812. [38] Lindley D.V. On the measure of information provided by an experiment. Ann. Math. Statist. Vol. 27, pp. 986–1005, 1956. [39] Loredo T.J. Bayesian adaptive exploration. In press: Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Jackson Hole WY, USA, August 2003 (ed. G. J. Erickson). AIP Conference Proceedings Vol. 707, Melville NY:AIP, 2004. [40] Luttrell S.P. The use of transinformation in the design of data sampling schemes for inverse problems. Inverse Problems Vol. 1, pp. 199–218, 1985. [41] MacKay D.J.C. Information-based objective functions for active data selection. Neural Computation Vol. 4 No. 4, pp. 589–603, 1992. [42] McGill W.J. Multivariate information transmission. IEEE Trans. Info. Theory, Vol. 4, pp. 93–111, 1955. 30 [43] Pierce J.G. A new look at the relation between information theory and search theory. In The Maximum Entropy Formalism (eds. R. D. Levine & M. Tribus), pp. 339–402, Cambridge:MIT Press, 1979. [44] Rota G.-C. On the foundations of combinatorial theory I. Theory of Möbius functions. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, Vol. 2, pp. 340–368, 1964. [45] Rota G.-C. Geometric probability. The Mathematical Intelligencer, Vol. 20, pp. 11–16, 1998. [46] Rota G.-C. On the combinatorics of the Euler characteristic. In Studies in Pure Mathematics (Presented to Richard Rado), London:Academic Press, pp. 221–233, 1971. [47] Shannon C.E. and Weaver W. A mathematical theory of communication. Chicago:Univ. of Illinois Press, 1949. [48] Smith C.R. and Erickson G.J. Probability theory and the associativity equation. In Maximum Entropy and Bayesian Methods, (ed. P. Fougere). Dordrecht:Kluwer, pp. 17–30, 1990. [49] Tribus M. Rational Descriptions, Decisions and Designs, New York:Pergamon Press, 1969. [50] Watanabe S. Knowing and Guessing, New York:Wiley, 1969. APPENDIX: Deriving the rules of the calculi In this appendix, I derive the sum rule, product rule, and Bayes’ theorem analog for distributive lattices. These rules are equally applicable to probability on the Boolean lattice of assertions A, to relevance on the free distributive lattice of questions Q, as well as any other distributive lattice. The first derivation of these rules was by Cox for complemented distributive lattices (Boolean lattices) [13,14]. The derivations rely on maintaining consistency between the proposed calculus and the properties of the algebra. In Cox’s derivations, he relied on consistency with complementation to obtain the sum rule, and consistency with associativity of the meet to obtain the product rule. The derivation of Bayes’ theorem is, in contrast, well-known since it follows directly from commutativity of the meet. An interesting variation of Cox’s derivation for Boolean algebra relying on a single algebraic operation (NAND) was introduced by Anthony Garrett [22]. Below, I expound on the derivations introduced by Ariel Caticha, which rely on associativity and distributivity [11]. The implications of Caticha’s derivation are profound, since his results imply that the sum rule, the product rule, and Bayes’ theorem are consistent with distributive lattices in general. These implications are discussed in greater detail elsewhere [36]. 31 Consistency with Associativity Consider a distributive lattice D, two join-irreducible elements, a, b ∈ J(D), where a ∧ b = ⊥, and a third element t ∈ D such that a ∧ t = ⊥ and b ∧ t = ⊥. We begin by introducing a degree of inclusion (see Eqn. 6) represented by the function φ, so that the degree to which a includes t is given by φ(a, t). We would like to be able to compute the degree to which the join a ∨ b includes t. In terms of probability, this is the degree to which t implies a ∨ b. Since a ∧ b = ⊥, this degree of inclusion can only be a function of φ(a, t) and φ(b, t), which can be written as φ(a ∨ b, t) = S(φ(a, t), φ(b, t)). (A-1) The function S, will tell us how to use φ(a, t) and φ(a, t) to compute φ(a∨b, t). The hope is that the consistency constraint will be sufficient to identify the form of S—we will see that this is the case. The function S must maintain consistency with the distributive algebra D. Consider another join-irreducible element c ∈ J(D) where a∧c = ⊥, b∧c = ⊥, and form the element (a ∨ b) ∨ c. We can use associativity of the lattice to write this element two ways (a ∨ b) ∨ c = a ∨ (b ∨ c). (A-2) Consistency requires that each expression gives the same result when the degree of inclusion is calculated S(φ(a ∨ b, t), φ(c, t)) = S(φ(a, t), φ(b ∨ c, t)). (A-3) Applying S to the arguments φ(a ∨ b, t) and φ(b ∨ c, t) above, we get S(S(φ(a, t), φ(b, t)), φ(c, t)) = S(φ(a, t), S(φ(b, t), φ(c, t))). (A-4) This can be further simplified by letting u = φ(a, t), v = φ(b, t), and w = φ(c, t) resulting in a functional equation for the function S, which Aczél called the associativity equation [1, pp. 253-273]. S(S(u, v), w) = S(u, S(v, w)). (A-5) The general solution [1], is S(u, v) = f (f −1 (u) + f −1 (v)), (A-6) where f is an arbitrary function. This is simplified by letting g = f −1 g(S(u, v)) = g(u) + g(v). 32 (A-7) Writing this in terms of the original expressions we get, g(φ(a ∨ b, t)) = g(φ(a, t)) + g(φ(b, t)), (A-8) which reveals that there exists a function g : R → R re-mapping these numbers to a more convenient representation. Defining z(a, t) ≡ g(φ(a, t)) we get z(a ∨ b, t) = z(a, t) + z(b, t), (A-9) which is the sum rule for the join of two join-irreducible elements. This rule can be extended to all elements in D by using the Möbius function for the lattice, or equivalently the inclusion-exclusion relation, which avoids double-counting the elements in the calculation [31,44,3,36]. This leads to the generalized sum rule for the join of two arbitrary elements z(a ∨ b, t) = z(a, t) + z(b, t) − z(a ∧ b, t), (A-10) and z(x1 ∨ x2 ∨ · · · ∨ xn , t) =  i z(xi , t) −  z(xi ∧ xj , t) + i<j  z(xi ∧ xj ∧ xk , t) − · · · (A-11) i<j<k for the join of multiple arbitrary elements x1 , x2 , . . . , xn [36]. Consistency with Distributivity Given x, y, t ∈ D, we would like to be able to compute the degree to which the meet x ∧ y includes t, written z(x ∧ y, t). We can easily use (A-10) to obtain z(x ∧ y, t) = z(x, t) + z(y, t) − z(x ∨ y, t). (A-12) However, another form can be found by requiring consistency with distributivity D1. Following Cox [13,14], and relying on the consistency arguments given by Jaynes [28], Tribus [49], and Smith and Erickson [48], this degree can be written two ways as a function P of two arguments z(x ∧ y, t) = P (z(x, t), z(y, x ∧ t)) = P (z(y, t), z(x, y ∧ t)), (A-13) where the function P will tell us how to do the calculation. The two expressions on the right are a consequence of commutativity, which we will address later. We will focus for now on the first expression of P , and consider five elements a, b, r, s, t ∈ D where a ∧ b = ⊥ and r ∧ s = ⊥. By considering distributivity 33 D1 of the meet over the join, we can write a ∧ (r ∨ s) two ways a ∧ (r ∨ s) = (a ∧ r) ∨ (a ∧ s). (A-14) Consistency with distributivity D1 requires that their relevances calculated these two ways are equal. Using the sum rule (A-10) and the form of P (A13), distributivity requires that P (z(a, t), z(r ∨ s, a ∧ t)) = z(a ∧ r, t) + z(a ∧ s, t), (A-15) which simplifies to P (z(a, t), z(r, a ∧ t) + z(s, a ∧ t)) = P (z(a, t), z(r, a ∧ t)) + P (z(a, t), z(s, a ∧ t)). (A-16) Defining u = z(a, t), v = z(r, a ∧ t), and w = z(s, a ∧ t), the equation above can be written as P (u, v + w) = P (u, v) + P (u, w). (A-17) This functional equation for P captures the essence of distributivity D1. We will now show that P (u, v + w) is linear in its second argument. Defining k = w + v, and writing (A-17) as P (u, k) = P (u, v) + P (u, w), (A-18) we can compute the second derivative with respect to x. Using the chain rule we find that ∂k ∂ ∂ ∂k ∂ ∂ ∂ = = = = , (A-19) ∂v ∂v ∂k ∂k ∂w ∂k ∂w so that the second derivative with respect to k can be written as ∂ ∂ ∂2 = . ∂k 2 ∂v ∂w (A-20) The second derivative of P (u, k) with respect to k is then easily shown to be ∂2 P (u, k) = 0, ∂k 2 (A-21) which implies that the function P is linear in its second argument P (u, v) = A(u)v + B(u), (A-22) where A and B are functions to be determined. Substitution of (A-22) into (A-17) gives B(u) = 0. Now we consider (a ∨ b) ∧ r, which using D1 can be written as (a ∨ b) ∧ r = (a ∧ r) ∨ (b ∧ r), 34 (A-23) gives a similar functional equation P (v + w, u) = P (v, u) + P (w, u), (A-24) where u = z(r, t), v = z(a, r ∧t), w = z(b, r ∧t). Following the approach above, we see that P is also linear in its first argument P (u, v) = A(v)u. (A-25) Together with (A-22), the general solution is P (u, v) = Cuv, (A-26) where C is an arbitrary constant. Thus we have the product rule z(x ∧ y, t) = Cz(x, t)z(y, x ∧ t), (A-27) which tells us the degree to which t includes the meet x ∧ y. The constant C acts as a normalization factor, and is necessary when these valuations are normalized to values other than unity. It should be noted that this only satisfies the distributivity of the meet over the join D1. There are reasons why D1 is preferred over D2, which are discussed elsewhere [36]. Consistency with Commutativity Commutativity of the meet is the reason that there are two forms for the function P , the product rule (A-13) z(x ∧ y, t) = Cz(x, t)z(y, x ∧ t) (A-28) z(y ∧ x, t) = Cz(y, t)z(x, y ∧ t). Equating the degrees (A-28) and (A-29) results in (A-29) and Cz(x, t)z(y, x ∧ t) = Cz(y, t)z(x, y ∧ t), (A-30) which leads to Bayes’ theorem z(y, x ∧ t) = z(y, t)z(x, y ∧ t) . z(x, t) (A-31) This demonstrates that there is a sum rule, a product rule, and a Bayes’ Theorem analog for bi-valuations defined this way on all distributive lattices. This realization clears up the mystery as to why some quantities in science act like probabilities, but clearly are not probabilities [36]. 35