Bayesian Classification Theory: Robin Hanson John Stutz Peter Cheeseman

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Bayesian Classification Theory

Technical Report FIA-90-12-7-01


Robin Hanson John Stutz Peter Cheeseman
Sterling Software NASA RIACS∗
Artificial Intelligence Research Branch
NASA Ames Research Center, Mail Stop 244-17
Moffet Field, CA 94035, USA
Email: <last-name>@ptolemy.arc.nasa.gov

Abstract so does not “overfit” the data. Such classes are also
“fuzzy”; instead of each case being assigned to a class, a
The task of inferring a set of classes and class case has a probability of being a member of each of the
descriptions most likely to explain a given different classes.
data set can be placed on a firm theoretical Autoclass III, the most recent released version, com-
foundation using Bayesian statistics. Within bines real and discrete data, allows some data to be miss-
this framework, and using various mathemat- ing, and automatically chooses the number of classes
ical and algorithmic approximations, the Au- from first principles. Extensive testing has indicated
toClass system searches for the most proba- that it generally produces significant and useful results,
ble classifications, automatically choosing the but is primarily limited by the simplicity of the mod-
number of classes and complexity of class de- els it uses, rather than, for example, inadequate search
scriptions. A simpler version of AutoClass has heuristics. AutoClass III assumes that all attributes are
been applied to many large real data sets, have relevant, that they are independent of each other within
discovered new independently-verified phenom- each class, and that classes are mutually exclusive. Re-
ena, and have been released as a robust soft- cent extensions, embodied in Autoclass IV, let us relax
ware package. Recent extensions allow at- two of these assumptions, allowing attributes to be se-
tributes to be selectively correlated within par- lectively correlated and to have more or less relevance
ticular classes, and allow classes to inherit, or via a class hierarchy.
share, model parameters though a class hierar- This paper summarizes the mathematical foundations
chy. In this paper we summarize the mathe- of AutoClass, beginning with the Bayesian theory of
matical foundations of Autoclass. learning, and then applying it to increasingly complex
classification problems, from various single class mod-
els up to hierarchical class mixtures. For each problem,
1 Introduction we describe our assumptions in words and mathematics,
The task of supervised classification - i.e., learning to pre- and then give the resulting evaluation and estimation
dict class memberships of test cases given labeled train- functions for comparing models and making predictions.
ing cases - is a familiar machine learning problem. A re- The derivations of these results from these assumptions,
lated problem is unsupervised classification, where train- however, are not given.
ing cases are also unlabeled. Here one tries to predict all
features of new cases; the best classification is the least 2 Bayesian Learning
“surprised” by new cases. This type of classification,
Bayesian theory gives a mathematical calculus of degrees
related to clustering, is often very useful in exploratory
of belief, describing what it means for beliefs to be con-
data analysis, where one has few preconceptions about
sistent and how they should change with evidence. This
what structures new data may hold.
section briefly reviews that theory, describes an approach
We have previously developed and reported on Au-
to making it tractable, and comments on the resulting
toClass [Cheeseman et al., 1988a; Cheeseman et al.,
tradeoffs. In general, a Bayesian agent uses a single real
1988b], an unsupervised classification system based on
number to describe its degree of belief in each proposition
Bayesian theory. Rather than just partitioning cases,
of interest. This assumption, together with some other
as most clustering techniques do, the Bayesian approach
assumptions about how evidence should affect beliefs,
searches in a model space for the “best” class descrip-
leads to the standard probability axioms. This result
tions. A best classification optimally trades off predic-
was originally proved by Cox [Cox, 1946] and has been
tive accuracy against the complexity of the classes, and
reformulated for an AI audience [Heckerman, 1990]. We

Research Institute for Advanced Computer Science now describe this theory.
2.1 Theory tractable. So one must use approximations. Here is our
Let E denote some evidence that is known or could po- approach.
tentially be known to an agent; let H denote a hypothe- Rather than consider all possible states of the world,
sis specifying that the world is in some particular state; we focus on some smaller space of models, and do all
and let the sets of possible evidence E and possible states of our analysis conditional on an assumption S that the
of the world H each be mutually exclusive and exhaus- world really is described by one of the models in our
tive sets. For example, if we had a coin that might be space. As with most modeling, this assumption is almost
two-headed the possible states of the world might be certainly false, but it makes the analysis tractable. With
”ordinary coin”, ”two-headed coin”. If we were to toss time and effort we can make our models more complex,
it once the possible evidence would be ”lands heads”, expanding our model space in order to reduce the effect
”lands tails”. of this simplification.
In general, P (ab|cd) denotes a real number describing The parameters which specify a particular model are
an agent’s degree of belief in the conjunction of proposi- split into two sets. First, a set of discrete parameters T
tions a and b, conditional on the assumption that propo- describe the general form of the model, usually by spec-
sitions c and d are true. The propositions on either side ifying some functional form for the likelihood function.
of the conditioning bar ”|” can be arbitrary Boolean ex- For example, T might specify whether two variables are
pressions. More specifically, π(H) is a “prior” describing correlated or not, or how many classes are present in a
the agent’s belief in H before, or in the absence of, see- classification. Second, free variables in this general form,
ing evidence E, π(H|E) is a “posterior” describing the such as the magnitude of the correlation or the relative
agent’s belief after observing some particular evidence sizes of the classes, constitute the remaining continuous
E, and L(E|H) is a “likelihood” embodying the agent’s model parameters V .
theory of how likely it would be to see each possible ev- We generally prefer a likelihood1 L(E|V T S) which is
idence combination E in each possible world H. mathematically simple and yet still embodies the kinds
To be consistent, beliefs must be  non-negative, 0 ≤ of complexity we believe to be relevant.
P (a|b) ≤ 1, and normalized, so that Similarly, we prefer a simple prior distribution
 H π(H) = 1 and
L(E|H) = 1. That is, the agent is sure that the dπ(V T |S) over this model space, allowing the result-
E
world is in some state and that some evidence will be ing V integrals, described below, to be at least approx-
observed. The likelihood and the prior together give a imated. A prior that predicts the different parameters
“joint” probability J(EH) ≡ L(E|H)π(H) of both E in V independently, through a product of terms for each
and H. Normalizing the joint gives Bayes’ rule, which different parameter, often helps. We also prefer the prior
tells how beliefs should change with evidence; to be as broad and uninformative as possible, so our soft-
ware can be used in many different problem contexts,
J(EH) L(E|H)π(H) though in principal we could add specific domain knowl-
π(H|E) =  =  .
H J(EH) H L(E|H)π(H) edge through an appropriate prior. Finally we prefer a
prior that gives nearly equal weight to different levels
When the set of possible Hs is continuous, the prior
of model complexity, resulting in a “significance test”.
π(H) becomes a differential dπ(H), and the sums over
Adding more parameters to a model then induces a cost,
H are replaced by integrals. Similarly, continuous Es
which must be paid for by a significantly better fit to the
have a differential likelihood dL(E|H), though any real
data before the more complex model is preferred.
evidence ∆E will have a finite probability ∆L(E|H) ≈
Sometimes the integrable priors are not broad enough,
dL(E|H) ∆EdE
.
containing meta-parameters which specify some part of
In theory, all an agent needs to do in any given situ-
model space to focus on, even though we have no prior
ation is to choose a set of states H, an associated like-
expectations about where to focus. In these cases we
lihood function describing what evidence is expected to
“cheat” and use simple statistics collected from the evi-
be observed in those states, a set of prior expectations
dence we are going to use, to help set these priors2 . For
on the states, and then collect some relevant evidence.
example, see Sections 4.2, 4.5.
Bayes’ rule then specifies the appropriate posterior be-
The joint can now be written as dJ(EV T |S) =
liefs about the state of the world, which can be used to
L(E|V T S) dπ(V T |S) and, for a reasonably-complex
answer most questions of interest. An agent can combine
problem, is usually a very rugged distribution in V T ,
these posterior beliefs with its utility over states U (H),
with an immense number of sharp peaks distributed
which says how much it prefers each possible state, to
widely over a huge high-dimensional space. Because of
choose an action A which maximizes its expected utility
this we despair of directly normalizing the joint, as re-

EU (A) = U (H)π(H|EA). quired by Bayes’ rule, or of communicating the detailed
H 1
Note that when a variable like V sits in a probability ex-
2.2 Practice pression where a proposition should be, it stands for a propo-
sition that the variable has a particular value.
In practice this theory can be difficult to apply, as the 2
This is cheating because the prior is supposed to be in-
sums and integrals involved are often mathematically in- dependent of evidence.

2
shape of the posterior distribution. this can be good discipline. One must deal with some
Instead we break the continuous V space into regions difficult integrals and sums, although there is a huge lit-
R surrounding each sharp peak, and search until we tire erature to help one here. And one must often search
for combinations RT for which the “marginal” joint large spaces, though most any technique will have to do
 this and the joint probability provides a good local eval-
M (ERT |S) ≡ dJ(EV T |S) uation function. Finally, it is not clear how one can take
V ∈R the computational cost of doing a Bayesian analysis into
is as large as possible. The best few such “models” RT account without a crippling infinite regress.
are then reported, even though it is usually almost cer- Some often perceived disadvantages of Bayesian anal-
tain that more probable models remain to be found. ysis are really not problems in practice. Any ambiguities
Each model RT is reported by describing its marginal in choosing a prior are generally not serious, since the
joint M (ERT |S), its discrete parameters T , and esti- various possible convenient priors usually do not disagree
mates of typical values of V in the region R, like the strongly within the regions of interest. Bayesian analysis
mean estimate of V : is not limited to what is traditionally considered “statis-
 tical” data, but can be applied to any space of models
V dJ(EV T |S)
E(V |ERT S) ≡ V ∈R about how the world might be. For a general discussion
M (ERT |S) of these issues, see [Cheeseman, 1990].
or the V for which dJ(EV T |S) is maximum in R. While We will now illustrate this general approach by apply-
these estimates are not invariant under reparameteriza- ing it to the problem of unsupervised classification.
tions of the V space, and hence depend on the syntax
with which the likelihood was expressed, the peak is usu- 3 Model Spaces Overview
ally sharp enough that such differences don’t matter. 3.1 Conceptual Overview
Reporting only the best few models is usually justified,
In this paper we deal only with attribute-value, not re-
since the models weaker than this are usually many or-
lational, data.3 For example, medical cases might be
ders of magnitude less probable than the best one. The
described by medical forms with a standard set of en-
main reason for reporting models other than the best is
tries or slots. Each slot could be filled only by elements
to show the range of variation in the models, so that one
of some known set of simple values, like numbers, colors,
can judge how different the better, not yet found, models
or blood-types. (In this paper, we will only deal with
might be.
real and discrete attributes.)
The decision to stop searching for better models RT
We would like to explain this data as consisting of a
than the current best can often be made in a principled
number of classes, each of which corresponds to a dif-
way by using estimates of how much longer it would
fering underlying cause for the symptoms described on
take to find a better model, and how much better than
the form. For example, different patients might fall into
model would be. If the fact that a data value is un-
classes corresponding to the different diseases they suffer
known might be informative, one can model “unknown”
from.
as just another possible (discrete) data value; otherwise
To do a Bayesian analysis of this, we need to make
the likelihood for an unknown value is just a sum over
this vague notion more precise, choosing specific math-
the possible known values.
ematical formulas which say how likely any particular
To make predictions with these resulting models, a
combination of evidence would be. A natural way to do
reasonable approximation is to average the answer from
this is to say that there are a certain number of classes,
the best few peaks, weighted by the relative marginal
that a random patient has a certain probability to come
joints. Almost all of the weight is usually in the best
from each of them, and that the patients are distributed
few, justifying the neglect of the rest.
independently – once we know all about the underlying
2.3 Tradeoffs classes then learning about one patient doesn’t help us
Bayesian theory offers the advantages of being theoret- learn what any other patient will be like.
ically well-founded and empirically well-tested [Berger, In addition, we need to describe how each class is dis-
1985]. It offers a clear procedure whereby one can almost tributed. We need a “single class” model saying how
“turn the crank”, modulo doing integrals and search, to likely any given evidence is, given that we know what
deal with any new problem. The machinery automati- class the patient comes from. Thus we build the multi-
cally trades off the complexity of a model against its fit class model space from some other pre-existing model
to the evidence. Background knowledge can be included space, which can be arbitrarily complex. (In fact, much
in the input, and the output is a flexible mixture of sev- of this paper will be spend describing various single class
eral different “answers,” with a clear and well-founded models.) In general, the more complex each class can be,
decision theory [Berger, 1985] to help one use that out- the less of a need there is to invoke multiple classes to
put. explain the variation in the data.
Disadvantages include being forced to be explicit 3
Nothing in principle prevents a Bayesian analysis of more
about the space of models one is searching in, though complex model spaces that predict relational data.

3
The simplest way to build a single-class model is to 3.2 Notation Summary
predict each attribute independently, i.e., build it from For all the models to be considered in this paper, the
attribute-specific models. A class has a distribution for evidence E will consist of a set of I cases, an associated
each attribute and, if you know the class of a case, learn- set K of attributes, of size4 K, and case attribute values
ing the values of one attribute doesn’t help you predict Xik , which can include “unknown.” For example, medi-
the value of any other attributes. For real attributes one cal case number 8, described as (age = 23, blood-type =
can use a standard normal distribution, characterized A, . . .), would have X8,1 = 23, X8,2 = A, etc.
by some specific mean and a variance around that mean. In the next two sections we will describe applications
For discrete attributes one can use the standard multino- of Bayesian learning theory to various kinds of mod-
mial distribution, characterized by a specific probability els which could explain this evidence, beginning with
for each possible discrete value. simple model spaces and building more complex spaces
Up to this point we have described the model space of from them. We begin with a single class. First, a sin-
Autoclass III. Autoclass IV goes beyond this by intro- gle attribute is considered, then multiple independent
ducing correlation and inheritance. Correlation is intro- attributes, then fully covariant attributes, and finally
duced by removing the assumption that attributes are selective covariance. In the next section we combine
independent within each class. The simplest way to do these single classes into class mixtures. Table 1 gives
this is to let all real attributes covary, and let all discrete an overview of the various spaces.
attributes covary. The standard way for real attributes For each space S we will describe the continuous
to covary is the multivariate normal, which basically says parameters V , any discrete model parameters T , nor-
that there is some other set of attributes one could de- malized likelihoods dL(E|V T S), and priors dπ(V T |S).
fine, as linear combinations of the attributes given, which Most spaces have no discrete parameters T , and only one
vary independently according to normal distributions. A region R, allowing us to usually ignore these parameters.
simple way to let discrete attributes covary is to define Approximations to the resulting marginals M (ERT |S)
one super-attribute whose possible values are all possible and estimates E(V |ERT S) will be given, but not de-
combinations of the values of the attributes given. rived. These will often be given in terms of general func-
If there are many attributes, the above ways to add tions F , so that they may be reused later on. As ap-
correlation introduce a great many parameters in the propriate, comments will be made about algorithms and
models, making them very complex and, under the usual computational complexity. All of the likelihood func-
priors, much less preferable than simpler independent tions considered here assume the cases are independent,
models. What we really want are simpler models which i.e., 
only allow partial covariance. About the simplest way L(E|V T S) = L(Ei |V T S)
to do this is to say that, for a given class, the attributes i
clump together in blocks of inter-related attributes. All so we need only give L(Ei |V T S) for each space, where
the attributes in a block covary with each other, but not Ei ≡ {Xi1 , Xi2 , Xi3 , . . . , XiK }.
with the attributes in other blocks. Thus we can build
a block model space from the covariant model spaces. 4 Single Class Models
Even this simpler form of covariance introduces more
4.1 Single Discrete Attribute - SD1
parameters that the independent case, and when each
class must have its own set of parameters, multiple A discrete attribute k allows only a finite number of pos-
classes are penalized more strongly. Attributes which sible values l ∈ [1, 2, ..., L] for any Xi . “Unknown” is usu-
are irrelevant to the whole classification, like a medi- ally treated here as just another possible value. A set of
cal patient’s favorite color, can be particularly costly. independent coin tosses, for example, might have L = 3
To reduce this cost, one can allow classes to share the with l1 = heads, l2 = tails, and l3 = “unknown”. We
specification of parameters associated with some of their make the assumption SD1 that there is only one discrete
independent blocks. Irrelevant attributes can then be attribute, and that the only parameters are the continu-
shared by all classes at a minimum cost. ous parameters V = q1 . . . qL consisting of the likelihoods
L(Xi |V SD1 ) = q(l=Xi ) for each possible value l. In the
Rather than allow arbitrary combinations of classes
coin example, q1 = .7 would say that the coin was so
to share blocks, it is simpler to organize the classes as
“unbalanced” that it has a 70 percent chance of coming
leaves of a tree. Each block can be placed at some node
up heads each time.
in this tree, to be shared by all the leaves below that
node. In this way different attributes can be explained
There are only L − 1 free parameters since normal-
ization requires l ql = 1. For this likelihood, all that
at different levels of an abstraction hierarchy. For med-
matters fromthe data are the number of cases with each
ical patients the tree might have “viral infections” near
value5 Il = i δXi l . In the coin example, I1 would be
the root, predicting fevers, and some more specific viral
disease near the leaves, predicting more disease specific 4
Note we use script letters like K for sets, and matching
symptoms. Irrelevant attributes like favorite-color would ordinary letters K to denote their size.
5
go at the root. Note that δuv denotes 1 when u = v and 0 otherwise.

4
Space Description V T R Subspaces Compute Time
SD1 Single Discrete ql I
SR1 Single Real µσ I
SI Independent Attrs Vk S1 ≡ SD1 or SR1 IK
SD Covariant Discrete ql1 l2 ... IK
SR Covariant Real µk Σkk  (I + K)K 2
SV Block Covariance Vb BKb SB ≡ SD or SR N K(IKb + Kb2 )
SM Flat Class Mixture αc Vc C R SC ≡ SI or SV N KC(IKb + Kb2 )
SH Tree Class Mixture αc Vc Jc Kc Tc R SC ≡ SI or SV N KC(IKb + Kb2 )

Table 1: Model Spaces

the number of heads. Such sums are called “sufficient of a model mean µ and deviation σ, and the likelihood
statistics” since they summarize all the information rel- is given by the standard normal distribution.
evant to a model.
1 1 xi −µ 2
We choose a prior dL(xi |V SR1 ) = √ e− 2 ( σ ) dxi .
2πσ
Γ(aL)  a−1
dπ(V |SD1 ) = dB(q1 . . . qL |L) ≡ ql dql
Γ(a)L For example, people’s weight might be distributed with
l
a mean of 80 kilograms and a deviation of 15. Since
which for a > 0 is a special case of a beta distribu- all real data have a finite width, we replace dx with
tion [Berger, 1985] (Γ(y) is the Gamma function [Spiegel, ∆x to approximate∆xthe likelihood ∆L(Xi |V SR1 ) =

∆x dL(xi |V SR1 ) = dx dL(xi |V SR1 ).
1968]). This formula is parameterized by a, a “hyperpa-
rameter” which can be set to different values to specify As usual, we choose priors that treat the parameters
different priors. Here we set a = 1/L. This simple prob- in V independently.
lem has only one maximum, whose marginal is given by
 dπ(V |SR1 ) = dπ(µ|SR1 ) dπ(σ|SR1 )
Γ(aL) l Γ(Il + a)
M (E|SD1 ) = F1 (I1 , . . . , IL , I, L) ≡
Γ(aL + I)Γ(a)L We choose a prior on the mean to be flat in the range of
We have abstracted the function F1 , so we can refer to it the data,
later. The prior above was chosen because it has a form dπ(µ|SR1 ) = dR(µ|µ+, µ− )
similar to the likelihood (and is therefore a ”conjugate” where µ+ = max xi , µ− = min xi , by using the general
prior), and to make the following mean estimate of ql uniform distribution
particularly simple
dy
Il + a Il + L1 dR(y|y+ , y− ) ≡ for y ∈ [y− , y+ ].
E(ql |ESD1 ) = F2 (Il , I, L) ≡ = y+ − y−
I + aL I+1
A flat prior is preferable because it is non-informative,
for a = 1/L. F2 is also abstracted out for use later.
but note that in order to make it normalizable we must
Note that while F2 (Il , I, L) is very similar to the classical
cheat and use information from the data to cut it off at
estimate of IIl , F2 is defined even when I = 0. Using a
some point. In the single attribute case, we can similarly
hash table, these results can be computed in order I
choose a flat prior in log(σ).
numerical steps, independent of L.
4.2 Single Real Attribute - SR1 dπ(σ|SR1 ) = dR(log(σ)|log(∆µ), log(min ∆xi ))
Real attribute values Xi specify a small range of the real
where ∆µ = µ+ − µ− . The posterior again has just one
line, with a center xi and a precision, ∆xi , assumed to be
peak, so there is only one region R, and the resulting
much smaller than other scales of interest. For example,
marginal is
someone’s weight might be measured as 70±1 kilograms.
For scalar attributes, which can only be positive, like √ 
I
weight, it is best to use the logarithm of that variable π Γ( I−1
2 ) 1 ∆x
M (E|SR1 ) = 1 I−1
.
[Aitchison and Brown, 1957]. 2 (πI) 2 log(∆µ/ min ∆xi ) s ∆µ
For SR1 , where there is only one real attribute, we
assume the standard normal distribution,  where the suf- Note that this joint is dimensionless. Theestimates are
ficient statistics are the data mean x = 1I i xi , the ge-
I
simply E(µ|ESR1 ) = x, and E(σ|E) = I
I+1 s. Com-
I
ometric mean precision ∆x  =( 1
i ∆xi ) and the stan- putation here takes order I steps, used to compute the
I

dard deviation s given by s2 = 1I i (xi −x)2 . V consists sufficient statistics.
5
4.3 Independent Attributes - SI 4.5 Fully Covariant Reals - SR
We now introduce some notation for collecting sets of If we assume SR that a set K of real-valued attributes
indexed terms like Xik . A single such term inside a {} follow the multivariate normal distribution, we replace
will denote the set of all such indexed terms collected the σk2 above with a model covariance matrix Σkk  and
across all of the indices, like i and k in E = {Xik } ≡ s2k with a data covariance matrix
{Xik such that i ∈ [1, . . . , I], k ∈ K}. To collect across
only some of the indices we use k as in Ei = k Xik ≡ 1
Skk  = (xik − xk )(xik  − xk  )
{Xi1 , Xi2 , . . .}, all the evidence for a single case i. I i
The simplest way to deal with cases having multiple
attributes is to assume SI that they are all independent, . The Σkk  must be symmetric,  with Σkk  = Σk  k ,
i.e., treating each attribute as if it were a separate prob- and “positive definite”, satisfying kk  yk Σkk  yk  > 0
lem. In this case, the parameter set V partitions into for any vector yk . The likelihood for a set of attributes
parameter sets Vk = lk qlk or [µk , σk ], depending on K is7
whether that k is discrete or real. The likelihood, prior, dL(Ei |V SR ) = dN (Ei , {µk } , {Σkk  } , K)
and joint for multiple attributes are all simple products 1
 inv
of the results above for one attribute: S1 = SD1 or SR1 e− 2 kk (xk −µk )Σkk (xk −µk ) 
≡ K 1 dxk
— i.e.,  (2π) 2 |Σkk  | 2 k
L(Ei |V SI ) = L(Xik |Vk S1 ),
k is the multivariate normal in K dimensions.
 As before, we choose a prior that takes the means to
dπ(V |SI ) = dπ(Vk |S1 ),
be independent of each other, and independent of the
k
covariance
and  
M (E|SI ) = J(E(k)|S1 ) dπ(V |SR ) = dπ({Σkk  } |SR ) dπ(µk |SR1 ),
k
k
where E(k) ≡ i Xik , all the evidence associated with
attribute k. The estimates E(Vk |ESI ) = E(Vk |E(k)S1 ) so the estimates of the means remain the same,
are exactly the same. Computation takes order IK steps E(µk |ESR ) = xk . We choose the prior on Σkk  to use
here. an inverse Wishart distribution [Mardia et al., 1979]
inv
4.4 Fully Covariant Discretes - SD dπ({Σkk  } |SR ) = dWK ({Σkk  } | {Gkk  } , h) ≡
A model space SD which allows a set K of discrete at- −h−K−1
K inv inv
−h 1
e− 2 kk Σkk Gk k 
K
tributes to fully covary (i.e, contribute to a likelihood in |Gkk  | 2 |Σkk  | 2
K(K−1) K
Kh
dΣkk 
non-trivial combinations) can be obtained by treating all h+1−a
2 2 π 4 a Γ( 2 ) k≤k 
combinations of base attribute values as particular  val-
ues of one super attribute, which then has L = k Lk which is normalized (integrates to 1) for h ≥ K and
values — so L can be a very large number! V consists Σkk  symmetric positive definite. This is a “conju-
of terms like ql1 l2 ...lK , indexed by all the attributes. Il gate” prior, meaning that it makes the resulting poste-
generalizes to rior dπ({Σkk  } |ESR ) take the same mathematical form
 as the prior. This choice makes the resulting integrals
Il1 l2 ...lK = δxik lk . manageable, but requires us to choose an h and all the
i k components of Gkk  . We choose h = K to make the
Given this transformation, the likelihoods, etc. look the prior as broad as possible, and for Gkk  we “cheat” and
same as before: choose Gkk  = Skk δkk  in order to avoid overly distorting
the resulting marginal
L(Ei | V SD ) = ql1 l2 ...lK ,
K Γ( I+h−a )
where each lk = Xik , 2 h K
I

a Γ( 1+h−a )
2
|Gkk  | 2 ∆xk
M (E|SR ) =
dπ(V |SD ) = dB({ql1 l2 ...lK } | L ), K K(I−1) I+h−1
I2π 2 |ISkk  + Gkk  | 2 ∆µk
k
M (E|SD ) = F1 ({Il1 l2 ...lK } , I, L ), and estimates
6
and
ISkk  + Gkk  I + δkk 
E(Σkk  |ESR ) = = Skk  .
E(ql1 l2 ...lK |ESD ) = F2 (Il1 l2 ...lK , I, L ) I +h−K −2 I −2
Computation takes order IK steps here. This model If we choose Gkk  too large it dominates the esti-
could, for example, use a single combined hair-color eye- mates, and if Gkk  is too small the marginal is too small.
color attribute to allow a correlation between people be- 7 inv
ing blond and blue-eyed.  Σinv
ab denotes the matrix inverse of Σab satisfying
b
Σab Σbc = δac , and |Σab | denotes components of the ma-
6
F1 and F2 are defined on page 4. trix determinant of {Σab }.

6
The compromise above should only over estimate the blocks, each possible way to group attributes is equally
marginal somewhat, since it in effect pretends to have likely. This is normalized using Z(A, U ), given by
seen previous data which agrees with the data given.
Note that the estimates are undefined unless I > 2. 
U
u−1 (U − u + 1)
A
Z(A, U ) ≡ (−1) ,
Computation here takes order (I + K)K 2 steps. At (U − u + 1)! (u − 1)!
u=1
present, we lack a satisfactory way to approximate the
above marginal when some values are unknown. which gives the number of ways one can partition a set
with A elements into U subsets. This prior prefers the
4.6 Block Covariance - SV
special cases of full covariance and full independence,
Rather than just having either full independence or full since there are fewer ways to make these block combi-
dependence of attributes, we prefer a model space SV nations. For example, in comparing the hypothesis that
where some combinations of attributes may covary while each attribute is in a separate block (i.e., all indepen-
others remain independent. This allows us to avoid pay- dent) with the hypothesis that only one particular pair
ing the cost of specifying covariance parameters when of attributes covary together in a block of size two, this
they cannot buy us a significantly better fit to the data. prior will penalize the covariance hypothesis in propor-
We partition the attributes K into B blocks Kb , with tion to the number of such pairs possible. Thus this
full covariance within each block and full independence prior includes a “significance test”, so that a covariance
between blocks. Since we presently lack a model allowing hypothesis will only be chosen if the added fit to the
different types of attributes to covary, all the attributes data from the extra covariance is enough to overcome
in a block must be of the same type. Thus real and this penalty.
discretes may not mutually covary. Computation here takes order N K(IKb + Kb2 ) steps,
We are away of other models of partial dependence, where N is the number of search trials done before quit-
such as the the trees of Chow and Liu described in [Pearl, ting, which would be around (K − 1)! for a complete
1988], but choose this approach because it includes the search of the space. Kb is an average, over both the
limiting cases of full dependence and full independence. search trials and the attributes, of the block size of real
The evidence
E partitions block-wise into E(Kb ) (us- attributes (and unity for discrete attributes).
ing Ei (K) ≡ k∈K Xik and E(K) ≡ {Ei (K)}), each with
its own sufficient statistics; and the parameters V parti- 5 Class Mixtures
tion into parameters Vb = {ql1 l2 ...lK } or [{Σkk  } , {µk }].
5.1 Flat Mixtures - SM
Each block is treated as a different problem, except that
we now also have discrete parameters T to specify which The above model spaces SC = SV or SI can be thought
attributes covary, by specifying B blocks and {Kb } at- of as describing a single class, and so can be extended
tributes in each block. Thus the likelihood by considering a space SM of simple mixtures of such
classes [D.M.Titterington et al., 1985]. Figure 1 shows

B
how this model, with SC = SI , can fit a set of artificial
L(Ei |V T SV ) = L(Ei (Kb )|Vb SB )
real-valued data in five dimensions.
b

is a simple product of block terms SB = SD or SR assum-


ing full covariance within each block, and the estimates
E(Vb |ET SV ) = E(Vb |E(Kb )SB ) are the same as before.
We choose a prior which predicts the block structure
B {Kb } independently of the parameters Vb within each
independent block

dπ(V T |SV ) = π(B {Kb } |SV ) dπ(Vb |SB )
b

which results in a similarly decomposed marginal



M (ET |SV ) = π(B {Kb } |SV ) M (E(Kb )|SB ).
b Figure 1: AutoClass III Finds Three Classes
We choose a block structure prior We plot attributes 1 vs. 2, and 3 vs. 4 for an artificial data
set. One σ deviation ovals are drawn around the centers of
the three classes.
π(B {Kb } |SV ) = 1/KR Z(KR , BR )KD Z(KD , BD ),

where KR is the set of real attributes and BR is the In this model space the likelihood
number of real blocks (and similarly for KD and BD ).

C
This says that it is equally likely that there will be one L(Ei |V T SM ) = αc L(Ei |VcTc SC )
or two or three, etc. blocks, and, given the number of c

7
sums over products of “class weights” αc, that give the mate the likelihood
probability that any case would belong to class c of the

C
C classes, and class likelihoods describing how members L(Ei |V T RSm ) = αc L(Ei |VcTc SC )
of each class are distributed. In the limit of large C this c
model space is general enough to be able to fit any dis- 

= (αcL(Ei |Vc Tc SC ))
wic
tribution arbitrarily closely, and hence is “asymtotically
c
correct”.
The parameters T = [C, {Tc }] and V = [{αc} , {Vc }] holding the wic fixed, we get an approximate joint:
combine parameters for each class and parameters de- 
scribing the mixture. The prior is similarly broken down M (ERT |SM ) ∼= F3 (C)C! F1({Ic } , I, C) M  (E c T |SC )
c
as
 Our standard search procedure combines an explicit
dπ(V T |SM ) = F3 (C)C! dB({αc} |C) dπ(Vc Tc |SC ) search in C with a random search in all the other pa-
c rameters. Each trial begins converging from classes built
around C random case pairs. The C is chosen randomly
where F3 (C) ≡ π26C 2 for C > 0 and is just one arbitrary from a log-normal distribution fit to the Cs of the 6 − 10
choice of a broad prior over integers. The αc is treated best trials seen so far, after trying a fixed range of Cs to
as if the choice of class were another discrete attribute, start. We also have developed alternative search proce-
except that a C! is added because classes are not distin- dures which selectively merge and split classes according
guishable a priori. to various heuristics. While these usually do better, they
Except in very simple problems, the resulting joint sometimes do much worse.
dJ(EV T |S) has many local maxima, and so we must The marginal joints of the different trials generally
now focus on regions R of the V space. To find such follow a log-normal distribution, allowing us to estimate
local maxima we use the “EM” algorithm [Dempster et during the search how much longer it will take on average
al., 1977] which is based on the fact that at a maxima to find a better peak, and how much better it is likely
the class parameters Vc can be estimated from weighted to be.
sufficient statistics. Relative likelihood weights In the simpler model space SM I where SC = SI the
computation is order N ICK, where C averages over the
αcL(Ei |VcTc SC ) search trials. N is the number of possible peaks, out
wic = , of the immense number usually present, that a compu-
L(Ei |V T SM )
tation actually examines. In the covariant space SM V
give the probability that a particular where SC = SV this becomes N KC(IKb + Kb2 ).
 case i is a member
of class c. These weights satisfy c wic = 1, since every 5.2 Class Hierarchy and Inheritance - SH
case must really belong to one of the classes. Using these
weights we can break each case into “fractional cases”, The above class mixture model space SM can be gener-
assign these to their alized to a hierarchical space SH by replacing the above
respective classes, and create new set of classes with a tree of classes. Leaves of the tree,
“class data” E c = ik [Xik , wic] with new weighted-class
sufficient statistics obtained by using weightedsums corresponding to the previous classes, can now inherit
  specifications of class parameters from “higher” (closer
w
i ic instead
 of sums i . For example
 Ic = i wic,
xkc = I1c i wicxik , Il1 ...lK c = w to the root) classes. For the purposes of the parameters
i ic k δxik lk , and
I wic specified at a class, all of the classes below that class

kc =
∆x i ∆xik
Ic . Substituting these statistics into
pool their weight into one big class. Parameters associ-
any previous class likelihood function L(E|Vc Tc SC ) gives ated with “irrelevant” attributes are specified indepen-
a weighted likelihood L (E c |Vc Tc SC ) and associated new dently at the root. Figure 2 shows how a class tree, this
estimates and marginals. time with SC = SV , can better fit the same data as in
At the maxima, the weights wic should be consistent Figure 1. See [Hanson et al., 1991] for more about this
with estimates of V = {[αc, Cc]} from E(Vc|ERSM ) = comparison.
E  (Vc |E cSC ) and E(αc |ERSM ) = F2 (Ic , I, C). To reach The tree of classes has one root class r. Every other
a maxima we start out at a random seed and repeatedly class c has one parent class Pc , and every class has Jc
use our current best estimates of V to compute the wic, child classes given by Ccj , where the index j ranges over
and then use the wic to re-estimate the V , stopping when the children of a class. Each child Jclass has a weight
they both predict each other. Typically this takes 10 − αcj relative to its siblings , with j c αcj = 1, and an
100 iterations. This procedure will converge from any absolute weight αCcj = αcj αc, with αr = 1.
starting point, but converges more slowly near the peak While other approaches to inheritance are possible,
than second-order methods. here each class is given an associated set of attributes
Integrating the joint in R can’t be done directly be- Kc , which it predicts independently through a likeli-
cause the product of a sum in the full likelihood is hard hood L(Ei (Kc )|VcTc SC ) and which no class above or be-
to decompose, but if we use fractional cases to approxi- low it predicts. To avoid having redundant trees which
8
class, this prior penalizes the all–but–one hypothesis in
proportion to the number of attributes that could have
been kept instead.
The marginal joint becomes

M (ERT |SH ) ∼
=

dπ(Jc Kc | Ac SH )Jc ! F1 ( ICcj , Ic , Jc)M  (E c (Kc )Tc |SC )
c j

and
estimates are again E(Vc|ERSH ) = E  (Vc |E c (Kc)SC )
and E(αcj |ERSH ) = F2 (Icj , Ic, Jc ).
Figure 2: AutoClass IV Finds Class Tree ×10120 Better
In the general case of SHV , where SC = SV , computa-
Lists of attribute numbers denote covariant blocks within
tion again takes N KC(IKb + Kb2 ), except that the J is
each class, and the ovals now indicate the leaf classes.
now also an average of, for each k, the number of classes
in the hierarchy which use that k (i.e., have k ∈ Kc).
describe the same likelihood function, only Kr can be Since this is usually less than the number of leaves, the
empty, and non-leaves must have Jc ≥ 2. model SH is typically cheaper to compute than SM for
We need to ensure that all attributes are predicted the same number of leaves.
somewhere at or above each leaf class. So we call Ac Searching in this most complex space SHV is challeng-
the set of attributes which are predicted at or below ing. There are a great many search dimensions where one
each class, start with Ar = K, and then recursively par- can trade off simplicity and fit to the data, and we have
tition each Ac into attributes Kc “kept” at that class, only begun to explore possible heuristics. Blocks can be
and hence predicted directly by it, and the remaining merged or split, classes can be merged or split, blocks
attributes to be predicted at or below each child ACcj . can be promoted or demoted in the class tree, EM itera-
For leaves Ac = Kc. tions can be continued farther, and one can try a random
Expressed in terms of the leaves the likelihood is again restart to seek a new peak. But even the simplest ap-
a mixture: proaches to searching a more general model space seem
  to do better than smarter searches of simpler spaces.
L(Ei |V T SM ) = αc L(Ei (Kc )|Vc Tc SC )
c:Jc =0 c =c,Pc ,PPc ,...,r
6 Conclusion
allowing the same EM procedure as before J to find local The Bayesian approach to unsupervised classification de-
maximas. The case weights here wci = j c wCcj i (with
scribes each class by a likelihood function with some free
wri = 1) sum like in the flat mixture case and define
parameters, and then adds in a few more parameters to
class statistics E c (Kc ) = k∈Kc ,i [Xik , wci ].
describe how those classes are combined. Prior expecta-
We also choose a similar prior, though it must now
tions on those parameters V T combine with the evidence
specify the Kc as well:
E to produce a marginal joint M (ERT |S) which is used
dπ(V T |SH ) = as an evaluation function for classifications in a region
 R near some local maxima of the continuous parameters
dπ(Jc Kc | Ac SH )Jc ! dB( αcj |Jc ) dπ(Vc Tc | KcSC ) V and with some choice of discrete model parameters T .
c j This evaluation function optimally trades off the com-
plexity of the model with its fit to the data, and is used
Kc ! (Ac − Kc )!
dπ(Jc Kc | Ac SH ) = F3 (Jc − 1) to guide an open-ended search for the best classification.
(Ac + δrc )Ac !
In this paper we have applied this theory to model
for all subsets Kc of Ac of size in the range [1 − δcr , Ac], spaces of varying complexity in unsupervised classifica-
except that F3 (Jc −1) is replaced by δ0Jc when Ac = Kc . tion. For each space we provides a likelihood, prior,
Note that this prior is recursive, as the prior for each marginal joint, and estimates. This should provide
class depends on the what attributes have been chosen enough information to allow anyone to reproduce Au-
for its parent class. toClass, or to use the same evaluation functions in other
This prior says that each possible number of attributes contexts where these models might be relevant.
kept is equally likely, and given the number to be kept
each particular combination is equally likely. This prior References
prefers the simpler cases of Kc = Ac and Kc = 1 and so
again offers a significance test. In comparing the hypoth- [Aitchison and Brown, 1957] J. Aitchison and J.A.C.
esis that all attributes are kept at class with a hypothesis Brown. The Lognormal Distribution. Cambridge at
that all but one particular attribute will be kept at that the University Press, 1957.
9
[Berger, 1985] J. O. Berger. Statistical Decision Theory
and Bayesian Analysis. Springer-Verlag, New York,
1985.
[Cheeseman et al., 1988a] P. Cheeseman,
J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Free-
man. Autoclass: a Bayesian classification system. In
Proceedings of the Fifth International Conference on
Machine Learning, 1988.
[Cheeseman et al., 1988b]
P. Cheeseman, M. Self, J. Kelly, J. Stutz, W. Taylor,
and D. Freeman. Bayesian classification. In Seventh
National Conference on Artificial Intelligence, pages
607–611, Saint Paul, Minnesota, 1988.
[Cheeseman, 1990] Peter Cheeseman. On finding the
most probable model. In Jeff Shrager and Pat Lan-
gley, editors, Computational Models of Discovery and
Theory Formation, pages 73–96. Morgan Kaufmann,
Palo Alto, 1990.
[Cox, 1946] R.T. Cox. Probability, frequency and rea-
sonable expectation. American Journal of Physics,
14(1), 1946.
[Dempster et al., 1977] A.P. Dempster, N.M. Laird, and
D.B. Rubin. Maximum likelihood from incomplete
data via the EM algorithm. J. Roy. Statist. Soc. B,
39:1–38, 1977.
[D.M.Titterington et al., 1985] D.M.Titterington,
A.F.M. Smith, and U.E. Makov. Statistical Analysis
of Finite Mixture Distributions. John Wiley & Sons,
New York, 1985.
[Hanson et al., 1991]
R. Hanson, J. Stutz, and P. Cheeseman. Bayesian clas-
sification with correlation and inheritance. In 12th In-
ternational Joint Conference on Artificial Intelligence,
Sydney, Australia, 1991.
[Heckerman, 1990] David Heckerman. Probabilistic in-
terpretations for mycin’s certainty factors. In Glenn
Shafer and Judea Pearl, editors, Readings in Uncer-
tain Reasoning, pages 298–312. Morgan Kaufmann,
San Mateo, California, 1990.
[Mardia et al., 1979] K.V. Mardia, J.T. Kent, and J.M.
Bibby. Multivariate Analysis. Academic Press Inc.,
New York, 1979.
[Pearl, 1988] Judea Pearl. Probabilistic Reasoning in In-
telligent Systems. Morgan Kaufmann, San Mateo,
1988.
[Spiegel, 1968] Murray Spiegel. Mathematical Handbook
of Formulas and Tables. McGraw-Hill Book Company,
New York, 1968.

10

You might also like