FHMM ML

Machine Learning, ?
, 1{31 (1997)
c 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Factorial Hidden Markov Models

ZOUBIN GHAHRAMANI [email protected]
Department of Computer Science, University of Toronto, Toronto, ON M5S 3H5, Canada
MICHAEL I. JORDAN [email protected]
Department of Brain & Cognitive Sciences, Massachusetts Institute of Technology,
Cambridge, MA 02139, USA
Editor: Padhraic Smyth
Abstract. Hidden Markov models (HMMs) have proven to be one of the most widely used tools
for learning probabilistic models of time series data. In an HMM, information about the past
is conveyed through a single discrete variable|the hidden state. We discuss a generalization of
HMMs in which this state is factored into multiple state variables and is therefore represented in
a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of
the hidden state variables given the observations, and relate it to the forward{backward algorithm
for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature
of the hidden state representation, this exact algorithm is intractable. As in other intractable
systems, approximate inference can be carried out using Gibbs sampling or variational methods.
Within the variational framework, we present a structured approximation in which the the state
variables are decoupled, yielding a tractable algorithm for learning the parameters of the model.
Empirical comparisons suggest that these approximations are ecient and provide accurate al-
ternatives to the exact methods. Finally, we use the structured approximation to model Bach's
chorales and show that factorial HMMs can capture statistical structure in this data set which an
unconstrained HMM cannot.
Keywords: Hidden Markov models, time series, EM algorithm, graphical models, Bayesian net-
works, mean eld theory
1. Introduction
Due to its exibility and to the simplicity and eciency of its parameter estimation
algorithm, the hidden Markov model (HMM) has emerged as one of the basic sta-
tistical tools for modeling discrete time series, nding widespread application in the
areas of speech recognition (Rabiner & Juang, 1986) and computational molecular
biology (Krogh, Brown, Mian, Sjolander, & Haussler, 1994). An HMM is essen-
tially a mixture model, encoding information about the history of a time series in
the value of a single multinomial variable|the hidden state|which can take on
one of K discrete values. This multinomial assumption supports an ecient pa-
rameter estimation algorithm|the Baum-Welch algorithm|which considers each
of the K settings of the hidden state at each time step. However, the multinomial
assumption also severely limits the representational capacity of HMMs. For exam-
ple, to represent 30 bits of information about the history of a time sequence, an
HMM would need K = 230 distinct states. On the other hand, an HMM with a
distributed state representation could achieve the same task with 30 binary state
2 Z. GHAHRAMANI AND M.I. JORDAN
variables (Williams & Hinton, 1991). This paper addresses the problem of con-
structing ecient learning algorithms for hidden Markov models with distributed
state representations.
The need for distributed state representations in HMMs can be motivated in two
ways. First, such representations let the model automatically decompose the state
space into features that decouple the dynamics of the process that generated the
data. Second, distributed state representations simplify the task of modeling time
series that are known a priori to be generated from an interaction of multiple,
loosely-coupled processes. For example, a speech signal generated by the superpo-
sition of multiple simultaneous speakers can be potentially modeled with such an
architecture.
Williams and Hinton (1991) rst formulated the problem of learning in HMMs
with distributed state representations and proposed a solution based on determinis-
tic Boltzmann learning.1 The approach presented in this paper is similar to Williams
and Hinton's in that it can also be viewed from the framework of statistical me-
chanics and mean eld theory. However, our learning algorithm is quite dierent
in that it makes use of the special structure of HMMs with a distributed state
representation, resulting in a signicantly more ecient learning procedure. An-
ticipating the results in Section 3, this learning algorithm obviates the need for
the two-phase procedure of Boltzmann machines, has an exact M step, and makes
use of the forward{backward algorithm from classical HMMs as a subroutine. A
dierent approach comes from Saul and Jordan (1995), who derived a set of rules
for computing the gradients required for learning in HMMs with distributed state
spaces. However, their methods can only be applied to a limited class of architec-
tures.
Hidden Markov models with distributed state representations are a particular
class of probabilistic graphical model (Pearl, 1988; Lauritzen & Spiegelhalter, 1988),
which represent probability distributions as graphs in which the nodes correspond
to random variables and the links represent conditional independence relations.
The relation between hidden Markov models and graphical models has recently
been reviewed in Smyth, Heckerman and Jordan (1997). Although exact probability
propagation algorithms exist for general graphical models (Jensen, Lauritzen, &
Olesen, 1990), these algorithms are intractable for densely-connected models such
as the ones we consider in this paper. One approach to dealing with this issue is
to utilize stochastic sampling methods (Kanazawa et al., 1995). Another approach,
which provides the basis for algorithms described in the current paper, is to make
use of variational methods (cf. Saul, Jaakkola, & Jordan, 1996).
In the following section we dene the probabilistic model for factorial HMMs
and in Section 3 we present algorithms for inference and learning. In Section 4 we
describe empirical results comparing exact and approximate algorithms for learning
on the basis of time complexity and model quality. We also apply factorial HMMs
to a real time series data set consisting of the melody lines from a collection of
chorales by J. S. Bach. We discuss several generalizations of the probabilistic model
FACTORIAL HIDDEN MARKOV MODELS 3
(a) (b)
(1) (1) (1)
S t-1 St S t+1
(2) (2) (2)

S t-1 St S t+1
S t-1 St S t+1
(3) (3) (3)

Yt-1 Yt Yt+1 S t-1 St S t+1
Yt-1 Yt Yt+1
Figure 1. (a) A directed acyclic graph (DAG) specifying conditional independence relations for
a hidden Markov model. Each node is conditionally independent from its non-descendants given
its parents. (b) A DAG representing the conditional independence relations in a factorial HMM
with M = 3 underlying Markov chains.
in Section 5, and we conclude in Section 6. Where necessary, details of derivations

are provided in the appendixes.
2. The probabilistic model

We begin by describing the hidden Markov model, in which a sequence of obser-
vations fYtg where t = 1; : : :T , is modeled by specifying a probabilistic relation
between the observations and a sequence of hidden states fSt g, and a Markov
transition structure linking the hidden states. The model assumes two sets of con-
ditional independence relations: that Yt is independent of all other observations and
states given St , and that St is independent of S1 : : :St 2 given St 1 (the rst-order
Markov property). Using these independence relations, the joint probability for the
sequence of states and observations can be factored as
Y
T
P(fSt; Ytg) = P(S1 )P(Y1jS1 ) P(StjSt 1)P(Yt jSt): (1)
t=2
The conditional independencies specied by equation (1) can be expressed graphi-
cally in the form of Figure 1 (a). The state is a single multinomial random variable
that can take one of K discrete values, St 2 f1; : : :; K g. The state transition
probabilities, P(StjSt 1), are specied by a K K transition matrix. If the obser-
vations are discrete symbols taking on one of D values, the observation probabilities
P(YtjSt ) can be fully specied as a K D observation matrix. For a continuous
observation vector, P(YtjSt) can be modeled in many dierent forms, such as a
Gaussian, a mixture of Gaussians, or even a neural network.2
In the present paper, we generalize the HMM state representation by letting the
state be represented by a collection of state variables
St = St(1) ; : : :St(m) ; : : :; St(M ) ; (2)
each of which can take on K (m) values. We refer to these models as factorial
hidden Markov models, as the state space consists of the cross product of these state
variables. For simplicity, we will assume that K (m) = K, for all m, although all the
results we present can be trivially generalized to the case of diering K (m) . Given
that the state space of this factorial HMM consists of all K M combinations of the
St(m) variables, placing no constraints on the state transition structure would result
in a K M K M transition matrix. Such an unconstrained system is uninteresting
for several reasons: it is equivalent to an HMM with K M states; it is unlikely
to discover any interesting structure in the K state variables, as all variables are
allowed to interact arbitrarily; and both the time complexity and sample complexity
of the estimation algorithm are exponential in M.
We therefore focus on factorial HMMs in which the underlying state transitions
are constrained. A natural structure to consider is one in which each state variable
evolves according to its own dynamics, and is a priori uncoupled from the other
state variables:
Y
M
P(StjSt 1) = P(St(m) jSt(m1) ): (3)
m=1
A graphical representation for this model is presented in Figure 1 (b). The tran-
sition structure for this system can be represented as M distinct K K matrices.
Generalizations that allow coupling between the state variables are brie y discussed
in Section 5.
As shown in Figure 1 (b), in a factorial HMM the observation at time step t can
depend on all the state variables at that time step. For continuous observations,
one simple form for this dependence is linear Gaussian; that is, the observation Yt
is a Gaussian random vector whose mean is a linear function of the state variables.
We represent the state variables as K 1 vectors, where each of the K discrete
values corresponds to a 1 in one position and 0 elsewhere. The resulting probability
density for a D 1 observation vector Yt is

P(YtjSt ) = jC j 1=2 (2) D=2 exp 12 (Yt t)0 C 1 (Yt t) ; (4a)
where
X
M
t = W (m) St(m) : (4b)
m=1
Each W (m) matrix is a D K matrix whose columns are the contributions to the
means for each of the settings of St(m) , C is the D D covariance matrix, 0 denotes
matrix transpose, and j j is the matrix determinant operator.
One way to understand the observation model in equations (4a) and (4b) is to
consider the marginal distribution for Yt , obtained by summing over the possible
states. There are K settings for each of the M state variables, and thus there
are K M possible mean vectors obtained by forming sums of M columns where one
column is chosen from each of the W (m) matrices. The resulting marginal density
of Yt is thus a Gaussian mixture model, with K M Gaussian mixture components
each having a constant covariance matrix C. This static mixture model, without
inclusion of the time index and the Markov dynamics, is a factorial parameterization
of the standard mixture of Gaussians model that has interest in its own right (Zemel,
1993; Hinton & Zemel, 1994; Ghahramani, 1995). The model we are considering in
the current paper extends this model by allowing Markov dynamics in the discrete
state variables underlying the mixture. Unless otherwise stated, we will assume the
Gaussian observation model throughout the paper.
The hidden state variables at one time step, although marginally independent,
become conditionally dependent given the observation sequence. This can be deter-
mined by applying the semantics of directed graphs, in particular the d-separation
criterion (Pearl, 1988), to the graphical model in Figure 1 (b). Consider the Gaus-
sian model in equations (4a)-(4b). Given an observation vector Yt, the posterior
probability of each of the settings of the hidden state variables is proportional to the
probability of Yt under a Gaussian with mean t . Since t is a function of all the
state variables, the probability of a setting of one of the state variables will depend
on the setting of the other state variables.3 This dependency eectively couples all
of the hidden state variables for the purposes of calculating posterior probabilities
and makes exact inference intractable for the factorial HMM.
3. Inference and learning

The inference problem in a probabilistic graphical model consists of computing
the probabilities of the hidden variables given the observations. In the context
of speech recognition, for example, the observations may be acoustic vectors and
the goal of inference may be to compute the probability for a particular word or
sequence of phonemes (the hidden state). This problem can be solved eciently
via the forward{backward algorithm (Rabiner & Juang, 1986), which can be shown
to be a special case of the Jensen, Lauritzen, and Olesen (1990) algorithm for
probability propagation in more general graphical models (Smyth et al., 1997). In
some cases, rather than a probability distribution over hidden states it is desirable
to infer the single most probable hidden state sequence. This can be achieved via
the Viterbi (1967) algorithm, a form of dynamic programming that is very closely
related to the forward{backward algorithm and also has analogues in the graphical
model literature (Dawid, 1992).
The learning problem for probabilistic models consists of two components: learn-
ing the structure of the model and learning its parameters. Structure learning is a
topic of current research in both the graphical model and machine learning commu-
nities (e.g. Heckerman, 1995; Stolcke & Omohundro, 1993). In the current paper we
deal exclusively with the problem of learning the parameters for a given structure.
3.1. The EM algorithm

The parameters of a factorial HMM can be estimated via the expectation maxi-
mization (EM) algorithm (Dempster, Laird, & Rubin, 1977), which in the case of
classical HMMs is known as the Baum{Welch algorithm (Baum, Petrie, Soules, &
Weiss, 1970). This procedure iterates between a step that xes the current param-
eters and computes posterior probabilities over the hidden states (the E step) and
a step that uses these probabilities to maximize the expected log likelihood of the
observations as a function of the parameters (the M step). Since the E step of EM
is exactly the inference problem as described above, we subsume the discussion of
both inference and learning problems into our description of the EM algorithm for
factorial HMMs.
The EM algorithm follows from the denition of the expected log likelihood of
the complete (observed and hidden) data:

Q(new j) = E logP(fSt; Yt gjnew) j ; fYtg ;
(5)
where Q is a function of the parameters new , given the current parameter esti-
mate and the observation sequence fYtg. For the factorial HMM the param-
eters of the model are = fW (m) ; (m) ; P (m); C g, where (m) = P(S1(m) ) and
P (m) = P(St(m) jSt(m1) ). The E step consists of computing Q. By expanding (5)
using equations (1){(4b), we nd that Q can be expressed as a function of0 three
types of expectations
0
over the hidden state variables: hSt(m) i, hSt(m) St(n) i, and
hSt(m1) St(m) i, where hi has been used to abbreviate E fj; fYtgg. In the HMM
notation of Rabiner and Juang (1986), 0 hSt(m) i corresponds to t , the vector of
state occupation probabilities, hSt(m1) St(m) i corresponds to t , the K K matrix of
0
state occupation probabilities at two consecutive time steps, and hSt(m) St(n) i has
no analogue when there is only a single underlying Markov model. The M step uses
these expectations to maximize Q as a function of new . Using Jensen's inequality,
Baum, Petrie, Soules & Weiss (1970) showed that each iteration of the E and M
steps increases the likelihood, P(fYtgj), until convergence to a (local) optimum.
As in hidden Markov models, the exact M step for factorial HMMs is simple
and tractable. In particular, the M step for the parameters of the output model
described in equations (4a)-(4b) can be found by solving a weighted linear regression
problem. Similarly, the M steps for the priors, (m) , and state transition matrices,
P (m) , are identical to the ones used in the Baum{Welch algorithm. The details
of the M step are given in Appendix A. We now turn to the substantially more
dicult problem of computing the expectations required for the E step.
3.2. Exact inference

Unfortunately, the exact E step for factorial HMMs is computationally intractable.
This fact can best be shown by making reference to standard algorithms for prob-
abilistic inference in graphical models (Lauritzen & Spiegelhalter, 1988), although

it can also be derived readily from direct application of Bayes rule. Consider the
computations that are required for calculating posterior probabilities for the fac-
torial HMM shown in Figure 1 (b) within the framework of the Lauritzen and
Spiegelhalter algorithm. Moralizing and triangulating the graphical structure for
the factorial HMM results in a junction tree (in fact a chain) with T(M + 1) M
cliques of size M+1. The resulting probability propagation algorithm has time com-
plexity O(T MK M +1 ) for a single observation sequence of length T. We present a
forward{backward type recursion that implements the exact E step in Appendix B.
The naive exact algorithm which consists of translating the factorial HMM into an
equivalent HMM with K M states and using the forward{backward algorithm, has
time complexity O(TK 2M ). Like other models with multiple densely-connected
hidden variables, this exponential time complexity makes exact learning and infer-
ence intractable.
Thus, although the Markov property can be used to obtain forward{backward-
like factorizations of the expectations across time steps, the sum over all possible
congurations of the other hidden state variables within each time step is unavoid-
able. This intractability is due inherently to the cooperative nature of the model:
for the Gaussian output model, for example, the settings of all the state variables
at one time step cooperate in determining the mean of the observation vector.
3.3. Inference using Gibbs sampling

Rather than computing the exact posterior probabilities, one can approximate them
using a Monte Carlo sampling procedure, and thereby avoid the sum over expo-
nentially many state patterns at some cost in accuracy. Although there are many
possible sampling schemes (for a review see Neal, 1993), here we present one of the
simplest|Gibbs sampling (Geman & Geman, 1984). For a given observation se-
quence fYtg, this procedure starts with a random setting of the hidden states fSt g.
At each step of the sampling process, each state vector is updated stochastically
according to its probability distribution conditioned on the setting of all the other
state vectors. The graphical model is again useful here, as each node is condition-
ally independent of all other nodes given its Markov blanket, dened as the set of
children, parents, and parents of the children of a node. To sample from a typical
state variable St(m) we only need to examine the states of a few neighboring nodes:
St(m) sampled from P(St(m) jfSt(n) : n 6= mg; St(m1) ; St(+1
m) ; Y )
t
(m) (m) (m)
/ P(St jSt 1 ) P(St+1 jSt ) P(YtjSt ; : : :; St(m) ; : : :; St(M ) ):
(m) (1)
Sampling once from each of the TM hidden variables in the model results in a
new sample of the hidden state of the model and requires O(TMK) operations.
The sequence of overall states resulting from each pass of Gibbs sampling denes
a Markov chain over the state space of the model. Assuming that all probabilities
are bounded away from zero, this Markov chain is guaranteed to converge to the
posterior probabilities of the states given the observations (Geman & Geman, 1984).
Thus, after some suitable time, samples from the Markov chain can be taken as
approximate samples from the posterior probabilities. The rst and second-order
(m) (m) (n)0 (m) (m)0
statistics needed to estimate hSt i, hSt St i and hSt 1 St i are collected using
the states visited and the probabilities estimated during this sampling process are
used in the approximate E step of EM.4
3.4. Completely factorized variational inference

There also exists a second approximation of the posterior probability of the hidden
states that is both tractable and deterministic. The basic idea is to approximate the
posterior distribution over the hidden variables P(fStgjfYtg) by a tractable distri-
bution Q(fSt g). This approximation provides a lower bound on the log likelihood
that can be used to obtain an ecient learning algorithm.
The argument can be formalized following the reasoning of Saul, Jaakkola, and
Jordan (1996). Any distribution over the hidden variables Q(fSt g) can be used to
dene a lower bound on the log likelihood
X
logP(fYtg) = log P(fSt; Ytg)
fSt g
X P(fS ; Y g)
= log Q(fSt g) t t
fSt g Q(fStg)
X P(fS t ; Y t g)
Q(fSt g) log Q(fS g) ;
fSt g t
where we have made use of Jensen's inequality in the last step. The dierence
between the left-hand side and the right-hand side of this inequality is given by the
Kullback-Leibler divergence (Cover & Thomas, 1991):
X fS g)
KL(QkP) = Q(fStg) log P(Q(
fS
t
gjf Ytg) : (6)
fSt g t
The complexity of exact inference in the approximation given by Q is determined

by its conditional independence relations, not by its parameters. Thus, we can chose
Q to have a tractable structure|a graphical representation that eliminates some
of the dependencies in P. Given this structure, we are free to vary the parameters
of Q so as to obtain the tightest possible bound by minimizing (6).
We will refer to the general strategy of using a parameterized approximating dis-
tribution as a variational approximation and refer to the free parameters of the
distribution as variational parameters. To illustrate, consider the simplest varia-
tional approximation, in which the state variables are assumed independent given
the observations (Figure 2 (a)). This distribution can be written as
(a) (b)
(1) (1) (1)
(1) (1) (1)
S t-1 St S t+1
S t-1 St S t+1
(2) (2) (2)

(2)
S t-1
(2)
St
(2)
S t+1
S t-1 St S t+1
(3) (3) (3)

(3)
S t-1
(3)
St
(3)
S t+1 S t-1 St S t+1
Figure 2. (a) The completely factorized variational approximation assumes that all the state vari-
ables are independent (conditional on the observation sequence). (b) The structured variational
approximation assumes that the state variables retain their Markov structure within each chain,
but are independent across chains.
Y
T Y
M
Q(fStgj) = Q(St(m) jt(m) ): (7)
t=1 m=1
The variational parameters, = ft(m) g, are the means of the state variables, where,
as before, a state variable St(m) is represented as a K-dimensional vector with a 1
in the kth position and 0 elsewhere, if the mth Markov chain is in state k at time t.
The elements of the vector t(m) therefore dene the state occupation probabilities
for the multinomial variable St(m) under the distribution Q:
K
Y St;km
( )
X
K
Q(St(m) jt(m) ) = (m)
t;k (m)
where St;k 2 f0; 1g; (m)
St;k = 1: (8)
k=1 k=1
This completely factorized approximation is often used in statistical physics, where
it provides the basis for simple yet powerful mean eld approximations to statistical
mechanical systems (Parisi, 1988).
To make the bound as tight as possible we vary separately for each observation
sequence so as to minimize the KL divergence. Taking the derivatives of (6) with
respect to t(m) and setting them to zero, we obtain the set of xed point equations
(see Appendix C) dened by

t(m) new = ' W (m)0 C 1Y~t(m) 12 (m) + (log P (m) ) t(m1) + (log P (m) )0 t(+1
m) (9a)
where Y~t(m) is the residual error in Yt given the predictions from all the state
variables not including m:
X
M
Y~t(m) Yt W (`) t(`) ; (9b)
`6=m
(m) is the vector of diagonal elements of W (m)0 C 1W (m) , and 'fg is the softmax
operator, which maps a vector A into a vector B of the same size, with elements
Bi = XexpfAi g ; (10)
expfAj g
j
and logP (m) denotes the elementwise logarithm of the transition matrix P (m) .
The rst term of (9a) is the projection of the error in reconstructing the ob-
servation onto the weights of state vector m|the more a particular setting of a
state vector can reduce this error, the larger its associated variational parameter.
The second term arises from the fact that the second order correlation hSt(m) St(m) i
evaluated under the variational distribution is a diagonal matrix composed of the
elements of t(m) . The last two terms introduce dependencies forward and backward
in time.5 Therefore, although the posterior distribution over the hidden variables is
approximated with a completely factorized distribution, the xed point equations
couple the parameters associated with each node with the parameters of its Markov
blanket. In this sense, the xed point equations propagate information along the
same pathways as those dening the exact algorithms for probability propagation.
The following may provide an intuitive interpretation of the approximation being
made by this distribution. Given a particular observation sequence, the hidden
state variables for the M Markov chains at time step t are stochastically coupled.
This stochastic coupling is approximated by a system in which the hidden variables
are uncorrelated but have coupled means. The variational or \mean-eld" equa-
tions solve for the deterministic coupling of the means that best approximates the
stochastically coupled system.
Each hidden state vector is updated in turn using (9a), with a time complexity
of O(T MK 2 ) per iteration. Convergence is determined by monitoring the KL
divergence in the variational distribution between successive time steps; in practice
convergence is very rapid (about 2 to 10 iterations of (9a)). Once the xed point
equations have converged, the expectations required for the E step can be obtained
as a simple function of the parameters (equations (C.6){(C.8) in Appendix C).
3.5. Structured variational inference

The approximation presented in the previous section factors the posterior proba-
bility such that all the state variables are statistically independent. In contrast to
this rather extreme factorization, there exists a third approximation that is both
tractable and preserves much of the probabilistic structure of the original system. In
this scheme, the factorial HMM is approximated by M uncoupled HMMs as shown
in Figure 2 (b). Within each HMM, ecient and exact inference is implemented
via the forward{backward algorithm. The approach of exploiting such tractable
substructures was rst suggested in the machine learning literature by Saul and
Jordan (1996).
Note that the arguments presented in the previous section did not hinge on the
the form of the approximating distribution. Therefore, more structured variational
approximations can be obtained by using more structured variational distributions
Q. Each such Q provides a lower bound on the log likelihood and can be used to
obtain a learning algorithm.
We write the structured variational approximation as
Y
M Y
T
Q(fStgj) = Z1 Q(S1(m) j) Q(St(m) jSt(m1) ; ); (11a)
Q m=1 t=2
where ZQ is a normalization constant ensuring that Q integrates to one and
K
Y S1(m;k)
Q(S (m) j) = m (m)
h ;k k
( )
(11b)
1 1
k=1
0 1St;km ( )
Y
K
@ X
K
Q(St(m) jSt(m1) ; ) = (m)
ht;k Pk;jm St m ;j A
( ) ( )
1
k=1 j =1
0 K 1St;km

( )
Y
K
@ht;km Y S m
Pk;jm t ;j A ;
( )
= ( ) ( ) 1
(11c)
k=1 j =1
where the last equality follows from the fact that St(m1) is a vector with a 1 in one po-
sition and 0 elsewhere. The parameters of this distribution are = f(m) ; P (m); ht(m) g|
the original priors and state transition matrices of the factorial HMM and a time-
varying bias for each state variable. Comparing equations (11a){(11c) to equa-
tion (1), we can see that the K 1 vector ht(m) plays the role of the probability of
an observation (P(YtjSt ) in (1)) for each of the K settings of St(m) . For example,
Q(S1(m;j ) = 1j) = h1(m;j) P(S1(m;j ) = 1j) corresponds to having an observation at time
t = 1 that under state S1(m;j ) = 1 has probability h1(m;j) .
Intuitively, this approximation uncouples the M Markov chains and attaches to
each state variable a distinct ctitious observation. The probability of this ctitious
observation can be varied so as to minimize the KL divergence between Q and P.
Applying the same arguments as before, we obtain a set of xed point equations
for ht(m) that minimize KL(QkP), as detailed in Appendix D:
0 1
(m) new (m) 1 ~ (m) (m)
ht = exp W C Yt 2 ; (12a)
where (m) is dened as before, and where we redene the residual error to be
X
M
Y~t(m) Yt W (`) hSt(`) i: (12b)
`6=m
The parameter ht(m) obtained from these xed point equations is the observation
probability associated with state variable St(m) in hidden Markov model m. Using
these probabilities, the forward{backward algorithm is used to compute a new set
of expectations for hSt(m) i, which are fed back into (12a) and (12b). This algorithm
is therefore used as a subroutine in the minimization of the KL divergence.
Note the similarity between equations (12a){(12b) and equations (9a){(9b) for the
completely factorized system. In the completely factorized system, since hSt(m) i =
t(m) , the xed point equations can be written explicitly in terms of the variational
parameters. In the structured approximation, the dependence of hSt(m) i on ht(m)
is computed via the forward{backward algorithm. Note also that (12a) does not
contain terms involving the prior, (m) , or transition matrix, P (m) . These terms
have cancelled by our choice of approximation.
3.6. Choice of approximation

The theory of the EM algorithm as presented in Dempster et al. (1977) assumes
the use of an exact E step. For models in which the exact E step is intractable,
one must instead use an approximation like those we have just described. The
choice among these approximations must take into account several theoretical and
practical issues.
Monte Carlo approximations based on Markov chains, such as Gibbs sampling,
oer the theoretical assurance that the sampling procedure will converge to the
correct posterior distribution in the limit. Although this means that one can come
arbitrarily close to the exact E step, in practice convergence can be slow (especially
for multimodal distributions) and it is often very dicult to determine how close
one is to convergence. However, when sampling is used for the E step of EM, there
is a time tradeo between the number of samples used and the number of EM
iterations. It seems wasteful to wait until convergence early on in learning, when
the posterior distribution from which samples are drawn is far from the posterior
given the optimal parameters. In practice we have found that even approximate
E steps using very few Gibbs samples (e.g. around ten samples of each hidden
variable) tend to increase the true likelihood.
Variational approximations oer the theoretical assurance that a lower bound on
the likelihood is being maximized. Both the minimization of the KL divergence in
the E step and the parameter update in the M step are guaranteed not to decrease
this lower bound, and therefore convergence can be dened in terms of the bound.
An alternative view given by Neal and Hinton (1993) describes EM in terms of the
negative free energy, F, which is a function of the parameters, , the observations,
Y , and a posterior probability distribution over the hidden variables, Q(S):
F(Q; ) = EQ flog P(Y; S j)g EQ flog Q(S)g ;
where EQ denotes expectation over S using the distribution Q(S). The exact E
step in EM maximizes F with respect to Q given . The variational E steps used
here maximize F with respect to Q given , subject to the constraint that Q is

of a particular tractable form. Given this view, it seems clear that the structured
approximation is preferable to the completely factorized approximation since it
places fewer constraints on Q, at no cost in tractability.
4. Experimental results
To investigate learning and inference in factorial HMMs we conducted two experi-
ments. The rst experiment compared the dierent approximate and exact methods
of inference on the basis of computation time and the likelihood of the model ob-
tained from synthetic data. The second experiment sought to determine whether
the decomposition of the state space in factorial HMMs presents any advantage in
modeling a real time series data set that we might assume to have complex internal
structure|Bach's chorale melodies.
4.1. Experiment 1: Performance and timing benchmarks

Using data generated from a factorial HMM structure with M underlying Markov
models with K states each, we compared the time per EM iteration and the training
and test set likelihoods of ve models:
HMM trained using the Baum-Welch algorithm;
Factorial HMM trained with exact inference for the E step, using a straight-
forward application of the forward{backward algorithm, rather than the more
ecient algorithm outlined in Appendix B;
Factorial HMM trained using Gibbs sampling for the E step with the number
of samples xed at 10 samples per variable;6
Factorial HMM trained using the completely factorized variational approxima-
tion; and
Factorial HMM trained using the structured variational approximation.
All factorial HMMs consisted of M underlying Markov models with K states each,
whereas the HMM had K M states. The data were generated from a factorial HMM
structure with M state variables, each of which could take on K discrete values.
All of the parameters of this model, except for the output covariance matrix, were
sampled from a uniform [0; 1] distribution and appropriately normalized to satisfy
the sum-to-one constraints of the transition matrices and priors. The covariance
matrix was set to a multiple of the identity matrix C = 0:0025I.
The training and test sets consisted of 20 sequences of length 20, where the observ-
able was a four-dimensional vector. For each randomly sampled set of parameters, a
separate training set and test set were generated and each algorithm was run once.
Fifteen sets of parameters were generated for each of the four problem sizes. Algo-
rithms were run for a maximumof 100 iterations of EM or until convergence, dened
as the iteration k at which the log likelihood L(k), or approximate log likelihood if
an approximate algorithm was used, satised L(k) L(k 1) < 10 5(L(k 1) L(2)).
At the end of learning, the log likelihoods on the training and test set were com-
puted for all models using the exact algorithm. Also included in the comparison
was the log likelihood of the training and test sets under the true model that gen-
erated
P the data. The test set log likelihood for N observation sequences is dened
N
as n=1 log P(Y1(n) ; : : :; YT(n)j); where is obtained by maximizing the log likeli-
hood over a training set that is disjoint from the test set. This provides a measure
of how well the model generalizes to a novel observation sequence from the same
distribution as the training data.
Results averaged over 15 runs for each algorithm on each of the four problem sizes
(a total of 300 runs) are presented in Table 1. Even for the smallest problem size
(M = 3 and K = 2), the standard HMM with K M states suers from overtting:
the test set log likelihood is signicantly worse than the training set log likelihood.
As expected, this overtting problem becomes worse as the size of the state space
increases; it is particularly serious for M = 5 and K = 3.
For the factorial HMMs, the log likelihoods for each of the three approximate
EM algorithms were compared to the exact algorithm. Gibbs sampling appeared
to have the poorest performance: for each of the three smaller size problems its
log likelihood was signicantly worse than that of the exact algorithm on both the
training sets and test sets (p < 0:05). This may be due to insucient sampling.
However, we will soon see that running the Gibbs sampler for more than 10 samples,
while potentially improving performance, makes it substantially slower than the
variational methods. Surprisingly, the Gibbs sampler appears to do quite well on
the largest size problem, although the dierences to the other methods are not
statistically signicant.
The performance of the completely factorized variational approximation was not
statistically signicantly dierent from that of the exact algorithm on either the
training set or the test set for any of the problem sizes. The performance of the
structured variational approximation was not statistically dierent from that of the
exact method on three of the four problem sizes, and appeared to be better on one of
the problem sizes (M = 5; K = 2; p < 0:05). Although this result may be a uke
arising from random variability, there is another more interesting (and speculative)
explanation. The exact EM algorithm implements unconstrained maximization of
F, as dened in section 3.6, while the variational methods maximize F subject to
a constrained distribution. These constraints could presumably act as regularizers,
reducing overtting.
There was a large amount of variability in the nal log likelihoods for the models
learned by all the algorithms. We subtracted the log likelihood of the true generative
model from that of each trained model to eliminate the main eect of the randomly
sampled generative model and to reduce the variability due to training and test
sets. One important remaining source of variance was the random seed used in
Table 1. Comparison of the factorial HMM on four problems of varying size. The negative log
likelihood for the training and test set, plus or minus one standard deviation, is shown for each
problem size and algorithm, measured in bits per observation (log likelihood in bits divided by
NT ) relative to the log likelihood under the true generative model for that data set.7 True is
the true generative model (the log likelihood per symbol is dened to be zero for this model by
our measure); HMM is the hidden Markov model with K M states; Exact is the factorial HMM
trained using an exact E step; Gibbs is the factorial HMM trained using Gibbs sampling; CFVA
is the factorial HMM trained using the completely factorized variational approximation; SVA is
the factorial HMM trained using the structured variational approximation.
M K Algorithm Training Set Test Set
3 2 True 0.00 0.39 0.00 0.39
HMM 1.19 0.67 2.29 1.02
Exact 0.88 0.80 1.05 0.72
Gibbs 1.67 1.23 1.78 1.22
CFVA 1.06 1.20 1.20 1.11
SVA 0.91 1.02 1.04 1.01
3 3 True 0.00 0.19 0.00 0.20
HMM 0.76 0.67 9.81 2.55
Exact 1.02 1.04 1.26 0.99
Gibbs 2.21 0.91 2.50 0.87
CFVA 1.24 1.50 1.50 1.53
SVA 0.64 0.88 0.90 0.84
5 2 True 0.00 0.60 0.00 0.57
HMM 0.83 0.82 11.57 3.71
Exact 2.29 1.19 2.51 1.21
Gibbs 3.25 1.17 3.35 1.14
CFVA 1.73 1.34 2.07 1.74
SVA 1.34 1.07 1.53 1.05
5 3 True 0.00 0.30 0.00 0.29
HMM -4.80 0.52 175.35 84.74
Exact 4.23 2.28 4.49 2.24
Gibbs 3.63 1.13 3.95 1.14
CFVA 4.85 0.68 5.14 0.69
SVA 3.99 1.57 4.30 1.62
(a) (b)
5 5
4 4
−log likelihood (bits)

3 3
2 2
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
iterations of EM iterations of EM
(c) (d)
5 5
4 4
3 3
2 2
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
iterations of EM iterations of EM
Figure 3. Learning curves for ve runs of each of the four learning algorithms for factorial HMMs:
(a) exact; (b) completely factorized variational approximation; (c) structured variational approx-
imation; and (d) Gibbs sampling. A single training set sampled from the M = 3; K = 2 problem
size was used for all these runs. The solid lines show the negative log likelihood per observation
(in bits) relative to the true model that generated the data, calculated using the exact algorithm.
The circles denote the point at which the convergence criterion was met and the run ended. For
the three approximate algorithms, the dashed lines show an approximate negative log likelihood.8
each training run, which determined the initial parameters and the samples used in
the Gibbs algorithm. All algorithms appeared to be very sensitive to this random
seed, suggesting that dierent runs on each training set found dierent local maxima
or plateaus of the likelihood (Figure 3). Some of this variability could be eliminated
by explicitly adding a regularization term, which can be viewed as a prior on the
parameters in maximuma posteriori parameter estimation. Alternatively, Bayesian
(or ensemble) methods could be used to average out this variability by integrating
over the parameter space.
The timing comparisons conrm the fact that both the standard HMM and the ex-
act E step factorial HMM are extremely slow for models with large state spaces (Fig-
100
HMM
Exact fHMM
Gibbs fHMM
Tim e/iteration (sec)

80 CFVA fHMM
SVA fHMM
60
40
20
0
3 3 5 5 10 15
2 3 2 3 2 2
M
S ta te s p a c e s iz e ( K )
Figure 4. Time per iteration of EM on a Silicon Graphics R4400 processor running Matlab.
ure 4). Gibbs sampling was slower than the variational methods even when limited
to ten samples of each hidden variable per iteration of EM. Since one pass of the
variational xed point equations has the same time complexity as one pass of Gibbs
sampling, and since the variational xed point equations were found to converge
very quickly, these experiments suggest that Gibbs sampling is not as competitive
time-wise as the variational methods. The time per iteration for the variational
methods scaled well to large state spaces.
4.2. Experiment 2: Bach chorales

Musical pieces naturally exhibit complex structure at many dierent time scales.
Furthermore, one can imagine that to represent the \state" of the musical piece
at any given time it would be necessary to specify a conjunction of many dierent
features. For these reasons, we chose to test whether a factorial HMM would provide
an advantage over a regular HMM in modeling a collection of musical pieces.
The data set consisted of discrete event sequences encoding the melody lines of
J. S. Bach's Chorales, obtained from the UCI Repository for Machine Learning
Databases (Merz & Murphy, 1996) and originally discussed in Conklin and Wit-
ten (1995). Each event in the sequence was represented by six attributes, described
in Table 2. Sixty-six chorales, with 40 or more events each, were divided into a
training set (30 chorales) and a test set (36 chorales). Using the rst set, hidden
Markov models with state space ranging from 2 to 100 states were trained until
convergence (30 12 steps of EM). Factorial HMMs of varying sizes (K ranging
from 2 to 6; M ranging from 2 to 9) were also trained on the same data. To
Table 2. Attributes in the Bach chorale data set. The key

signature and time signature attributes were constant over the
duration of the chorale. All attributes were treated as real
numbers and modeled as linear-Gaussian observations (4a).
Attribute Description Representation
pitch pitch of the event int [0; 127]
keysig key signature int [ 7; 7]
timesig time signature (1/16 notes)
fermata event under fermata? binary
st start time of event int (1/16 notes)
dur duration of event int (1/16 notes)
approximate the E step for factorial HMMs we used the structured variational ap-
proximation. This choice was motivated by three considerations. First, for the size
of state space we wished to explore, the exact algorithms were prohibitively slow.
Second, the Gibbs sampling algorithm did not appear to present any advantages
in speed or performance and required some heuristic method for determining the
number of samples. Third, theoretical arguments suggest that the structured ap-
proximation should in general be superior to the completely factorized variational
approximation, since more of the dependencies of the original model are preserved.
The test set log likelihoods for the HMMs, shown in Figure 5 (a), exhibited the
typical U-shaped curve demonstrating a trade-o between bias and variance (Ge-
man, Bienenstock, & Doursat, 1992). HMMs with fewer than 10 states did not
predict well, while HMMs with more than 40 states overt the training data and
therefore provided a poor model of the test data. Out of the 75 runs, the highest
test set log likelihood per observation was 9:0 bits, obtained by an HMM with 30
hidden states.9
The factorial HMM provides a more satisfactory model of the chorales from three
points of view. First, the time complexity is such that it is possible to consider
models with signicantly larger state spaces; in particular, we t models with up to
1000 states. Second, given the componential parametrization of the factorial HMM,
these large state spaces do not require excessively large numbers of parameters rel-
ative to the number of data points. In particular, we saw no evidence of overtting
even for the largest factorial HMM as seen in Figures 5 (c) & (d). Finally, this
approach resulted in signicantly better predictors; the test set likelihood for the
best factorial HMM was an order of magnitude larger than the test set likelihood
for the best HMM, as Figure 5 (d) reveals.
While the factorial HMM is clearly a better predictor than a single HMM, it
should be acknowledged that neither approach produces models that are easily
interpretable from a musicological point of view. The situation is reminiscent of
that in speech recognition, where HMMs have proved their value as predictive
models of the speech signal without necessarily being viewed as causal generative
models of speech. A factorial HMM is clearly an impoverished representation of
(a) (b) (c) (d)

− log likelihood (bits)
15 15 15 15
13 13 13 13
11 11 11 11
9 9 9 9
7 7 7 7
5 5 5 5
1 25 50 75 100 1 10 100 1000 1 10 100 1000 1 10 100 1000
Size of state space
Figure 5. Test set log likelihood per event of the Bach chorale data set as a function of number of
states for (a) HMMs, and factorial HMMs with (b) K = 2, (c) K = 3, and (d) K = 4 ('s; heavy
dashed line) and K = 5 ('s; solid line). Each symbol represents a single run; the lines indicate the
mean performances. The thin dashed line in (b){(d) indicates the log likelihood per observation
of the best run in (a). The factorial HMMs were trained using the structured approximation. For
both methods the true likelihood was computed using the exact algorithm.
musical structure, but its promising performance as a predictor provides hope that
it could serve as a step on the way toward increasingly structured statistical models
for music and other complex multivariate time series.
5. Generalizations of the model

In this section, we describe four variations and generalizations of the factorial HMM.
5.1. Discrete observables

The probabilistic model presented in this paper has assumed real-valued Gaus-
sian observations. One of the advantages arising from this assumption is that the
conditional density of a D-dimensional observation, P(YtjSt(1) ; : : :; St(M ) ), can be
compactly specied through M mean matrices of dimension D K, and one D D
covariance matrix. Furthermore, the M step for such a model reduces to a set of
weighted least squares equations.
The model can be generalized to handle discrete observations in several ways.
Consider a single D-valued discrete observation Yt . In analogy to traditional HMMs,
the output probabilities could be modeled using a matrix. However, in the case of a
factorial HMM, this matrix would have D K M entries for each combination of the
state variables and observation. Thus the compactness of the representation would
be entirely lost. Standard methods from graphical models suggest approximating
such large matrices with \noisy-OR" (Pearl, 1988) or \sigmoid" (Neal, 1992) models
of interaction. For example, in the softmax model, which generalizes the sigmoid
model to D > 2, P(Yt jSt(1) ; : : :; St(M ) ) is multinomial with mean proportional to
nP o
exp m W (m) St(m) . Like the Gaussian model, this specication is again com-
pact, using M matrices of size D K. (As in the linear-Gaussian model, the W (m)
are overparametrized since they can each model the overall mean of Yt , as shown in
Appendix A.) While the nonlinearity induced by the softmax function makes both
the E step and M step of the algorithm more dicult, iterative numerical methods
can be used in the M step whereas Gibbs sampling and variational methods can
again be used in the E step (see Neal, 1992; Hinton et al., 1995; and Saul et al.,
1996, for discussions of dierent approaches to learning in sigmoid networks).
5.2. Introducing couplings

The architecture for factorial HMMs presented in Section 2 assumes that the un-
derlying Markov chains interact only through the observations. This constraint can
be relaxed by introducing couplings between the hidden state variables (cf. Saul &
Jordan, 1997). For example, if St(m) depends on St(m1) and St(m1 1) , equation (3) is
replaced by the following factorization
Y
M
P(StjSt 1) = P(St(1) jSt(1)1) P(St(m) jSt(m1) ; St(m1 1) ): (13)
m=1
Similar exact, variational, and Gibbs sampling procedures can be dened for this
architecture. However, note that these couplings must be introduced with caution,
as they may result in an exponential growth in parameters. For example, the above
factorization requires transition matrices of size K 2 K. Rather than specifying
these higher-order couplings through probability transition matrices, one can intro-
duce second-order interaction terms in the energy (log probability) function. Such
terms eectively couple the chains without the number of parameters incurred by
a full probability transition matrix.10 In the graphical model formalism these cor-
respond to symmetric undirected links, making the model a chain graph. While
the Jensen, Lauritzen and Olesen (1990) algorithm can still be used to propagate
information exactly in chain graphs, such undirected links cause the normalization
constant of the probability distribution|the partition function|to depend on the
coupling parameters. As in Boltzmann machines (Hinton & Sejnowski, 1986), both
a clamped and an unclamped phase are therefore required for learning, where the
goal of the unclamped phase is to compute the derivative of the partition function
with respect to the parameters (Neal, 1992).
5.3. Conditioning on inputs

Like the hidden Markov model, the factorial HMM provides a model of the uncon-
ditional density of the observation sequences. In certain problem domains, some of
the observations can be better thought of as inputs or explanatory variables, and
the others as outputs or response variables. The goal, in these cases, is to model
the conditional density of the output sequence given the input sequence. In ma-
chine learning terminology, unconditional density estimation is unsupervised while
conditional density estimation is supervised.
Several algorithms for learning in hidden Markov models that are conditioned on
inputs have been recently presented in the literature (Cacciatore & Nowlan, 1994;
Bengio & Frasconi, 1995; Meila & Jordan, 1996). Given a sequence of input vectors
fXt g, the probabilistic model for an input-conditioned factorial HMM is
Y
M
P(fSt; YtgjfXtg) = P(S1(m) jX1)P(Y1jS1(m) ; X1 )
m=1
YT Y
M
P(St(m) jSt(m1) ; Xt )P(YtjSt(m) ; Xt): (14)
t=2 m=1
The model depends on the specication of P(YtjSt(m) ; Xt ) and P(St(m) jSt(m1) ; Xt ),
which are conditioned both on a discrete state variable and on a (possibly con-
tinuous) input vector. The approach used in Bengio and Frasconi's Input Out-
put HMMs (IOHMMs) suggests modeling P(St(m) jSt(m1) ; Xt ) as K separate neural
networks, one for each setting of St(m1) . This decomposition ensures that a valid
probability transition matrix is dened at each point in input space if a sum-to-one
constraint (e.g., softmax nonlinearity) is used in the output of these networks.
Using the decomposition of each conditional probability into K networks, infer-
ence in input-conditioned factorial HMMs is a straightforward generalization of the
algorithms we have presented for factorial HMMs. The exact forward{backward
algorithm in Appendix B can be adapted by using the appropriate conditional
probabilities. Similarly, the Gibbs sampling procedure is no more complex when
conditioned on inputs. Finally, the completely factorized and structured approx-
imations can also be generalized readily if the approximating distribution has a
dependence on the input similar to the model's. If the probability transition struc-
ture P(St(m) jSt(m1) ; Xt) is not decomposed as above, but has a complex dependence
on the previous state variable and input, inference may become considerably more
complex.
Depending on the form of the input conditioning, the Maximization step of learn-
ing may also change considerably. In general, if the output and transition prob-
abilities are modeled as neural networks, the M step can no longer be solved ex-
actly and a gradient-based generalized EM algorithm must be used. For log-linear
models, the M step can be solved using an inner loop of iteratively reweighted
least-squares (McCullagh & Nelder, 1989).
5.4. Hidden Markov decision trees

An interesting generalization of factorial HMMs results if one conditions on an
input Xt and orders the M state variables such that St(m) depends on St(n) for
X t-1 Xt X t+1
(1) (1) (1)

S t-1 St S t+1
(2) (2) (2)

S t-1 St S t+1
(3) (3) (3)

S t-1 St S t+1
Yt-1 Yt Yt+1
Figure 6. The hidden Markov decision tree.
1 n < m (Figure 6). The resulting architecture can be seen as a probabilistic

decision tree with Markovian dynamics linking the decision variables. Consider how
this probabilistic model would generate data at the rst time step, t = 1. Given
input X1 , the top node S1(1) can take on K values. This stochastically partitions X
space into K decision regions. The next node down the hierarchy, S1(2) , subdivides
each of these regions into K subregions, and so on. The output Y1 is generated from
the input X1 and the K-way decisions at each of the M hidden nodes. At the next
time step, a similar procedure is used to generate data from the model, except that
now each decision in the tree is dependent on the decision taken at that node in the
previous time step. Thus, the \hierarchical mixture of experts" architecture (Jordan
& Jacobs, 1994) is generalized to include Markovian dynamics for the decisions.
Hidden Markov decision trees provide a useful starting point for modeling time
series with both temporal and spatial structure at multiple resolutions. We explore
this generalization of factorial HMMs in Jordan, Ghahramani, and Saul (1997).
6. Conclusion
In this paper we have examined the problem of learning for a class of generalized
hidden Markov models with distributed state representations. This generalization
provides both a richer modeling tool and a method for incorporating prior struc-
tural information about the state variables underlying the dynamics of the system
generating the data. Although exact inference in this class of models is generally
intractable, we provided a structured variational approximation that can be com-
puted tractably. This approximation forms the basis of the Expectation step in an
EM algorithm for learning the parameters of the model. Empirical comparisons
to several other approximations and to the exact algorithm show that this approx-
imation is both ecient to compute and accurate. Finally, we have shown that
the factorial HMM representation provides an advantage over traditional HMMs in

predictive modeling of the complex temporal patterns in Bach's chorales.
Appendix A
The M step
The M step equations for each parameter are obtained by setting the derivatives
of Q with respect to that parameters to zero. We start by expanding Q using
equations (1){(4b):
"
Q= 1X T
Y 0 C 1Yt 2 X Y 0 C 1W (m) hS (m) i
M
2 t=1 t m=1
t t
M n #
X m)0 io
M X
(m)0 n
+ tr W C 1W (n)hSt St
( ) (
m=1 n=1
X
M
(m)0
XT X M n 0 o
+ hS1 i log (m) + tr (logP (m) )hSt(m1) St(m) i logZ; (A.1)
m=1 t=2 m=1
where tr is the trace operator for square matrices and Z is a normalization term
independent of the states and observations ensuring that the probabilities sum to
one.
Setting the derivatives of Q with respect to the output weights to zero, we obtain
a linear system of equations for the W (m) :
" #
@Q = X T X M 0 0
W (n) hSt(n) St(m) i YthSt(m) i = 0: (A.2)
@W (m )
t=1 n=1
Assuming Yt is a D1 vector, let St be the MK 1 vector obtained by concatenating
the S (m) vectors, and W be the D MK matrix obtained by concatenating the
W (m) matrices (of size D K). Then solving (A.2) results in
X
T ! X
T !y
W new = 0
Yt hSt i hSt St0 i ; (A.3)
t=1 t=1
where y is the Moore-Penrose pseudo-inverse. Note that the model is overparame-
terized since the D 1 means of each of the W (m) matrices add up to a single mean.
Using the pseudo-inverse removes the need to explicitly subtract this overall mean
from each W (m) and estimate it separately as another parameter.
To estimate the priors, we solve @ Q=@(m) = 0 subject to the constraint that
they sum to one, obtaining
(m) new = hS1(m) i: (A.4)
Similarly, to estimate the transition matrices we solve @ Q=@P (m) = 0, subject to

the constraint that the columns of P (m) sum to one. For element (i; j) in P (m) ,
PT hS(m) S(m) i
(m) new t=2 t;i t 1;j :
Pi;j = P T hS (m) i (A.5)
t=2 t 1;j
Finally, the re-estimation equations for the covariance matrix can be derived by
taking derivatives with respect to C 1
" #
@Q = T C + XT X M
Y hS (m)0
iW (m)0 1Y Y 0 1 X M
W (n)
hS (n) (m)0
S iW (m)0
:
@C 1 2 t t 2 t t 2m;n=1 t t
t=1 m=1
(A.6)
The rst term arises from the normalization for the Gaussian density function: Z is
proportional to jC jT=2 and @ jC j=@C 1 = C . Substituting (A.2) and re-organizing
we get
XT XT X M
C new = T1 YtYt0 T1 W (m) hSt(m) i Yt0 : (A.7)
t=1 t=1 m=1
For M = 1, these equations reduce to the Baum-Welch re-estimation equations
for HMMs with Gaussian observables. The above M step has been presented for
the case of a single observation sequence. The extension to multiple sequences is
straightforward.
Appendix B
Exact forward{backward algorithm
Here we specify an exact forward{backward recursion for computing the posterior
probabilities of the hidden states in a factorial HMM. It diers from a straightfor-
ward application of the forward{backward algorithm on the equivalent K M state
HMM, in that it does not depend on a K M K M transition matrix. Rather, it
makes use of the independence of the underlying Markov chains to sum over M
transition matrices of size K K.
Using the notation fY grt to mean the observation sequence Yt ; : : :; Yr , we dene
t = P(St(1) ; St(2) ; : : :St(M ) ; fY gt1j)
(M ) t 1
t = P(St ; St ; : : :St ; fY g1 j)
(0) (1) (2)
(M ) t 1
t = P(St 1; St ; : : :St ; fY g1 j)
(1) (1) (2)
..
.
(tM ) = P(St(1)1; : : :St(M1) ; fY g1t 1j) = t 1 :
Then we obtain the forward recursions

t = P(YtjSt(1) ; : : :; St(M ) ; )(0)
t (B.1)
and
X (m) (m) (m)
(tm 1) = P(St jSt 1 )t : (B.2)
Stm1)
(
At the end of the forward recursions, the likelihood of the observation sequence is
the sum of the K M elements in T .
Similarly, to obtain the backward recursions we dene
t = P(fY gTt+1 jSt(1) ; : : :St(M ) ; )
t(M1) = P(fY gTt jSt(1) ; : : :St(M ) ; )
..
.
t 1 = P(fY gTt jSt(1) ; St(2)1 : : :St(M1) ; )
(1)
t(0)1 = P(fY gTt jSt(1)1; St(2)1 : : :St(M1) ; ) = t 1 ;
from which we obtain

t(M1) = P(Yt jSt(1) ; : : :; St(M ) ; )t (B.3)
X (m) (m) (m)
t(m1 1) = P(St jSt 1 )t 1 : (B.4)
St(m)
The posterior probability of the state at time t is obtained by multiplying t and
t and normalizing:
t = P(St jfY gT1 ; ) = Ptt : (B.5)
St t t
This algorithm can be shown to be equivalent to the Jensen, Lauritzen and Ole-
sen algorithm for probability propagation in graphical models. The probabilities
are dened over collections of state variables corresponding to the cliques in the
equivalent junction tree. Information is passed forwards and backwards by sum-
ming over the sets separating each neighboring clique in the tree. This results in
forward{backward-type recursions of order O(TMK M +1 ).
Using the t , t , and t quantities, the statistics required for the E step are
X
hSt(m) i = t (B.6)
St(n) (n6=m)
0 X
hSt(m) St(n) i = t (B.7)
St(r) (r6=m^r6=n)
X
t 1P(St jSt 1 )P(YtjSt )t
(m) (m)0 St(n)1 ;St(r) (n6=m^r6=m)
hSt 1 St i = X : (B.8)
t 1P(StjSt 1)P(Yt jSt)t
St 1 ;St
Appendix C
Completely factorized variational approximation
Using the denition of the probabilistic model given by equations (1){(4b), the
posterior probability of the states given an observation sequence can be written as
P(fStgjfYtg; ) = Z1 expf H(fSt ; Ytg)g ; (C.1)
where Z is a normalization constant ensuring that the probabilities sum to one and
XT XM !0 X
M !
1
H(fSt; Yt g) = 2 Yt W (m) St(m) C 1
Yt W m) S (m)
(
t
t=1 m=1 m=1
X
M 0 T X
X M 0
S1(m) log(m) St(m) (log P (m) )St(m1) : (C.2)
m=1 t=2 m=1
Similarly, the probability distribution given by the variational approximation (7){
(8) can be written as
Q(fStgj) = Z1 expf HQ(fSt g)g ; (C.3)
Q
where
T X
X M 0
HQ(fSt g) = St(m) logt(m) : (C.4)
t=1 m=1
Using this notation, and denoting expectation with respect to the variational dis-
tribution using angular brackets hi, the KL divergence is
KL(QkP) = hH i hHQ i log ZQ + log Z: (C.5)
Three facts can be veried from the denition of the variational approximation:
hSt(m) i = t(m) (C.6)
(m) (m)0 (m) (m)0
hSt 1 St i = t 1 t (C.7)
( (m) (n)0
hSt(m) St(n) i = t t (m) if m 6= n
0
(C.8)
diagft g if m = n
where diag is an operator that takes a vector and returns a square matrix with
the elements of the vector along its diagonal, and zeros everywhere else. The KL
divergence can therefore be expanded to
XT X M 1 XT " XM
(m)0 (m)
KL = t log t + 2 Yt0 C 1Yt 2 Yt0 C 1W (m) t(m)
t=1 m=1 t=1 m=1 3
X
M X
M M X n o
W m 0 C W m diagftm g 5
0 0
+ trfW (m) C 1
W (n) (n) (m) g +
t t tr ( ) 1 ( ) ( )
m=1 n6=m m=1

XM 0 XT X M 0
+ 1(m) log (m) + trft(m1) t(m) logP (m) g logZQ + log Z: (C.9)
m=1 t=2 m=1
Taking derivatives with respect to t(m) , we obtain
@KL = log (m) W (m) 0C 1Y + X M
W (m)0 C 1W (n) t(n) + 21 (m)
(m) t t
@t n6=m
(m) (m) (m) 0 (m)
(logP ) t 1 (log P ) t+1 + c ; (C.10)
0
where (m) is the vector of diagonal elements of W (m) C 1 W (m) , and c is a term
arising from log ZQ , ensuring that the t(m) sum to one. Setting this derivative
equal to 0 and solving for t(m) gives equation (9a).
Appendix D
Structured approximation
For the structured approximation, HQ is dened as
X
M 0 T X
X M 0
HQ(fSt g) = S1(m) log (m) St(m) (log P (m) )St(m1)
m=1 t=2 m=1
XT XM 0
St(m) log ht(m) : (D.1)
t=1 m=1
Using (C.2), we write the KL divergence as
XT X M 1 XT " XM
(m) (m)
KL = hSt i log ht + 2 Yt0 C 1Yt 2 Yt0C 1W (m) hSt(m) i
t=1 m=1 t=1 m=1
M X
X M n 0 0 o
+ tr W (m) C 1W (n) hSt(n) ihSt(m) i
m=1 n6=m
XM n
m)0 C m) diag
n m) i
oo#
+ tr W ( 1
W (
hSt (
logZQ + log Z: (D.2)
m=1
Since KL is independent of (m) and P (m) , the rst thing to note is that these
parameters of the structured approximation remain equal to the equivalent pa-
rameters of the true system. Now, taking derivatives with respect to log h(n), we
get
2
@KL = hS (n) i + T X
X M
(m) 0
XM
(n) 4 logh(m)
t W C 1
Yt + W (m) 0C 1W (`) hSt(`) i
@ log h t=1 m=1 `6=m
(m)
+ 21 (m) @ hSt (ni) hS(n) i: (D.3)
@ log h
The last term, which we obtained by making use of the fact that
@ logZQ = hS (n) i; (D.4)

@ log h(n)
cancels out the rst term. Setting the terms inside the brackets in (D.3) equal to
zero yields equation (12a).
Acknowledgments
We thank Lawrence Saul for helpful discussions and Georey Hinton for support.
This project was supported in part by a grant from the McDonnell-Pew Foundation,
by a grant from ATR Human Information Processing Research Laboratories, by a
gift from Siemens Corporation, and by grant N00014-94-1-0777 from the Oce of
Naval Research. Zoubin Ghahramani was supported by a grant from the Ontario
Information Technology Research Centre.
Notes
1. For related work on inference in distributed state HMMs, see Dean and Kanazawa (1989).
2. In speech, neural networks are generally used to model P (St jYt ); this probability is converted
to the observation probabilities needed in the HMM via Bayes rule.
3. If the columns of W (m) and W (n) are orthogonal for every pair of state variables, m and n, and
C is a diagonal covariance matrix, then the state variables will no longer be dependent given
the observation. In this case there is no \explaining away": each state variable is modeling the
variability in the observation along a dierent subspace.
4. A more Bayesian treatment of the learning problem, in which the parameters are also consid-
ered hidden random variables, can be handled by Gibbs sampling by replacing the \M step"
with sampling from the conditional distribution of the parameters given the other hidden
variables (for example, see Tanner and Wong, 1987).
5. The rst term is replaced by log (m) for t = 1 the second term does not appear for t = T .
6. All samples were used for learning; that is, no samples were discarded at the beginning of the
run. Although ten samples is too few to even approach convergence, it provides a run-time
roughly comparable to the variational methods. The goal was to see whether this \impatient"
Gibbs sampler would be able to compete with the other approximate methods.
7. Lower values suggest a better probabilistic model: a value of one, for example, means that
it would take one bit more than the true generative model to code each observation vector.
Standard deviations re ect the variation due to training set, test set, and the random seed of
the algorithm. Standard errors on the mean are a factor of 3.8 smaller.
8. For the variational methods these dashed lines are equal to minus the lower bound on the
log likelihood, except for a normalization term which is intractable to compute and can vary
during learning, resulting in the apparent occasional increases in the bound.
9. Since the attributes were modeled as real numbers, the log likelihoods are only a measure of
relative coding cost. Comparisons between these likelihoods are meaningful, whereas to obtain
the absolute cost of coding a sequence, it is necessary to specify a discretization level.
10.This is analogous to the fully-connectedBoltzmann machine with N units (Hinton & Sejnowski,
1986), in which every binary unit is coupled to every other unit using O(N 2) parameters, rather
than the O(2N ) parameters required to specify the complete probability table.
References
Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in
the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical
Statistics, 41, 164{171.
Bengio, Y., & Frasconi, P. (1995). An input{outputHMM architecture. In G. Tesauro, D. S. Touret-
zky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 427{434.
Cambridge, MA: MIT Press.
Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear
plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information
processing systems 6, pp. 719{726. San Francisco, CA: Morgan Kaufmann.
Conklin, D., & Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of
New Music Research, 24, 51{73.
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: John Wiley.
Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert
systems. Statistics and Computing, 2, 25{36.
Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation.
Computational Intelligence, 5 , 142{150.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1{38.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation, 4, 1{58.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6,
721{741.
Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. Touret-
zky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 617{624.
Cambridge, MA: MIT Press.
Heckerman, D. (1995.). A tutorial on learning Bayesian networks. (Technical Report MSR-TR-
95-06). Redmond, WA: Microsoft Research.
Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In
D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in
the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press.
Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz
free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information
Jensen, F. V., Lauritzen, S. L., & Olesen, K. G. (1990). Bayesian updating in recursive graphical
models by local computations. Computational Statistical Quarterly, 4, 269{282.
Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decision trees. In M. Mozer,
M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cam-
bridge, MA: MIT Press.
Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural Computation, 6, 181{214.
Kanazawa, K., Koller, D., & Russell, S. J. (1995). Stochastic simulation algorithms for dynamic
probabilistic networks. In P. Besnard, , & S. Hanks (Eds.), Uncertainty in Articial Intelligence:
Proceedings of the Eleventh Conference. (pp. 346{351). San Francisco, CA: Morgan Kaufmann.
Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models
in computational biology: Applications to protein modeling . Journal of Molecular Biology, 235,
1501{1531.
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical
structures and their application to expert systems. Journal of the Royal Statistical Society B,
157{224.
McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman & Hall.
Meila, M., & Jordan, M. I. (1996). Learning ne motion by Markov mixtures of experts. In
D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information
processing systems 8, pp. 1003{1009. Cambridge, MA: MIT Press.
Merz, C. J., & Murphy, P. M. (1996). UCI Repository of machine learning databases
[http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of California,
Department of Information and Computer Science.
Neal, R. M. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71{113.
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Technical
Report CRG-TR-93-1). Toronto, Ontario: University of Toronto, Department of Computer
Science.
Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justies incremental
and other variants. Unpublished manuscript, Department of Computer Science, University of
Toronto, Ontario.
Parisi, G. (1988). Statistical eld theory. Redwood City, CA: Addison-Wesley.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference.
San Mateo, CA: Morgan Kaufmann.
Rabiner, L. R., & Juang, B. H. (1986). An Introduction to hidden Markov models. IEEE
Acoustics, Speech & Signal Processing Magazine, 3, 4{16.
Saul, L. K., & Jordan, M. I. (1997). Mixed memory Markov models. In D. Madigan , & P.
Smyth (Eds.), Proceedings of the 1997 Conference on Articial Intelligence and Statistics. Ft.
Lauderdale, FL.
Saul, L., Jaakkola, T., & Jordan, M. I. (1996). Mean Field Theory for Sigmoid Belief Networks.
Journal of Articial Intelligence Research, 4, 61{76.
Saul, L., & Jordan, M. I. (1995). Boltzmann chains and hidden Markov models. In G. Tesauro,
D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7,
pp. 435{442. Cambridge, MA: MIT Press.
Saul, L., & Jordan, M. I. (1996). Exploiting tractable substructures in Intractable networks.
In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information
processing systems 8, pp. 486{492. Cambridge, MA: MIT Press.
Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for
hidden Markov probability models. Neural Computation, 9, 227{269.
Stolcke, A., & Omohundro, S. (1993). Hidden Markov model induction by Bayesian model
merging. In S.J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data
augmentation (with discussion). Journal of the American Statistical Association, 82, 528{550.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm. IEEE Transactions Information Theory, IT-13, 260{269.
Williams, C. K. I., & Hinton, G. E. (1991). Mean eld networks that learn to discriminate
temporally distorted strings. In D. Touretzky, J. Elman, T. Sejnowski, & G. Hinton (Eds.),
Connectionist models: Proceedings of the 1990 summer school (pp. 18{22). San Francisco, CA:
Morgan Kaufmann.
Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. Ph.D.
Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.
Received Date
Accepted Date
Final Manuscript Date

FHMM ML

Uploaded by

Copyright:

Available Formats

FHMM ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FHMM ML

Uploaded by

Copyright:

Available Formats

Machine Learning, ?

Factorial Hidden Markov Models

(2) (2) (2)

(3) (3) (3)

in Section 5, and we conclude in Section 6. Where necessary, details of derivations

2. The probabilistic model

3. Inference and learning

3.1. The EM algorithm

3.2. Exact inference

abilistic inference in graphical models (Lauritzen & Spiegelhalter, 1988), although

3.3. Inference using Gibbs sampling

3.4. Completely factorized variational inference

The complexity of exact inference in the approximation given by Q is determined

(2) (2) (2)

(3) (3) (3)

3.5. Structured variational inference

3.6. Choice of approximation

here maximize F with respect to Q given , subject to the constraint that Q is

4.1. Experiment 1: Performance and timing benchmarks

−log likelihood (bits)

−log likelihood (bits)

Tim e/iteration (sec)

4.2. Experiment 2: Bach chorales

Table 2. Attributes in the Bach chorale data set. The key

(a) (b) (c) (d)

5. Generalizations of the model

5.1. Discrete observables

5.2. Introducing couplings

5.3. Conditioning on inputs

5.4. Hidden Markov decision trees

(1) (1) (1)

(2) (2) (2)

(3) (3) (3)

Figure 6. The hidden Markov decision tree.

1  n < m (Figure 6). The resulting architecture can be seen as a probabilistic

the factorial HMM representation provides an advantage over traditional HMMs in

Similarly, to estimate the transition matrices we solve @ Q=@P (m) = 0, subject to

Then we obtain the forward recursions

t(0)1 = P(fY gTt jSt(1)1; St(2)1 : : :St(M1) ; ) = t 1 ;

from which we obtain

m=1 n6=m m=1

You might also like

here maximize F with respect to Q given , subject to the constraint that Q is

1 n < m (Figure 6). The resulting architecture can be seen as a probabilistic

t(0)1 = P(fY gTt jSt(1)1; St(2)1 : : :St(M1) ; ) = t 1 ;