FHMM ML
FHMM ML
FHMM ML
, 1{31 (1997)
c 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
1. Introduction
Due to its
exibility and to the simplicity and eciency of its parameter estimation
algorithm, the hidden Markov model (HMM) has emerged as one of the basic sta-
tistical tools for modeling discrete time series, nding widespread application in the
areas of speech recognition (Rabiner & Juang, 1986) and computational molecular
biology (Krogh, Brown, Mian, Sjolander, & Haussler, 1994). An HMM is essen-
tially a mixture model, encoding information about the history of a time series in
the value of a single multinomial variable|the hidden state|which can take on
one of K discrete values. This multinomial assumption supports an ecient pa-
rameter estimation algorithm|the Baum-Welch algorithm|which considers each
of the K settings of the hidden state at each time step. However, the multinomial
assumption also severely limits the representational capacity of HMMs. For exam-
ple, to represent 30 bits of information about the history of a time sequence, an
HMM would need K = 230 distinct states. On the other hand, an HMM with a
distributed state representation could achieve the same task with 30 binary state
2 Z. GHAHRAMANI AND M.I. JORDAN
variables (Williams & Hinton, 1991). This paper addresses the problem of con-
structing ecient learning algorithms for hidden Markov models with distributed
state representations.
The need for distributed state representations in HMMs can be motivated in two
ways. First, such representations let the model automatically decompose the state
space into features that decouple the dynamics of the process that generated the
data. Second, distributed state representations simplify the task of modeling time
series that are known a priori to be generated from an interaction of multiple,
loosely-coupled processes. For example, a speech signal generated by the superpo-
sition of multiple simultaneous speakers can be potentially modeled with such an
architecture.
Williams and Hinton (1991) rst formulated the problem of learning in HMMs
with distributed state representations and proposed a solution based on determinis-
tic Boltzmann learning.1 The approach presented in this paper is similar to Williams
and Hinton's in that it can also be viewed from the framework of statistical me-
chanics and mean eld theory. However, our learning algorithm is quite dierent
in that it makes use of the special structure of HMMs with a distributed state
representation, resulting in a signicantly more ecient learning procedure. An-
ticipating the results in Section 3, this learning algorithm obviates the need for
the two-phase procedure of Boltzmann machines, has an exact M step, and makes
use of the forward{backward algorithm from classical HMMs as a subroutine. A
dierent approach comes from Saul and Jordan (1995), who derived a set of rules
for computing the gradients required for learning in HMMs with distributed state
spaces. However, their methods can only be applied to a limited class of architec-
tures.
Hidden Markov models with distributed state representations are a particular
class of probabilistic graphical model (Pearl, 1988; Lauritzen & Spiegelhalter, 1988),
which represent probability distributions as graphs in which the nodes correspond
to random variables and the links represent conditional independence relations.
The relation between hidden Markov models and graphical models has recently
been reviewed in Smyth, Heckerman and Jordan (1997). Although exact probability
propagation algorithms exist for general graphical models (Jensen, Lauritzen, &
Olesen, 1990), these algorithms are intractable for densely-connected models such
as the ones we consider in this paper. One approach to dealing with this issue is
to utilize stochastic sampling methods (Kanazawa et al., 1995). Another approach,
which provides the basis for algorithms described in the current paper, is to make
use of variational methods (cf. Saul, Jaakkola, & Jordan, 1996).
In the following section we dene the probabilistic model for factorial HMMs
and in Section 3 we present algorithms for inference and learning. In Section 4 we
describe empirical results comparing exact and approximate algorithms for learning
on the basis of time complexity and model quality. We also apply factorial HMMs
to a real time series data set consisting of the melody lines from a collection of
chorales by J. S. Bach. We discuss several generalizations of the probabilistic model
FACTORIAL HIDDEN MARKOV MODELS 3
(a) (b)
(1) (1) (1)
S t-1 St S t+1
Yt-1 Yt Yt+1
Figure 1. (a) A directed acyclic graph (DAG) specifying conditional independence relations for
a hidden Markov model. Each node is conditionally independent from its non-descendants given
its parents. (b) A DAG representing the conditional independence relations in a factorial HMM
with M = 3 underlying Markov chains.
each of which can take on K (m) values. We refer to these models as factorial
hidden Markov models, as the state space consists of the cross product of these state
variables. For simplicity, we will assume that K (m) = K, for all m, although all the
results we present can be trivially generalized to the case of diering K (m) . Given
that the state space of this factorial HMM consists of all K M combinations of the
St(m) variables, placing no constraints on the state transition structure would result
in a K M K M transition matrix. Such an unconstrained system is uninteresting
for several reasons: it is equivalent to an HMM with K M states; it is unlikely
to discover any interesting structure in the K state variables, as all variables are
allowed to interact arbitrarily; and both the time complexity and sample complexity
of the estimation algorithm are exponential in M.
We therefore focus on factorial HMMs in which the underlying state transitions
are constrained. A natural structure to consider is one in which each state variable
evolves according to its own dynamics, and is a priori uncoupled from the other
state variables:
Y
M
P(StjSt 1) = P(St(m) jSt(m1) ): (3)
m=1
A graphical representation for this model is presented in Figure 1 (b). The tran-
sition structure for this system can be represented as M distinct K K matrices.
Generalizations that allow coupling between the state variables are brie
y discussed
in Section 5.
As shown in Figure 1 (b), in a factorial HMM the observation at time step t can
depend on all the state variables at that time step. For continuous observations,
one simple form for this dependence is linear Gaussian; that is, the observation Yt
is a Gaussian random vector whose mean is a linear function of the state variables.
We represent the state variables as K 1 vectors, where each of the K discrete
values corresponds to a 1 in one position and 0 elsewhere. The resulting probability
density for a D 1 observation vector Yt is
P(YtjSt ) = jC j 1=2 (2) D=2 exp 12 (Yt t)0 C 1 (Yt t) ; (4a)
where
X
M
t = W (m) St(m) : (4b)
m=1
Each W (m) matrix is a D K matrix whose columns are the contributions to the
means for each of the settings of St(m) , C is the D D covariance matrix, 0 denotes
matrix transpose, and j j is the matrix determinant operator.
One way to understand the observation model in equations (4a) and (4b) is to
consider the marginal distribution for Yt , obtained by summing over the possible
states. There are K settings for each of the M state variables, and thus there
FACTORIAL HIDDEN MARKOV MODELS 5
are K M possible mean vectors obtained by forming sums of M columns where one
column is chosen from each of the W (m) matrices. The resulting marginal density
of Yt is thus a Gaussian mixture model, with K M Gaussian mixture components
each having a constant covariance matrix C. This static mixture model, without
inclusion of the time index and the Markov dynamics, is a factorial parameterization
of the standard mixture of Gaussians model that has interest in its own right (Zemel,
1993; Hinton & Zemel, 1994; Ghahramani, 1995). The model we are considering in
the current paper extends this model by allowing Markov dynamics in the discrete
state variables underlying the mixture. Unless otherwise stated, we will assume the
Gaussian observation model throughout the paper.
The hidden state variables at one time step, although marginally independent,
become conditionally dependent given the observation sequence. This can be deter-
mined by applying the semantics of directed graphs, in particular the d-separation
criterion (Pearl, 1988), to the graphical model in Figure 1 (b). Consider the Gaus-
sian model in equations (4a)-(4b). Given an observation vector Yt, the posterior
probability of each of the settings of the hidden state variables is proportional to the
probability of Yt under a Gaussian with mean t . Since t is a function of all the
state variables, the probability of a setting of one of the state variables will depend
on the setting of the other state variables.3 This dependency eectively couples all
of the hidden state variables for the purposes of calculating posterior probabilities
and makes exact inference intractable for the factorial HMM.
Sampling once from each of the TM hidden variables in the model results in a
new sample of the hidden state of the model and requires O(TMK) operations.
The sequence of overall states resulting from each pass of Gibbs sampling denes
a Markov chain over the state space of the model. Assuming that all probabilities
are bounded away from zero, this Markov chain is guaranteed to converge to the
8 Z. GHAHRAMANI AND M.I. JORDAN
posterior probabilities of the states given the observations (Geman & Geman, 1984).
Thus, after some suitable time, samples from the Markov chain can be taken as
approximate samples from the posterior probabilities. The rst and second-order
(m) (m) (n)0 (m) (m)0
statistics needed to estimate hSt i, hSt St i and hSt 1 St i are collected using
the states visited and the probabilities estimated during this sampling process are
used in the approximate E step of EM.4
where we have made use of Jensen's inequality in the last step. The dierence
between the left-hand side and the right-hand side of this inequality is given by the
Kullback-Leibler divergence (Cover & Thomas, 1991):
X fS g)
KL(QkP) = Q(fStg) log P(Q(
fS
t
gjf Ytg) : (6)
fSt g t
Figure 2. (a) The completely factorized variational approximation assumes that all the state vari-
ables are independent (conditional on the observation sequence). (b) The structured variational
approximation assumes that the state variables retain their Markov structure within each chain,
but are independent across chains.
Y
T Y
M
Q(fStgj) = Q(St(m) jt(m) ): (7)
t=1 m=1
The variational parameters, = ft(m) g, are the means of the state variables, where,
as before, a state variable St(m) is represented as a K-dimensional vector with a 1
in the kth position and 0 elsewhere, if the mth Markov chain is in state k at time t.
The elements of the vector t(m) therefore dene the state occupation probabilities
for the multinomial variable St(m) under the distribution Q:
K
Y St;km
( )
X
K
Q(St(m) jt(m) ) = (m)
t;k (m)
where St;k 2 f0; 1g; (m)
St;k = 1: (8)
k=1 k=1
This completely factorized approximation is often used in statistical physics, where
it provides the basis for simple yet powerful mean eld approximations to statistical
mechanical systems (Parisi, 1988).
To make the bound as tight as possible we vary separately for each observation
sequence so as to minimize the KL divergence. Taking the derivatives of (6) with
respect to t(m) and setting them to zero, we obtain the set of xed point equations
(see Appendix C) dened by
t(m) new = ' W (m)0 C 1Y~t(m) 12 (m) + (log P (m) ) t(m1) + (log P (m) )0 t(+1
m) (9a)
where Y~t(m) is the residual error in Yt given the predictions from all the state
variables not including m:
X
M
Y~t(m) Yt W (`) t(`) ; (9b)
`6=m
10 Z. GHAHRAMANI AND M.I. JORDAN
(m) is the vector of diagonal elements of W (m)0 C 1W (m) , and 'fg is the softmax
operator, which maps a vector A into a vector B of the same size, with elements
Bi = XexpfAi g ; (10)
expfAj g
j
and logP (m) denotes the elementwise logarithm of the transition matrix P (m) .
The rst term of (9a) is the projection of the error in reconstructing the ob-
servation onto the weights of state vector m|the more a particular setting of a
state vector can reduce this error, the larger its associated variational parameter.
The second term arises from the fact that the second order correlation hSt(m) St(m) i
evaluated under the variational distribution is a diagonal matrix composed of the
elements of t(m) . The last two terms introduce dependencies forward and backward
in time.5 Therefore, although the posterior distribution over the hidden variables is
approximated with a completely factorized distribution, the xed point equations
couple the parameters associated with each node with the parameters of its Markov
blanket. In this sense, the xed point equations propagate information along the
same pathways as those dening the exact algorithms for probability propagation.
The following may provide an intuitive interpretation of the approximation being
made by this distribution. Given a particular observation sequence, the hidden
state variables for the M Markov chains at time step t are stochastically coupled.
This stochastic coupling is approximated by a system in which the hidden variables
are uncorrelated but have coupled means. The variational or \mean-eld" equa-
tions solve for the deterministic coupling of the means that best approximates the
stochastically coupled system.
Each hidden state vector is updated in turn using (9a), with a time complexity
of O(T MK 2 ) per iteration. Convergence is determined by monitoring the KL
divergence in the variational distribution between successive time steps; in practice
convergence is very rapid (about 2 to 10 iterations of (9a)). Once the xed point
equations have converged, the expectations required for the E step can be obtained
as a simple function of the parameters (equations (C.6){(C.8) in Appendix C).
Note that the arguments presented in the previous section did not hinge on the
the form of the approximating distribution. Therefore, more structured variational
approximations can be obtained by using more structured variational distributions
Q. Each such Q provides a lower bound on the log likelihood and can be used to
obtain a learning algorithm.
We write the structured variational approximation as
Y
M Y
T
Q(fStgj) = Z1 Q(S1(m) j) Q(St(m) jSt(m1) ; ); (11a)
Q m=1 t=2
where ZQ is a normalization constant ensuring that Q integrates to one and
K
Y S1(m;k)
Q(S (m) j) = m (m)
h ;k k
( )
(11b)
1 1
k=1
0 1St;km ( )
Y
K
@ X
K
Q(St(m) jSt(m1) ; ) = (m)
ht;k Pk;jm St m ;j A
( ) ( )
1
k=1 j =1
0 K 1St;km
( )
Y
K
@ht;km Y S m
Pk;jm t ;j A ;
( )
= ( ) ( ) 1
(11c)
k=1 j =1
where the last equality follows from the fact that St(m1) is a vector with a 1 in one po-
sition and 0 elsewhere. The parameters of this distribution are = f(m) ; P (m); ht(m) g|
the original priors and state transition matrices of the factorial HMM and a time-
varying bias for each state variable. Comparing equations (11a){(11c) to equa-
tion (1), we can see that the K 1 vector ht(m) plays the role of the probability of
an observation (P(YtjSt ) in (1)) for each of the K settings of St(m) . For example,
Q(S1(m;j ) = 1j) = h1(m;j) P(S1(m;j ) = 1j) corresponds to having an observation at time
t = 1 that under state S1(m;j ) = 1 has probability h1(m;j) .
Intuitively, this approximation uncouples the M Markov chains and attaches to
each state variable a distinct ctitious observation. The probability of this ctitious
observation can be varied so as to minimize the KL divergence between Q and P.
Applying the same arguments as before, we obtain a set of xed point equations
for ht(m) that minimize KL(QkP), as detailed in Appendix D:
0 1
(m) new (m) 1 ~ (m) (m)
ht = exp W C Yt 2 ; (12a)
where (m) is dened as before, and where we redene the residual error to be
X
M
Y~t(m) Yt W (`) hSt(`) i: (12b)
`6=m
12 Z. GHAHRAMANI AND M.I. JORDAN
The parameter ht(m) obtained from these xed point equations is the observation
probability associated with state variable St(m) in hidden Markov model m. Using
these probabilities, the forward{backward algorithm is used to compute a new set
of expectations for hSt(m) i, which are fed back into (12a) and (12b). This algorithm
is therefore used as a subroutine in the minimization of the KL divergence.
Note the similarity between equations (12a){(12b) and equations (9a){(9b) for the
completely factorized system. In the completely factorized system, since hSt(m) i =
t(m) , the xed point equations can be written explicitly in terms of the variational
parameters. In the structured approximation, the dependence of hSt(m) i on ht(m)
is computed via the forward{backward algorithm. Note also that (12a) does not
contain terms involving the prior, (m) , or transition matrix, P (m) . These terms
have cancelled by our choice of approximation.
4. Experimental results
To investigate learning and inference in factorial HMMs we conducted two experi-
ments. The rst experiment compared the dierent approximate and exact methods
of inference on the basis of computation time and the likelihood of the model ob-
tained from synthetic data. The second experiment sought to determine whether
the decomposition of the state space in factorial HMMs presents any advantage in
modeling a real time series data set that we might assume to have complex internal
structure|Bach's chorale melodies.
Fifteen sets of parameters were generated for each of the four problem sizes. Algo-
rithms were run for a maximumof 100 iterations of EM or until convergence, dened
as the iteration k at which the log likelihood L(k), or approximate log likelihood if
an approximate algorithm was used, satised L(k) L(k 1) < 10 5(L(k 1) L(2)).
At the end of learning, the log likelihoods on the training and test set were com-
puted for all models using the exact algorithm. Also included in the comparison
was the log likelihood of the training and test sets under the true model that gen-
erated
P the data. The test set log likelihood for N observation sequences is dened
N
as n=1 log P(Y1(n) ; : : :; YT(n)j); where is obtained by maximizing the log likeli-
hood over a training set that is disjoint from the test set. This provides a measure
of how well the model generalizes to a novel observation sequence from the same
distribution as the training data.
Results averaged over 15 runs for each algorithm on each of the four problem sizes
(a total of 300 runs) are presented in Table 1. Even for the smallest problem size
(M = 3 and K = 2), the standard HMM with K M states suers from overtting:
the test set log likelihood is signicantly worse than the training set log likelihood.
As expected, this overtting problem becomes worse as the size of the state space
increases; it is particularly serious for M = 5 and K = 3.
For the factorial HMMs, the log likelihoods for each of the three approximate
EM algorithms were compared to the exact algorithm. Gibbs sampling appeared
to have the poorest performance: for each of the three smaller size problems its
log likelihood was signicantly worse than that of the exact algorithm on both the
training sets and test sets (p < 0:05). This may be due to insucient sampling.
However, we will soon see that running the Gibbs sampler for more than 10 samples,
while potentially improving performance, makes it substantially slower than the
variational methods. Surprisingly, the Gibbs sampler appears to do quite well on
the largest size problem, although the dierences to the other methods are not
statistically signicant.
The performance of the completely factorized variational approximation was not
statistically signicantly dierent from that of the exact algorithm on either the
training set or the test set for any of the problem sizes. The performance of the
structured variational approximation was not statistically dierent from that of the
exact method on three of the four problem sizes, and appeared to be better on one of
the problem sizes (M = 5; K = 2; p < 0:05). Although this result may be a
uke
arising from random variability, there is another more interesting (and speculative)
explanation. The exact EM algorithm implements unconstrained maximization of
F, as dened in section 3.6, while the variational methods maximize F subject to
a constrained distribution. These constraints could presumably act as regularizers,
reducing overtting.
There was a large amount of variability in the nal log likelihoods for the models
learned by all the algorithms. We subtracted the log likelihood of the true generative
model from that of each trained model to eliminate the main eect of the randomly
sampled generative model and to reduce the variability due to training and test
sets. One important remaining source of variance was the random seed used in
FACTORIAL HIDDEN MARKOV MODELS 15
Table 1. Comparison of the factorial HMM on four problems of varying size. The negative log
likelihood for the training and test set, plus or minus one standard deviation, is shown for each
problem size and algorithm, measured in bits per observation (log likelihood in bits divided by
NT ) relative to the log likelihood under the true generative model for that data set.7 True is
the true generative model (the log likelihood per symbol is dened to be zero for this model by
our measure); HMM is the hidden Markov model with K M states; Exact is the factorial HMM
trained using an exact E step; Gibbs is the factorial HMM trained using Gibbs sampling; CFVA
is the factorial HMM trained using the completely factorized variational approximation; SVA is
the factorial HMM trained using the structured variational approximation.
M K Algorithm Training Set Test Set
3 2 True 0.00 0.39 0.00 0.39
HMM 1.19 0.67 2.29 1.02
Exact 0.88 0.80 1.05 0.72
Gibbs 1.67 1.23 1.78 1.22
CFVA 1.06 1.20 1.20 1.11
SVA 0.91 1.02 1.04 1.01
3 3 True 0.00 0.19 0.00 0.20
HMM 0.76 0.67 9.81 2.55
Exact 1.02 1.04 1.26 0.99
Gibbs 2.21 0.91 2.50 0.87
CFVA 1.24 1.50 1.50 1.53
SVA 0.64 0.88 0.90 0.84
5 2 True 0.00 0.60 0.00 0.57
HMM 0.83 0.82 11.57 3.71
Exact 2.29 1.19 2.51 1.21
Gibbs 3.25 1.17 3.35 1.14
CFVA 1.73 1.34 2.07 1.74
SVA 1.34 1.07 1.53 1.05
5 3 True 0.00 0.30 0.00 0.29
HMM -4.80 0.52 175.35 84.74
Exact 4.23 2.28 4.49 2.24
Gibbs 3.63 1.13 3.95 1.14
CFVA 4.85 0.68 5.14 0.69
SVA 3.99 1.57 4.30 1.62
16 Z. GHAHRAMANI AND M.I. JORDAN
(a) (b)
5 5
4 4
−log likelihood (bits)
2 2
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
iterations of EM iterations of EM
(c) (d)
5 5
4 4
−log likelihood (bits)
3 3
2 2
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
iterations of EM iterations of EM
Figure 3. Learning curves for ve runs of each of the four learning algorithms for factorial HMMs:
(a) exact; (b) completely factorized variational approximation; (c) structured variational approx-
imation; and (d) Gibbs sampling. A single training set sampled from the M = 3; K = 2 problem
size was used for all these runs. The solid lines show the negative log likelihood per observation
(in bits) relative to the true model that generated the data, calculated using the exact algorithm.
The circles denote the point at which the convergence criterion was met and the run ended. For
the three approximate algorithms, the dashed lines show an approximate negative log likelihood.8
each training run, which determined the initial parameters and the samples used in
the Gibbs algorithm. All algorithms appeared to be very sensitive to this random
seed, suggesting that dierent runs on each training set found dierent local maxima
or plateaus of the likelihood (Figure 3). Some of this variability could be eliminated
by explicitly adding a regularization term, which can be viewed as a prior on the
parameters in maximuma posteriori parameter estimation. Alternatively, Bayesian
(or ensemble) methods could be used to average out this variability by integrating
over the parameter space.
The timing comparisons conrm the fact that both the standard HMM and the ex-
act E step factorial HMM are extremely slow for models with large state spaces (Fig-
FACTORIAL HIDDEN MARKOV MODELS 17
100
HMM
Exact fHMM
Gibbs fHMM
60
40
20
0
3 3 5 5 10 15
2 3 2 3 2 2
M
S ta te s p a c e s iz e ( K )
Figure 4. Time per iteration of EM on a Silicon Graphics R4400 processor running Matlab.
ure 4). Gibbs sampling was slower than the variational methods even when limited
to ten samples of each hidden variable per iteration of EM. Since one pass of the
variational xed point equations has the same time complexity as one pass of Gibbs
sampling, and since the variational xed point equations were found to converge
very quickly, these experiments suggest that Gibbs sampling is not as competitive
time-wise as the variational methods. The time per iteration for the variational
methods scaled well to large state spaces.
approximate the E step for factorial HMMs we used the structured variational ap-
proximation. This choice was motivated by three considerations. First, for the size
of state space we wished to explore, the exact algorithms were prohibitively slow.
Second, the Gibbs sampling algorithm did not appear to present any advantages
in speed or performance and required some heuristic method for determining the
number of samples. Third, theoretical arguments suggest that the structured ap-
proximation should in general be superior to the completely factorized variational
approximation, since more of the dependencies of the original model are preserved.
The test set log likelihoods for the HMMs, shown in Figure 5 (a), exhibited the
typical U-shaped curve demonstrating a trade-o between bias and variance (Ge-
man, Bienenstock, & Doursat, 1992). HMMs with fewer than 10 states did not
predict well, while HMMs with more than 40 states overt the training data and
therefore provided a poor model of the test data. Out of the 75 runs, the highest
test set log likelihood per observation was 9:0 bits, obtained by an HMM with 30
hidden states.9
The factorial HMM provides a more satisfactory model of the chorales from three
points of view. First, the time complexity is such that it is possible to consider
models with signicantly larger state spaces; in particular, we t models with up to
1000 states. Second, given the componential parametrization of the factorial HMM,
these large state spaces do not require excessively large numbers of parameters rel-
ative to the number of data points. In particular, we saw no evidence of overtting
even for the largest factorial HMM as seen in Figures 5 (c) & (d). Finally, this
approach resulted in signicantly better predictors; the test set likelihood for the
best factorial HMM was an order of magnitude larger than the test set likelihood
for the best HMM, as Figure 5 (d) reveals.
While the factorial HMM is clearly a better predictor than a single HMM, it
should be acknowledged that neither approach produces models that are easily
interpretable from a musicological point of view. The situation is reminiscent of
that in speech recognition, where HMMs have proved their value as predictive
models of the speech signal without necessarily being viewed as causal generative
models of speech. A factorial HMM is clearly an impoverished representation of
FACTORIAL HIDDEN MARKOV MODELS 19
15 15 15 15
13 13 13 13
11 11 11 11
9 9 9 9
7 7 7 7
5 5 5 5
1 25 50 75 100 1 10 100 1000 1 10 100 1000 1 10 100 1000
Size of state space
Figure 5. Test set log likelihood per event of the Bach chorale data set as a function of number of
states for (a) HMMs, and factorial HMMs with (b) K = 2, (c) K = 3, and (d) K = 4 ('s; heavy
dashed line) and K = 5 ('s; solid line). Each symbol represents a single run; the lines indicate the
mean performances. The thin dashed line in (b){(d) indicates the log likelihood per observation
of the best run in (a). The factorial HMMs were trained using the structured approximation. For
both methods the true likelihood was computed using the exact algorithm.
musical structure, but its promising performance as a predictor provides hope that
it could serve as a step on the way toward increasingly structured statistical models
for music and other complex multivariate time series.
nP o
exp m W (m) St(m) . Like the Gaussian model, this specication is again com-
pact, using M matrices of size D K. (As in the linear-Gaussian model, the W (m)
are overparametrized since they can each model the overall mean of Yt , as shown in
Appendix A.) While the nonlinearity induced by the softmax function makes both
the E step and M step of the algorithm more dicult, iterative numerical methods
can be used in the M step whereas Gibbs sampling and variational methods can
again be used in the E step (see Neal, 1992; Hinton et al., 1995; and Saul et al.,
1996, for discussions of dierent approaches to learning in sigmoid networks).
the others as outputs or response variables. The goal, in these cases, is to model
the conditional density of the output sequence given the input sequence. In ma-
chine learning terminology, unconditional density estimation is unsupervised while
conditional density estimation is supervised.
Several algorithms for learning in hidden Markov models that are conditioned on
inputs have been recently presented in the literature (Cacciatore & Nowlan, 1994;
Bengio & Frasconi, 1995; Meila & Jordan, 1996). Given a sequence of input vectors
fXt g, the probabilistic model for an input-conditioned factorial HMM is
Y
M
P(fSt; YtgjfXtg) = P(S1(m) jX1)P(Y1jS1(m) ; X1 )
m=1
YT Y
M
P(St(m) jSt(m1) ; Xt )P(YtjSt(m) ; Xt): (14)
t=2 m=1
The model depends on the specication of P(YtjSt(m) ; Xt ) and P(St(m) jSt(m1) ; Xt ),
which are conditioned both on a discrete state variable and on a (possibly con-
tinuous) input vector. The approach used in Bengio and Frasconi's Input Out-
put HMMs (IOHMMs) suggests modeling P(St(m) jSt(m1) ; Xt ) as K separate neural
networks, one for each setting of St(m1) . This decomposition ensures that a valid
probability transition matrix is dened at each point in input space if a sum-to-one
constraint (e.g., softmax nonlinearity) is used in the output of these networks.
Using the decomposition of each conditional probability into K networks, infer-
ence in input-conditioned factorial HMMs is a straightforward generalization of the
algorithms we have presented for factorial HMMs. The exact forward{backward
algorithm in Appendix B can be adapted by using the appropriate conditional
probabilities. Similarly, the Gibbs sampling procedure is no more complex when
conditioned on inputs. Finally, the completely factorized and structured approx-
imations can also be generalized readily if the approximating distribution has a
dependence on the input similar to the model's. If the probability transition struc-
ture P(St(m) jSt(m1) ; Xt) is not decomposed as above, but has a complex dependence
on the previous state variable and input, inference may become considerably more
complex.
Depending on the form of the input conditioning, the Maximization step of learn-
ing may also change considerably. In general, if the output and transition prob-
abilities are modeled as neural networks, the M step can no longer be solved ex-
actly and a gradient-based generalized EM algorithm must be used. For log-linear
models, the M step can be solved using an inner loop of iteratively reweighted
least-squares (McCullagh & Nelder, 1989).
X t-1 Xt X t+1
Yt-1 Yt Yt+1
6. Conclusion
In this paper we have examined the problem of learning for a class of generalized
hidden Markov models with distributed state representations. This generalization
provides both a richer modeling tool and a method for incorporating prior struc-
tural information about the state variables underlying the dynamics of the system
generating the data. Although exact inference in this class of models is generally
intractable, we provided a structured variational approximation that can be com-
puted tractably. This approximation forms the basis of the Expectation step in an
EM algorithm for learning the parameters of the model. Empirical comparisons
to several other approximations and to the exact algorithm show that this approx-
imation is both ecient to compute and accurate. Finally, we have shown that
FACTORIAL HIDDEN MARKOV MODELS 23
Appendix A
The M step
The M step equations for each parameter are obtained by setting the derivatives
of Q with respect to that parameters to zero. We start by expanding Q using
equations (1){(4b):
"
Q= 1X T
Y 0 C 1Yt 2 X Y 0 C 1W (m) hS (m) i
M
2 t=1 t m=1
t t
M n #
X m)0 io
M X
(m)0 n
+ tr W C 1W (n)hSt St
( ) (
m=1 n=1
X
M
(m)0
XT X M n 0 o
+ hS1 i log (m) + tr (logP (m) )hSt(m1) St(m) i logZ; (A.1)
m=1 t=2 m=1
where tr is the trace operator for square matrices and Z is a normalization term
independent of the states and observations ensuring that the probabilities sum to
one.
Setting the derivatives of Q with respect to the output weights to zero, we obtain
a linear system of equations for the W (m) :
" #
@Q = X T X M 0 0
W (n) hSt(n) St(m) i YthSt(m) i = 0: (A.2)
@W (m )
t=1 n=1
Assuming Yt is a D1 vector, let St be the MK 1 vector obtained by concatenating
the S (m) vectors, and W be the D MK matrix obtained by concatenating the
W (m) matrices (of size D K). Then solving (A.2) results in
X
T ! X
T !y
W new = 0
Yt hSt i hSt St0 i ; (A.3)
t=1 t=1
where y is the Moore-Penrose pseudo-inverse. Note that the model is overparame-
terized since the D 1 means of each of the W (m) matrices add up to a single mean.
Using the pseudo-inverse removes the need to explicitly subtract this overall mean
from each W (m) and estimate it separately as another parameter.
To estimate the priors, we solve @ Q=@(m) = 0 subject to the constraint that
they sum to one, obtaining
(m) new = hS1(m) i: (A.4)
24 Z. GHAHRAMANI AND M.I. JORDAN
Appendix B
Exact forward{backward algorithm
Here we specify an exact forward{backward recursion for computing the posterior
probabilities of the hidden states in a factorial HMM. It diers from a straightfor-
ward application of the forward{backward algorithm on the equivalent K M state
HMM, in that it does not depend on a K M K M transition matrix. Rather, it
makes use of the independence of the underlying Markov chains to sum over M
transition matrices of size K K.
Using the notation fY grt to mean the observation sequence Yt ; : : :; Yr , we dene
t = P(St(1) ; St(2) ; : : :St(M ) ; fY gt1j)
(M ) t 1
t = P(St ; St ; : : :St ; fY g1 j)
(0) (1) (2)
(M ) t 1
t = P(St 1; St ; : : :St ; fY g1 j)
(1) (1) (2)
..
.
(tM ) = P(St(1)1; : : :St(M1) ; fY g1t 1j) = t 1 :
FACTORIAL HIDDEN MARKOV MODELS 25
At the end of the forward recursions, the likelihood of the observation sequence is
the sum of the K M elements in T .
Similarly, to obtain the backward recursions we dene
t = P(fY gTt+1 jSt(1) ; : : :St(M ) ; )
t(M1) = P(fY gTt jSt(1) ; : : :St(M ) ; )
..
.
t 1 = P(fY gTt jSt(1) ; St(2)1 : : :St(M1) ; )
(1)
X
t 1P(St jSt 1 )P(YtjSt )t
(m) (m)0 St(n)1 ;St(r) (n6=m^r6=m)
hSt 1 St i = X : (B.8)
t 1P(StjSt 1)P(Yt jSt)t
St 1 ;St
Appendix C
Completely factorized variational approximation
Using the denition of the probabilistic model given by equations (1){(4b), the
posterior probability of the states given an observation sequence can be written as
P(fStgjfYtg; ) = Z1 expf H(fSt ; Ytg)g ; (C.1)
where Z is a normalization constant ensuring that the probabilities sum to one and
XT XM !0 X
M !
1
H(fSt; Yt g) = 2 Yt W (m) St(m) C 1
Yt W m) S (m)
(
t
t=1 m=1 m=1
X
M 0 T X
X M 0
S1(m) log(m) St(m) (log P (m) )St(m1) : (C.2)
m=1 t=2 m=1
Similarly, the probability distribution given by the variational approximation (7){
(8) can be written as
Q(fStgj) = Z1 expf HQ(fSt g)g ; (C.3)
Q
where
T X
X M 0
HQ(fSt g) = St(m) logt(m) : (C.4)
t=1 m=1
Using this notation, and denoting expectation with respect to the variational dis-
tribution using angular brackets hi, the KL divergence is
KL(QkP) = hH i hHQ i log ZQ + log Z: (C.5)
Three facts can be veried from the denition of the variational approximation:
hSt(m) i = t(m) (C.6)
(m) (m)0 (m) (m)0
hSt 1 St i = t 1 t (C.7)
( (m) (n)0
hSt(m) St(n) i = t t (m) if m 6= n
0
(C.8)
diagft g if m = n
FACTORIAL HIDDEN MARKOV MODELS 27
where diag is an operator that takes a vector and returns a square matrix with
the elements of the vector along its diagonal, and zeros everywhere else. The KL
divergence can therefore be expanded to
XT X M 1 XT " XM
(m)0 (m)
KL = t log t + 2 Yt0 C 1Yt 2 Yt0 C 1W (m) t(m)
t=1 m=1 t=1 m=1 3
X
M X
M M X n o
W m 0 C W m diagftm g 5
0 0
+ trfW (m) C 1
W (n) (n) (m) g +
t t tr ( ) 1 ( ) ( )
Appendix D
Structured approximation
For the structured approximation, HQ is dened as
X
M 0 T X
X M 0
HQ(fSt g) = S1(m) log (m) St(m) (log P (m) )St(m1)
m=1 t=2 m=1
XT XM 0
St(m) log ht(m) : (D.1)
t=1 m=1
Using (C.2), we write the KL divergence as
XT X M 1 XT " XM
(m) (m)
KL = hSt i log ht + 2 Yt0 C 1Yt 2 Yt0C 1W (m) hSt(m) i
t=1 m=1 t=1 m=1
M X
X M n 0 0 o
+ tr W (m) C 1W (n) hSt(n) ihSt(m) i
m=1 n6=m
XM n
m)0 C m) diag
n m) i
oo#
+ tr W ( 1
W (
hSt (
logZQ + log Z: (D.2)
m=1
28 Z. GHAHRAMANI AND M.I. JORDAN
Since KL is independent of (m) and P (m) , the rst thing to note is that these
parameters of the structured approximation remain equal to the equivalent pa-
rameters of the true system. Now, taking derivatives with respect to log h(n), we
get
2
@KL = hS (n) i + T X
X M
(m) 0
XM
(n) 4 logh(m)
t W C 1
Yt + W (m) 0C 1W (`) hSt(`) i
@ log h t=1 m=1 `6=m
(m)
+ 21 (m) @ hSt (ni) hS(n) i: (D.3)
@ log h
The last term, which we obtained by making use of the fact that
@ logZQ = hS (n) i; (D.4)
@ log h(n)
cancels out the rst term. Setting the terms inside the brackets in (D.3) equal to
zero yields equation (12a).
Acknowledgments
We thank Lawrence Saul for helpful discussions and Georey Hinton for support.
This project was supported in part by a grant from the McDonnell-Pew Foundation,
by a grant from ATR Human Information Processing Research Laboratories, by a
gift from Siemens Corporation, and by grant N00014-94-1-0777 from the Oce of
Naval Research. Zoubin Ghahramani was supported by a grant from the Ontario
Information Technology Research Centre.
Notes
1. For related work on inference in distributed state HMMs, see Dean and Kanazawa (1989).
2. In speech, neural networks are generally used to model P (St jYt ); this probability is converted
to the observation probabilities needed in the HMM via Bayes rule.
3. If the columns of W (m) and W (n) are orthogonal for every pair of state variables, m and n, and
C is a diagonal covariance matrix, then the state variables will no longer be dependent given
the observation. In this case there is no \explaining away": each state variable is modeling the
variability in the observation along a dierent subspace.
4. A more Bayesian treatment of the learning problem, in which the parameters are also consid-
ered hidden random variables, can be handled by Gibbs sampling by replacing the \M step"
with sampling from the conditional distribution of the parameters given the other hidden
variables (for example, see Tanner and Wong, 1987).
5. The rst term is replaced by log (m) for t = 1 the second term does not appear for t = T .
6. All samples were used for learning; that is, no samples were discarded at the beginning of the
run. Although ten samples is too few to even approach convergence, it provides a run-time
roughly comparable to the variational methods. The goal was to see whether this \impatient"
Gibbs sampler would be able to compete with the other approximate methods.
FACTORIAL HIDDEN MARKOV MODELS 29
7. Lower values suggest a better probabilistic model: a value of one, for example, means that
it would take one bit more than the true generative model to code each observation vector.
Standard deviations re
ect the variation due to training set, test set, and the random seed of
the algorithm. Standard errors on the mean are a factor of 3.8 smaller.
8. For the variational methods these dashed lines are equal to minus the lower bound on the
log likelihood, except for a normalization term which is intractable to compute and can vary
during learning, resulting in the apparent occasional increases in the bound.
9. Since the attributes were modeled as real numbers, the log likelihoods are only a measure of
relative coding cost. Comparisons between these likelihoods are meaningful, whereas to obtain
the absolute cost of coding a sequence, it is necessary to specify a discretization level.
10.This is analogous to the fully-connectedBoltzmann machine with N units (Hinton & Sejnowski,
1986), in which every binary unit is coupled to every other unit using O(N 2) parameters, rather
than the O(2N ) parameters required to specify the complete probability table.
References
Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in
the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical
Statistics, 41, 164{171.
Bengio, Y., & Frasconi, P. (1995). An input{outputHMM architecture. In G. Tesauro, D. S. Touret-
zky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 427{434.
Cambridge, MA: MIT Press.
Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear
plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information
processing systems 6, pp. 719{726. San Francisco, CA: Morgan Kaufmann.
Conklin, D., & Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of
New Music Research, 24, 51{73.
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: John Wiley.
Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert
systems. Statistics and Computing, 2, 25{36.
Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation.
Computational Intelligence, 5 , 142{150.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1{38.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation, 4, 1{58.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6,
721{741.
Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. Touret-
zky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 617{624.
Cambridge, MA: MIT Press.
Heckerman, D. (1995.). A tutorial on learning Bayesian networks. (Technical Report MSR-TR-
95-06). Redmond, WA: Microsoft Research.
Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In
D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in
the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press.
Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz
free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information
processing systems 6, pp. 3{10. San Francisco, CA: Morgan Kaufmann.
Jensen, F. V., Lauritzen, S. L., & Olesen, K. G. (1990). Bayesian updating in recursive graphical
models by local computations. Computational Statistical Quarterly, 4, 269{282.
30 Z. GHAHRAMANI AND M.I. JORDAN
Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decision trees. In M. Mozer,
M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cam-
bridge, MA: MIT Press.
Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural Computation, 6, 181{214.
Kanazawa, K., Koller, D., & Russell, S. J. (1995). Stochastic simulation algorithms for dynamic
probabilistic networks. In P. Besnard, , & S. Hanks (Eds.), Uncertainty in Articial Intelligence:
Proceedings of the Eleventh Conference. (pp. 346{351). San Francisco, CA: Morgan Kaufmann.
Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models
in computational biology: Applications to protein modeling . Journal of Molecular Biology, 235,
1501{1531.
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical
structures and their application to expert systems. Journal of the Royal Statistical Society B,
157{224.
McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman & Hall.
Meila, M., & Jordan, M. I. (1996). Learning ne motion by Markov mixtures of experts. In
D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information
processing systems 8, pp. 1003{1009. Cambridge, MA: MIT Press.
Merz, C. J., & Murphy, P. M. (1996). UCI Repository of machine learning databases
[http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of California,
Department of Information and Computer Science.
Neal, R. M. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71{113.
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Technical
Report CRG-TR-93-1). Toronto, Ontario: University of Toronto, Department of Computer
Science.
Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justies incremental
and other variants. Unpublished manuscript, Department of Computer Science, University of
Toronto, Ontario.
Parisi, G. (1988). Statistical eld theory. Redwood City, CA: Addison-Wesley.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference.
San Mateo, CA: Morgan Kaufmann.
Rabiner, L. R., & Juang, B. H. (1986). An Introduction to hidden Markov models. IEEE
Acoustics, Speech & Signal Processing Magazine, 3, 4{16.
Saul, L. K., & Jordan, M. I. (1997). Mixed memory Markov models. In D. Madigan , & P.
Smyth (Eds.), Proceedings of the 1997 Conference on Articial Intelligence and Statistics. Ft.
Lauderdale, FL.
Saul, L., Jaakkola, T., & Jordan, M. I. (1996). Mean Field Theory for Sigmoid Belief Networks.
Journal of Articial Intelligence Research, 4, 61{76.
Saul, L., & Jordan, M. I. (1995). Boltzmann chains and hidden Markov models. In G. Tesauro,
D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7,
pp. 435{442. Cambridge, MA: MIT Press.
Saul, L., & Jordan, M. I. (1996). Exploiting tractable substructures in Intractable networks.
In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information
processing systems 8, pp. 486{492. Cambridge, MA: MIT Press.
Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for
hidden Markov probability models. Neural Computation, 9, 227{269.
Stolcke, A., & Omohundro, S. (1993). Hidden Markov model induction by Bayesian model
merging. In S.J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information
processing systems 5, pp. 11{18. San Francisco, CA: Morgan Kaufmann.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data
augmentation (with discussion). Journal of the American Statistical Association, 82, 528{550.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm. IEEE Transactions Information Theory, IT-13, 260{269.
Williams, C. K. I., & Hinton, G. E. (1991). Mean eld networks that learn to discriminate
temporally distorted strings. In D. Touretzky, J. Elman, T. Sejnowski, & G. Hinton (Eds.),
FACTORIAL HIDDEN MARKOV MODELS 31
Connectionist models: Proceedings of the 1990 summer school (pp. 18{22). San Francisco, CA:
Morgan Kaufmann.
Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. Ph.D.
Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.
Received Date
Accepted Date
Final Manuscript Date