AML_mod2
AML_mod2
AML_mod2
Models
hidden Markov model
• doubly (or bivariate) stochastic process in which an
underlying stochastic process that is not observable can
only be observed through another stochastic process that
produces a sequence of observations
- state process
- observation process
a little story about HMM
s1 .2
.7 .6
.8
H .3 C
P(1|H) = .2 (s2) (s3) P(1|C) = .5
P(2|H) = .4 B1 .4 B2 P(2|C) = .4
P(3|H) = .4 P(3|C) = .1
3
HMM: definition
λ = (S, V, A, B, Π)
S = {s ,. .. , s }
set of states 1 N
set of observation symbols
V = {v1,. .. , vM }
A = [a ]
state transition probabilities ij
where aij P (qt+1 = sj |qt = si), ∀i, j = 1,. .. ,N
qt,t = 1,. .. ,T : state at time t
B = [b (m)]
observation probabilities i
where bj (m) P (ot = vm|qt = sj ), ∀i = 1,. .. ,N
ot,t = 1,. .. ,T : observation at time t
stochastic constraints on HMM
N
aij = 1, ∀i
j=1
N
πi = 1,
j=1 ∀i
assumptions
Markov assumption:
P (qt|q1 .. . qt−1) = P (qt|qt−1)
output independence:
P (ot|q1 .. . qt .. . qT , o1 .. . ot .. . oT ) = P
(ot|qt)
M
multinomial observations: m) r mt
P (ot|qt = s ,j λ)j = b (
m=1
1, if ot = vm
where tm =
r 0, o.w.
3 fundamental problems
likelihood problem:
λ = (A, B, Π)
Given an HMM and an observation
sequence O, determine the likelihood P (O|λ)
decoding problem:
λ = (A, B, Π)
Given an HMM and an observation
sequence O, discover the best hidden state sequence Q
learning problem:
Given an an observation sequence O, the set of states S,
and the set of symbols V in the HMM, learn the HMM
parameters A and B
likelihood: probability evaluation
O o1o2 .. . oT Q
q1q2 .. . qT
T
P (O|Q, λ) = P (o |q , λ)
t
t=1 t
9
likelihood: forward variable
s1
a
si ij sj
ot
sN
t-1 t
likelihood: forward algorithm
1. Initialization
α1(i) = πibi(o1), 1 ≤ i ≤ N
2. Induction
N
1 t T 1
αt+1(j) = αt(i)aij bj (ot+1)
i=1 1≤j≤N
3.
Termination N
P (O|λ) = α T (i)
i=1
backward variable
βt(i) P (ot+1ot+2 .. . oT |qt = i, λ)
N
βt(i) = a ij bj (o t+1)βt+1(j)
j=1
s1
a
si ij sj
o
sN t+1
t t+1
backward procedure
1. initialization
βT (i) = 1, 1 ≤ i ≤ N
2. induction
N
βt(i) = a ij bj (o t+1)βt+1(j)
j=1
t = T − 1,T − 2,. .. , 1, 1 ≤ i ≤ N
decoding: Viterbi algorithm
P (O, qt = i|λ)
γt(i) P (q t = i|O, λ) =
P (O|λ)
P (O, qt = i|λ) αt(i)βt(i)
= N
= N
i=1 P (O, qt = i|λ) i=1
αt(i)βt(i)
s1 s1
si si si
ot
sN sN
t-1 t t+1
most likely state at time t
qt∗ = arg min [γt(i)], 1 ≤ t ≤ T
1≤i≤N
Viterbi algorithm
δt(i) max P (q1q2 .. . qt−1, qt = i, o1o2 .. .
ot|λ)
q1q2...qt−1
δt+1(j) =max δt(i)aij · bj (ot+1)
i
ψ (j) δ (j)
t
the state that maximizes t−1
at time t - 1
Viterbi algorithm
1. initialization
δ1(i) = πibi(o1), ψ1(i) = 0, 1 ≤ i ≤ N
2. recursion
δt(j) = max [δt−1(i)aij ] bj (ot)
1≤i≤N
ψt(j) = arg max [δt−1(i)aij ] , 2 ≤ t ≤ T, 1≤j ≤N
1≤i≤N
3. termination
P ∗ = max [δT (i)
1≤i≤N
qT∗ = arg max [δT (i)]
1≤i≤N
4. path backtracking
qt∗ = ψt+1(qt∗+1), t = T − 1,T − 2,. .. ,
learning problem
P (O|λ)
choose λ such that its likelihood is maximized
q0 ?
? ?
? ?
Hot Cold
B1 ? B2
? ?
Q = {Hot, Cold}
O = {1, 3, 2, 2, 1, …}
learning problem
ξt(i, j) P (qt = i, qt+1 = j|O, λ)
P (qt = i, qt+1 = j, O|λ)
=
P (O|λ)
αt(i)aijbj (o )β (j)
P (O|λ)
= t+1 t+1
αt(i)aijbj (ot+1)βt+1(j)
= N N
i=1 j=1 αt(i)aijbj (ot+1)βt+1(j)
a
si ij sj
o
t+1
t t+1
learning problem
N
γt (i)t = ξ ( i,
j=1 j)
T −1
γt (i) = expected # of transitions from state i in O
t=1
T −1
expected # of transitions from state i
ξt (i, j) =
t=1 to state j in O
(re)estimation of an HMM parameters
πˆ = expected frequency in state i at t = 1 =
i
γ (i) expected # of transitions from state i to state j
aˆ
1
ij expected # of transitions from state i
=
T −1
ξt (i, j)
t=1
= T −1
γ (i)
t=1 t
t=1
γt(j)
forward-backward algorithm (Baum-Welch)
initialize A, B, and
Π
iterate until convergence
E-step M-step
αt(i)βt(i) πˆi =
γt(i) = N
,
i=1 α t(i)βt (i) ∀t, γ1(i) T −1
i t=1 ξt (i, j)
ξt(i, j) = aˆ
ij T −1
= t=1 γt (i)
αt(i)aijbj (ot+1)βt+1(j) P T
N N
i=1 j=1 αt(i)aijbj (ot+1)βt+1(j)
t=1 γt (j)
ˆ
bj (m) = s.t.ot=v m
T
∀t, t=1 γt (j)
i, j
N
αt(j) P (o1o2 .. . ot, qt = j|λ) = αt−1(i)aij bj (ot)
i=1
N
βt(i) P (ot+1ot+2 .. . oT |qt = i, λ) = a b (o )β (j)
ij j t+1 t+1
j=1
multiple training sequences
k K
X {O } k=1
K
P (X |λ) P (O k |λ)
. k=1
=
K
1 γk(i)
k=1
πˆi = K
K −1k
T k (i,
t
ξ
k=1 t=1
aˆij = K T k −1j) k
k=1 t=1 γt i)
P T k(
K
k (j)
t=1
k=1 t
ˆj s.t.ok =v
t
m
γ
b (m) K k k
γ t (j)
T
= k=1 t=1
model selection
• tuning the topology of an HMM
- zeroing out some impossible (or unnecessary) transitions:
aij = 0
- moving forward only: aij = 0, for j < i
- no big jumps: aij = 0, for j > i + τ
• # of states
- determined using prior information
- can be fine-tuned by cross validation by checking the likelihood
of validation sequences
finite mixture
N
P (ot = vm) = P (o =t v m |qt = s j)P (q=
t s )j
j=1
2
2 t γt (j)( ot − µˆj
σˆ j )
t γt (j)
=
earthquake example
(source: D. W. Chambers, et al., Hidden Markov model forecasting of earthquakes)
s1 ? s3
? ?
? ?
s2
p(ot|qt = s1) p(ot|qt = s3)
p(ot|qt = s2)
earthquake: forecasting problem
Given an HMM with parameters (A, θ1, θ2, θ3, Π), and the
observations up to the present, find the forecast density:
p(ot+1
3
= x|o1o2 .. . ot, λ)
= p(o = t+1
x, qt+1= si|o1o2 .. . o t , λ)
i=1
3
= p(o = t+1
x|q = s t+1
,o i .. . o t , λ) p(q=
t+1 si|o1 .. . o t , λ)
i=1 1 ·
3
1 − x
= e θi· p(q t+1 = s i|o .. .ot ,
θi
i=1 λ)
3 1 3
3
— x p(qt+1 = si,o1 .. .ot|λ) 1 x j=1 α t (j) ji·
= e θi · = e − θi· 3 a
1i p(o1 .. .ot|λ) θi
i=1 θ j=1 α t(j)
i=1
earthquake: forecasting problem
P (ot+1 ≤ x|o1o2 .. . ot, λ)
3 3
− x j=1 αt(j) · aji
= (1 − e) · θ
i 3
i=1 j=1 α t(j)
earthquake: the results
• θ1 = 1.3, θ2 = 17.42, θ3 = 27,92
• 1226 forecasts were made to find the probability of
another earthquake within 7 days
• proportion of times an earthquake did actually occur
within 7 days:
[.28, .32) [.32, .36) [.36, .50) [.50, 1]
proportion .292 .300 .421 .625
classification
• set of HMMs, each one modeling the sequences
belonging to one class
P (O|λi)P (λi)
arg max P (λi|O) =
i j P (O|λ )j P (λ j)
numerical issues
• scaling α and β
- for large t, αt(i) computation will exceed the precision range
(even in double precision)
• smoothing
- symbol vm that does not appear in the training sequence will
make bj(m) = 0
• imbalance between emission and transition probabilities
- bj(m) << aij
- P(O|λ) becomes mostly influenced by bj(m)
HMM for Hot Topic Detection
dfi
state
34
user state mining for games
35
automatic composition of Mozart style music
readings
• K. Seymore, A. McCallum, and R. Rosenfeld, “Learning Hidden Markov
Model Structure for Information Extraction,” Proc. AAAI Workshop
on Machine Learning for Information Extraction, 1999.
• A. Srivastava, et al., “Credit Card Fraud Detection Using Hidden Markov
Model,” IEEE Trans. Dependable and Secure Computing, vol. 5, no. 1,
pp. 37-48, 2008.
• S. M. Siddiqi and A. W. Moore, “Fast Inference and Learning in Large-
State-Space HMMs,” Proc. 22nd Int’l Conf. Machine Learning, 2005.
• O. Netzer, J. M. Lattin, and V. Srinivasan, “A Hidden Markov Model of
Customer Relationship Dynamics,” Marketing Science, vol. 27, no. 2, pp.
185-204, 2008.
• M. J. Zaki, C. D. Carothers, and B. K. Szymanski, “VOGUE: A Variable
Order Hidden Markov Model with Duration Based on Frequent Sequence
Mining,” ACM Trans. Knowledge Discovery from Data, vol. 4, no. 1,
2010.
• J. Kleinberg, “Bursty and Hierarchical Structure in Streams,” Proc. 8th
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2002.
references
• E. Alpaydin, Introduction to Machine Learning, The MIT
Press, 2004.
• O. C. Ibe, Markov Processes for Stochastic Modeling,
Academic Press, 2009.
• F. Jelinek, Statistical Methods for Speech Recognition, The
MIT Press, 1997.
• D. Jurafsky and J. H. Martin, Speech and Language
Processing, 2nd Ed., Prentice Hall, 2009
• T. Koski, Hidden Markov Models for Bioinformatics, Kluwer
Academic Publishers, 2001.
• L. Rabiner and B-H. Juang, Fundamentals of Speech
Recognition, Prentice Hall, 1993.