L11.2 Prob Models em

Introduction to
Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2022
2
Content
¡ Introduction to Machine Learning & Data Mining

¡ Unsupervised learning
¡ Supervised learning
¡ Probabilistic modeling
¨ Expectation maximization
¡ Practical advice
3
Difficult situations
¡ No closed-form solution for the learning/inference problem?
(không tìm được ngay công thức nghiệm)
¨ The examples before are easy cases, as we can find solutions in a closed
form by using gradient.
¨ Many models (e.g., GMM) do not admit a closed-form solution
¡ No explicit expression of the density/mass function?
(không có công thức tường minh để tính toán)
¡ Intractable inference (bài toán không khả thi)

¨ Inference in many probabilistic models is NP-hard
[Sontag & Roy, 2011; Tosh & Dasgupta, 2019]
4
Expectation
maximization
The EM algorithm
5
GMM revisit
q Consider learning GMM, with K Gaussian distributions, from the training
data D = {x1, x2, …, xM}.
q The density function is 𝑝(𝒙|𝝁, 𝜮, 𝝓) = ∑$
!"# 𝜙! 𝒩 𝒙 𝝁! , 𝜮! )
¨ 𝝓 = (𝜙! , … , 𝜙" ) represents the weights of the Gaussians, 𝑃 𝑧 = 𝑘| 𝝓 = 𝜙# .

¨ Each multivariate Gaussian has density
! ! , -!
𝒩 𝒙 𝝁# , 𝜮# ) = exp − ( 𝒙 − 𝝁# 𝜮# 𝒙 − 𝝁#
$%&(()𝜮! )
q MLE tries to maximize the following log-likelihood function

& $
𝐿 𝝁, 𝜮, 𝝓 = / log / 𝜙! 𝒩 𝒙% 𝝁! , 𝜮! )
%"# !"#
q We cannot find a closed-form solution!
q Naïve gradient decent: repeat until convergence
¨ Optimize 𝐿 𝝁, 𝜮, 𝝓 w.r.t 𝝓, when fixing (𝝁, 𝜮). Still hard
¨ Optimize 𝐿 𝝁, 𝜮, 𝝓 w.r.t (𝝁, 𝜮), when fixing 𝝓.
6
GMM revisit: K-means
q GMM: we need to know q K-means:
¨ Among K gaussian components, ¨ Among K clusters, to which an
which generates an instance x? instance x belongs?
the index z of the gaussian component the cluster index z
¨ The parameters of individual gaussian ¨ The parameters of individual
components: 𝝁# , 𝜮# , 𝜙# clusters: the mean
q Idea for GMM? q K-means training:

¨ 𝑃(𝑧|𝒙, 𝝁, 𝜮, 𝝓)? ¨ Step 1: assign each instance
(note ∑$
!"# 𝑃(𝑧 = 𝑘|𝒙, 𝝁, 𝜮, 𝝓) = 1) x to the nearest cluster
(the cluster index z for each x)
(soft assignment) (hard assignment)
¨ Update the parameters of individual ¨ Step 2: recompute the means
gaussians: 𝝁# , 𝜮# , 𝜙# of the clusters
7
GMM: lower bound
q Idea for GMM?
¨ Step 1: compute 𝑃(𝑧|𝒙, 𝝁, 𝜮, 𝝓)? (note ∑$!"# 𝑃(𝑧 = 𝑘|𝒙, 𝝁, 𝜮, 𝝓) = 1)
¨ Step 2: Update the parameters of the gaussian components: 𝜽 = 𝝁, 𝜮, 𝝓
¡ Consider the log-likelihood function

0 "
𝐿 𝜽 = log 𝑃(𝑫|𝜽) = : log : 𝜙# 𝒩 𝒙. 𝝁# , 𝜮# )

./! #/!
¨ Too complex if directly using gradient
¨ Note that log 𝑃(𝒙|𝜽) = log 𝑃(𝒙, 𝑧|𝜽) − log 𝑃(𝑧|𝒙, 𝜽). Therefore
log 𝑃(𝒙|𝜽) = 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 − 𝔼'|𝒙,𝜽 log 𝑃 𝑧 𝒙, 𝜽 ≥ 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽
¡ Maximizing 𝐿(𝜽) can be done by maximizing the lower bound

𝐿𝐵 𝜽 = / 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 = / / 𝑃 𝑧 𝒙, 𝜽 log 𝑃 𝒙, 𝑧 𝜽
𝒙∈𝑫 '
𝒙∈𝑫
8
GMM: maximize the lower bound
¨ Step 1: compute 𝑃(𝑧|𝒙, 𝝁, 𝜮, 𝝓)? (note ∑$!"# 𝑃(𝑧 = 𝑘|𝒙, 𝝁, 𝜮, 𝝓) = 1)
¨ Step 2: Update the parameters of the gaussian components: 𝜽 = 𝝁, 𝜮, 𝝓
¡ Bayes’ rule: 𝑃 𝑧 𝒙, 𝜽 = 𝑃 𝒙 𝑧, 𝜽 𝑃(𝑧|𝝓)/𝑃(𝒙) = 𝜙' 𝒩 𝒙 𝝁' , 𝜮' )/𝐶,

where 𝐶 = ∑# 𝜙# 𝒩 𝒙 𝝁# , 𝜮# ) is the normalizing constant.
¨ Meaning that one can compute 𝑃 𝑧 𝒙, 𝜽 if 𝜽 is known
¨ Denoting 𝑇#. = 𝑃 𝑧 = 𝑘 𝒙. , 𝜽 for any index 𝑘 = 1, 𝐾, 𝑖 = 1, 𝑀
¡ How about 𝝓?
¨ 𝜙1 = 𝑃 𝑧 𝝓 = 𝑃 𝑧 𝜽 = ∫ 𝑃 𝑧, 𝒙 𝜽 𝑑𝒙 = ∫ 𝑃 𝑧 𝒙, 𝜽 𝑃 𝒙 𝜽 𝑑𝒙 =
𝔼𝒙 𝑃 𝑧 𝒙, 𝜽 ≈ #" ∑𝒙∈4 𝑃 𝑧 𝒙, 𝜽 = #" ∑0
./! 𝑇1.
¡ Then the lower bound can be maximized w.r.t individual (𝝁! , 𝜮! ):

𝐿𝐵 𝜽 = : : 𝑃 𝑧 𝒙, 𝜽 log[𝑃 𝒙 𝑧, 𝜽 𝑃 𝑧 𝜽 ]
𝒙∈𝑫 1
0 "
1 , -!
= : : 𝑇#. − 𝒙 − 𝝁# 𝜮# 𝒙. − 𝝁# − log det(2𝜋𝜮# ) + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
2 .
./! #/!
9
GMM: EM algorithm
¡ Input: training data 𝑫 = {𝒙1, 𝒙2, … , 𝒙𝑀 }, 𝐾 > 0
¡ Output: model parameter 𝝁, 𝜮, 𝝓
¡ Initialize 𝝁 . , 𝜮(.) , 𝝓(.) randomly
¨ 𝝓(&) must be non-negative and sum to 1.
¡ At iteration 𝑡:
(9) (9) (9)
¨ E step: compute 𝑇#. = 𝑃 𝑧 = 𝑘 𝒙. , 𝜽(9) = 𝜙# 𝒩 𝒙 𝝁# , 𝜮# )/𝐶
for any index 𝑘 = 1, 𝐾, 𝑖 = 1, 𝑀
¨ M step: update for any k,
(9:!) 𝑎# 0
𝜙# = , where 𝑎# = : 𝑇#. ;
𝑀 ./!
(9:!) 1 0
(9:!) 1 0
(9:!) (9:!) ,
𝝁# = : 𝑇#. 𝒙. ; 𝜮# = : 𝑇#. 𝒙. − 𝝁# 𝒙. − 𝝁#
𝑎# ./! 𝑎# ./!
¡ If not convergence, go to iteration 𝑡 + 1.

10
GMM: example 1
¡ We wish to model the height of a person
¨ We had collected a dataset from 10 people in Hanoi + 10 people in Sydney
D={1.6, 1.7, 1.65, 1.63, 1.75, 1.71, 1.68, 1.72, 1.77, 1.62, 1.75, 1.80, 1.85,
1.65, 1.91, 1.78, 1.88, 1.79, 1.82, 1.81}
GMM with GMM with

2 components 3 components
11
GMM: example 2
¡ A GMM is fitted in a 2-dimensional dataset to do clustering.
From initialization To convergence
https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
12
GMM: comparison with K-means
q K-means: q GMM clustering
¨ Step 1: hard assignment ¨ Soft assignment of data to the clusters
¨ Step 2: the means ¨ Parameters 𝝁# , 𝜮# , 𝜙#

à similar shape for the clusters? àdifferent shapes for the clusters
https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
13
General models
¡ We can make the EM algorithm in more general cases.
¡ Consider a model 𝐵(𝒙, 𝒛; 𝜽) with observed variable x, hidden variable z,
and parameterized by 𝜽
(mô hình có một biến x quan sát được, biến ẩn z, và tham số 𝜽)
¨ x depends on z and 𝛉, while z may depend on 𝛉
¨ Mixture models: each observed data point has a corresponding latent variable,
specifying the mixture component which generated the data point
¡ The learning task is to find a specific model, from the model family
parameterized by 𝜽, that maximizes the log-likelihood of training data D:
𝜽∗ = argmax𝜽 log 𝑃(𝑫|𝜽)
¡ We assume D consists of i.i.d samples of x, the the log-likelihood function
can be expressed analytically, 𝐿𝐵 𝜽 = ∑𝒙∈𝑫 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 can be
computed easily (hàm log-likelihood có thể viết một cách tường minh)
¨ Since there is a latent variable, MLE may not have a close form solution
14
The Expectation Maximization algorithm
¡ The Expectation maximization (EM) algorithm was introduced in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin.
¡ The EM algorithm maximizes the lower bound of the log-likelihood
L 𝜽; 𝑫 = log 𝑃 𝑫 𝜽 ≥ 𝐿𝐵 𝜽 = / 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽
𝒙∈𝑫
¡ Initialization: 𝜽(.) , 𝑡 = 0
¡ At iteration 𝑡:
¨ E step: compute the expectation 𝑄 𝜽|𝜽(9) = 𝐿𝐵 𝜽(9-!)
(tính hàm kỳ vọng Q khi cố định giá trị 𝜽(() đã biết ở bước trước)
¨ M step: find 𝜽(9:!) = argmax𝜽 𝑄 𝜽|𝜽(9)

(tìm điểm 𝜽(()#) mà làm cho hàm Q đạt cực đại)
¡ If not convergence, go to iteration 𝑡 + 1.

15
EM: covergence condition
¡ Different conditions can be used to check convergence
¨ 𝐿𝐵 𝜽 does not change much between two consecutive iterations
¨ 𝜽 does not change much between two consecutive iterations
¡ In practice, we sometimes need to limit the maximum number of

iterations
16
EM: some properties
¡ The EM algorithm is guaranteed to return a stationary point of the lower
bound 𝐿𝐵 𝜽
(thuật toán EM đảm bảo sẽ hội tụ về một điểm dừng của hàm cận dưới)
¨ It may be the local maximum
¡ Due to maximizing the lower bound, EM does not necessarily returns

the maximizer of the log-likelihood function
(EM chưa chắc trả về điểm cực đại của hàm log-likelihood)
¨ No guarantee exists multimodel
distribution
¨ It can be seen in cases of multimodel,
where the log-likelihood function is non-concave
¡ The Baum-Welch algorithm is the a special
case of EM for hidden Markov models
17
EM, mixture model, and clustering
¡ Mixture model: we assume the data population is composed of K
different components (distributions), and each data point is generated
from one of those components
¨ E.g., Gaussian mixture model, categorical mixture
model, Bernoulli mixture model,…
¨ The mixture density function can be written as
"
𝑓(𝒙; 𝜽, 𝝓) = : 𝜙# 𝑓# 𝒙 𝜽# )
#/!
where 𝑓# 𝒙 𝜽# ) is the density of the k-th component

¡ We can interpret that a mixture distribution partitions the data space
into different regions, each associates with a component
(Một phân bố hỗn hợp tạo ra một cách chia không gian dữ liệu ra thành các
vùng khác nhau, mà mỗi vùng tương ứng với 1 thành phần trong hỗn hợp đó)
¡ Hence, mixture models provide solutions for clustering
¡ The EM algorithm provides a natural way to learn mixture models
18
EM: limitation
¡ When the lower bound 𝐿𝐵 𝜽 does not admit easy computation of the
expectation or maximization steps
¨ Admixture models, Bayesian mixture models
¨ Hierarchical probabilistic models
¨ Nonparametric models
¡ EM finds a point estimate, hence easily gets stuck at a local maximum
¡ In practice, EM is sensitive with initialization
¨ Is it good to use the idea of K-means++ for initialization?
¡ Sometimes EM converges slowly in practice
19
Further?
¡ Variational inference
¨ Inference for more general models
¡ Deep generative models
¨ Neural networks + probability theory
¡ Bayesian neural networks
¨ Neural networks + Bayesian inference
¡ Amortized inference
¨ Neural networks for doing Bayesian inference
¨ Learning to do inference
20
Reference
¡ Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for
statisticians." Journal of the American Statistical Association 112, no. 518 (2017): 859-877.
¡ Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. "Weight
Uncertainty in Neural Network." In International Conference on Machine Learning (ICML), pp.
1613-1622. 2015.
¡ Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via
the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1-38.
¡ Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing

model uncertainty in deep learning." In ICML, pp. 1050-1059. 2016.
¡ Ghahramani, Zoubin. "Probabilistic machine learning and artificial intelligence." Nature 521,
no. 7553 (2015): 452-459.
¡ Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes.” In International
Conference on Learning Representations (ICLR), 2014.
¡ Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and
prospects." Science 349, no. 6245 (2015): 255-260.
¡ Tosh, Christopher, and Sanjoy Dasgupta. “The Relative Complexity of Maximum Likelihood
Estimation, MAP Estimation, and Sampling.” In COLT, PMLR 99:2993-3035, 2019.
¡ Sontag, David, and Daniel Roy, “Complexity of inference in latent dirichlet allocation” in:
Advances in Neural Information Processing System, 2011.

L11.2 Prob Models em

Uploaded by

Copyright:

Available Formats

L11.2 Prob Models em

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L11.2 Prob Models em

Uploaded by

Copyright:

Available Formats

Introduction to

Machine Learning and Data Mining

¡ Introduction to Machine Learning & Data Mining

¡ Intractable inference (bài toán không khả thi)

¨ 𝝓 = (𝜙! , … , 𝜙" ) represents the weights of the Gaussians, 𝑃 𝑧 = 𝑘| 𝝓 = 𝜙# .

q MLE tries to maximize the following log-likelihood function

q Idea for GMM? q K-means training:

¡ Consider the log-likelihood function

𝐿 𝜽 = log 𝑃(𝑫|𝜽) = : log : 𝜙# 𝒩 𝒙. 𝝁# , 𝜮# )

log 𝑃(𝒙|𝜽) = 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 − 𝔼'|𝒙,𝜽 log 𝑃 𝑧 𝒙, 𝜽 ≥ 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽

¡ Maximizing 𝐿(𝜽) can be done by maximizing the lower bound

¡ Bayes’ rule: 𝑃 𝑧 𝒙, 𝜽 = 𝑃 𝒙 𝑧, 𝜽 𝑃(𝑧|𝝓)/𝑃(𝒙) = 𝜙' 𝒩 𝒙 𝝁' , 𝜮' )/𝐶,

¡ Then the lower bound can be maximized w.r.t individual (𝝁! , 𝜮! ):

¡ If not convergence, go to iteration 𝑡 + 1.

GMM with GMM with

From initialization To convergence

¨ Step 2: the means ¨ Parameters 𝝁# , 𝜮# , 𝜙#

¨ M step: find 𝜽(9:!) = argmax𝜽 𝑄 𝜽|𝜽(9)

¡ If not convergence, go to iteration 𝑡 + 1.

¡ In practice, we sometimes need to limit the maximum number of

¡ Due to maximizing the lower bound, EM does not necessarily returns

where 𝑓# 𝒙 𝜽# ) is the density of the k-th component

¡ Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing

You might also like