Chapter20 4e

Artificial Intelligence: A Modern
Approach
Fourth Edition
Chapter 20
Learning Probabilistic Models
1
Copyright © 2021 Pearson Education, Inc. All Rights Reserved
Outline
♦ Statistical Learning
♦ Learning with Complete Data
♦ Learning with Hidden Variables: The EM Algorithm
Chapter 20, Sections 1–3
2
Introduction
♦ Agents can handle uncertainty by using the methods of probability and
decision theory.
♦ However, they need to learn their probabilistic theories of the world form
experience.
♦ This chapter focuses on that.
♦ Focus on Data and Hypotheses
♦ Data are evidence: instantiations of some or all of the random variables
describing the domain.
♦ The hypotheses in this chapter are probabilistic theories of how the domain
works, including logical theories as a special case.
3
Statistical Learning Example
Suppose there are five kinds of bags of candies:
h1: 100% cherry candies
h2: 75% cherry candies + 25% lime candies
h5: 100% lime candies
Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?
• Given a new bag of candy, the random variable H (for hypothesis) denotes
the type of the bag, with possible values h1 through h5.
• H is not directly observable, of course. As the pieces of candy are opened
and inspected, data are revealed—D1, D2,..., DN, where each Di is a random
variable with possible values cherry and lime.
• Agent’s Goal: Predict what flavor is the next candy.
Full Bayesian learning
• Bayesian learning simply calculates the probability of each hypothesis, given the
data, and makes predictions on that basis.
• predictions are made by using all the hypotheses, weighted by their probabilities,
rather than by using just a single “best” hypothesis. In this way, learning is
reduced to probabilistic inference.
• Let D represent all the data, with observed value d. The key quantities in the
Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data
under each hypothesis, Likelihood P(d|hi).
Full Bayesian learning
View learning as Bayesian updating of a probability distribution over the
hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
jth observation dj gives the outcome of random variable D j

training data d = d1 , . . . , dN
Given the data so far, each hypothesis has a posterior probability:
P (hi |d) = αP (d|hi )P (hi ) (Bayes’ rule)
where P (d|hi ) is called the likelihood

Predictions use a likelihood-weighted average over the hypotheses:
P(X|d) = Σ i P(X|h i )P (hi|d)
where each hypothesis determines a probability distribution over X.
No need to pick one best-guess hypothesis!
Example
Suppose there are five kinds of bags of candies

10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
10% are h5: 100% lime candies
Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?
Assumption: observations are independent and identically distributed (i.i.d) then the
likelihood of the data is:
Posterior probability of hypotheses
1
P(h1 | d)
Posterior probability of hypothesis
P(h2 |
d) P(h3
0.8 | d)
P(h4 | d)
P(h5 |
0.6 d)
0.4
0.2
0
0 2 4 6 8 10
Number of samples in d
8
© 2021 Pearson Education Ltd.
© 2021 Pearson Education Ltd.
Prediction probability
0.9
P(next candy is lime | d)
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of samples in d
11
M A P approximation
• Bayesian predictions are optimal regardless of the dataset size.
• However, the hypotheses space is very large or infinite.
• Summing over the hypothesis space is often intractable, so we need to resort
to approximate or simplified methods. (e.g., 18,446,744,073,709,551,616
Boolean functions of 6 attributes)
• Common approach: Make prediction based on a single most probable
hypothesis.
• Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P

(hi)
• For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise (e.g. h1
in our example)
⇒ MAP = simplest consistent hypothesis
12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi )
I.e., simply get the best fit to the data; identical to MAP for uniform
prior (which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method
13
Learning with Complete Data
• The task of learning a probability model, given data that are assumed to
be generated from that model, is called density estimation.
• Density estimation is a form of unsupervised learning.
• We will assume the data is complete. That is. Each data point contains
values for every variable in the probability model being learned.
14
Bayesian Network Model
(a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on
the candy flavor.
15
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry P(F=cherry )
candies? 
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of Flavor
models we unwrap N candies, c cherries and l = N − c limes
Suppose
These are i.i.d. (independent, identically distributed) observations, so
INI c
P (d|hθ) = ∏ P (dj |h ) = θ · (1 l −
j =1 θ
θ)
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(d|hθ) = log P (d|hθ) = ∑ log P (dj|hθ) = c log θ + l log(1
j =1
− θ) θ)
dL(d|h c € c
= − =0 ⇒ θ c
d θ 1− θ c + l= N
=
θ
Seems sensible, but causes problems with 0
counts!
16
Multiple parameters
Red/green wrapper depends probabilistically on P(F=cherry )

flavor:
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
P (F = cherry , W = green | F P(W=red | F)
h= P2 )(F = cherry |h
θ,θ1,θ )P (W = green |F = cherry , h cherry  1
θ,θ1 , θ,θ1 ,
) θ 2 θ 2 lime  2
= θ · (1 − θ1 ) Wrapper
N candies, rc red-wrapped cherry candies, etc.:
Take the logarithm:
17
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:
With complete data, parameters can be learned separately
18
Naive Bayes Models
Assuming Boolean variables, the parameters are
With observed attribute values x1, ... , xn, the probability of each class is given
by
A deterministic prediction can be obtained by choosing the most likely class
The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis-which is a decision tree-is not
representable exactly using a naive Bayes model.
Naive Bayes learning turns out to do surprisingly well in a wide range of

applications; the boosted version
Scales well to large problems with n Boolean attributes there are just 2n + 1
parameters
19
Naive Bayes Models
The learning curve for naive Bayes learning applied to the restaurant
problem from chapter 18. Compared with decision tree learning.
20
Example: linear Gaussian model
1
0.8
P(y |x)
4 0.6
3.
5
y
3
2. 0.4
5
2
1. 1 0.2
5 0.8
010 0.2 0.6
0.4 y
0. 0.4 0.6 0.2
5 0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Assume data are generated as follows: x
Maximize the conditional likelihood

with respect to θ1, θ2.
LN 2
= minimizing E = (yj − (θ1 xj +
j =1
θ2))
That is, minimizing the sum of squared errors gives the ML
solution for a linear fit assuming Gaussian noise of fixed
variance 21
Naive Bayes Models
Bayesian parameter learning
Hypothesis prior: Bayesian approach to parameter learning starts by defining

a prior probability distribution over the possible hypotheses
In the Bayesian view, is the (unknown) value of a random variable that

defines the hypothesis space
The hypothesis prior is just the prior distribution P().
Thus, P( = ) is the prior probability that the bag has a fraction of cherry
candies.
if we don’t anything about the possible values of then we can use :

P() = Uniform[0,1](), uniform density is part of beta distributions.
Each beta distribution is defined by two hyperparameters a and b such that
22
Naive Bayes Models
Suppose we observe a cherry, then we have:
23
Naive Bayes Models
Assuming Parameter independence
A Bayesian network that corresponds to a Bayesian learning process.

Posterior distributions for the parameter variables , 1, and 2 can be
inferred from their prior distributions and the evidence in the Flavori and
Wrapperi variables.
24
Naive Bayes Models
Density estimation with nonparametric models
• It is possible to learn a probability model without making any assumptions

about its structure and parameterization by adopting the nonparametric
methods.
Two possibilities:
• k-nearest neighbors (use them for density estimation rather than regression
or classification)
to estimate the unknown probability density at a query point x, we can simply
measure the density of the data points in the neighborhood of x.
• Use kernel functions
a) A 3D plot of the mixture of Gaussians

b) A 128- point sample of points from the mixture, together with two query points
(small squares) and their IO-nearest-neighborhoods (medium and large circles).
Naive Bayes Models
Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.

k = 3 is too spiky, 40 is too smooth, and 10 is just about right.
The best value for k can be chosen by cross-validation
26
Naive Bayes Models
Kernel density estimation using Gaussian kernels with = 0.02, 0.07, and 0.20
respectively. w = 0.07 is about right.
27
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of
Unsupervised clustering objects. The problem is unsupervised because the category labels are not
given.
• Unsupervised clustering begins with Data.
• Next, we need to understand what kind of probability distribution might have generated the data.
• Clustering presumes that the data are generated from a mixture distribution, P. Such distribution has
k components, each of which is a distribution in its own.
• Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is
given by
x refers to the values of the attributes for a data point
• For continuous data, a natural choice is mixture of Gaussians (multivariate Gaussian) .The
parameters of a mixture of Gaussians Mixture of Gaussians are wi =P(C=i) (the weight of each
component), µi (the mean of each component), and Σi (the covariance of each component).
28
Many real-world problems have hidden variables (sometimes called latent variables)
which are not observable in the data.
Expectation-maximization helps with hidden variables

• Pretend to know the parameters for the model, then to infer the probability that each
data point belongs to each component.
• refit the components to the data, where each component is fitted to the entire data set
with each point weighted by the probability that it belongs to that component.
• The process iterates until convergence
Essentially, we are “completing” the data by inferring probability distributions over the
hidden variables—which component each data point belongs to—based on the
current model.
29
Unsupervised clustering: Learning mixtures of Gaussians
• For the mixture of Gaussians, initialize the mixture-model parameters arbitrarily and iterate the E-
step (Expectation Step) & M-step (Maximization Step)
• E-Step: Compute the probabilities pij =P(C=i|xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have p ij =αP(xj |C=i)P(C=i). The term P(xj |C=i) is just the
probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the i th
Gaussian. Define ni = ∑j pij, the effective number of data points currently assigned to
component i.
• M-Step: Compute the new mean, covariance, and component weights using the following steps
in sequence: (finds the new values of the parameters that maximize the log likelihood of the
data, given the expected values of the hidden indicator variables)
where N is the total number of data points.

• Observations: the log likelihood for the final learned model slightly exceeds
that of the original model & EM increases the log likelihood of the data at
every iteration.
30
Learning Bayesian networks with hidden variables
To learn a Bayesian network with hidden variables, we apply the same insights
that worked for mixtures of Gaussians.
Example: a situation in which there are two bags of candies that have been
mixed together.
• Candies are described by three features: in addition to the Flavor and the
Wrapper, some candies have a Hole in the middle and some do not.
• The distribution of candies in each bag is described by a naive Bayes
model: the features are independent, given the bag, but the conditional
probability distribution for each feature depends on the bag.
• A 1000 samples were generated from a mode with true parameters:
31

Start with arbitrary values for model parameters.
The expected count of ( Bag = l) is the sum, over all candies, of the probability that the
candy came from bag 1:
using Bayes' rule and applying conditional independence
The expected count of cherry candies from bag 1 is given by
32
(a) A mixture model for candy. The proportions of different flavors, wrappers,
presence of holes depend on the bag, which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the
observable variables X depend on the component C.
33
Graphs showing the log likelihood of the data, L , as a function of the EM

iteration (a) Graph for Gaussian mixture model (b) Graph for the Bayesian
network
34
Learning hidden Markov models
One application of EM involves learning the transition probabilities in hidden

Markov models (HMMs).
A hidden Markov model can be represented by a dynamic Bayes net with a

single discrete state variable
Each data point consists of an observation sequence of finite length
transition probability from state i to state j,

• calculate the expected proportion of times that the system undergoes a
transition to state j when in state i:
35
Summary
Bayesian learning methods formulate learning as a form of probabilistic
inference, using the observations to update a prior distribution over hypotheses.
Maximum a posteriori (MAP) learning selects a single most likely hypothesis

given the data.
Maximum-likelihood learning simply selects the hypothesis that maximizes the

likelihood of the data; it is equivalent to MAP learning with a uniform prior.
Naive Bayes learning is a particularly effective technique that scales
When some variables are hidden, local maximum likelihood solutions can be
found using the EM algorithm
Nonparametric models represent a distribution using the collection of data

points.
36

Chapter20 4e

Uploaded by

Copyright:

Available Formats

Chapter20 4e

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter20 4e

Uploaded by

Copyright:

Available Formats

Artificial Intelligence: A Modern

Learning Probabilistic Models

♦ Learning with Complete Data

♦ Learning with Hidden Variables: The EM Algorithm

Chapter 20, Sections 1–3

Chapter 20, Sections 1–3

Then we observe candies drawn from some bag:

jth observation dj gives the outcome of random variable D j

where P (d|hi ) is called the likelihood

Suppose there are five kinds of bags of candies

Then we observe candies drawn from some bag:

Chapter 20, Sections 1–3

I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P

Take the logarithm:

With complete data, parameters can be learned separately

Assuming Boolean variables, the parameters are

A deterministic prediction can be obtained by choosing the most likely class

Naive Bayes learning turns out to do surprisingly well in a wide range of

Maximize the conditional likelihood

Bayesian parameter learning

Hypothesis prior: Bayesian approach to parameter learning starts by defining

In the Bayesian view, is the (unknown) value of a random variable that

The hypothesis prior is just the prior distribution P().

if we don’t anything about the possible values of then we can use :

Each beta distribution is defined by two hyperparameters a and b such that

Bayesian parameter learning

Suppose we observe a cherry, then we have:

Bayesian parameter learning

Assuming Parameter independence

A Bayesian network that corresponds to a Bayesian learning process.

Density estimation with nonparametric models

• It is possible to learn a probability model without making any assumptions

• Use kernel functions

a) A 3D plot of the mixture of Gaussians

Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.

Expectation-maximization helps with hidden variables

where N is the total number of data points.

Learning Bayesian networks with hidden variables

• A 1000 samples were generated from a mode with true parameters:

Learning Bayesian networks with hidden variables

using Bayes' rule and applying conditional independence

The expected count of cherry candies from bag 1 is given by

Learning Bayesian networks with hidden variables

Graphs showing the log likelihood of the data, L , as a function of the EM

Learning hidden Markov models

One application of EM involves learning the transition probabilities in hidden

A hidden Markov model can be represented by a dynamic Bayes net with a

Each data point consists of an observation sequence of finite length

transition probability from state i to state j,

Maximum a posteriori (MAP) learning selects a single most likely hypothesis

Maximum-likelihood learning simply selects the hypothesis that maximizes the

Naive Bayes learning is a particularly effective technique that scales

Nonparametric models represent a distribution using the collection of data

You might also like