Chapter20 4e

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

Artificial Intelligence: A Modern

Approach
Fourth Edition

Chapter 20

Learning Probabilistic Models

1
Copyright © 2021 Pearson Education, Inc. All Rights Reserved
Outline
♦ Statistical Learning

♦ Learning with Complete Data

♦ Learning with Hidden Variables: The EM Algorithm

Chapter 20, Sections 1–3

2
Introduction
♦ Agents can handle uncertainty by using the methods of probability and
decision theory.
♦ However, they need to learn their probabilistic theories of the world form
experience.
♦ This chapter focuses on that.
♦ Focus on Data and Hypotheses
♦ Data are evidence: instantiations of some or all of the random variables
describing the domain.
♦ The hypotheses in this chapter are probabilistic theories of how the domain
works, including logical theories as a special case.

Chapter 20, Sections 1–3

3
Statistical Learning Example
Suppose there are five kinds of bags of candies:
h1: 100% cherry candies
h2: 75% cherry candies + 25% lime candies
h3: 50% cherry candies + 50% lime candies
h4: 25% cherry candies + 75% lime candies
h5: 100% lime candies

Then we observe candies drawn from some bag:


What kind of bag is it? What flavour will the next candy be?
• Given a new bag of candy, the random variable H (for hypothesis) denotes
the type of the bag, with possible values h1 through h5.
• H is not directly observable, of course. As the pieces of candy are opened
and inspected, data are revealed—D1, D2,..., DN, where each Di is a random
variable with possible values cherry and lime.
• Agent’s Goal: Predict what flavor is the next candy.
Full Bayesian learning
• Bayesian learning simply calculates the probability of each hypothesis, given the
data, and makes predictions on that basis.
• predictions are made by using all the hypotheses, weighted by their probabilities,
rather than by using just a single “best” hypothesis. In this way, learning is
reduced to probabilistic inference.
• Let D represent all the data, with observed value d. The key quantities in the
Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data
under each hypothesis, Likelihood P(d|hi).
Full Bayesian learning
View learning as Bayesian updating of a probability distribution over the
hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)

jth observation dj gives the outcome of random variable D j


training data d = d1 , . . . , dN
Given the data so far, each hypothesis has a posterior probability:
P (hi |d) = αP (d|hi )P (hi ) (Bayes’ rule)

where P (d|hi ) is called the likelihood


Predictions use a likelihood-weighted average over the hypotheses:
P(X|d) = Σ i P(X|h i )P (hi|d)
where each hypothesis determines a probability distribution over X.
No need to pick one best-guess hypothesis!
Example

Suppose there are five kinds of bags of candies


10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies

Then we observe candies drawn from some bag:


What kind of bag is it? What flavour will the next candy be?
Assumption: observations are independent and identically distributed (i.i.d) then the
likelihood of the data is:
Posterior probability of hypotheses

1
P(h1 | d)
Posterior probability of hypothesis
P(h2 |
d) P(h3
0.8 | d)
P(h4 | d)
P(h5 |
0.6 d)

0.4

0.2

0
0 2 4 6 8 10
Number of samples in d

Chapter 20, Sections 1–3

8
© 2021 Pearson Education Ltd.
© 2021 Pearson Education Ltd.
Prediction probability

0.9
P(next candy is lime | d)

0.8

0.7

0.6

0.5

0.4
0 2 4 6 8 10
Number of samples in d

11
M A P approximation
• Bayesian predictions are optimal regardless of the dataset size.
• However, the hypotheses space is very large or infinite.
• Summing over the hypothesis space is often intractable, so we need to resort
to approximate or simplified methods. (e.g., 18,446,744,073,709,551,616
Boolean functions of 6 attributes)
• Common approach: Make prediction based on a single most probable
hypothesis.
• Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)

I.e., maximize P (d|hi)P (hi) or minimizing -log2 P (d|hi) - log2 P


(hi)
• For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise (e.g. h1
in our example)
⇒ MAP = simplest consistent hypothesis

12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi )
I.e., simply get the best fit to the data; identical to MAP for uniform
prior (which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method

13
Learning with Complete Data
• The task of learning a probability model, given data that are assumed to
be generated from that model, is called density estimation.
• Density estimation is a form of unsupervised learning.
• We will assume the data is complete. That is. Each data point contains
values for every variable in the probability model being learned.

14
Bayesian Network Model

(a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on
the candy flavor.

15
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry P(F=cherry )
candies? 
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of Flavor
models we unwrap N candies, c cherries and l = N − c limes
Suppose
These are i.i.d. (independent, identically distributed) observations, so
INI c
P (d|hθ) = ∏ P (dj |h ) = θ · (1 l −
j =1 θ
θ)
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(d|hθ) = log P (d|hθ) = ∑ log P (dj|hθ) = c log θ + l log(1
j =1
− θ) θ)
dL(d|h c € c
= − =0 ⇒ θ c
d θ 1− θ c + l= N
=
θ
Seems sensible, but causes problems with 0
counts!

16
Multiple parameters
Red/green wrapper depends probabilistically on P(F=cherry )

flavor:
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
P (F = cherry , W = green | F P(W=red | F)

h= P2 )(F = cherry |h
θ,θ1,θ )P (W = green |F = cherry , h cherry  1
θ,θ1 , θ,θ1 ,
) θ 2 θ 2 lime  2
= θ · (1 − θ1 ) Wrapper
N candies, rc red-wrapped cherry candies, etc.:

Take the logarithm:

17
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:

With complete data, parameters can be learned separately

18
Naive Bayes Models

Assuming Boolean variables, the parameters are

With observed attribute values x1, ... , xn, the probability of each class is given
by

A deterministic prediction can be obtained by choosing the most likely class

The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis-which is a decision tree-is not
representable exactly using a naive Bayes model.

Naive Bayes learning turns out to do surprisingly well in a wide range of


applications; the boosted version

Scales well to large problems with n Boolean attributes there are just 2n + 1
parameters

19
Naive Bayes Models

The learning curve for naive Bayes learning applied to the restaurant
problem from chapter 18. Compared with decision tree learning.

20
Example: linear Gaussian model
1

0.8
P(y |x)
4 0.6
3.
5

y
3
2. 0.4
5
2
1. 1 0.2
5 0.8
010 0.2 0.6
0.4 y
0. 0.4 0.6 0.2
5 0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Assume data are generated as follows: x

Maximize the conditional likelihood


with respect to θ1, θ2.

LN 2
= minimizing E = (yj − (θ1 xj +
j =1
θ2))
That is, minimizing the sum of squared errors gives the ML
solution for a linear fit assuming Gaussian noise of fixed
variance 21
Naive Bayes Models

Bayesian parameter learning

Hypothesis prior: Bayesian approach to parameter learning starts by defining


a prior probability distribution over the possible hypotheses

In the Bayesian view, is the (unknown) value of a random variable that


defines the hypothesis space

The hypothesis prior is just the prior distribution P().

Thus, P( = ) is the prior probability that the bag has a fraction of cherry
candies.

if we don’t anything about the possible values of then we can use :


P() = Uniform[0,1](), uniform density is part of beta distributions.

Each beta distribution is defined by two hyperparameters a and b such that

22
Naive Bayes Models

Bayesian parameter learning

Suppose we observe a cherry, then we have:

23
Naive Bayes Models

Bayesian parameter learning

Assuming Parameter independence

A Bayesian network that corresponds to a Bayesian learning process.


Posterior distributions for the parameter variables , 1, and 2 can be
inferred from their prior distributions and the evidence in the Flavori and
Wrapperi variables.
24
Naive Bayes Models

Density estimation with nonparametric models

• It is possible to learn a probability model without making any assumptions


about its structure and parameterization by adopting the nonparametric
methods.
Two possibilities:
• k-nearest neighbors (use them for density estimation rather than regression
or classification)
to estimate the unknown probability density at a query point x, we can simply
measure the density of the data points in the neighborhood of x.

• Use kernel functions

a) A 3D plot of the mixture of Gaussians


b) A 128- point sample of points from the mixture, together with two query points
(small squares) and their IO-nearest-neighborhoods (medium and large circles).
Naive Bayes Models

Density estimation using k-nearest-neighbors, for k= 3, 10, and 40 respectively.


k = 3 is too spiky, 40 is too smooth, and 10 is just about right.
The best value for k can be chosen by cross-validation

26
Naive Bayes Models

Kernel density estimation using Gaussian kernels with = 0.02, 0.07, and 0.20
respectively. w = 0.07 is about right.

27
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of
Unsupervised clustering objects. The problem is unsupervised because the category labels are not
given.
• Unsupervised clustering begins with Data.
• Next, we need to understand what kind of probability distribution might have generated the data.
• Clustering presumes that the data are generated from a mixture distribution, P. Such distribution has
k components, each of which is a distribution in its own.
• Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is
given by
x refers to the values of the attributes for a data point

• For continuous data, a natural choice is mixture of Gaussians (multivariate Gaussian) .The
parameters of a mixture of Gaussians Mixture of Gaussians are wi =P(C=i) (the weight of each
component), µi (the mean of each component), and Σi (the covariance of each component).

28
Learning With Hidden Variables: The EM Algorithm

Many real-world problems have hidden variables (sometimes called latent variables)
which are not observable in the data.

Expectation-maximization helps with hidden variables


• Pretend to know the parameters for the model, then to infer the probability that each
data point belongs to each component.
• refit the components to the data, where each component is fitted to the entire data set
with each point weighted by the probability that it belongs to that component.
• The process iterates until convergence

Essentially, we are “completing” the data by inferring probability distributions over the
hidden variables—which component each data point belongs to—based on the
current model.

29
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• For the mixture of Gaussians, initialize the mixture-model parameters arbitrarily and iterate the E-
step (Expectation Step) & M-step (Maximization Step)
• E-Step: Compute the probabilities pij =P(C=i|xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have p ij =αP(xj |C=i)P(C=i). The term P(xj |C=i) is just the
probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the i th
Gaussian. Define ni = ∑j pij, the effective number of data points currently assigned to
component i.
• M-Step: Compute the new mean, covariance, and component weights using the following steps
in sequence: (finds the new values of the parameters that maximize the log likelihood of the
data, given the expected values of the hidden indicator variables)

where N is the total number of data points.


• Observations: the log likelihood for the final learned model slightly exceeds
that of the original model & EM increases the log likelihood of the data at
every iteration.
30
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables

To learn a Bayesian network with hidden variables, we apply the same insights
that worked for mixtures of Gaussians.

Example: a situation in which there are two bags of candies that have been
mixed together.
• Candies are described by three features: in addition to the Flavor and the
Wrapper, some candies have a Hole in the middle and some do not.
• The distribution of candies in each bag is described by a naive Bayes
model: the features are independent, given the bag, but the conditional
probability distribution for each feature depends on the bag.

• A 1000 samples were generated from a mode with true parameters:

31
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables


Start with arbitrary values for model parameters.

The expected count of ( Bag = l) is the sum, over all candies, of the probability that the
candy came from bag 1:

using Bayes' rule and applying conditional independence

The expected count of cherry candies from bag 1 is given by

32
Learning With Hidden Variables: The EM Algorithm

Learning Bayesian networks with hidden variables

(a) A mixture model for candy. The proportions of different flavors, wrappers,
presence of holes depend on the bag, which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the
observable variables X depend on the component C.

33
Learning With Hidden Variables: The EM Algorithm

Graphs showing the log likelihood of the data, L , as a function of the EM


iteration (a) Graph for Gaussian mixture model (b) Graph for the Bayesian
network

34
Learning With Hidden Variables: The EM Algorithm

Learning hidden Markov models

One application of EM involves learning the transition probabilities in hidden


Markov models (HMMs).

A hidden Markov model can be represented by a dynamic Bayes net with a


single discrete state variable

Each data point consists of an observation sequence of finite length

transition probability from state i to state j,


• calculate the expected proportion of times that the system undergoes a
transition to state j when in state i:

35
Summary
Bayesian learning methods formulate learning as a form of probabilistic
inference, using the observations to update a prior distribution over hypotheses.

Maximum a posteriori (MAP) learning selects a single most likely hypothesis


given the data.

Maximum-likelihood learning simply selects the hypothesis that maximizes the


likelihood of the data; it is equivalent to MAP learning with a uniform prior.

Naive Bayes learning is a particularly effective technique that scales

When some variables are hidden, local maximum likelihood solutions can be
found using the EM algorithm

Nonparametric models represent a distribution using the collection of data


points.

36

You might also like