Chapter20 4e
Chapter20 4e
Chapter20 4e
Approach
Fourth Edition
Chapter 20
1
Copyright © 2021 Pearson Education, Inc. All Rights Reserved
Outline
♦ Statistical Learning
2
Introduction
♦ Agents can handle uncertainty by using the methods of probability and
decision theory.
♦ However, they need to learn their probabilistic theories of the world form
experience.
♦ This chapter focuses on that.
♦ Focus on Data and Hypotheses
♦ Data are evidence: instantiations of some or all of the random variables
describing the domain.
♦ The hypotheses in this chapter are probabilistic theories of how the domain
works, including logical theories as a special case.
3
Statistical Learning Example
Suppose there are five kinds of bags of candies:
h1: 100% cherry candies
h2: 75% cherry candies + 25% lime candies
h3: 50% cherry candies + 50% lime candies
h4: 25% cherry candies + 75% lime candies
h5: 100% lime candies
1
P(h1 | d)
Posterior probability of hypothesis
P(h2 |
d) P(h3
0.8 | d)
P(h4 | d)
P(h5 |
0.6 d)
0.4
0.2
0
0 2 4 6 8 10
Number of samples in d
8
© 2021 Pearson Education Ltd.
© 2021 Pearson Education Ltd.
Prediction probability
0.9
P(next candy is lime | d)
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of samples in d
11
M A P approximation
• Bayesian predictions are optimal regardless of the dataset size.
• However, the hypotheses space is very large or infinite.
• Summing over the hypothesis space is often intractable, so we need to resort
to approximate or simplified methods. (e.g., 18,446,744,073,709,551,616
Boolean functions of 6 attributes)
• Common approach: Make prediction based on a single most probable
hypothesis.
• Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi )
I.e., simply get the best fit to the data; identical to MAP for uniform
prior (which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method
13
Learning with Complete Data
• The task of learning a probability model, given data that are assumed to
be generated from that model, is called density estimation.
• Density estimation is a form of unsupervised learning.
• We will assume the data is complete. That is. Each data point contains
values for every variable in the probability model being learned.
14
Bayesian Network Model
(a) Bayesian network model for the case of candies with an unknown
proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on
the candy flavor.
15
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry P(F=cherry )
candies?
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of Flavor
models we unwrap N candies, c cherries and l = N − c limes
Suppose
These are i.i.d. (independent, identically distributed) observations, so
INI c
P (d|hθ) = ∏ P (dj |h ) = θ · (1 l −
j =1 θ
θ)
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(d|hθ) = log P (d|hθ) = ∑ log P (dj|hθ) = c log θ + l log(1
j =1
− θ) θ)
dL(d|h c € c
= − =0 ⇒ θ c
d θ 1− θ c + l= N
=
θ
Seems sensible, but causes problems with 0
counts!
16
Multiple parameters
Red/green wrapper depends probabilistically on P(F=cherry )
flavor:
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
P (F = cherry , W = green | F P(W=red | F)
h= P2 )(F = cherry |h
θ,θ1,θ )P (W = green |F = cherry , h cherry 1
θ,θ1 , θ,θ1 ,
) θ 2 θ 2 lime 2
= θ · (1 − θ1 ) Wrapper
N candies, rc red-wrapped cherry candies, etc.:
17
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:
18
Naive Bayes Models
With observed attribute values x1, ... , xn, the probability of each class is given
by
The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis-which is a decision tree-is not
representable exactly using a naive Bayes model.
Scales well to large problems with n Boolean attributes there are just 2n + 1
parameters
19
Naive Bayes Models
The learning curve for naive Bayes learning applied to the restaurant
problem from chapter 18. Compared with decision tree learning.
20
Example: linear Gaussian model
1
0.8
P(y |x)
4 0.6
3.
5
y
3
2. 0.4
5
2
1. 1 0.2
5 0.8
010 0.2 0.6
0.4 y
0. 0.4 0.6 0.2
5 0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Assume data are generated as follows: x
LN 2
= minimizing E = (yj − (θ1 xj +
j =1
θ2))
That is, minimizing the sum of squared errors gives the ML
solution for a linear fit assuming Gaussian noise of fixed
variance 21
Naive Bayes Models
Thus, P( = ) is the prior probability that the bag has a fraction of cherry
candies.
22
Naive Bayes Models
23
Naive Bayes Models
26
Naive Bayes Models
Kernel density estimation using Gaussian kernels with = 0.02, 0.07, and 0.20
respectively. w = 0.07 is about right.
27
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of
Unsupervised clustering objects. The problem is unsupervised because the category labels are not
given.
• Unsupervised clustering begins with Data.
• Next, we need to understand what kind of probability distribution might have generated the data.
• Clustering presumes that the data are generated from a mixture distribution, P. Such distribution has
k components, each of which is a distribution in its own.
• Let the random variable C denote the component, with values 1,...,k; then the mixture distribution is
given by
x refers to the values of the attributes for a data point
• For continuous data, a natural choice is mixture of Gaussians (multivariate Gaussian) .The
parameters of a mixture of Gaussians Mixture of Gaussians are wi =P(C=i) (the weight of each
component), µi (the mean of each component), and Σi (the covariance of each component).
28
Learning With Hidden Variables: The EM Algorithm
Many real-world problems have hidden variables (sometimes called latent variables)
which are not observable in the data.
Essentially, we are “completing” the data by inferring probability distributions over the
hidden variables—which component each data point belongs to—based on the
current model.
29
Learning With Hidden Variables: The EM Algorithm
Unsupervised clustering: Learning mixtures of Gaussians
• For the mixture of Gaussians, initialize the mixture-model parameters arbitrarily and iterate the E-
step (Expectation Step) & M-step (Maximization Step)
• E-Step: Compute the probabilities pij =P(C=i|xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have p ij =αP(xj |C=i)P(C=i). The term P(xj |C=i) is just the
probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the i th
Gaussian. Define ni = ∑j pij, the effective number of data points currently assigned to
component i.
• M-Step: Compute the new mean, covariance, and component weights using the following steps
in sequence: (finds the new values of the parameters that maximize the log likelihood of the
data, given the expected values of the hidden indicator variables)
To learn a Bayesian network with hidden variables, we apply the same insights
that worked for mixtures of Gaussians.
Example: a situation in which there are two bags of candies that have been
mixed together.
• Candies are described by three features: in addition to the Flavor and the
Wrapper, some candies have a Hole in the middle and some do not.
• The distribution of candies in each bag is described by a naive Bayes
model: the features are independent, given the bag, but the conditional
probability distribution for each feature depends on the bag.
31
Learning With Hidden Variables: The EM Algorithm
The expected count of ( Bag = l) is the sum, over all candies, of the probability that the
candy came from bag 1:
32
Learning With Hidden Variables: The EM Algorithm
(a) A mixture model for candy. The proportions of different flavors, wrappers,
presence of holes depend on the bag, which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the
observable variables X depend on the component C.
33
Learning With Hidden Variables: The EM Algorithm
34
Learning With Hidden Variables: The EM Algorithm
35
Summary
Bayesian learning methods formulate learning as a form of probabilistic
inference, using the observations to update a prior distribution over hypotheses.
When some variables are hidden, local maximum likelihood solutions can be
found using the EM algorithm
36