Notes4_BayesianLearning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Note Set 4: Bayesian Learning

Padhraic Smyth
Department of Computer Science
University of California, Irvine
January 2024

1 Introduction

In this set of notes we introduce a different approach to parameter estimation and learning: the Bayesian
approach. We outline the concepts that form the basis for Bayesian thinking, discuss how these ideas can be
applied to parameter estimation for various models, and conclude with a discussion of some of the broader
aspects of Bayesian learning.
The key aspect of Bayesian estimation is that we treat unknown quantities, such as parameters θ, as
random variables that we can put distributions over, where these distributions reflect our uncertainty about
their values. This is in contrast to the maximum likelihood approach for parameter estimation where an
unknown parameter θ is treated as an unknown but fixed quantity, rather than as a random variable.
As with maximum likelihood, in Bayesian estimation we begin by defining a likelihood function p(D|θ)
for some observed data D conditioned on some unknown set of parameters θ. The likelihood function
will be different for different problems depending on the nature of the data D (e.g., real-valued, discrete,
multivariate, sequential, and so on) and depending on the nature of the modeling assumptions we make (e.g.,
Gaussian, Markov, and so on). The definition of the likelihood in Bayesian estimation is done in the same
way as it is done for maximum likelihood estimation, i.e., by making some simplifying assumptions about
a probabilistic “generative” model for how the observed data might have been generated.
Where Bayesian estimation differs is that it treats θ as a random variable. Specifically, we can apply
Bayes rule to the problem of parameter estimation:
p(D|θ)p(θ)
p(θ|D) = (1)
p(D)
This equation has introduced some new quantities we have not encountered before:

1. A prior density p(θ) on the unknown parameters θ. This is a density function that reflects our prior
belief on the possible values of θ. For example, if we had prior knowledge for some particular problem
that 0 ≤ θ ≤ 10, and that all values of θ in that range were equally likely, they we could defined p(θ)
as the uniform density p(θ) = U [0, 10]. The prior density p(θ) is (in principle at least) defined based
only on our prior knowledge about the problem, before we see any data D.

1
Notes on Bayesian Learning 2

2. A posterior density p(θ|D) on θ. Like the prior, this is another density function on θ, but now is a
conditional density, conditioned on the data D. It reflects our posterior belief about θ, having seen
the data D. For example, if our prior were p(θ) = U [0, 10], and θ represented the mean value for a
Gaussian distribution, and we observed data with values D = {5.2, 4.8, 5.9, 5.1, 4.6} we might expect
our posterior density to be more concentrated around the average of these values compared to a flat
uniform prior from 0 to 10.

3. A normalization term P (D): we can compute this by integrating over parameter space, i.e., P (D) =
R R
p(θ, D)dθ = p(D|θ)p(θ)dθ. This normalization term P (D) is sometimes referred to as the
marginal likelihood: it is the likelihood of the data marginalizing over all possible parameter values
with some prior p(θ).

We will not pay much attention to the normalization term p(D) for now other than to note that the numerator
term on the right in Equation 1, p(D|θ)p(θ), requires normalization in order to insure that p(θ|D) is a density
function.

Ignoring the normalization term, we can write our earlier equation in a simple proportional form:

p(θ|D) ∝ p(D|θ) p(θ). (2)

This is the classic definition of Bayesian learning. The posterior p(θ|D) is proportional to the likelihood
p(D|θ) times the prior p(θ).

Bayesian estimation can be viewed as an information processing procedure: we begin with the prior,
we update our prior information with observed data D, we combine the prior with a likelihood for the data,
and we arrive at a posterior on θ given the data and the prior. This updating procedure corresponds to a
very natural sequential notion of incorporating new information (data) by updating our prior beliefs (the
prior) to arrive at a new posterior set of beliefs about the world (as represented by θ). Given this view it
is not surprising that Bayesian estimation has found wide application in artificial intelligence and machine
learning, particularly for problems such as computer vision, robotics, online learning, and more, where an
agent is continually encountering new data D and needs to update its internal model of the world around it.

2 Example: Bayesian Estimation of a Bernoulli Parameter

Consider tossing a binary-valued (Bernoulli) random variable X that takes value 1 with probability θ and 0
with probability 1 − θ, and where the value of the parameter θ is unknown and we wish to estimate it from
data. This is the simple Bernoulli model we discussed in earlier notes. Now say we observe a data set D
of outcomes where D = {x1 , . . . , xN } and xi ∈ {0, 1} represents the outcome of the ith observation of X,
and where we will assume that the observations are conditionally independent given θ.
Notes on Bayesian Learning 3

2.1 Binomial Likelihood

As discussed earlier the binomial likelihood for this problem can be defined as
N
Y
p(D|θ) = L(θ) = P (xi |θ)
i=1
= θ (1 − θ)N −r
r

where r is the number of 1’s observed in the data and N − r is the number of 0’s.

2.2 Beta Prior

Our parameter θ is bounded between 0 and 1. To define a prior on θ we need a density function that is (a)
restricted to the interval [0, 1] and (b) has some flexibility in its shape (so that we can specify different priors
on θ by varying the parameters of the prior distribution).

A useful prior in this context is the Beta density, defined as


1
P (θ) = Be(θ; α, β) = θα−1 (1 − θ)β−1 (3)
B(α, β)

where 0 < θ < 1. The two parameters of this density function are α > 0 and β > 0. B(α, β) is a
normalization term to ensure that the density integrates to 11 .

The parameters of the prior (α and β here) are referred to as hyperparameters. In our discussion below
we will assume that these are fixed before we do our Bayesian analysis: in this classical Bayesian setup the
data analyst fixed the values of the hyperparameters a priori, based on their prior knowledge, before looking
at the data.

The mean of the Beta prior is


α
E[θ] = .
α+β
Since α and β are positive, the mean of the Beta prior is the relative size of α to the sum α + β. And it
turns out that the strength of our prior (how narrow it is around the mean) is proportional to α + β. So,
Be(20, 20) is a much stronger (narrower) prior than Be(1, 1), but they both have an expected value of 0.5
for our unknown parameter θ. A prior such as Be(9, 1) has an expected value of 0.9, and a “strength” of 10,
and so on. We can use α and β to express different prior beliefs about the value of θ.

2.3 Posterior Density for θ

Now that we have a prior (a Beta density) and a likelihood (binomial), we can use Bayes rule to derive
an expression for the posterior density p(θ|D) which is what we are primarily interested in for a Bayesian
Γ(α)Γ(β) R∞
1
B(α, β) = Γ(α+β)
, where Γ(z) = 0
xz−1 e−x dx, z > 0, is the gamma function.
Notes on Bayesian Learning 4

estimation problem, i.e.,

p(θ|D) ∝ p(D|θ)p(θ)
∝ θr (1 − θ)N −r θα−1 (1 − θ)β−1
= θr+α−1 (1 − θ)N −r+β−1
′ ′
= θα −1 (1 − θ)β −1

where α′ = α+r and β ′ = β +N −r. In our derivation above we dropped all the terms (from the likelihood
and the prior) that don’t involve θ.
Note that the final expression for p(θ|D) is itself in the form of a Beta density Be(α′ , β ′ ). Thus, we
have the nice result that the Beta prior has been updated to a Beta posterior2 .
The posterior mean is defined as

α′ r+α
Ep(θ|D) [θ] = = (4)
α′ + β ′ N +α+β
α
It is informative to compare this expression to the mean of the prior, E[θ] = α+β . We see that the posterior
mean can be seen as an updated version of the prior where we have (in effect) added r counts to α and N − r
counts to β.
In this sense, we can think of α and β in the prior as “pseudocounts”: for example if α = 2 and β = 8
we can think of this prior as representing 10 pseudocounts in total, 2 “successes” and 8 “failures.” Looking
at Ep(θ|D) [θ] above we can see that when N < 10 (in this example) that the information from the prior will
tend to dominate the information from the data (in terms of how the posterior mean is defined), and as N
increases above 10, the prior has less influence on the posterior density for θ.
In particular, recall that θ̂M L = Nr is the maximum likelihood estimate of θ for a binomial likelihood.
With a little algebra, we can rewrite the posterior mean as
α r
Ep(θ|D) [θ] = γN + (1 − γN )
α+β N
= γN (prior mean) + (1 − γN ) (ML estimate)
α+β
where γN = α+β+N . We see that the posterior mean of θ is a convex combination of the prior mean and the
maximum likelihood estimate: if the prior mean is strong relative to the data (α + β >> N and γN → 1)
the prior will dominate the data (the likelihood). And for any fixed prior values, as N → ∞, we have that
γN → 0, the posterior mean will converge to the maximum likelihood estimate Nr and the prior is ignored.
This is generally true in Bayesian estimation: as N gets large, the prior plays less of a role: and conversely,
with small amounts of data the prior can play a significant role.
We have focused above on the posterior mean Ep(θ|D) [θ], but it is also true that the variance (“width”) of
the posterior will tend to narrow as N increases. The variance of the posterior p(θ|D) reflects our uncertainty
2
Whenever a prior and posterior density are the same functional form, we say that the prior is conjugate to the likelihood (i.e.,
in this case the Beta prior is conjugate to the Binomial likelihood).
Notes on Bayesian Learning 5

about θ as we see more data and in general this variance will get narrower and narrower as we get more data,
eventually converging to a delta function centered at the maximum likelihood estimate.

We can also interpret the prior as providing a form of smoothing, i.e., denoting θ̂M P E be the mean
posterior estimate of θ.
r+α
θ̂M P E = Ep(θ|D) [θ] = (5)
N +α+β
The “raw” maximum likelihood estimate corresponds to Nr . Having positive α and β in the equation is a
form of smoothing: it particular we can interpret θ̂M P E as a smoothed estimate of θ̂M L , where the extreme
cases of estimates of θ = 0 or 1 are avoided in situations where r = 0 or r = N .

3 Different Types of Parameter Estimates

Given what we know so far about maximum likelihood and Bayesian estimation we can define a few different
types of parameter estimates:

Maximum Likelihood: θ̂M L = arg max p(D|θ)


θ
M AP
Maximum A Posteriori (MAP): θ̂ = arg max p(θ|D) = arg max p(D|θ)p(θ)
θ θ
Z
Mean Posterior Estimate (MPE): θ̂M P E = Ep(θ|D) [θ] = p(θ|D) θ dθ

The maximum a posteriori (MAP) and mean posterior estimate (MPE) differ only in the sense that one
of them selects the mode of the posterior density (MAP) and the other selects the mean of the posterior
(MPE). If the posterior density is symmetric, they will be the same, otherwise they will differ. Which one
to use in practice may come down to how the estimates will be used later on, i.e., what the purpose is of
computing the point estimate.

The MAP estimate has the property that if the prior is flat, i.e., p(θ) ∝ c where c is some constant, then
the ML and MAP estimates are the same, i.e., they both maximize the likelihood.

Note that all three of the parameter estimates defined above are point estimates, i.e., they only provide
a single number to summarize what we have learned about θ given the data (and given the prior if we are
being Bayesian). In contrast, the “fully Bayesian” approach is to use (or report) the full posterior p(θ|D).
This makes sense in contexts such as public policy or medicine where a human will look at p(θ|D) and
then make a decision—and if p(θ|D) is very wide (i.e., a lot of uncertainty) then the decision might be to
go back and get more data before making a final decision. In contrast, in machine learning, θ is not just a
single parameter but could consist of a set of thousands or even millions of parameters (e.g., all the weights
in a deep neural network). In this situation it is often not practical computationally to work with the full
posterior p(θ|D) and point estimates θ̂ (such as MAP or MPE estimates) are widely used.
Notes on Bayesian Learning 6

4 Generalization to K Outcomes (Multinomial Model)

We can generalize from binary outcomes to K-ary outcomes. Say X is a discrete-variable taking values
from 1 to K and we have IID data D = {x1 , . . . , xN }, xi ∈ {1, . . . , K}, with the usual IID assumption.
As an example xi might be a randomly-selected word from a corpus of documents (e.g., all Web pages in
Wikipedia) and K could be a large number (the vocabulary of words in our model), e.g., K = 100, 000.

Let θk = p(xi = k) be the probability of the kth outcome, with N


P
k=1 θk = 1 and let θ = (θ1 , . . . , θK )
be the set of all of the θ’s. The multinomial likelihood is defined as
N
Y K
Y
L(θ) = p(θ|D) = p(xi |θ) = θkrk (6)
i=1 k=1
PK
where rk is the number of times that xi = k in the data D, xi ∈ {0, 1, 2, . . .}, and k=1 rk = N . Recall
from earlier that the maximum likelihood estimate of θk is θ̂kM L = rNk .

In order to pursue a Bayesian approach for estimation we need to define a prior for θ. For the multinomial
likelihood the conjugate prior for θ is the Dirichlet density:

p(θ) = Dirichlet(α), α = (α1 , . . . , αK )


K
θkαk −1
Y

k=1

(Note that a special case of the Dirichlet density is the Beta(α, β) density, setting K = 2 and α = α1 , β =
α2 .)

The Dirichlet density p(θ) is defined over the (K−1)-dimensional simplex, defined by θk > 0, K
P
k=1 θk =
1. The mean of a Dirichlet is
αk
Ep(θ) [θk ] = P (7)
αk
As with the Beta density, the relative size of each αk to the sum of the α’s reflects the location of the mean
for θk , and the sum of the α’s reflects how narrow the density function is.

A non-symmetric Dirichlet prior is one where the αk ’s can be different to each other. For text, the αk ’s
could be proportional to frequency of words in English text (e.g., as measured from Web pages or a corpus
of newspapers): we might then be interested in estimating the relative frequencies of words, θk in some
new corpus where words might have different probabilities of occurring (e.g., in the biomedical research
literature, or in a set of blogs).

Multiplying the Dirichlet prior and the multinomial likelihood it is straightforward to show that the
posterior density for θ given the data D is another Dirichlet:

p(θ|D) = Dirichlet(α1′ , . . . , αK

) (8)

where αk′ = αk + rk (again this is a generalization of the solution for the K = 2 case).
Notes on Bayesian Learning 7

The posterior mean is


rk + αk
Ep(θ|D) [θk ] = P . (9)
N + αk
This is directly analogous to what we found for the Beta binomial case, i.e.,

• Each αk can be thought of as a “prior pseudocount” for outcome k,


P
• The sum αk represents the strength of the prior
rk
• The posterior mean approaches θ̂kM L = N as N → ∞.

5 Bayesian parameter estimation of a Gaussian mean

Consider the case where we have a real valued random variable X, where p(X) is Gaussian, i.e., x ∼
N (µ, σ 2 ). In our example below we will assume that the mean µ is unknown and the variance σ 2 is known
(for simplicity). Assume we have an observed data set D = {x1 , . . . , xN }, xi ∈ R, with the usual assump-
tion that the xi ’s are conditionally independent:
N
1 2
e− 2σ2 (xi −µ)
Y
L(µ) ∝
i=1

1 2
For the Gaussian likelihood, the conjugate prior for µ is also Gaussian: p(µ) = N (µ0 , s2 ) ∝ e− s2 (µ−µ0 ) .
Here µ0 is the mean of our prior for µ and s2 is the variance of our prior (so the smaller s2 is, the more
confidence we are that µ should be close to the prior mean µ0 ).
We can derive the posterior p(µ|D) as follows:

p(µ|D) ∝ p(D|µ)p(µ)
 
N
1 2 1 2
 e− 2σ2 (xi −µ)  e− 2s2 (µ−µ0 )
Y

i=1
  
N
X xi − µ 2 µ − µ0 2 
   
∝ exp −1/2  +

σ s

i=1

 !2
µ − µN
∝ exp −1/2 ,
σN

where µN = γ µ̂M L + (1 − γ)µ0 with µ̂M L = N1 N


P
i=1 xi being the maximum likelihood estimate. We see
that the posterior mean given the data D is a weighted combination of the maximum likelihood estimate and
the prior mean, where the weighting coefficient is
N s2
γ= . (10)
N s2 + σ 2
Notes on Bayesian Learning 8

So with no data, γ = 0, and we rely entirely on our prior. As N → ∞, γ → 1, and our estimate approaches
the maximum likelihood estimate. s2 measures the strength of our prior.

The posterior variance is defined as


2 s2 σ 2
σN = (11)
N s2 + σ 2
Or equivalently, in terms of precisions (the reciprocal of the variance), σ12 = σN2 + 1
s2
. So the precision of
N
our posterior is the sum of the precision of our data and the precision of our prior.

Additional Reading

The notes above cover only some of the basic introductory ideas in Bayesian learning. For additional reading
on Bayesian methods, both in machine learning and statistics, students may be interested in the following:

• Section 4.6 on Bayesian Statistics in Kevin Murphy’s book Probabilistic Machine Learning: An
Introduction, https://probml.github.io/pml-book/book1.html.

• David Barber’s book Bayesian Reasoning and Machine Learning: Sections 9.1 and 9.2 on Learning
as Inference) and Chapter 12 on Bayesian Model Selection: http://web4.cs.ucl.ac.uk/
staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online

• The “bible” of Bayesian Statistics, Bayesian Data Analysis, by Gelman et al: http://www.
stat.columbia.edu/˜gelman/book/

• Tutorial article on Bayesian aspects of neural networks and deep learning: https://arxiv.
org/pdf/2007.06823.pdf; and a more detailed review of priors in deep learning, https:
//onlinelibrary.wiley.com/doi/full/10.1111/insr.12502.

You might also like