Bayesian Machine Learning
Bayesian Machine Learning
Bayesian Machine Learning
2/127
Bayesian Machine Learning 23
Preliminaries 23
Challenge: Predicting a Coin Toss 23
The Bayesian Machine Learning Framework 23
(1) Model specification 23
(2) Parameter estimation 23
(3) Model Evaluation 24
(4) Prediction 25
We're Done! 26
Bayesian Evidence as a Model Performance Criterion 26
Bayesian Machine Learning and the Scientific Method Revisited 27
Now Solve the Example Problem: Predicting a Coin Toss 27
Coin toss example (1): Model Specification 28
Coin toss example (2): Parameter estimation 29
Coin toss example (3): Model Evaluation 29
Coin Toss Example (4): Prediction 30
Coin Toss Example: What did we learn? 30
Code Example: Bayesian evolution for the coin toss 30
From Posterior to Point-Estimate 32
Some Well-known Point-Estimates 33
Bayesian vs Maximum Likelihood Learning 33
Report Card on Maximum Likelihood Estimation 33
OPTIONAL SLIDES 34
The Kullback-Leibler Divergence 34
Factor Graphs 35
Preliminaries 35
Why Factor Graphs? 35
Factor Graph Construction Rules 36
Equality Nodes for Branching Points 36
Probabilistic Models as Factor Graphs 37
Inference by Closing Boxes 38
Sum-Product Algorithm 39
Sum-Product Messages for the Equality Node
Message Passing Schedules 40
Terminal Nodes and Processing Observations 41
Automating Bayesian Inference by Message Passing 41
Example: Bayesian Linear Regression by Message Passing 41
Final thoughts: Modularity and Abstraction 44
OPTIONAL SLIDES 44
Sum-Product Messages for Multiplication Nodes
Code example: Gaussian forward and backward messages for the Addition node 45
Code Example: forward and backward messages for the Matrix Multiplication node 45
Example: Sum-Product Algorithm to infer a posterior 45
Code for Sum-Product Algorithm to infer a posterior 46
3/127
Continuous Data and the Gaussian Distribution 47
Preliminaries 47
Example Problem 47
The Gaussian Distribution 48
Why the Gaussian? 49
Transformations and Sums of Gaussian Variables 49
Example: Gaussian Signals in a Linear System 49
Bayesian Inference for the Gaussian 50
(Multivariate) Gaussian Multiplication 51
Code Example: Product of Two Gaussian PDFs 51
Bayesian Inference with multiple Observations 52
Maximum Likelihood Estimation for the Gaussian 53
Conditioning and Marginalization of a Gaussian 53
Code Example: Joint, Marginal, and Conditional Gaussian Distributions 54
Example: Conditioning of Gaussian 55
Recursive Bayesian Estimation 55
Code Example: Kalman Filter 56
Product of Normally Distributed Variables
Code Example: Product of Gaussian Distributions 58
Solution to Example Problem 58
Summary 59
OPTIONAL SLIDES 59
Inference for the Precision Parameter of the Gaussian
Discrete Data and the Multinomial Distribution 60
Preliminaries 61
Discrete Data: the 1-of-K Coding Scheme 61
The Categorical Distribution 61
Bayesian Density Estimation for a Loaded Die 61
Categorical, Multinomial and Related Distributions 63
Maximum Likelihood Estimation for the Multinomial 63
Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions 64
Regression 64
Preliminaries 64
Regression - Illustration 64
Regression vs Density Estimation 65
Bayesian Linear Regression 65
Example Predictive Distribution 68
Maximum Likelihood Estimation for Linear Regression Model 69
Least-Squares Regression 69
Not Identically Distributed Data 70
Code Example: Least Squares vs Weighted Least Squares 70
OPTIONAL SLIDES 71
Some Useful Matrix Calculus
Derivation Predictive Distribution
Generative Classification 72
Preliminaries 72
Challenge: an apple or a peach? 72
Generative Classification Problem Statement 73
1 - Model specification 73
Computing the log-likelihood 74
2 - Parameter Inference for Classification 74
3 - Application: Class prediction for new Data 75
Discrimination Boundaries 76
Code Example: Working out the "apple or peach" example problem
Recap Generative Classification 77
4/127
Discriminative Classification 77
Preliminaries 77
Challenge: difficult class-conditional data distributions 77
Main Idea of Discriminative Classification 78
Model Specification for Bayesian Logistic Regression 78
Some Notes on the Model 79
Parameter Inference
Application: the predictive distribution 80
The Laplace Approximation 80
Working out the Laplace Approximation 81
Bayesian Logistic Regression with the Laplace Approximation 82
Using the Laplace-approximated parameter posterior to evaluate the predictive distribution 82
ML Estimation for Discriminative Classification 83
Code Example: ML Estimation for Discriminative Classification 84
Recap Classification 85
OPTIONAL SLIDES 85
Proof of gradient and Hessian for Laplace Approximation of Posterior
Proof of Derivative of Log-likelihood for Logistic Regression
Latent Variable Models and Variational Bayes 87
Preliminaries 87
Illustrative Example : Density Modeling for the Old Faithful Data Set 87
Unobserved Classes 88
The Gaussian Mixture Model 88
The Marginal Distribution for the GMM 89
GMM is a very flexible model 89
Latent Variable Models 89
Inference for GMM is Difficult 89
Inference When Information is in the Form of Constraints 90
The Method of Maximum Relative Entropy 90
Relative Entropy, KL Divergence and Free Energy 91
Example KL divergence for Gaussians 92
The Free Energy Functional and Variational Bayes 92
FE Minimization as Approximate Bayesian Inference 93
Constrained FE Minimization 94
Example: FEM for the Gaussian Mixture Model (CAVI Approach) 94
Code Example: FEM for GMM on Old Faithfull data set 96
Interesting Decompositions of the Free Energy Functional 98
Summary 99
OPTIONAL SLIDES
FE Minimization with Mean-field Factorization Constraints: the CAVI Approach 99
FE Minimization by the Expectation-Maximization (EM) Algorithm
Code Example: EM-algorithm for the GMM on the Old-Faithful data set 101
Message Passing for Free Energy Minimization 103
The Local Free Energy in a Factor Graph 103
Variational Message Passing 104
Dynamic Models 105
Preliminaries 105
Example Problem 105
Dynamical Models 105
State-space Models 106
Hidden Markov Models and Linear Dynamical Systems 106
Common Signal Processing Tasks as Message Passing-based Inference 107
Kalman Filtering
Multi-dimensional Kalman Filtering 109
Code Example: Kalman Filtering and the Cart Position Tracking Example Revisited 109
The Cart Tracking Problem Revisited: Inference by Message Passing 110
Recap Dynamical Models 111
OPTIONAL SLIDES 111
Proof of Kalman filtering equations including evidence updating
Extensions of Generative Gaussian Models 113
5/127
Intelligent Agents and Active Inference 113
Preliminaries 113
Agents 114
Illustrative Example: Steering a cart to a parking spot 114
Karl Friston and the Free Energy Principle 114
Execution of an AIF Agent 114
The Generative Model in an AIF agent 115
Specification of AIF Agent's model and Environmental Dynamics 116
State Updating in the AIF Agent 116
Policy Updating in an AIF Agent 117
Active Inference Analysis: exploitation-exploration dilemma 118
AIF Agents learn both the Problem and Solution 119
The Brain's Action-Perception Loop by FE Minimization 119
The Engineering Challenge: Synthetic AIF Agents 120
Factor Graph Approach to Modeling of an Active Inference Agent 121
How to minimize FE: Online Active Inference 121
The Cart Parking Problem Revisited 122
Video 123
Extensions and Comments 124
Final Thoughts 124
OPTIONAL SLIDES 124
In an AIF Agent, Actions fulfill Desired Expectations about the Future 125
Proof $q^*(u) = \arg\min_q H[q] \propto p(u)\exp(-G(u))#1 126
What Makes a Good Agent? The Good Regulator Theorem 126
6/127
5SSD0 Course Syllabus
Learning Goals
This course provides an introduction to Bayesian machine learning and information processing systems. The Bayesian approach affords a unified and
consistent treatment of many useful information processing systems.
Upon successful completion of the course, students should be able to:
understand the essence of the Bayesian approach to information processing.
specify a solution to an information processing problem as a Bayesian inference task on a probabilistic model.
design a probabilistic model by a specifying a likelihood function and prior distribution;
Code the solution in a probabilistic programming package.
execute the Bayesian inference task either analytically or approximately.
evaluate the resulting solution by examination of Bayesian evidence.
be aware of the properties of commonly used probability distribitions such as the Gaussian, Gamma and multinomial distribution; models such as
hidden Markov models and Gaussian mixture models; and inference methods such as the Laplace approximation, variational Bayes and message
passing in a factor graph.
Important Links
1. The course homepage http://bmlip.nl (or try https://biaslab.github.io/teaching/bmlip ) contains links to all materials such as lecture notes and video
lectures.
2. The Piazza course site will be used for Q&A and communication.
3. The Canvas course site will be sparingly used for communication (mostly by ESA staff)
Materials
Source materials are available at github repo at https://github.com/bertdv/BMLIP. You do not need to bother with this site. If you spot an error in the
materials, please raise the issue at Piazza.
Study Guide
The lecture and PP notes contain the mandatory materials. Some lecture notes are extended by a reading assignment, see the first cell in the lecture
notes. These reading assignment are also part of the mandatory materials.
Slides that are not required for the exam are moved to the end of the notes and preceded by an OPTIONAL SLIDES header.
During the scheduled classroom meetings, I will not teach all materials in the lecture notes.
Rather, I will first discuss a summary of the lecture notes and then be available for any additional questions that you may still have.
7/127
Each class also comes with a set of exercises. They are often a bit challenging and test more of your quantitative skills than you will need for the exam.
When doing exercises, feel free to make use of Sam Roweis' cheat sheets for Matrix identities and Gaussian identities. Also accessible from course
homepage.
We also provide a Video Guide to each lecture. These videos Guides aim to cover the main points or (expected) sticky issues in a lecture. They are
optional and were recorded in 2020.
Piazza (Q&A)
We will be using Piazza for Q&A and news. The system is highly catered to getting you help fast and efficiently from both classmates and the teaching
staff.
Sign up for Piazza today if you have not done so. And install the Piazza app on your phone!
The quicker you begin asking questions on Piazza (rather than via emails), the quicker you'll benefit from the collective knowledge of your classmates
and instructors. We encourage you to ask questions when you're struggling to understand a concept—you can even do so anonymously.
Unless it is a personal issue, pose your course-related questions at Piazza (in the right folder).
Piazza has a great search feature. Use search before putting in new questions.
Exam Guide
You are not allowed to use books nor bring printed or handwritten formula sheets to the exam. Difficult-to-remember formulas are supplied at the
exam sheet.
The tested material consists of the lecture + PP notes (+ reading assignments as assigned in the first cell/slide of each lecture notebook).
The class homepage contains two representative practice exams from previous terms.
OPTIONAL SLIDES
Preliminaries
Goal
Top-level overview of machine learning
Materials
Mandatory
this notebook
Optional
pre-recorded video guide and live class (2020)
Study Bishop pp. 1-4
Machine Learning relates to building models from data and using these models in applications.
Problem: Suppose we want to develop an algorithm for a complex process about which we have little knowledge (so hand-programming is not
possible).
8/127
Solution: Get the computer to develop the algorithm by itself by showing it examples of the behavior that we want.
Practically, we choose a library of models, and write a program that picks a model and tunes it to fit the data.
This field is known in various scientific communities with slight variations under different names such as machine learning, statistical inference, system
identification, data mining, source coding, data compression, data science, etc.
Machine learning technology uses the scientific inquiry loop to develop models and use these models in applications.
Super vised Learning: Given examples of inputs and corresponding desired outputs, predict outputs on future inputs.
Examples: classification, regression, time series prediction
Unsuper vised Learning: (a.k.a. density estimation). Given only inputs, automatically discover representations, features, structure, etc.
Examples: clustering, outlier detection, compression
9/127
Trial Design: (a.k.a. experimental design, active learning). Learn to make actions that optimize some performance criterion about the expected future.
Examples: playing games like chess, self-driving cars, robotics.
Two major approaches include reinforcement learning and active inference
Reinforcement Learning: Given an observed sequence of input signals and (occasionally observed) rewards for those inputs, learn to select
actions that maximize expected future rewards.
Active inference: Given an observed sequence of input signals and a prior probability distribution about future observations, learn to select
actions that minimize expected prediction errors (i.e., minimize actual minus predicted sensation).
Other stuff, like preference learning, learning to rank, etc., can often be (re-)formulated as special cases of either a supervised, unsupervised or trial
design problem.
Supervised Learning
Given observations of desired input-output behavior D = {(x1 , y1 ), … , (xN , yN )} (with xi inputs and yi outputs), the goal is to estimate the
conditional distribution p(y|x) (i.e., how does y depend on x?).
Classification
Regression
Same problem statement as classification but now the target variable is a real-valued vector.
Regression is sometimes called cur ve fitting.
Unsupervised Learning
Given data D = {x1 , … , xN }, model the (unconditional) probability distribution p(x) (a.k.a. density estimation). The two primary applications are
clustering and compression (a.k.a. dimensionality reduction).
Clustering
10/127
Group data into clusters such that all data points in a cluster have similar properties.
Clustering can be interpreted as ''unsupervised classification''.
Output from coder is much smaller in size than original, but if coded signal if further processed by a decoder, then the result is very close (or exactly
equal) to the original.
Usually, the compressed image comprises continuously valued variables. In that case, compression can be interpreted as ''unsupervised regression''.
Given the state of the world (obtained from sensory data), the agent must learn to produce actions (like making a movement or making a decision)
that optimize some performance criterion about the expected future.
11/127
In contrast to supervised and unsupervised learning, an agent is able to affect its data set by making actions, e.g., a robot can change its input video
data stream by turning the head of its camera.
In this course, we focus on the active inference approach to trial design, see the Intelligent Agent lesson for details.
Preliminaries
Goal
Review of probability theory as a theory for rational/logical reasoning with uncertainties (i.e., a Bayesian interpretation)
Materials
Mandatory
12/127
Data Analysis: A Bayesian Tutorial
The following is an excerpt from the book Data Analysis: A Bayesian Tutorial
In this lesson we introduce Probability Theory (PT) again. As we will see in the next lessons, PT is all you need to make sense of machine learning,
artificial intelligence, statistics, etc.
Problem: Given a disease with prevalence of 1% and a test procedure with sensitivity ('true positive' rate) of 95% and specificity ('true negative' rate) of
85% , what is the chance that somebody who tests positive actually has the disease?
Define an event (or "proposition") A as a statement, whose truth can be contemplated by a person, e.g.,
I
= 'All known life forms require water'
and a new piece of information
Richard T. Cox, 1946 developed a calculus for rational reasoning about how to represent and update the degree of beliefs about the truth value of
events when faced with new information.
In developing this calculus, only some very agreeable assumptions were made, e.g.,
(Transitivity). If the belief in A is greater than the belief in B , and the belief in B is greater than the belief in C , then the belief in A must be
13/127
greater than the belief in C .
(Consistency). If the belief in an event can be inferred in two different ways, then the two ways must agree on the resulting belief.
This effort resulted in confirming that the sum and product rules of Probability Theory are the only proper rational way to process belief intensities.
⇒ Probability theory (PT) provides the theor y of optimal processing of incomplete information (see Cox theorem, and Caticha, pp.7-24), and as
such provides the (only) proper quantitative framework for drawing conclusions from a finite (read: incomplete) data set.
Machine learning concerns drawing conclusions about model parameter settings from (a finite set of) data and therefore PT provides the optimal
calculus for machine learning.
In general, nearly all interesting questions in machine learning can be stated in the following form (a conditional probability):
p(whatever-we-want-to-know |
whatever-we-do-know)
where p(a|b) means the probability that a is true, given that b is true.
Examples
Predictions
p( future-observations |
past-observations )
Classify a received data point x
p( x-belongs-to-class-k | x )
Update a model based on a new observation
p( model-parameters |
new-observation,
past-observations )
The interpretation of a probability as a degree-of-belief about the truth value of an event is also called the Bayesian interpretation.
In the Bayesian interpretation, the probability is associated with a state-of-knowledge (usually held by a person).
For instance, in a coin tossing experiment, p(tail) = 0.4 should be interpreted as the belief that there is a 40% chance that tail comes up if
the coin were tossed.
Under the Bayesian interpretation, PT calculus (sum and product rules) extends boolean logic to rational reasoning with uncer tainty.
The Bayesian interpretation contrasts with the frequentist interpretation of a probability as the relative frequency that an event would occur under
repeated execution of an experiment.
For instance, if the experiment is tossing a coin, then p(tail) = 0.4 means that in the limit of a large number of coin tosses, 40% of outcomes
turn up as tail.
The Bayesian viewpoint is more generally applicable than the frequentist viewpoint, e.g., it is hard to apply the frequentist viewpoint to events like '
it will rain tomorrow'.
The Bayesian viewpoint is clearly favored in the machine learning community. (In this class, we also strongly favor the Bayesian interpretation).
events
14/127
¯.
We write the denial of A , i.e. the event not-A, as A
Given two events A and B , we write the conjunction "A ∧ B" as "A, B" or "AB". The conjunction AB is true only if both A and B are true.
We will write the disjunction "A ∨ B" as "A + B", which is true if either A or B is true or both A and B are true.
Note that, if X is a variable, then an assignment X = x (with x a value, e.g., X = 5) can be interpreted as an event.
probabilities
For any event A , with background knowledge I , the conditional probability of A given I , is written as
p(A|I) .
All probabilities are in principle conditional probabilities of the type p(A|I), since there is always some background knowledge.
We often write p(A) rather than p(A|I) if the background knowledge I is assumed to be obviously present. E.g., p(A) rather than
p( A | the-sun-comes-up-tomorrow ).
(In the context of random variable assignments) we often write p(x) rather than p(X = x), assuming that the reader understands the context.
In an apparent effort to further abuse notational conventions, p(X) denotes the full distribution over random variable X , i.e., the distribution for all
assignments for X .
If X is a discretely valued variable, then p(X = x) is a probability mass function (PMF) with 0 ≤ p(X = x) ≤ 1 and normalization
∑x p(x) = 1.
If X is continuously valued, then p(X = x) is a probability density function (PDF) with p(X = x) ≥ 0 and normalization ∫x p(x)dx = 1.
Note that if X is continuously valued, then the value of the PDF p(x) is not necessarily ≤
1. E.g., a uniform distribution on the continuous
domain [0, .5] has value p(x) = 2.
The following product and sum rules are also known as the axioms of probability theor y, but as discussed above, under some mild assumptions,
they can be derived as the unique rules for rational reasoning under uncertainty (Cox theorem, 1946, and Caticha, 2012, pp.7-26).
Sum rule. The disjunction for two events A and B given background I is given by
Product rule. The conjuction of two events A and B with given background I is given by
All legitimate probabilistic relations can be derived from the sum and product rules!
Two events A and B are said to be independent if the probability of one is not altered by information about the truth of the other, i.e.,
p(A|B) = p(A)
⇒ If A and B are independent, given I , then the product rule simplifies to
( , | )= ( | ) ( | ) 15/127
p(A, B|I) = p(A|I)p(B|I)
Two events A 1 and A 2 are said to be mutually exclusive if they cannot be true simultanously, i.e., if p(A 1 , A 2 ) = 0.
⇒ For mutually exclusive events, the sum rule simplifies to
p(A 1 + A 2 ) = p(A 1 ) + p(A 2 )
A set of events A 1 , A 2 , … , A N is said to be collectively exhaustive if one of the statements is necessarily true, i.e.,
A 1 + A 2 + ⋯ + A N = TRUE, or equivalently
p(A 1 + A 2 + ⋯ + A N ) = 1
Note that, if A 1 , A 2 , … , A n are both mutually exclusive and collectively exhausitive (MECE) events, then
N
∑ p(A n) = p(A 1 + … + A N ) = 1
n=1
We mentioned that every inference problem in PT can be evaluated through the sum and product rules. Next, we present two useful corollaries: (1)
Marginalization and (2) Bayes rule
If X ∈X and Y ∈Y are random variables over finite domains, than it follows from the above considerations about MECE events that
∑ p(X, Y ) = p(X) .
Y ∈Y
Summing Y out of a joint distribution p(X, Y ) is called marginalization and the result p(X) is sometimes referred to as the marginal probability.
Note that this is just a generalized sum rule. In fact, Bishop (p.14) (and some other authors as well) calls this the sum rule.
p(X) = ∫ p(X, Y ) dY
p(D|θ)
p(θ|D) = p(θ) .
p(D)
This formula is called Bayes rule (or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when D refers to an
observed data set and θ is set of model parameters. In that case,
the prior probability p(θ) represents our state-of-knowledge about proper values for θ, before seeing the data D.
the posterior probability p(θ|D) represents our state-of-knowledge about θ after we have seen the data.
⇒ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence,
16/127
Some nomenclature associated with Bayes rule:
likelihood prior
p(D|θ) × p(θ)
p(θ|D) =
p(D)
posterior
evidence
Note that the evidence (a.k.a. marginal likelihood ) can be computed from the numerator through marginalization since
Hence, having access to likelihood and prior is in principle sufficient to compute both the evidence and the posterior. To emphasize that point, Bayes rule
is sometimes written as a transformation:
For a given data set D, the posterior probabilities of the parameters scale relatively against each other as
p(θ|D) ∝ p(D|θ)p(θ)
⇒ All that we can learn from the observed data is contained in the likelihood function p(D|θ). This is called the likelihood principle.
Consider a distribution p(D|θ), where D relates to variables that are observed (i.e., a "data set") and θ are model parameters.
In general, p(D|θ) is just a function of the two variables D and θ. We distinguish two interpretations of this function, depending on which variable is
observed (or given by other means).
p(D|θ = θ0 )
(which is a function of D only) describes a probability distribution for data D, assuming that it is generated by the given model with parameters fixed
at θ = θ0 .
In a machine learning context, often the data is observed, and θ is the free variable. In that case, for given observations D = D0 , the likelihood
function (which is a function only of the model parameters θ) is defined as
Note that L(θ) is not a probability distribution for θ since in general ∑θ L(θ) ≠ 1.
Code Example: Sampling Distribution and Likelihood Function for the Coin Toss
Consider the following simple model for the outcome (head or tail) y ∈ {0, 1} of a biased coin toss with parameter θ ∈ [0, 1]:
p(y|θ) ≜ θy (1 − θ)1−y
We can plot both the sampling distribution p(y|θ = 0.8) and the likelihood function L(θ) = p(y = 0|θ).
In [1]: using PyPlot
#using Plots
p(y,θ) = θ.^y .* (1 .- θ).^(1 .- y)
f = figure()
17/127
title("Sampling distribution");
xlim([-0.5,1.5]); ylim([0,1]); xlabel("y"); ylabel("p(y|θ=$(θ))");
subplot(222);
_θ = 0:0.01:1
y = 1.0 # Plot p(y=1 | θ)
plot(_θ,p(y,_θ))
title("Likelihood function");
xlabel("θ");
ylabel("L(θ) = p(y=$y)|θ)");
1
The (discrete) sampling distribution is a valid probability distribution. However, the likelihood function L(θ) clearly isn't, since ∫0 L(θ)dθ ≠ 1.
Probabilistic Inference
p( whatever-we-want-to-know |
whatever-we-already-know )
For example:
p( Mr.S.-killed-Mrs.S. | he-has-her-blood-on-his-shirt )
p( transmitted-codeword | received-codeword )
p p(X, Z) s ∑Y p(X, Y , Z)
p(X|Z) = = ,
p(Z) ∑X,Y p(X, Y , Z)
where the 's' and 'p' above the equality sign indicate whether the sum or product rule was used.
In the rest of this course, we'll encounter many long probabilistic derivations. For each manipulation, you should be able to associate an 's' (for sum
rule), a 'p' (for product or Bayes rule) or an 'm' (for a simplifying model assumption) above any equality sign.
Problem: Given a disease D with prevalence of 1% and a test procedure T with sensitivity ('true positive' rate) of 95% and specificity ('true negative'
rate) of 85%, what is the chance that somebody who tests positive actually has the disease?
Solution: The given data are p(D = 1) = 0.01, p(T = 1|D = 1) = 0.95 and p(T = 0|D = 0) = 0.85. Then according to Bayes rule,
18/127
p(D = 1|T = 1)
p p(T = 1|D = 1)p(D = 1)
=
p(T = 1)
s p(T = 1|D = 1)p(D = 1)
=
p(T = 1|D = 1)p(D = 1)
+ p(T = 1|D = 0)p(D = 0)
0.95 × 0.01
= = 0.0601
0.95 × 0.01 + 0.15 × 0.99
Note that p(sick|positive test) = 0.06 while p(positive test|sick) = 0.95. This is a huge difference that is sometimes called the "medical
test paradox" or the base rate fallacy.
Many people have trouble distinguishing p(A|B) from p(B|A) in their heads. This has led to major negative consequences. For instance, unfounded
convictions in the legal arena and even lots of unfounded conclusions in the pursuit of scientific results. See Ioannidis (2005) and Clayton (2021).
Problem: A bag contains one ball, known to be either white or black. A white ball is put in, the bag is shaken, and a ball is drawn out, which proves to
be white. What is now the chance of drawing a white ball?
Solution: Again, use Bayes and marginalization to arrive at p(white|data) = 2/3, see the Exercises notebook.
⇒ Note that probabilities describe a person's state of knowledge rather than a 'property of nature'.
Problem: A dark bag contains five red balls and seven green ones. (a) What is the probability of drawing a red ball on the first draw? Balls are not
returned to the bag after each draw. (b) If you know that on the second draw the ball was a green one, what is now the probability of drawing a red ball
on the first draw?
⇒ Again, we conclude that conditional probabilities reflect implications for a state of knowledge rather than temporal causality.
μx = E[x] ≜ ∫ x p(x) dx
Σx ≜ E [(x − μx )(x − μx )T ]
Clearly, if x and y are independent, then Σxy = 0, since E[xy T ] = E[x]E[y T ] = μx μTy .
Linear Transformations
Consider an arbitrary distribution p(X) with mean μx and variance Σx and the linear transformation
Z = AX + b .
( ) 19/127
No matter the specification of p(X) , we can derive that (see Exercises notebook)
μz = Aμx + b (SRG-3a)
Σz = A Σx A T (SRG-3b)
(The tag (SRG-3a) refers to the corresponding eqn number in Sam Roweis Gaussian identities notes.)
Given eqs SRG-3a and SRG-3b (previous cell), you should now be able to derive the following: for any distribution of variable X and Y and sum
Z = X + Y (proof by Exercise)
μz = μx + μy
Σz = Σx + Σy + 2Σxy
Σz = Σx + Σy
More generally, given two independent variables X and Y , with PDF's px (x) and py (y). The PDF pz (z)for Z=X+Y is given by the
convolution
∞
pz (z) = ∫ px (x)py (z − x) dx
−∞
https://en.wikipedia.org/wiki/List_of_convolutions_of_probability_distributions shows how these convolutions work out for a few common probability
distributions.
In linear stochastic systems theory, the Fourier Transform of a PDF (i.e., the characteristic function) plays an important computational role. Why?
Consider the PDF of the sum of two independent Gaussian distributed X and Y :
2
pX (x) = N ( x | μX , σX )
2
pY (y) = N ( y | μY , σY )
Let Z = X + Y . Performing the convolution (nice exercise) yields a Gaussian PDF for Z:
2
pZ (z) = N ( z | μX + μY , σX + σY2 ).
In [2]: using PyPlot, Distributions
μx = 2.
σx = 1.
μy = 2.
σy = 0.5
μz = μx+μy; σz = sqrt(σx^2 + σy^2)
x = Normal(μx, σx)
y = Normal(μy, σy)
z = Normal(μz, σz)
range_min = minimum([μx-2*σx, μy-2*σy, μz-2*σz])
range_max = maximum([μx+2*σx, μy+2*σy, μz+2*σz])
range_grid = range(range_min, stop=range_max, length=100)
plot(range_grid, pdf.(x,range_grid), "k-")
plot(range_grid, pdf.(y,range_grid), "b-")
plot(range_grid, pdf.(z,range_grid), "r-")
legend([L"p_X", L"p_Y", L"p_Z"])
grid()
20/127
PDF for the Product of Two Variables
For two continuous random independent variables X and Y , with PDF's px (x) and py (y), the PDF of Z = XY is given by
∞
1
pz (z) = ∫ px (x) py (z/x) dx
−∞ |x|
For proof, see https://en.wikipedia.org/wiki/Product_distribution
Generally, this integral does not lead to an analytical expression for pz (z). For example, the product of two independent variables that are both
normally (Gaussian) distributed does not lead to a normal distribution.
Exception: the distribution of the product of two variables that both have log-normal distributions is again a lognormal distribution.
(If X has a normal distribution, then Y = exp(X) has a log-normal distribution.)
Variable Transformations
Suppose x is a discrete random variable with probability mass function P x (x), and y = h(x) is a one-to-one function with x = g(y) = h−1 (y).
Then
Py (y) = Px(g(y)) .
Proof:
Py (y^) = P (y = y^) = P (h(x) = y^ )
= P (x = g(y^ )) = Px(g(y^)). □
If x is defined on a continuous domain, and px (x) is a probability density function, then probability mass is represented by the area under a (density)
curve. Let a = g(c) and b = g(d). Then
21/127
b
P (a ≤ x ≤ b) = ∫ px (x)dx
a
g(d)
=∫ px(x)dx
g(c)
d
=∫ px (g(y))dg(y)
c
d
=∫ px (g(y))g ′(y)dy
c
p y (y)
= P(c ≤ y ≤ d)
Equating the two probability masses leads to identificaiton of the relation
If the tranformation y = h(x) is not invertible, then x = g(y) does not exist. In that case, you can still work out the transformation by equating
equivalent probability masses in the two domains.
Solution: Note that h(x) is invertible with x = g(y) = σy + μ . The change-of-variable formula leads to
Summary
Probabilities should be interpretated as degrees of belief, i.e., a state-of-knowledge, rather than a property of nature.
We can do everything with only the sum rule and the product rule. In practice, Bayes rule and marginalization are often very useful for inference,
i.e., for computing
p( what-we-want-to-know |
what-we-already-know ) .
Bayes rule
p(D|θ)p(θ)
p(θ|D) =
p(D)
is the fundamental rule for learning from data!
For a variable X with distribution p(X) with mean μx and variance Σx, the mean and variance of the Linear Transformation Z = AX + b is
given by
22/127
μz = Aμx + b (SRG-3a)
Σz = A Σx A T (SRG-3b)
Preliminaries
Goals
Introduction to Bayesian (i.e., probabilistic) modeling
Materials
Mandatory
These lecture notes
Optional
Bishop pp. 68-74 (on the coin toss example)
Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.35-44 (section 2.9, on deriving Bayes rule for updating
probabilities)
Problem: We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly
D = {hthhtth} .
What is the probability that heads comes up next?
Suppose that your application is to predict a future observation x, based on N past observations D = {x1 , … , xN }.
The Bayesian design approach to solving this task involves four stages:
REPEAT 1- Model specification 2- Parameter estimation 3- Model evaluation UNTIL model performance is satisfactory
4- Apply model
In principle, based on the model evaluation results, you may want to re-specify your model and repeat the design process (a few times), until model
performance is acceptable.
Your first task is to propose a probabilistic model (m) for generating the observations x.
A probabilistic model m consists of a joint distribution p(x, θ|m) that relates observations x to model parameters θ. Usually, the model is proposed
in the form of a data generating distribution p(x|θ, m) and a prior p(θ|m).
You are responsible to choose the data generating distribution p(x|θ) based on your physical understanding of the data generating process. (For
brevity, since we are working on one given model m, we drop the given dependency on m from the notation).
You must also choose the prior p(θ) to reflect what you know about the parameter values before you see the data D.
Note that, for a given data set D = {x1 , x2 , … , xN } with independent observations xn , the likelihood factorizes as
N
p(D|θ) = ∏ p(xn|θ) ,
n=1
23/127
so usually you select a model for generating one observation xn and then use (in-)dependence assumptions to combine these models into a likelihood
function for the model parameters.
The likelihood and prior both contain information about the model parameters. Next, you use Bayes rule to fuse these two information sources into a
posterior distribution for the parameters:
p(D|θ)p(θ)
p(θ|D) = ∝ p(D|θ) ⋅ p(θ)
p(D)
posterior likelihood prior
Note that there's no need for you to design some clever parameter estimation algorithm. Bayes rule is the parameter estimation algorithm. The
only complexity lies in the computational issues!
This "recipe" works only if the right-hand side (RHS) factors can be evaluated; the computational details can be quite challenging and this is what
machine learning is about.
In the framework above, parameter estimation was executed by "perfect" Bayesian reasoning. So is everything settled now?
No, there appears to be one remaining problem: how good really were our model assumptions p(x|θ) and p(θ)? We want to "score" the model
performance.
Note that this question is only interesting in practice if we have alternative models to choose from. After all, if you don't have an alternative model, any
value for the model evidence would still not lead you to switch to another model.
Let's assume that we have more candidate models, sayM = {m1 , … , mK } where each model relates to specific prior p(θ|mk ) and likelihood
p(D|θ, mk )? Can we evaluate the relative performance of a model against another model from the set?
Start again with model specification. You must now specify a prior p(mk ) (next to the likelihood p(D|θ, mk ) and prior p(θ|mk )) for each of the
models and then solve the desired inference problem:
p(D|mk )p(mk )
p(mk |D) =
p(D)
model
posterior
∝ p(mk ) ⋅ p(D|mk )
Note that, to evaluate the posterior for a model, you must calculate the "evidence", which can be interpreted as a likelihood function for the model.
You can now compare posterior distributions p(mk |D) for a set of models {mk } and decide on the merits of each model relative to alternative
models. This procedure is called Bayesian model comparison.
⇒ In a Bayesian framework, model estimation follows the same recipe as parameter estimation; it just works at one higher hierarchical level. Compare
the required calulations:
Again, no need to invent a special algorithm for estimating the performance of your model. Straightforward application of probability theory
takes care of all that.
24/127
In principle, you could proceed with asking how good your choice for the candidate model set M was. You would have to provide a set of alternative
model sets {M 1 , M 2 , … , M M } with priors p(M m ) for each set and compute posteriors p(M m |D). And so forth ...
With the (relative) performance evaluation scores of your model in hand, you could now re-specify your model (hopefully an improved model) and
repeat the design process until the model performance score is acceptable.
As an aside, in the (statistics and machine learning) literature, performance comparison between two models is often reported by the Bayes Factor,
which is defined as the ratio of model evidences:
p(D,m1 )
p(D|m1 ) p(m1 )
=
p(D|m2 ) p(D, m2 )
p(m2 )
Bayes Factor
p(D, m1 ) p(m2 )
= ⋅
p(m1 ) p(D, m2 )
p(m1 |D)p(D) p(m2 )
= ⋅
p(m1 ) p(m2 |D)p(D)
p(m1 |D) p(m2 )
= ⋅
p(m2 |D) p(m1 )
posterior prior
ratio ratio
Hence, for equal model priors (p(m1 ) = p(m2 ) = 0.5), the Bayes Factor reports the posterior probability ratio for the two models.
In principle, any decision about which is the better model has accepted some ad hocery, but Jeffreys (1961) advises the following interpretation of the
p(D|m1 )
log-Bayes factor log 10 B 12 = log10 p(D|m2 )
:
0.5 to 1 substantial
1 to 2 strong
>2 decisive
(4) Prediction
Once we are satisfied with the evidence for a (trained) model, we can apply the model to our prediction/classification/etc task.
Given the data D, our knowledge about the yet unobserved datum x is captured by (everything is conditioned on the selected model)
s
p(x|D) = ∫ p(x, θ|D) dθ
p
= ∫ p(x|θ, D)p(θ|D) dθ
m
=∫ p(x|θ) ⋅ p(θ|D) dθ
data generation dist. posterior
= p(x|θ) follows from our model specification. We assumed a parametric data generating
In the last equation, the simplification p(x|θ, D)
distribution p(x|θ) with no explicit dependency on the data set D.
Again, no need to invent a special prediction algorithm. Probability theory takes care of all that. The complexity of prediction is just computational:
how to carry out the marginalization over θ.
Note that the application of the learned posterior p(θ|D) not necessarily has to be a prediction task. We use it here as an example, but other
applications (e.g., classification, regression etc.) are of course also possible.
Alternatively, if you do need to work with one model (e.g. due to computational resource constraints), you can for instance select the model with largest
posterior p(mk |D) and use that model for prediction. This is called Bayesian model selection.
Bayesian model averaging is the principal way to apply PT to machine learning. You don't throw away information by discarding lesser performant
models, but rather use PT (marginalization of models) to compute
p(what-I-am-interested-in |
all available information)
exactly.
We're Done!
In principle, you now have the recipe in your hands now to solve all your prediction/classification/regression etc problems by the same method:
1. specify a model
2. train the model (by PT)
3. evaluate the model (by PT); if not satisfied, goto 1
4. apply the model (by PT)
Crucially, there is no need to invent clever machine learning algorithms, and there is no need to invent a clever prediction algorithm nor a need to
invent a model performance criterion. Instead, you propose a model and, from there on, you let PT reason about everything that you care about.
Your problems are only of computational nature. Perhaps the integral to compute the evidence may not be analytically tractable, etc.
I'd like to convince you that Bayesian model evidence is an excellent criterion for assessing your model's performance. To do so, let us consider a
decomposition that relates model evidence to other highly-valued criteria such as accuracy and model complexity.
Given the data set D, the log-evidence for model m decomposes as follows (please check the derivation):
26/127
log p(D|m) = log p(D|m) ⋅ ∫ p(θ|D, m)dθ
log-evidence
evaluates to 1
p(θ|D, m)
− ∫ p(θ|D, m) log dθ
p(θ|m)
complexity
The first term (data fit, also known as accuracy) measures how well the model predicts the data set D (as measured by log p(D|θ, m)), after having
learned from the data (because we marginalize θ with the posterior p(θ|D, m)). We want this term to be large.
The second term (complexity) quantifies the amount of information that the model absorbed through learning, i.e., by moving parameter beliefs from
p(θ|m) to p(θ|D, m). Technically, this term is the Kullback-Leibler divergence between posterior and prior. The KL divergence is always ≥ 0 and only
equals 0 if the posterior is equal to the prior, see OPTIONAL SLIDE. We want this term to be small.
The complexity term regularizes the Bayesian learning process automatically. If you prefer models with high Bayesian evidence, then you prefer models
that get a good data fit without need to learn much from the data set. These types of models are said to generalize well, since they can be applied to
different data sets without specific adaptations for each data set.
Focussing only on accuracy maximization could lead to overfitting. Focussing only on complexity minimization could lead to underfitting of the model.
Bayesian ML attends to both terms and avoids both underfitting and overfitting.
⇒ Bayesian learning automatically leads to models that generalize well. There is no need for early stopping or validation data sets. Just learn on
the full data set and all behaves well.
The Bayesian design process provides a unified framework for the Scientific Inquiry method. We can now add equations to the design loop. (Trial design
to be discussed in Intelligent Agent lesson.)
We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly
={ }. 27/127
D = {hthhtth} .
What is the probability that heads comes up next? We solve this in the next slides ...
h if heads comes up
xk = {
t if tails
Likelihood
Assume a Bernoulli distributed variable p(xk = h|μ) = μ for a single coin toss, leading to
N
p(D|μ) = ∏ p(xk|μ) = μn (1 − μ)N−n
k=1
Prior
Γ(α + β) α−1
p(μ) = Beta(μ|α, β) = μ (1
Γ(α)Γ(β)
− μ)β−1
where the Gamma function is sort of a generalized factorial function. In particular, if α, β are integers, then
Γ(α + β) (α + β − 1)!
=
Γ(α)Γ(β) (α − 1)! (β − 1)!
A what distribution? Yes, the beta distribution is a conjugate prior for the binomial distribution, which means that
α and β are called hyperparameters, since they parameterize the distribution for another parameter (μ). E.g., α = β = 1 (uniform).
28/127
(Bishop Fig.2.2). Plots of the beta distribution Beta(μ|a, b) as a function of μ for various values of the hyperparameters a and b.
⎡ ⎤
⎢ Γ(N + α + β) ⎥
⋅⎢
⎢ μn+α−1 (1 − μ)N−n+β−1 ⎥ ⎥
⎢ Γ(n + α)Γ(N − n + β)
⎢ ⎥
⎥
⎣ posterior p(μ|D)=Beta(μ|n+α, N−n+β) ⎦
p(μ|D) = Beta(μ| n + α, N − n + β)
It follow from the above calculation that the evidence for model m can be analytically expressed as
29/127
Coin Toss Example (4): Prediction
Once we have accepted a model, let's apply it to the application, in this case, predicting future observations.
Marginalize over the parameter posterior to get the predictive PDF for a new coin toss x∙ , given the data D,
1
p(x∙ = h|D) = ∫ p(x∙ = h|μ) p(μ|D) dμ
0
1
=∫ μ × Beta(μ| n + α, N − n + β) dμ
0
n+α
=
N+α+β
This result is known as Laplace's rule of succession.
a
The above integral computes the mean of a beta distribution, which is given by E[x] = a+b
for x ∼ Beta(a, b), see wikipedia.
Finally, we're ready to solve our example problem: for D = {hthhtth} and uniform prior (α = β = 1), we get
n+1 4+1 5
p(x∙ = h|D) = = =
N+2 7+2 9
What did we learn from the data? Before seeing any data, we think that
α
p(x∙ = h) = p(x∙ = h|D)|n=N=0 = .
α+β
Hence, α and β are prior pseudo-counts for heads and tails respectively.
n+α
After the N coin tosses, we think that p(x∙ = h|D) = N+α+β
.
N
Note that, since 0 ≤ < 1, the Bayesian prediction lies between (fuses) the prior and data-based predictions. The data plays the role of
N+α+β
gain
"correcting" the prior prediction.
n
For large N , the gain goes to 1 and p(x∙ = h|D)|N→∞ → N
goes to the data-based prediction (the observed relative frequency).
Next, we code an example for a sequence of coin tosses, where we assume that the true coin generates data xn ∈ {0, 1} by a Bernoulli distribution:
x 1−x
30/127
p(xn |μ = 0.4) = 0.4xn ⋅ 0.61−xn
So, this coin is biased!
In order predict the outcomes of future coin tosses, we'll compare three models.
All models have the same data generating distribution (also Bernoulli)
p(μ|m1 ) = Beta(μ|α = 1, β = 5)
p(μ|m2 ) = Beta(μ|α = 5, β = 1)
p(μ|m3 ) = Beta(μ|α = 8, β = 13)
For each model, we will report as a function of the total number of coin tosses, the posteriors
p(μ|D, m∙ ) ,
and the log-evidence in decibels
log10 p(D|m∙ ) .
In [1]: using Distributions, StatsPlots, SpecialFunctions
priors = [Beta(1., 5.), Beta(5., 1.), Beta(8, 13)] #specify prior distributions (you can add additional beta distribu
log_e_priors = priors #save prior distributions to compute log-evidence
posterior_distributions = [[d] for d in priors] #save a sequence of posterior distributions for every prior, start
evidences = [[] for _ in priors] #maintain a vector of log evidences to plot later
31/127
gif(anim, "anim_bay_ct.gif")
This is an animation. If you are viewing these notes in PDF format you can see the animation at
https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Bayesian-Machine-Learning.ipynb
⇒ With more data, the relevance of the prior diminishes!
In [3]: #visualize the first 30 log-evidences of models over time
range = collect(1:length(samples))
labels = reshape([string("Model ", i) for i in 1:length(evidences)], (1, length(evidences)))
plot(range, evidences, label=labels, title="Log evidences", xlims=(0,30), ylims=(-10, 0), xlabel="# Datapoints seen")
Out[3]:
Over time, the log-evidences of our models converge to the same value. Can you explain this behaviour?
In the example above, Bayesian parameter estimation and prediction were tractable in closed-form. This is often not the case. We will need to
approximate some of the computations.
32/127
p(x|D) = ∫ p(x|θ)p(θ|D) dθ
This is just the data generating distribution p(x|θ) evaluated at θ = θ^, which is easy to evaluate.
θ^bayes = ∫ θ p (θ|D) dθ
Choose a model m
with same data
1. Model generating
Choose a model m with data generating distribution p(x|θ, m) and parameter prior p(θ|m)
Specification distribution
p(x|θ, m). No need
for priors.
By Maximum
use Bayes rule to find the parameter posterior, Likelihood (ML)
optimization,
2. Learning
p(θ|D) ∝ p(D|θ)p(θ)
θ^ = arg max p(D|θ)
θ
p(x|D) =
3. Prediction p(x|D) = p(x|θ^)
∫ p(x|θ)p(θ|D) dθ
Maximum Likelihood (ML) is MAP with uniform prior. MAP is sometimes called a 'penalized' ML procedure:
33/127
(good!). ML works rather well if we have a lot of data because the influence of the prior diminishes with more data.
(good!). Computationally often do-able. Useful fact that makes the optimization easier (since log is monotonously increasing):
(bad). Cannot be used for model comparison! When doing ML estimation, the Bayesian model evidence always evalutes to zero because the prior
probability mass under the likelihood function goes to zero. Therefore, when doing ML estimation, Bayesian model evidence cannot be used to evaluate
model performance:
OPTIONAL SLIDES
The Kullback-Leibler Divergence (a.k.a. relative entropy) between two distributions q and p is defined as
q(z)
DKL [q, p] ≡ ∑ q(z) log
z p(z)
As an aside, note that DKL [q, p] ≠ DKL [p, q]. Both divergences are relevant.
Here is an animation that shows the KL divergence between two Gaussian distributions:
In [4]: using Distributions, StatsPlots, Plots.PlotMeasures, LaTeXStrings
μ_p = 0 #statistics of distributions we'll keep constant (we'll vary the mean of q). Feel free to change t
σ_p = 1
σ_q = 1
p = Normal(μ_p, σ_p)
34/127
plot!(p, xlims = (μ_p - 8, μ_p + 8), fill=(0, .2,), label=string("P"), linewidth=2, ylims=(0,0.5))
plot!(q, fill=(0, .2,), label=string("Q"), linewidth=2, ylims=(0,0.5))
plot!(twinx(), μ_seq, kl, xticks=:none, ylims=(0, maximum(kl) + 3), linewidth = 3,
legend=:topright,xlims = (μ_p - 8, μ_p + 8), color="green", label=L"D_{KL}(Q || P)")
end
gif(anim, "anim_lat_kl.gif")
Note that this is an animation. If you are viewing these notes in PDF format you can see the animation at
https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Bayesian-Machine-Learning.ipynb
Factor Graphs
Preliminaries
Goal
Introduction to Forney-style factor graphs and message passing-based inference
Materials
Mandatory
These lecture notes
Loeliger (2007), The factor graph approach to model based signal processing, pp. 1295-1302 (until section V)
Optional
Frederico Wadehn (2015), Probabilistic graphical models: Factor graphs and more video lecture (highly recommended)
References
Forney (2001), Codes on graphs: normal realizations
A probabilistic inference task gets its computational load mainly through the need for marginalization (i.e., computing integrals). E.g., for a model
p(x1 , x2 , x3 , x4 , x5 ), the inference task p(x2 |x3 ) is given by
p(x2 , x3 )
p(x2 |x3 ) =
p(x3 )
∫ ⋯ ∫ p(x1 , x2 , x3 , x4 , x5 ) dx1 dx4 dx5
=
∫ ⋯ ∫ p(x1 , x2 , x3 , x4 , x5 ) dx1 dx2 dx4 dx5
Since these computations (integrals or sums) suffer from the "curse of dimensionality", we often need to solve a simpler problem in order to get an
answer.
Factor graphs provide a computationally efficient approach to solving inference problems if the probabilistic model can be factorized.
35/127
Factorization helps. For instance, if
p(x1 , x2 , x3 , x4 , x5 )
= p(x1 )p(x2 , x3 )p(x4 )p(x5 |x4 )
, then
p(x2 |x3 )
∫ ⋯ ∫ p(x1 )p(x2 , x3 )p(x4 )p(x5 |x4 ) dx1 dx4 dx5
=
∫ ⋯ ∫ p(x1 )p(x2 , x3 )p(x4 )p(x5 |x4 )
dx1 dx2 dx4 dx5
p(x2 , x3 )
=
∫ p(x2 , x3 )dx2
which is computationally much cheaper than the general case above.
In this lesson, we discuss how computationally efficient inference in factorized probability distributions can be automated by message passing-based
inference in factor graphs.
Consider a function
f(x1 , x2 , x3 , x4 , x5 ) = fa (x1 , x2 , x3 )
⋅ fb (x3 , x4 , x5 ) ⋅ fc (x4 )
The factorization of this function can be graphically represented by a Forney-style Factor Graph (FFG):
An FFG is an undirected graph subject to the following construction rules (Forney, 2001)
Note that a variable can appear in maximally two factors in an FFG (since an edge has only two end points).
For the factor graph representation, we will instead consider the function g , defined as
36/127
where
Note that through introduction of auxiliary variables X2′ and X2′′ and a factor f= (x2 , x′2 , x′′
2 ), each variable in g appears in maximally two factors.
The constraint f= (x, x′ , x′′ ) enforces that X = X ′ = X ′′ for ever y valid configuration.
x4 ) dx′2 dx′′2
it follows that any inference problem on f can be executed by a corresponding inference problem on g , e.g.,
For example, the (previously shown) graph for fa (x1 , x2 , x3 ) ⋅ fb (x3 , x4 , x5 ) ⋅ fc (x4 ) could represent the probabilistic model
37/127
Inference by Closing Boxes
Factorizations provide opportunities to cut on the amount of needed computations when doing inference. In what follows, we will use FFGs to process
these opportunities in an automatic way by message passing between the nodes of the graph.
f¯(x3 ) ≜ ∑ f(x1 , x2 , … , x7 )
x1 , x2 , x4 , x5 , x6 , x7
Due to the factorization and the Generalized Distributive Law, we can decompose this sum-of-products to the following product-of-sums:
f¯(x3 ) =
which is computationally (much) lighter than executing the full sum ∑x ,…,x
1 7
f(x1 , x2 , … , x7 )
38/127
→
Note that the new factor μ X (x3 ) is obtained by multiplying all enclosed factors (fa , fb , fc ) by the red dashed box, followed by marginalization
3
(summing) over all enclosed variables (x1 , x2 ).
This is the Closing the Box-rule, which is a general recipe for marginalization of latent variables (inside the box) and leads to a new factor that has the
→
variables (edges) that cross the box as arguments. For instance, the argument of the remaining factor μ X3 (x3 ) is the variable on the edge that
crosses the red box (x3 ).
→
Hence, μ X3 (x3 ) can be interpreted as a message from the red box toward variable x3 .
→
We drew directed edges in the FFG in order to distinguish forward messages μ ∙ (⋅) (in the same direction as the arrow of the edge) from backward
←
messages μ ∙ (⋅) (in opposite direction). This is just a notational convenience since an FFG is computationally an undirected graph.
Sum-Product Algorithm
Closing-the-box can also be interpreted as a message update rule for an outgoing message from a node. For a node f(y, x1 , … , xn ) with
→ → →
incoming messages μ X1 (x1 ), μ X1 (x1 ), … , μ Xn (xn), the outgoing message is given by (Loeliger (2007), pg.1299):
→ → →
μ Y (y) = ∑ μ X1 (x1 ) ⋯ μ Xn (xn )
x1 ,…, xn
outgoing incoming
message messages
⋅ f(y, x1 , … , xn )
node
function
This is called the Sum-Product Message (SPM) update rule. (Look at the formula to understand why it's called the SPM update rule).
Note that all SPM update rules can be computed from information that is locally available at each node.
39/127
If the factor graph for a function f has no cycles (i.e., the graph is a tree), then the marginal f¯(x3 ) = ∑x , x f(x1 , x2 , … , x7 ) is
1 2 , x4 , x5 , x6 , x7
given by multiplying the forward and backward messages on that edge:
→ ←
f¯(x3 ) = μ X3 (x3 ) ⋅ μ X3 (x3 )
It follows that the marginal f¯(x3 ) = ∑x , x f(x1 , x2 , … , x7 ) can be efficiently computed through sum-product messages. Executing
1 2 , x4 , x5 , x6 , x7
inference through SP message passing is called the Sum-Product Algorithm (or alternatively, the belief propagation algorithm).
Just as a final note, inference by sum-product message passing is much like replacing the sum-of-products
ac + ad + bc + b
by the following product-of-sums:
(a + b)(c + d) .
Which of these two computations is cheaper to execute?
As an example, let´s evaluate the SP messages for the equality node f= (x, y, z) = δ(z − x)δ(z − y):
→ → →
μ Z (z) = ∬ μ X (x) μ Y (y) δ(z − x)δ(z − y) dxdy
→ →
= μ X (z) ∫ μ Y (y) δ(z − y) dy
→ →
= μ X (z) μ Y (z)
By symmetry, this also implies (for the same equality node) that
← → ←
μ X (x) = μ Y (x) μ Z (x) and
← → ←
μ Y (y) = μ X (y) μ Z (y) .
→ → → → → → → → →
Let us now consider the case of Gaussian messages μ X (x) = N (x| m X , V X ), μ Y (y) = N (y| m Y , V Y ) and μ Z (z) = N (z| m Z , V Z )
−
→ →−1
. Let´s also define the precision matrices W X ≜ V X and similarly for Y and Z. Then applying the SP update rule leads to multiplication of two
Gaussian distributions (see Roweis notes), resulting in
−
→ −
→ −→
WZ = WX + W Y
−
→ → −
→ → −
→ →
W Z mz = W X mX + W Y mY
It follows that message passing through an equality node is similar to applying Bayes rule, i.e., fusion of two information sources. Does this make
sense?
In a tree graph, start with messages from the terminals and keep passing messages through the internal nodes towards the "target" variable (x3 in
above problem) until you have both the forward and backward message for the target variable.
In a tree graph, if you continue to pass messages throughout the graph, the Sum-Product Algorithm computes exact marginals for all hidden variables.
If the graph contains cycles, we have in principle an infinite tree by "unrolling" the graph. In this case, the SP Algorithm is not guaranteed to find exact
40/127
marginals. In practice, if we apply the SP algorithm for just a few iterations ("unrolls"), then we often find satisfying approximate marginals.
We can use terminal nodes to represent observations, e.g., add a factor f(y) = δ(y − 3) to terminate the half-edge for variable Y if y = 3 is
observed.
Terminal nodes that carry observations are denoted by small black boxes.
The message out of a terminal node (attached to only 1 edge) is the factor itself. For instance, closing a box around terminal node fa (x1 ) would lead
to
→
μ X1 (x1 ) ≜ ∑ ∏ fa (x1 ) = fa (x1 )
enclosed enclosed
variables factors
The message from a half-edge is 1 (one). You can verify this by imagining that a half-edge x can be terminated by a node function f(x) = 1 without
affecting any inference issue.
The foregoing message update rules can be worked out in closed-form and put into tables (e.g., see Tables 1 through 6 in Loeliger (2007) for many
standard factors such as essential probability distributions and operations such as additions, fixed-gain multiplications and branching (equality nodes).
In the optional slides below, we have worked out a few more update rules for the addition node and the multiplication node.
If the update rules for all node types in a graph have been tabulated, then inference by message passing comes down to executing a set of table-
lookup operations, thus creating a completely automatable Bayesian inference framework.
In our research lab BIASlab (FLUX 7.060), we are developing RxInfer, which is a (Julia) toolbox for automating Bayesian inference by message passing in
a factor graph.
Assume we want to estimate some function f : RD → R from a given data set D = {(x1 , y1 ), … , (xN , yN )}, with model assumption
yi = f(xi ) + ϵi.
model specification
We will assume a linear model with white Gaussian noise and a Gaussian prior on the coefficients w:
41/127
yi = wT xi + ϵi
ϵi ∼ N (0, σ 2 )
w ∼ N (0, Σ)
or equivalently
We are interested in inferring the posterior p(w|D). We will execute inference by message passing on the FFG for the model.
The left figure shows the factor graph for this model.
The right figure shows the message passing scheme.
CODE EXAMPLE
Let's solve this problem by message passing-based inference with Julia's FFG toolbox RxInfer.
In [1]: using PyPlot, LinearAlgebra
# Parameters
Σ = 1e5 * Diagonal(I,3) # Covariance matrix of prior on w
σ2 = 2.0 # Noise variance
42/127
Now build the factor graph in RxInfer, perform sum-product message passing and plot results (mean of posterior).
In [2]: using RxInfer, Random
In [3]: # Build model
@model function linear_regression(N, Σ, σ2)
w ~ MvNormalMeanCovariance(constvar(zeros(3)),constvar(Σ))
x = datavar(Vector{Float64}, N)
y = datavar(Float64, N)
for i in 1:N
y[i] ~ NormalMeanVariance(dot(w , x[i]), σ2)
end
return w, x, y
end
# Run message passing algorithm
results = inference(
model = linear_regression(length(x_train), Σ, σ2 ),
data = (y = y_train, x = x_train),
returnvars = (w = KeepLast()),
iterations = 20
);
# Plot result
w = results.posteriors[:w]
println("Posterior distribution of w: $(w)")
scatter(z, y_train); xlabel(L"z"); ylabel(L"f([1.0, z, z^2]) + ϵ");
z_test = collect(0:0.2:12)
x_test = [[1.0; z; z^2] for z in z_test]
for i=1:10
w_sample = rand(results.posteriors[:w])
f_est(x) = (w_sample'*x)[1]
plot(z_test, map(f_est, x_test), "k-", alpha=0.3);
end
43/127
Final thoughts: Modularity and Abstraction
The great Michael Jordan (no, not this one, but this one), wrote:
I basically know of two principles for treating complicated systems in simple ways: the first is the principle of modularity and the second
is the principle of abstraction. I am an apologist for computational probability in machine learning because I believe that probability
theory implements these two principles in deep and intriguing ways — namely through factorization and through averaging. Exploiting
these two mechanisms as fully as possible seems to me to be the way forward in machine learning. — Michael Jordan, 1997 (quoted in
Fre98).
Factor graphs realize these ideas nicely, both visually and computationally.
Visually, the modularity of conditional independencies in the model are displayed by the graph structure. Each node hides internal complexity and by
closing-the-box, we can hierarchically move on to higher levels of abstraction.
Computationally, message passing-based inference uses the Distributive Law to avoid any unnecessary computations.
OPTIONAL SLIDES
Next, let us consider a multiplication by a fixed (invertible matrix) gain fA (x, y) = δ(y − Ax)
→ →
μ Y (y) = ∫ μ X (x) δ(y − Ax) dx
→
= ∫ μ X (x) |A|−1 δ(x − A −1 y) dx
→
= |A|−1 μ X (A −1 y) .
→ → →
For a Gaussian message input message μ X (x) = N (x| m X , V X ), the output message is also Gaussian with
→ → → →
m Y = A m X , and V Y = A V X A T
since
44/127
→ −1 →
μ Y (y) = |A| μ X (A −1 y)
∝ exp
1 → T →−1 →
(− (A −1 y − m X ) V X (A −1 y − m X ))
2
= exp
1 → T →−1 →
( − (y − A m X ) A −T V X A −1 (y − A m X ) )
2
→
(AV X AT )−1
→ →
∝ N (y|A m X , A V X A T ) .
←
Exercise: Proof that, for the same factor δ(y − Ax) and Gaussian messages, the (backward) sum-product message μ X is given by
← ←
ξ X = AT ξ Y
←
− ←−
W X = AT W Y A
← ←− ← ←
− ←−1
where ξ X ≜ W X mX and W X ≜ V X (and similarly for Y ).
Code example: Gaussian forward and backward messages for the Addition node
Let's calculate the Gaussian forward and backward messages for the addition node in RxInfer.
Forward message on Z:
Out[4]:NormalMeanVariance{Float64}(μ=3.0, v=2.0)
In [5]: println("Backward message on X:")
@call_rule typeof(+)(:in1, Marginalisation) (m_out = NormalMeanVariance(3.0, 1.0), m_in2 = NormalMeanVariance(2.0, 1.0
Backward message on X:
Out[5]:NormalMeanVariance{Float64}(μ=1.0, v=2.0)
Code Example: forward and backward messages for the Matrix Multiplication node
In the same way we can also investigate the forward and backward messages for the matrix multiplication ("gain") node
Forward message on Y:
Out[6]:NormalMeanVariance{Float64}(μ=4.0, v=16.0)
In [7]: println("Backward message on X:")
@call_rule typeof(*)(:in, Marginalisation) (m_out = NormalMeanVariance(2.0, 1.0), m_A = PointMass(4.0), meta = TinyCor
Backward message on X:
Out[7]:NormalWeightedMeanPrecision{Float64}(xi=8.0, w=16.0)
Assume that we are interested in the posterior for X after observing Y1 = y^ 1 and Y2 = y^2 . The posterior for X can be inferred by applying the
sum-product algorithm to the following graph:
(Note that) we usually draw terminal nodes for observed variables in the graph by smaller solid-black squares. This is just to help the visualization of the
graph, since the computational rules are no different than for other nodes.
We'll use RxInfer to build the above graph, and perform sum-product message passing to infer the posterior p(x|y1 , y2 ). We assume p(y1 |x) and
p(y2 |x) to be Gaussian likelihoods with known variances:
p(y1 | x) = N (y1 | x, vy1 )
p(y2 | x) = N (y2 | x, vy2 )
Under this model, the posterior is given by:
likelihood prior
p(x | y1 , y2 ) ∝ p(y1 | x) p(y2 | x) p(x)
= N (x | y^1 , vy1 ) N (x | y^ 2 , vy2 ) N (x | mx , vx )
so we can validate the answer by solving the Gaussian multiplication manually.
In [8]: # Data
y1_hat = 1.0
y2_hat = 2.0
x ~ NormalMeanVariance(constvar(0.0), constvar(4.0))
y1 ~ NormalMeanVariance(x, constvar(1))
y2 ~ NormalMeanVariance(x, constvar(2))
return x
46/127
end
# Calculate mean and variance of p(x|y1,y2) manually by multiplying 3 Gaussians (see lesson 4 for details)
v = 1 / (1/4 + 1/1 + 1/2)
m = v * (0/4 + y1_hat/1.0 + y2_hat/2.0)
println("Manual result: p(x|y1,y2) = ($(m), $(v))")
Preliminaries
Goal
Review of information processing with Gaussian distributions in linear systems
Materials
Mandatory
These lecture notes
Optional
Bishop pp. 85-93
MacKay - 2006 - The Humble Gaussian Distribution (highly recommended!)
Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.30-34, section 2.8, the Gaussian distribution
References
E.T. Jaynes - 2003 - Probability Theory, The Logic of Science (best book available on the Bayesian view on probability theory)
Example Problem
Consider a set of observations D = {x1 , … , xN } in the 2-dimensional plane (see Figure). All observations were generated by the same process. We
now draw an extra observation x∙ = (a , b) from the same data generating process. What is the probability that x∙ lies within the shaded rectangle S?
In [1]: using Distributions, PyPlot
N = 100
generative_dist = MvNormal([0,1.], [0.8 0.5; 0.5 1.0])
function plotObservations(obs::Matrix)
plot(obs[1,:], obs[2,:], "kx", zorder=3)
fill_between([0., 2.], 1., 2., color="k", alpha=0.4, zorder=2) # Shaded area
text(2.05, 1.8, "S", fontsize=12)
xlim([-3,3]); ylim([-2,4]); xlabel("a"); ylabel("b")
end
D = rand(generative_dist, N) # Generate observations from generative_dist
plotObservations(D)
x_dot = rand(generative_dist) # Generate x·
plot(x_dot[1], x_dot[2], "ro");
gcf()
47/127
Out[1]:
1
p(x|μ, Σ) = N (x|μ, Σ) ≜ exp
√(2π)M |Σ|
1
{− (x − μ)T Σ−1 (x − μ)} .
2
1 48/127
1
p(x|μ, σ 2 ) = exp
√2πσ 2
(x − μ)2
{− }.
2σ 2
Alternatively, the _canonical_ (a.k.a. _natural_ or _information_ ) parameterization of the Gaussian distribution is given by
Why is the Gaussian distribution so ubiquitously used in science and engineering? (see also Jaynes, section 7.14, and the whole chapter 7 in his book).
(2) Once the Gaussian has been attained, this form tends to be preserved. e.g.,
The convolution of two Gaussian functions is another Gaussian function (useful in sum of 2 variables and linear transformations)
The product of two Gaussian functions is another Gaussian function (useful in Bayes rule).
The Fourier transform of a Gaussian function is another Gaussian function.
In fact, after a linear transformation z = Ax + b , no matter how x is distributed, the mean and variance of z are always given by
μz = Aμx + b and Σz = AΣx A T , respectively (see probability theory review lesson). In case x is not Gaussian, higher order moments may
be needed to specify the distribution for z.
The sum of two independent Gaussian variables is also Gaussian distributed. Specifically, if x ∼ N (μx , Σx ) and y ∼ N (μy , Σy ), then the PDF
for z = x + y is given by
p(z) = N (x | μx , Σx ) ∗ N (y | μy , Σy )
= N (z | μx + μy , Σx + Σy ) (SRG-8)
The sum of two Gaussian distributions is NOT a Gaussian distribution. Why not?
2 2
49/127
Given independent variables x ∼ N (μx , σx2 ) and y ∼ N (μy , σy2 ) , what is the PDF for z = A ⋅ (x − y) + b ? (see Exercises)
Think about the role of the Gaussian distribution for stochastic linear systems in relation to what sinusoidals mean for deterministic linear system
analysis.
Let's estimate a constant θ from one 'noisy' measurement x about that constant.
We assume the following measurement equations (the tilde ∼ means: 'is distributed as'):
x = θ+ϵ
ϵ ∼ N (0, σ 2 )
Also, let's assume a Gaussian prior for θ
θ ∼ N (μ0 , σ02 )
Model specification
Note that you can rewrite these specifications in probabilistic notation as follows:
p(x|θ) = N (x|θ, σ 2 )
p(θ) = N (θ|μ0 , σ02 )
(Notational convention). Note that we write ϵ ∼ N (0, σ 2 ) but not ϵ ∼ N (ϵ|0, σ 2 ), and we write p(θ) = N (θ|μ0 , σ02 ) but not
p(θ) = N (μ0 , σ02 ).
Inference
For simplicity, we assume that the variance σ 2 is given and will proceed to derive a Bayesian posterior for the mean θ. The case for Bayesian inference
for σ 2 with a given mean is discussed in the optional slides.
(Just as an aside,) this computational 'trick' for multiplying two Gaussians is called completing the square. The procedure makes use of the equality
2
2 b
ax + bx + c1 = a(x + ) + c2
2a
50/127
where
1 σ02 + σ 2 1 1
= = +
σ12 σ 2 σ02 σ02 σ2
σ02 x + σ 2 μ0 1 1
μ1 = = σ12 ( μ0 + x)
σ2 + σ02 σ02 σ2
So, multiplication of two Gaussian distributions yields another (unnormalized) Gaussian with
posterior precision equals sum of prior precisions
posterior precision-weighted mean equals sum of prior precision-weighted means
where
Σ−1 −1 −1
c = Σa + Σb
−1 −1
Σc μc = Σa μa + Σ−1
b μb
⇒ Note that Bayesian inference is trivial in the canonical parameterization of the Gaussian, where we would get
Λc = Λa + Λb (precisions add)
ηc = ηa + ηb (precision-weighted means add)
This property is an important reason why the canonical parameterization of the Gaussian distribution is useful in Bayesian data processing.
Let's plot the exact product of two Gaussian PDFs as well as the normalized product according to the above derivation.
In [2]: using PyPlot, Distributions
d1 = Normal(0, 1) # μ=0, σ^2=1
d2 = Normal(3, 2) # μ=3, σ^2=4
# Plot stuff
x = range(-4, stop=8, length=100)
plot(x, pdf.(d1,x), "k")
plot(x, pdf.(d2,x), "b")
plot(x, pdf.(d1,x) .* pdf.(d2,x), "r-") # Plot the exact product
plot(x, pdf.(d_prod,x), "r:") # Plot the normalized Gaussian product
legend([L"\mathcal{N}(0,1)",
L"\mathcal{N}(3,4)",
L"\mathcal{N}(0,1) \mathcal{N}(3,4)",
L"Z^{-1} \mathcal{N}(0,1) \mathcal{N}(3,4)"]);
51/127
The solid and dotted red curves are identical up to a scaling factor Z .
xn = θ + ϵn
ϵn ∼ N (0, σ 2 )
and the same prior for θ:
θ ∼ N (μ0 , σ02 )
inference
N
p(θ|D) ∝ N (θ|μ0 , σ02 ) ⋅ ∏ N (xn |θ, σ 2 )
n=1
prior
likelihood
Using the property that precisions and precision-weighted means add when Gaussians are multiplied, we can immediately write the posterior
2
p(θ|D) = N (θ| μN , σN )
as
1 1 1
= +∑ (B-2.142)
2 σ02 σ2
σN n
2 1 1
μN = σN ( μ0 + ∑ xn ) (B-2.141)
σ02 n σ2
We now have a posterior for the model parameters. Let's write down what we know about the next sample xN+1 .
52/127
p(xN+1 |DN ) = ∫ p(xN+1 |θ)p(θ|DN )dθ
2
Uncertainty about xN+1 comprises both uncertainty about the parameter (σN ) and observation noise σ 2 .
2 1 1
μML = μN |σ 2→∞ = σN ( ∑ xn ) =
0
σ2 n N
N
∑ xn
n=1
As expected, having an expression for the maximum likelihood estimate, it is now possible to rewrite the (Bayesian) posterior mean for θ as
2 1 1
μN = σN ( μ0 + ∑ xn )
σ02 n σ2
posterior
σ02 σ 2 1 1
= ( μ0 + ∑ xn )
Nσ02 + σ2 σ02 n σ2
σ 2 Nσ02
= μ0 + μML
Nσ02+ σ2 Nσ02 + σ 2
Nσ02
= μ0 + ⋅ (μML − μ0 ) (B-2.141)
Nσ 2 + σ 2
prior 0
prediction error
gain
correction
Hence, the posterior mean always lies somewhere between the prior mean μ0 and the maximum likelihood estimate (the "data" mean) μML .
x
Let z = [ ] be jointly normal distributed as
y
p(z) = N (z|μ, Σ)
x ∣ μx Σx Σxy
= N ([ ] ∣[ ] , [ ])
y ∣ yμ Σyx Σy
Since covariance matrices are by definition symmetric, it follows that Σx and Σy are symmetric and Σxy = ΣTyx.
Let's factorize p(z) = p(x, y) into p(y|x) ⋅ p(x) through conditioning and marginalization.
conditioning: p(y|x)
−1 −1
= N (y | μy + Σyx Σx (x − μx ), Σy − Σyx Σx Σxy )
53/127
marginalization: p(x) = N (x|μx , Σx )
Hence, conditioning and marginalization in Gaussians leads to Gaussians again. This is very useful for applications to Bayesian inference in jointly
Gaussian systems.
Λx Λxy
With a natural parameterization of the Gaussian p(z) = Nc (z|η, Λ) with precision matrix Λ = Σ−1 = [ ] , the conditioning operation
Λyx Λy
results in a simpler result, see Bishop pg.90, eqs. 2.96 and 2.97.
As an exercise, interpret the formula for the conditional mean (E[y|x] = μy + Σyx Σ−1
x (x − μx )) as a prediction-correction operation.
μ = [1.0; 2.0]
Σ = [0.3 0.7;
0.7 2.0]
joint = MvNormal(μ,Σ)
marginal_x = Normal(μ[1], sqrt(Σ[1,1]))
#Plot p(x,y)
subplot(221)
x_range = y_range = range(-2,stop=5,length=1000)
joint_pdf = [ pdf(joint, [x_range[i];y_range[j]]) for j=1:length(y_range), i=1:length(x_range)]
imshow(joint_pdf, origin="lower", extent=[x_range[1], x_range[end], y_range[1], y_range[end]])
grid(); xlabel("x"); ylabel("y"); title("p(x,y)"); tight_layout()
# Plot p(x)
subplot(222)
plot(range(-2,stop=5,length=1000), pdf.(marginal_x, range(-2,stop=5,length=1000)))
grid(); xlabel("x"); ylabel("p(x)"); title("Marginal distribution p(x)"); tight_layout()
# Plot p(y|x)
x = 0.1
conditional_y_m = μ[2]+Σ[2,1]*inv(Σ[1,1])*(x-μ[1])
conditional_y_s2 = Σ[2,2] - Σ[2,1]*inv(Σ[1,1])*Σ[1,2]
conditional_y = Normal(conditional_y_m, sqrt.(conditional_y_s2))
subplot(223)
plot(range(-2,stop=5,length=1000), pdf.(conditional_y, range(-2,stop=5,length=1000)))
grid(); xlabel("y"); ylabel("p(y|x)"); title("Conditional distribution p(y|x)"); tight_layout()
54/127
As is clear from the plots, the conditional distribution is a renormalized slice from the joint distribution.
p(x | θ) = N (x | θ, σ 2 )
with a Gaussian prior for θ:
p(θ) = N (θ | μ0 , σ02 )
x
Let z = [ ] . The distribution for z is then given by (Exercise)
θ
x
p(z) = p ([ ])
θ
x ∣ μ0 σ2 + σ2 σ02
= N ([ ] ∣∣ [ ] , [ 0 2 ])
θ ∣ μ0 σ0 σ02
Direct substitution of the rule for Gaussian conditioning leads to the posterior (derivation as an Exercise):
p(θ|x) = N (θ | μ1 , σ12 ) ,
with
σ02
K= (K is called: Kalman gain)
σ02 + σ 2
μ1 = μ0 + K ⋅ (x − μ0 )
σ12 = (1 − K) σ02
⇒ Moral: For jointly Gaussian systems, we can do inference simply in one step by using the formulas for conditioning and marginalization.
= + ={ ,…, } 55/127
Consider the signal xt = θ + ϵt , where Dt = {x1 , … , xt } is observed sequentially (over time).
Problem: Derive a recursive algorithm for p(θ|Dt ), i.e., an update rule for (posterior) p(θ|Dt ) based on (prior) p(θ|Dt−1 ) and (new observation) xt
.
Model specification
Let's define the estimate after t observations (i.e., our solution ) as p(θ|Dt ) = N (θ | μt , σt2 ).
We define the joint distribution for θ and xt , given background Dt−1 , by
Inference
This linear sequential estimator of mean and variance in Gaussian observations is called a Kalman Filter.
Note that the uncertainty about θ decreases over time (since 0 < (1 − Kt ) < 1). Since we assume that the statistics of the system do not change
(stationarity), each new sample provides new information.
Recursive Bayesian estimation is the basis for adaptive signal processing algorithms such as Least Mean Squares (LMS) and Recursive Least Squares
(RLS).
Let's implement the Kalman filter described above. We'll use it to recursively estimate the value of θ based on noisy observations.
In [4]: using PyPlot, Distributions
56/127
posterior_σ = prior.σ * (1.0 - K) #update the posterior standard deviation
return Normal(posterior_μ, posterior_σ) #return the posterior distribution
end
prior = Normal(0, 1) #specify the prior distribution (you can play with the parameterization of this to get a feeli
post_μ[1] = prior.μ #save prior mean and variance to show these in plot
post_σ2[1] = prior.σ
for (i, x) in enumerate(observations) #note that this loop demonstrates Bayesian learning on
posterior = perform_kalman_step(prior, x, noise_σ2) #compute the posterior distribution given the observat
post_μ[i + 1] = posterior.μ #save the mean of the posterior distribution
post_σ2[i + 1] = posterior.σ #save the variance of the posterior distribution
prior = posterior #the posterior becomes the prior for future observatio
end
obs_scale = collect(2:n+1)
plot(obs_scale, observations, "rx")
post_scale = collect(1:n+1) #scatter the observati
plot(post_scale, post_μ, "b-") #lineplot our estimate
plot(post_scale, θ*ones(n + 1), "k--") #plot the true value o
fill_between(post_scale, post_μ-sqrt.(post_σ2), post_μ+sqrt.(post_σ2), color="b", alpha=0.3) #fill the area of 1 st
legend([L"\theta", L"x[t]", L"\mu[t]"])
xlim((1, n)); xlabel(L"t"); grid()
#gcf() #uncomment this line if plot does not show
The shaded area represents 2 standard deviations of posterior p(θ|D). The variance of the posterior is guaranteed to decrease monotonically for the
standard Kalman filter.
(We've seen that) the sum of two Gausssian distributed variables is also Gaussian distributed.
Has the product of two Gaussian distributed variables also a Gaussian distribution?
No! In general this is a difficult computation. As an example, let's compute p(z) for Z = XY for the special case that X ∼ N (0, 1) and
Y ∼ N (0, 1).
57/127
p(z) = ∫ p(z|x, y) p(x, y) dxdy
X,Y
1
∫ δ(z − xy) e−(x +y )/2 dxdy
2 2
=
2π
1 ∞ 1
e−(x +z /x )/2 dx
2 2 2
= ∫
π 0 x
1
= K0 (|z|) .
π
where Kn (z) is a modified Bessel function of the second kind.
We plot p(Z = XY ) and p(X)p(Y ) for X ∼ N (0, 1) and Y ∼ N (0, 1) to give an idea of how these distributions differ.
In [5]: using PyPlot, Distributions, SpecialFunctions
X = Normal(0,1)
Y = Normal(0,1)
pdf_product_std_normals(z::Vector) = (besselk.(0, abs.(z))./π)
range1 = collect(range(-4,stop=4,length=100))
plot(range1, pdf.(X, range1))
plot(range1, pdf.(X,range1).*pdf.(Y,range1))
plot(range1, pdf_product_std_normals(range1))
legend([L"p(X)=p(Y)=\mathcal{N}(0,1)", L"p(X)*p(Y)",L"p(Z=X*Y)"]); grid()
In short, Gaussian-distributed variables remain Gaussian in linear systems, but this is not the case in non-linear systems.
We apply maximum likelihood estimation to fit a 2-dimensional Gaussian model (m) to data set D. Next, we evaluate p(x∙ ∈ S|m) by (numerical)
integration of the Gaussian pdf over S : p(x∙ ∈ S|m) = ∫S p(x|m)dx.
In [6]: using HCubature, LinearAlgebra# Numerical integration package
# Maximum likelihood estimation of 2D Gaussian
N = length(sum(D,dims=1))
μ = 1/N * sum(D,dims=2)[:,1]
D_min_μ = D - repeat(μ, 1, N)
Σ = Hermitian(1/N * D_min_μ*D_min_μ')
m = MvNormal(μ, convert(Matrix, Σ));
58/127
for j=1:100
A[i,j] = a = (i-1)*6/100 .- 2
B[i,j] = b = (j-1)*6/100 .- 3
density[i,j] = pdf(m, [a,b])
end
end
c = contour(A, B, density, 6, zorder=1)
PyPlot.set_cmap("cool")
clabel(c, inline=1, fontsize=10)
# Plot observations, x·, and the countours of the estimated Gausian density
plotObservations(D)
plot(x_dot[1], x_dot[2], "ro")
p(x⋅∈S|m) ≈ 0.20381961095601722
Summary
Bayesian inference with Gaussian prior and likelihood leads to an analytically computable Gaussian posterior.
OPTIONAL SLIDES
Again, we consider an observed data set D = {x1 , x2 , … , xN } and try to explain these data by a Gaussian distribution.
We discussed earlier Bayesian inference for the mean with a given variance. Now we will derive a posterior for the variance if the mean is given.
(Technically, we will do the derivation for a precision parameter λ = σ −2 , since the discussion is a bit more straightforward for the precision
parameter).
model specification
59/127
N
p(D|λ) = ∏ N (xn | μ, λ−1 )
n=1
λ N 2
∝ λN/2 exp{− ∑ (xn − μ) } (B-2.145)
2 n=1
The conjugate distribution for this function of λ is the Gamma distribution, given by
(Bishop fig.2.13). Plots of the Gamma distribution Gam (λ | a, b) for different values of a and b .
a a
The mean and variance of the Gamma distribution evaluate to E (λ) = and var [λ] = .
b b2
inference
We will consider a prior p(λ) = Gam (λ | a 0 , b0 ), which leads by Bayes rule to the posterior
λ N
p(λ | D) ∝ λN/2 exp{− ∑ (xn − μ)2 }
2 n=1
likelihood
1
⋅ ba0 λa0 −1 exp{−b0 λ}
Γ(a 0 ) 0
prior
∝ Gam (λ | a N , bN )
with
N
aN = a0 + (B-2.150)
2
1 2
bN = b0 + ∑ (xn − μ) (B-2.151)
2 n
Hence the posterior is again a Gamma distribution. By inspection of B-2.150 and B-2.151, we deduce that we can interpret 2a 0 as the number of a
priori (pseudo-)observations.
Since the most uninformative prior is given by a 0 = b0 → 0, we can derive the maximum likelihood estimate for the precision as
aN ∣
λML = E [λ]|a0=b0 →0 = ∣
bN ∣a0 =b0 →0
N
=
∑N
n=1
(xn − μ)2
60/127
Preliminaries
Goal
Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets
Materials
Mandatory
These lecture notes
Optional
Bishop pp. 67-70, 74-76, 93-94
Consider a coin-tossing experiment with outcomes x ∈ {0, 1} (tail and head) and let 0 ≤ μ ≤ 1 represent the probability of heads. This model can
written as a Bernoulli distribution:
p(x|μ) = μx (1 − μ)1−x
Note that the variable x acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.
Now consider a K -sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes?
It turns out that the one-hot coding scheme is mathematically more convenient!
Consider a K -sided die. We use a one-hot coding scheme. Assume the probabilities p(xk = 1) = μk with ∑k μk = 1. The data generating
distribution is then (note the similarity to the Bernoulli distribution)
K
p(x|μ) = μx1 1 μx2 2 ⋯ μxKK = ∏ μxkk (B-2.26)
k=1
This generalized Bernoulli distribution is called the categorical distribution (or sometimes the 'multi-noulli' distribution).
Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model p(x|θ) for an observed data set D = {x1 , … , xN } of
N IID rolls of a K-sided die, with
1 if the nth throw landed on kth face
xnk = {
0 otherwise
Model specification
n xnk
p(D|μ) = ∏ ∏ μxknk = ∏ μ∑
k = ∏ μm
k
k
(B-2.29)
n k k k
where mk = ∑n xnk is the total number of occurrences that we 'threw' k eyes. Note that ∑k m k = N .
This distribution depends on the observations only through the quantities {mk }.
61/127
We need a prior for the parameters μ = (μ1 , μ2 , … , μK ) . In the binary coin toss example, we used a beta distribution that was conjugate with the
binomial and forced us to choose prior pseudo-counts.
The generalization of the beta prior to the K parameters {μk } is the Dirichlet distribution:
Γ (∑k α k )
p(μ|α) = Dir(μ|α) =
Γ(α 1 ) ⋯ Γ(α K )
K
∏ μαk k−1
k=1
The Gamma function can be interpreted as a generalization of the factorial function to the real (R) numbers. If n is a natural number (1, 2, 3, …),
then Γ(n) = (n − 1)!, where (n − 1)! = (n − 1) ⋅ (n − 2) ⋅ 1.
As before for the Beta distribution in the coin toss experiment, you can interpret α k − 1 as the prior number of (pseudo-)observations that the die
landed on the kth face.
k k
αk +mk−1
= ∏ μk
k
∝ Dir (μ | α + m) (B-2.41)
Γ (∑k (α k + mk ))
=
Γ(α 1 + m1 )Γ(α 2 + m2 ) ⋯ Γ(α K + mK )
K
∏ μαk k +mk−1
k=1
where m = (m1 , m2 , … , mK )T .
We recognize the (α k − 1)'s as prior pseudo-counts and the Dirichlet distribution shows to be a conjugate prior to the categorical/multinomial:
Dirichlet ∝ categorical ⋅ Dirichlet
posterior likelihood prior
This is actually a generalization of the conjugate relation that we found for the binary coin toss:
Let's apply what we have learned about the loaded die to compute the probability that we throw the k-th face at the next toss.
62/127
p(x∙ ,k = 1|D) = ∫ p(x∙ ,k = 1|μ) p(μ|D) dμ
1
=∫ μk × Dir(μ| α + m) dμ
0
= E [μk]
mk + α k
=
N + ∑k α k
(You can find the mean of the Dirichlet distribution at its Wikipedia site).
In the above derivation, we noticed that the data generating distribution for N die tosses D = {x1 , … , xN } only depends on the data
frequencies mk :
∑n xnk
p(D|μ) = ∏ ∏ μxknk = ∏ μk = (B-2.29)
n k k
categorical dist.
∏ μm
k
k
A related distribution is the distribution over data frequency observations Dm = {m1 , … , mK }, which is called the multinomial distribution,
N!
p(Dm |μ) = ∏ μmk .
m1 !m2 ! … mK ! k k
When used as a likelihood function for μ, it makes no difference whether you use p(D|μ) or p(Dm |μ). Why?
We can get the maximum likelihood estimate μ ^ k for μk based on N throws of a K -sided die from the Bayesian framework by using a uniform prior for
μ and taking the mode of the posterior for μ:
^ k = arg max p(D|μ)
μ
μk
= arg max p(D|μ) ⋅ Uniform(μ)
μk
= arg max p(D|μ) ⋅ Dir(μ|α)|α=(1,1,…,1)
μk
= arg max p(μ|D, α)|α=(1,1,…,1)
μk
= arg max Dir (μ|m + α)|α=(1,1,…,1)
μk
mk mk
= =
∑k m k N
where we used the fact that the maximum of the Dirichlet distribution Dir({α 1 , … , α K }) is obtained at (α k − 1)/(∑k α k − K).
Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the
maximum of the likelihood directly.
63/127
The log-likelihood for the multinomial distribution is given by
When doing ML estimation, we must obey the constraint ∑k μk = 1, which can be accomplished by a Lagrange multiplier (see Bishop App.E). The
augmented log-likelihood with Lagrange multiplier is then
L′(μ) = ∑ mk log μk + λ ⋅ (1 − ∑ μk )
k k
For a multivariate Gaussian model p(xn |θ) = N (xn |μ, Σ), we obtain ML estimates
1
^=
μ ∑ xn (sample mean)
N n
1
^=
Σ ∑(xn − μ ^ )T
^ )(xn − μ (sample variance)
N n
1 mk
^k =
μ ∑ xnk (= )
N n N
(sample proportion)
Regression
Preliminaries
Goal
Introduction to Bayesian (Linear) Regression
Materials
Mandatory
These lecture notes
Optional
Bishop pp. 152-158
In this and forthcoming lectures, we will make use of some elementary matrix calculus. The most important formulas are summarized at the
bottom of this notebook in an OPTIONAL SLIDE on matrix calculus. For derivations, see Bishop Appendix C.
Regression - Illustration
64/127
Given a set of (noisy) data measurements, find the 'best' relation between an input variable x ∈ RM and input-dependent outcomes y ∈ R.
Assume that we are interested in (a model for) the responses yn for given inputs xn ?, I.o.w. we are interested in building a model for the conditional
distribution p(y|x).
Note that, since p(x, y) = p(y|x) p(x), building a model p(y|x) is similar to density estimation with the assumption that x is drawn from a
uniform distribution.
Next, we discuss (1) model specification, (2) Inference and (3) a prediction application for a Bayesian linear regression problem.
1. Model Specification
In a traditional regression model, we try to 'explain the data' by a purely deterministic function f(xn , w), plus a purely random term ϵn for
'unexplained noise':
yn = f (xn, w) + ϵn
In a linear regression model, i.e., linear w.r.t. the parameters w, we assume that
M −1
f(xn , w) = ∑ wj ϕj (xn ) = wT ϕ(xn )
j=0
In ordinary linear regression , the noise process ϵn is zero-mean Gaussian with constant variance, i.e.
ϵn ∼ N (0, β −1 ) .
Hence, given a data set D = {(x1 , y1 ), … , (xN , yN )}, the likelihood for an ordinary linear regression model is
p(y | X, w, β) = N (y | Xw, β −1 I)
= ∏ N (yn | wT xn , β −1 ) (B-3.10)
n
65/127
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
xT
w
⎛ 1 ⎞ ⎛ 1 ⎞ ⎛ x11 , x12 , … , x1M ⎞
y
⎛ 1⎞
⎜ w2 ⎟ ⎜ xT2 ⎟ ⎜ x21 , x22 , … , x2M ⎟ ⎜ y2 ⎟
where w = ⎜ ⎜
⎟, the (N × M)-dim matrix X = ⎜ ⎟=⎜ ⎟ and y = ⎜ ⎟.
⎜ ⎟ ⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮
⎟ ⎜
⎜ ⋮ ⎟
⎟
⎜ ⎟ ⎜ ⎟
⎝w ⎠ ⎝ xT ⎠ ⎝ xN1 , xN2 , … , xNM ⎠ ⎝y ⎠
M N N
p(w | α) = N (w | 0, α −1 I) (B-3.52)
For simplicity, we will assume that α and β are fixed and known.
2. Inference
ηN = βXT y
ΛN = βXT X + αI
Or equivalently (in the moment parameterization of the Gaussian):
p(w|D) = N (w | mN , SN ) (B-3.49)
mN = βSN XT y (B-3.53)
−1
SN = (αI + βXT X) (B-3.54)
(Bishop Fig.3.7) Illustration of sequential Bayesian learning for a simple linear model of the form y(x, w) = w0 + w1 x. (Bishop Fig.3.7, detailed
description at Bishop, pg.154.)
66/127
3. Application: Predictive Distribution
Assume we are interested in the distribution p(y∙ | x∙ , D) for a new input x∙ . This can be worked out as (exercise B-3.10)
= ∫ N (y∙ | wT x∙ , β −1 )N (w | mN , SN ) dw
(sub. z = xT∙ w)
= ∫ N (z | y∙ , β −1 )N (z | xT∙ mN , xT∙ SN x∙ ) dz
switch z and y∙
Use Gaussian product formula SRG-6
2
= ∫ N (y∙ | mTN x∙ , σN (x∙ )) N (z | ⋅, ⋅) dz
integrate this out
2
=N (y∙ | mTN x∙ , σN (x∙ ))
with
2
σN (x∙ ) = β −1 + xT∙ SN x∙ (B-3.59)
2 −1
67/127
2
So, the uncertainty σN (x∙ ) about the output y∙ contains both uncertainty about the process (β −1 ) and about the model parameters (xT∙ SN x∙ ).
(See the OPTIONAL SLIDE below for the step in this derivation where N (w | mN , SN ) dw gets replaced N (z | xT T
∙ mN , x∙ SN x∙ ) dz.)
As an example, let's do Bayesian Linear Regression for a synthetic sinusoidal data set and a model with 9 Gaussian basis functions
9
yn = ∑ wmϕm (xn ) + ϵn
m=1
xn − μm
ϕm (xn ) = exp( )
σ2
−1
ϵn ∼ N (0, β )
The predictive distributions for y are shown in the following plots (Bishop, Fig.3.8)
T
And some plots of draws of posteriors for the functions w ϕ(x) (Bishop, Fig.3.9)
68/127
Maximum Likelihood Estimation for Linear Regression Model
†
The matrix X ≡ (XT X)−1 XT is also known as the Moore-Penrose pseudo-inverse (which is sort-of-an-inverse for non-square matrices).
Note that if we have fewer training samples than input dimensions, i.e., if N < M , then XT X will not be invertible and maximum likelihood blows
up. The Bayesian solution does not suffer from this problem.
Least-Squares Regression
(You may say that) we don't need to work with probabilistic models. E.g., there's also the deterministic least-squares solution: minimize sum of squared
errors,
2
^ LS
w = arg min ∑ (yn − wT xn ) = arg
w
n
T
min (y − Xw) (y − Xw)
w
T
∂(y−Xw) (y−Xw)
Setting the gradient ∂w
= −2XT (y − Xw) to zero yields the so-called normal equations XT Xw
^ LS = XT y and consequently
^ LS = (XT X)−1 XT y
w
^ ML .
which is the same answer as we got for the maximum likelihood weights w
If you use the Least-Squares method, you cannot see (nor modify) these assumptions. The probabilistic method forces you to state all assumptions
explicitly!
69/127
Not Identically Distributed Data
Let's do an example regarding changing our assumptions. What if we assume that the variance of the measurement error varies with the sampling
index, ϵn ∼ N (0, βn−1 )?
mN = SN XT Λy
−1
SN = (αI + XT ΛX)
This maximum likelihood solution is also called the Weighted Least Squares (WLS) solution. (Note that we just stumbled upon it, the crucial aspect is
appropriate model specification!)
Note also that the dimension of Λ grows with the number of data points. In general, models for which the number of parameters grow as the number
of observations increase are called non-parametric models.
We'll compare the Least Squares and Weighted Least Squares solutions for a simple linear regression model with input-dependent noise:
x ∼ Unif[0, 1]
y|x ∼ N (f(x), v(x))
f(x) = 5x − 2
2
v(x) = 10e2x − 9.5
D = {(x1 , y1 ), … , (xN , yN )}
In [1]: using PyPlot, LinearAlgebra
# Add constant to input so we can estimate both the offset and the slope
_x = [x ones(N)]
_x_test = hcat(x_test, ones(2))
# LS regression
70/127
w_ls = pinv(_x) * y
plot(x_test, _x_test*w_ls, "b-") # plot LS solution
# Weighted LS regression
W = Diagonal(1 ./ v(x)) # weight matrix
w_wls = inv(_x'*W*_x) * _x' * W * y
plot(x_test, _x_test*w_wls, "r-") # plot WLS solution
ylim([-5,8]); legend(["f(x)", "D", "LS linear regr.", "WLS linear regr."],loc=2);
OPTIONAL SLIDES
When doing derivatives with matrices, e.g. for maximum likelihood estimation, it will be helpful to be familiar with some matrix calculus. We shortly
recapitulate used formulas here.
71/127
In the derivation of the predictive distribution, we replaced N (w | mN , SN ) dw with N (z | xT∙ mN , xT∙ SN x∙ ) dz. Here we discuss why that is
allowed.
Generative Classification
Preliminaries
Goal
Mandatory
These lecture notes
Optional
Bishop pp. 196-202 (section 4.2 focusses on binary classification, whereas in these lecture notes we describe generative classification for
multiple classes).
Problem: You're given numerical values for the skin features roughness and color for 200 pieces of fruit, where for each piece of fruit you also know if it
is an apple or a peach. Now you receive the roughness and color values for a new piece of fruit but you don't get its class label (apple or peach). What is
the probability that the new piece is an apple?
function plot_fruit_dataset()
# Plot the data set and x_test
plot(X_apples[:,1], X_apples[:,2], "r+") # apples
plot(X_peaches[:,1], X_peaches[:,2], "bx") # peaches
plot(x_test[1], x_test[2], "ko") # 'new' unlabelled data point
legend(["Apples"; "Peaches"; "Apple or peach?"], loc=2)
xlabel(L"x_1"); ylabel(L"x_2"); xlim([-1,3]); ylim([-1,4])
end
72/127
plot_fruit_dataset();
D = {(x1 , y1 ), … , (xN , yN )}
Given is a data set
M
inputs xn ∈ R are called features.
outputs yn ∈ Ck , with k = 1, … , K ; The discrete targets Ck are called classes.
We will again use the 1-of- K notation for the discrete classes. Define the binary class selection variable
1 if yn ∈ Ck
ynk = {
0 otherwise
The plan for generative classification: build a model for the joint pdf p(x, y) = p(x|y)p(y) and use Bayes to infer the posterior class probabilities
p(x|y)p(y)
p(y|x) = ∝ p(x|y) p(y)
∑y ′ p(x|y ′ )p(y ′ )
1 - Model specification
Likelihood
Assume Gaussian class-conditional distributions with equal covariance matrix across the classes,
Prior
p(Ck) = πk
Hence, using the one-hot coding formulation for ynk , the generative model p(xn , yn ) can be written as
73/127
K
p(xn , yn) = ∏ p(xn, ynk = 1)ynk
k=1
K
ynk
= ∏ (πk ⋅ N (xn |μk , Σ))
k=1
As usual, once the model has been specified, the rest (inference for parameters and model prediction) through straight probability theory.
p(ynk = 1)
= ∑ ynk log N (xn |μk , Σ) + ∑ ynk log πk
n, k n, k
+ ∑ mk log πk
k
multinomial
Maximization of the LLH for the GDA model breaks down into
Gaussian density estimation for parameters μk , Σ, since the first term contains exactly the log-likelihood for MVG density estimation. We've
already done this, see the Gaussian distribution lesson.
Multinomial density estimation for class priors πk , since the second term holds exactly the log-likelihood for multinomial density estimation, see
the Multinomial distribution lesson.
Now group the data into separate classes and do MVG ML estimation for class-conditional parameters (we've done this before as well):
∑ 74/127
∑n ynk xn 1
^k =
μ = ∑ ynk xn
∑n ynk mk n
1
^=
Σ ^ k )T
^ k )(xn − μ
∑ ynk (xn − μ
N nk
,
^k
= ∑π
k
1
⋅( ∑ ynk (xn − μ ^ k )(xn − μ ^ k )T )
mk n
class-cond. variance
^k
^k ⋅ Σ
= ∑π
k
^k , μ
where π ^ k are the sample proportion, sample mean and sample variance for the kth class, respectively.
^ k and Σ
Note that the binary class selection variable ynk groups data from the same class.
Let's apply the trained model to predict the class for given a 'new' input x∙ :
^ −1 μ
βk = Σ ^k
1 T ^ −1
γk = − μ ^ Σ μ ^ k + log π
^k
2 k
Z = ∑ exp{βkT′ x∙ + γk′ } .
k′
(normalization constant)
The softmax function is a smooth approximation to the max-function. Note that we did not a priori specify a softmax posterior, but rather it followed
from applying Bayes rule to the prior and likelihood assumptions.
75/127
Discrimination Boundaries
The class log-posterior log p(Ck |x) ∝ βkT x + γk is a linear function of the input features.
Thus, the contours of equal probability (discriminant functions) are lines (hyperplanes) in the feature space
How to classify a new input x∙ ? The Bayesian answer is a posterior distribution p(Ck |x∙ ). If you must choose, then the class with maximum posterior
class probability
is an appealing decision.
We'll apply the above results to solve the "apple or peach" example problem.
In [2]: # Make sure you run the data-generating code cell first
p(apple|x=x∙) = 0.7326326217040139
76/127
Recap Generative Classification
p(x, Ck | θ) = πk ⋅ N (x|μk , Σ)
If the class-conditional distributions are Gaussian with equal covariance matrices across classes (Σk = Σ), then the discriminant functions are
hyperplanes in feature space.
ML estimation for {πk , μk , Σ} in the GCM model breaks down to simple density estimation for Gaussian and multinomial/categorical distributions.
Discriminative Classification
Preliminaries
Goal
Introduction to discriminative classification models
Materials
Mandatory
These lecture notes
Optional
Bishop pp. 213 - 217 (Laplace approximation)
Bishop pp. 217 - 220 (Bayesian logistic regression)
T. Minka (2005), Discriminative models, not discriminative training
Our task will be the same as in the preceding class on (generative) classification. But this time, the class-conditional data distributions look very non-
Gaussian, yet the linear discriminative boundary looks easy enough:
In [1]: using Random; Random.seed!(1234);
# Generate dataset {(x1,y1),...,(xN,yN)}
# x is a 2-d feature vector [x_1;x_2]
# y ∈ {false,true} is a binary class label
# p(x|y) is multi-modal (mixture of uniform and Gaussian distributions)
using PyPlot
include("./scripts/lesson8_helpers.jl")
N = 200
X, y = genDataset(N) # Generate data set, collect in matrix X and vector y
X_c1 = X[:,findall(.!y)]'; X_c2 = X[:,findall(y)]' # Split X based on class label
X_test = [3.75; 1.0] # Features of 'new' data point
77/127
function plotDataSet()
plot(X_c1[:,1], X_c1[:,2], "bx", markersize=8)
plot(X_c2[:,1], X_c2[:,2], "r+", markersize=8, fillstyle="none")
plot(X_test[1], X_test[2], "ko")
xlabel(L"x_1"); ylabel(L"x_2");
legend([L"y=0", L"y=1",L"y=?"], loc=2)
xlim([-2;10]); ylim([-4, 8])
end
plotDataSet();
p(yn ∈ Ck|xn )
directly, without any assumptions on the class densities.
We will work this idea out for a 2-class problem. Assume a data set is given by D = {(x1 , y1 ), … , (xN , yN )} with xn ∈ RM and yn ∈ {0, 1}.
What model should we use for the posterior distribution p(yn ∈ Ck |xn )?
Likelihood
In Logistic Regression, we take inspiration from the generative approach, where the softmax function "emerged" as the posterior. Here, we choose the
2-class softmax function (which is called the logistic function) with linear discrimination bounderies for the posterior class probability:
p(yn = 1 | xn , w) = σ(wT xn ) .
where
1 78/127
1
σ(a) =
1 + e−a
is the logistic function.
(Bishop fig.4.9). The logistic function σ(a) = 1/(1 + e−a ) (red), together with the scaled probit function Φ(λa), for λ2 = π/8 (in blue). We will
use this approximation later in the Laplace approximation.
Adding the other class (yn = 0) leads to the following posterior class distribution:
p(yn | xn, w) = Bernoulli (yn | σ(wT xn ))
(1−yn)
= σ(wT xn )yn (1 − σ(wT xn )) (B-4.89)
T
= σ ((2yn − 1)w xn )
Note that for the 3rd equality, we have made use of the fact that σ(−a) = 1 − σ(a).
Each of these three models in B-4.89 are equivalent. We mention all three notational options since they all appear in the literature.
For the data set D = {(x1 , y1 ), … , (xN , yN )}, the likelihood function for the parameters w is then given by
N
p(D|w) = ∏ σ ((2yn − 1)wT xn )
n=1
This choice for the class posterior is called logistic regression, in analogy to linear regression:
In the discriminative approach, the parameters w are not structured into {μ, Σ, π}. In principle they are "free" parameters for which we can choose
any value that seems appropriate. This provides discriminative approach with more flexibility than the generative approach.
Prior
p(w) = N (w | m0 , S0 ) (B-4.140)
Note that for generative classification, for the sake of simplicity, we used maximum likelihood estimation for the model parameters. In this lesson on
79/127
discriminative classification, we specify both a prior and likelihood function for the parameters w, which allows us to compute a Bayesian posterior for
the weights. In principle, we could have used Bayesian parameter estimation for the generative classification model as well (but the math is not suited
for a introductory lesson).
In the optional paper by T. Minka (2005), you can read how the model assumptions for discriminative classification can be re-interpreted as a special
generative model (this paper not for exam).
As an exercise, please check that for logistic regression with p(yn = 1 | xn , w) = σ(wT xn ), the discrimination boundar y, which can be
computed by
p(yn ∈ C1 |xn )
=! 1
p(yn ∈ C0 |xn )
is a straight line, see Exercises.
Parameter Inference
p(w | D) ∝ p(w)p(D|w)
posterior
N
= N (w | m0 , S0 ) ⋅ ∏ σ ((2yn − 1)wT xn ) (B-4.142)
n=1
prior
likelihood
Unfortunately, the posterior p(w | D) is not Gaussian and the evidence p(D) is also not analytically computable. (We will deal with this later).
After substitution of p(w|D) from B-4.142, we have closed-form expressions for both the posterior p(w|D) and the predictive distribution
p(y∙ = 1 ∣ x∙ , D). Unfortunately, these expressions contain integrals that are not analytically computable.
Many methods have been developed to approximate the integrals in order to get analytical or numerical solutions. Here, we present the Laplace
approximation, which is one of the simplest methods with broad applicability to Bayesian calculations.
The central idea of the Laplace approximation is to approximate a (possibly unnormalized) distribution f(z) by a Gaussian distribution q(z) .
The Laplace approximation usually serves one or both of the following two purposes:
1. To approximate a posterior distribution without closed-form expression by a Gaussian distribution.
2. To approximate (part of) the integrand in an integral with purpose to get an analytical solution for the integral.
Example
80/127
(Bishop fig.4.14a). Laplace approximation (in red) to the distribution p(z) ∝ exp(−z 2 /2)σ(20z + 4), where σ(a) = 1/(1 + e−a ). The Laplace
approximation is centered on the mode of p(z).
Note that, if q(z) is a Gaussian distribution, then log q(z) is a second-order polynomial in z, so we will find the Gaussian by fitting a parabola to
log f(z).
estimation of mean
The mean (z0 ) of q(z) is placed on the mode of log f(z), i.e.,
1
Note that since ∇ log f(z) = ∇f(z) and the gradient ∇ f(z)|z=z0 vanishes at the mode z = z0 , we can (Taylor) expand log f(z) around
f(z)
z = z0 as
=0 at z=z0
T
log f(z) ≈ log f(z0 ) + (∇ log f(z0 )) (z − z0 ) + …
1
+ (z − z0 )T (∇∇ log f(z0 )) (z − z0 )
2
1
= log f(z0 ) − (z − z0 )T A(z − z0 ) (B-4.131)
2
where the Hessian matrix A is defined by
1
f(z) ≈ f(z0 ) exp(− (z − z0 )T A(z − z0 ))
2
81/127
q(z) = N (z | z0 , A −1 ) (B-4.134)
All we have done up to now is approximate a function f(z) by a Gaussian q(z) . This procedure is called the Laplace Approximation. Often, the
required integrals (for Bayesian marginalization) can be approximately computed if we replace f(z) by q(z) .
Let's get back to the challenge of computing the predictive class distribution (B-4.145) for Bayesian logistic regression. We first work out the Gaussian
Laplace approximation q(w) to the posterior weight distribution
p(w|D) ∝ N (w | m0 , S0 ) (B-4.142)
posterior prior
N
⋅ ∏ σ ((2yn − 1)wT xn )
n=1
likelihood
Since we have a differentiable expression for log p(w|D), it is straightforward to compute the gradient and Hessian (for proof, see optional slide):
We can now use the gradient ∇w log p(w|D) to find the mode wN of log p(w|D) (eg by some gradient-based optimization procedure) and then
use the Hessian ∇∇w log p(w|D) to get the variance of q(w), leading to a **Gaussian approximate weights posterior**:
q(w) = N (w | wN , SN ) (B-4.144)
with
In the analytically unsolveable expressions for evidence and the predictive distribution (estimating the class of a new observation), we proceed with
using the Laplace approximation to the weights posterior. For a new observation x∙ , the class probability is now
≈ ∫ p(y∙ = 1 | x∙ , w) ⋅ q(w) dw
Gaussian
= ∫ σ(wT x∙ ) ⋅ N (w | wN , SN ) dw (B-4.145)
This looks better but we need two more clever tricks to evaluate this expression.
1. First, note that w appears in σ(wT x∙ ) as an inner product, so through substitution of a := wT x∙ , the expression simplifies to an integral over
the scalar a (see Bishop for derivation):
82/127
p(y∙ = 1 ∣ x∙ , D) ≈ ∫ σ(a) N (a | μa , σa2 ) da (B-4.151)
μa = wTN x∙ (B-4.149)
σa2 = xT∙ SN x∙ (B-4.150)
2. Secondly, while the integral of the product of a logistic function with a Gaussian is not analytically solvable, the integral of the product of a
Gaussian cumulative distribution function (CDF, also known as the probit function) with a Gaussian does have a closed-form solution. Fortunately,
Φ(λa) ≈ σ(a)
1 x 2 /2
with the Gaussian CDF Φ(x) = ∫−∞ e−t dt, λ2 = π/8 and σ(a) = 1/(1 + e−a ). Thus, substituting Φ(λa) with λ2 = π/8 for
√(2π)
σ(a) leads to
≈∫ Φ(λa) ⋅ N (a | μa , σa2 ) da
probit function Gaussian
μa
= Φ( ) (B-4.152)
√(λ−2 + σa2 )
We now have an approximate but closed-form expression for the predictive class distribution for a new obser vation with a Bayesian logistic
regression model.
2
Note that, by Eq.B-4.143, the variance SN (and consequently σa ) for the weight vector depends on the distribution of the training set. Large
2
uncertainty about the weights (in areas with little training data and uninformative prior variance S0 ) increases σa and takes the posterior class
probability eq. B-4.152 closer to 0.5. Does that make sense?
Apparently, the Laplace approximation leads to a closed-form solutions for Bayesian logistic regression (although admittedly, the derivation is no walk in
the park).
Exam guide: The derivation of closed-form expression eq. B-4.152 for the predictive class distribution requires clever tricks and is therefore not
something that you should be able to reproduce at the exam without assistance. You should understand the Laplace Approximation though and be
able to work out simpler examples.
Rather than the computationally involved Laplace approximation for Bayesian inference, in practice, discriminative classification is often executed
through maximum likelihood estimation.
With the usual 1-of-K encoding scheme for classes (ynk = 1 if xn ∈ Ck , otherwise ynk = 0), the log-likelihood for a K-dimensional discriminative
classifier is
ynk
L(θ) = log ∏ ∏ p(Ck|xn , θ)
n k
T ynk
eθk xn
= log ∏ ∏ ( T
)
n k ∑j eθj xn
softmax function
θTk xn
e
= ∑ ∑ ykn log ( T
)
n k ∑j eθj xn
Computing the gradient ∇θk L(θ) leads to (for proof, see optional slide below)
θT x
83/127
T
eθk xn
∇θk L(θ) = ∑ ( ynk − ) ⋅ xn
θTj xn
n
target
∑ j e
prediction
prediction error
∇θ L(θ) = ∑ (yn − θT xn ) xn
n
In both cases
The parameter vector θ for logistic regression can be estimated through iterative gradient-based adaptation. E.g. (with iteration index i),
(i+1) (i)
θ^ = θ^ + η ⋅ ∇θ L(θ)| (i)
θ=θ^
Note that, while in the Bayesian approach we get to update θ with Kalman-gain-weighted prediction errors (which is optimal), in the maximum
likelihood approach, we weigh the prediction errors with input values (which is less precise).
Let us perform ML estimation of w on the data set from the introduction. To allow an offset in the discrimination boundary, we add a constant 1 to the
feature vector x. We only have to specify the (negative) log-likelihood and the gradient w.r.t. w. Then, we use an off-the-shelf optimisation library to
minimize the negative log-likelihood.
We plot the resulting maximum likelihood discrimination boundary. For comparison we also plot the ML discrimination boundary obtained from the
code example in the generative Gaussian classifier lesson.
x_test = [3.75;1.0]
println("P(C1|x•,θ) = $(p_1(x_test))")
P(C1|x•,θ) = 0.4016963543060162
84/127
The generative model gives a bad result because the feature distribution of one class is clearly non-Gaussian: the model does not fit the data well.
The discriminative approach does not suffer from this problem because it makes no assumptions about the feature distribition p(x|y), it just estimates
the conditional class distribution p(y|x) directly.
Recap Classification
Like density estimation, model joint prob. Like (linear) regression, model conditional
1
p(Ck)p(x|Ck ) = πkN (μk , Σ) p(Ck |x, θ)
Leads to softmax posterior class probability Choose also softmax posterior class probability
in one shot.
OPTIONAL SLIDES
p(w|D) ∝ N (w | m0 , S0 ) (B-4.142)
posterior prior
N
⋅ ∏ σ((2yn − 1)wT xn )
n=1
an
likelihood
85/127
from which it follows that
1
log p(w|D) ∝ − log |S0 |
2
1
− (w − m0 )T S0−1 (w − m0 ) + ∑ log
2 n
σ (a n)
∇w log p(w|D) 1
∝ S0−1 (m0 − w) + ∑
n σ(a n)
SRM-5b
∂ log σ(a n)
∂σ(a n )
= S0−1 (m0 − w) +
∑(2yn − 1)(1 − σ(a n ))xn (gradient)
n
′
where we used σ (a) = σ(a) ⋅ (1 − σ(a)).
For the Hessian, we continue to differentiate the transpose of the gradient, leading to
T
∇∇w log p(w|D) = ∇w (S0−1 (m0 − w)) −
∑(2yn − 1)xn ∇w σ(a n )T
n
= −S0−1 − ∑(2yn − 1)xn
n
⋅ σ(a n ) ⋅ (1 − σ(a n )) ⋅ (2yn − 1)xTn
∂σ(a n ) T ∂a T
n
∂a T ∂w
n
The Log-likelihood is
ynk
L(θ) = log ∏n ∏k p(Ck |xn , θ) =
p nk
∑n,k ynk log pnk
Use the fact that the softmax ϕk ≡ eak /∑j eaj has analytical derivative:
∂L ∂ ∂a 86/127
∂Lnk ∂pnk ∂a nj
∇θj L(θ) = ∑ ⋅ ⋅
n, k
∂pnk ∂a nj ∂θj
ynk
=∑ ⋅ pnk (δkj − pnj ) ⋅ xn
n, k
p nk
= ∑ (ynj − pnj ) ⋅ xn
n
T
eθj xn
= ∑ ( ynj − ) ⋅ xn
θTj′ xn
n
target ∑ j′ e
prediction
Preliminaries
Goal
Introduction to latent variable models and variational inference by Free energy minimization
Materials
Mandatory
These lecture notes
Ariel Caticha (2010), Entropic Inference
tutorial on entropic inference, which is a generalization to Bayes rule and provides a foundation for variational inference.
Optional
Bishop (2016), pp. 461-486 (sections 10.1, 10.2 and 10.3)
references
Blei et al. (2017), Variational Inference: A Review for Statisticians
Lanczos (1961), The variational principles of mechanics
Senoz et al. (2021), Variational Message Passing and Local Constraint Manipulation in Factor Graphs
Dauwels (2007), On variational message passing on factor graphs
Caticha (2010), Entropic Inference
Shore and Johnson (1980), Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy
Illustrative Example : Density Modeling for the Old Faithful Data Set
You're now asked to build a density model for a data set (Old Faithful, Bishop pg. 681) that clearly is not distributed as a single Gaussian:
87/127
Unobserved Classes
Classification problems with unobserved classes are called Clustering problems. The learning algorithm needs to discover the classes from the
obser ved data.
The spread of the data in the illustrative example looks like it could be modeled by two Gaussians. Let's develop a model for this data set.
Let D = {xn } be a set of observations. We associate a one-hot coded hidden class label zn with each observation:
Again, this is the same model as we defined for the generative classification model: A Gaussian-Categorical model but now with unobserved classes).
This model (with unobser ved class labels) is known as a Gaussian Mixture Model (GMM).
88/127
The Marginal Distribution for the GMM
In the literature, the GMM is often introduced by the marginal distribution for an observed data point xn , given by
p(xn ) = ∑ p(xn , zn )
zn
K
= ∑ πk ⋅ N (xn |μk , Σk ) (B-9.12)
k=1
Note that Eq. B-9.12 is not the generative model. The generative model is the joint distribution p(x, z) over all variables, including the latent variables.
Eq. B-9.12 reveals the link to the name Gaussian mixture model. The priors πk are also called mixture coefficients.
GMMs are very popular models. They have decent computational properties and are universal approximators of densities (as long as there are
enough Gaussians of course)
(In the above figure, the Gaussian components are shown in red and the pdf of the mixture models in blue).
The GMM contains both observed variables {xn }, (unobserved) parameters θ = {πk , μk , Σk } and unobserved (synonym: latent, hidden) variables
{znk }.
From a Bayesian viewpoint, both latent variables {znk } and parameters θ are just unobserved variables for which we can set a prior and compute a
posterior by Bayes rule.
Note that znk has a subscript n, hence its value depends not only on the cluster (k) but also on the n-th observation (in contrast to parameters).
These observation-dependent latent variables are generally a useful tool to encode structure in the model. Here (in the GMM), the latent variables
{znk } encode (unobserved) class membership.
By adding model structure through (equations among) latent variables, we can build very complex models. Unfortunately, inference in latent variable
models can also be much more complex than for fully observed models.
We recall here the log-likelihood for the Gaussian-Categorial Model, see generative classification lesson)
+ ∑ ynk log πk .
n, k
multinomial
Since the class labels ynk ∈ {0, 1} were given, this expression decomposed into a set of simple updates for the Gaussian and multinomial
distributions. For both distributions, we have conjugate priors, so inference is easily solved.
89/127
However, for the Gaussian mixture model (same log-likelihood function with znk replacing ynk ), the class labels {znk } are unobserved and need to
estimated alongside with the parameters.
There is no conjugate prior on the latent variables for the GMM likelihood function p( D | {znk }, {μk , Σk , πk }).
{xn } all latent variables
In this lesson, we introduce an approximate Bayesian inference method known as Variational Bayes (VB) (also known as Variational Inference) that
can be used for Bayesian inference in models with latent variables. Later in this lesson, we will use VB to do inference in the GMM.
As a motivation for variational inference, we first discuss the foundation of inference by the Method of Maximum Relative Entropy, (Caticha, 2010).
In the probability theory lesson, we recognized Bayes rule as the fundamental rule for learning from data.
We will now generalize this notion and consider learning as a process that updates a prior into a posterior distribution whenever new information
becomes available.
In this context, new information is not necessarily a new observation, but could (for instance) also relate to adding constraints on the posterior
distribution.
Our prior beliefs about z are represented by p(z). In the following, we will write q(z) to denote a posterior distribution for z.
We might be interested in obtaining rational posterior beliefs q(z) in consideration of the following additional constraints:
1. Data Constraints: e.g., two observed data points x1 = 5 and x2 = 3, which lead to constraints q(x1 ) = δ(x1 − 5) and
q(x2 ) = δ(x2 − 3).
2. Form Constraints: e.g., we only consider the Gamma distribution for q(z) = Gam(z|α, β).
3. Factorization Constraints:, e.g., we consider independent marginals for the posterior distribution: q(z) = ∏k q(zk).
4. Moment Constraints: e.g., the first moment of the posterior is given by ∫ zq(z)dz = 3.
Note that this is not "just" applying Bayes rule to compute a posterior, which can only deal with data constraints as specified by the likelihood function.
Note also that observations can be rephrased as constraints on the posterior, e.g., observation x1 = 5 can be phrased as a constraint
q(x1 ) = δ(x1 − 5).
⇒ Updating a prior to a posterior on the basis of constraints on the posterior is more general than updating based on observations alone.
Caticha (2010) (based on earlier work by Shore and Johnson (1980)) developed the method of maximum (relative) entropy for rational updating of
priors to posteriors when faced with new information in the form of constraints.
Consider prior beliefs p(z) about z. New information in the form of constraints is obtained and we are interested in the "best update" to a posterior
q(z).
In order to define what "best update" means, we assume a ranking function S[q, p] that generates a preference score for each candidate posterior q
for a given p. The best update from p to q is identified as q ∗ = arg max q S[q, p].
Similarly to Cox' method to deriving probability theory, Caticha introduced the following mild criteria based on a rational principle (the principle of
minimal updating, see Caticha 2010) that the ranking function needs to adhere to:
1. Locality: local information has local effects.
2. Coordinate invariance: the system of coordinates carries no information.
3. Independence: When systems are known to be independent, it should not matter whether they are treated separately or jointly.
These three criteria uniquely identify the Relative Entropy (RE) as the proper ranking function:
q(z) 90/127
q(z)
S[q, p] = − ∑ q(z) log
z p(z)
⇒ When information is supplied in the form of constaints on the posterior, we should select the posterior that maximizes the Relative Entropy, subject
to the constraints. This is the Method of Maximum (Relative) Entropy (MRE).
Note that the Relative Entropy is technically a functional, i.e., a function of function(s) (since it is a function of probability distributions). The calculus of
functionals is also called variational calculus.
See Lanczos, The variational principles of mechanics (1961) for a great book on variational calculus. For a summary, see Bishop (2016), app.D.
It is customary to use square brackets (like S[q, p]) for functionals rather than round brackets, which we use for functions.
Also note the complimentary relation between the Maximum Relative Entropy method and Probability Theory:
PT describes how to represent beliefs about events and how to calculate beliefs about joint and conditional events.
In contrast, the MRE method describes how to update beliefs when new information becomes available.
PT and the MRE method are both needed to describe the theory of optimal information processing. (As we will see below, Bayes rule is a special case of
the MRE method.)
In principle, entropies can always be considered as a relative score against a reference distribution. For instance, the score − ∑zi q(zi ) log q(zi ) can
be interpreted as a score against the uniform distribution, i.e.,
− ∑zi q(zi ) log q(zi ) ∝ − ∑zi q(zi ) log
q(zi )
Uniform(zi )
. Therefore, the "method of maximum relative entropy" is often simply known as the "method of maximum entropy".
q(z)
KL[q, p] ≜ ∑ q(z) log (B-1.113)
z p(z)
The Gibbs inequality (proof), is a famous theorem in information theory that states that
KL[q, p] ≥ 0
with equality only iff p = q.
The KL divergence can be interpreted as a "distance" between two probability distributions. Note however that the KL divergence is an asymmetric
distance measure, i.e., in general KL[q, p] ≠ KL[p, q].
We also introduce here the notion of a (variational) Free Energy (FE) functional, which is just a generalization of the KL-divergence that allows the prior
to be unnormalized. Let f(z) = Z ⋅ p(z) be an unnormalized distribution, then the FE is defined as
q(z)
F [q, p] ≜ ∑ q(z) log
z f (z)
q(z)
= ∑ q(z) log
z Z ⋅ p(z)
q(z)
= ∑ q(z) log − log Z
z p(z)
KL divergence ≥0
Since the second term (log Z ) is constant, FE minimization (subject to constraints) with respect to q leads to the same posteriors as KL minimization
and RE maximization.
If we are only interested in minimizing FE with respect to q, then we'll write F [q] (rather than F [q, p]) fo brevity.
In this class, we prefer to discuss inference in terms of minimizing Free Energy (subject to the constraints) rather than maximizing Relative Entropy, but
note that these two concepts are equivalent.
91/127
Example KL divergence for Gaussians
The following animation illustrates the KL-divergence between two Gaussian distributions.
In [1]: using Distributions, StatsPlots, Plots.PlotMeasures, LaTeXStrings
μ_p = 0 #statistics of distributions we'll keep constant (we'll vary the mean of q). Feel free to change t
σ_p = 1
σ_q = 1
p = Normal(μ_p, σ_p)
This is an animation. If you are viewing these notes in PDF format you can see the animation at
https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Latent-Variable-Models-and-VB.ipynb
Let's get back to the issue of doing inference for models with latent variables (such as the GMM).
Consider a generative model specified by p(x, z) where x and z represent the observed and unobserved variables, respectively.
Assume further that x has been observed as x = x ^ and we are interested in performing inference, i.e., we want to compute the posterior
p(z|x = x^) for the latent variables and we want to compute the evidence p(x = x ^) .
According to the Method of Maximum Relative Entropy, in order to find the correct posterior q(x, z), we should minimize
q(x, z) 92/127
q(x, z)
F[q] = ∑ q(x, z) log ,
x, z p(x, z)
^
subject to data constraint x = x
^ )∗
The log-evidence term does not depend on q. Hence, the global minimum q(z|x ≜ arg minq F [q] is obtained for
q ∗ (z|x
^ ) = p(z|x
^) ,
which is the correct Bayesian posterior.
equals minus log-evidence. (Or, equivalently, the evidence is given by ^) = exp(−F [q ∗ ]).)
p(x
This is an amazing result! Note that FE minimization transforms an inference problem (that involves integration) to an optimization problem! Generally,
optimization problems are easier to solve than integration problems.
Executing inference by minimizing the variational FE functional is called Variational Bayes (VB) or variational inference.
(As an aside), note that Bishop introduces in Eq. B-10.2 an Evidence Lower BOund (in modern machine learning literature abbreviated as ELBO) L(z)
that equals the negative FE. We prefer to discuss variational inference in terms of a free energy (but it is the same story as he discusses with the ELBO).
In the rest of this lesson, we are concerned with how to execute FE minimization (FEM) w.r.t q for the functional
q(z)
F [q] = ∑ q(z) log − log p(x)
z p(z|x)
where x has been observed and z represent all latent variables.
^, and we write simply q(z) (rather than q(z|x
To keep the notation uncluttered, in the following we write x rather than x ^)) for the posterior.
Due to restrictions in our optimization algorithm, we are often unable to fully minimize the FE to the global minimum q ∗ (z), but rather get stuck in a
local minimum q^(z).
Note that, since KL[q(z), p(z|x)] ≥ 0 for any q(z), the FE is always an upperbound on (minus) log-evidence, i.e.,
F [q] ≥ − log p(x) .
In practice, even if we cannot attain the global minimum, we can still use the local minimum q^(z) ≈ arg minq F [q] to accomplish approximate
Bayesian inference:
93/127
posterior: q^(z) ≈ p(z|x)
evidence: p(x) ≈ exp(−F[q^])
Constrained FE Minimization
Generally speaking, it can be very helpful in the FE minimization task to add some additional constraints on q(z) (i.e., in addition to the data
constraints).
Once we add constraints to q (in addition to data constraints), we are no longer guaranteed that the minimum of the (constrained) FE coincides with
the solution by Bayes rule (which only takes data as constraints). So again, at best constrained FEM is only an approximation to Bayes rule.
There are three important cases of adding constraints to q(z) that often alleviates the FE minimization task:
1. mean-field factorization
Variational inference with mean-field factorization has been worked out in detail as the Coordinate Ascent Variational Inference (CAVI)
algorithm. See the Optional Slide on CAVI for details.
2. fixed-form parameterization
q(z) = N (z|μ, Σ) .
In this case, the functional minimization problem for F [q] reduces to the minimization of a function F (μ, Σ) w.r.t. its parameters. We can often
use standand gradient-based optimization methods to minimize the FE.
The following image by David Blei illustrates this approach
We place some constraints both on the prior and posterior for z (also to be discussed in an Optional Slide) that simplifies FE minimization to maximum-
likelihood estimation.
Let's get back to the illustrative example at the beginning of this lesson: we want to do density modeling for the Old Faithful data set.
model specification
94/127
p(x, z|θ) = p(x|z, μ, Λ)p(z|π)
N K
znk
= ∏ ∏ (πk ⋅ N (xn |μk , Λ−1
k ))
(B-10.37,38)
n=1 k=1
Let us introduce some priors for the parameters. We factorize the prior and choose conjugate distributions by
p(x, z, π, μ, Λ) (B-10.41)
= p(x|z, μ, Λ)p(z|π)p(π)p(μ|Λ)p(Λ)
with hyperparameters {α 0 , m0 , β0 , W0 , ν0 }.
inference
Assume that we have observed D = {x1 , x2 , … , xN } and are interested to infer the posterior distribution for the tuning parameters:
p(x = D, θ)
p(θ|D) =
p(x = D)
∑z p(x = D, z, π, μ, Λ)
=
∑z ∑π ∬ p(x = D, z, π, μ, Λ) dμdΛ
but this is intractable (See Blei (2017), p861, eqs. 8 and 9).
Bishop shows that the equations for the optimal solutions (Eq. B-10.9) are analytically solvable for the GMM as specified above, leading to the following
variational update equations (for k = 1, … , K ):
95/127
α k = α 0 + Nk (B-10.58)
βk = β0 + Nk (B-10.60)
1
mk = (β0 m0 + Nk x̄k ) (B-10.61)
βk
Wk−1 = W0−1 + Nk Sk (B-10.62)
β0 Nk
+ (x̄k − m0 ) (x̄k − m0 )T
β0 + Nk
νk = ν0 + Nk (B-10.63)
where we used
1 D
log ρnk = E [log πk] + E [log |Λk |] − log(2π)
2 2
1
− E [(xk − μk )T Λk (xk − μk )] (B-10.46)
2
ρnk
rnk = K
(B-10.49)
∑j=1 ρnj
N
Nk = ∑ rnkxn (B-10.51)
n=1
1 N
x̄ k = ∑ rnk xn (B-10.52)
Nk n=1
1 N T
Sk = ∑ rnk (xn − x̄k ) (xn − x̄k ) (B-10.53)
Nk n=1
Exam guide: Working out FE minimization for the GMM to these update equations (eqs B-10.58 through B-10.63) is not something that you need to
reproduce without assistance at the exam. Rather, the essence is that it is possible to arrive at closed-form variational update equations for the GMM.
You should understand though how FEM works conceptually and in principle be able to derive variational update equations for very simple models that
do not involve clever mathematical tricks.
Below we exemplify training of a Gaussian Mixture Model on the Old Faithful data set by Free Energy Minimization, using the constraints as specified
above, e.g., (B-10.42).
In [2]: using DataFrames, CSV, LinearAlgebra, PDMats, SpecialFunctions
import PyPlot.contour, PyPlot.scatter
include("scripts/gmm_plot.jl") # Holds plotting function
old_faithful = CSV.read("datasets/old_faithful.csv",DataFrame);
X = convert(Matrix{Float64}, [old_faithful[!,1] old_faithful[!,2]]');#data matrix
N = size(X, 2) #number of observations
K = 6
96/127
end
max_iter = 15
#store the inference results in these vectors
ν = fill!(Array{Float64}(undef,K,max_iter),3)
β = fill!(Array{Float64}(undef,K,max_iter),1.0)
α = fill!(Array{Float64}(undef,K,max_iter),0.01)
R = Array{Float64}(undef,K,N,max_iter)
M = Array{Float64}(undef,2,K,max_iter)
Λ = Array{Float64}(undef,2,2,K,max_iter)
clusters_vb = Array{Distribution}(undef,K,max_iter) #clusters to be plotted
#initialize prior distribution parameters
M[:,:,1] = [[3.0; 60.0];[4.0; 70.0];[2.0; 50.0];[2.0; 60.0];[3.0; 100.0];[4.0; 80.0]]
for k=1:K
Λ[:,:,k,1] = [1.0 0;0 0.01]
R[k,:,1] = 1/(K)*ones(N)
clusters_vb[k,1] = MvNormal(M[:,k,1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[1,1].*Λ[:,:,k,1])))))
end
#variational inference
for i=1:max_iter-1
#variational expectation
R[:,:,i+1] = updateR(Λ[:,:,:,i],M[:,:,i],α[:,i],ν[:,i],β[:,i])
#variational minimisation
M[:,:,i+1],β[:,i+1],Λ[:,:,:,i+1],ν[:,i+1],α[:,i+1] = updateMeanPrecisionPi(M[:,:,i],β[:,i],Λ[:,:,:,i],ν[:,i],α[:,
for k=1:K
clusters_vb[k,i+1] = MvNormal(M[:,k,i+1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[k,i+1].*Λ[:,:,k,i+1])))))
end
end
97/127
Out[2]:
The generated figure looks much like Figure 10.6 in Bishop. The plots show results for Variational Bayesian mixture of K = 6 Gaussians applied to the Old
Faithful data set, in which the ellipses denote the one standard-deviation density contours for each of the components, and the color coding of the data
points reflects the "soft" class label assignments. Components whose expected mixing coefficient are numerically indistinguishable from zero are not plotted.
Note that this procedure learns the number of classes (two), learns the class label for each observation, and learns the mean and variance for the Gaussian
data distributions.
The FE functional can be decomposed in various interesting ways, making use of p(x, z) = p(z|x)p(x) = p(x|z)p(z)
1 98/127
1
F[q] = − ∑ q(z) log p(x, z) − ∑ q(z) log (EE)
z z q(z)
energy entropy
q(z)
= ∑ q(z) log − log p(x) (DE)
z p(z|x)
log-evidence
KL divergence≥0
q(z)
= ∑ q(z) log − ∑ q(z) log p(x|z) (CA)
z p(z) z
complexity accuracy
These decompositions are very insightful (we will revisit them later) and we will label them respectively as energy-entropy (EE), divergence-evidence
(DE), and complexity-accuracy (CA) decompositions.
In the Bayesian Machine Learning lecture, we discussed the CA decomposition of Bayesian model evidence to support the interpretation of evidence as
a model performance criterion. Here, we recognize that FE allows a similar CA decomposition: minimizing FE increases data fit and decreases model
complexity. Hence, FE is a good model performance criterion.
The CA decomposition makes use of the prior p(z) and likelihood p(x|z), both of which are selected by the engineer, so the FE can be evaluated with
this decomposition!
The DE decomposition restates what we derived earlier, namely that the FE is an upperbound on the (negative) log-evidence. The bound is the KL-
divergence between the variational posterior q(z) and the (perfect) Bayesian posterior p(z|x). Global minimization of FE with only data constraints
drives the KL-divergence to zero and results to perfect Bayesian inference.
The EE decomposition provides a link to the second law of thermodynamics: Minimizing FE leads to entropy maximization, subject to constraints, where
in this case the constraints are imposed by the postulated generative model.
Summary
Latent variable models (LVM) contain a set of unobserved variables whose size grows with the number of observations.
LVMs can model more complex phenomena than fully observed models, but inference in LVM models is usually not analytically solvable.
The Free Energy (FE) functional transforms Bayesian inference computations (very large summations or integrals) to an optimization problem.
Inference by minimizing FE, also known as variational inference, is fully consistent with the "Method of Maximum Relative Entropy", which is by design
the rational way to update beliefs from priors to posteriors when new information becomes available. Thus, FE mimimization is the "correct" inference
procedure that generalizes Bayes rule.
In general, global FE minimization is an unsolved problem and finding good local minima is at the heart of current Bayesian technology research. Three
simplifying constraints on the posterior q(z) in the FE functional are currently popular in practical settings:
mean-field assumptions
assuming a parametric from for q
EM algorithm
These constraints often makes FE minimization implementable at the price of obtaining approximately Bayesian inference results.
The back-ends of Probabilistic Programming Languages (such as RxInfer) often contain lots of methods for automating constrained FE minimization.
OPTIONAL SLIDES
Let's work out FE minimization with additional mean-field constraints (=full factorization) constraints:
m
q(z) = ∏ qj (zj ) .
j=1
In other words, the posteriors for zj are all considered independent. This is a strong constraint but leads often to good solutions.
∗
Given the mean-field constraints, it is possible to derive the following expression for the optimal solutions qj (zj ), for j = 1, … , m :
99/127
log qj∗ (zj ) ∝ Eq−j
∗ [log p(x, z)]
∗
= ∑ q−j (z−j )log p(x, z)
z−j (B-10.9)
"field"
"mean field"
where we defined
∗ ∗ ∗
q−j (z−j ) ≜ q1∗ (z1 )q2∗ (z2 ) ⋯ qj−1 (zj−1 )qj+1 (zj+1 )
∗
⋯ qm (zm )
.
Proof (from Blei, 2017): We first rewrite the FE as a function of qj (zj ) only:
This is not yet a full solution to the FE minimization task since the solution qj∗ (zj ) depends on expectations that involve other solutions qi≠j
∗
(zi≠j ),
∗
and each of these other solutions qi≠j (zi≠j ) depends on an expection that involves qj∗ (zj ).
In practice, we solve this chicken-and-egg problem by an iterative approach: we first initialize all qj (zj ) (for j = 1, … , m) to an appropriate initial
∗
distribution and then cycle through the factors in turn by solving eq.B-10.9 and update q−j (z−j ) with the latest estimates. (See Blei, 2017, Algorithm
1, p864).
This algorithm for approximating Bayesian inference is known Coordinate Ascent Variational Inference (CAVI).
The EM algorithm is a special case of FE minimization that focusses on Maximum-Likelihood estimation for models with latent variables.
Consider a model
p(x, z, θ)
q(z, θ)
F [q] = ∑ ∑ q(z, θ) log
z θ
p(x, z, θ)
1. The prior for the parameters is uninformative (uniform). This implies that
q(z, θ) = q(z)q(θ)
3. The posterior for the parameters is a delta function:
q(z) 100/127
q(z)
F[q, θ] = ∑ q(z) log
z p(x, z|θ)
The EM algorithm minimizes this FE by iterating (iteration counter: i) over
q (i) (z)
L(i) (θ) = ∑ p(z|x, θ(i−1) ) log p(x, z|θ) the E-step
z
(i)
θ(i) = arg max L (θ) the M-step
θ
These choices are optimal for the given FE functional. In order to see this, consider the two decompositions
1
F [q, θ] = − ∑ q(z) log p(x, z|θ) − ∑ q(z) log (EE)
z z q(z)
energy entropy
q(z)
= ∑ q(z) log − log p(x|θ) (DE)
z p(z|x, θ)
log-likelihood
divergence
The DE decomposition shows that the FE is minimized for the choice q(z) := p(z|x, θ). Also, for this choice, the FE equals the (negative) log-evidence
(, which is this case simplifies to the log-likelihood).
The EE decomposition shows that the FE is minimized wrt θ by minimizing the energy term. The energy term is computed in the E-step and optimized
in the M-step.
Note that in the EM literature, the energy term is often called the expected complete-data log-likelihood.)
In order to execute the EM algorithm, it is assumed that we can analytically execute the E- and M-steps. For a large set of models (including models
whose distributions belong to the exponential family of distributions), this is indeed the case and hence the large popularity of the EM algorithm.
The EM algorihm imposes rather severe assumptions on the FE (basically approximating Bayesian inference by maximum likelihood estimation). Over
the past few years, the rise of Probabilistic Programming languages has dramatically increased the range of models for which the parameters can by
estimated autmatically by (approximate) Bayesian inference, so the popularity of EM is slowly waning. (More on this in the Probabilistic Programming
lessons).
Code Example: EM-algorithm for the GMM on the Old-Faithful data set
We'll perform clustering on the data set from the illustrative example by fitting a GMM consisting of two Gaussians using the EM algorithm.
In [3]: using DataFrames, CSV, LinearAlgebra
import PyPlot.contour; PyPlot.scatter
include("scripts/gmm_plot.jl") # Holds plotting function
old_faithful = CSV.read("datasets/old_faithful.csv", DataFrame);
X = Array(Matrix{Float64}(old_faithful)')
N = size(X, 2)
Out[3]:
102/127
Note that you can step through the interactive demo yourself by running this script in julia. You can run a script in julia by
julia> include("path/to/script-name.jl")
Consider an edge xj in a Forney-style factor graph for a generative model p(x) = p(x1 , x2 , … , xN ).
Assume that the graph structure (factorization) is specified by
M
p(x) = ∏ pa (xa )
a=1
N
q(x) = ∏ qi (xi )
i=1
q(x)
F [q] = ∑ q(x) log
x p(x)
N ∏N
i=1 qi (xi )
= ∑ ∑ (∏ qi(xi )) log
i xi i=1 ∏M
a=1 pa (xa )
103/127
F [q] =
M ⎛ ⎞
∑∑ ∏ qj (xj ) ⋅ (− log pa (xa )) −
a=1 xa ⎝j∈N(a) ⎠
node energy U[p a ]
N
1
∑ ∑ qi(xi ) log
i=1 xi q i i)
(x
edge entropy H[qi ]
In words, the FE decomposes into a sum of (expected) energies for the nodes minus the entropies on the edges.
Let us now consider the local free energy that is associated with edge corresponding to xj .
Apparently (see previous slide), there are three contributions to the free energy for xj :
qj (xj )
F [qj ] ∝ ∑ q(xj ) log
xj νa (xj ) ⋅ νb (xj )
where
These considerations lead to the Variational Message Passing procedure, which is an iterative free energy minimization procedure that can be executed
completely through locally computable messages.
∝ 1 and ν(⋅) ∝ 1.
1. Initialize all messages q and ν , e.g., q(⋅)
2. Select an edge zk in the factor graph of f(z1 , … , zm ).
→ ←
3. Compute the two messages ν (zk ) and ν (zk ) by applying the following generic rule:
→
ν (y) ∝ exp
(Eq [log g(x1 , … , xn , y)])
( ) 104/127
4. Compute the marginal q(zk )
→ ←
q(zk ) ∝ ν (zk ) ν (zk)
and send it to the two nodes connected to the edge xk .
5. Iterate 2–4 until convergence.
Dynamic Models
Preliminaries
Goal
Introduction to dynamic (=temporal) Latent Variable Models, including the Hidden Markov Model and Kalman filter.
Materials
Mandatory
These lecture notes
Optional
Bishop pp.605-615 on Hidden Markov Models
Bishop pp.635-641 on Kalman filters
Faragher (2012), Understanding the Basis of the Kalman Filter
Minka (1999), From Hidden Markov Models to Linear Dynamical Systems
Example Problem
The hidden states are the position zt and velocity ż t . We can apply an external acceleration/breaking force ut . (Noisy) observations are represented by
xt .
The equations of motions are given by
zt 1 Δt z (Δt)2 /2
[ ]=[ ] [ t−1 ] + [ ] ut
żt 0 1 ż t−1 Δt
+ N (0, Σz )
zt
xt = [ ] + N (0, Σx )
żt
Task: Infer the position zt after 10 time steps. (Solution later in this lesson).
Dynamical Models
In this lesson, we consider models where the sequence order of observations matters.
p(xT )
Generally, we will want to limit the depth of dependencies on previous observations. For example, a K th-order linear Auto-Regressive (AR) model
that is given by
∣ K
p(xt | xt−1 ) = N (xt ∣∣ ∑ a k xt−k , σ 2 )
∣ k=1
limits the dependencies to the past K samples.
State-space Models
A limitation of AR models is that they need a lot of parameters in order to create a flexible model. E.g., if xt is an M -dimensional discrete variable, then
a K th-order AR model will have about M K parameters.
Similar to our work on Gaussian Mixture models, we can create a flexible dynamic system by introducing latent (unobserved) variables
z T ≜ (z1 , z2 , … , zT ) (one zt for each observation xt ). In dynamic systems, zt are called state variables.
T T
p(xT , z T ) = p(z1 ) ∏ p(zt | zt−1 ) ∏ p(xt | zt )
t=2 t=1
initial state state transitions observations
The condition p(zt | z t−1 ) = p(zt | zt−1 ) is called a 1st-order Markov condition.
A Hidden Markov Model (HMM) is a specific state-space model with discrete-valued state variables zt .
Typically, zt is a K -dimensional one-hot coded latent 'class indicator' with transition probabilities a jk ≜ p(ztk = 1 | zt−1 ,j = 1) , or equivalently,
K K
p(zt |zt−1 ) = ∏ ∏ a zjkt−1,j ⋅ztk
k=1 j=1
The classical HMM has also discrete-valued observations but in pratice any (probabilistic) observation model p(xt |zt ) may be coupled to the hidden
Markov chain.
106/127
Another well-known state-space model with continuous-valued state variables zt is the (Linear) Gaussian Dynamical System (LGDS), which is defined
as
Note that the joint distribution over all states and observations {(x1 , z1 ), … , (xt , zt )} is a (large-dimensional) Gaussian distribution. This means
that, in principle, every inference problem on the LGDS model also leads to a Gaussian distribution.
HMM's and LGDS's (and variants thereof) are at the basis of a wide range of complex information processing systems, such as speech and language
recognition, robotics and automatic car navigation, and even processing of DNA sequences.
As we have seen, inference tasks in linear Gaussian state space models can be analytically solved.
Alternatively, we could specify the generative model in a (Forney-style) factor graph and use automated message passing to infer the posterior over the
hidden variables. Here follows some examples.
Filtering, a.k.a. state estimation: estimation of a state (at time step t), based on past and current (at t) observations.
Smoothing: estimation of a state based on both past and future observations. Needs backward messages from the future.
Prediction: estimation of future state or observation based only on observations of the past.
Kalman Filtering
Technically, a Kalman filter is the solution to the recursive estimation (inference) of the hidden state zt based on past observations in an LGDS, i.e.,
Kalman filtering solves the problem p(zt | xt ) based on the previous estimate p(zt−1 | xt−1 ) and a new observation xt (in the context of the given
107/127
model specification of course).
Let's infer the Kalman filter for a scalar linear Gaussian dynamical system:
Kalman filtering comprises inferring p(zt | xt ) from a given prior estimate p(zt−1 | xt−1 ) (available after the previous time step) and a new
observation xt . Let us assume that
(prior)
Note that everything is Gaussian, so it is in principle possible to execute inference problems analytically and the result will be a Gaussian posterior:
1 2
(In the following derivation we make use of the renormalization equality N (x | cz, σ 2 ) = c
N (z ∣ xc , ( σc ) ) ).
∝ p(xt , zt | xt−1 )
= p(xt | zt ) p(zt | xt−1 )
1 x σ 2
= N (zt ∣ t , ( x ) )
c c c
1 zt σz 2 2 )dz
∫ N (zt−1 ∣ , ( ) ) N (zt−1 | μt−1 , σt−1 t−1
a a a
use Gaussian multiplication formula SRG-6
xt σ 2
∝ N (zt ∣ ,( x ) )
c c
⋅ N (zt ∣ aμt−1 , σz2 + (aσt−1 )2 )
use SRG-6 again
∝N (zt | μt , σt2 )
with
ρ2t = a 2 σt−1
2
+ σz2 (predicted variance)
cρ2t
Kt = (Kalman gain)
c2 ρ2t + σx2
μt = aμt−1 + Kt ⋅ (xt − caμt−1 )
prior prediction prediction error
(posterior mean)
σt2 = (1 − c ⋅ Kt ) ρ2t (posterior variance)
Kalman filtering consists of computing/updating these last four equations for each new observation (xt ). This is a very efficient recursive algorithm to
estimate the state zt from all observations (until t).
t−1
( | ) 108/127
t−1
It turns out that it's also possible to get an analytical result for p(xt |x ), which is the model evidence in a filtering context. See optional slides for
details.
The Kalman filter equations can also be derived for multidimensional state-space models. In particular, for the model
zt = Azt−1 + N (0, Γ)
xt = Czt + N (0, Σ)
t
the Kalman filter update equations for the posterior p(zt |x ) = N (zt ∣ μt , Vt) are given by (see Bishop, pg.639)
Code Example: Kalman Filtering and the Cart Position Tracking Example Revisited
We can now solve the cart tracking problem of the introductory example by implementing the Kalman filter.
In [1]: using Rocket, ReactiveMP, RxInfer, LinearAlgebra, PyPlot
include("scripts/cart_tracking_helpers.jl")
for t = 2:n
global m_z, V_z, m_pred_z, V_pred_z
#predict
m_pred_z = A * m_z + b * u[t] # predictive mean
V_pred_z = A * V_z * A' + Σz # predictive covariance
#update
gain = V_pred_z * inv(V_pred_z + Σx) # Kalman gain
m_z = m_pred_z + gain * (noisy_x[t] - m_pred_z) # posterior mean update
V_z = (Diagonal(I,2)-gain)*V_pred_z # posterior covariance update
end
println("Prediction: ",MvNormalMeanCovariance(m_pred_z,V_pred_z))
println("Measurement: ", MvNormalMeanCovariance(noisy_x[n],Σx))
println("Posterior: ", MvNormalMeanCovariance(m_z,V_z))
plotCartPrediction(m_pred_z[1], V_pred_z[1], m_z[1], V_z[1], noisy_x[n][1], Σx[1][1]);
Prediction: MvNormalMeanCovariance(
μ: [41.59136133641072, 4.338236377111019]
Σ: [1.2958787328575079 0.3921572953097835; 0.3921572953130242 0.3415636711134632]
)
Measurement: MvNormalMeanCovariance(
μ: [41.89934474190233, 5.716098112818856]
Σ: [1.0 0.0; 0.0 2.0]
)
Posterior: MvNormalMeanCovariance(
μ: [41.864718407117785, 4.550823144862611]
Σ: [0.5516100293973586 0.15018972175285758; 0.15018972175409862 0.24143326063188655]
109/127
)
Let's solve the cart tracking problem by sum-product message passing in a factor graph like the one depicted above. All we have to do is create factor
nodes for the state-transition model p(zt |zt−1 ) and the observation model p(xt |zt ). Then we let RxInfer execute the message passing schedule.
In [2]: @model function cart_tracking(n, A, b, Σz, Σx, z_prev_m_0, z_prev_v_0, u)
znodes = Vector{Any}(undef, n)
# `z` is a sequence of hidden states
z = randomvar(n)
# `x` is a sequence of "clamped" observations
x = datavar(Vector{Float64}, n)
z_prev = z_prior
for i in 1:n
znodes[i],z[i] ~ MvNormalMeanCovariance(cA * z_prev + cB*u[i], cΣz)
x[i] ~ MvNormalMeanCovariance(z[i], cΣx)
z_prev = z[i]
end
return z, x, znodes
end
Now that we've built the model, we can perform Kalman filtering by inserting measurement data into the model and performing message passing.
In [3]: z_prev_m_0 = noisy_x[1]
z_prev_v_0 = A * (1e8*Diagonal(I,2) * A') + Σz ;
result = inference(model=cart_tracking(n, A,b, Σz, Σx, z_prev_m_0, z_prev_v_0,u), data=(x=noisy_x,), free_energy=true
μz_posterior, Σz_posterior = mean.(result.posteriors[:z])[end], cov.(result.posteriors[:z])[end];
prediction_z_1 = ReactiveMP.messageout(ReactiveMP.getinterface(result.returnval[end][end], :out))
prediction = ReactiveMP.materialize!(Rocket.getrecent(prediction_z_1));
println("Prediction: ",MvNormalMeanCovariance(mean(prediction), cov(prediction)))
println("Measurement: ", MvNormalMeanCovariance(noisy_x[n], Σx))
println("Posterior: ", MvNormalMeanCovariance(μz_posterior, Σz_posterior))
plotCartPrediction(mean(prediction)[1], cov(prediction)[1], μz_posterior[1], Σz_posterior[1], noisy_x[n][1], Σx[1][1])
Prediction: MvNormalMeanCovariance(
μ: [41.584690043736884, 4.334796582261093]
Σ: [1.2934227334046857 0.3916229823498387; 0.3916229823498387 0.3414332606222485]
)
110/127
Measurement: MvNormalMeanCovariance(
μ: [41.89934474190233, 5.716098112818856]
Σ: [1.0 0.0; 0.0 2.0]
)
Posterior: MvNormalMeanCovariance(
μ: [41.861811432022556, 4.548776671897666]
Σ: [0.551150997075792 0.1501469959500068; 0.1501469959500068 0.24141815274489328]
)
Note that both the analytical Kalman filtering solution and the message passing solution lead to the same results. The advantage of message passing-
based inference with RxInfer is that we did not need to derive any inference equations. RxInfer took care of all that.
Dynamical systems do not obey the sample-by-sample independence assumption, but still can be specified, and state and parameter estimation
equations can be solved by similar tools as for static models.
Two of the more famous and powerful models with latent states include the hidden Markov model (with discrete states) and the Linear Gaussian
dynamical system (with continuous states).
For the LGDS, the Kalman filter is a well-known recursive state estimation procedure. The Kalman filter can be derived through Bayesian update rules on
Gaussian distributions.
If anything changes in the model, e.g., the state noise is not Gaussian, then you have to re-derive the inference equations again from scratch and it may
not lead to an analytically pleasing answer.
⇒ Generally, we will want to automate the inference processes. As we discussed, message passing in a factor graph is a visually appealing method to
automate inference processes. We showed how Kalman filtering emerged naturally by automated message passing.
OPTIONAL SLIDES
Now let's proof the Kalman filtering equations including evidence updating by probabilistic calculus:
111/127
posterior evidence
p(zt | xt ) ⋅ p(xt | xt−1 ) = p(xt , zt | xt−1 )
= p(xt | zt ) p(zt | xt−1 )
The RHS can be further evaluated by making use of the Gaussian multiplication rules and the following renormalization equality:
1 x σ 2
N (x | cz, σ 2 ) = ⋅ N (z ∣ , ( ) ) (renormalization) .
c c c
In particular, the RHS evaluates to
1 xt σ 2
= N (zt ∣ , ( x ) ) N (zt ∣ aμt−1 , σz2
c c c
+ (aσt−1 )2 )
use SRG-6 again
1 xt σ 2 2
= N( ∣ aμt−1 , ( x ) + σz2 + (aσt−1 ) )
c c c
use renormalization
N (zt | μt , σt2 )
= N (xt | caμt−1 , σx2 + c2 (σz2 + a 2 σt−1 2 ))
evidence p(xt |xt−1 )
⋅ N (zt | μt , σt2 )
posterior p(zt |xt)
112/127
with
ρ2t = a 2 σt−1
2 + σz2 (predicted variance)
cρ2t
Kt = (Kalman gain)
c2 ρ2t + σx2
μt = aμt−1 + Kt ⋅ (xt − caμt−1 ) (posterior mean)
prior prediction prediction error
Using the methods of the previous lessons, it is possible to create your own new models based on stacking Gaussian and categorical distributions in
new ways:
Preliminaries
Goal
Introduction to Active Inference and application to the design of synthetic intelligent agents
Materials
Mandatory
These lecture notes
Karl Friston - 2016 - The Free Energy Principle (video)
Optional
Raviv (2018), The Genius Neuroscientist Who Might Hold the Key to True AI.
Interesting article on Karl Friston, who is a leading theoretical neuroscientist working on a theory that relates life and intelligent behavior
to physics (and Free Energy minimization). (highly recommended)
Friston et al. (2022), Designing Ecosystems of Intelligence from First Principles
Friston's vision on the future of AI.
Van de Laar and De Vries (2019), Simulating Active Inference Processes by Message Passing
How to implement active inference by message passing in a Forney-style factor graph.
113/127
Agents
We begin with a motivating example that requires "intelligent" goal-directed decision making: assume that you are an owl and that you're hungry. What
are you going to do?
Have a look at Prof. Karl Friston's answer in this video segment on the cost function for intelligent behavior. (Do watch the video!)
Friston argues that intelligent decision making (behavior, action making) by an agent requires minimization of a functional of beliefs.
Friston further argues (later in the lecture and his papers) that this functional is a (variational) free energy (to be defined below), thus linking decision-
making and acting to Bayesian inference.
In fact, Friston's Free Energy Principle (FEP) claims that all biological self-organizing processes (including brain processes) can be described as Free
Energy minimization in a probabilistic model.
This includes perception, learning, attention mechanisms, recall, acting and decision making, etc.
Taking inspiration from FEP, if we want to develop synthetic "intelligent" agents, we have (only) two issues to consider:
= 1, 2, … 114/127
Consider an AIF agent with observations (sensory states) xt , latent internal states st and latent control states ut for t = 1, 2, ….
Actions a t are selected by the agent. Actions affect the environment and consequently affect future observations.
FORALL t DO
END
In the above algorithm, F [q] and H[q] are appropriately defined Free Energy functionals, to be discussed below. Next, we discuss these steps in more
details.
What should the agent's model p(x, s, u) be modeling? This question was (already) answered by Conant and Ashby (1970) as the good regulator
theorem: ever y good regulator of a system must be a model of that system. See the OPTIONAL SLIDE for more information.
Conant and Ashley state: "The theorem has the interesting corollary that the living brain, so far as it is to be successful and efficient as a regulator for
survival, must proceed, in learning, by the formation of a model (or models) of its environment."
Indeed, perception in brains is clearly affected by predictions about sensory inputs by the brain's own generative model.
115/127
In the above picture (The Gardener, by Giuseppe Arcimboldo, ca 1590 ), on the left you will likely see a bowl of vegetables, while the same picture
upside down elicits with most people the perception of a gardener's face rather than an upside-down vegetable bowl.
The reason is that the brain's model predicts to see straight-up faces with much higher probability than upside-down vegetable bowls.
So the agent's model p will be a model that aims to explain how environmental causes (latent states) lead to sensory observations.
In this notebook, for illustrative purposes, we specify the generative model at time step t of an AIF agent as
We will assume that the agent interacts with an environment, which we represent by a dynamic model R as
Note that R only needs to be specified for simulated environments. If we were to deploy the agent in a real-world environment, we would not need to
specify R .
The agent's knowledge about environmental process R is expressed by its generative model p(xt , st , ut |st−1 ).
Note that we distinguish between control states and actions. Control states ut are latent variables in the agent's generative model. An action a t is a
realization of a control state as observed by the environment.
Observations xt = x^t are generated by the environment and observed by the agent. Vice versa, actions a t ^t are generated by the agent and
=u
observed by the environment.
After the agent makes a new observation xt ^t , it will update beliefs over its latent variables. First the internal state variables s.
=x
Assume the following at time step t:
the state of the agent's model has already been updated to q(st−1 |x ^1:t−1 ).
the agent has selected a new action a t = u ^t .
the agent has recorded a new observation xt = x ^t .
^1:t ), based on the previous estimate q(st−1 |x
The state updating task is to infer q(st |x ^ 1:t−1 ), the new data {a t ^t , xt = x
=u ^t } , and the agent's
generative model.
Technically, this is a Bayesian filtering task. In a real brain, this process is called perception.
116/127
We specify the following FE functional
^1:t ) log
F [q] = ∑ q(st |x
st
state posterior
^1:t )
q(st |x
p(x^ t|st )p(st |st−1 , u
^t ) q(st−1 |x
^1:t−1 )
generative model w new data state prior
The state updating task can be formulated as minimization of the above FE (see also AIF Algorithm, step 2):
In case the generative model is a Linear Gaussian Dynamical System, minimization of the FE can be solved analytically in closed-form and leads to the
standard Kalman filter.
In case these (linear Gaussian) conditions are not met, we can still minimize the FE by other means and arrive at some approximation of the Kalman
filter, see for example Baltieri and Isomura (2021) for a Laplace approximation to variational Kalman filtering.
Once the agent has updated its internal states, it will turn to inferring the next action.
In order to select a good next action, we need to investigate and compare consequences of a sequence of future actions.
A sequence of future actions a = (a t+1 , a t+2 , … , a t+T ) is called a policy. Since relevant consequences are usually the result of an future action
sequence rather than a single action, we will be interested in updating beliefs over policies.
In order to assess the consequences of a selected policy, we will, as a function of that policy, run the generative model forward-in-time to make
predictions about future observations xt+1:T .
Note that perception (state updating) preceeds policy updating. In order to accurately predict the future, the agent first needs to understand the
current state of the world.
Consider an AIF agent at time step t with (future) observations x = (xt+1 , xt+2 , … , xt+T ), latent future internal states
s = (st , st+1 , … , st+T ), and latent future control variables u = (ut+1 , ut+2 , … , ut+T ).
From the agent's viewpoint, the evolution of these future variables are constrained by its generative model, rolled out into the future:
t+T
p(x, s, u) = q(st ) ⋅ ∏ p(xk|sk ) ⋅ p(sk |sk−1 , uk )p(uk )
k=t+1
current
state
GM roll-out to future
Consider the Free Energy functional for estimating posterior beliefs q(s, u) over future states and control signals:
In principle, this is a regular FE functional, with one difference to previous versions: since future observations x have not yet occurred, H[q]
marginalizes not only over latent states s and policies u, but also over future observations x.
We will update the beliefs over policies by minimization of Free Energy functional H[q]. In the optional slides below, we prove that the solution to this
optimization task is given by (see AIF Algorithm, step 3, above)
( ) exp(− ( )) 117/127
where the factor p(u) is a prior over admissible policies, and the factor exp(−G(u)) updates the prior with information about future consequences
of a selected policy u.
The function
q(s|u)
G(u) = ∑ q(x, s|u) log
x, s p(x, s|u)
Note that, since q ∗ (u) ∝ p(u) exp(−G(u)), the probability q ∗ (u) for selecting a policy u increases when EFE G(u) gets smaller.
Once the policy (control) variables have been updated, in simulated environments, it is common to assume that the next action a t+1 (an action is the
observed control variable by the environment) gets selected in proportion to the probability of the related control variable (see AIF Agent Algorithm,
step 4, above), i.e.,
^ t+1 ∼ q(ut+1 )
u (select control realization by sampling)
^t+1
a t+1 = u (transfer to environment)
Next, we analyze some properties of the EFE.
q(s|u)
G(u) = ∑ q(x, s|u) log
x, s p(x, s|u)
1
= ∑ q(x, s|u) log + ∑ q(x, s|u) log
x, s p(x) x, s
q(s|u) q(s|x)
p(s|x, u) q(s|x)
1
= ∑ q(x|u) log + ∑ q(x, s|u) log
x p(x) x, s
q(s|u) q(s|x)
+ ∑ q(x, s|u) log
q(s|x) x, s p(s|x, u)
E[DKL [q(s|x), p(s|x, u)]]≥0
1
≥ ∑ q(x|u) log
x p(x)
goal-seeking behavior
(exploitation)
q(s|x)
− ∑ q(x, s|u) log
x, s q(s|u)
information-seeking behavior
(exploration)
Apparently, minimization of EFE leads to selection of policies that balances the following two imperatives:
1
1. minimization of the first term of G(u), i.e. minimizing ∑x q(x|u) log , leads to policies (u) that align the inferred observations q(x|u)
p(x)
under policy u (i.e., predicted future observations under policy u) with a prior p(x) on future observations. We are in control to choose any prior
p(x) and usually we choose a prior that aligns with desired (goal) observations. Hence, policies with low EFE leads to goal-seeking behavior
(a.k.a. pragmatic behavior or exploitation). In the OPTIONAL SLIDES, we derive an alternative (perhaps clearer) expression to support this
interpretation].
which is the (conditional) mutual information between (posteriors on) future observations and states, for a given policy u. Thus, maximizing this
term leads to actions that maximize statistical dependency between future observations and states. In other words, a policy with low EFE also leads
to information-seeking behavior (a.k.a. epistemic behavior or exploration).
q(s|x)
(The third term ∑x,s q(x, s|u) log is an (expected) KL divergence between posterior and prior on the states. This can be interpreted as a
p(s|x)
complexity/regularization term and G(u) minimization will drive this term to zero.)
Seeking actions that balance goal-seeking behavior (exploitation) and information-seeking behavior (exploration) is a fundamental problem in the
Reinforcement Learning literature.
Active Inference solves the exploration-exploitation dilemma. Both objectives are served by EFE minimization without need for any tuning
parameters.
We highlight another great feature of FE minimizing agents. Consider an AIF agent (m) with generative model p(x, s, u|m).
q(s, u)
F [q] = ∑ q(s, u) log
s, u p(x, s, u|m)
q(s, u)
= − log p(x|m) + ∑ q(s, u) log
s, u p(s, u|x, m)
problem
representation costs solution costs
The first term, − log p(x|m), is the (negative log-) evidence for model m, given recorded data x.
Minimization of FE maximizes the evidence for the given model. The model captures the problem representation. A model with high evidence predicts
the data well and therefore "understands the world".
The second term scores the cost of inference. In almost all cases, the solution to a problem can be phrased as an inference task on the generative
model. Hence, the second term scores the accuracy of the inferred solution, for the given model.
FE minimization optimizes a balanced trade-off between a good-enough problem representation and a good-enough solution proposal for that model.
Since FE comprises both a cost for solution and problem representation, it is a neutral criterion that applies across a very wide set of problems.
A good solution to the wrong problem is not good enough. A poor solution to a great problem statement is not sufficient either. In order to solve a
problem well, we need both to represent the problem correctly (high model evidence) and we need to solve it well (low inference costs).
The above derivations are not trivial, but we have just shown that FE minimizing agents accomplish variational Bayesian perception (a la Kalman
filtering), and a balanced exploration-exploitation trade-off for policy selection.
Moreover, the FE by itself serves as a proper objective across a very wide range of problems, since it scores both the cost of the problem statement and
the cost of inferring the solution.
The current FEP theory claims that minimization of FE (and EFE) is all that brains do, i.e., FE minimization leads to perception, policy selection, learning,
structure adaptation, attention, learning of problems and solutions, etc.
119/127
The Engineering Challenge: Synthetic AIF Agents
A current big AI challenge is to design synthetic AIF agents based solely on FE/EFE minimization.
120/127
Executing a synthetic AIF agent often poses a large computational problem because of the following reasons:
1. For interesting problems (e.g. speech recognition, scene analysis), generative models may contain thousands of latent variables.
2. The FE function is a time-varying function, since it is also a function of observable variables.
3. An AIF agent must execute inference in real-time if it is engaged and embedded in a real world environment.
So, in practice, executing a synthetic AIF agent may lead to a task of minimizing a time-var ying FE function of thousands of variables in real-
time!!
How to specify and execute a synthetic AIF agent is an active area of research.
There is no definitive solution approach to AIF agent modeling yet; we (BIASlab) think that (reactive) message passing in a factor graph representation
provides a promising path.
After selecting an action a t and making an observation xt , the FFG for the rolled-out generative model is given by the following FFG:
The open red nodes for p(xt+k ) specify desired future obser vations, whereas the open black boxes for p(sk |sk−1 , uk ) and p(xk |sk) reflect the
agent's beliefs about how the world actually evolves (ie, the veridical model).
The (brown) dashed box is the agent's Markov blanket. Given the states on the Markov blanket, the internal states of the agent are independent of the
state of the world.
121/127
1. act-execute-observe
2. update the latent variables and select an action
3. slide forward
Here we solve the cart parking problem as stated at the beginning of this lesson. We first specify a generative model for the agent's environment (which is
the observed noisy position of the cart) and then constrain future observations by a prior distribution that is located on the target parking spot. Next, we
schedule a message passing-based inference algorithm for the next action. This is followed by executing the Act-execute-observe --> infer --> slide
procedure to infer a sequence of consecutive actions. Finally, the position of the cart over time is plotted. Note that the cart converges to the target spot.
In [1]: using Pkg;Pkg.activate("probprog/workspace");Pkg.instantiate()
IJulia.clear_output();
In [2]: using PyPlot, LinearAlgebra, ForneyLab
# Load helper functions. Feel free to explore these
include("ai_agent/environment_1d.jl")
include("ai_agent/helpers_1d.jl")
include("ai_agent/agent_1d.jl")
T = 10 # Lookahead
s_k_min = s[1]
for k=2:T
@RV u[k] ~ GaussianMeanVariance(0.0, upsilon) # Control prior
@RV s[k] ~ GaussianMeanPrecision(s_k_min + u[k], gamma) # State transition model
@RV o[k] ~ GaussianMeanPrecision(s[k], phi) # Observation model
GaussianMeanVariance(o[k],
placeholder(:m_o, var_id=:m_o_*k, index=k-1),
placeholder(:v_o, var_id=:v_o_*k, index=k-1)) # Goal prior
s_k_min = s[k]
end
Video
See this video to verfiy that the robot will be steered to the correct parking spot even after an "adversarial" intervention :). (This project was completed
by Burak Ergul as part of this MSc graduation project).
123/127
Extensions and Comments
We also have a 2D version of this cart parking problem implemented on Raspberry Pi-based robot. (Credits for this implemention to Thijs van de Laar
and Burak Ergul).
Just to be sure, you don't need to memorize all FE/EFE decompositions nor are you expected to derive them on-the-spot. We present these
decompositions only to provide insight into the multitude of forces that underlie FEM-based action selection.
In a sense, the FEP is an umbrella for describing the mechanics and self-organization of intelligent behavior, in man and machines. Lots of sub-fields in
AI, such as reinforcement learning, can be interpreted as a special case of active inference under the FEP, see e.g., Friston et al., 2009.
Is EFE minimization really different from "regular" FE minimization? Not really, it appears that EFE minimization can be reformulated as a special case of
FE minimization. In other words, FE minimization is still the only game in town.
Active inference also completes the "scientific loop" picture. Under the FEP, experimental/trial design is driven by EFE minimization. Bayesian probability
theory (and FEP) contains all the equations for running scientific inquiry.
The big engineering challenge remains the computational load of AIF. The human brain consumes about 25 Watt and the neocortex only about 4 Watt
(which is about the power consumption of a bicycle light). This is multiple orders of magnitude cheaper than what we can engineer on silicon for similar
tasks.
Final Thoughts
In the end, all the state inference, parameter estimation, etc., in this lecture series could have been implemented by FE minimization in an appropriately
specified generative probabilistic model. However, the Free Energy Principle extends beyond state and parameter estimation. Driven by FE minimization,
brains change their structure as well over time. In fact, the FEP extends beyond brains to a general theory for biological self-organization, e.g., Darwin's
natural selection process may be interpreted as a FE minimization-driven model optimization process, and here's an article on FEP for predictive
processing in plants. Moreover, Constrained-FE minimization (rephrased as the Principle of Maximum Relative Entropy) provides an elegant framework
to derive most (if not all) physical laws, as Caticha exposes in his brilliant monograph on Entropic Physics. Indeed, the framework of FE minimization is
known in the physics community as the very fundamental Principle of Least Action that governs the equations-of-motion in nature.
So, the FEP is very fundamental and extends way beyond applications to machine learning. At our research lab at TU/e, we work on developing FEP-
based intelligent agents that go out into the world and autonomously learn to accomplish a pre-determined task, such as learning-to-walk or learning-
to-process-noisy-speech-signals. Free free to approach us if you want to know more about that effort.
OPTIONAL SLIDES
124/127
In an AIF Agent, Actions fulfill Desired Expectations about the Future
In the derivations above, we decomposed the EFE into an upperbound on the sum of a goal-seeking and information-seeking term. Here, we derive an
alternative (exact) decomposition that more clearly reveals the goal-seeking objective</a>.
We consider again the EFE and factorize the generative model p(x, s|u) = p′ (x)p(s|x, u) as a product of a target prior p′ (x) on observations
and a veridical state model p(s|x, u).
Through the target prior p′ (x), the agent declares which observations it wants to observe in the future. (The prime is just to distinguish the
semantics of a desired future from the model for the actual future).
Through the veridical state model p(s|x, u) , the agent implicitly declares its beliefs about how the world will actually generate observations.
p(x|s)p(s|u)
p(s|x, u) =
p(x|u)
p(x|s)p(s|u)
= ,
∑s p(x|s)p(s|u)
it follows that in practice the agent may specify p(s|x, u) implicitly by explicitly specifying a state transition model p(s|u) and observation model
p(x|s).
Hence, an AIF agent holds both a model for its beliefs about how the world will actually evolve AND a model for its beliefs about how it desires the
world to evolve!!
To highlight the role of these two models in the EFE, consider the following alternative EFE decomposition:
q(s|u)
G(u) = ∑ q(x, s|u) log
x, s p′(x)p(s|x, u)
q(s|u) 1
= ∑ q(x, s|u) log
x, s p′ (x) p(s|x, u)
q(s|u) p(x|u)
= ∑ q(x, s|u) log
x, s p′ (x) p(x|s)p(s|u)
(use Bayes)
q(s|u) p(x|u)
= ∑ q(x, s|u) log
x, s p(x|s)p(s|u) p′ (x)
q(s|u)
= ∑ q(x, s|u) log +
x, s p(x|s)p(s|u)
p(x|u)
∑ q(x, s|u) log
x, s p′ (x)
p(s|u)
= ∑ p(s|u)p(x|s) log +
x, s p(x|s) p(s|u)
p(x|u)
∑ p(s|u)p(x|s) log
x, s p′ (x)
( assume q(x, s|u) = p(x|s)p(s|u) )
1
= ∑ p(s|u) ∑ p(x|s) log +
s x p(x|s)
p(x|u)
∑ p(x|u) log ′
x p (x)
= Ep(s|u) [H[p(x|s)]] + DKL [p(x|u), p′ (x)]
ambiguity risk
In this derivation, we have assumed that we can use the generative model to make inferences in the "forward" direction. Hence, q(s|u) = p(s|u) and
( | )= ( | ) 125/127
q(x|s) = p(x|s).
The terms "ambiguity" and "risk" have their origin in utility theory for behavioral ecocomics. Minimization of EFE leads to minimizing both ambiguity
and risk.
Ambiguous (future) states are states that map to large uncertainties about (future) observations. We want to avoid those ambiguous states since it
implies that the model is not capable to predict how the world evolves. Ambiguity can be resolved by selecting information-seeking (epistemic) actions.
Minimization of the second term (risk) leads to choosing actions (u) that align predicted future observations (represented by p(x|u)) with desired
future observations (represented by p′ (x)). Agents minimize risk by selecting pragmatic (goal-seeking) actions.
⇒ Actions fulfill desired expectations about the future!
(return to related cell in main text).
q(s, u)
H[q] = ∑ q(x, s, u) log
x, s, u p(x, s, u)
q(s|u)q(u)
= ∑ q(x, s|u)q(u) log
x, s, u p(x, s|u)p(u)
q(s|u)q(u)
)
p(x, s|u)p(u)
1
= ∑ q(u)( log q(u) + log
u p(u)
q(s|u)
+ ∑ q(x, s|u) log )
x, s p(x, s|u)
G(u)
q(u)
= ∑ q(u) log
u p(u) exp(−G(u))
This is a KL-divergence. Minimization of H[q] leads to the following posterior for the policy:
According to Friston, an ``intelligent'' agent like a brain minimizes a variational free energy functional, which, in general, is a functional of a probability
distribution p and a variational posterior q.
What should the agent's model p be modeling? This question was (already) answered by Conant and Ashby (1970) as the Good Regulator Theorem:
ever y good regulator of a system must be a model of that system.
A Quote from Conant and Ashby's paper (this statement was later finessed by Friston (2013)):
"The theory has the interesting corollary that the living brain, insofar as it is successful and efficient as a regulator for survival, must
proceed, in learning, by the formation of a model (or models) of its environment."
126/127
(Return to related cell in main text).
127/127