EE5434 Regression
• Unsupervised learning
• Create an internal representation of the input e.g. form
clusters; extract features
• How do we know if a representation is good?
• This is the new frontier of machine learning because most big
datasets do not come with labels.
• Reinforcement learning
• Learn action to maximize payoff
• Not much information in a payoff signal
• Payoff is often delayed
• Reinforcement learning is an important area that will not be
covered in this course.
Hypothesis Space
• One way to think about a supervised learning machine is as a device that
explores a “hypothesis space”.
• Each setting of the parameters in the machine is a different hypothesis
about the function that maps input vectors to output vectors.
• If the data is noise-free, each training example rules out a region of
hypothesis space.
• If the data is noisy, each training example scales the posterior
probability of each point in the hypothesis space in proportion to how
likely the training example is given that hypothesis.
• The art of supervised machine learning is in:
• Deciding how to represent the inputs and outputs
• Selecting a hypothesis space that is powerful enough to represent the
relationship between inputs and outputs but simple enough to be
Searching a hypothesis space
• The obvious method is to first formulate a loss function and
then adjust the parameters to minimize the loss function.
• This allows the optimization to be separated from the
objective function that is being optimized.
• Bayesians do not search for a single set of parameter values
that do well on the loss function.
• They start with a prior distribution over parameter
values and use the training data to compute a posterior
distribution over the whole hypothesis space.
Some Loss Functions
• Squared difference between actual and target real-valued
• Number of classification errors
• Problematic for optimization because the derivative is
not smooth.
• Negative log probability assigned to the correct answer.
• This is usually the right function to use.
• In some cases it is the same as squared error (regression
with Gaussian output noise)
• In other cases it is very different (classification with
discrete classes needs cross-entropy error)
• The real aim of supervised learning is to do well on test data
that is not known during learning.
• Choosing the values for the parameters that minimize the
loss function on the training data is not necessarily the best
• We want the learning machine to model the true
regularities in the data and to ignore the noise in the data.
• But the learning machine does not know which
regularities are real and which are accidental quirks of
the particular set of training examples we happen to
• So how can we be sure that the machine will generalize
correctly to new data?
Trading off the goodness of fit against the
complexity of the model
p (W ) p( D | W )
p (W | D) =
p( D)
Posterior probability of
weight vector W given
training data D p(W ) p( D | W )
Why we maximize sums of log probs
• We want to maximize the product of the probabilities of the
outputs on the training cases
• Assume the output errors on different training cases, c, are
p( D | W ) = p(d c | W )
• Because the log function is monotonic, it does not change
where the maxima are. So we can maximize sums of log
log p( D | W ) = log p( Dc | W )
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
Living area 𝑓𝑒𝑒𝑡 2 , #bedrooms=4, price=?
Polynomial Curve Fitting
= 𝒘𝑇 𝐱
−1 vector of
w = ( X X)
* T T
X t target values
• Vector 𝒘 ∈ 𝑅𝑑+1
Linear Regression - Derivation
• Write the problem in matrix form:
1 1 2
𝐿 𝑓𝒘 = σ𝑁 𝑇 2
𝑖=1(𝒘 𝒙𝒊 − 𝑦𝑖 ) = 𝑿𝒘 − 𝒚 2
𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
t = the y = model’s
correct estimate of most
yn = y ( x n , w ) answer probable value
( t n − yn ) 2
1 −
p (t n | yn ) = p ( yn + noise = t n | x n , w ) = e 2 2
(t n − yn ) 2
− log p (t n | yn ) = log 2 + log +
2 2 can be ignored if
can be ignored if sigma is same for
sigma is fixed every case
Multiple outputs
• If there are multiple outputs we can often treat the learning
problem as a set of independent problems, one per output.
• Not true if the output noise is correlated and changes from
case to case.
• Even though they are independent problems we can save work
by only multiplying the input vectors by the inverse covariance
of the input components once. For output k we have:
w*k = (XT X) −1 XT t k
• This is called “online“ learning. It can be more efficient if the dataset is very
redundant and it is simple to implement in hardware.
• It is also called stochastic gradient descent if the training cases are picked
at random.
• Care must be taken with the learning rate to prevent divergent
oscillations, and the rate must decrease at the end to get a good fit.
Illustrative Example
Looks good!
Polynomial Fitting
Polynomial Coefficients
Very large
Regularized least squares
~ 1 N
E (w ) = { y (x n , w ) − t n }2 + || w || 2
2 n =1 2
w* = ( I + XT X) −1 XT t
A picture of the effect of the regularizer
Linear Basis Function Models
Linear Basis Function Models
Linear Basis Function Models
Minimizing the absolute error
min over w | tn − w T
xn |
• This minimization involves solving a linear
programming problem.
• It corresponds to maximum likelihood estimation if
the output noise is modeled by a Laplacian instead of
a Gaussian.
− a |tn − yn |
p (t n | yn ) = a e
− log p (t n | yn ) = − a | t n − yn | + const
Cross-Validation for Regression
• Test set method
• Leave-one-out
• K-fold cross validation
Test set method
• Pros
• Very simple
• Can then simply choose the method with the best test-
set score
• Cons
• Wastes data: we get an estimate of the best method to
apply to30% less data
• If we don’t have much data, our test-set might just be
lucky or unlucky(“test-set estimator of performance has
high variance”)
LOOCV (Leave-one-out cross
Report the mean error, when you have done all points.
LOOCV example 1
LOOCV example 2
K-fold Cross Validation
K-fold Cross Validation
K-fold Cross Validation
The bias-variance trade-off
(a figment of the frequentists lack of imagination?)
y(x n ; D) − tn 2
= y ( x n ; D) D
− tn
y ( x, w ) = w0 + w1x
• It is possible to display the full posterior distribution over the
two-dimensional parameter space.
p(t | X, w, ) = Ν(tn | wT x n , −1 ) likelihood
n =1
−1 conjugate
p(w | ) = N(w | 0, I ) prior
of prior
− ln p(w | t ) =
(tn − w x n )
T 2
wT w + const
n =1
The Bayesian interpretation of the
regularization parameter:
Using the posterior distribution
• If we can afford the computation, we ought to average the
predictions of all parameter settings using the posterior
distribution to weight the predictions:
p(ttest | xtest , , , D) = p(ttest | xtest , , w ) p(w | , , D) dw
PRML(Bishop): 3.3.2
A way to see the covariance of the predictions
for different values of x
We sample
models at
random from
the posterior
and show the
mean of the
each model’s
Sequential Bayesian Regression
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers
p(t | X, w, ) = Ν(tn | wT x n , −1 )
n =1
− ln p(w | t ) =
(tn − wT x n ) 2 +
wT w + const
n =1
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers
p(w | M i ) p( D | w, M i )
p ( w | D, M i ) =
p( D | M i )
p( D | M i ) = p( D | w, M i ) p(w | M i ) dw
Using the evidence
• Now we use the evidence for a model class in exactly the same
way as we use the likelihood term for a particular setting of the
• The evidence gives us a posterior distribution over model
classes, provided we have a prior.
p ( M i | D) p ( M i ) p( D | M i )
• For simplicity in making predictions we often just pick the
model class with the highest posterior probability. This is
called model selection.
• But we should still average over the parameter vectors for that
model class using the posterior distribution.
Empirical Bayes
• Empirical Bayes (also called the evidence approximation)
means integrating out the parameters but maximizing over
the hyperparameters.
• Its more feasible and often works well.
Empirical Bayes