EE5434 Regression
EE5434 Regression
EE5434 Regression
• Unsupervised learning
• Create an internal representation of the input e.g. form
clusters; extract features
• How do we know if a representation is good?
• This is the new frontier of machine learning because most big
datasets do not come with labels.
• Reinforcement learning
• Learn action to maximize payoff
• Not much information in a payoff signal
• Payoff is often delayed
• Reinforcement learning is an important area that will not be
covered in this course.
Hypothesis Space
• One way to think about a supervised learning machine is as a device that
explores a “hypothesis space”.
• Each setting of the parameters in the machine is a different hypothesis
about the function that maps input vectors to output vectors.
• If the data is noise-free, each training example rules out a region of
hypothesis space.
• If the data is noisy, each training example scales the posterior
probability of each point in the hypothesis space in proportion to how
likely the training example is given that hypothesis.
• The art of supervised machine learning is in:
• Deciding how to represent the inputs and outputs
• Selecting a hypothesis space that is powerful enough to represent the
relationship between inputs and outputs but simple enough to be
searched.
Searching a hypothesis space
• The obvious method is to first formulate a loss function and
then adjust the parameters to minimize the loss function.
• This allows the optimization to be separated from the
objective function that is being optimized.
• Bayesians do not search for a single set of parameter values
that do well on the loss function.
• They start with a prior distribution over parameter
values and use the training data to compute a posterior
distribution over the whole hypothesis space.
Some Loss Functions
• Squared difference between actual and target real-valued
outputs.
• Number of classification errors
• Problematic for optimization because the derivative is
not smooth.
• Negative log probability assigned to the correct answer.
• This is usually the right function to use.
• In some cases it is the same as squared error (regression
with Gaussian output noise)
• In other cases it is very different (classification with
discrete classes needs cross-entropy error)
Generalization
• The real aim of supervised learning is to do well on test data
that is not known during learning.
• Choosing the values for the parameters that minimize the
loss function on the training data is not necessarily the best
policy.
• We want the learning machine to model the true
regularities in the data and to ignore the noise in the data.
• But the learning machine does not know which
regularities are real and which are accidental quirks of
the particular set of training examples we happen to
pick.
• So how can we be sure that the machine will generalize
correctly to new data?
Trading off the goodness of fit against the
complexity of the model
p (W ) p( D | W )
p (W | D) =
p( D)
Posterior probability of
weight vector W given
training data D p(W ) p( D | W )
W
Why we maximize sums of log probs
• We want to maximize the product of the probabilities of the
outputs on the training cases
• Assume the output errors on different training cases, c, are
independent.
p( D | W ) = p(d c | W )
c
• Because the log function is monotonic, it does not change
where the maxima are. So we can maximize sums of log
probabilities
log p( D | W ) = log p( Dc | W )
c
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
Living area 𝑓𝑒𝑒𝑡 2 , #bedrooms=4, price=?
Polynomial Curve Fitting
= 𝒘𝑇 𝐱
n
−1 vector of
w = ( X X)
* T T
X t target values
• Vector 𝒘 ∈ 𝑅𝑑+1
24
Linear Regression - Derivation
• Write the problem in matrix form:
1 1 2
𝐿 𝑓𝒘 = σ𝑁 𝑇 2
𝑖=1(𝒘 𝒙𝒊 − 𝑦𝑖 ) = 𝑿𝒘 − 𝒚 2
N N
𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
t = the y = model’s
correct estimate of most
yn = y ( x n , w ) answer probable value
( t n − yn ) 2
1 −
p (t n | yn ) = p ( yn + noise = t n | x n , w ) = e 2 2
2
(t n − yn ) 2
− log p (t n | yn ) = log 2 + log +
2 2 can be ignored if
can be ignored if sigma is same for
sigma is fixed every case
Multiple outputs
• If there are multiple outputs we can often treat the learning
problem as a set of independent problems, one per output.
• Not true if the output noise is correlated and changes from
case to case.
• Even though they are independent problems we can save work
by only multiplying the input vectors by the inverse covariance
of the input components once. For output k we have:
w*k = (XT X) −1 XT t k
• This is called “online“ learning. It can be more efficient if the dataset is very
redundant and it is simple to implement in hardware.
• It is also called stochastic gradient descent if the training cases are picked
at random.
• Care must be taken with the learning rate to prevent divergent
oscillations, and the rate must decrease at the end to get a good fit.
Illustrative Example
Illustrative Example
Looks good!
Polynomial Fitting
Overfitting
Polynomial Coefficients
Very large
Regularized least squares
~ 1 N
E (w ) = { y (x n , w ) − t n }2 + || w || 2
2 n =1 2
w* = ( I + XT X) −1 XT t
identity
matrix
A picture of the effect of the regularizer
Recap
Linear Basis Function Models
Linear Basis Function Models
Linear Basis Function Models
Minimizing the absolute error
min over w | tn − w T
xn |
n
• This minimization involves solving a linear
programming problem.
• It corresponds to maximum likelihood estimation if
the output noise is modeled by a Laplacian instead of
a Gaussian.
− a |tn − yn |
p (t n | yn ) = a e
− log p (t n | yn ) = − a | t n − yn | + const
Cross-Validation for Regression
• Test set method
• Leave-one-out
• K-fold cross validation
Test set method
Test set method
Test set method
Test set method
• Pros
• Very simple
• Can then simply choose the method with the best test-
set score
• Cons
• Wastes data: we get an estimate of the best method to
apply to30% less data
• If we don’t have much data, our test-set might just be
lucky or unlucky(“test-set estimator of performance has
high variance”)
LOOCV (Leave-one-out cross
validation)
LOOCV (Leave-one-out cross
validation)
Report the mean error, when you have done all points.
LOOCV example 1
LOOCV example 2
K-fold Cross Validation
K-fold Cross Validation
K-fold Cross Validation
Summary
The bias-variance trade-off
(a figment of the frequentists lack of imagination?)
y(x n ; D) − tn 2
D
= y ( x n ; D) D
− tn
2
y ( x, w ) = w0 + w1x
• It is possible to display the full posterior distribution over the
two-dimensional parameter space.
N
p(t | X, w, ) = Ν(tn | wT x n , −1 ) likelihood
n =1
−1 conjugate
p(w | ) = N(w | 0, I ) prior
inverse
variance
of prior
N
− ln p(w | t ) =
2
(tn − w x n )
T 2
+
2
wT w + const
n =1
The Bayesian interpretation of the
=
regularization parameter:
Using the posterior distribution
• If we can afford the computation, we ought to average the
predictions of all parameter settings using the posterior
distribution to weight the predictions:
p(ttest | xtest , , , D) = p(ttest | xtest , , w ) p(w | , , D) dw
PRML(Bishop): 3.3.2
A way to see the covariance of the predictions
for different values of x
We sample
models at
random from
the posterior
and show the
mean of the
each model’s
predictions
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers
N
p(t | X, w, ) = Ν(tn | wT x n , −1 )
n =1
N
− ln p(w | t ) =
2
(tn − wT x n ) 2 +
2
wT w + const
n =1
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers
p(w | M i ) p( D | w, M i )
p ( w | D, M i ) =
p( D | M i )
p( D | M i ) = p( D | w, M i ) p(w | M i ) dw
Using the evidence
• Now we use the evidence for a model class in exactly the same
way as we use the likelihood term for a particular setting of the
parameters
• The evidence gives us a posterior distribution over model
classes, provided we have a prior.
p ( M i | D) p ( M i ) p( D | M i )
• For simplicity in making predictions we often just pick the
model class with the highest posterior probability. This is
called model selection.
• But we should still average over the parameter vectors for that
model class using the posterior distribution.
Empirical Bayes
• Empirical Bayes (also called the evidence approximation)
means integrating out the parameters but maximizing over
the hyperparameters.
• Its more feasible and often works well.
Empirical Bayes