EE5434 Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

EE5434

Machine Learning for Signal Processing


Applications
Dr. LI Haoliang
Department of Electrical Engineering

Slides are collected from Bishop and Hinton


Types of learning task
• Supervised learning
• Learn to predict output when given an input vector

• Unsupervised learning
• Create an internal representation of the input e.g. form
clusters; extract features
• How do we know if a representation is good?
• This is the new frontier of machine learning because most big
datasets do not come with labels.

• Reinforcement learning
• Learn action to maximize payoff
• Not much information in a payoff signal
• Payoff is often delayed
• Reinforcement learning is an important area that will not be
covered in this course.
Hypothesis Space
• One way to think about a supervised learning machine is as a device that
explores a “hypothesis space”.
• Each setting of the parameters in the machine is a different hypothesis
about the function that maps input vectors to output vectors.
• If the data is noise-free, each training example rules out a region of
hypothesis space.
• If the data is noisy, each training example scales the posterior
probability of each point in the hypothesis space in proportion to how
likely the training example is given that hypothesis.
• The art of supervised machine learning is in:
• Deciding how to represent the inputs and outputs
• Selecting a hypothesis space that is powerful enough to represent the
relationship between inputs and outputs but simple enough to be
searched.
Searching a hypothesis space
• The obvious method is to first formulate a loss function and
then adjust the parameters to minimize the loss function.
• This allows the optimization to be separated from the
objective function that is being optimized.
• Bayesians do not search for a single set of parameter values
that do well on the loss function.
• They start with a prior distribution over parameter
values and use the training data to compute a posterior
distribution over the whole hypothesis space.
Some Loss Functions
• Squared difference between actual and target real-valued
outputs.
• Number of classification errors
• Problematic for optimization because the derivative is
not smooth.
• Negative log probability assigned to the correct answer.
• This is usually the right function to use.
• In some cases it is the same as squared error (regression
with Gaussian output noise)
• In other cases it is very different (classification with
discrete classes needs cross-entropy error)
Generalization
• The real aim of supervised learning is to do well on test data
that is not known during learning.
• Choosing the values for the parameters that minimize the
loss function on the training data is not necessarily the best
policy.
• We want the learning machine to model the true
regularities in the data and to ignore the noise in the data.
• But the learning machine does not know which
regularities are real and which are accidental quirks of
the particular set of training examples we happen to
pick.
• So how can we be sure that the machine will generalize
correctly to new data?
Trading off the goodness of fit against the
complexity of the model

• It is intuitively obvious that you can only expect a model to


generalize well if it explains the data surprisingly well given
the complexity of the model.
• If the model has as many degrees of freedom as the data, it
can fit the data perfectly but so what?
• There is a lot of theory about how to measure the model
complexity and how to control it to optimize generalization.
• Some of this “learning theory” will be covered later in the
course, but it requires a whole course on learning theory
to cover it properly.
A sampling assumption
• Assume that the training examples are drawn independently
from the set of all possible examples.
• Assume that each time a training example is drawn, it
comes from an identical distribution (i.i.d)
• Assume that the test examples are drawn in exactly the
same way – i.i.d. and from the same distribution as the
training data.
• This is a reasonable assumption, but may not be true in some cases.
The probabilistic guarantee
1
 h + h log( 2 N / h) − log( p / 4)  2
Etest  Etrain +  
 N 
where N = size of training set
h = VC dimension of the model class = complexity
p = upper bound on probability that this bound fails

So if we train models with different complexity, we should pick


the one that minimizes this bound
Actually, this is only sensible if we think the bound is fairly
tight, which it usually isn’t. The theory provides insight, but in
practice we still need some witchcraft.
Using a validation set
• Divide the total dataset into three subsets:
• Training data is used for learning the parameters of the
model.
• Validation data is not used of learning but is used for
deciding what type of model and what amount of
regularization works best.
• Test data is used to get a final, unbiased estimate of how
well the network works. We expect this estimate to be
worse than on the validation data.
• We could then re-divide the total dataset to get
another unbiased estimate of the true error rate.
The Bayesian framework

• The Bayesian framework assumes that we always have a


prior distribution for everything.
• The prior may be very vague.
• When we see some data, we combine our prior distribution with a
likelihood term to get a posterior distribution.
• The likelihood term takes into account how probable the observed
data is given the parameters of the model.
• It favors parameter settings that make the data likely.
• It fights the prior
• With enough data the likelihood terms always win.
Bayes Theorem
conditional
joint probability probability
p ( D ) p (W | D ) = p ( D, W ) = p (W ) p ( D | W )

Prior probability of Probability of observed


weight vector W data given W

p (W ) p( D | W )
p (W | D) =
p( D)
Posterior probability of
weight vector W given
training data D  p(W ) p( D | W )
W
Why we maximize sums of log probs
• We want to maximize the product of the probabilities of the
outputs on the training cases
• Assume the output errors on different training cases, c, are
independent.

p( D | W ) =  p(d c | W )
c
• Because the log function is monotonic, it does not change
where the maxima are. So we can maximize sums of log
probabilities

log p( D | W ) =  log p(d c | W )


c
A cheaper trick
• Suppose we completely ignore the prior over weight
vectors
• This is equivalent to giving all possible weight vectors the
same prior probability density.
• Then all we have to do is to maximize:

log p( D | W ) =  log p( Dc | W )
c

• This is called maximum likelihood learning. It is very


widely used for fitting models in statistics.
Supervised Maximum Likelihood Learning
• Minimizing the squared
residuals is equivalent to
maximizing the log
probability of the correct
answer under a Gaussian
centered at the model’s
guess. d = the y = model’s
correct estimate of most
answer probable value
yc = f (inputc , W )
( d c − yc ) 2
1 −
p (output = d c | inputc , W ) = p (d c | yc ) = e 2 2
2 
( d c − yc ) 2
− log p (output = d c | inputc , W ) = k +
2 2
Supervised Maximum Likelihood Learning
• Finding a set of weights, W, that minimizes the squared
errors is exactly the same as finding a W that maximizes
the log probability that the model would produce the
desired outputs on all the training cases.
• We implicitly assume that zero-mean Gaussian noise is added
to the model’s actual output.
• We do not need to know the variance of the noise because
we are assuming it’s the same in all cases. So it just scales the
squared error.
Regression
A Motivated Example

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
Living area 𝑓𝑒𝑒𝑡 2 , #bedrooms=4, price=?
Polynomial Curve Fitting

= 𝒘𝑇 𝐱

Question: how to find w?


The loss function
• Fitting a model to data is typically done by finding the parameter
values that minimize some loss function.
• There are many possible loss functions. What criterion should we
use for choosing one?
• Choose one that makes the math easy (squared error)
• Choose one that makes the fitting correspond to maximizing
the likelihood of the training data given some noise model for
the observed outputs.
• Choose one that makes it easy to interpret the learned
coefficients (easy if mostly zeros)
• Choose one that corresponds to the real loss on a practical
application (losses are often asymmetric)
Sum-of-Squares Error Function

y: predicted output, t: groundtruth


Minimizing squared error
Minimizing squared error
Matrix formulation
y = wT x
error =  (tn − w T
xn ) 2

n
−1 vector of
w = ( X X)
* T T
X t target values

optimal the transposed


weights inverse of the design matrix has
covariance one input vector per
matrix of the column
input vectors
Linear Regression - Derivation
• Write the problem in matrix form:
෠𝐿 𝑓𝒘 = 1 σ𝑁 𝑇 2
𝑖=1(𝒘 𝒙𝒊 − 𝑦𝑖 ) =
1
𝑿𝒘 − 𝒚 2
2
N N

• Concatenate the rows:𝒙𝒊 𝑦𝒊


• Matrix 𝑿 ∈ 𝑅𝑁×(𝑑+1)
𝒚
• Vector 𝒚 ∈ 𝑅𝑁

• Vector 𝒘 ∈ 𝑅𝑑+1

24
Linear Regression - Derivation
• Write the problem in matrix form:
1 1 2
𝐿෠ 𝑓𝒘 = σ𝑁 𝑇 2
𝑖=1(𝒘 𝒙𝒊 − 𝑦𝑖 ) = 𝑿𝒘 − 𝒚 2
N N

• Find the gradient w.r.t. 𝒘:


2
∇𝒘 𝑿𝒘 − 𝒚 2 = ∇𝒘 𝑿𝒘 − 𝒚 𝑇 𝑿𝒘 − 𝒚
= ∇𝑤 𝒘𝑇 𝑿𝑇 𝑿𝒘 − 2𝒘𝑇 𝑿𝑇 𝒚
= 2𝑿𝑇 𝑿𝒘 − 2𝑿𝑇 𝒚

• Set gradient to zero to get the minimizer:


𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚

𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚

• Note: here we assume (𝑿𝑇 𝑿)−1 exists unless otherwise specified.


Recap: When is minimizing the squared error equivalent
to Maximum Likelihood Learning?
• Minimizing the squared residuals
is equivalent to maximizing the log
probability of the correct answer
under a Gaussian centered at the
model’s guess.

t = the y = model’s
correct estimate of most
yn = y ( x n , w ) answer probable value
( t n − yn ) 2
1 −
p (t n | yn ) = p ( yn + noise = t n | x n , w ) = e 2 2
2 
(t n − yn ) 2
− log p (t n | yn ) = log 2 + log  +
2 2 can be ignored if
can be ignored if sigma is same for
sigma is fixed every case
Multiple outputs
• If there are multiple outputs we can often treat the learning
problem as a set of independent problems, one per output.
• Not true if the output noise is correlated and changes from
case to case.
• Even though they are independent problems we can save work
by only multiplying the input vectors by the inverse covariance
of the input components once. For output k we have:

w*k = (XT X) −1 XT t k

does not depend


on a
Least mean squares: An alternative approach for
really big datasets
 +1 
w = w −  En ( )
weights after
seeing training learning vector of derivatives of the
case tau+1 rate squared error w.r.t. the weights
on the training case presented
at time tau.

• This is called “online“ learning. It can be more efficient if the dataset is very
redundant and it is simple to implement in hardware.
• It is also called stochastic gradient descent if the training cases are picked
at random.
• Care must be taken with the learning rate to prevent divergent
oscillations, and the rate must decrease at the end to get a good fit.
Illustrative Example
Illustrative Example

How does learning rate influence the results?


Pseudo-code of incremental
learning
Beyond incremental learning:
Overview of gradient descent methods
• Momentum
• Nesterov accelerated gradient
• Adagrad
• Adadelta
• RMSprop
• Adam
• AdaMax
• Nadam
• AMSGrad
Polynomial Fitting
Polynomial Fitting
Polynomial Fitting
Polynomial Fitting

Looks good!
Polynomial Fitting
Overfitting
Polynomial Coefficients

Very large
Regularized least squares
~ 1 N 
E (w ) =  { y (x n , w ) − t n }2 + || w || 2
2 n =1 2

The penalty on the squared weights is mathematically compatible with the


squared error function, so we get a nice closed form for the optimal weights
with this regularizer:

w* = ( I + XT X) −1 XT t

identity
matrix
A picture of the effect of the regularizer

• The overall cost function is


the sum of two parabolic
bowls.
• The sum is also a parabolic
bowl.
• The combined minimum
lies on the line between the
minimum of the squared
error and the origin.
• The regularizer just shrinks
the weights.
Regularization
Other regularizers
• We do not need to use the squared error, provided
we are willing to do more computation.
• Other powers of the weights can be used.
The lasso: penalizing the absolute values of the
weights
~
E (w ) =  { y (x n , w ) − t n }2 +   | w i |
1 N
2 n =1
i
• Finding the minimum requires quadratic programming but its
still unique because the cost function is convex (a bowl plus an
inverted pyramid)
• As lambda is increased, many of the weights go to exactly zero.
• This is great for interpretation, and it is also pretty good for
preventing overfitting.
A geometrical view of the lasso compared with
a penalty on the squared weights

Notice that w1=0 at


the optimum
Increasing data size

Another option to mitigate the problem of overfitting


Linear Basis Function Models

Recap
Linear Basis Function Models
Linear Basis Function Models
Linear Basis Function Models
Minimizing the absolute error
min over w  | tn − w T
xn |
n
• This minimization involves solving a linear
programming problem.
• It corresponds to maximum likelihood estimation if
the output noise is modeled by a Laplacian instead of
a Gaussian.
− a |tn − yn |
p (t n | yn ) = a e
− log p (t n | yn ) = − a | t n − yn | + const
Cross-Validation for Regression
• Test set method
• Leave-one-out
• K-fold cross validation
Test set method
Test set method
Test set method
Test set method
• Pros
• Very simple
• Can then simply choose the method with the best test-
set score
• Cons
• Wastes data: we get an estimate of the best method to
apply to30% less data
• If we don’t have much data, our test-set might just be
lucky or unlucky(“test-set estimator of performance has
high variance”)
LOOCV (Leave-one-out cross
validation)
LOOCV (Leave-one-out cross
validation)

Report the mean error, when you have done all points.
LOOCV example 1
LOOCV example 2
K-fold Cross Validation
K-fold Cross Validation
K-fold Cross Validation
Summary
The bias-variance trade-off
(a figment of the frequentists lack of imagination?)

• Imagine that the training set was drawn at random from a


whole set of training sets.
• The squared loss can be decomposed into a “bias” term and a
“variance” term.
• Bias = systematic error in the model’s estimates
• Variance = noise in the estimates cause by sampling noise in
the training set.
• There is also an additional loss due to the fact that the target
values are noisy.
• We eliminate this extra, irreducible loss from the math by
using the average target values (i.e. the unknown, noise-
free values)
The bias-variance decomposition
model’s estimate average
for test case n target value The “bias” term is the squared error of the
when trained on for test case average, over all training datasets, of the
dataset D n estimates.

y(x n ; D) − tn  2
D
=  y ( x n ; D) D
− tn 
2

angle brackets are


+ y(x n ; D) −  y(x n ; D)  D  2
D
physics notation for
expectation over D
The “variance” term is the variance, over all training
datasets, of the model’s estimate.

see PRML(Bishop) page 149 for a derivation using a different


notation
How the regularization parameter affects the bias and
variance terms
high variance low variance

low bias high bias


An example of the bias-variance trade-off
Beating the bias-variance trade-off
• We can reduce the variance term by averaging lots of models
trained on different datasets.
• This seems silly. If we had lots of different datasets it would
be better to combine them into one big training set.
• With more training data there will be much less variance.
• Weird idea: We can create different datasets by bootstrap
sampling of our single training dataset.
• This is called “bagging” and it works surprisingly well.
• But if we have enough computation its better to do the right
Bayesian thing:
• Combine the predictions of many models using the
posterior probability of each parameter vector as the
combination weight.
The Bayesian approach
• Consider a very simple linear model that only has two
parameters:

y ( x, w ) = w0 + w1x
• It is possible to display the full posterior distribution over the
two-dimensional parameter space.

• The likelihood term is a Gaussian, so if we use a Gaussian prior


the posterior will be Gaussian:
• This is a conjugate prior. It means that the prior is just like
having already observed some data.
variance of
Gaussian output noise

N
p(t | X, w,  ) =  Ν(tn | wT x n ,  −1 ) likelihood
n =1

−1 conjugate
p(w |  ) = N(w | 0,  I ) prior
inverse
variance
of prior

 N

− ln p(w | t ) =
2
 (tn − w x n )
T 2
+
2
wT w + const
n =1


The Bayesian interpretation of the
=
regularization parameter:

Using the posterior distribution
• If we can afford the computation, we ought to average the
predictions of all parameter settings using the posterior
distribution to weight the predictions:
p(ttest | xtest ,  ,  , D) =  p(ttest | xtest ,  , w ) p(w |  ,  , D) dw

training precision precision


data of output of prior
noise
The predictive distribution for noisy sinusoidal data
modeled by a linear combination of nine radial basis
functions.

PRML(Bishop): 3.3.2
A way to see the covariance of the predictions
for different values of x

We sample
models at
random from
the posterior
and show the
mean of the
each model’s
predictions
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Sequential Bayesian Regression
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers

N
p(t | X, w,  ) =  Ν(tn | wT x n ,  −1 )
n =1

p(w |  ) = N(w | 0,  −1I )

 N

− ln p(w | t ) =
2
 (tn − wT x n ) 2 +
2
wT w + const
n =1
Bayesian model comparison
• We usually need to decide between many different models:
• Different numbers of basis functions
• Different types of basis functions
• Different strengths of regularizers

• The frequentist way to decide between models is to hold back a validation


set and pick the model that does best on the validation data.
• Three different ways we have introduced.
• The Bayesian alternative is to use all of the data for training each model
and to use the “evidence” to pick the best model (or to average over
models).
• The evidence is the marginal likelihood with the parameters integrated out.
Definition of the evidence
• The evidence is the normalizing term in the
expression for the posterior probability of a weight
vector given a dataset and a model class

p(w | M i ) p( D | w, M i )
p ( w | D, M i ) =
p( D | M i )

p( D | M i ) =  p( D | w, M i ) p(w | M i ) dw
Using the evidence
• Now we use the evidence for a model class in exactly the same
way as we use the likelihood term for a particular setting of the
parameters
• The evidence gives us a posterior distribution over model
classes, provided we have a prior.

p ( M i | D)  p ( M i ) p( D | M i )
• For simplicity in making predictions we often just pick the
model class with the highest posterior probability. This is
called model selection.
• But we should still average over the parameter vectors for that
model class using the posterior distribution.
Empirical Bayes
• Empirical Bayes (also called the evidence approximation)
means integrating out the parameters but maximizing over
the hyperparameters.
• Its more feasible and often works well.
Empirical Bayes

p(t | x, D) =    p(t | x,  , w ) p(w |  ,  , D) p( ,  | D) dw d d

• The equation above is the right predictive distribution (assuming we do not


have hyperpriors for alpha and beta).
• The equation below is a more tractable approximation that works well if
the posterior distributions for alpha and beta are highly peaked (so the
distributions are well approximated by their most likely values)

p(t | x, D)  p(t | x, D, ˆ , ˆ ) =  p(t | x, ˆ , w ) p(w | ˆ , ˆ , D) p(ˆ , ˆ | D) dw d d


Empirical Bayes
target and precision
input on test of output precision training
case noise of prior data

p(t | x, D) =    p(t | x,  , w ) p(w |  ,  , D) p( ,  | D) dw d d

• The equation above is the right predictive distribution (assuming we do not


have hyperpriors for alpha and beta).
• The equation below is a more tractable approximation that works well if
the posterior distributions for alpha and beta are highly peaked (so the
distributions are well approximated by their most likely values)

point estimates of alpha and beta that


maximize the evidence

p(t | x, D)  p(t | x, D, ˆ , ˆ ) =  p(t | x, ˆ , w ) p(w | ˆ , ˆ , D) p(ˆ , ˆ | D) dw d d


Additional Reading
• P4-11, P138-147, P152-156 of the PRML Book
• For advanced machine learning contents (if you are
interested in exploring more)
• Take home exercise 1: PRML 3.4
Take-home exercise 2

X 0 0.2 0.7 1.5 2.3 3.0 3.2


y 0 0.2 0.7 1.5 2.3 3.0 3.2
Take-home exercise 2 (Cont.)
Take-home exercise 3
Solution: Intro Ex1
Solution: Intro Ex1 (Cont.)
Solution: Intro Ex2
Solution: Intro Ex2 (Cont.)

You might also like