CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

CPSC540

Regularization,
nonlinear prediction
and generalization

Nando de Freitas
Janurary, 2013
University of British Columbia
Outline of the lecture
This lecture will teach you how to fit nonlinear functions by using
bases functions and how to control model complexity. The goal is for
you to:

 Learn how to derive ridge regression.


 Understand the trade-off of fitting the data and regularizing it.
 Learn polynomial regression.
 Understand that, if basis functions are given, the problem of
learning the parameters is still linear.
Learn cross-validation.
 Understand the effects of the number of data and the number of
basis functions on generalization.
Regularization
Derivation
Ridge regression as constrained optimization
Regularization paths
Asδ increases, t(δ) decreases and each θi goes to zero.

θ1

θ2

θ8

t(δ) [Hastie, Tibshirani & Friedman book]


Going nonlinear via basis functions
We introduce basis functions φ(·) to deal with nonlinearity:

y(x) = φ(x)θ + ǫ

For example, φ(x) = [1, x, x2 ],


Going nonlinear via basis functions
y(x) = φ(x)θ + ǫ

φ(x) = [1, x1 , x2 ] φ(x) = [1, x1 , x2 , x21 , x22 ]


Example: Ridge regression with a polynomial of degree 14

y(xi ) = 1 θ0 + xi θ1 + xi2 θ2 + . . . + xi13 θ13 + xi14 θ14

Φ = [ 1 xi xi2 . . . xi13 xi14 ]

J(θ) = ( y − Φ θ ) Τ ( y − Φ θ ) + δ2 θ Τ θ

y y y
small δ medium δ large δ

x x x
Kernel regression and RBFs
We can use kernels or radial basis functions (RBFs) as features:
1
(− λ x−µi 2 )
φ(x) = [κ(x, µ1 , λ), . . . , κ(x, µd , λ)], e.g. κ(x, µi , λ) = e

y(xi ) = φ (xi ) θ = 1θ0 + k(xi , µ1 , λ) θ1 + . . . + k(xi , µd , λ) θd


We can choose the locations µ of the basis functions to be the inputs.
That is, µi = xi . These basis functions are the known as kernels.
The choice of width λ is tricky, as illustrated below.
kernels

Too small λ

Right λ

Too large λ
The big question is how do we
choose the regularization coefficient,
the width of the kernels or the
polynomial order?
One Solution: cross-validation
K-fold crossvalidation

The idea is simple: we split the training data into K folds; then, for each
fold k ∈ {1, . . . , K}, we train on all the folds but the k’th, and test on the
k’th, in a round-robin fashion.

It is common to use K = 5; this is called 5-fold CV.

If we set K = N , then we get a method called leave-one out cross


validation, or LOOCV, since in fold i, we train on all the data cases
except for i, and then test on i.
Example: Ridge regression with polynomial of degree 14
Effect of data when we have the right model
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is too simple
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Effect of data when the model is very complex
yi = θ0 + xi θ1 + xi2 θ2 + N ( 0 , σ 2 )
Confidence in the predictions
Next lecture
In the next lecture, we introduce Bayesian inference, and show how it
can provide us with an alternative way of learning a model from data.

You might also like