Chapter3 Annotated Almostwholething

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Chapter 3

Nonlinear Regression I: Basis


Functions and Penalized Regression

In this chapter we learn how to extend the linear regression model to model regression
functions that depend on the inputs in a nonlinear manner. The key idea is to derive a
set of transformed inputs that is very rich and flexible, and then apply ridge regression to
control estimation of the coefficients. For models with a small number of inputs upon which
the regression function depends in an additive manner, splines do exceptionally well at this
task. In later chapters we will see more complicated transformations and models, but the
basic idea of transforming the inputs and then controlling estimation via regularization is
fundamental.

51
52CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

3.1 Linear basis function models


Consider the linear regression model:

E Yi Ex Xp ERP PERP

jÉXjβj
A linear basis function model models f ECHEN
as a linear combination of known, chosen basis
functions:
Forsimplicity let HR univariate

EYX x jÉhjWβj β Pi Ba
The hj can be anything, and there can be any number of them. Examples:

ch 5
Pg 140
X simple linearregression
eg hj x
X2
hj x quadratic

h Ix halx x hplx XP polynomial regression

For intervals
ELYIN
Ij Aj ajti let hjlx 1 XEIj
f

it
3.1. LINEAR BASIS FUNCTION MODELS 53

We see that a number of clever procedures for predictive modelling can be written as
linear basis function models. This makes studying their estimation of interest.

3.1.1 Polynomial regression


A problem that must be overcome is how to choose the hj , both their form and number.
Consider the data in Figure 3.1 (a). I used a polynomial basis with m = 2, 4, 11 coefficients
(i.e. a linear, cubic, and degree-10 polynomial) to produce the fits in Figure 3.1 (b). The
cubic looks the best. Why doesn’t the m = 100 fit as well as the cubic though? It contains
the cubic as a special case, so as a statistical model it is exactly as “correct” as the cubic.
But it fits terribly.
There are two immediate explanations for this behaviour:

1. The polynomial basis functions are global: every hj is nonzero for every (nonzero) xi .
This means that a change in any xi will affect the shape of the entire fitted curve.

2. The polynomial basis functions are very flexible when the order is high: when m+1 = n
the polynomial can actually interpolate the data, leading to extremely high variance
in the shape of the fitted curve.

There are two solutions which, when applied together, create an incredibly useful tool for
fitting nonlinear regression functions.
54CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

3.1.2 Piecewise polynomial regression


The first solution is to remove the global property of the basis by using piecewise polynomials:

Define breakpoints To Td
A piecewise polynomial is a function f that is a polynomial
on each interval E Ed

Degree 0 fix j hjlx hjlx 1 Tjt xctj xcj


To

E.IE
Degree
1 hjlx 1 Tj cxcTj ajtbjx

EEE
3.1. LINEAR BASIS FUNCTION MODELS 55

True function Polynomial regression


1.5

1.5
1.0

1.0
0.5

0.5
y

y
0.0

0.0
−0.5

−0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x

(a) Data and true function f (—) (b) Degree 1 (red), 3 (purple), and 10 (orange) poly-
nomial regression

Piecewise polynomial regression B−Spline regression


1.5

1.5
1.0

1.0
0.5

0.5
y

y
0.0

0.0
−0.5

−0.5

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x

(c) Piecewise polynomial regression with 5 knots (d) B-Spline regrssion with 5 knots (· · · )
(· · · )

Figure 3.1: Polynomial, piecewise polynomial, and spline regression for a simulated data set
with n = 100. The true regression function is cubic.
56CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

This solves the global problem: changes to x 2 [⌧1 , ⌧2 ) simply won’t affect the fitted
function in the range [⌧2 , ⌧3 ) (and so on). Figure Figure 3.1 (c) shows a piecewise polynomial
fit. It not only looks completely unrealistic, but despite being more flexible than a single
cubic polynomial, it doesn’t fit the data as well! The model is too flexible.
We don’t really believe that the regression function should change in a discontinuous
manner at a small number of essentially arbitrary points. We want to use piecewise polyno-
mials for their flexibility and local properties, but retain the continuity properties of a global
polynomial.
A basis that satisfies these requirements is the truncated power function basis:
ESL py 144
m men
hjlx jet
hmte x x Te let d
X
X e
CT

This basis satisfies the following properties:

1. The number of parameters, m, is equal to the order of the polynomimal basis, p, plus
the number of knots, k; m = p + k.
dancette Mirage
2. hj+p is a pth -order polynomial on (⌧j 1 , ⌧j ).

3. f has p 2 continuous derivatives at x = ⌧j for any j = 1, . . . , k.

Thefix
numberhgofxparameters can also be computed by subtracting the number of continuity
constraints from the number of parameters in the piecewise polynomial basis:
Piecewisepoly A MH dmtdtmtI parameters

d Mti 1 dm constraint

dmtdtitm dm mtlt.de parameters


3.2. SPLINES 57

Points to interpolate

4
2
f(x) = sin(2π/x)
0
−2
−4

−1.0 −0.5 0.0 0.5 1.0

Figure 3.2: Polynomial (black) and cubic B-spline (purple) interpolants of a set of 10 points.
The polynomial is very “high-energy”, while the spline is “calmer”.

The monomial terms in the basis still pose a problem, and this basis will experience a
similar behaviour to the polynomial regression (Figure 3.1 (b)) for large order.
We know what properties we want our basis functions to have: local support, polynomial
within each local neighbourhood, and a number of continuous derivatives at the boundaries
of each local neighbourhood. These constraints define a function space, the space of all
piecewise polynomials on a given knot sequence satisfying these continuity conditions. Rather
than exploring this space manually by guessing its elements and hoping they have desirable
properties, if we could construct a basis for this space, then this would give us our basis for
nonlinear regression.

3.2 Splines
A spline is a thin piece of flexible material (e.g. wood) that is held in position at a number
of points. Between the points, physics predicts that the spline will take up the position of
least energy: it will be as “smooth” as possible.
A spline interpolant is a numerical analytic tool which smoothly interpolates a set of
points using piecewise polynomials with knots at each point, constrained to be some number
of times continuously differentiable at each point. See Figure 3.2. Shown are a Lagrange
polynomial interpolant and a spline interpolant. They both interpolate the points shown,
but the polynomial varies wildly between the points, while the spline takes up a sort of path
of least resistance. This is a physical law, and we will prove it as a theorem in Section
3.2.2. Splines will be our go-to tool for nonlinear regression via linear combinations of basis
functions.
58CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

3.2.1 B-splines
of order PEIN
The “B” is for basis!
B-spline basis functions are piecewise polynomials constrained to be some number of
times continuously differentiable at a set of knots:

Let actor

bjp.IR IR jtᵗ B spline basisfunction f hr Hints t


order Polynomial
bjpx 8 If ftp
In fact, they span the space of all such functions, so that a spline function:

bjptjtpte P 2 MH times continuouslydifferentiable where


m is the of timesthe knot Titpte

splimfuntionifw.IE Ifff 3j arbitrwy


Coefficients
Linearcombination
of Bspline basisfunctions
can represent any piecewise polynomial with such continuity constraints. The canonical
reference on the mathematical and computational properties of B-splines is De Boor and
De Boor (1978). For the numerically-inclined, we note that B-splines can be constructed in
an abstract manner that is very useful for theoretical analysis by forming divided differences
of the truncated power basis above. When computation is of concern—as it is here—it is
more useful (and more common) to construct B-splines recursively.

Note D KTP of interiorknots orderof polynomial

There are Kt2p total knots K interior p at each


is non zero on an
boundary Each spline basis function
interval spanned by PH knots So the splinefunction has
Ktp basisfunctions total the largest onestarts at the
last interior knot
3.2. SPLINES 59

The recursive definition of B-splines is as follows. Start with the indicator functions:

bjlx 1 Tj xCTjH tjexctj.tl


f else

st ght

545

We call Bi,1 the ith B-spline of order 1. Alternatively, degree 0. Now, let:

birkt f.IT
ffgefdt
83,2

IN

EE
60CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

Finally, for any p > 1 2 N the ith B-spline of order p is:

bjpH bin IN t ftp.tg bin.p ilx

The most common choice of p = 4 gives the cubic B-spline shown in purple in Figure 3.2.
B-splines satisfy the following useful properties:
1. Local support: Bi,p (x) = 0 if x 2
/ [⌧i , ⌧i+p ).
2. Positivity: Bi,p (x) > 0 8x 2 [⌧i , ⌧i+p ).
P
3. Self-normalization: 2p+m i=1 Bi,p (x) = 1 8x 2 [⌧i , ⌧i+p )
This last property is not obvious by simple inspection, and provides a useful sort of scale-
invariance when using B-splines in regression modelling.
A spline function is a linear combination of B-spline basis functions:

fx jÉbj p x Bj Bj
determine the
shape of the
curve
Notation for the knots varies considerably across different sources. We use the following
conventions:

Let actor

total of Kt2p knots interior knots

the importantnumber

Finally, a natural cubic spline has the additional constraint that the spline function be
linear outside the boundary knots. This is usually enforced explicitly in implementations.
3.2. SPLINES 61

3.2.2 Roughness penalties


Splines were motivated empirically in Figure 3.2 as being a sort of “calm” interpolant: they
share the property of the polynomial interpolant that they interpolant the given set of points,
but between the points they seem to vary less than the polynomial. It turns out that this is
a fundamental property of splines, and that in fact splines are characterized entirely by this
property.
Consider a measure of the energy or roughness of a twice-continuously differentiable
function f : R ! R:

P f 11411 fix dx

Now consider the following variational (optimization) problem:

Want I argmin Yi fix f fix dx


for a given set of points Xiii i 1 in
gffi.edu
62CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

An incredible fact is that this problem not only admits a unique solution, but that
solution is finite-dimensional. We prove the following “fundamental theorem” for splines:

Suppose n 2 Let a xn b

Let glx be the unique cubic spline interpolant


to X Y XnYn Then

I g
et interpolate ie Yi i l
g Xi Vi
af
n

want to show that


x dx figw5dx
Let h Then we have
g g
this'ax what

Ssh x dx 20 Middle term

Pg ah indx glah'm fig uh'iadx


a

But is linear for aib so a g4b 0


g g
For Xe ab with xi for any i glx is a
cubic so its 3ʳᵈ derivative is constant Define
x when Xo
g Cj Xj XjH

so fig wh'Max g ashindx


3.2. SPLINES 63

hindx
jicj
jiCj high hly
But hit gin g x
so
hlxj guy gly Yj Yj 0

because
gg interpolate XiYi

fgwh wdx 0

119 dx 711g in dx

contradiction that
Now suppose
by way of
Let be the spline
is not a spline
g
that satisfies glxi fix it n
Then Yi fixi II Yi glad
but exit dx Im dx
a contradiction
Mmm

This means the


proof is completed
64CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

3.3 Regression splines


At this point, we could fit a nonlinear regression model by choosing some knots, modelling
the regression function as a spline function with those knots, and then estimating the weights
by least squares:

Model EYX x fix j bj.plIpj bjipjth B spline


basis function with
some chosen knot
Estimate
B arggin Y Xp
sequence

where
Xij bjp x is the splinedesign matrix

This is a linear basis function model

The choice of knots matters. To build the spline interpolant in Figure 3.2, we used a
knot at each point. But we don’t want to interpolate our training data; we want to estimate
the regression function, which is the mean output at each input.
Figure 3.3 shows a spline with 2, 3, 5, and 20 knots fit by ordinary least squares.
How do you know which of these curves is best in practice, when all you would observe
is the points? We need some way of penalizing the complexity of the model fit.
When too few knots are used, the fitted curve is biased. This cannot be fixed: the model
simply is not flexible enough to capture the true regression function.
When too many knots are used, the fitted curve is too wiggly/rough: it varies wildly,
in an effort to best fit the data. The model is very flexible, and it would be able to fit the
true function if the data had less variance. We can fix this by penalizing roughness in the
estimation.
3.3. REGRESSION SPLINES 65

B−Spline regression
0.5
0.0
y

−0.5

0.3 0.5 0.7 0.9

Figure 3.3: B-spline regression for a nonlinear regression function (black) with 2 (red), 3
(purple), 5 (orange), and 20 (blue) knots.
66CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

3.3.1 Fitting a spline by ridge regression


Let’s step back to the model:

Y N N Wi 02
Wx EYX x
fix
unknown function
Assume twice continuously
differentiable
The function f here is an infinite-dimensional parameter, in the precise sense that it
belongs to a vector space with infinite dimension. We need to restrict attention to some
finite-dimensional subspace in order to actually estimate f . Consider the least squares
minimization:

I arginin
Iii Yi fixis

Any interpolant—such as an nth -degree polynomial interpolant—would clearly minimize


this, and be finite-dimensional. However, we have talked at length about how we do not just
want to interpolate the data.
Further, we have seen that the solution to the minimum roughness interpolation problem,

I argmin Iii Yi fixi J f in dx

is a cubic spline with n knots at the data values.


3.3. REGRESSION SPLINES 67

We combine these ideas to (finally) arrive at the actual method we’ll use for nonlinear
regression. Regression splines fix f to be a cubic spline with some pre-chosen large number
of knots,

EYX x fix j bjWβj


obscuringhiding details about the knot placements counting
and the boundary behaviour

always use cubicsplines


bj bj 4
just think of this as a basis expansion with
de IN parameters to estimate

and then choose the spline weights to minimize the penalized least squares objective:

βa arggin
Yi tail 7 If w dx
smoothingparameter
Where fix jIbjWβj
penalty

Write in vectorform
T
Y Y_ Yn
b
bay
Xij bjki
bilk baku splinedesignmatrix

BA argpinlly xp.lk 7 f In dx
68CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

The penalty term is tractable. We write it as a quadratic form:

bWβ where b x bix ba


f x j bjWβj
f in b B b β b β
if then
bIx5β b HB q
βT bwib1 5 β

so t.in Ux pTfJb'ixblxidx β

βᵗSβ penalty matrix


Where SEIRI satisfies

fbixibj.ladx
sij

piecewise
The penalty matrix is computed by recognizing that the spline basis functions are (piece-
wise) polynomials, so the integrand is a polynomial, and then applying either interpolation
(Wood, 2017a) or Gaussian quadrature.
3.3. REGRESSION SPLINES 69

We then write the objective in vector form:

Bla angry 114 Xp I Xp'sβ

Let
gβ L βiY P BA

Lβ4 114 4311

PCP X 7βᵗSβ 7 f W5dx

PgBA
Solve 0

Pg p 2 44 XP 27513
D

Xy XXβ asphal
xty XX AS BA

Ba XX XS X'Y
generalized ridge regression
Compare to ridge regression
114 4311
β
7111311
a argmin
XTX AIJXy

Remark XX 7 I 0 if 7 0

tredet
because if So are singular values of X
then fi for 0 are the eigenvalues of XX
so
fitt data so are the eigenvalues

of XXXXI
70CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

We then solve for the spline weights: ABOVE


3.3. REGRESSION SPLINES 71

From the original model, Splinedesign


matrix
Yi N fixi 04 Y N f 0 In
fix jÉbjWβj
f flu fi nl

BID
XB
X E
B1a
we also obtain the sampling distribution of the estimated spline weights,

XTXtXS

XTX AS XElY
XTXtXS X'XP
Cor Bla xxx X'Corly XX 75
O XTX 75 X XX 7s

Bla N ECB COMB

mathematically
which we use to form estimates of and confidence intervals for the fitted curve:

Fix butBIA AMER bix b ba x


t

N E FcN var Flx

E fix butELBAD var fix blxCorpixDblx

Fw z.am Varfix is a 1 a 100 Confidence


internal for fix
1 quantileof Z No
In practice 40
ftp.t.ILXA
on a fine good a car such that 4 24 and xN xn
Fix n
4 Zi 42 var fix

ÉÉÉ
72CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

To estimate the residual variance, we use: CovB 021


Yi FM

In linear regression p dim B


changes the appropriate
denominator
Penalization
The denominator requires the effective degrees of freedom:

e d f a trace HA
Where Ha XXXX x

satisfies YA HA Y

Balancing the EDF with the model fit is also one way to choose :

Can choose to minimize


Yi Iki travelHAI
Ftth data complexity of the model

We will come back to this in Chapter 5.


3.3. REGRESSION SPLINES 73

3.3.2 Penalty nullspace


In ridge regression, the coefficients were shrunk towards zero. The penalized splines problem
reduces to ridge regression if the penalty matrix is taken to be the identity:

The nullspace of the penalty matrix determines which class of functions is unpenalized,
and hence which type of function we are smoothing towards. For ridge, the identity is full
rank so has empty null space, and we smooth towards the zero function. For penalized
splines, linear functions are unpenalized,

and the penalty matrix has rank m 2 and nullspace spanned by:

We therefore interpret the smoothing parameter as a sort of desired inverse magnitude


of deviation of f from a linear function; larger values pull f closer to linearity, and smaller
values allow it to be much rougher.
74CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION

The idea of the penalty nullspace allows for reparameterization of the model in a manner
that is, convenient for fitting, useful for theoretical analysis, and helpful for understanding
some of the challenges with identifiability involved in moving from regression splines (one f )
to additive models (multiple f1 , f2 , . . .). We have:

Now split the spline design matrix:

The model is then:


3.3. REGRESSION SPLINES 75

The penalty is:

The column space of XF is exactly the nullspace of S, and it can be shown that (1, x) is
a basis for this space. Putting it all together, the model and penalized likelihood are:

So there is a sort of “secret” linear model plus an explicit deviation from linearity. This
form is especially useful when building much more complicated models in which identifiability
of parameters becomes a challenge. One example is an additive model, which is to regression
splines as multiple linear regression is to simple linear regression, i.e. we have multiple
functions f1 , f2 , . . .. This is the subject of the next chapter.

You might also like