Chapter3 Annotated Almostwholething

Chapter 3
Nonlinear Regression I: Basis

Functions and Penalized Regression
In this chapter we learn how to extend the linear regression model to model regression
functions that depend on the inputs in a nonlinear manner. The key idea is to derive a
set of transformed inputs that is very rich and flexible, and then apply ridge regression to
control estimation of the coefficients. For models with a small number of inputs upon which
the regression function depends in an additive manner, splines do exceptionally well at this
task. In later chapters we will see more complicated transformations and models, but the
basic idea of transforming the inputs and then controlling estimation via regularization is
fundamental.
51
52CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
3.1 Linear basis function models

Consider the linear regression model:
E Yi Ex Xp ERP PERP
jÉXjβj
A linear basis function model models f ECHEN
as a linear combination of known, chosen basis
functions:
Forsimplicity let HR univariate
EYX x jÉhjWβj β Pi Ba
The hj can be anything, and there can be any number of them. Examples:
ch 5
Pg 140
X simple linearregression
eg hj x
X2
hj x quadratic
h Ix halx x hplx XP polynomial regression
For intervals
ELYIN
Ij Aj ajti let hjlx 1 XEIj
f
it
3.1. LINEAR BASIS FUNCTION MODELS 53
We see that a number of clever procedures for predictive modelling can be written as
linear basis function models. This makes studying their estimation of interest.
3.1.1 Polynomial regression

A problem that must be overcome is how to choose the hj , both their form and number.
Consider the data in Figure 3.1 (a). I used a polynomial basis with m = 2, 4, 11 coefficients
(i.e. a linear, cubic, and degree-10 polynomial) to produce the fits in Figure 3.1 (b). The
cubic looks the best. Why doesn’t the m = 100 fit as well as the cubic though? It contains
the cubic as a special case, so as a statistical model it is exactly as “correct” as the cubic.
But it fits terribly.
There are two immediate explanations for this behaviour:
1. The polynomial basis functions are global: every hj is nonzero for every (nonzero) xi .
This means that a change in any xi will affect the shape of the entire fitted curve.
2. The polynomial basis functions are very flexible when the order is high: when m+1 = n
the polynomial can actually interpolate the data, leading to extremely high variance
in the shape of the fitted curve.
There are two solutions which, when applied together, create an incredibly useful tool for
fitting nonlinear regression functions.
3.1.2 Piecewise polynomial regression

The first solution is to remove the global property of the basis by using piecewise polynomials:
Define breakpoints To Td
A piecewise polynomial is a function f that is a polynomial
on each interval E Ed
Degree 0 fix j hjlx hjlx 1 Tjt xctj xcj

To
E.IE
Degree
1 hjlx 1 Tj cxcTj ajtbjx
EEE
3.1. LINEAR BASIS FUNCTION MODELS 55
True function Polynomial regression

1.5
1.5
1.0
1.0
0.5
0.5
y
y
0.0
0.0
−0.5
−0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
(a) Data and true function f (—) (b) Degree 1 (red), 3 (purple), and 10 (orange) poly-
nomial regression
Piecewise polynomial regression B−Spline regression

1.5
1.5
1.0
1.0
0.5
0.5
y
y
0.0
0.0
−0.5
−0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
(c) Piecewise polynomial regression with 5 knots (d) B-Spline regrssion with 5 knots (· · · )
(· · · )
Figure 3.1: Polynomial, piecewise polynomial, and spline regression for a simulated data set
with n = 100. The true regression function is cubic.
This solves the global problem: changes to x 2 [⌧1 , ⌧2 ) simply won’t affect the fitted
function in the range [⌧2 , ⌧3 ) (and so on). Figure Figure 3.1 (c) shows a piecewise polynomial
fit. It not only looks completely unrealistic, but despite being more flexible than a single
cubic polynomial, it doesn’t fit the data as well! The model is too flexible.
We don’t really believe that the regression function should change in a discontinuous
manner at a small number of essentially arbitrary points. We want to use piecewise polyno-
mials for their flexibility and local properties, but retain the continuity properties of a global
polynomial.
A basis that satisfies these requirements is the truncated power function basis:
ESL py 144
m men
hjlx jet
hmte x x Te let d
X
X e
CT
This basis satisfies the following properties:
1. The number of parameters, m, is equal to the order of the polynomimal basis, p, plus
the number of knots, k; m = p + k.
dancette Mirage
2. hj+p is a pth -order polynomial on (⌧j 1 , ⌧j ).
3. f has p 2 continuous derivatives at x = ⌧j for any j = 1, . . . , k.
Thefix
numberhgofxparameters can also be computed by subtracting the number of continuity
constraints from the number of parameters in the piecewise polynomial basis:
Piecewisepoly A MH dmtdtmtI parameters
d Mti 1 dm constraint
dmtdtitm dm mtlt.de parameters

3.2. SPLINES 57
Points to interpolate
4
2
f(x) = sin(2π/x)
0
−2
−4
−1.0 −0.5 0.0 0.5 1.0
Figure 3.2: Polynomial (black) and cubic B-spline (purple) interpolants of a set of 10 points.
The polynomial is very “high-energy”, while the spline is “calmer”.
The monomial terms in the basis still pose a problem, and this basis will experience a
similar behaviour to the polynomial regression (Figure 3.1 (b)) for large order.
We know what properties we want our basis functions to have: local support, polynomial
within each local neighbourhood, and a number of continuous derivatives at the boundaries
of each local neighbourhood. These constraints define a function space, the space of all
piecewise polynomials on a given knot sequence satisfying these continuity conditions. Rather
than exploring this space manually by guessing its elements and hoping they have desirable
properties, if we could construct a basis for this space, then this would give us our basis for
nonlinear regression.
3.2 Splines
A spline is a thin piece of flexible material (e.g. wood) that is held in position at a number
of points. Between the points, physics predicts that the spline will take up the position of
least energy: it will be as “smooth” as possible.
A spline interpolant is a numerical analytic tool which smoothly interpolates a set of
points using piecewise polynomials with knots at each point, constrained to be some number
of times continuously differentiable at each point. See Figure 3.2. Shown are a Lagrange
polynomial interpolant and a spline interpolant. They both interpolate the points shown,
but the polynomial varies wildly between the points, while the spline takes up a sort of path
of least resistance. This is a physical law, and we will prove it as a theorem in Section
3.2.2. Splines will be our go-to tool for nonlinear regression via linear combinations of basis
functions.
3.2.1 B-splines
of order PEIN
The “B” is for basis!
B-spline basis functions are piecewise polynomials constrained to be some number of
times continuously differentiable at a set of knots:
Let actor
bjp.IR IR jtᵗ B spline basisfunction f hr Hints t

order Polynomial
bjpx 8 If ftp
In fact, they span the space of all such functions, so that a spline function:
bjptjtpte P 2 MH times continuouslydifferentiable where

m is the of timesthe knot Titpte
splimfuntionifw.IE Ifff 3j arbitrwy

Coefficients
Linearcombination
of Bspline basisfunctions
can represent any piecewise polynomial with such continuity constraints. The canonical
reference on the mathematical and computational properties of B-splines is De Boor and
De Boor (1978). For the numerically-inclined, we note that B-splines can be constructed in
an abstract manner that is very useful for theoretical analysis by forming divided differences
of the truncated power basis above. When computation is of concern—as it is here—it is
more useful (and more common) to construct B-splines recursively.
Note D KTP of interiorknots orderof polynomial
There are Kt2p total knots K interior p at each

is non zero on an
boundary Each spline basis function
interval spanned by PH knots So the splinefunction has
Ktp basisfunctions total the largest onestarts at the
last interior knot
3.2. SPLINES 59
The recursive definition of B-splines is as follows. Start with the indicator functions:
bjlx 1 Tj xCTjH tjexctj.tl

f else
st ght
545
We call Bi,1 the ith B-spline of order 1. Alternatively, degree 0. Now, let:
birkt f.IT
ffgefdt
83,2
IN
EE
Finally, for any p > 1 2 N the ith B-spline of order p is:
bjpH bin IN t ftp.tg bin.p ilx
The most common choice of p = 4 gives the cubic B-spline shown in purple in Figure 3.2.
B-splines satisfy the following useful properties:
1. Local support: Bi,p (x) = 0 if x 2
/ [⌧i , ⌧i+p ).
2. Positivity: Bi,p (x) > 0 8x 2 [⌧i , ⌧i+p ).
P
3. Self-normalization: 2p+m i=1 Bi,p (x) = 1 8x 2 [⌧i , ⌧i+p )
This last property is not obvious by simple inspection, and provides a useful sort of scale-
invariance when using B-splines in regression modelling.
A spline function is a linear combination of B-spline basis functions:
fx jÉbj p x Bj Bj
determine the
shape of the
curve
Notation for the knots varies considerably across different sources. We use the following
conventions:
Let actor
total of Kt2p knots interior knots
the importantnumber
Finally, a natural cubic spline has the additional constraint that the spline function be
linear outside the boundary knots. This is usually enforced explicitly in implementations.
3.2. SPLINES 61
3.2.2 Roughness penalties

Splines were motivated empirically in Figure 3.2 as being a sort of “calm” interpolant: they
share the property of the polynomial interpolant that they interpolant the given set of points,
but between the points they seem to vary less than the polynomial. It turns out that this is
a fundamental property of splines, and that in fact splines are characterized entirely by this
property.
Consider a measure of the energy or roughness of a twice-continuously differentiable
function f : R ! R:
P f 11411 fix dx
Now consider the following variational (optimization) problem:
Want I argmin Yi fix f fix dx

for a given set of points Xiii i 1 in
gffi.edu
An incredible fact is that this problem not only admits a unique solution, but that
solution is finite-dimensional. We prove the following “fundamental theorem” for splines:
Suppose n 2 Let a xn b
Let glx be the unique cubic spline interpolant

to X Y XnYn Then
I g
et interpolate ie Yi i l
g Xi Vi
af
n
want to show that

x dx figw5dx
Let h Then we have
g g
this'ax what
Ssh x dx 20 Middle term
Pg ah indx glah'm fig uh'iadx

a
But is linear for aib so a g4b 0

g g
For Xe ab with xi for any i glx is a
cubic so its 3ʳᵈ derivative is constant Define
x when Xo
g Cj Xj XjH
so fig wh'Max g ashindx

3.2. SPLINES 63
hindx
jicj
jiCj high hly
But hit gin g x
so
hlxj guy gly Yj Yj 0
because
gg interpolate XiYi
fgwh wdx 0
119 dx 711g in dx
contradiction that
Now suppose
by way of
Let be the spline
is not a spline
g
that satisfies glxi fix it n
Then Yi fixi II Yi glad
but exit dx Im dx
a contradiction
Mmm
This means the

proof is completed
3.3 Regression splines

At this point, we could fit a nonlinear regression model by choosing some knots, modelling
the regression function as a spline function with those knots, and then estimating the weights
by least squares:
Model EYX x fix j bj.plIpj bjipjth B spline

basis function with
some chosen knot
Estimate
B arggin Y Xp
sequence
where
Xij bjp x is the splinedesign matrix
This is a linear basis function model
The choice of knots matters. To build the spline interpolant in Figure 3.2, we used a
knot at each point. But we don’t want to interpolate our training data; we want to estimate
the regression function, which is the mean output at each input.
Figure 3.3 shows a spline with 2, 3, 5, and 20 knots fit by ordinary least squares.
How do you know which of these curves is best in practice, when all you would observe
is the points? We need some way of penalizing the complexity of the model fit.
When too few knots are used, the fitted curve is biased. This cannot be fixed: the model
simply is not flexible enough to capture the true regression function.
When too many knots are used, the fitted curve is too wiggly/rough: it varies wildly,
in an effort to best fit the data. The model is very flexible, and it would be able to fit the
true function if the data had less variance. We can fix this by penalizing roughness in the
estimation.
3.3. REGRESSION SPLINES 65
B−Spline regression
0.5
0.0
y
−0.5
0.3 0.5 0.7 0.9
Figure 3.3: B-spline regression for a nonlinear regression function (black) with 2 (red), 3
(purple), 5 (orange), and 20 (blue) knots.
3.3.1 Fitting a spline by ridge regression

Let’s step back to the model:
Y N N Wi 02
Wx EYX x
fix
unknown function
Assume twice continuously
differentiable
The function f here is an infinite-dimensional parameter, in the precise sense that it
belongs to a vector space with infinite dimension. We need to restrict attention to some
finite-dimensional subspace in order to actually estimate f . Consider the least squares
minimization:
I arginin
Iii Yi fixis
Any interpolant—such as an nth -degree polynomial interpolant—would clearly minimize

this, and be finite-dimensional. However, we have talked at length about how we do not just
want to interpolate the data.
Further, we have seen that the solution to the minimum roughness interpolation problem,
I argmin Iii Yi fixi J f in dx
is a cubic spline with n knots at the data values.

We combine these ideas to (finally) arrive at the actual method we’ll use for nonlinear
regression. Regression splines fix f to be a cubic spline with some pre-chosen large number
of knots,
EYX x fix j bjWβj

obscuringhiding details about the knot placements counting
and the boundary behaviour
always use cubicsplines

bj bj 4
just think of this as a basis expansion with
de IN parameters to estimate
and then choose the spline weights to minimize the penalized least squares objective:
βa arggin
Yi tail 7 If w dx
smoothingparameter
Where fix jIbjWβj
penalty
Write in vectorform
T
Y Y_ Yn
b
bay
Xij bjki
bilk baku splinedesignmatrix
BA argpinlly xp.lk 7 f In dx
The penalty term is tractable. We write it as a quadratic form:
bWβ where b x bix ba

f x j bjWβj
f in b B b β b β
if then
bIx5β b HB q
βT bwib1 5 β
so t.in Ux pTfJb'ixblxidx β
βᵗSβ penalty matrix

Where SEIRI satisfies
fbixibj.ladx
sij
piecewise
The penalty matrix is computed by recognizing that the spline basis functions are (piece-
wise) polynomials, so the integrand is a polynomial, and then applying either interpolation
(Wood, 2017a) or Gaussian quadrature.
We then write the objective in vector form:
Bla angry 114 Xp I Xp'sβ
Let
gβ L βiY P BA
Lβ4 114 4311
PCP X 7βᵗSβ 7 f W5dx
PgBA
Solve 0
Pg p 2 44 XP 27513
D
Xy XXβ asphal
xty XX AS BA
Ba XX XS X'Y
generalized ridge regression
Compare to ridge regression
114 4311
β
7111311
a argmin
XTX AIJXy
Remark XX 7 I 0 if 7 0
tredet
because if So are singular values of X
then fi for 0 are the eigenvalues of XX
so
fitt data so are the eigenvalues
of XXXXI
We then solve for the spline weights: ABOVE

From the original model, Splinedesign

matrix
Yi N fixi 04 Y N f 0 In
fix jÉbjWβj
f flu fi nl
BID
XB
X E
B1a
we also obtain the sampling distribution of the estimated spline weights,
XTXtXS
XTX AS XElY
XTXtXS X'XP
Cor Bla xxx X'Corly XX 75
O XTX 75 X XX 7s
Bla N ECB COMB
mathematically
which we use to form estimates of and confidence intervals for the fitted curve:
Fix butBIA AMER bix b ba x

t
N E FcN var Flx
E fix butELBAD var fix blxCorpixDblx
Fw z.am Varfix is a 1 a 100 Confidence

internal for fix
1 quantileof Z No
In practice 40
ftp.t.ILXA
on a fine good a car such that 4 24 and xN xn
Fix n
4 Zi 42 var fix
ÉÉÉ
To estimate the residual variance, we use: CovB 021

Yi FM
In linear regression p dim B

changes the appropriate
denominator
Penalization
The denominator requires the effective degrees of freedom:
e d f a trace HA
Where Ha XXXX x
satisfies YA HA Y
Balancing the EDF with the model fit is also one way to choose :
Can choose to minimize

Yi Iki travelHAI
Ftth data complexity of the model
We will come back to this in Chapter 5.

3.3.2 Penalty nullspace

In ridge regression, the coefficients were shrunk towards zero. The penalized splines problem
reduces to ridge regression if the penalty matrix is taken to be the identity:
The nullspace of the penalty matrix determines which class of functions is unpenalized,
and hence which type of function we are smoothing towards. For ridge, the identity is full
rank so has empty null space, and we smooth towards the zero function. For penalized
splines, linear functions are unpenalized,
and the penalty matrix has rank m 2 and nullspace spanned by:
We therefore interpret the smoothing parameter as a sort of desired inverse magnitude

of deviation of f from a linear function; larger values pull f closer to linearity, and smaller
values allow it to be much rougher.
The idea of the penalty nullspace allows for reparameterization of the model in a manner
that is, convenient for fitting, useful for theoretical analysis, and helpful for understanding
some of the challenges with identifiability involved in moving from regression splines (one f )
to additive models (multiple f1 , f2 , . . .). We have:
Now split the spline design matrix:
The model is then:

The penalty is:
The column space of XF is exactly the nullspace of S, and it can be shown that (1, x) is
a basis for this space. Putting it all together, the model and penalized likelihood are:
So there is a sort of “secret” linear model plus an explicit deviation from linearity. This
form is especially useful when building much more complicated models in which identifiability
of parameters becomes a challenge. One example is an additive model, which is to regression
splines as multiple linear regression is to simple linear regression, i.e. we have multiple
functions f1 , f2 , . . .. This is the subject of the next chapter.

Chapter3 Annotated Almostwholething

Uploaded by

Copyright:

Available Formats

Chapter3 Annotated Almostwholething

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter3 Annotated Almostwholething

Uploaded by

Copyright:

Available Formats

Chapter 3

Nonlinear Regression I: Basis

3.1 Linear basis function models

h Ix halx x hplx XP polynomial regression

3.1.1 Polynomial regression

3.1.2 Piecewise polynomial regression

Degree 0 fix j hjlx hjlx 1 Tjt xctj xcj

True function Polynomial regression

Piecewise polynomial regression B−Spline regression

This basis satisfies the following properties:

3. f has p 2 continuous derivatives at x = ⌧j for any j = 1, . . . , k.

dmtdtitm dm mtlt.de parameters

−1.0 −0.5 0.0 0.5 1.0

bjp.IR IR jtᵗ B spline basisfunction f hr Hints t

bjptjtpte P 2 MH times continuouslydifferentiable where

splimfuntionifw.IE Ifff 3j arbitrwy

Note D KTP of interiorknots orderof polynomial

There are Kt2p total knots K interior p at each

bjlx 1 Tj xCTjH tjexctj.tl

Finally, for any p > 1 2 N the ith B-spline of order p is:

bjpH bin IN t ftp.tg bin.p ilx

total of Kt2p knots interior knots

3.2.2 Roughness penalties

Now consider the following variational (optimization) problem:

Want I argmin Yi fix f fix dx

Let glx be the unique cubic spline interpolant

want to show that

Ssh x dx 20 Middle term

Pg ah indx glah'm fig uh'iadx

But is linear for aib so a g4b 0

so fig wh'Max g ashindx

This means the

3.3 Regression splines

Model EYX x fix j bj.plIpj bjipjth B spline

This is a linear basis function model

0.3 0.5 0.7 0.9

3.3.1 Fitting a spline by ridge regression

Any interpolant—such as an nth -degree polynomial interpolant—would clearly minimize

I argmin Iii Yi fixi J f in dx

is a cubic spline with n knots at the data values.

EYX x fix j bjWβj

always use cubicsplines

The penalty term is tractable. We write it as a quadratic form:

bWβ where b x bix ba

βᵗSβ penalty matrix

We then write the objective in vector form:

Bla angry 114 Xp I Xp'sβ

Lβ4 114 4311

PCP X 7βᵗSβ 7 f W5dx

We then solve for the spline weights: ABOVE

From the original model, Splinedesign

Bla N ECB COMB

Fix butBIA AMER bix b ba x

N E FcN var Flx

E fix butELBAD var fix blxCorpixDblx

Fw z.am Varfix is a 1 a 100 Confidence

To estimate the residual variance, we use: CovB 021

In linear regression p dim B

Can choose to minimize