Chapter3 Annotated Almostwholething
Chapter3 Annotated Almostwholething
Chapter3 Annotated Almostwholething
In this chapter we learn how to extend the linear regression model to model regression
functions that depend on the inputs in a nonlinear manner. The key idea is to derive a
set of transformed inputs that is very rich and flexible, and then apply ridge regression to
control estimation of the coefficients. For models with a small number of inputs upon which
the regression function depends in an additive manner, splines do exceptionally well at this
task. In later chapters we will see more complicated transformations and models, but the
basic idea of transforming the inputs and then controlling estimation via regularization is
fundamental.
51
52CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
E Yi Ex Xp ERP PERP
jÉXjβj
A linear basis function model models f ECHEN
as a linear combination of known, chosen basis
functions:
Forsimplicity let HR univariate
EYX x jÉhjWβj β Pi Ba
The hj can be anything, and there can be any number of them. Examples:
ch 5
Pg 140
X simple linearregression
eg hj x
X2
hj x quadratic
For intervals
ELYIN
Ij Aj ajti let hjlx 1 XEIj
f
it
3.1. LINEAR BASIS FUNCTION MODELS 53
We see that a number of clever procedures for predictive modelling can be written as
linear basis function models. This makes studying their estimation of interest.
1. The polynomial basis functions are global: every hj is nonzero for every (nonzero) xi .
This means that a change in any xi will affect the shape of the entire fitted curve.
2. The polynomial basis functions are very flexible when the order is high: when m+1 = n
the polynomial can actually interpolate the data, leading to extremely high variance
in the shape of the fitted curve.
There are two solutions which, when applied together, create an incredibly useful tool for
fitting nonlinear regression functions.
54CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
Define breakpoints To Td
A piecewise polynomial is a function f that is a polynomial
on each interval E Ed
E.IE
Degree
1 hjlx 1 Tj cxcTj ajtbjx
EEE
3.1. LINEAR BASIS FUNCTION MODELS 55
1.5
1.0
1.0
0.5
0.5
y
y
0.0
0.0
−0.5
−0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
(a) Data and true function f (—) (b) Degree 1 (red), 3 (purple), and 10 (orange) poly-
nomial regression
1.5
1.0
1.0
0.5
0.5
y
y
0.0
0.0
−0.5
−0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
(c) Piecewise polynomial regression with 5 knots (d) B-Spline regrssion with 5 knots (· · · )
(· · · )
Figure 3.1: Polynomial, piecewise polynomial, and spline regression for a simulated data set
with n = 100. The true regression function is cubic.
56CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
This solves the global problem: changes to x 2 [⌧1 , ⌧2 ) simply won’t affect the fitted
function in the range [⌧2 , ⌧3 ) (and so on). Figure Figure 3.1 (c) shows a piecewise polynomial
fit. It not only looks completely unrealistic, but despite being more flexible than a single
cubic polynomial, it doesn’t fit the data as well! The model is too flexible.
We don’t really believe that the regression function should change in a discontinuous
manner at a small number of essentially arbitrary points. We want to use piecewise polyno-
mials for their flexibility and local properties, but retain the continuity properties of a global
polynomial.
A basis that satisfies these requirements is the truncated power function basis:
ESL py 144
m men
hjlx jet
hmte x x Te let d
X
X e
CT
1. The number of parameters, m, is equal to the order of the polynomimal basis, p, plus
the number of knots, k; m = p + k.
dancette Mirage
2. hj+p is a pth -order polynomial on (⌧j 1 , ⌧j ).
Thefix
numberhgofxparameters can also be computed by subtracting the number of continuity
constraints from the number of parameters in the piecewise polynomial basis:
Piecewisepoly A MH dmtdtmtI parameters
d Mti 1 dm constraint
Points to interpolate
4
2
f(x) = sin(2π/x)
0
−2
−4
Figure 3.2: Polynomial (black) and cubic B-spline (purple) interpolants of a set of 10 points.
The polynomial is very “high-energy”, while the spline is “calmer”.
The monomial terms in the basis still pose a problem, and this basis will experience a
similar behaviour to the polynomial regression (Figure 3.1 (b)) for large order.
We know what properties we want our basis functions to have: local support, polynomial
within each local neighbourhood, and a number of continuous derivatives at the boundaries
of each local neighbourhood. These constraints define a function space, the space of all
piecewise polynomials on a given knot sequence satisfying these continuity conditions. Rather
than exploring this space manually by guessing its elements and hoping they have desirable
properties, if we could construct a basis for this space, then this would give us our basis for
nonlinear regression.
3.2 Splines
A spline is a thin piece of flexible material (e.g. wood) that is held in position at a number
of points. Between the points, physics predicts that the spline will take up the position of
least energy: it will be as “smooth” as possible.
A spline interpolant is a numerical analytic tool which smoothly interpolates a set of
points using piecewise polynomials with knots at each point, constrained to be some number
of times continuously differentiable at each point. See Figure 3.2. Shown are a Lagrange
polynomial interpolant and a spline interpolant. They both interpolate the points shown,
but the polynomial varies wildly between the points, while the spline takes up a sort of path
of least resistance. This is a physical law, and we will prove it as a theorem in Section
3.2.2. Splines will be our go-to tool for nonlinear regression via linear combinations of basis
functions.
58CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
3.2.1 B-splines
of order PEIN
The “B” is for basis!
B-spline basis functions are piecewise polynomials constrained to be some number of
times continuously differentiable at a set of knots:
Let actor
The recursive definition of B-splines is as follows. Start with the indicator functions:
st ght
545
We call Bi,1 the ith B-spline of order 1. Alternatively, degree 0. Now, let:
birkt f.IT
ffgefdt
83,2
IN
EE
60CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
The most common choice of p = 4 gives the cubic B-spline shown in purple in Figure 3.2.
B-splines satisfy the following useful properties:
1. Local support: Bi,p (x) = 0 if x 2
/ [⌧i , ⌧i+p ).
2. Positivity: Bi,p (x) > 0 8x 2 [⌧i , ⌧i+p ).
P
3. Self-normalization: 2p+m i=1 Bi,p (x) = 1 8x 2 [⌧i , ⌧i+p )
This last property is not obvious by simple inspection, and provides a useful sort of scale-
invariance when using B-splines in regression modelling.
A spline function is a linear combination of B-spline basis functions:
fx jÉbj p x Bj Bj
determine the
shape of the
curve
Notation for the knots varies considerably across different sources. We use the following
conventions:
Let actor
the importantnumber
Finally, a natural cubic spline has the additional constraint that the spline function be
linear outside the boundary knots. This is usually enforced explicitly in implementations.
3.2. SPLINES 61
P f 11411 fix dx
An incredible fact is that this problem not only admits a unique solution, but that
solution is finite-dimensional. We prove the following “fundamental theorem” for splines:
Suppose n 2 Let a xn b
I g
et interpolate ie Yi i l
g Xi Vi
af
n
hindx
jicj
jiCj high hly
But hit gin g x
so
hlxj guy gly Yj Yj 0
because
gg interpolate XiYi
fgwh wdx 0
119 dx 711g in dx
contradiction that
Now suppose
by way of
Let be the spline
is not a spline
g
that satisfies glxi fix it n
Then Yi fixi II Yi glad
but exit dx Im dx
a contradiction
Mmm
where
Xij bjp x is the splinedesign matrix
The choice of knots matters. To build the spline interpolant in Figure 3.2, we used a
knot at each point. But we don’t want to interpolate our training data; we want to estimate
the regression function, which is the mean output at each input.
Figure 3.3 shows a spline with 2, 3, 5, and 20 knots fit by ordinary least squares.
How do you know which of these curves is best in practice, when all you would observe
is the points? We need some way of penalizing the complexity of the model fit.
When too few knots are used, the fitted curve is biased. This cannot be fixed: the model
simply is not flexible enough to capture the true regression function.
When too many knots are used, the fitted curve is too wiggly/rough: it varies wildly,
in an effort to best fit the data. The model is very flexible, and it would be able to fit the
true function if the data had less variance. We can fix this by penalizing roughness in the
estimation.
3.3. REGRESSION SPLINES 65
B−Spline regression
0.5
0.0
y
−0.5
Figure 3.3: B-spline regression for a nonlinear regression function (black) with 2 (red), 3
(purple), 5 (orange), and 20 (blue) knots.
66CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
Y N N Wi 02
Wx EYX x
fix
unknown function
Assume twice continuously
differentiable
The function f here is an infinite-dimensional parameter, in the precise sense that it
belongs to a vector space with infinite dimension. We need to restrict attention to some
finite-dimensional subspace in order to actually estimate f . Consider the least squares
minimization:
I arginin
Iii Yi fixis
We combine these ideas to (finally) arrive at the actual method we’ll use for nonlinear
regression. Regression splines fix f to be a cubic spline with some pre-chosen large number
of knots,
and then choose the spline weights to minimize the penalized least squares objective:
βa arggin
Yi tail 7 If w dx
smoothingparameter
Where fix jIbjWβj
penalty
Write in vectorform
T
Y Y_ Yn
b
bay
Xij bjki
bilk baku splinedesignmatrix
BA argpinlly xp.lk 7 f In dx
68CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
so t.in Ux pTfJb'ixblxidx β
fbixibj.ladx
sij
piecewise
The penalty matrix is computed by recognizing that the spline basis functions are (piece-
wise) polynomials, so the integrand is a polynomial, and then applying either interpolation
(Wood, 2017a) or Gaussian quadrature.
3.3. REGRESSION SPLINES 69
Let
gβ L βiY P BA
PgBA
Solve 0
Pg p 2 44 XP 27513
D
Xy XXβ asphal
xty XX AS BA
Ba XX XS X'Y
generalized ridge regression
Compare to ridge regression
114 4311
β
7111311
a argmin
XTX AIJXy
Remark XX 7 I 0 if 7 0
tredet
because if So are singular values of X
then fi for 0 are the eigenvalues of XX
so
fitt data so are the eigenvalues
of XXXXI
70CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
BID
XB
X E
B1a
we also obtain the sampling distribution of the estimated spline weights,
XTXtXS
XTX AS XElY
XTXtXS X'XP
Cor Bla xxx X'Corly XX 75
O XTX 75 X XX 7s
mathematically
which we use to form estimates of and confidence intervals for the fitted curve:
ÉÉÉ
72CHAPTER 3. NONLINEAR REGRESSION I: BASIS FUNCTIONS AND PENALIZED REGRESSION
e d f a trace HA
Where Ha XXXX x
satisfies YA HA Y
Balancing the EDF with the model fit is also one way to choose :
The nullspace of the penalty matrix determines which class of functions is unpenalized,
and hence which type of function we are smoothing towards. For ridge, the identity is full
rank so has empty null space, and we smooth towards the zero function. For penalized
splines, linear functions are unpenalized,
and the penalty matrix has rank m 2 and nullspace spanned by:
The idea of the penalty nullspace allows for reparameterization of the model in a manner
that is, convenient for fitting, useful for theoretical analysis, and helpful for understanding
some of the challenges with identifiability involved in moving from regression splines (one f )
to additive models (multiple f1 , f2 , . . .). We have:
The column space of XF is exactly the nullspace of S, and it can be shown that (1, x) is
a basis for this space. Putting it all together, the model and penalized likelihood are:
So there is a sort of “secret” linear model plus an explicit deviation from linearity. This
form is especially useful when building much more complicated models in which identifiability
of parameters becomes a challenge. One example is an additive model, which is to regression
splines as multiple linear regression is to simple linear regression, i.e. we have multiple
functions f1 , f2 , . . .. This is the subject of the next chapter.