4 Smoothing
4 Smoothing
4 Smoothing
2024 Fall
Haoran Zhang
β̂ = (X T X )−1 X T y (2)
But if p > n, which means we have more features than samples, which
we often call the “high dimensional” (or “overparametrized”) setting, then
we are in trouble ... the matrix X T X cannot be invertible, so the expres-
sion in (2) isn’t even well-defined
Moreover, the least squares optimization problem (1) does not have a
unique solution in this case. Indeed, if β̃ is one solution, then any other
vector of the form
also solves (1), where null(X ) is the null space of the matrix X :
null(X ) = {η ∈ Rp : X η = 0}
β̃ = (X T X )+ X T y (4)
1
Technically, this is only true if null(X ) ̸⊥ ej , where ej is the j th standard
basis vector.
Haoran Zhang Time Series Analysis
This is being driven by the variance of the predictions from least squares,
which grows out of control as p approaches n
(Aside: what happens with the prediction MSE when p > n? The an-
swer may surprise you. The MSE associated with (4) is actually quite
interesting and in some ways exotic when p > n. Typically we need p to
be much larger than n (away from the p = n barrier) in order for it to be
well-behaved. This has been the topic of a recent flurry of research in
statistics in machine learning ... )
Regularization to the rescue! This will finesse both of the problems de-
scribed above: it leads to nontrivial coefficient estimates, and often leads
to more accurate predictions, by reducing variance
In the regression setting, a general approach for regularization moves us
from (1) to solving:
min ∥y − X β∥22 + λP(β) (5)
β
X
p
P(β) = ∥β∥1 = |βj |,
j=1
X
p
P(β) = ∥β∥22 = βj2 .
j=1
These give rise to what we call best subset selection, the lasso, and
ridge regression, respectively
(Aside: calling ∥ · ∥0 the “ℓ0 norm” is a misnomer, as it is not actually a
norm: it does not satisfy positive homogeneity, i.e., ∥aβ∥0 = a∥β∥0 for
all a > 0. It would be more accurate to call it the “ℓ0 pseudonorm”, but
nearly everybody just calls it the “ℓ0 norm”)
Penalties like ℓ1 and ℓ2 only make sense if all the features (columns
of X ) are on the same scale (why?). Thus an important and standard
preprocessing step is to scale each feature (column of X ) so that it has
unit norm
X
n X
p
minp (yi − xiT β)2 + λ βj2
β∈R
i=1 j=1
or equivalently
min ∥y − X β∥22 + λ∥β∥22 (6)
β∈Rp
β̂ = (X T X + λI)−1 X T y (7)
Ridge
0.20
Lag 0
Lag 4
0.15 Lag 8
Lag 12
Coefficient estimate
Lag 16
Lag 20
0.10
Lag 24
Lag 28
Lag 32
0.05
Lag 36
Lag 40
0.00
−0.05
0 2 4 6 8
log(lambda)
X
n X
p
minp (yi − xiT β)2 + λ |βj |
β∈R
i=1 j=1
or equivalently
min ∥y − X β∥22 + λ∥β∥1 (8)
β∈Rp
β1 β1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
The “classic”(right). Shown are contours of the error and constraint functions. The solid blue
illustration comparing lasso and ridge, each in constrained form.
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.
Lasso
Lag 0
0.20
Lag 4
0.15 Lag 8
Lag 12
Coefficient estimate
Lag 16
Lag 20
0.10
Lag 24
Lag 28
Lag 32
0.05
Lag 36
Lag 40
0.00
0 1 2 3 4 5
lambda
That is, define a grid of λ values, fit ridge or lasso estimates for each λ,
let each one make predictions, and select the value that yields the best
CV error.
xjT rj
β̄j =
xjT xj
We’ll now talk about smoothing, which for us will be a task that is mainly
about retrospective rather than prospective estimation (as in forecasting)
As with regression, most smoothing techniques are not actually specific
to time series data, so our notation will be generic to reflect this. There
are so many smoothers out there, and we will choose three ones to
discuss that are generic in principle, but are quite popular (and in part,
have roots in) the time series context
Recall the signal plus noise model from our earlier lectures,
yi = θi + ϵi , i = 1, . . . , n
X
k
θ̂i = aj yi−j , i = 1, . . . , n (9)
j=−k
Moving average
1.0
Southern Oscillation Index
0.5
0.0
−0.5
−1.0
Time
1.0
Southern Oscillation Index
0.5
0.0
−0.5
−1.0
Time
Kernel smoother
1.0
Southern Oscillation Index
0.5
0.0
−0.5
−1.0
Time
X
n X
n−2
minn (yi − θi )2 + λ (θi − 2θi+1 + θi+2 )2 (11)
θ∈R
i=1 i=1
We can see that the penalty is squared ℓ2 norm of the terms θi − 2θi+1 +
θi+2 , i = 1, . . . , n − 2. Each of these are a second difference of θ, cen-
tered at a particular index
The HP filter is named after two econometricians who proposed the idea
in the early 1980s. But it is worth noting that very similar ideas were
around much, much early (the idea of penalizing according to squared
first differences dates back to the 1899, and according to squared third
differences dates back to 1923)
θ̂ = (I + λD T D)−1 y (14)
Hodrick−Prescott filter
lambda = 10
lambda = 100
160
lambda = 1000
150
Time
140
130
Year
HP filter estimate fit to the winning men’s Boston marathon times (from HA), each
at a few different values of λ.
X
n X
n−2
minn (yi − θi )2 + λ |θi − 2θi+1 + θi+2 | (15)
θ∈R
i=1 i=1
Or in other words, the trend filter solution θ̂ will have a piecewise lin-
ear structure, with a kink at each point i such that θ̂i+1 ̸= (θ̂i + θ̂i+2 )/2.
These kink points (also called knot points) are chosen adaptively, based
on the data
Trend filter
lambda = 21
lambda = 218
160
lambda = 1810
150
Time
140
130
Year
HP filter and trend filter estimates fit to the winning men’s Boston marathon times
(from HA), each at a few different values of λ.
to
X
p
yi ≈ β0 + fj (xij ), i = 1, . . . , n
j=1