4 Smoothing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

MA309: Time Series Analysis

2024 Fall

Haoran Zhang

Haoran Zhang Time Series Analysis


Contents

1 High dimensional regression


2 Regularization
3 Smoothing

Haoran Zhang Time Series Analysis


As in the regression lecture, let’s suppose that we seek β ∈ Rp such that
for given samples xi ∈ Rp and yi ∈ R (predictor and response pairs),
i = 1, . . . , n,
yi ≈ xiT β, i = 1, . . . , n

Equivalently, we can write y ∈ Rn for the response vector (with i th com-


ponent yi ) and X ∈ Rn×p for the feature matrix (with i th row xi ), and say
that we are seeking β such that y ≈ X β
Recall, the least squares estimates of the coefficients are given by solv-
ing
Xn
minp (yi − xiT β)2 ⇐⇒ minp ∥y − X β∥22 (1)
β∈R β∈R
i=1

where we use ∥ · ∥2 for the Euclidean or ℓ2 norm of a vector, defined for


P
a ∈ Rd as ∥a∥22 = di=1 ai2

Haoran Zhang Time Series Analysis


If p ≤ n and rank(X ) = p (here rank(X ) denotes the rank of the matrix
X ), then this produces the unique solution

β̂ = (X T X )−1 X T y (2)

But if p > n, which means we have more features than samples, which
we often call the “high dimensional” (or “overparametrized”) setting, then
we are in trouble ... the matrix X T X cannot be invertible, so the expres-
sion in (2) isn’t even well-defined
Moreover, the least squares optimization problem (1) does not have a
unique solution in this case. Indeed, if β̃ is one solution, then any other
vector of the form

β̂ = β̃ + η, where η ∈ null(X ) (3)

also solves (1), where null(X ) is the null space of the matrix X :

null(X ) = {η ∈ Rp : X η = 0}

Haoran Zhang Time Series Analysis


When rank(X ) < p, the null space null(X ) is nontrivial, and since it is
a linear space: η ∈ null(X ) =⇒ cη ∈ null(X ) for any c ∈ R, we see
that from one least squares solution β̃, we can generate infinitely many
others in (3)
Furthermore, we can always take the “one least squares solution” to be:

β̃ = (X T X )+ X T y (4)

Here A+ denotes the generalized inverse (also called the Moore-Penrose


pseudoinverse) of a matrix A. If you don’t know what that is, then it
doesn’t really matter for this lecture, but you can think of it precisely as
follows: among all solutions in (1), the solution in (4) is the unique one
having the smallest ℓ2 norm ∥β̃∥2

Haoran Zhang Time Series Analysis


There are actually two distinct troubles:
The first trouble involves the interpretation of the coefficients themselves.
Once p > n, and we find any least squares solution β̃ with β̃j > 0 for
some j, then we can always find1 another solution β̂ of the form (3) with
β̂j < 0. Thus we cannot even consistently interpret the sign of any of any
estimated coefficient (let alone its magnitude)
T
The second trouble involves prediction. If p < n and ŷnew = xnew β̂ is the
least squares prediction at a new predictor value xnew , whose associated
response is ynew , then under fairly standard conditions for regression the-
ory, the prediction MSE behaves as:
  p
E (ynew − ŷnew )2 ≈ σ 2
n−p

for large n and p, where σ 2 is the error variance.

This explodes as p approaches n!

1
Technically, this is only true if null(X ) ̸⊥ ej , where ej is the j th standard
basis vector.
Haoran Zhang Time Series Analysis
This is being driven by the variance of the predictions from least squares,
which grows out of control as p approaches n
(Aside: what happens with the prediction MSE when p > n? The an-
swer may surprise you. The MSE associated with (4) is actually quite
interesting and in some ways exotic when p > n. Typically we need p to
be much larger than n (away from the p = n barrier) in order for it to be
well-behaved. This has been the topic of a recent flurry of research in
statistics in machine learning ... )

Haoran Zhang Time Series Analysis


Regularization

Regularization to the rescue! This will finesse both of the problems de-
scribed above: it leads to nontrivial coefficient estimates, and often leads
to more accurate predictions, by reducing variance
In the regression setting, a general approach for regularization moves us
from (1) to solving:
min ∥y − X β∥22 + λP(β) (5)
β

Here P : R → R+ is a penalty function, and λ ≥ 0 is a tuning parameter


p

(or regularization parameter), trading off the importance of the squared


loss and penalty (first and second terms in (5))
In other words, the larger the value of λ, the more weight we put on
penalizing large values of P(β), which results in estimates that we call
more “regularized”

Haoran Zhang Time Series Analysis


Arguably the three canonical choices for penalties are based on the ℓ0 ,
ℓ1 , and ℓ2 norms:
X
p
P(β) = ∥β∥0 = 1{βj ̸= 0},
j=1

X
p
P(β) = ∥β∥1 = |βj |,
j=1

X
p
P(β) = ∥β∥22 = βj2 .
j=1

These give rise to what we call best subset selection, the lasso, and
ridge regression, respectively
(Aside: calling ∥ · ∥0 the “ℓ0 norm” is a misnomer, as it is not actually a
norm: it does not satisfy positive homogeneity, i.e., ∥aβ∥0 = a∥β∥0 for
all a > 0. It would be more accurate to call it the “ℓ0 pseudonorm”, but
nearly everybody just calls it the “ℓ0 norm”)

Haoran Zhang Time Series Analysis


Critically, ∥ · ∥0 is not convex, while ∥ · ∥1 and ∥ · ∥2 are convex (note
that any norm is a convex function). This makes best subset selection
a nonconvex problem, and one that is generally very hard to solve in
practice except for small p.
Meanwhile, the lasso and ridge regression are both defined by convex
optimization problems, and are generally much easier to compute. We
won’t focus on best subset selection further in this lecture.
An important note: when there is an explicit intercept β0 in the model, we
typically do not want to penalize it, and so we modify the penalty P(β)
so that it excludes β0 . This will be implicit in everything below.
If we centerize y and each column of X , then βb0 = 0 regardless of P(β).

Penalties like ℓ1 and ℓ2 only make sense if all the features (columns
of X ) are on the same scale (why?). Thus an important and standard
preprocessing step is to scale each feature (column of X ) so that it has
unit norm

Haoran Zhang Time Series Analysis


Ridge

The ridge estimates of the regression coefficients are given by solving

X
n X
p
minp (yi − xiT β)2 + λ βj2
β∈R
i=1 j=1

or equivalently
min ∥y − X β∥22 + λ∥β∥22 (6)
β∈Rp

The problem in (6) has closed-form solution, verified by differentiating


the criterion and setting the result equal to zero,

β̂ = (X T X + λI)−1 X T y (7)

where I is in the p × p identity matrix. This always exists (the matrix


X T X + λI is always invertible), regardless of the relative sizes of n, p

Haoran Zhang Time Series Analysis


The figure below gives an example of the ridge regression coefficient
estimates as a function of λ, fit on the cardiovascular mortality regression
data, with many features derived from lags of particulate levels

Ridge

0.20
Lag 0
Lag 4
0.15 Lag 8
Lag 12
Coefficient estimate

Lag 16
Lag 20
0.10

Lag 24
Lag 28
Lag 32
0.05

Lag 36
Lag 40
0.00
−0.05

0 2 4 6 8

log(lambda)

Ridge estimate on the cardiovascular mortality regression problem, with lags of


particulate levels as features. (The intercept is estimated without a penalty, and
omitted from the plots for visualization purposes).

Haoran Zhang Time Series Analysis


Lasso

The lasso estimates of the regression coefficients are given by solving

X
n X
p
minp (yi − xiT β)2 + λ |βj |
β∈R
i=1 j=1

or equivalently
min ∥y − X β∥22 + λ∥β∥1 (8)
β∈Rp

A key difference is that the lasso estimates of the regression coefficients


are sparse. In other words, solving the lasso problem results in a vector
β̂ with many components exactly equal to zero, and a larger choice of λ
will result in more zeros
This doesn’t happen with ridge regression, whose coefficient estimates
are generically dense.

Haoran Zhang Time Series Analysis


β2 ^
β
. β2 ^
β
.

β1 β1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
The “classic”(right). Shown are contours of the error and constraint functions. The solid blue
illustration comparing lasso and ridge, each in constrained form.
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.

Haoran Zhang Time Series Analysis


See figure below for an example of the lasso coefficients on the cardio-
vascular mortality data. We can see that as λ grows, it sets many of the
coefficients of lagged features to zero

Lasso

Lag 0

0.20
Lag 4
0.15 Lag 8
Lag 12
Coefficient estimate

Lag 16
Lag 20
0.10

Lag 24
Lag 28
Lag 32
0.05

Lag 36
Lag 40
0.00

0 1 2 3 4 5

lambda

Lasso estimate on the cardiovascular mortality regression problem, with lags of


particulate levels as features.

Haoran Zhang Time Series Analysis


In general, this sparsity-inducing behavior of the ℓ1 penalty allows the
lasso to perform variable selection in the working linear model. By zero-
ing out some coefficients exactly, it discards some features from having
any predictive influence in the fitted model. Many people like sparsity
because it leads to better interpretability
We should be clear that the lasso is not “better” than ridge in any general
sense, and neither is ridge “better” than the lasso. They each can help
tremendously with stabilizing coefficient estimates so as to lead to im-
proved predictive accuracy. They each do so by regularizing in different
ways
The most basic question to ask: is a sparse linear model likely to be a
good (or desirable) approximation to the true regression function? If so,
then the lasso can outperform (or be preferable) to ridge. On the other
hand, in problems where there are many underlying features that are
relevant for prediction, ridge can outperform the lasso
(And often times people combine the two penalties which gives rise to
the elastic net)

Haoran Zhang Time Series Analysis


In time series regression, in the vein of examples we studied in the last
lecture, this would allow us to include many lags of a feature of interest,
or a few lags of many external covariates, and so on, and then apply a
ridge or lasso penalty
Then, you might wonder: how would we select λ? In fact, you already
know the answer (for problems with a predictive focus): use time series
cross-validation!

That is, define a grid of λ values, fit ridge or lasso estimates for each λ,
let each one make predictions, and select the value that yields the best
CV error.

Haoran Zhang Time Series Analysis


Coordinate descent for the lasso
Here is an algorithm used to compute lasso estimates in practice, called
coordinate descent. In fact, this is what is used by the glmnet package
internally. We repeat, cycling over j = 1, 2, . . . , p until convergence (until
the coefficient estimates do not really change anymore), the following
steps (with xj denoting the j th column of X ):
P
Compute the j th partial residual: rj = y − ℓ̸=j xℓ β̂ℓ
Compute the univariate least squares coefficient of rj on xj :

xjT rj
β̄j =
xjT xj

Update the j th lasso coefficient by “soft-thresholding” the above:




β̄j + 2λ if β̄j < −2λ∥xj ∥2
2

β̂j = β̄j − 2λ if β̄j > 2λ∥xj ∥2


2


0 otherwise

where recall λ is the lasso regularization parameter

Haoran Zhang Time Series Analysis


Smoothing

We’ll now talk about smoothing, which for us will be a task that is mainly
about retrospective rather than prospective estimation (as in forecasting)
As with regression, most smoothing techniques are not actually specific
to time series data, so our notation will be generic to reflect this. There
are so many smoothers out there, and we will choose three ones to
discuss that are generic in principle, but are quite popular (and in part,
have roots in) the time series context
Recall the signal plus noise model from our earlier lectures,

yi = θi + ϵi , i = 1, . . . , n

where ϵi , i = 1, . . . , n are white noise errors, and θi , i = 1, . . . , n denote


the unknown means that we want to estimate, assumed to be smoothly
varying across the index i.
Smoothing helps us to extract the trend θi .

Haoran Zhang Time Series Analysis


Two broad classes of smoothers of interest are as follow. The first are
linear filters, which are (weighted) local averages, of the form

X
k
θ̂i = aj yi−j , i = 1, . . . , n (9)
j=−k

for some weights aj , j = −k , . . . , k


The second are penalized least squares smoothers, which are given by
solving
Xn
minn (yi − θi )2 + λP(θ) (10)
θ∈R
i=1

for some penalty function P : Rn → R+ and tuning parameter λ ≥ 0


You should be able to clearly see the connection between smoothing
and regularization, with (10) being a special case of (5) (when X = I,
the identity matrix). Note, the choice of penalties that are common and
useful in smoothing, as we’ll see below, are more complex than simple
ℓ1 or ℓ2 penalties

Haoran Zhang Time Series Analysis


Linear filters
A moving average (MA), either of centered or trailing type, is one of the
basic and widely-used examples of a linear filter, of the form (9). For
a window of (odd) length m = 2k + 1, a centered MA smoother uses
weights
1
a−k = · · · a0 = · · · = ak =
m
For a window of (odd or even) length m = k + 1, a trailing MA smoother
uses weights
1
a−k = · · · = a−1 = 0 and a0 = · · · = ak =
m
(For either, centered or trailing MA smoothers, or really linear filters in
general, there will be annoying boundary issues to pay attention to ...
some software packages may just pad the response vector with 0s in
order to deal with them; but a more principled approach is probably to
renormalize the weights at the boundaries so that the filter is averaging
over the “right” number of observations; and the safest approach is to
just omit estimates at the boundaries, i.e., report NAs)

Haoran Zhang Time Series Analysis


The following figure displays an example of centered MA smoothing on
the Southern Oscillation Index (SOI), which measures air pressure in the
Central Pacific Ocean. We can see that the results look fairly accurate
(they track the known cycles which occur every 3-7 years, due to the El
Nino effect), but the estimates look a bit “choppy”

Moving average

1.0
Southern Oscillation Index

0.5
0.0
−0.5
−1.0

1950 1960 1970 1980

Time

Moving average estimate fit to the Southern Oscillation Index (from SS).

Haoran Zhang Time Series Analysis


We can get smoother-looking estimates by using a smoother weight se-
quence (with larger k ). This is precisely what kernel smoothing does: it
uses k = n and
K (j/b)
aj = Pn , j = 1, . . . , n
i=1 K (i/b)

for a particular kernel function K : R → R and bandwidth b. Think of


the bandwidth as a tuning parameter, just like the regularization level λ
in ridge or lasso
2
A common choice of kernel is the Gaussian kernel, K (u) = e−u /2 . The
figure (next slide) gives an example of Gaussian kernel smoothing on the
SOI data again. We can see that the estimates look more smooth
(Note that a centered MA is actually kernel smoothing with a “boxcar”
kernel: K (u) = 1{|u| ≤ b})

Haoran Zhang Time Series Analysis


Moving average

1.0
Southern Oscillation Index

0.5
0.0
−0.5
−1.0

1950 1960 1970 1980

Time

Kernel smoother
1.0
Southern Oscillation Index

0.5
0.0
−0.5
−1.0

1950 1960 1970 1980

Time

Haoran Zhang Time Series Analysis


Hodrick-Prescott filter

The Hodrick-Prescott filter, or simply HP filter, is a penalized estimator


of the form (10), defined by

X
n X
n−2
minn (yi − θi )2 + λ (θi − 2θi+1 + θi+2 )2 (11)
θ∈R
i=1 i=1

We can see that the penalty is squared ℓ2 norm of the terms θi − 2θi+1 +
θi+2 , i = 1, . . . , n − 2. Each of these are a second difference of θ, cen-
tered at a particular index

The HP filter is named after two econometricians who proposed the idea
in the early 1980s. But it is worth noting that very similar ideas were
around much, much early (the idea of penalizing according to squared
first differences dates back to the 1899, and according to squared third
differences dates back to 1923)

Haoran Zhang Time Series Analysis


Like the ridge problem, which is similar in spirit (but aimed at solving a
different problem), the HP filter problem (11) has an exact solution. First
define the second difference matrix D ∈ R(n−2)×n by
 
1 −2 1 0 0 ··· 0 0 0
0 1 −2 1 0 ··· 0 0 0
 
0 0 1 −2 1 · · · 0 0 0
D=  (12)
. 
 .. · 
0 0 0 0 0 ··· 1 −2 1

and then rewrite the HP filter optimization problem (11) as

min ∥y − θ∥22 + λ∥Dθ∥22 (13)


θ∈Rn

The HP filter solution can be derived by differentiating the criterion in


(13), setting it equal to zero, and solving, which yields

θ̂ = (I + λD T D)−1 y (14)

Haoran Zhang Time Series Analysis


The following figure shows an example of the HP filter fit to the winning
men’s times from the Boston marathon, at a few different values of the
regularization level λ. (What happens to the HP filter as λ → ∞?)

Hodrick−Prescott filter

lambda = 10
lambda = 100

160
lambda = 1000
150
Time

140
130

1940 1960 1980 2000 2020

Year

HP filter estimate fit to the winning men’s Boston marathon times (from HA), each
at a few different values of λ.

Haoran Zhang Time Series Analysis


The HP filter has a computational advantage over a linear filter with
smooth weights, such as the Gaussian kernel smoother (the HP filter
can be computed in linear-time because D is banded). In terms of their
qualitative behavior, the HP filter and a Gaussian kernel smoother are
pretty similar
In fact, though it not obvious at all, the HP filter admits an equivalent form
as a kernel smoother, for a special implicit kernel
The HP filter is worth knowing about because it is quite a popular tool in
macroeconomics, where it is primarily used for time series decomposi-
tion: it is used to estimate a trend component in a time series, and then
the residuals from this are used to estimate a cyclic component. How-
ever, recently, some econometricians have criticized this practice, on the
basis that it can induce spurious correlations in the residuals, along with
other concerns.

Another reason worth learning about the HP filter is that it provides us a


natural bridge to another method that we’ll cover next, which acts com-
pletely differently from any linear filter

Haoran Zhang Time Series Analysis


Trend filter

The ℓ1 trend filter, or simply trend filter, is another penalized estimator of


the form (10), defined by

X
n X
n−2
minn (yi − θi )2 + λ |θi − 2θi+1 + θi+2 | (15)
θ∈R
i=1 i=1

We can see that the penalty is the ℓ1 norm of second differences of θ.


Thus, compared do the HP filter, the only difference is in the use of an ℓ1
norm penalty, and not squared ℓ2 norm penalty
Just as in (13), we can rewrite the trend filter (15) more compactly as

min ∥y − θ∥22 + λ∥Dθ∥1 (16)


θ∈Rn

where D ∈ R(n−2)×n is the second difference matrix in (12)

Haoran Zhang Time Series Analysis


Swapping the squared ℓ2 penalty with an ℓ1 penalty on second differ-
ences results in a very different behavior in the estimator. Think of the
analogy of ridge versus the lasso. Trend filtering yields estimates that
are sparse in second differences: if we denote by θ̂ the solution vector
in (16), then many components of D θ̂ will be exactly equal to zero, and a
larger choice of λ will generally result in more zeros
Since
θ̂i + θ̂i+2
θ̂i − 2θ̂i+1 + θ̂i+2 = 0 ⇐⇒ θ̂i+1 =
2
this means that many components θ̂i will lie on the line defined by their
neighbors

Or in other words, the trend filter solution θ̂ will have a piecewise lin-
ear structure, with a kink at each point i such that θ̂i+1 ̸= (θ̂i + θ̂i+2 )/2.
These kink points (also called knot points) are chosen adaptively, based
on the data

Haoran Zhang Time Series Analysis


The figure shows an example of the trend filter on Boston marathon data
again. The piecewise linear structure, and qualitative difference to the
HP filter, is very clear. (What is the behavior of trend filtering as λ → ∞?)

Trend filter

lambda = 21
lambda = 218

160
lambda = 1810
150
Time

140
130

1940 1960 1980 2000 2020

Year

HP filter and trend filter estimates fit to the winning men’s Boston marathon times
(from HA), each at a few different values of λ.

Haoran Zhang Time Series Analysis


So ... how do you choose the regularization level λ in the trend filter or
HP filter, or bandwidth b in a kernel smoother?
Cross-validation! This is actually closer to traditional CV, rather than
time series CV (since we are not in a forecasting setting, where we are
predicting the future), but just more structured in how we form the folds
To tune any one of the smoothers described above, we can define (say)
5 folds using every 5th point in the sequence. That is, we leave out
every 5th point, fit the smoother on the remaining points, over a grid of
tuning parameter values, and calculate the squared error on the held-out
points. To form an estimate at each held-out point, we can use linear
interpolation of the estimates at its neighbors.
Doing this 4 more times, where each time we shift the indices of the
held-out points forward by one, we will be able to compute the average
error over all points, and use this to select the tuning parameter value.
(An important side note: trend filtering can be extended to model a piece-
wise polynomial of an arbitrary degree, not just linear; as in the above,
the knots in this piecewise polynomial will be chosen adaptively based
on the data. So it can generate smoother-looking fits, like the HP filter or
kernel smoothing, but the key difference is that it has a property called
local adaptivity, which comes from its ability to select knots based on the
data.)
Haoran Zhang Time Series Analysis
An important side note: trend filtering can be extended to model a piece-
wise polynomial of an arbitrary degree, not just linear; as in the above,
the knots in this piecewise polynomial will be chosen adaptively based
on the data. So it can generate smoother-looking fits, like the HP filter or
kernel smoothing, but the key difference is that it has a property called
local adaptivity, which comes from its ability to select knots based on the
data.

Haoran Zhang Time Series Analysis


Additive models
To gear up to very briefly discuss additive models, we have to first make
clear the following point: each one of the smoothers described in the
last section can be extended to a problem where the input points in the
underlying smoothing problem are arbitrary: xi , i = 1, . . . , n (rather than
simply xi = i, as would be the case in standard time series)
Now let’s introduce the additive model. This extends the basic multivari-
ate linear regression model:
X
p
yi ≈ β0 + βj xij , i = 1, . . . , n
j=1

to
X
p
yi ≈ β0 + fj (xij ), i = 1, . . . , n
j=1

where each fj is a smooth transformation to be estimated


In other words, we have moved from estimating linear transformations of
the features to nonlinear ones, which we fit using a smoother (traditional
packages use kernel smoothing, or spline smoothing, which is similar to
using the HP filter)
Haoran Zhang Time Series Analysis

You might also like