Adv Stat Inf
Adv Stat Inf
Adv Stat Inf
Table of contents
Preface V
2 M-estimators 41
2.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Univariate parametric models . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 Univariate location model . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.2 Scale model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 The sensitivity curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4 The finite-sample breakdown value . . . . . . . . . . . . . . . . . . . . . . 51
2.5 M-estimators of univariate location . . . . . . . . . . . . . . . . . . . . . . 52
2.6 M-estimators of scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 M-estimators of univariate location with unknown scale . . . . . . . . . 56
2.8 Computation of M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.8.1 Computing an M-estimator of location . . . . . . . . . . . . . . . 58
2.8.2 Computing an M-estimator of scale . . . . . . . . . . . . . . . . . 59
I
2.8.3 Computing simultaneous M-estimates . . . . . . . . . . . . . . . . 59
2.9 Other robust estimators of univariate location and scale . . . . . . . . . . 61
2.10 Functional definition of M-estimators . . . . . . . . . . . . . . . . . . . . 62
2.11 Consistency of M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.12 Asymptotic normality of M-estimators . . . . . . . . . . . . . . . . . . . . 65
2.13 The influence function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.14 The asymptotic breakdown value . . . . . . . . . . . . . . . . . . . . . . . 68
2.15 The maxbias curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3 Regression M-estimators 73
3.1 Review of the LS method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2 M-estimators of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Computing M-estimators of regression . . . . . . . . . . . . . . . . . . . . 80
3.4 Robustness properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.1 Finite-sample breakdown value . . . . . . . . . . . . . . . . . . . 83
3.4.2 Influence function . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Asymptotic distribution of regression M-estimators . . . . . . . . . . . . 85
II
5.2.2 Regression Support Vector Machines . . . . . . . . . . . . . . . . . 136
A Background 163
A.1 Basics of measure and probability theory . . . . . . . . . . . . . . . . . . 163
A.1.1 The abstract integral . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.1.2 Applications to probability theory . . . . . . . . . . . . . . . . . . 170
A.2 Elements of Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . 171
A.2.1 The Banach L p (Ω) spaces . . . . . . . . . . . . . . . . . . . . . . . 171
A.2.2 The Hilbert space L2 ([0, 1]) . . . . . . . . . . . . . . . . . . . . . . 173
A.2.3 Linear operators on L2 ([0, 1]). . . . . . . . . . . . . . . . . . . . . 174
A.3 Spline functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Bibliography 181
III
IV
Preface
This course discusses several methods for modern statistical inference. These mod-
ern inference methods are more flexible than classical inference techniques such as
least squares or maximum likelihood in the sense that they relax the assumptions that
need to be satisfied to obtain reliable results and/or can handle data analysis problems
that cannot be solved by the classical techniques. This course is the result of a joint ef-
fort of the colleagues of the Section of Statistics and Data Science at the Department of
Mathematics.
For practical computations and applications, the software package R (R Core Team,
2013) is used in this course. R is freely available at http://www.r-project.org. For
first time users of R, download and install R following the instructions on this website.
An easy user interface for the R statistical computing environment is RStudio which
can be freely downloaded from http://www.rstudio.com/.
Here is a quick guideline to get started if you have no experience with R/RStudio
yet. When opening RStudio you should see the R console window in the lower left
corner. Here you can type commands behind the symbol ‘>’. However, it is more
convenient to type commands in the editor window in the upper left corner. These
commands can then be executed by clicking the ’run’ button (executes the selected
line or block of code) or the ’source’ button (executes the whole code). To learn more
about the use of RStudio, select the ’RStudio Docs’ option in the ’Help’ menu of RStudio.
R is a modular system where functionality is organized in packages. Some of the
basic packages are installed by default. If a specific package is needed for a statistical
method, then it can be installed by selecting the ‘Packages’ tab in the lower right panel
of RStudio, and then clicking the ‘Install’ button. After installing a package, it becomes
possible to select them from the list which loads the package into R. When the package
is loaded, its content is ready to use.
Help files are readily available by typing
> help.start()
Select ‘An Introduction to R’ for a basic manual. It can be a useful exercise to read and
execute some of the examples of this manual.
Chapter 1
2. We assume that the observations were drawn from some unknown population
6. Using the accuracy estimates, we can build confidence intervals, perform hy-
pothesis tests, . . .
The term statistical inference often refers specifically to the last two steps. The boot-
strap is a generally applicable method to perform these two steps. Schematically, sta-
tistical inference can be represented as follows.
#
{ x1 , . . . , x n } ∼ F standard error of θb
1
• They rely on much fewer assumptions; For instance, they do not require normal-
ity;
• They can easily be applied in statistical models which are too complex for ana-
lytical (classical) solutions.
Let us specify in more detail what we mean by the accuracy of an estimate. Sup-
pose we have a data set Xn = { x1 , . . . , xn } from an unknown population F and we
computed some estimate θb = θb(Xn ) for the quantity θ = θ ( F ). The accuracy obviously
concerns the difference between θb and θ. We can never know exactly how far our es-
timate is from θ (otherwise we would know the unknown θ). But we can figure out
how far we can expect it to be. There are two main aspects about this accuracy:
When we know that the estimate is more or less unbiased, we are mainly interested
in the variability of θ.
b To asses the accuracy, we take into account that the available
data set Xn is just one possible random sample from F. We could have had a different
sample of the same size and the question is then to what extent our estimate θb could
be different. The accuracy of the estimate can be derived from its sampling distribu-
tion, which is the distribution of the estimator over all possible random samples from
F with the same sample size. The sampling distribution reflects the accuracy of the
estimate from which we can obtain
2
further developed by Tukey (Tukey, 1958). The word “jackknife” refers to the handy
‘knife’ that one should always have.
The basic idea behind the jackknife estimator lies in systematically recomputing
the estimate leaving out one observation at a time from the sample set. From this new
set of “observations” for the statistic an estimate for the bias can be calculated, as well
as an estimate for the variance of the statistic.
Tn = θ ( Fn ) ,
(i )
The leave-one-out estimators Tn−1 are thus obtained by estimating θ ( F ) from the re-
duced sample ( X1 , · · · , Xi−1 , Xi+1 , · · · , Xn ) of size n − 1.
Corresponding to the leave-one-out estimates, the ‘pseudo-values’ Tn,i are now de-
fined by
(i )
Tn,i = nTn − (n − 1) Tn−1 i = 1, · · · , n . (1.5)
The jackknife estimator for the mean of Tn is thus equal to the average of the pseudo-
values.
The jackknife estimator for the bias of Tn is now given by:
d jack ( Tn ) = Tn − E
Bias b jack ( Tn )
n − 1 n (i )
n i∑
= Tn − nTn + Tn−1 ,
=1
3
and hence
" #
1 n (i )
n i∑
Biasjack ( Tn ) = (n − 1)
d Tn−1 − Tn (1.7)
=1
The jackknife estimator for the variance of Tn is given by the sample variance of
the pseudo-values Tn,i :
( )
n 2
d jack ( Tn ) = 1 1
n − 1 i∑
Var Tn,i − E
b jack ( Tn ) (1.8)
n =1
The motivation behind this estimation procedure can be explained as follows. Sup-
pose that Tn is asymptotically unbiased with an expansion as follows
a(θ ) b(θ )
E( Tn ) = θ + + 2 +··· .
n n
Now, we have
( )
n − 1 n (i )
n i∑
E T
bn = E nTn − Tn−1
=1
n−1 n
(i )
n i∑
= nE( Tn ) − E Tn−1
=1
(1)
= nE( Tn ) − (n − 1)E Tn−1
a(θ ) b(θ ) a(θ ) b(θ )
= n θ+ + 2 + · · · − ( n − 1) θ + + +···
n n n − 1 ( n − 1)2
b(θ ) b(θ )
= θ+ − +···
n n−1
b(θ )
= θ− +··· .
n ( n − 1)
If we compare with the expansion of E( Tn ), we can see that in this case the estimator
T
bn should be a more accurate estimator for θ than Tn .
4
1.2.1 Some examples
The mean of a random variable
5
Since
n
(i ) 1 1
X n −1 =
n−1 ∑ Xj = n − 1 (nX n − Xi ) ,
j =1
j ̸ =i
we find that
n
2 2
Tn,i = ∑ Xi2 − 2nX n + nX n
i =1
n 2 2
(i ) (i )
−∑ X 2j + 2( n − 1) X n −1 − ( n − 1) X n −1
j =1
j ̸ =i
2
(i )
2
= Xi2 − nX n + ( n − 1) X n −1
2
2 1
= Xi2 − nX n
+ ( n − 1) (nX n − Xi )
n−1
2 1 2
= Xi2 − nX n + nX n − Xi
n−1
2 2 n2 2 n 1
= Xi − nX n + Xn − 2 Xi X n + Xi2
n−1 n−1 n−1
n n 2 − n ( n − 1) n
2
= Xi2 + Xn − 2 X Xn
n−1 n−1 n−1 i
n n 2 n
= Xi2 + Xn − 2 X Xn
n−1 n−1 n−1 i
n 2
= Xi − X n .
n−1
We then find
n n
bn = 1 ∑ Tn,i = 1 ∑ Xi − X n 2 = Sn2 ,
T
n i =1 n − 1 i =1
n
the sample variance of the sample X1 , · · · , Xn . Note that E( Tn ) = θ and E( T
bn ) =
n−1
θ. Hence, T
bn is a better estimator than Tn .
6
where X(1) ≤ X(2) ≤ · · · ≤ X(n) denote the order statistics of X1 , · · · , Xn . We calculate
(i ) (i )
the pseudo-values Tn,i = nTn − (n − 1) Tn−1 ; and for this we first look at Tn−1 :
(
(i ) X(m+1) i = 1, · · · , m
Tn−1 =
X( m ) i = m + 1, · · · , n .
(
nTn − (n − 1) X(m+1) i = 1, · · · , m
Tn,i =
nTn − (n − 1) X(m) i = m + 1, · · · , n ,
and
n
bn = 1 ∑ Tn,i = nTn − n − 1 mX(m+1) − n − 1 (n − m) X(m)
T
n i =1 n n
n−1
= nTn − m X ( m +1) + X ( m )
n
n−1
= nTn − 2mTn
n
= nTn − (n − 1) Tn
= Tn .
Tn − E
b jack ( Tn ) = Tn − T
bn = 0 .
However, the jackknife estimator of the variance does not work. Let us calculate that
estimator. Recall (1.8) and calculate
Tn,i − T
bn = Tn,i − Tn
(
(n − 1) Tn − (n − 1) X(m+1) i = 1, · · · , m
=
(n − 1) Tn − (n − 1) X(m) i = m + 1, · · · , n
(n − 1) Tn − X(m+1) i = 1, · · · , m
=
(n − 1) Tn − X(m) i = m + 1, · · · , n.
7
It follows that
( )
n 2
d jack ( Tn ) = 1 1
n − 1 i∑
Var Tn,i − Tbn
n =1
2 2
1 1 2
1 2
= m(n − 1) Tn − X(m+1) + m(n − 1) Tn − X(m)
n n−1 n−1
2 2
1
= m ( n − 1) Tn − X(m+1) + Tn − X(m)
n
( 2 2 )
1 1 1 1 1
= ( n − 1) X − X + X − X
2 2 ( m ) 2 ( m +1) 2 ( m +1) 2 ( m )
1 1 2 1 1 2 1
= ( n − 1 ) 2 X ( m ) − X ( m ) X ( m +1) + 2 X ( m +1) − X ( m ) X ( m +1)
2 4 2 4 2
1 2
= ( n − 1 ) X ( m +1) − X ( m ) .
4
However, for symmetric densities one has that
1
Asymptotic Variance of Tn = ,
4 f 2 (0)
where f = F ′ .
The two quantities are different even when n → ∞:
1
d jack ( Tn ) −
Var ̸→ 0 , as n → ∞ .
4n f 2 (0)
In conclusion, the jackknife estimator of the variance of the median is inconsistent (it
fails to converge to the true standard error).
8
be assessed without long-winded and error-prone analytical calculation”. Nowadays
bootstrap methodology can be used for almost any type of inference. It can thus be
used to obtain standard errors and confidence regions, to perform hypothesis tests and
much more.
Figure 1.1 Histogram of B = 1000 sample means from random samples of size n = 100
from F = N (5, 1)
We can see that around 95% of the time the difference between the sample mean and
µ is less than 0.2, just as we of course know from the normal theory (i.e. sample mean
is distributed as N (µ, σ2 /n)).
Considering a whole range of possible random samples from F would thus give us an
idea of the sampling distribution and thus of the accuracy of a single estimate (in prob-
ability terms). Of course, in practice we cannot draw extra samples from F, we only
have one. In the above example, even if we do not know µ and σ, this can be solved if
9
we are willing to assume that F is normal. Then, we can rely on the theory and take the
normal distribution as the sampling distribution. But if it is not reasonable to assume
that F is normal, then we cannot rely on the theoretical approximation anymore. Note
that the approximation based on repeated sampling would not need any assumptions
about the distribution F in contrast to the analytical approximation. The bootstrap
mimics the process of repeated sampling from F and thus allows to approximate the
sampling distribution without making possibly irrealistic assumptions.
Let us focus on the situation where we have a single sample Xn of independent and
identically distributed (i.i.d.) observations, which may be univariate or multivariate.
In short, the bootstrap reshuffles the data set Xn many times and computes θbn again
for each reshuffle to find out about its variability.
We would like to have many extra samples from F but we do not know F, so what
can we do? As always in statistics, we replace F by an estimate F. b The best estimate,
in the absence of prior knowledge of the possible form of F, is given by the empirical
distribution Fn corresponding to the sample Xn . Its distribution function is given by
# { Xi ≤ x }
Fn ( x ) = .
n
Now that we have Fn as an estimate of F, we can proceed by randomly drawing sam-
ples X∗n from Fn (instead of F), compute the corresponding estimates θbn∗ , and consider
the sampling distribution under Fn as an estimate for the sampling distribution under
F. We then use the following approximations:
var[θbn | F ] ∼
= var[θbn∗ | Fn ]
E[θbn | F ] − θ ∼
= E[θbn∗ | Fn ] − θbn
P[(θbn − θ ) ≤ t| F ] ∼
= P[(θbn∗ − θbn ) ≤ t| Fn ]
The general bootstrap principle thus can be stated as follows. Replace the unknown
F by the estimate Fn and obtain an approximation for the sampling distribution of θbn
under Fbn by repeatedly sampling from Fn . This implies that θ is replaced by θbn and θbn
by θbn∗ .
Consider again the example above, where F = N (µ, σ2 ) with µ = 5 and σ = 1. We
took one random sample Xn of size n = 100 from F, which happened to give us a sam-
ple mean equal to θb = 5.094. Figure 1.2 shows the approximation of the distribution
F by Fn corresponding to Xn : We then constructed B = 1000 samples from Fn corre-
sponding to this Xn , each time computing the sample mean θbn∗ . The histogram of these
estimates is represented in Figure 1.3. We see that the B estimates of the mean are cen-
tered around the sample mean 5.094 of the original sample, as could be expected. But
clearly, the distribution of θbn∗ − θbn resembles the distribution of θbn − θ (where θ = µ).
10
1.0
0.8
0.6
CDF
0.4
0.2
F = N(µ,σ2)
Fn
0.0
2 3 4 5 6 7 8
x
Figure 1.2 Distribution F and its approximation Fn using a sample of size n = 100.
200
150
Frequency
100
50
0
Figure 1.3 Histogram of B = 1000 bootstrap means based on a random samples of size
n = 100 from F = N (5, 1)
The samples X∗n from Fn are called bootstrap samples. They are samples with n in-
dependent observations that are all (identically) distributed according to the discrete
distribution Fn . The density corresponding to Fn puts probability mass 1/n on each
observation in Xn (assuming there are no ties), which means that a bootstrap sample
needs to be constructed by drawing n observations from Xn with replacement.
11
distribution of θbn . In practice this means that the original data set Xn is re-sampled
with replacement to create many bootstrap samples and many re-computed estimates
of θbn .
∗
Xn ∼ F
=⇒ θn
b
c [θn | Fn ],
var b
⇓ ⇒ Eb [θbn∗ | Fn ] − θ,
b
X∗n1 , X∗n2 , . . . , X∗nB =⇒ θbn∗1 , θbn∗2 , . . . , θbn∗ B
Pb[(θbn∗ − θbn ) ≤ t| Fn ]
In some simple cases, E[θbn∗ | Fn ], var[θbn∗ | Fn ] may be calculated analytically and thus
it is not necessary to actually perform the resampling. As an example, we consider the
estimation of a mean µ by the sample mean X n where X is such that E( X ) = µ and
Var( X ) = σ2 (but there is no further information about the distribution of X).
Let ( X1∗ , X2∗ , · · · , Xn∗ ) be an i.i.d. sample drawn from the empirical cumulative dis-
tribution Fn . We have
1 n
n i∑
E∗ ( Xi∗ ) = EFn ( Xi∗ ) = Xi = X n
=1
Var∗ ( Xi∗ ) = VarFn ( Xi∗ ) = EFn ( Xi∗ − EFn ( Xi∗ ))2
1 n n−1 2
= ∑ ( Xi − X n ) 2 = Sn .
n i =1 n
Hence, regarding the sampling properties of X n , we have for the mean and the vari-
ance:
E( X n ) = E F ( X n ) = µ and Bias( X n ) = 0
Var( X )
Var( X n ) = VarF ( X n ) = ,
n
∗ ∗ 1 n
n i∑
E∗ ( X n ) = EFn ( X n ) = EFn ( Xi∗ ) = X n
=1
∗ ∗ 1 1n−1 2
Var∗ ( X n ) = VarFn ( X n ) = VarFn ( Xi∗ ) = Sn .
n n n
The latter expressions yield the bootstrap estimates of bias and variance of Tn =
Xn:
12
∗
the bootstrap estimate of the bias is E∗ ( X n ) − X n = 0
∗ n − 1 Sn2
the bootstrap estimate of the variance of X n is Var∗ ( X n ) = ≈
n n
Sn2
.
n
However, in most cases we cannot calculate bootstrap estimates of bias and vari-
ance analytically. The re-sampling in the bootstrap procedure then amounts to cal-
culations through Monte Carlo simulation. This re-computing part then makes the
bootstrap a computer-intensive method. we may need thousands or more bootstrap
samples! The re-sampling by computer simulation gives us only an approximation (or
simulation analogue) to var[θbn∗ | Fn ], E[θbn∗ | Fn ] and P[(θbn∗ − θbn ) ≤ t| Fn ]. Hence the hats in
the above scheme. In particular, the simulation analogues are defined as
B
1
B − 1 b∑
c [θbn∗ | Fn ] =
var (θbn∗b − (θbn∗ ) B )2 ( =: v∗B ) (1.9)
=1
1 B b∗b b
B b∑
b [θbn∗ | Fn ] − θbn =
E θn − θn ( =: b∗B ) (1.10)
=1
#{(θbn∗b − θbn ) ≤ t}
Pb[(θbn∗ − θbn ) ≤ t| Fn ] = (1.11)
B
where θbn∗b ; b = 1, . . . , B are the re-computed estimates, and (θbn∗ ) B = 1
B ∑bB=1 θbn∗b .
Now, (1.9)-(1.11) are the usual bootstrap estimates for the variance, the bias and the
sampling distribution of θbn .
Example: (Davison and Hinkley, 1997) The following table reports the data set
Xn = { x1 , . . . , xn }, where n = 10 and where xi = (yi , zi ). Each pair xi corresponds
to a city in the US and contains the 1920 (yi ) and 1930 (zi ) populations of that city, in
thousands.
obs 1 2 3 4 5 6 7 8 9 10
yi 138 93 61 179 48 37 29 23 30 2
zi 143 104 69 260 75 63 50 48 111 50
Of interest here is the ratio of the means, since this would enable us to estimate the to-
tal population growth in the US from 1920 to 1930. We assume that the given pairs are
a random sample from a bivariate distribution F, with X = (Y, Z ) denoting the corre-
sponding random variable. The quantity or parameter of interest is θ = E[ Z ]/E[Y ].
A natural estimate of θ is the empirical analogue: θbn = z/y, i.e. the ratio between the
two sample means. From the data, we obtain θbn = 1.520.
We are then concerned about the accuracy of θbn , especially because with n = 10
we have quite a small sample. We do not know anything about the underlying dis-
tribution F. We could make assumptions about it and use theoretical calculations.
13
Assuming bivariate normality would be an option, but probably not a very good one
(plot the data to see this). Another option is assuming that F is bivariate lognormal,
which might be better. If we would do so, it would be possible to do the necessary
calculations. But the calculations would be rather difficult and above all we cannot be
not sure that our assumption is correct. Instead we can avoid the calculations and the
assumptions, and use the bootstrap.
We need to re-sample the data: generate new samples, consisting again of n pairs,
by randomly drawing observations from the original sample Xn = { x1 , . . . , xn }, with
replacement. So in any newly generated sample, some of the 10 original cities will
appear more than once, and some will not appear at all. We generated B = 1000 boot-
strap samples. The following table lists how many times each observation appears in
the bootstrap samples and shows the resulting recomputed estimates θbn∗ .
obs x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
original sample: 1 1 1 1 1 1 1 1 1 1 θbn = 1.520
The histogram with all B = 1000 bootstrap recomputations θbn∗1 , . . . , θbn∗ B is given in
Figure 1.4. The vertical line indicates the original estimate θb = 1.520. The bootstrap
estimate for the variance of θb is
B
1
B − 1 b∑
v∗B = (θbn∗b − (θbn∗ ) B )2 = 0.0465
=1
√
so the bootstrap standard error equals 0.0465 = 0.216. The bootstrap estimate for
the bias of θbn is
1 B
b∗B = ∑ θbn∗b − θbn = 0.034
B b =1
14
200
150
Frequency
100
50
0
The name comes from the English expression ”Pull yourself up by the bootstraps”
which means that you get yourself out of an almost impossible situation by using only
your own strength. This expression is sometimes said to originate from an 18th cen-
tury tale of Baron Muchhausen, who found himself at the bottom of the lake and got
out by pulling himself up by his own bootstraps. Using the data itself to generate more
data seems like an analogous trick!
Sampling from the data and pretending it’s new data yields often the initial reaction
that this is a kind of fraud because no new (extra data) or additional (assumptions)
information is used. However, it turns out that it is not. Using the data only for es-
timation just means that the information provided by the data has not yet been fully
exploited. Applying the bootstrap is a way to make the extra information in the data
to use. Theoretical research and lots of practice have validated the bootstrap as a very
useful tool that can tackle a wide range of statistical problems. This can be summa-
rized as follows. Bootstrap doesn’t really give you something for nothing. It gives you
something you previously ignored.
15
bootstrap error and the simulation error.
The simulation error occurs when we cannot obtain the bootstrap distribution an-
alytically, but have to resort to an approximation based on computer simulation, i.e.
because we approximate var[θbn∗ | Fn ] by var
c [θbn∗ | Fn ], and similarly for the other quanti-
ties. This error decreases when the number of bootstrap samples B in the simulation
increases. By choosing B sufficiently large, we can limit the simulation error as much
as we want. Of course, this costs us computation time.
The Bootstrap error occurs because we approximate the sampling distribution by
the bootstrap distribution, i.e. because we approximate var[θbn | F ] by var[θbn∗ | Fn ], and
similarly for the other quantities. This error is independent of the simulation error
and hence does not decrease with B. It does decrease with the sample size n, but
in practice n is fixed, of course. The bootstrap error occurs because the bootstrap
principle does not provide the exact sampling distribution, it only gives an estimate of
the true sampling distribution.
Hence, if we want (1.9)-(1.11) to be good estimates of (1.1)-(1.3), then (a) we should
take B large enough, and (b) we should hope that the bootstrap error is small... The
size of the bootstrap error is thus crucial. We can give two remarks about it.
1. It has been shown that the bootstrap error is small, perhaps surprisingly, for
many applications
P[θbn ≤ t| F ] ∼
= P[θbn∗ ≤ t| Fn ]
will in general not be very good. That’s why we usually work with
P[(θbn − θ ) ≤ t| F ] ∼
= P[(θbn∗ − θbn ) ≤ t| Fn ]
where V and V∗ are some estimates of the variance of θbn and θbn∗ .
The simulation error can always be made small, but how many bootstrap sam-
ples should then be drawn? The answer depends on the desired accuracy of the ob-
tained approximation, but there are some guidelines. Traditional guidelines state that
16
B = 200 is required to obtain accurate variance estimates, while at least B = 1000
is generally necessary for confidence intervals. However, since computer power has
increased enormously, higher numbers are often used nowadays.
Note that the empirical distribution Fn is discrete, even if F is continuous. There
are only a finite number of possible samples from Fn and therefore the distribution
of θbn∗1 , . . . , θbn∗ B will have some discrete nature. For a continuous distribution the true
sampling distribution will usually not have this characteristic. To explore whether
this could be a problem, let’s see how many different bootstrap samples there are.
This number is given by the number of ways you can put n balls into n boxes (cfr. the
number of balls in box j is the number of times observation j appears in the bootstrap
sample). Basic combinatorics tells us that this number is given by the combinatorial
number !
2n − 1
n−1
The following table yields the number of different bootstrap resamples for a few sam-
ple sizes:
n=5 n = 10 n = 15 n = 20 n = 50
126 92378 7.75 × 107 6.89 × 1010 5.04 × 1028
These large numbers show that the discreteness of the bootstrap distribution should
not be a problem when the sample size is not really small, at least if the statistic is a
sufficiently smooth function of the observations. Most statistics are indeed sufficiently
smooth, but some are not. The sample median is the prime example. If you try to
bootstrap the median for a small sample, you will see what happens! Note that the
combinatorial number gives you the total number of different bootstrap samples, but
not all these bootstrap samples are equally likely to occur. The most likely bootstrap
sample to be drawn is in fact the original sample itself.
Let us have a look at the different formulas for estimating the bias and the variance
of a statistic Tn by both methods.
17
For estimating the bias of a statistic Tn , the methods give us:
" #
n
1 ( )
n i∑
d jack ( Tn ) = (n − 1) i
Bias Tn−1 − Tn ,
=1
and the bootstrap estimate for the variance leads to (with the Monte-Carlo ap-
proximation)
!2
B
d boot ( Tn ) = 1 ∑ ∗(b) 1 B ∗(ℓ)
Var∗ ( Tn ) ≈ Var Tn − ∑ Tn .
B b =1 B ℓ=1
When comparing the jackknife formulas to the bootstrap expressions, we note that
both jackknife formulas have an inflation factor (n − 1). This “inflation factor” is
(i )
needed because the jackknife values Tn−1 are much closer to each other than the boot-
∗(b)
strap values Tn which would result in underestimation of the bias and variance
without correction by the inflation factor. The jackknife ‘resamples’ thus resemble
much more to X = { X1 , · · · , Xn } than X ∗ = { X1∗ , · · · , Xn∗ } does.
Bootstrap or jackknife?
18
of bias and variance are easier to compute (at least for a not too large n).
In fact, it has been shown that the jackknife can be viewed as an approximation of the
bootstrap. if θb is a linear statistic (i.e. if we can write θb = µ + ∑in=1 α( xi ) for some func-
tion α), then the jackknife is equivalent to the bootstrap. If θb is nonlinear, as is usually
the case, then Efron showed that the jackknife variance estimate is a bootstrap estimate
corresponding to a linear approximation to θ. b The jackknife is thus an approximation
to the bootstrap and therefore inferior to it, especially for highly nonlinear statistics.
A major advantage of the bootstrap is its versatility. For example, it is able to estimate
quantiles of the sampling distribution, while the jackknife is limited to variance and
bias estimates.
If we are willing to assume that the sampling distribution of the estimator θbn is
approximately N (θ, σ2 ), then a 95% confidence interval for θ is given by
as usual. The scale σ in this interval is unknown, so we can estimate it from the boot-
strap distribution. The bootstrap confidence interval is thus obtained by simply re-
placing σ by the bootstrap standard error estimate:
q q
(θbn − 1.96 v∗B , θbn + 1.96 v∗B ). (1.12)
We can go one (small) step further and also make use of the bootstrap distribution
to estimate the bias. That means that we relax our assumption by assuming that the
sampling distribution of θbn is approximately N (θ + β, σ2 ). Hence, β represents the bias
of the estimator θbn . This leads to the bias-corrected normal interval:
q q
(θbn − b∗B − 1.96 v∗B , θbn − b∗B + 1.96 v∗B ). (1.13)
19
When is the assumption of “approximately normal” justified? Of course, exact nor-
mality is practically only obtained when F itself is normal and θbn is the sample mean.
Approximate normality can be claimed in a lot more situations, in the sense that many
statistics or estimators are asymptotically normal. For example, the sample mean is
asymptotically normal even when F is not normal. In many other cases, especially
more complex situations, asymptotic normality is not true or just difficult to find out.
Moreover, even if θbn is asymptotically normal, the normal distribution may not yield
a good approximation for the sampling distribution at small n. Thus, the simple in-
terval (1.12) or (1.13) would be a good option only when asymptotic normality is pre-
sumed and when n is large.
P [W ≤ w α ] = α
20
Denote by θb(∗1) ≤ . . . ≤ θb(∗B) the order statistics corresponding to the bootstrap esti-
mates θbn∗1 , . . . , θbn∗ B . The quantile wα∗ is then usually estimated by
or equivalently
which yields
(2θbn − θb(∗.975( B+1)) , 2θbn − θb(∗.025( B+1)) ). (1.14)
Remark 1: Why do we use θb(∗α( B+1)) to estimate the quantile wα∗ instead of θb(∗αB) ? Sup-
pose Y1 , . . . YN are i.i.d. with CDF K, then it can be shown for the order statistics that
Hence, the j-th order statistic is a good estimate of the j/( N + 1)-quantile rather than
of the j/N-quantile.
Remark 2: Because of the use of the (α( B + 1) order statistics, it is often recommended
to choose B such that, for example, .025( B + 1) and .975( B + 1) are integers. Therefore,
common choices are B = 999 or B = 4999. When α( B + 1) is not an integer, interpo-
lation should be used in principle. However, many people use rounded numbers of B
(e.g. 1000, 5000 or 10000) and αB anyway because for large values of B it doesn’t really
matter.
For example, when choosing B = 999, the 95% basic bootstrap interval becomes
The basic bootstrap interval is quite straightforward, but the disadvantage is that
the confidence interval generally is quite inaccurate because the approximation of W
by W ∗ is often not good enough. Let’s look at this in more detail. A central 1 − 2α
confidence interval (θbL , θbU ) is supposed to have probability α of not covering the true
θ from above or from below, i.e.
21
The nominal level α is often not reached exactly, but the difference becomes smaller
when the sample size increases. The accuracy of an approximate confidence interval
can be measured by how fast this difference decreases with increasing n. The basic
bootstrap interval is first-order accurate. This means that its error w.r.t. (1.15) goes to
√
zero at rate 1/ n:
P[θ < θbL ] = α + O(n−1/2 )
and similarly for the upper limit. This is usually the same accuracy as can be obtained
by asymptotic approximations. However, when using the bootstrap we can do better.
It is possible to construct bootstrap confidence intervals that are second-order accurate,
which means that the error w.r.t. (1.15) is of an order of magnitude smaller. It goes to
zero at the much faster rate 1/n:
θbn − θ
z= (1.16)
vb1/2
n
where vbn is an estimate of Var[θbn | F]. We then need the quantiles z0.025 and z0.975 deter-
mined by
θbn − θ
P[z.025 ≤ 1/2 ≤ z.975 ] = 0.95,
vbn
so that a 95% confidence interval for θ is given by
Using the bootstrap principle on the studentized sampling distribution, we can ap-
proximate the distribution of z by that of
θbn∗ − θbn
z∗ = (1.17)
vb∗n 1/2
Hence, we need recomputed statistics z∗1 , . . . , z∗ B . Let z∗(1) , . . . , z∗( B) be the correspond-
ing order statistics, then the corresponding 95% bootstrap-t confidence interval is de-
fined as
∗ ∗
(θbn − vb1/2
n z , θbn − vb1/2
(.975( B+1)) n z (.025( B+1))).
22
(Disadvantage 1) What variance estimate vbn could be used here?
Bootstrap-t intervals can be more accurate, but a major disadvantage is the need
for the variance estimates vbn and vb∗n (in every bootstrap sample). In some special
cases, like the sample mean, we may have an easy formula for the variance of the
bootstrapped estimate. (One could argue that, if this is the case, then we do not need
bootstrap in the first place: this is not entirely true, however, since a good variance
estimate does not necessarily mean we have a good estimate of the whole distribution.)
In most cases we do not have an easy way to obtain variance estimates. Still, two
relatively general methods for obtaining the variance estimates can be used:
1. Nested bootstrap: at each bootstrap sample X∗n with corresponding θbn∗ , we can
generate B′ second-level bootstrap samples X∗∗ ∗∗ B′ , randomly drawn
n , . . . , Xn
1
with replacement from X∗n , to obtain the variance estimate vb∗n . This is generally
applicable in principle, but can become impractical because of the high com-
putational demands: for the second-level bootstrap variance estimates we need
at least, say, B′ = 200. Considering that we need precise quantile estimates,
the first-level bootstrap requires at least B = 999, say. Together this makes
B( B′ + 1) > 200, 000 re-computations of θbn , which is extremely high if θbn is com-
puted by a costly algorithm.
2. Nonparametric delta method: see e.g. Davison and Hinkley (1997, p. 46). If there
is a way to compute or estimate the empirical influence values li of each obser-
vation i; 1, . . . , n, then a common variance estimate is given by vbn = n−2 ∑in=1 li2 ,
with an analogous bootstrap version. Sometimes the li can be estimated from
an available formula for the theoretical influence function (see the Chapter on
M-estimators for details on influence functions), while in other cases jackknife-
type techniques can be used to obtain li , which then again becomes more time-
consuming, however.
However, often var[θbn | F] depends on θ ( F ), and thus vb∗n will be related to θbn∗ . In most
cases, we can find a transformation
ϕ : = h ( θ ), ϕ bn∗ := h(θbn∗ )
bn := h(θbn ) and ϕ (1.18)
23
bn | F] is approximately independent from ϕ( F ).
such that after transformation var[ϕ
How can we find the appropriate parameter transformation h(.)? Well, suppose
we know the form of the dependency of var[θbn | F] on θ, i.e. suppose we can write
var[θbn | F ] = v(θ ) for some function v(.). Then for any well-behaved function h(.), a
Taylor expansion yields
h(θbn ) ≈ h(θ ) + h′ (θ )(θbn − θ )
which leads to
bn | F ] ≈ (h′ (θ ))2 v(θ )
var[ϕ
which implies that the variance of ϕ
bn will be approximately constant (equal to 1) if
Z t
du
h(t) = . (1.19)
v(u)1/2
To apply this result, we need to know the function v(.). However, we can estimate
this function from the bootstrap! Indeed, once we have applied the bootstrap on the
original scale, we can take a look at the relation between the values of θbn∗ and their
respective variance estimates vb∗n . If the variance estimates are fairly constant over the
range of θbn∗ -values, then no transformation is required. If not, we use these bootstrap
values to estimate a function v. We can then apply (1.19), do the bootstrap again on the
transformed scale, and eventually back-transform the resulting interval to the original
scale.
We now illustrate this procedure with an example. The left panel of Figure 1.5,
shows vb∗n versus θbn∗ for B = 999 bootstrap samples in the City data example. Here, the
values of vb∗n were obtained by double bootstrap with B′ = 200. Clearly, the variance of
θbn∗ seems to be an increasing function of θbn∗ . There are various ways now to obtain an
estimate of the function v(.). Nonparametric curve fitting is an option, or you can try
to fit any parametric function. Here, v(θ ) = (θ − 1.1)2 /5 turned out to give a nice fit to
the data, as indicated by the line on the plot. We then applied the bootstrap again, but
on the transformed scale based on (1.19), where h(.) was computed through numerical
integration. The right panel of Figure 1.5 shows the relation between the resulting ϕ bn∗ -
values and their variance estimates. It is clear that the trend has indeed disappeared.
The confidence interval after variance-stabilizing transformation can then be back-
transformed to the original scale by numerical methods.
Let us compare the bootstrap based 95% confidence intervals for θ that we can
construct so far for the City data. Without any transformation we obtain:
24
0.0 0.5 1.0 1.5 2.0 2.5 3.0
1.0
0.8
0.6
v*
v*
0.4
0.2
0.0
Figure 1.5 (City data) Left: vb∗n versus θbn∗ for 999 bootstrap samples with line indicating
the approximate fit vb∗n ≈ v(θbn∗ ) = (θbn∗ − 1.1)2 /5. Right: vb∗n versus ϕ
bn∗ for 999 bootstrap
samples using transformation ϕ = h(θ )).
Note that in this example the variance-stabilization has a large effect on the basic
bootstrap confidence interval, but less on the bootstrap t-confidence interval. It is
sometimes argued that, once you have stabilized the variance through transformation,
it is sufficient to use basic bootstrap and there’s actually no need for a bootstrap-t
interval (since the denominator is supposed to be approximately constant). It is true
that after transformation both intervals will be close, but it is wise to use bootstrap-t
even after transformation (as pointed out by Davison and Hinkley, 1997).
The fact that transforming the parameter can improve the intervals, implies that
the intervals are not transformation-respecting. You may get different conclusions
depending on the scale you’re working on. This may be seen as a disadvantage and
holds for both the bootstrap-t and basic bootstrap interval. A side-effect hereof is that
these intervals may include impossible values. For example, in the City data, it is
25
quite realistic to assume that all cities have a positive growth (population decline does
not occur). Thus, θ < 1 is not reasonable, but still the untransformed basic bootstrap
interval includes such values. This can happen in bootstrap-t intervals as well. The
next methods, percentile and BCa intervals, do not suffer from this.
Now, from (1.21) and (1.22) we have that ϕ bL is just the 0.025-quantile of the distribu-
bn∗ , and ϕ
tion of ϕ bU is the 0.975-quantile of this distribution. Hence, given B bootstrap
estimates ϕ ∗ 1 bn∗ B , a 95% CI for ϕ is given by
bn , . . . , ϕ
which is the bootstrap percentile interval. Hence, constructing this interval is very
simple! We don’t even need to know the transformation ϕ = h(θ ) when applying this
method. However, we do assume that h exists and leads to a pivot with a symmetric
distribution. This is a somewhat strong assumption which is not always satisfied and
implies that the method does not always work very well. Moreover, percentile inter-
vals are only first-order accurate, similarly to basic bootstrap intervals. However, the
method can be improved to second-order accuracy by some adjustments, which will
be described next.
26
Method 5: BCa percentile intervals
The assumption described in (1.20) is strong, since it implies that we can find a
transformation such that (1) the estimate becomes unbiased and symmetrically dis-
tributed, and (2) its distribution does not change when moving from F to Fbn . These
concerns were partially addressed when Efron proposed a bias-corrected version of
the percentile interval, which was based on the assumption that h(.) exists such that
bn∗ − ϕ
ϕ bn − ϕ ∼ N (−wσ, σ2 )
bn ∼ ϕ
for some w. This leads to so-called BC intervals, but it was shown that this still is a
strong assumption and that it hardly yields an improvement on the (basic) percentile
method.
Efron’s next adjustment turned out to be more successful. He now assumed that h(.)
exists such that
with σ(ϕ) = 1 + aϕ. Here, w is called the bias-correction and a is called the acceleration
factor. Allowing the variance of the sampling distribution to be non-constant and to
depend on ϕ is a much weaker assumption that is much more realistic.
Suppose w and a are known. Then (a) in assumption (1.24) implies that
By multiplying both sides by a, adding the constant 1 and taking the logarithm, we
obtain
log(1 + aϕ
bn ) ∼ log(1 + aϕ) + log(1 + a( Z − w)).
w ± 1.96
{ϕbL , ϕbU } = ϕbn + σ(ϕbn )
1 − a(w ± 1.96)
On the other hand, from (b) in (1.24) we have that the α-quantile of the distribution of
b∗ is
ϕ
bα∗ = ϕ
ϕ bn − wσ (ϕ
bn ) + σ(ϕ
bn )zα (1.25)
with zα the α-quantile of Z ∼ N (0, 1). Can we find values of α such that the quantile
(1.25) matches with ϕ
bL and ϕbU , respectively? Indeed, from
w ± 1.96
bn − wσ(ϕ
ϕ bn ) + σ(ϕ
bn )zα = ϕ
bn + σ(ϕ
bn )
1 − a(w ± 1.96)
27
it follows that the levels α should be defined by
w ± 1.96
zα = w + .
1 − a(w ± 1.96)
This means that by choosing α as
w ± 1.96
{b αU } = Φ w +
αL, b
1 − a(w ± 1.96)
we get that a 95% CI for ϕ based on B recalculated estimates is given by
Back-transforming then gives the 95% bias-corrected and accelerated (BCa) interval
for θ:
(θb(∗bα L ( B+1)) , θb(∗bαU ( B+1)) )
So, the BCa limits are just percentiles of θbn∗1 , . . . , θbn∗ B , but with adjusted level α. Again,
we do not need to know the transformation function h(.) to construct the intervals. It
can be shown that BCa intervals are generally second-order accurate. However, we
need to know w and a, so how can we obtain them?
The acceleration a: This parameter is more difficult to estimate. It can be shown that
a good estimate is given by
∑in=1 bli3
a=
6{ ∑ n b
b
l 2 }3/2
i =1 i
(i )
where θbn−1 is the estimate of θ based on the original sample without observation i,
known as the ith jackknife estimate, and θen−1 is the average of the n jackknife esti-
mates.
To complete the example, we also give the percentile based 95% confidence inter-
vals for θ in the City data:
28
4. percentile: (θbL , θbU ) = (1.270, 2.129)
Method Order TR
Normal 1st - inaccurate, but simple
Basic 1st - inaccurate, but simple (and pure bootstrap principle)
Bootstrap-t 2nd - accurate, but need b
σ and variance-stabilization
Percentile 1st + inaccurate, but simple (and intuitive)
BCa 2nd a
+ accurate, but need b
(TR = transformation-respecting)
Remark: the orders of accuracy above are meant for estimators which are approxi-
mately (i.e. asymptotically) normal. Otherwise, even theoretical consistency itself is
hard to show.
29
however, the rest remains the same (standard errors, the various confidence intervals,
. . . ). We explain this in more detail for the case of regression models.
Consider a data set of the form Xn = {(x1 , y1 ), . . . , (xn , yn )}, where xi is a p-variate
vector, and suppose we want to investigate the dependence of the y-variable on the
x-variable by assuming a linear regression model:
The aim of the regression analysis is then to estimate the unknown p-variate β from
the data Xn , and often also to find out which coefficients are different from zero. The
classical estimate βb is the one obtained through least-squares minimization. The prob-
ability structure of the linear model is usually expressed as
yi = xit β + ϵi (1.26)
where the error terms ϵi are assumed to be an i.i.d. sample from an unknown error
distribution F having expectation 0.
P = ( β, F ) =⇒ Xn = {(x1 , y1 ), . . . , (xn , yn )}
Hence, P is the probability distribution of the responses which depends on the error
distribution F and the parameter β. The distribution P depends also on the covariates
xi , which can be either fixed or random. In this approach we consider the xi to be fixed.
If we want to replace P by an estimate P̂, then we need (1) an estimate for β, and
(2) an estimate for F. The estimate for β we have immediately when estimating β for
the original data set. For F, we know that it is the underlying distribution of the errors
ϵi , so a natural estimate would be the empirical distribution Fn of the ϵi . However, the
ϵi are not available since β is unknown, but we can use βb to compute the residuals
ϵi = yi − xit βb
b
30
The bootstrap data-generating process can then be seen as
Note that since we are treating the xi as fixed, the resampled data are taken to have the
same design as the original data.
A quite different approach results from viewing the data as a random sample from
a p + 1-variate unknown distribution F:
F =⇒ Xn = {(x1 , y1 ), . . . , (xn , yn )}
which means that we consider both yi and xi as random variables. In fact, this keeps
us in the original, simple framework. Bootstrap samples are constructed by sampling
from the empirical distribution Fn , which just means drawing cases from the original
Xn with replacement (where by cases we mean the whole vectors (xi , yi )). So,
In this approach, since the x-variable is just as random as the response y, the design in
the bootstrap samples will be different from the original sample. The bootstrap infer-
ence will thus reflect the uncertainty about the design.
Example: The following plots show a regression dataset (pilot-plant data: y = acid
content determined by titration, x = acid content determined by extraction and weigh-
ing), with 5 random bootstrap samples based on the two approaches. Note that
error-resampling yields n different bootstrap observations every time, while case-
resampling generally produces multiple replicates of certain observations (which are
hidden in the plot due to overlapping). Note the gaps in the designs for case-
resampling.
• original + 5 bootstrap samples by ERROR-RESAMPLING
31
90
90
90
80
80
80
70
70
70
Titration
Titration
Titration
60
60
60
50
50
50
40
40
40
50 100 150 50 100 150 50 100 150
90
90
90
80
80
80
70
70
70
Titration
Titration
Titration
60
60
60
50
50
50
40
40
40
50 100 150 50 100 150 50 100 150
90
90
80
80
80
70
70
70
Titration
Titration
Titration
60
60
60
50
50
50
40
40
40
90
90
80
80
80
70
70
70
Titration
Titration
Titration
60
60
60
50
50
50
40
40
40
32
Resampling errors versus resampling cases
Example: To illustrate the second of these issues, the following plots show a regression
dataset where we have heteroscedasticity or non-constant error-variance (hormone as-
say data: y = measurement by new method, x = measurement by reference method).
Note how error-resampling causes the large residuals, corresponding to large x, to
end up at small values of x in the bootstrap samples. The specific x-dependence of
the error-variance is lost. Case-resampling seems a better choice, although the occa-
sional absence of that high leverage point may be disturbing in view of the third issue
above. Another possibility here would be to incorporate the non-constant variance in
the regression model, and use error-resampling based on this new model.
33
• original + 5 bootstrap samples by ERROR-RESAMPLING
70
70
70
60
60
60
50
50
50
New method
New method
New method
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
70
70
60
60
60
50
50
50
New method
New method
New method
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
70
70
60
60
60
50
50
50
New method
New method
New method
40
40
40
30
30
30
20
20
20
10
10
10
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
70
70
60
60
60
50
50
50
New method
New method
New method
40
40
40
30
30
30
20
20
20
10
10
10
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Whatever approach you use (error or case resampling), once you have B bootstrap
samples you can re-compute the estimate βb B times, and proceed as before to obtain
standard errors and confidence intervals for every component of β.
34
Remark 1: The approaches described here for linear models, can be applied to gener-
alized linear models as well. See e.g Davison and Hinkley (1997).
Remark 2: There is substantial theory available for inference by least-squares (LS) es-
timation in linear regression, such that bootstrap would seem unnecessary. However,
this theory is based on the error distribution F being normal, which is an assumption
that cannot always be trusted. The bootstrap makes much weaker assumptions, lead-
ing to more reliable inference in many cases.
Remark 3: LS is just one possible method to estimate β. Many other estimators can
be used instead, such as Least-trimmed-squares (LTS), M- and S-estimators for ex-
ample. Such alternative (robust) regression estimators are increasingly popular, and
since normal-theory is much less available for these estimators than for LS, appropri-
ate bootstrap methods are an important inference tool in this context.
Apart from standard errors and confidence intervals, hypothesis testing is the main
application in statistical inference. The bootstrap can be used for testing, but things are
more complicated than for confidence intervals. The reason is that we need to estimate
the distribution of a certain test statistic under the null hypothesis H0 . And since we
do not know whether our data satisfies H0 , we cannot simply re-sample from the data.
The exact procedure depends on the type of test and type of data, but let us illustrate
some issues by means of an example.
Example: (Efron and Tibshirani, 1993) In a small experiment, 16 mice were randomly
assigned to a treatment group or a control group. The treatment was intended to
prolong survival after a test surgery. These are the observed survival times (in days)
after surgery:
Treatment: 94 197 16 38 99 141 23
Control: 52 104 146 10 51 30 40 27 46
Denote by x1,1 , . . . , x1,7 the observations in the Treatment group, having underlying
distribution F1 , and by x2,1 , . . . , x2,9 those in the Control group, having underlying
distribution F2 . We wonder if the treatment has had any significant effect. Formally,
we would like to test the hypothesis H0 : F1 = F2 .
35
(A) Testing H0 : F1 = F2
In case H0 holds, we expect the difference in mean survival time to be small, in case
H0 is false we expect the difference to be large. Hence, a reasonable test statistic is
θb = x1 − x2 = 30.63
Is this strong evidence against H0 ? To answer this, we need to know the distribution
of θb under H0 . We can bootstrap, but simple (stratified) re-sampling from the data is
not an option, since the data may or may not satisfy H0 .
Under H0 , both parts of the sample have a common distribution, say F0 . In the
absence of any knowledge of F0 , the best way to estimate F0 is by the empirical dis-
tribution Fb0 of the data considered as one large sample of size n1 + n2 = 16. Hence,
rewrite the data:
Bootstrap samples are then constructed under H0 by drawing with replacement from
Xn = {z1 , . . . , z16 }, yielding X∗n = {z1∗ , . . . , z16
∗ }. The result is then simply rewritten as
∗
X∗n = {z1∗ , . . . , z7∗ , z8∗ . . . , z16 ∗
} → { x1,1 ∗
, . . . , x1,7 ∗
, x2,1 ∗
, . . . , x2,9 }
Constructing B bootstrap samples leads to θb1∗ , . . . , θbB∗ and these give us an estimate
of the distribution of the statistic under H0 . Therefore, we can define the bootstrap
p-value as
#{θbb∗ ≥ θb}
pboot = .
B
Instead of θb = x1 − x2 we could also use the studentized test statistic
x − x2
θb = √ 1
σ 1/n1 + 1/n2
b
where b σ2 is the pooled variance of the two groups. This typically leads to similar
results.
36
(B) Testing H0 : µ1 = µ2
What if we just want to test whether the two groups have equal means, without as-
suming anything about equal variances? Then, we can use the following test statistic:
x1 − x2
θb = q
σ12 /n1 + b
b σ22 /n2
In the procedure described above, the bootstrap samples were constructed under the
assumption that the whole distribution was equal, not just the means. We have to
think of something different now.
This time we have to keep the two groups apart when re-sampling. We will adjust
the data so that the groups have equal means. Define
X∗n = { x1,1
∗ ∗
, . . . , x1,7 ∗
, x2,1 ∗
, . . . , x2,9 }
from
X
e n = { xe1,1 , . . . , xe1,7 , xe2,1 , . . . , xe2,9 }.
#{θbb∗ ≥ θb}
pboot = .
B
Consider now a one-sample problem, with just the data for the Treatment group.
Suppose some earlier experiment had found an average survival time of 129 days, and
suppose we want to test whether our observations have the same underlying mean.
So we want to test H0 : µ1 = 129.0. The obvious test statistic for this is
x − 129.0
θb = 1 √ (1.27)
σ1 / n1
b
This can be bootstrapped in the same way as in the test for H0 : µ1 = µ2 outlined
above: by translating the data first so that they have mean equal to 129.0.
37
Connection with confidence intervals
There is an equivalent way to find the null distribution of (1.27), and that is by
simply bootstrapping from the un-translated data, and computing
x ∗ − x1
θb∗ = ∗1 √ (1.28)
σ1 / n1
b
#{θbb∗ ≥ θb}
pboot = .
B
Both methods here are completely equivalent, and make the assumption that the dis-
tribution of the test statistic is pivotal, or translation-invariant. Just like the classical
t-test (which additionally assumes normality, of course).
A typical situation where this can be useful is for testing if a coefficient in a regression
model is significantly different from 0.
Summarized:
• The general idea is to adjust the data in some way such that it satisfies H0 , while
keeping much of the characteristics of the original data.
38
Further reading
A nice introduction to the bootstrap is given by Efron and Tibshirani (1993) while
an extensive overview of bootstrap methods is given in Davison and Hinkley (1997).
A more practical discussion can be found in e.g. Chernick (1999) while a mathematical
treatment of the subject can be found in Shao and Tu (1995).
39
40
Chapter 2
M-estimators
2.1 Robustness
All statistical methods rely on a number of assumptions. Typically two types of
assumptions occur. Distributional assumptions assume that the observations are gen-
erated from a specific distribution, often the Gaussian distribution. Classical linear
least squares regression for instance assumes that the errors are normally distributed
with constant variance. Secondly, even when such an explicit distribution is not re-
quired, classical statistical methods rely on the assumption that all observations are
generated from the same underlying process. However, in many real life applications
it is observed that data sets contain several atypical observations called outliers. These
are observations that deviate from the main pattern in the data.
Classical statistical methodology often turns out to be very sensitive to outliers.
A few deviating observations can have a huge impact on the results. On the other
hand, robust statistics constructs estimators that control the effects of outliers. Typically
robust methods give similar results as classical estimates when there are no outliers,
but keep giving appropriate results when outliers are present.
Three R packages with robust methods that will be frequently used, are MASS (Ven-
ables and Ripley, 2002), robustbase (Rousseeuw et al., 2003) and rrcov (Todorov and
Filzmoser, 2009).
Consider a data set containing the logarithm of the annual incomes of 10 persons.
If the file annualincome.txt contains the data and is located in the working directory
of R, then provide some basic graphical displays (histogram, boxplot and normal qq-
plot).
41
> boxplot(anninc,xlab="anninc")
> hist(anninc)
> qqnorm(anninc)
Figure 2.1 shows the resulting plots. From this graphical analysis, it is clear that
the log income of person 10 is quite large compared to the other log incomes.
6
15
15
5
14
14
4
13
13
Sample Quantiles
Frequency
3
12
12
2
11
11
1
10
10
0
Now let us analyze the sample mean and the sample median.
> mean(anninc)
[1] 10.49203
> median(anninc)
[1] 9.978188
Observe that there is a large difference between both estimates. Now consider the data
set without observation 10, the outlier.
42
> plot(anninc,rep(1,10),ylab="",xlab="log annual income",ylim=c(0.8,1.1))
> points(mean(anninc),0.95,pch=4,col="red")
> points(mean(anninc2),0.92,pch=4,col="red")
> points(median(anninc),0.89,pch=4,col="blue")
> points(median(anninc2),0.86,pch=4,col="blue")
> text(11.5,0.95,"mean with outlier",cex=0.8,col="red")
> text(10.9,0.92,"mean without outlier",cex=0.8,col="red")
> text(10.9,0.89,"median with outlier",cex=0.8,col="blue")
> text(10.9,0.86,"median without outlier",cex=0.8,col="blue")
10 11 12 13 14 15
Note that both the mean and the median of the sample without the outlier give ap-
proximately the same result as the median of the sample containing the outlier, but
a very different result from the mean of the sample containing the outlier. This illus-
trates that the sample mean is not a robust estimator: a single observation can have an
arbitrary large effect on the result. For the median the influence of a single observation
is bounded. The sample median is an example of a robust estimator of location.
For univariate scale estimators the situation is similar. Consider the sample stan-
dard deviation s:
n
1
n − 1 i∑
s2 = ( xi − x̄ )2
=1
Again an outlier can be arbitrary influential: notice again the difference between using
the data set with or without the outlier:
> sqrt(var(anninc))
[1] 1.678832
43
> sqrt(var(anninc2))
[1] 0.2726027
For a robust scale estimator, e.g. the Median Absolute Deviation (MAD),
> mad(anninc)
[1] 0.2105529
> mad(anninc2)
[1] 0.1827097
At this point one might think that robust estimation could also be obtained by de-
tecting outliers and deleting them. However, it is important to realize that detection
of outliers is far from trivial. In the univariate case, one might suggest to flag points
as outliers when the distance to the mean is larger than 3 standard deviations. This is
based on the fact that for X ∼ N (0, 1), P(| X | > 3) = 0.003. Let us apply this rule to
the income data.
round((anninc-mean(anninc))/sqrt(var(anninc)),digits=1)
[1] -0.6 -0.5 -0.2 -0.3 -0.2 -0.3 0.0 -0.3 -0.3 2.8
Observe that the result is actually smaller than 3! This is because the sample mean
is shifted and the sample variance is blown up due to the outlier. This is called the
masking effect: outliers can change non-robust estimates so dramatically, that outlier
detection using these estimates fails. However, when using the robust estimates in the
outlier detection rule, we do find that observation 10 is very far away from the other
observations.
round((anninc-median(anninc))/mad(anninc),digits=1)
[1] -2.2 -1.4 0.9 -0.1 0.5 0.1 2.4 -0.3 -0.3 24.9
This illustrates the importance of using a robust method in order to detect outliers.
In the remaninder of this chapter we will study the theory of M-estimators which
is a well established class of robust estimators. We start by univariate estimation of
location and scale as in the example above and then continue with general properties
of M-estimators.
44
2.2 Univariate parametric models
We assume that we have a sample { x1 , . . . , xn } of n i.i.d. univariate random vari-
ables Xi . A parametric model assumes that Xi ∼ Fθ with { Fθ : θ ∈ Θ} a set of distri-
bution functions. We will assume that the distribution has a density f θ ( x ) = Fθ′ ( x ). An
important goal of statistics is to find a good estimator θ̂n of θ, that is, a real function of
the sample θ̂n ( x1 , . . . , xn ) such that θ̂n is close to the true parameter θ.
MSE(θ̂n ) = E[(θ̂n − θ )2 ].
with
bias(θ̂n ) = E[θ̂n ] − θ
and
An estimator is called unbiased if its bias is zero. For unbiased estimators the
MSE equals the variance.
The Cramér-Rao information inequality states that for any unbiased estimator θ̂n
we have that
1
Var(θ̂n ) ⩾ .
nIθ
Thus
1
0 ⩽ eff(θ̂n ) := ⩽ 1.
nVar(θ̂n ) Iθ
The percentage eff(θ̂n ) is called the (finite sample) efficiency of θ̂n . If the effi-
ciency equals 1, then θ̂n has the smallest possible variance (and hence the small-
est possible MSE) of all unbiased estimators.
45
• An estimator is called consistent if for all θ ∈ Θ it holds that θ̂n → p θ. Here → p
denotes convergence in probability, that is, for any ε > 0 it holds that P(|θ̂n − θ | ≥
ε) → 0.
1
0 ⩽ eff(θ̂n ) := ⩽ 1.
Vθ Iθ
Vθ̂B
ARE(θ̂ A , θ̂ B ) =
Vθ̂ A
considered as a function of θ.
• The Maximum Likelihood Estimator (MLE) is the estimator maximizing the like-
lihood function:
θ̂n = argmax L( x1 , . . . , xn ; θ ).
θ
Fµ ( x ) = F ( x − µ) (2.1)
where −∞ < µ < +∞ is the unknown location parameter and F is a continuous distri-
bution with density f , hence f µ ( x ) = Fµ′ ( x ) = f ( x − µ). (Note that Fµ is only a model
46
for the uncontaminated data. We do not model outliers.)
The goal is to find a good estimator µ̂n of µ. Most location estimators satisfy the
following desirable properties: the estimator is called location equivariant if for all b ∈ R
it holds that
µ̂n ( x1 + b, . . . , xn + b) = µ̂n ( x1 , . . . , xn ) + b
Xi ∼ N (µ, σ2 ).
1 n
X̄ = ∑ Xi
n i =1
with X(i) the ith order statistic. Under the assumption of normality the sample median
has a lower efficiency than the sample mean, namely 2/π ≈ 64%. But the situation
may be reversed if the model distribution F has longer tails.
47
where σ > 0 is the unknown scale parameter. As before F is a continuous distribution
with density f , but now
1 x
f σ ( x ) = Fσ′ ( x ) = f ( ) .
σ σ
Most scale estimators σ̂n are location invariant, i.e. for all b ∈ R it holds that
Other well-known scale estimators include the Inter Quartile Range (IQR) defined as
MAD( x1 , . . . , xn )
MADN( x1 , . . . , xn ) = .
Φ−1 (0.75)
The factor (Φ−1 (0.75))−1 = 1.4826 is called a consistency factor. Some people prefer
the notation MAD to include this factor, others do not. When using software one
should watch out which convention is used. In R, mad refers to the version with the
factor included.
48
2.3 The sensitivity curve
Consider for example the location model. The following function computes the sensi-
tivity curve of the sample mean for a given sample on a grid between −10 and 10.
A similar function can be written for the sample median. Both functions can be drawn
as follows.
The sample mean is clearly more sensitive to outliers than the sample median. Its
sensitivity curve is unbounded as |z| increases. The sensitivity curve of the sample
median is bounded, reflecting the fact that the addition of a single point z to a sample
cannot have an arbitrarily large impact.
49
5
SC(z)
0
−5
−10 −5 0 5 10
50
2.4 The finite-sample breakdown value
If the sensitivity curve is bounded, then one observation cannot have an arbitrarily
large impact on the estimator. But what happens when more than one point is altered?
To measure this, the finite-sample breakdown value (Donoho and Huber, 1983) plays
an important role.
In this situation:
1 n+1
∗ 1
ε n (µ̂n , x) ⩽ ≈ . (2.8)
n 2 2
In the univariate scale model we have that Θ =]0, ∞[. The boundary of Θ is the
pair {0, +∞}. The breakdown value with respect to 0 is called the implosion breakdown
value. The one with respect to +∞ is called the explosion breakdown value. In this setting:
• The breakdown value of the sample standard deviation equals 1/n because it
explodes easily.
⌊n/2⌋ 1
ε∗n (σ̂n , x) ⩽ ≈ .
n 2
51
2.5 M-estimators of univariate location
Let ρ be an even and nondecreasing function of | x |, with ρ(0) = 0. If ρ is differen-
tiable, we define ψ = ρ′ . We then have that ψ is odd and ψ( x ) ⩾ 0 for x ⩾ 0. For the
location model, an M-estimator (Huber, 1964) is defined as
n
µ̂n = argmin ∑ ρ( xi − µ). (2.9)
µ i =1
Examples:
and thus µ̂n equals the MLE of µ. M-estimators are thus generalizations of MLE-
estimators.
x2
• ρ( x ) = 2. Then ψ( x ) = x and we have that
n
∑ (xi − µ̂n ) = 0
i =1
which means that µ̂n = x̄. Thus the sample mean is an M-estimator. At the
normal model, f ( x ) = ϕ( x ), it is the MLE.
52
and (
x if | x | ⩽ b
ψ( x ) =
b sign x if | x | > b
Note that the limiting cases b → ∞ and b → 0 correspond to the mean and the
median.
and 2
x2
ψ( x ) = x 1 − 2 I (| x | ⩽ c).
c
The graphs of these ρ and ψ are depicted in Figure 2.2. For the mean both func-
tions are unbounded. For the median and the Huber M-estimator ψ is increasing and
bounded and ρ is unbounded. Such M-estimators are called monotone. One can prove
that the optimization problem in (2.9) is then a convex optimization problem and no lo-
cal minima exist. This is a big computational advantage. For the bisquare M-estimator
both ρ and ψ are bounded. Moreover ψ is not increasing: for large x the function ψ( x )
decreases. Such M-estimators are called redescending. Computationally this yields a
more difficult optimization since local minima can arise. This means that some solu-
tions of (2.10) may not correspond to the global minimum of (2.9).
You can check that location M-estimators as defined by (2.9)-(2.10) are location
equivariant, but not scale equivariant. M-estimators of location can achieve a high
breakdown value. If ψ is odd, bounded and nondecreasing, they attain the maximal
breakdown value (2.8). This implies that the Huber location M-estimator has maximal
breakdown value for all b < ∞.
It turns out that properties of M-estimators are determined by the ψ-function
in (2.10) rather than the loss function ρ in (2.9). Therefore, M-estimators are often
defined directly as solutions of an estimating equation as in (2.10). This is the common
approach for M-estimators of scale.
53
rho function of various estimators
10
Classical mean
Median
Huber, b=1.345
Tukey, c=4.68
8
6
ρ(x)
4
2
0
−6 −4 −2 0 2 4 6
Classical mean
Median
Huber, b=1.345
Tukey, c=4.68
1
ψ(x)
0
−1
−2
−6 −4 −2 0 2 4 6
Figure 2.2 The ρ and ψ functions of the mean, the median, the Huber
estimator with b = 1.345, and the bisquare M-estimator with c = 4.68.
54
2.6 M-estimators of scale
Assume the scale model from (2.2). An M-estimator of scale satisfies an equation
of the form
1 n
xi
∑
n i =1
ψ
σ̂n
=0 (2.12)
Note that the factors xi /σ̂n are actually standardized deviations from the center of the
distributions which is 0 in this scale model. For symmetric distributions, it is natu-
ral that the contribution of a discrepancy is the same no matter whether xi lies in the
left or right tail of the distribution. Hence, a natural requirement for ψ is to use an
even function and thus ρ functions as defined above are ideal choices. However, since
ρ(t) ≥ 0 for all values of t it is clear that generally n1 ∑in=1 ρ( σ̂xni ) > 0. Therefore, we set
ψ(t) = ρ(t) − δ where the constant δ can be chosen to make the estimator σ̂n consis-
tent at a particular scale model, usually Fσ = N (0, σ2 ). Expression (2.12) can now be
rewritten as
1 n
xi
∑
n i =1
ρ
σ̂n
=δ (2.13)
for some 0 < δ < ρ(∞). Usually we set δ = ρ(t) f (t)dt which makes σ̂n consistent.
R
Examples:
• ρ( x ) is the bisquare function (2.11) and δ = 21 ρ(∞). The resulting σ̂n is called the
bisquare scale M-estimator.
1
• ρ( x ) = I (| x | > c) with c > 0 and δ = 0.5, then σ̂n = c mediani (| xi |).
• As in the case of the MLE, we can also take the ψ function of a location M-
estimator and construct an M-estimator of scale with the ρ function ρ( x ) =
xψ( x ).
Check that the scale M-estimators given by (2.13) are scale equivariant but not location
invariant.
55
2.7 M-estimators of univariate location with unknown
scale
The general location-scale model assumes that the xi are i.i.d. according to
x−µ
F(µ,σ) ( x ) = F
σ
where −∞ < µ < +∞ is the location parameter and σ > 0 is the scale parameter. In
this general model both µ and σ are assumed to be unknown, which is realistic. The
density is now
x−µ
′ 1
f (µ,σ) ( x ) = F(µ,σ) ( x ) = f .
σ σ
In this general situation we can still estimate location by the median and scale by
MADN which possess the required equivariance and invariance properties. But the
location M-estimators defined by (2.9) are not scale equivariant, so they depend on the
measurement units. If σ were known, the natural solution would be to divide xi − µ
in the formulas by this σ:
n
xi − µ
µ̂ = argmin ∑ ρ (2.14)
µ i =1
σ
and thus
n
xi − µ̂
∑ψ σ
= 0. (2.15)
i =1
(Note that we have dropped the subscript n from µ̂ and σ̂ for simplicity.)
Naturally, σ̂ should be a robust estimator of σ (e.g. the MADN) if one wants
µ̂ to be a robust estimator of µ. For monotone M-estimators with a bounded
56
and odd ψ, the breakdown value of the location estimate µ̂ can be shown to be
equal to the breakdown value of the scale estimate σ̂, thus resulting in a maximal
breakdown value when the MADN is chosen. Using the standard deviation as
scale estimate would result in a zero breakdown value. For redescending M-
estimators, the breakdown value depends on the data. For the bisquare estimator
with the MADN as preliminary scale estimate, the breakdown value is 1/2 for
all practical purposes.
n i∑
ρscale =δ (2.19)
=1
σ̂
where we may choose ρscale differently from ρlocation . This approach is more com-
plicated than the first, and unfortunately it yields less robust estimators, so the
first approach is preferable.
57
2.8 Computation of M-estimators
2.8.1 Computing an M-estimator of location
Here we follow the first approach above so we first compute a scale estimate σ̂,
typically by MADN. To solve (2.17) an iterative reweighting algorithm can be used.
Based on the ψ function for location, define
(
ψ( x )/x if x ̸= 0
W (x) = (2.20)
ψ′ (0) if x = 0
where the second line is just L’Hopital’s rule for x → 0. Then (2.17) can be written as
n
xi − µ̂ xi − µ̂
∑ σ̂
W
σ̂
= 0.
i =1
∑in=1 wi xi
µ̂ = (2.21)
∑in=1 wi
with wi = W (( xi − µ̂)/σ̂). Note that this is still an implicit equation, as the wi depend
on µ̂ itself. This suggests the following algorithm:
∑in=1 wk,i xi
3. Set µ̂k+1 = ∑in=1 wk,i
.
Because a weighted mean like (2.21) is a special case of weighted least squares, this al-
gorithm is called iteratively reweighted least squares (IRLS). If W ( x ) is bounded and non-
increasing for x > 0, the sequence µ̂k is guaranteed to converge to a solution of (2.17).
The weight functions for the Huber estimator and the Tukey bisquare estimator are
shown in Figure 2.3. We see that the weights for the Huber estimator converge to zero
at the rate 1/x, whereas the Tukey estimator assigns zero weight to large outliers.
58
weight functions for Huber weight functions for Tukey’s bisquare
ψ(x)
x
x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x x
Figure 2.3 Weight functions for the Huber and Tukey M-estimators
1 n x
2
σ̂ = ∑
nδ i=1
W i xi2
σ̂
59
2. For k = 0, 1, . . . compute the weights
3. Set
s
∑in=1 w1k,i xi 1 n
nδ i∑
µ̂k+1 = and σ̂k+1 = w2k,i ( xi − µ̂k )2 .
∑in=1 w1k,i =1
4. Stop when |µ̂k+1 − µ̂k | < εσ̂k+1 and |σ̂k+1 /σ̂k − 1| < ε.
60
2.9 Other robust estimators of univariate location and
scale
The trimmed mean with parameter 0 ⩽ α < 0.5 and m = ⌊(n − 1)α⌋ is defined as
n−m
1
n − 2m i=∑
µ̂α = x (i ) .
m +1
Its breakdown value is (m + 1)/n.
The least median of squares (LMS) location estimator (Rousseeuw, 1984) is defined as
µ̂ = argmin med( xi − µ)2 .
µ
It equals the midpoint of the shortest ‘half’. This is the subsample containing h =
⌊n/2⌋ + 1 observations that possesses the shortest range. Its breakdown value is max-
imal. The length of the shortest half yields a robust estimator of scale.
The least trimmed squares (LTS) location estimator (Rousseeuw, 1984) is defined as
h
µ̂ = argmin ∑ ( x − µ)2(i) .
µ i =1
where ( x − µ)2(i)is the i-th order statistic of the set {( xi − µ)2 ; i = 1, . . . , n}, so the
differences are first squared and then ordered. The LTS is the mean of the subsample
with h observations that has the smallest standard deviation. Its breakdown value is
maximal for h = ⌊n/2⌋ + 1. The standard deviation of the ’best’ subsample yields a
robust scale estimator.
Many scale estimators such as Stdev and MAD require the estimation of a center
as a first step. The IQR does not use centering. A more robust scale estimator that
does not require centering is the Qn estimator (Rousseeuw and Croux, 1993). This
scale estimator is essentially the first quartile of all distances between two data points.
More precisely, the estimator is defined as
Qn = 2.219{| xi − x j |; i < j}(k) (2.23)
with k = (2h) ≈ (n2 )/4 and h = ⌊ n2 ⌋ + 1. The breakdown value of Qn is 50% whereas its
statistical efficiency at the normal model is 82%.
61
2.10 Functional definition of M-estimators
To study the asymptotic properties of M-estimators, we first give a general defini-
tion of M-estimators as statistical functionals.
A function T is called a statistical functional with domain DT , where DT is a subset
of the set of all distribution functions, iff
These conditions guarantee that a statistical functional can be computed at any sample
distribution Fn .
A statistical functional thus defines an estimator Tn through Tn := T ( Fn ). If T is
sufficiently regular, it often holds that T ( Fn ) → p T ( F ) and thus it makes sense to use
T ( Fn ) as an “estimator” of T ( F ). If our interest is in estimating the parameter θ of a
parametric family { Fθ ; θ ∈ Θ}, we call T Fisher-consistent iff T ( Fθ ) = θ. In that case it
is reasonable to estimate the parameter θ by T ( Fn ).
Examples:
1 n
Z
n i∑
T ( Fn ) = EFn [ X ] = xdFn = xi .
=1
From the law of large numbers we know that T ( Fn ) → p T ( F ). Under the location
model
Xi ∼ Fµ where Fµ ( x ) = F ( x − µ)
equals µ. Thus for a location model the mean is Fisher consistent iff the first
moment of F equals 0. In that case T ( Fn ) → p T ( Fµ ) = µ so the sample mean is a
consistent estimator for the parameter µ.
62
• The median is a statistical functional defined by
Note that T ( F ) is well-defined for any F and thus its domain DT is the set of all
probability distributions. For odd n, the corresponding estimator is the sample
median. If F is strictly increasing, one can prove that T ( Fn ) → p T ( F ). Under
the location model, T is Fisher consistent iff T ( F ) = 0. Thus if the median of F
equals 0, the sample median is a consistent estimator for µ.
M-estimators
M-estimators can be studied as statistical functionals as well. For a given function
Ψ( x, θ ) define the functional T as the solution of the following implicit equation:
1 n
n i∑
Ψ( xi , T ( Fn )) = 0. (2.25)
=1
Ψ : R2 → R : (u, v) → ψ(u − v)
63
2.11 Consistency of M-estimators
For a general M-estimator the convergence of T ( Fn ) to T ( F ) is established under
some conditions on the function Ψ (see Huber, 1964).
then, if T ( F ) is unique,
T ( Fn ) → p T ( F )
for the functional T defined by EF [Ψ( X, T ( F ))] = 0.
In the univariate location model one can check that the conditions are satisfied for
the mean, median, Huber M-estimator etc. Note that for redescending M-estimators
such as the bisquare condition (ii ) is not satisfied and thus the preceding theorem does
not apply immediately. However, a similar result can be proven in these cases as well,
under some extra technical conditions.
since f is even. Therefore the parameter µ solves equation (2.24) defining the M-
estimator, so it holds that T ( Fµ ) = µ which proves Fisher consistency. For a scale
M-estimator, we have that
X X x
Z
EFσ ψ = EFσ [ρ −δ = ρ f σ ( x ) dx − δ
σ σ R σ
Z
= ρ (t) f (t) dt − δ.
R
64
2.12 Asymptotic normality of M-estimators
In statistics one is not only interested in consistency, but also in the rate of conver-
gence. Fortunately, many estimators are asymptotically normal. For M-estimators the
following theorem can be proven.
Theorem 2.2. Suppose that E[Ψ( X, T ( F ))2 ] < ∞ and that the derivative of Ψ( x, θ ) with
•
respect to its second argument exists. Denote this derivative as the function Ψ. Assume that
•
the conditions in Theorem 2.1 are satisfied and that EF [Ψ ( X, T ( F ))] is nonnull. Then it holds
that
√
n( T ( Fn ) − T ( F )) → N (0, V ( T, F ))
with
EF [Ψ( X, T ( F ))2 ]
V ( T, F ) = • (2.26)
(E F [ Ψ ( X, T ( F ))])2
for T defined by EF [Ψ( X, T ( F ))] = 0.
In this case the derivative of Ψ does not exist since the sign function is discon-
tinuous in 0. However, one can write this derivative with a delta function: if
a function g is discontinuous at a point u with left limit g(u− ) and right limit
g(u+ ), we can write the derivative of g in u as ( g(u+ ) − g(u− ))δu . With this trick
it is easy to see that
•
EFµ [Ψ ( X, T ( Fµ ))] = −2 f (0).
At the normal model this means that the asymptotic variance equals π2 ≈ 1.57
and is thus larger than that of the mean. This shows that the median is less
efficient than the mean at the normal model, with ARE = π2 = 0.64. However,
the opposite holds at some longer-tailed distributions.
65
2.13 The influence function
In classical statistics the focus is on the analysis of the approximation of T ( F ) by
T ( Fn ). In robust statistics we are also interested in the behavior of T when the distri-
bution F is slightly changed. To this end we consider a “neighborhood” around F of
contaminated distributions Fε,H defined as
Nε = { Fε,H = (1 − ε) F + εH : H ∈ H}
with H a suitable set of distributions. Intuitively, the value ε represents the percent-
age of outliers and the distribution H represents the distribution of these outliers. The
idea is to examine the difference between T ( F ) and T ( Fε,H ). For a robust estimator this
difference should not be too large. One can measure this difference in several ways. A
first concept is the influence function (Hampel, 1974; Hampel et al., 1986).
T ( Fε,∆z ) − T ( F )
IF(z; T, F ) = lim (2.27)
ε ↓0 ε
(if the limit exists), with ∆z the discrete distribution putting all its probability mass
at z. The influence function thus measures the difference between applying T to the
‘clean’ distribution F and to the contaminated distribution Fε,∆z in the limit ε ↓ 0. Often
it can be interpreted as a limit case of the sensitivity curve (2.5) too.
which is called the gross error sensitivity (GES). In general, a lower GES indicates better
robustness.
The following three properties hold under certain regularity conditions (without
proof):
R
1. If G ≈ F: T ( G ) ≈ T ( F ) + IF(z; T, F )dG (z).
R
2. IF(z; T, F )dF (z) = 0.
√
n( Tn − T ( F )) → N (0, V ( T, F )).
66
For an M-estimator satisfying (2.24), we have that
Ψ(z, T ( F ))
IF(z; T, F ) = • .
−EF [Ψ ( X, T ( F ))]
Therefore, the influence function of an M-estimator is simply proportional to the func-
tion Ψ.
IF (z; T, F ) = sign z − µ /(2 f (0)).
The GES thus equals (2 f (0))−1 . This illustrates that the median is more robust
than the mean.
• For the Huber M-estimator with parameter b > 0, we find IF(z; T, F ) = ψ(z −
µ)/(2F (b) − 1) at a symmetric F. For the unit variance normal location model
this yields:
b γ∗ ( T, F ) eff( T )
b→0 1.25 0.64
b = 1.5 1.73 0.96
b→∞ ∞ 1
67
2.14 The asymptotic breakdown value
For a parametric model { Fθ , θ ∈ Θ} with parameter space Θ, the asymptotic break-
down value (Hampel, 1971) of a statistical functional T at a distribution F is given by
ε∗ ( T, F ) = inf{ε > 0 : T ( Fε,H ) can be arbitrary close to the boundary of Θ}. (2.28)
Intuitively this represents the minimum amount of contamination that is needed to de-
stroy the result of the functional T. Consider for example the location model. Then the
parameter space is R and its boundary is the pair {−∞, +∞}. Therefore the formula
of the breakdown value for the location model becomes:
In the univariate scale model (2.2), the parameter space is ]0, +∞[ with boundary
points {0, +∞}. One defines the explosion breakdown value as
Often the asymptotic breakdown value is the limit of the finite sample breakdown
value of (2.6):
ε∗n ( T ( Fn ), Fn ) → ε∗ ( T, F ).
68
Examples:
sup{|EFε,∆z [ X ]|} = ∞
z
and consequently
sup{| T ( Fε,H )|} = ∞
H
for every ε > 0.
• The highest possible asymptotic breakdown value for location equivariant loca-
tion estimators is 0.5.
• Any location M-estimator with nondecreasing, bounded and odd ψ function at-
tains this maximal breakdown value of 0.5. For nondecreasing ψ functions that
are not odd but satisfy ψ(−∞) = −k1 > −∞ and ψ(+∞) = k2 < ∞ it can be
proved that
min(k1 , k2 )
ε∗ ( T, F ) = .
k1 + k2
• The asymptotic breakdown value of the Stdev is 0, since the explosion break-
down value equals 0.
• For an M-estimator of scale (2.13) with even function ρ, the asymptotic break-
down value at a symmetric distribution F is
∗
∗ ∗
δ δ
ε ( T, F ) = min ε exp ( T, F ), ε imp ( T, F ) = min ,1− .
ρ(∞) ρ(∞)
69
2.15 The maxbias curve
The maxbias curve of a statistical functional T at F measures the worst case bias for
every contamination percentage ε > 0 (contrary to the influence function which only
shows the limiting case ε ↓ 0 and only considers point contamination). For a location
model the maxbias curve is defined by
It follows that the asymptotic breakdown value can be derived from the maxbias
curve:
Then we obtain
70
In fact, under regularity conditions on F it has been shown (Huber, 1964) that
the median has the lowest maxbias curve among all translation equivariant esti-
mators of location (for all 0 ⩽ ε < 12 ). This is called the “min-max bias property”
of the median.
Figure 2.4 plots the maxbias curve (2.32) of the median at the standard normal distri-
bution (F = Φ). This graph combines the maxbias curve, the gross-error sensitivity
γ∗ ( T, F ) and the asymptotic breakdown value ε∗ (which is 12 in this case).
Vertical asymptote
supG∈Nε( T(G) − T(F))
Slope=γ*(T,F)
0 ε* ε
71
72
Chapter 3
Regression M-estimators
One of the reasons of the lasting success of M-estimators is their vast area of ap-
plications, of which regression analysis is an important example. In this section we
discuss robust estimation of the parameters of linear regression models. M-estimators
for regression are constructed in the same way as location M-estimators. However,
regression poses some peculiar and difficult robustness problems making robust re-
gression and its applications an active area of research, even after nearly 40 years of
investigation. To motivate the need for robustness, let us consider a simple illustrative
example.
Example. The data in Table 3.1 correspond to an experiment on the speed of learn-
ing of rats. Times were recorded for a rat to go through a shuttle box in successive
attempts. If the time exceeds 5 seconds, the rat received an electric shock for the dura-
tion of the next attempt. The data are the number of shocks received and the average
time for all attempts between shocks.
Figure 3.1 depicts the data and the straight line fitted by least-squares (LS) corre-
73
sponding to the linear regression model
y i = β 0 + β 1 x i + ϵi .
The relationship between the variables is seen to be roughly linear except for the
three upper left points. The LS line does not fit the bulk of the data, but is a compro-
mise between those three points and the other observations. The plot also contains the
more resistant least-absolute-deviations (LAD) estimator and the least-squares estima-
tor computed from the data without the three points in the upper left corner. These
two estimators give a better representation of the relation in the majority of the data,
while pointing out the exceptional character of points 1, 2 and 4 by their large devia-
tions from the fit.
The LS estimator fits a straight line by finding βb0 and βb1 such that the sum of
squared residuals
n
∑ | y i − β 0 − β 1 x i |2 , (3.1)
i =1
74
From our example, it can be seen that the LAD estimate is more resistant to outlying
observations. Intuitively, this is because the absolute value loss increases more slowly
than the squared loss. Note that LAD generalizes the sample median which is recov-
ered in the special case of location estimation.
Now, consider the more general case of a dataset consisting of n observations
(x1 , y1 ), . . . , (xn , yn ) with the xi being p-dimensional vectors, i.e., xi = ( xi1 , . . . , xip )⊤ ,
and the yi real random variables. As usual, the xi are referred to as predictor variables
while the yi are called responses. The data are assumed to follow the linear model
yi = xi⊤ β0 + ϵi , i = 1, . . . , n, (3.2)
yi = β 0 + xi⊤ β1 + ϵi ,
where xi = ( xi2 , . . . , xip )⊤ and β1 ∈ R p−1 with β0⊤ = ( β 0 , β1⊤ ). In a designed exper-
iment, the predictors xi are nonrandom (or fixed). Such situation often arises, for ex-
ample, in controlled lab experiments. When the data are observational, the predictors
xi are random variables generated from a p-variate distribution, for example, a mul-
tivariate Gaussian distribution. In this chapter we focus on fixed predictors, see the
course Robust Statistics for the case of random designs and the implications thereof.
With an estimator β b at our disposal, the fitted values ybi and residuals ri of the fit
n
are defined as
b ) = x⊤ β
ybi ( β b and ri ( β
b ) = yi − ybi ( β
b ),
n i n n n
75
3.1 Review of the LS method
Let us begin by reviewing some basic facts about the computation and theoretical
properties of the universally known least-squares estimator. Setting X = [x1⊤ , . . . , x⊤
n]
⊤
X⊤ X β
b = X⊤ y.
LS
b = (X⊤ X)−1 X⊤ y.
β (3.4)
LS
If the model contains a constant term, then the first column of X is identically one , and
the full rank condition implies that no other column is constant. If X is not of full rank,
we have linear dependencies between the columns of X (extreme multicollinearity) and
other methods of estimation are needed (see following chapters).
The LS estimator satisfies the following three important, easily verifiable proper-
ties:
b (X, y + Xγ) = β
β b (X, y) + γ for all γ ∈ R p (3.5)
LS LS
β b (X, y) for all λ ∈ R,
b (X, λy) = λ β (3.6)
LS LS
b (XA, y) = A−1 β
β b (X, y). (3.7)
LS LS
These properties are called regression, scale and affine equivariance, respectively. They
are desirable properties for any regression estimator, since they allow us to know how
the estimate changes under linear transformations of the data. Consider, for example,
transforming y according to
y⋆ = y + Xγ.
If model (3.2) holds for response y, then the transformed response satisfies y⋆ =
X( β0 + γ) + ϵ. Thus, this transformation amounts to adding γ to the coefficient vector
β0 and the value of (3.5) becomes clear. Likewise, if X⋆ = XA for some invertible A,
then y satisfies the model
y = X⋆ A−1 β0 + ϵ,
which is our linear model with X replaced by X⋆ and β0 by A−1 β0 . Again, it is de-
sirable that the estimate transforms in the same way. Finally, scale equivariance (3.6)
is concerned with the units of measurement of the response variable y. Consider, for
76
example, the situation where y is the height of sampled men measured in centimeters.
If it is now preferred to measure the height in meters, then scale equivariance guar-
antees that the estimate transforms in the same manner and thus there is no need to
estimate the regression coefficients anew.
In the fixed design case, it is well-known that the LS estimator is consistent and
asymptotically normal under suitable assumptions. The most common assumptions
are that the errors have mean zero and variance-covariance matrix σ2 I (spherical sym-
metry). Under these assumptions the LS estimator is unbiased, i.e.,
E{ β
b }=β .
LS 0 (3.8)
b } = σ 2 ( X ⊤ X ) −1 .
Var{ β (3.9)
LS
b − β ∥ 2 } = σ 2 ( X ⊤ X ) −1 .
E{∥ β LS 0
A related result is the normality of the LS estimator, which is extensively used for
inferential purposes. If the ϵi are normally distributed and X is of full rank, then β
b
LS
is multivariate normal
b ∼ N p β , σ 2 ( X ⊤ X ) −1 ,
β (3.10)
LS 0
where N p (µ, Σ) denotes the p-variate normal distribution with mean vector µ and co-
variance matrix Σ. This follows immediately from the formula
n
(X⊤ X)1/2 ( β
b − β ) = (X⊤ X)−1/2 X⊤ ϵ =
LS 0 ∑ αni ϵi ,
i =1
where αni , . . . , αnn are the columns of the ( p × n) matrix (X⊤ X)−1/2 X⊤ . If the ϵi are not
normally distributed (and they cannot expected to be), but have a finite variance, then
using the central limit theorem it can be shown that for large n the estimator β b is
LS
approximately normal, with parameters given by (3.10), provided that the Lindeberg
condition holds for the xi . This condition ensures the absence of leverage points and
essentially boils down to
77
Notice that these are just the diagonal elements of the hat-matrix. See van der Vaart
(1998) and Huber and Ronchetti (2009) for more details.
Both consistency and asymptotic normality of the LS estimator have been derived
under the assumption that the error possesses two moments, which may not hold
in practice. The purpose of developing M-estimators for regression is to relax these
distributional assumptions as much as possible, while still ensuring the equivariance
properties given in (3.5)–(3.7) as well as a good trade-off between robustness and effi-
ciency. We now formally define this general class of estimators.
1 ϵ
f0 ,
σ σ
with σ a scale parameter. For the linear model (3.2) the yi are independent but clearly
not independently distributed (their distribution depends on xi ). Indeed, each yi has
density
!
y − xi⊤ β0
1
σ
f0
σ
, y∈ , R
and, assuming a fixed value of σ, the likelihood function for β is
!
1 n yi − xi⊤ β
L( β) = n ∏ f 0 .
σ i =1 σ
n
ri ( β )
∑ ψ0 σ
xi = 0, (3.12)
i =1
n
∑ sign
ri ( β
b ) xi = 0.
i =1
78
If the model contains an intercept term, this implies that the residuals have zero me-
dian.
Unlike the LS estimator, there are in general no explicit expressions for the LAD
estimator. However, there exist very fast convex optimization algorithms to compute
it. A complication is that the LAD estimate may not be unique for ”small” samples.
The LAD estimate also has the intriguing property that it sets at least p residuals to
zero, so that the estimator fits exactly part of the data. We refer to Koenker (2005) for
more details concerning the LAD estimator.
More generally, we define regression M-estimators β b as solutions
n
n
ri ( β )
b = argmin ∑ ρ
β , (3.13)
n
β ∈R p i =1
σn
b
R2 ρ(0) = 0.
Note that the solution of (3.14) is not identical to the minimizer of (3.13), unless ρ
is convex. For non-convex functions, one has to account for the possibility of local
minima.
M-estimators based on convex ρ-functions are referred to as ”monotone regression
M-estimators”. The reason for this terminology is that the derivative of a differen-
tiable convex function is necessarily nondecreasing. Recall that convex functions are
differentiable almost everywhere, so that if they fail to be differentiable at some point,
79
then one can select a nondecreasing subgradient. M-estimators based on non-convex
ρ-functions, such as the Tukey-bisquare, are referred to as redescending M-estimators.
The main advantage of monotone M-estimators over their redescending counterparts
is their cheaper computation. On the other hand, the latter class of estimators can
achieve a better trade-off between robustness and efficiency but its computation re-
quires special care.
Finally, we discuss a common estimation strategy to obtain an estimate of the scale
σn . Recall that in the location-scale problem we estimated σ using the MAD. Here,
the equivalent procedure would be to first compute a LAD estimator, which does not
require knowledge of the scale. From its residuals we can obtain the analog of the
(normalized) MAD by taking the median of the nonnull absolute residuals:
1
σn = Med (|r | : ri ̸= 0). (3.15)
0.675 1≤i≤n i
b
The reason for using only nonnull residuals is that since at least p residuals are zero
for the LAD estimator, including all residuals could lead to underestimation of σ when
p is large. The standardization factor (0.675)−1 ensures that the estimator is Fisher-
consistent for the normal distribution, since for X ∼ N(0, 1) the median of | X | is 0.675.
Let us write b σn (X, y). Then, since the LAD estimator is regression,
σn in (3.15) as b
scale and affine equivariant, it follows that
σn (X, y + Xγ) = b
b σn (X, y), b
σn (XA, y) = b
σn (X, y) and b
σn (X, λy) = |λ|b
σn (X, y), (3.16)
80
with wi = W (ri ( β
b )). This set of equations may be perceived as weighted normal
n
equations, but the weights wi depend upon the data through β b . Nevertheless, this
n
suggests an iterative scheme to calculate β
b .
n
2. For k = 0, 1, 2, . . . :
b compute ri,k ( β
(a) Given β b ) and wi,k = W (ri,k ( β
b )).
k k k
(b) Compute β
b
k+1 by solving
n
∑ wi xi (yi − xi⊤ βb k+1 ) = 0.
i =1
In practice, the algorithm converges very fast. Often less than 10 iterations are
needed for reasonable tolerance levels. The following result provides theoretical justi-
fication for the IRLS algorithm.
Theorem 3.1. Assume that X is of full rank, ψ is continuous and W is nonincreasing in | x |
with a well-defined limit limx→0 W ( x ). Then the IRLS algorithm monotonically decreases the
objective function
n
ri ( β )
h( β) = ∑ ρ .
i =1
σn
b
If ρ is strictly convex, then the sequence { βk }k converges to the unique stationary point βS of
h( β), satisfying
n
ri ( β S )
∑ ψ bσn xi = 0. (3.17)
i =1
The first set of conditions are satisfied for almost all commonly used ρ-functions
including the Huber and the bisquare loss functions. The condition that ρ is increas-
ing, i.e., convex, serves to preserve the equivalence between (3.13) and (3.14). It can
be shown that omitting this assumption has the consequence that the algorithm still
converges, but convergence can be to a local minimum. For strictly convex ρ-functions
the theorem has the remarkable implication that convergence of the algorithm to the
stationary point in (3.17) is irrespective of the starting point.
Proof. To make the notation more tractable, we absorb b σn into the ρ-function. That is,
√
we write ρ( x ) instead of ρ( x/bσn ). For r ≥ 0 let g(r ) := ρ( r ). Since ρ(r ) = g(r2 ) we
have ψ(r ) = 2rg′ (r2 ) and consequently
ψ (r )
W (r ) = = 2g′ (r2 ). (3.18)
r
81
This implies that W (r ) is nonincreasing for r ≥ 0 if and only if g′ is nonincreasing.
We claim that g is concave, that is, the graph of g lies below the tangent line. For
differentiable functions this is equivalent to
for all x ̸= y. To see that (3.19) holds, first assume that y > x. Then, by the mean value
theorem of calculus, we have
g ( y ) − g ( x ) = ( y − x ) g ′ ( ξ ),
for some ξ ∈ ( x, y). But since g′ is nonincreasing, g′ (ξ ) ≤ g′ ( x ) and (3.19) holds. The
case y < x is handled similarly. We have thus established that g is concave.
Now, define the matrix
n
U ( β) = ∑ W (ri ( β)) xi xi′ ,
i =1
which, by the nonnegativity of W, is positive semidefinite for all β. Define also the
function
n
f ( β) = argmin ∑ W (ri ( β))(yi − xi′ γ)2 .
γ ∈R p i =1
β k +1 = f ( β k ).
Note that this is a compact way of saying that for each iteration we are solving a
weighted least-squares problem with weights depending on the previous iteration.
A fixed point β0 , i.e., one satisfying f ( β0 ) = β0 , is, by definition, also a stationary
point of h( β) satisfying (3.17), as ψ is continuous. Put for simplicity wi := Wi (ri ( βk )).
Then βk+1 satisfies
n n
∑ wi xi yi = ∑ wi xi xi⊤ βk+1 = U ( βk ) βk+1. (3.20)
i =1 i =1
82
But since ri ( βk+1 ) − ri ( βk ) = ( βk − βk+1 )⊤ xi and ri ( βk+1 ) + ri ( βk ) = 2yi − xi⊤ ( βk +
βk+1 ) we have, using (3.20),
n
1
h ( β k +1 ) − h ( β k ) ≤ ( βk − βk+1 )⊤ ∑ wi xi xi⊤ (2βk+1 − βk − βk+1 )
2 i =1
1
= − ( βk − βk+1 )⊤ U ( βk )( βk − βk+1 ) ≤ 0,
2
as U ( β) is positive semidefinite. This proves that the algorithm monotically decreases
h( β) and hence establishes the first part of the theorem.
To prove the convergence of { βk }k to βS , first note that since the sequence {h( βk )}k
is nonincreasing and bounded from below (by zero), it has a limit h0 . This implies that
βk is bounded, for if that were not the case there would exist a subsequence { βk j } j
tending to infinity. However, by assumption ρ is strictly convex and thus strictly in-
creasing. Therefore { h( βk j )} j would also tend to infinity contradicting the conver-
gence of {h( βk )}k .
Now, since { βk }k is bounded, it has a subsequence converging to a limit βS , which
by continuity of f satisfies
f ( βS ) = βS ,
and by our previous remarks this a stationary point of h( β). If the limit is unique,
then βk → βS and we are done. Otherwise, there would exist a subsequence bounded
away from βS , which in turn would have a subsequence converging to a stationary
point different from βS . However, that is not possible because the stationary point in
(3.17) is unique due to strict convexity of ρ. The proof is complete. □
b ) = max m : β
n o
ϵn⋆ ( β n
b (X, ym ) remains bounded ∀ym ∈ Ym , (3.21)
n
where Ym is the set of length n vectors with at least n − m elements in common with y.
It is clear that for the LS estimator we have that ϵn⋆ ( β
b ) = 1/n, since the estimator is
LS
linear in y.
83
The FSB, in general, depends on how well-conditioned the predictor matrix X is.
Let k⋆ = k⋆ (X) be the maximum number of xi lying on the same subspace (hyperplane)
of dimension smaller than p, that is,
where a subspace of dimension 0 is the set {0}. In the case of simple straight-line
regression k⋆ is the maximum number of repeated xi . We always have k⋆ ≥ p − 1
as one needs at least p − 1 points to obtain a p − 1 dimensional hyperplane. If k⋆ =
p − 1 the data set is said to be in general position. This means that every set of p − 1
observations determines a unique hyperplane. It can be shown (see Maronna et al.,
2006) that for all regression equivariant estimators it holds that
1 n − k⋆ − 1 1 n−p
⋆
ϵmax = ≤ . (3.22)
n 2 n 2
The FSB of monotone regression estimators is, in general, difficult to describe pre-
cisely and no closed-form expressions exist. From the available results we know that
monotone M-estimators with bounded ψ attain the maximal FSB for one-way and two-
way analysis of variance but other than that their breakdown value is far below (3.22).
For example, in straight line fitting, it can be shown that ϵn⋆ ( β
b ) ≈ 0.29. Moreover, the
n
breakdown value can depend on p and in certain cases we have ϵn⋆ = O( p−1/2 ). Thus,
for high-dimensional datasets, the FSB can be very close to zero.
Redescending M-estimators can reach a much higher FSB but normally one needs
to start from a robust estimator in order to avoid ”bad” solutions. In particular, if β b
in
denotes an initial regression estimator then Yohai (1987) shows
1 1 − 2( p − 1)/n
⋆ b ⋆ b
ϵn ( βn ) ≥ min ϵn ( βin ), , (3.23)
2 1 − ( p − 1)/n
for any sample in general position (i.e., k⋆ (X) = p − 1). Thus, for large n and fixed p
the FSB of redescending M-estimators can be arbitrarily close to (3.22), but only if the
initial estimator is itself very robust. Such estimators exist but are beyond our scope,
see the Robust Statistics course for some examples. Their common disadvantage is
their computational burden, which is far worse than that of M-estimators.
84
where, given that the design is fixed, we assume that outliers only lie in the response
space. It follows that the IF is bounded as a function of y0 whenever the score func-
tion ψ is bounded. The IF behaves rather differently formonotone and redescending
y0 −xi⊤ β0
score functions. In particular, for |y0 | → ∞ the term ψ σ approaches either
supx |ψ( x )| or 0 depending on whether ψ is monotone or redescending. Thus, re-
descending M-estimators are impervious to gross infinitessimal contamination.
as ri ( β0 ) = yi − xi⊤ β0 = ϵi . Despite the fact that the errors ϵi are iid, the presence of
the scale estimator b σn makes the random variables ψ(ϵi /b σn ), i = 1, . . . , n dependent.
Hence, classical central limit theorems are not applicable.
With more advanced techniques we can show that if there exists a positive definite
matrix V such that
1 n
lim
n→∞ n
∑ xi xi⊤ = V,
i =1
then we have
1 n
ϵi ϵ 2
√ ∑ψ
D
→ Np
xi − 0, E ψ 1 V
n i =1 σn
b σ
and
1 n ′
ϵi n ϵ o
n i∑
P
ψ → E ψ′ 1
xi xi⊤ − V.
=1
σn
b σ
85
Therefore, by Slutzky’s theorem,
n o
ϵ1 2
√ D
E ψ
σ
→ N p 0, σ2
n( βbn − β0 ) − ϵ1 2
V −1 .
E ψ′ σ
The above derivation is completely heuristic and does not cover M-estimators derived
from non-differentiable ψ-functions (such as the Huber estimator). The following gen-
eral set of conditions serves to add the missing rigor.
lim n−1 X′ X = V.
n→∞
(C) There exist finite constants κ and M such that for all x ∈ R and |y| < κ,
|ψ( x + y) − ψ( x )| ≤ M.
P
(E) There exists a σ > 0 such that b
σn −
→ σ.
(G) We have
ϵ 2
ϵ1
lim sup E ψ + t − ψ 1 = 0,
t→0 |α−σ|≤δ α α
86
Theorem 3.2. Assume that conditions R1–R3 and (A)–(F) hold and further that ρ is a convex
loss function. Then, for the sequence of M-estimators, β
b , it holds that
n
n o
ϵ1 2
√ D
E ψ σ
→ N p 0, σ2
n( βbn − β0 ) − V −1 ,
|ξ (σ)|2
The proof of Theorem 3.2 for monotone M-estimators is based on the following
interesting lemma of Hjort and Pollard (1993). For redescending M-estimators the
result of the theorem is still true, but the mathematics to prove it are more complicated.
Lemma 3.3. Suppose An (δ) is a real convex process that can be represented as 21 δ⊤ Vδ +
D
U⊤n δ + rn ( δ ) for some rn ( δ ) tending to zero for each δ. If Un −
→ U, then for the minimizer
δn of An (δ), we have
D
→ −V−1 U.
δn −
Proof of Theorem 3.2. We aim to apply Lemma 3.3 and to this end consider the objec-
tive function
( √ ! )
n ϵi − xi⊤ δ/ n ϵi
Zn (δ) = ∑ ρ −ρ .
i =1
σn
b σn
b
1
Zn (δ) = ξ (σ)δ⊤ Vδ + U⊤
n δ + r n ( δ ),
2σ
with
n
1 ϵ
Un = √ ∑ψ
σn n i=1
b σ
i
xi ,
R
and rn (δ) tending to zero pointwise, that is, for every δ ∈ p . The claim would then
follow from Lemma 3.3 and Slutsky’s theorem.
We start by using the fundamental theorem of calculus to write
√ ! ⊤ √
ϵi − xi⊤ δ/ n 1 −xi δ/ n ϵi + t
Z
ϵi ϵi
ρ −ρ = ψ −ψ dt
σn
b σn
b σn 0
b σn
b σn
b
1 ⊤ ϵi
+√ xi δψ .
nbσn σn
b
87
Consequently,
(1) (2)
Zn (δ) = Zn (δ) + Zn (δ),
with
√
n Z −x⊤ δ/ n
ϵi + t
(1) 1 ϵi
∑ 0
i
Zn (δ) = ψ −ψ dt,
σn
i =1
b σ
b n σ
bn
n
(2) 1 ϵi
Zn (δ) = − √
nb
∑
σn i=1
⊤
xi δψ
σn
b
.
(2) D
→ N p (0, σ−2 E |ψ(ϵ1 /σ)|2 V). To see this,
We begin by showing that Zn (δ) −
write
1 n ⊤ 1 n ⊤ ϵi 1 n ⊤
ϵ
ϵi ϵi
√ ∑ xi δψ = √ ∑ xi δψ + √ ∑ xi δ ψ −ψ i . (3.24)
n i =1 σn
b n i =1 σ n i =1 σn
b σ
By (F), the first term in the RHS of (3.24) is the sample mean of independent but not
identically distributed random variables with mean equal to zero. It has an asymptotic
Gaussian distribution provided that we can check the Lindeberg condition, which in
our case takes the form
max1≤i≤n |xi⊤ δ|
q → 0,
∑in=1 |xi⊤ δ|2
as n → ∞. Now,
√
max1≤i≤n |xi⊤ δ| max1≤i≤n ∥xi ∥/ n∥∥δ∥
q ≤ q → 0,
∑in=1 |xi⊤ δ|2 δ⊤ ∑in=1 xi xi⊤ /nδ
88
as n → ∞.
(1)
Now, consider Zn (δ), which we may decompose as
(1) (1) (1) (1)
Zn (δ) = E{ Zn (δ)} + ( Zn (δ) − E{ Zn (δ)}). (3.25)
To treat the first term in the right hand side of (3.25), recall that by the Schwarz in-
equality and (B),
√ √
max |xi⊤ δ|/ n ≤ max ∥xi ∥/ n∥δ∥ → 0.
1≤ i ≤ n 1≤ i ≤ n
are o P (1) and the sum of their variances converges to zero, as n → ∞. To establish
√
the former, note that by (B) max1≤i≤n |xi⊤ δ|/ n → 0. An application of (C) therefore
yields
Z −xi⊤ δ/√n ϵ + t ϵ Z −xi⊤ δ/√n ϵ + t
ϵ
i
−ψ i dt ≤ i
−ψ i dt
ψ ψ
0 α α 0 α α
|xi⊤ δ|
≤M √ ,
n
uniformly in i = 1, . . . , n. Conclude that
−xi⊤ δ/√n ϵ + t
Z
ϵ
i i
max −ψ dt = o (1),
ψ
1≤ i ≤ n 0 α α
89
as n → ∞.
To bound the sum of the variances, first recall that for any random variable X,
Var{ X } ≤ E{ X 2 } and it thus suffices to bound the sum of the second moments. By
the Schwarz integral inequality, we have
Z −xi⊤ δ/√n ϵ + t
n ϵ 2
∑ E 0 i
−ψ i dt
ψ
i =1
α α
√ ( )
n | x⊤ δ | Z −x⊤ δ/ n
ϵ + t
ϵ 2
≤ ∑ √i
i i
E ψ − ψ i dt
i =1 n 0 α α
n |xi⊤ δ|2
=∑ o (1)
i =1
n
= o (1),
as n → ∞, where the second to last line follows from (G) and the last line from (A).
The claim now follows from the methods of van de Geer (2000)[Chapter 8].
Combining these two derivations we have shown that
(1) ξ (σ) ⊤
σn Zn (δ) =
b δ Vδ + o P (1).
2σ
Therefore, by (E),
(1) ξ (σ) ⊤
Zn (δ) = δ Vδ + o P (1).
2σ2
The result of the theorem now follows from Lemma 3.3. The proof is complete. □
The conditions of the theorem are focused on monotone M-estimators requiring
an auxiliary scale bσn . They may be marginally weakened if the M-estimator does not
require a scale estimator, in that, (D) and (E) may be omitted entirely while (F) and (G)
only need to hold for α = 1. Can you then identify ξ in (F) for ψ( x ) = sign x ? What
does this say about the asymptotic distribution of the LAD estimator?
For M-estimators that do require scale estimators, the conditions of the theorem
can be satisfied for M-estimators without differentiable ψ-functions. Contrast this with
theorem 2.2 of the previous chapter. For example, for the Huber M-estimator it may
be verified that (C) and (D) are automatically satisfied while (F) is satisfied with
b
ξ (α) = 2F − 1,
α
for symmetric error distributions F. Moreover, (G) is satisfied as ψb ( x ) is Lipschitz and
therefore
ϵ ϵ
1 1
+ t − ≤ C (b)|t|.
ψ
b ψb
α α
90
The theorem also applies to M-estimators of location, as in this case conditions (A)
and (B) are trivially satisfied (why?). What is then the asymptotic distribution of the
Huber location estimator with a preliminary MAD scale? Does the answer change if
the IQR or Qn scale estimators are used instead?
Further reading
Regression with fixed covariate dimension and iid errors is but one of the myriad
applications of M-estimators. Through the course of the last 30 years M-estimators
have been applied in time series analysis, high dimensional statistics, non-parametric
statistics and functional data analysis (see the last chapter of this course) with consid-
erable success.
There are many texts devoted to the study of M-estimators. Of particular impor-
tance are Maronna et al. (2006) and Huber and Ronchetti (2009). These books clearly
present the underlying ideas, albeit in a not always rigorous manner. The theoreti-
cal properties of M-estimators are covered in great detail in van der Vaart (1998) and
van de Geer (2000), among others. The study of M-estimators in general metric spaces
is ongoing.
91
92
Chapter 4
4.1 Introduction
Linear regression models assume that the regression function E[Y | X ] depends lin-
early on the predictors X1 , . . . , Xd . Such models are simple and in many cases provide
a reasonable and well-interpretable description of the effect of the features on the out-
come. In some situations, simple linear models can even outperform fancier nonlinear
models, such as in cases with a small number of training data, cases with very noisy
data such that the trend (the signal) is hard to detect (low signal to noise ratio) or in
cases with sparse data (high-dimensional data problems).
The standard least squares estimator for linear models may become unstable when
the number of available predictors is large in comparison to the sample size of the
data. A setting which is commonly called high-dimensional data problems. If the dimen-
sion exceeds the sample size, the least squares estimator is not even unique anymore.
Hence, alternative methods are needed that provide more stable solutions. First, we
review relevant properties of standard least squares estimation for linear models to
explain its instability. We then continue with regularized estimation methods that al-
low for bias if this leads to a substantial decrease in variance and thus yields a more
stable model. Regularization is obtained by adding a penalty term to the least squares
objective function which aims to shrink the coefficient estimates towards zero. By
shrinking the coefficient estimates bias is introduced, but at the same time the vari-
ance is reduced.
A common assumption for high dimensional data problems is sparsity which in
this setting means that most available predictors are assumed to be noise variables
that do not help to explain the response. Hence, the final model should only contain a
small number of predictors. To achieve this goal, regularization methods use a penalty
93
term which allows to shrink coefficients completely to zero such that the correspond-
ing predictor actually is not included in the model anymore. Extensive treatments of
sparse methods for high-dimensional data can be found in e.g. Bühlmann and van de
Geer (2011); Hastie et al. (2015).
Data are called ultra-high dimensional if the dimension grows at an exponential rate
of the sample size. Regularization does not suffice to successfully determine a sparse
linear model from such ultra-high dimensional data. To address this issue, an initial
screening step can be applied which screens each candidate predictor for its potential
use in the final model. The goal of this step is to largely reduce the dimension by
removing the least interesting predictors. Regularization methods can then be applied
on the data set with reduced dimension to determine the final model.
d
f (X ) = β0 + ∑ β j Xj . (4.1)
j =1
Hence, the linear model assumes that the regression function E[Y | X ] can be approx-
imated reasonably well by the linear form (4.1). Recall that the model is linear in the
unknown parameters β 0 , . . . , β d . The predictor variables X1 , . . . , Xd can be measured
features, transformations of features, polynomial representations or other basis ex-
pansions, dummy variables, interactions, etc....
The unknown parameter vector β = ( β 0 , . . . , β d ) needs to be estimated from the
training sample {( xit , yi ); i = 1, . . . , n} with xi ∈ Rd . The standard least squares estima-
tor minimizes
n
RSS( β) = ∑ (yi − f (xi ))2
i =1
!2
n d
= ∑ yi − β 0 − ∑ β j xij
i =1 j =1
t
= (y − Xβ) (y − Xβ), (4.2)
where X is the feature data matrix which starts with a column of ones to include the
intercept in the model. Differentiating (4.2) with respect to β and setting the result
equal to zero yields the estimating equation
Xt (y − Xβ) = 0 (4.3)
94
with solution
β̂ = (Xt X)−1 Xt y (4.4)
The fitted values for the training data are given by
where H is called the hat matrix. Note that the fitted values ŷ are a linear function of
the responses y. The linear operator H (which puts the ”hat” on y) does not depend
on y but only on the xi observations. The predicted value at a new input vector x0 is
obtained by ŷ0 = fˆ( x0 ) = (1, x0t ) β̂.
Least squares minimizes RSS( β) = (y − Xβ)t (y − Xβ) = ∥y − Xβ∥2 . Geometri-
cally, minimizing the euclidean norm means that β̂ is determined such that ŷ is the
orthogonal projection of the n-dimensional vector y onto the subspace spanned by the
columns x0 , . . . , xd of the matrix X, called the column space of X. Hence, the resid-
ual vector y − ŷ is orthogonal to this subspace as can be seen in (4.3). The hat matrix
H computes the orthogonal projection of y onto the column space of X and thus is a
projection matrix. This implies that H has some useful properties. It is a symmetric,
positive semidefinite matrix that is idempotent (i.e. H H =H). Moreover, if n > d, then
the rank of H equals the number of linearly independent columns in X. Hence, in ab-
sence of exact linear dependence of the predictors, we have that the rank of H equals
d + 1. Moreover, also trace( H ) = d + 1 the dimension of the projection space which
corresponds with the number of (free) parameters in the fit, also called the degrees of
freedom in the estimation problem. Note that the trace of a square matrix is defined as
the sum of its diagonal elements. The more free parameters are included in a model,
the better it can fit the given data, but the more complex the resulting model is. The
degrees of freedom is thus a measure of model complexity.
It can easily be shown that the variance-covariance matrix of β̂ is given by
95
the regression parameters can be constructed, as well as F-tests to check for significant
differences between nested models.
The Gauss-Markov theorem is an important result for least squares regression. It
shows that the least squares estimator has the smallest variance among all unbiased
estimators that can be written as a linear transformation of the response y in the fol-
lowing sense. Consider any linear combination θ = at β with a ∈ Rd+1 , then the least
squares estimator of θ is
θ̂ = at β̂ = [ at (Xt X)−1 Xt ]y,
which is a linear transformation of y. Clearly, at β̂ is an unbiased estimator of θ. Now
consider any other linear, unbiased estimator of θ, that is, an estimator of θ of the form
θ̃ = ct y with c ∈ Rn such that E[θ̃ ] = θ. Then it holds that,
Var(θ̂ ) ≤ Var(θ̃ ).
When using the squared error loss function to measure prediction accuracy, we
focus on the mean squared error of estimators. Consider the prediction of the response
Y0 at a point x0 , assuming that the model is correct, hence the true response Y0 has the
form
Y0 = f ( x0 ) + ϵ0 .
Then, the expected prediction error of an estimate f˜( x0 ) is
where MSE( f˜( x0 )) is the mean squared error of the prediction at x0 by the estimator
f˜. Therefore, expected prediction error and mean squared error of a prediction differ
only by the constant σ2 , representing the irreducible variance of the new observation
y0 around its expected value.
Now, assume that the linear model is correct, i.e. f ( x0 ) = x0t β. Consider any esti-
mate β̃ of the unknown regression parameters β, then the mean squared error of the
prediction f˜( x0 ) = x0t β̃ can be decomposed further as follows
96
The unbiasedness of the least squares estimator makes the first term in the mean
squared error decomposition (4.6) zero. Since the prediction f˜( x0 ) = x0t β̃ is a linear
combination of β̃, the Gauss-Markov theorem ensures that the least squares estimator
is optimal w.r.t. variance within an important class of unbiased estimators. That is, the
least squares estimator has the smallest mean squared error among all linear unbiased
estimators. However, this optimal unbiased estimator can still be highly unstable. This
can happen in situations with (many) candidate predictors among which only a few
have a significant relation with the outcome. The large fraction of noise variables then
causes a large variability of the least squares estimates. Also the presence of highly
correlated predictors (multicollinearity) will make an unbiased estimator such as least
squares highly variable.
To obtain a more stable, better performing solution, we can look for biased estima-
tors that lead to a smaller mean squared error, and thus to better predictions. These
biased estimators thus trade (a little) bias for a (much) larger reduction in variance.
From a broader viewpoint, almost all models are only an approximation of the true
mechanism that relates the features to the outcome, and hence these models are bi-
ased. Therefore, allowing for a little extra bias when selecting the optimal model from
the class of models considered is a small price to achieve a more stable, better per-
forming model. Picking the right model thus means trying to find the right balance
between bias and variance.
A low prediction accuracy due to high variability is a first reason why the least
squares estimator may not be satisfactory. A second reason is the possible loss of in-
terpretability. With a large number of predictors it is often of interest to determine
a small subset of predictors that exhibit the strongest influence on the outcome. The
high variability of least squares complicates the selection of these predictors. A biased
estimator obtained by shrinkage will be much more successful. With least squares, we
can take the subset selection approach to determine important variables or to find
stable submodels. Note that after subset selection we usually end up with a biased
regression model, a fact that is often ignored in regression analysis. Several selection
strategies exist such as all-subsets, backward elimination, or forward selection. Dif-
ferent selection criteria can be used to select the optimal model in these procedures.
For instance, minimizing prediction error on a test set or through cross-validation can
be used. Note that these selection methods are discrete in the sense that predictors
are either completely in or completely out of the model. This discreteness results in a
high variance of the selection process. Shrinkage methods discussed next are a more
continuous alternative to subset selection which usually have a lower variance.
97
4.3 Regularization
4.3.1 Ridge Regression
Ridge regression (Hoerl and Kennard (1970b,a)) is one of the oldest shrinkage
methods. It shrinks the regression coefficients (slopes) by imposing a penalty on their
overall size. The ridge regression coefficients are obtained by minimizing a penalized
residual sum of squares
!2
n d d
β̂ ridge = arg min ∑ yi − β 0 − ∑ β j xij + λ ∑ β2j (4.7)
β i =1 j =1 j =1
λ is a tuning parameter controlling the size of the penalty. The larger λ, the larger the
weight of the penalty term, and thus the more shrinkage that is imposed. The limit
case λ = 0 means no shrinkage and just yields the (unbiased) least squares solution.
Increasing λ increases the bias by shrinking the coefficients towards zero. Note that
no shrinkage is applied on the intercept term (β 0 ).
An equivalent formulation of the ridge regression problem is
!2
n d
β̂ ridge = arg min ∑ yi − β 0 − ∑ β j xij (4.8)
β i =1 j =1
d
subject to ∥ β∥22 = ∑ β2j ≤ s, (4.9)
j =1
98
such that the n × d matrix X is a centered feature matrix (without the usual columns
of ones, because there is no intercept in the model).
After setting β 0 equal to 0, the ridge estimator in (4.7) can be rewritten in matrix
notation as
β̂ ridge = arg min (y − Xβ)t (y − Xβ) + λβt β ,
(4.10)
β
Hence, the ridge regression solution is a linear function of the observed outcome y.
The solution is similar to the least squares solution (4.4) but adds a constant to the
diagonal of Xt X before inversion. This constant makes the matrix nonsingular even if
Xt X itself is singular. This means that ridge regression can be used even when there
are more predictors than observations or when there exists linear dependence between
the predictor variables. From (4.11) it follows that the fitted responses are obtained by
ŷ = X β̂ ridge
= [X(Xt X + λI )−1 Xt ]y
ridge
= Sλ y,
Hence, also for ridge regression the fitted values ŷ are a linear function of the responses
ridge
y. The linear operator Sλ is called the ridge regression linear operator.
Similarly as for least squares, the trace of the ridge regression linear operator,
ridge
trace(Sλ ), can be used as a measure of the model complexity, which is called the
effective degrees of freedom Hastie et al. (2001). Clearly, the effective degrees of freedom
depends on the choice of the tuning parameter λ. Therefore, one way to choose the
value of the tuning parameter is by fixing the effective degrees of freedom of the ridge
regression solution.
To further examine the behavior of ridge regression we use the singular value de-
composition (SVD) of the centered feature matrix X, which is given by
X = U DV t (4.12)
99
if Pt P = I d with I d the d-dimensional identity matrix. This means that the d columns
of the orthogonal matrix are an orthonormal basis for a d-dimensional space. In the
singular value decomposition the columns of U (corresponding to non-zero singular
values) are a basis for the space spanned by the columns of X. Similarly, the columns
of V (corresponding to non-zero singular values) are a basis for the subspace in Rd
spanned by the rows of X. For the special case of a d × d matrix V , orthogonality of
the matrix implies that V t = V −1 .
Using the singular value decomposition the fitted values by ridge regression can
be rewritten as
ŷridge = X β̂ ridge
= [X(Xt X + λI )−1 Xt ]y
= [U DV t (V DU t U DV t + λI )−1 V DU t ]y
= [U DV t (V D2 V t + λV IV t )−1 V DU t ]y
= [U DV t (V ( D2 + λI )V t )−1 V DU t ]y
= [U DV t V ( D2 + λI )−1 V t V DU t ]y
= [U D ( D2 + λI )−1 DU t ]y (4.13)
ridge
Sλ = U D ( D2 + λI )−1 DU t ,
which reveals how the effective degrees of freedom depend on the choice of the tuning
parameter λ. Since both U and D are independent of λ, we can see that larger values
ridge
of λ lead to smaller values of trace(Sλ ). Hence, more shrinkage leads to ridge re-
gression fits with a smaller effective degrees of freedom because the solution β̂ ridge is
constrained to be in a smaller region around the origin.
Furthermore, note that D ( D2 + λI )−1 D is still a diagonal matrix with diagonal
elements
γ2j
; j = 1, . . . , d.
γ2j + λ
Let us denote by u1 , . . . , ud the columns of U which are an orthonormal basis for the d-
dimensional subspace of Rn determined by the columns of X. Then, expression (4.13)
can be written as
d γ2j
ŷridge = ∑ γ2 + λ (utj y)u j (4.14)
j =1 j
100
For least squares regression a similar derivation yields
ŷLS = X β̂ LS
= [ X ( X t X ) −1 X t ] y
= [U DV t (V DU t U DV t )−1 V DU t ]y
= [U DV t (V D2 V t )−1 V DU t ]y
= [U DV t V ( D2 )−1 V t V DU t ]y
= [U D ( D2 )−1 DU t ]y
= [UU t ]y
d
= ∑ (utj y)u j , (4.15)
j =1
which corresponds to setting λ = 0 in (4.14) as expected. Both least squares and ridge
regression thus look for an approximation of the response vector y in the subspace
spanned by the columns u1 , . . . , ud of U. Least squares finds the approximation by
projecting y onto this subspace. Ridge regression then shrinks the least squares co-
ordinate weights (utj y) by the factor γ2j /(γ2j + λ). Hence, the smaller the squared
singular value γ2j , the larger the amount of shrinkage in the direction u j .
101
Hence, in the directions u j , the variance of the feature observations equals γ2j /n.
Therefore, directions with small squared singular values γ2j correspond to directions
with small variance of the inputs, and ridge regression shrinks these directions the
most.
The sample covariance matrix S of the predictors (remember that they are centered
at zero) is given by Xt X/n. Therefore, the eigen-decomposition of Xt X in (4.16) corre-
sponds to the eigen-decomposition of the sample covariance matrix up to the factor
n. Consequently, the eigenvectors v1 , . . . , vd are the principal components of the feature
data (in d-dimensional space), that is directions with maximal variance subject to being
orthogonal to directions of earlier principal components. The directions u1 , . . . , ud are
thus the principal component directions of the input observations in the n-dimensional
space.
Least squares regression becomes highly variable if there are directions in the data
with small variance of the data. The small amount of information in these directions
makes it very difficult to accurately estimate the slope of the linear surface fitting the
data in these directions. This makes the least squares solution very unstable. Ridge
regression protects against a potentially high variance in these directions (with small
variability) by shrinking the regression coefficients in these directions. Hence, ridge
regression does not allow a large slope in these directions and thus assumes (implic-
itly) that the response will vary most in directions with high variance of the predictors.
This is often a reasonable assumption and even if it does not hold, the data do not con-
tain sufficient information to reliably estimate the slope of the surface in directions
with small variance of the predictors.
d
subject to ∥ β∥1 = ∑ | β j | ≤ t, (4.18)
j =1
102
be estimated by β̂ 0 = ȳ. Hence, the intercept can again be removed by centering the
response as well.
Replacing the L2 ridge penalty by the L1 lasso penalty, makes the optimal solution
a nonlinear function of the observed responses y (instead of a linear operator). There-
fore, a more complicated optimization problem needs to be solved, but the solution
can still be found in a very efficient way.
The lasso penalty and ridge penalty have a different effect on the regression pa-
rameters. To explain the difference, let us consider an orthonormal feature matrix X,
that is the predictors are uncorrelated and have variance 1. In this case the principal
components are just the coordinate directions. Since all predictors have the same vari-
ance, ridge regression imposes an equal amount of shrinking on each of them. The
ridge regression estimates are
( β̂ LS ) j
( β̂ ridge ) j = . (4.19)
1+λ
Note that these coefficient estimates decrease with increasing value of λ, but do not
become zero for λ < ∞.
To obtain the lasso estimates we have to solve the constraint optimization prob-
lem (4.17)-(4.18). The Lagrange formulation of the problem becomes
!2
n d d
β̂ lasso = arg min ∑ yi − ∑ β j xij + γ( ∑ | β j | − t)
β i =1 j =1 j =1
with multiplier γ ≥ 0.
In case X is orthonormal, we have that Xt X = I. Therefore, the least squares es-
timator in this case reduces to β̂ LS = Xt y and we can rewrite the lasso optimization
as
Since each term in this sum depends on a different component β j of β, the terms in the
sum can be minimized separately. That is,
2
( β̂ lasso ) j = arg min −2( β̂ LS ) j β j + β j + γ| β j | . (4.20)
βj
103
Note that the last two terms are always positive. To find the solution we split the
problem in three cases
Case 1: ( β̂ LS ) j > 0. In this case ( β̂ lasso ) j ≥ 0 as well to make the first term in (4.20)
negative. The problem thus reduces to
2
( β̂ lasso ) j = arg min −2( β̂ LS ) j β j + β j + γβ j .
βj
Setting the derivative w.r.t. β j equal to zero yields ( β̂ lasso ) j = ( β̂ LS ) j − γ/2 when-
ever this solution is positive and ( β̂ lasso ) j = 0 otherwise.
Case 2: ( β̂ LS ) j < 0. In this case ( β̂ lasso ) j ≤ 0 as well to make the first term in (4.20)
negative. The problem thus reduces to
( β̂ lasso ) j = arg min −2 β̂ LS ) j β j + β2j − γβ j .
βj
Setting the derivative w.r.t. β j equal to zero yields ( β̂ lasso ) j = ( β̂ LS ) j + γ/2 when-
ever this solution is negative and ( β̂ lasso ) j = 0 otherwise.
104
If t is sufficiently large such that no coefficients are completely shrunken to zero, this
reduces to
d
t = ∑ (|( β̂LS ) j | − γ/2)
j =1
d
= ∑ |( β̂LS ) j | − dγ/2
j =1
= ∥ β̂ LS ∥1 − dγ/2
= t0 − dγ/2,
which yields γ = 2(t0 − t)/d. When t is decreased such that one of the parameter
estimates becomes zero, then both t0 and d in this expression need to be adjusted ac-
cordingly. The lasso thus equally shrinks all the least squares coefficients until the
constraint (4.18) is satisfied. In contrast to the ridge regression coefficients in (4.19) ,
lasso coefficient estimates can be shrunken to zero whenever their least squares esti-
mate ( β̂ LS ) j is in absolute value smaller than threshold γ/2, after which the remaining
nonzero coefficients are shrunken further.
The effect of the lasso is often called soft thresholding. Regular hard thresholding puts
coefficients below the cutoff equal to zero and only keeps the coefficients above the
cutoff. This is the way subset selection for least squares actually proceeds. Fixing a size
M < d for the optimal model, best subset selection in this case chooses the model with
the M predictors corresponding to the largest parameters ( β̂ LS ) j . The soft thresholding
of the lasso can be seen as a continuous subset selection. The tuning parameter t
specifies the threshold. If this threshold is exceeded, then the coefficients are shrunken.
The shrinkage increases and for small values of t some coefficients become zero which
implies that the corresponding predictors are removed from the model.
Let us now consider the general nonorthonormal case. This situation is represented
in Figure 4.1. Ridge regression minimizes the residual sum of squares subject to the
ridge constraint (4.9) while the lasso minimizes the same residual sum of squares sub-
ject to the lasso constraint (4.18). Let us put s = t2 in the ridge constraint (4.9) such
that both the ridge and lasso constraints put a bound t on the norm (L2 or L1 ) of β. In
the parameter space, the ridge constraint (4.9) means that the solution should fall in a
ball of radius t around the origin. The lasso constraint (4.18) means that the solution
√
should fall in a diamond with edge length 2t around the origin.
Consider the residual sum of squares (4.2) for each point in the parameter space.
The optimum (minimum) is at β̂ LS and let us assume that it does not belong to the
ridge region nor the lasso region. Therefore, we have to allow a larger value of the
residual sum of squares to obtain a feasible solution. The residual sum of squares is a
105
β2 β2
RSS(β) = c2 RSS(β) = c2
β̂ LS t β̂ LS
β̂ ridge
t β1 β1
β̂ lasso
Figure 4.1 Ridge regression (left) and lasso (right) solutions in the general
nonorthonormal case. The green area is the part of the parameter space that is fea-
sible according to the contstraint in the penalty. The red ellipses represent areas of
equal residual sum of squares (RSS). Larger sizes correspond to larger RSS values (in-
creasing value of c).
quadratic function of β, and thus for values of the residual sum of squares above the
optimum, the set of solutions becomes an ellipsoid around the center β̂ LS . The value
of the residual sum of squares thus needs to be increased until the corresponding el-
lipsoid of solutions hits the ridge/lasso constraint region around the origin. In either
case, the intersection point with the ellipsoid is the optimal solution. Hence, the shape
of the feasible region around the origin determines the solution that is found. The
ridge regression ball has no corners and is equally likely to be hit anywhere. The dia-
mond (and certainly the rhomboid in higher dimensions) has corners at the coordinate
axes where a parameter value(s) equals zero. These corners are more likely to be hit
by the ellipsoid because they ’stick out’ in the region. In practice this means that ridge
regression usually shrinks all regression coefficients but they remain positive, so the
size of the model is not reduced. On the other hand, the lasso will shrink some coef-
ficients to zero and thus reduces the size of the model. Lasso shrinkage automatically
incorporates variable selection. This explains the name Lasso which stands for Least
Absolute Shrinkage and Selection Operator.
106
4.3.3 Other Regularization Methods
Ridge regression and lasso can be generalized by considering the general penalized
least squares problem
( )
d
β̂ pen = arg min (y − Xβ)t (y − Xβ) + ∑ pλ (| β j |) , (4.21)
β j =1
Sparsity: The estimator performs model selection by setting small estimated co-
efficients ( β̂ LS ) j to zero such that the model complexity is reduced.
The advantage of penalty functions having these properties is that the correspond-
ing regularization method can enjoy the oracle property which means that on the one
hand the zero components are estimated as zero with probability going to one and
on the other hand the coefficients of the nonzero components are estimated consis-
tently. Hence, asymptotically the method performs as well as an oracle procedure
which knows the correct submodel. Note that ridge regression does not satisfy the
sparsity property while the lasso does not satisfy the approximate unbiasedness. To
meet all three criteria the smoothly clipped absolute deviation (SCAD) penalty has been
introduced Fan and Li (2001):
λt if t ≤ λ
2 2
pλ (t) = − t −22aλt +λ
( a −1)
if λ < t ≤ aλ (4.22)
( a +1) λ2
if t > aλ,
2
with λ > 0 and a > 2. The constant a is often set equal to 3.7 which can be motivated
by a Bayesian argument. For small values of β j , the SCAD penalty pλ (| β j |) in (4.22)
corresponds to the lasso which allows for model selection. On the other hand, for
107
5
4
3
pλ(t)
2
1
0
−4 −2 0 2 4
t
Figure 4.2 The ridge regression (dotted, red), lasso (dashed, blue) and SCAD (solid,
black) penalty functions
large values of β j the SCAD penalty pλ (| β j |) is a constant. This implies that the es-
timator is nearly unbiased for large values of β j because shrinking these coefficients
only affects the sum of squares, but does not reduce the penalty. In between a smooth
transition from the absolute value to the constant is established. The penalty is shown
in Figure 4.2. Next to the SCAD penalty, several other penalties have been proposed in
the literature. For example, the elastic net penalty proposed in Zou and Hastie (2005)
uses a convex combination of the L2 norm penalty (ridge regression) and the L1 norm
penalty (lasso) to reach the best of both worlds. Not all regularization methods are
penalized least squares methods, see e.g. the Dantzig selector Candes and Tao (2007).
The idea of adding a penalty term in the optimization problem corresponding to
an estimation method, to obtain more stable and/or more parsimonious models, can
be applied much more widely than just to least squares methods. An obvious class
of methods which can be combined with penalization are parametric maximum like-
lihood estimators (MLE) which are minimizers of the negative loglikelihood corre-
sponding to the parametric statistical model. A convenient family of distributions in
this respect is the exponential family in which it is assumed that the density or proba-
bility function of each response yi is of the form
yθ0,i − b(θ0,i )
f (y; θ0,i ) = exp + c(y, ϕ0 ) ,
ϕ0
R R R R
where ϕ0 is a scale parameter and b : → as well as c : × + → are functions R
determining the shape of the distribution. The parameter θi is referred to as the natural
parameter of the exponential family and is specific to each yi . A common formulation
108
is
θ0,i = xi⊤ β0 ,
which is equivalent to the use of the canonical link function to model the effect of co-
variates on the conditional mean function E(yi |xi ). Penalized maximum (log-) likelihood
estimators add a penalty term to the classical log-likelihood in order to increase stabil-
ity (ridge penalty) and/or sparsity (lasso penalty, SCAD penalty, . . . ). The objective
function now is
n d
yi θi − b ( θi )
L( β, ϕ) = ∑ + c(yi , ϕ) + ∑ pλ (| β j |)
i =1
ϕ j =1
and optimization is carried out with respect to β and ϕ. Note that penalized MLE is
a direct extension of penalized least squares in regression to other statistical models.
Logistic (binary outcome) regression models is an example where penalized MLE is
used for the same reasons as penalized LS in standard continuous outcome regression
models.
4.4 Screening
While regularization methods can be used to obtain more stable and parsimonious
models from high dimensional data, they still run into problems when the dimension
of the data is much larger than its sample size. A typical example of such ultra-high
dimensional data is gene expression data where several thousands of variables have
been measured for a limited number of observations (at most a few hundred). In such
situations the computational burden of regularization methods such as the lasso or
SCAD is still very high. Moreover, while these regularization methods can produce
sparse models, they have only be shown to be model selection consistent under strin-
gent design conditions if the dimension exceeds the sample size. Model selection con-
sistency means that the method selects the true model with probability going to one
when the sample size increases, if this true model is among the candidate models. The
stringent design conditions are hard or impossible to verify in practice, so it remains
unclear how well regularization models select useful models if they can be computed.
To address these issues Fan and Lv (2008) introduced sure independence screening.
The idea is to apply a two step procedure.
1. In the first step a dimension reduction/ feature selection procedure is applied
to reduce the dimension d from huge (ultra-high) to a high but manageable size
for regularization methods, e.g. dimension d˜ = O(n) or even o (n). This dimen-
sion reduction method should be fast and efficient in the sense that with high
probability no important variables are lost.
109
2. Apply a regularization method which is model selection consistent for this re-
duced setting or preferably even has the oracle property so that the correct sub-
model is selected with high probability.
Regularization methods for the second step have been discussed in the previous sec-
tion, so in this section we focus on feature selection procedures for the first step.
The ultra-high dimensional data setting can be formalized and studied by allowing
that the dimension d of the data depends on the sample size n, which can be made
explicit by adding the subscript n to the dimension (d = dn ), and letting the dimension
increase at an exponential rate with the sample size, i.e. d = exp(O(nξ ) for some ξ > 0.
Some of the issues that arise in ultra-high dimensional data are the following.
• The (n × d) design matrix X is a “fat” matrix with many more columns than rows.
The matrix Xt X is thus a huge matrix which is singular because it is of rank at
most n (≤≤ d). As a result, its entries can be very misleading. For example,
the maximum absolute sample correlation between independent predictors can
be very large which goes against our intuition from low dimensional settings.
To illustrate this consider d independent predictors X1 , . . . , Xd that each follow a
standard normal distribution. In this case the (n × d) design matrix X consists
of independent realizations of N (0, 1). We generated N = 500 datasets of size
n = 50 In Figure 4.3 you can see the maximum absolute correlation between the
independent predictors for design matrices X of dimension d = 10, 100, 1000 and
10 000. Multiple correlations between groups of predictors can be even much
higher than these maximum absolute correlations indicating that spurious linear
relations in high dimensional datasets can be very high.
• The number of nonzero coefficients may also depend on n and the size of the
minimal absolute nonzero coefficients mini {| β i | > 0} may decay with n. The
size of this coefficient estimate then may become close to the size of estimates of
the coefficients for noise variables.
• Predictors may have heavy tails such that the stringent conditions on the design
matrix for model selection consistency of regularization methods are not satis-
fied.
110
25
p=10
p=100
20 p=1000
p=10000
15
Density
10
5
0
note that each component ω j ( j = 1, . . . , d) is just the least squares coefficient obtained
by regressing y on predictor x j . Alternatively, the components of ω are the correlations
of the predictors with the response variable, rescaled by the standard deviation of the
response.
A submodel Aγ is then selected by retaining the indices of the predictors corre-
sponding to the largest absolute coefficients in ω:
Aγ = {1 ≤ j ≤ d : |ω j | ≥ |ω |(⌈n(1−γ)⌉) }, (4.23)
where |ω |(⌈n(1−γ)⌉) is the ⌈n(1 − γ)⌉ order statistic of |ω| with ⌈n(1 − γ)⌉ the smallest
integer larger than (1 − γ) for some γ > 0. Note that this is equivalent to selecting the
d˜ = [γn] predictors which have largest marginal correlation with the response vari-
able ([γn] denotes the integer part of γn). Therefore, this approach is also called corre-
lation learning. The method thus filters out the predictors which have weak marginal
111
correlation with the response. This correlation learning method is called Sure Indepen-
dence Screening (SIS) because each available feature is used independently as predictor
to evaluate how useful it is to predict the response variable. Clearly, SIS is a hard-
thresholding method. The original model of size d is shrunken to a submodel of much
smaller size d˜ by completely removing all features whose marginal correlation with
the response is below the threshold in (4.23).
−1 t Xt X
t
λ β̂ ridge (λ) = λ(X X + λI ) Xy=( + I )−1 Xt y.
λ
It can thus be seen that λ β̂ ridge (λ) → Xt y = ω for λ → ∞. If we use the absolute ridge
regression coefficient estimates β̂ ridge (λ) to order the features and threshold features
with too small coefficients, then we obtain the same order whether we use β̂ ridge (λ)
or λ β̂ ridge (λ). Hence, SIS is a limiting case of thresholded ridge regression screening with
regularization parameter λ = ∞. That is, the case where the variability of the estimator
is as small as possible.
Thresholded ridge regression selects a submodel Aλ by retaining the indices of the
predictors corresponding to the largest absolute coefficients in β̂ ridge (λ):
For ultra-high dimensional data ridge regression is not model selection consistent, so
applying thresholded ridge regression will not lead to a good model. However, we can
apply an iterative procedure which reduces the number of features by a factor 1 − δ in
each iteration for some trimming fraction 0 < δ < 1. Hence, the first step yields the
submodel
A1δ,λ = {1 ≤ j ≤ d : |( β̂ ridge (λ)) j | ≥ | β̂ ridge (λ)|(⌈(1−δ)d⌉) },
which contains [δd] features. Based on the features in A1δ,λ the corresponding ridge
(1)
regression solution β̂ ridge (λ) is computed which is used to select a second submodel
(1) (1)
A2δ,λ = {1 ≤ j ≤ d : |( β̂ ridge (λ)) j | ≥ | β̂ ridge (λ)|(⌈(1−δ)d⌉) }.
112
This submodel A2δ,λ ⊂ A1δ,λ contains [δ2 d] features. Iterating this process yields a se-
quence of submodels A1δ,λ ⊃ A2δ,λ ⊃ · · · ⊃ Akδ,λ where the process is stopped when the
submodel Akδ,λ has sufficiently small size. Note that this process generally yields a dif-
ferent solution than the thresholded ridge regression in (4.24) even if the final models
have the same size because the ridge regression coefficients are re-estimated in each it-
eration step based on the remaining features. The only exception is the case λ = ∞, i.e.
SIS, because the ranking of the features is determined by the componentwise regres-
sion coefficients in this case. SIS is thus a special case of iteratively thresholded ridge
regression screening. Moreover, it is the only case where iterations are unnecessary
and the solution does not depend on the choice of δ.
The sure screening property holds for a screening method if the selected submodel
contains all important variables with probability going to 1 if the sample size in-
creases. In Fan and Lv (2008) it is shown that the sure screening property holds for
iteratively thresholded ridge regression screening under some stringent technical con-
ditions. These conditions have implications for the choice of δ. On the one hand δ
cannot be too small (if λ < ∞)) such that several iterations are needed, but on the
other hand δ can also not be too large such that the number of iterations k does not be-
come too large. For example, iteratively thresholded ridge regression screening where
only one variable is removed in each iteration until a submodel of size n − 1 is obtained
would not satisfy the sure screening property because too many iterations (d − n + 1)
are needed to reach the final solution and in each iteration there is a small probabil-
ity that an important variable is lost. Hence, next to the excessive computation time
required for this procedure, it is also not guaranteed that it provides a reliable solu-
tion which contains the important predictors with high probability. The conditions
for sure screening naturally imply that important predictors should have a marginal
correlation with the response variable that is sufficiently high. Moreover, collinearity
between the predictors cannot be too high either. Because these issues regularly occur
in practice, we consider in the next section an extension of SIS that is more flexible.
1. Apply SIS on the original dataset with all d features and select a submodel with
o (n) ( e.g. n/ log(n)) candidate predictors. Apply a regularization method that
performs variable selection to this submodel such as lasso, SCAD or Dantzig
selector. This yields a first subset with k1 predictors A1 = { Xi1 , . . . , Xik }.
1
2. Remove the effect of the k1 selected predictors from the response. That is, replace
113
y by the residual vector obtained by regressing y on xi1 , . . . , xik . Apply Step 1
1
with this residual vector as response and using the remaining d − k1 features as
candidate predictors to select a second subset A2 with k2 predictors
3. Repeat the procedure in the previous steps until l subsets of predictors have been
obtained where l is a number such that the model A = ∪il=1 Ai contains o (n) or
at most O(n) predictors.
Removing from the response the effect of the already selected predictors in step 2 has
several advantages. On the one hand, it allows to select important predictors that had
a weak correlation with the original response due to the presence of already selected
predictors because such features will get a high correlation with the residuals. On
the other hand, features that are not important but have a high correlation with the
response due to their association with already selected predictors will have a much
weaker correlation with the residuals because this residual vector is uncorrelated with
the selected predictors. Moreover, this iterative sure independence screening (ISIS)
has the additional advantage that important variables that are missed by SIS (i.e. by
applying only one iteration) can still be picked up in the next iterations of the ISIS
procedure. Hence, ISIS is much more likely to return a model that is close to the true
submodel, even when the stringent conditions for sure screening are not satisfied.
4.4.4 Extensions
Sure screening is not limited to ultra-high dimensional linear regression problems,
but the idea of ISIS can be extended to handle ultra-high dimensional data in other
settings such as classification, generalized linear models or additive models, see e.g.
Fan et al. (2009); Barut et al. (2016); Cui et al. (2015); Fan et al. (2011) or even model
free settings Li et al. (2012); Pan et al. (2019); Zhu et al. (2011).
114
Chapter 5
In this chapter we mainly discuss the use of kernel-based methods for classifica-
tion and regression problems. Kernels allow to (implicitly) transform and enlarge
the predictor space which results in more flexible (nonlinear) solutions in the origi-
nal predictor space. Because of the flexibility of the transformed space regularization
is needed to obtain a stable solution. These methods are commonly called support
vector machines. An in-depth treatment of support vector machines can be found in
e.g. Schölkopf and Smola (2002); Christmann and Steinwart (2008). A discussion from
a statistical perspective can be found in Hastie et al. (2001) while a low-level exposition
is given in James et al. (2014).
5.1 Classification
In classification we consider problems with a qualitative (categorical) outcome
variable G, where the labels of G can be denoted by 1, . . . , k without loss of gener-
ality. Often the outcome is dichotomous, that is, there are only two classes (k = 2).
Examples are infected vs. not-infected by a certain virus or bacteria, valid vs. fraudu-
lent transaction, or admitted vs. not admitted. In this section we consider such binary
classification problems. In this case the class label G is often represented by a single
binary variable Y which takes the values 0 or 1, or alternatively, -1 or 1.
In a classification problem the goals are to find good estimates of the class labels.
Ideally, the method works well on the whole distribution of outcomes and features.
In practice, these predictions can be considered both for the training data, expressing
the goodness-of-fit quality of the method, as for new objects, expressing the prediction
capacity of the method. Furthermore, the purpose often also is to find a representation
that separates or discriminates the classes as much as possible to get more insight in the
differences between the classes and to determine the features that are associated with
these differences.
115
Given any vector of features x ∈ Rd , we would like to find the classification rule
that makes the least possible mistakes, or more generally, gives the smallest loss. In
the general case, we need to define a loss function L(r, s) that represents the price paid
for classifying an object from class r to class s, for all possible classes in G (which is
2 in this case). The standard setting of least possible mistakes corresponds to setting
L(r, s) = 1 if r ̸= s and zero otherwise, which is called the zero-one loss function. We
now would like to find the classifier Ĝ that minimizes the expected prediction error or
expected misclassification loss given by
k
EML( Ĝ ( x )) = ∑ L(r, Ĝ(x)) P(r|X = x). (5.1)
r =1
The classifier is thus defined pointwise. For each feature vector x, the class estimate is
given by the solution of (5.2). For the standard zero-one loss, the classifier simplifies
to
Ĝ ( x ) = argmin ∑ P(r | X = x ). (5.3)
g∈{1,...,k} r ̸= g
Ĝ ( x ) = argmin [1 − P( g| X = x )]
g∈{1,...,k}
or equivalently
Ĝ ( x ) = argmax P( g| X = x ). (5.4)
g∈{1,...,k}
This solution is called the Bayes classifier. It says that we should classify an object with
features x to the most probable class as determined by the conditional distribution
P( G | X = x ), that is
Figure 5.1 shows a mixture of observations from three different bivariate normal distri-
butions together with the optimal Bayesian decision boundaries which are quadratic
functions in this case. With real data, of course the true conditional distribution of the
classes is unknown and the Bayes classifier (5.4) cannot be derived.
The error rate of the Bayes classifier, obtained as
116
Bayesian Classifier
●
●
10
● ●● ●●
● ● ●
● ● ●
●●● ● ● ●
● ●●● ●
● ●
●● ●
●●
●
● ● ●● ●
● ● ● ●● ●●● ●
● ● ● ● ●
5
●● ●
● ●● ● ● ●● ● ●
●● ●● ●
● ● ●● ● ●
●● ● ● ● ●
● ● ●
● ● ● ●
● ● ●●
● ●● ●
●● ● ● ● ●
● ●●
● ● ●●● ●
0
●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
●● ● ●● ●
X2 ● ●●
● ● ●
● ● ● ●●
●
● ● ●
●
● ●
● ●
●● ● ● ●●
−5
● ●
●
● ● ● ●● ● ● ● ●●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●
●
● ● ● ● ● ● ●
●● ● ●
● ● ●●
● ● ●
● ●● ●
−10
● ● ●
●
● ● ● ●
● ●
●
●
−15
●
●
−10 −5 0 5
X1
Figure 5.1 Mixture of three bivariate normal distributions with optimal Bayesian deci-
sion boundaries.
with Ĝ ( x ) the Bayes classifier, is called the optimal Bayes error rate. This is the error rate
we would obtain if the conditional distribution P( G | X ) would be known.
We start our discussion of support vector machines for classification by considering
methods that explicitly try to find linear boundaries that separate the two classes as
much as possible when the classes are separable. We then discuss the construction of
such optimal separating hyperplanes when the classes are not completely separable
anymore. Finally, we introduce kernels to add flexibility to this approach and discuss
extensions using different loss functions.
117
plane Hβ represented by an equation
f ( x ) = β 0 + βt x = 0, (5.7)
dsign ( x, Hβ ) = ( β⋆ )t ( x − x0 ).
The function is called signed distance because its absolute value is the orthogonal
distance from the point to the hyperplane but the function is positive for points on one
side (the positive side for which f ( x ) > 0) and negative for points on the other side
(the negative side for which f ( x ) < 0) of the hyperplane. Note that for any point x0 in
Hβ , it holds that f ( x0 ) = 0, which yields βt x0 = − β 0 . Therefore, the signed distance
can be rewritten as
dsign ( x, Hβ ) = ( β⋆ )t ( x − x0 )
1
= ( β t x − β t x0 )
∥ β∥
1
= ( βt x + β 0 )
∥ β∥
f (x)
= ,
∥ f ′ ( x )∥
since f ′ ( x ) = β in this case. Hence, the signed distance of a point x to the hyperplane
defined by f ( x ) = 0 is proportional to the value f ( x ). Separating hyperplane classi-
fiers try to find a linear function f ( x ) as in (5.7) such that f ( x ) > 0 for objects in class
Y = 1 and f ( x ) < 0 for objects in class Y = −1. Procedures that classify observations
by computing the sign of a linear combination of the input features were originally
called perceptrons when introduced in the engineering literature.
Let us assume that the data are separable, that is, there is at least one hyperplane that
completely separates the two classes. In this case, the solution of the separating hyper-
plane problem will often not be unique, but several hyperplanes completely separate
the classes. To make the solution unique, an additional condition can be imposed to
find the best possible separating hyperplane among all choices. The optimal separating
hyperplane is defined as the hyperplane that separates the two classes and maximizes
118
the distance from the hyperplane to the closest points from both classes. This extra
condition defines a unique solution to the separating hyperplane problem and maxi-
mizes the margin between the two classes on the training data. It can be expected that
it therefore also leads to better performance on test data.
If Hβ is a separating hyperplane, then Y = −1 for points on the negative side
( f ( x ) < 0) of Hβ and Y = 1 for points on the positive side ( f ( x ) > 0). It follows that
the (unsigned) distance of a point xi to the hyperplane Hβ is obtained by d( xi , Hβ ) =
yi ( βt xi + β 0 )/∥ β∥. Hence, a separating hyperplane satisfies d( xi , Hβ ) = yi ( βt xi +
β 0 )/∥ β∥ > 0 for all training points.
The optimal separating hyperplane is the solution to the problem
max C (5.8)
β,β 0
β t xi + β 0
subject to yi ≥ C; i = 1, . . . , n. (5.9)
∥ β∥
The condition (5.9) can equivalently be written as
yi ( βt xi + β 0 ) ≥ C ∥ β∥; i = 1, . . . , n. (5.10)
Now suppose that for some value of C, there are values β and β 0 that satisfy condi-
tion (5.10). Let us consider a positive value γ, and define β̃ = γβ and β̃ 0 = γβ 0 . It
then follows that ∥ β̃∥ = γ∥ β∥ and condition (5.10) is equivalent to
γyi ( βt xi + β 0 ) ≥ γC ∥ β∥; i = 1, . . . , n
or yi ( β̃t xi + β̃ 0 ) ≥ C ∥ β̃∥; i = 1, . . . , n.
This means (not surprisingly) that if values β and β 0 satisfy the condition, then also
any multiple by a positive constant satisfies the constraint. To make the solution
unique, the value of ∥ β∥ can be taken equal to an arbitrary chosen constant. A conve-
nient choice is to take
1
∥ β∥ = . (5.11)
C
The maximization problem (5.8) then becomes the minimization problem
min ∥ β∥.
β,β 0
Minimizing the norm is equivalent to minimizing the squared norm which is analyt-
ically more tractable. Hence, the constrained maximization problem (5.8)-(5.9) can be
reformulated as
1
min ∥ β∥2 (5.12)
β,β 0 2
subject to yi ( βt xi + β 0 ) ≥ 1; i = 1, . . . , n. (5.13)
119
Note that condition (5.9) together with (5.11) implies that around the optimal sepa-
rating boundary there is a margin of width 1/∥ β∥ in both directions, that does not
contain any data points.
The optimal separating hyperplane problem as formulated in (5.12)-(5.13) consists
of a quadratic criterion that needs to be minimized under linear inequality constraints,
which is a standard convex optimization problem of operations research. It is well-
known that minimizing (5.12)-(5.13) is equivalent to minimizing w.r.t. β and β 0 the
Lagrange (primal) function
n
1
LP = ∥ β ∥2 − ∑ α i [ y i ( β t x i + β 0 ) − 1]
2 i =1
n
1 t
= β β − ∑ α i [ y i ( β t x i + β 0 ) − 1] (5.14)
2 i =1
where αi ≥ 0 are the Lagrange multipliers. Setting the derivatives w.r.t. β and β 0 equal
to zero yields the equations
n
β = ∑ αi yi xi (5.15)
i =1
n
0 = ∑ αi yi (5.16)
i =1
These first-order conditions thus define the parameter β as function of the Lagrange
multipliers αi . Hence, it remains to determine the optimal values of the Lagrange mul-
tipliers. To this end, the above first-order conditions are substituted in the Lagrange
function (5.14) which yields the Wolfe dual function
n n n
1
LD = ∑ αi − 2 ∑ ∑ αi α j yi y j xit x j . (5.17)
i =1 i =1 j =1
This is a standard convex optimization problem. The final solution must satisfy the
Karush-Kuhn-Tucker conditions, well-known in operations research. These conditions
are given by (5.13),(5.15)-(5.16),(5.18) and the condition
αi [yi ( βt xi + β 0 ) − 1] = 0; i = 1, . . . , n. (5.19)
120
• If αi > 0, then yi ( βt xi + β 0 ) = 1 which means that the point xi is on the boundary
of the margin around the separating hyperplane.
From (5.15) it follows that the optimal solution (direction) β is defined as a linear com-
bination of the points xi with αi > 0. These points are the points on the boundary
of the margin around the separating hyperplane, called the support points. Hence, op-
timal separating hyperplanes focus mainly on points close to the decision boundary
and gives zero weights to all other points (although all observations are needed to
determine the support points).
Once the optimal separating hyperplane Hβ̂ has been found as solution of the con-
strained optimization problem, new observations can be classified as
In practice, it will often occur that there does not exist a separable hyperplane and
thus there is no feasible solution to problem (5.12)-(5.13). Therefore, we extend the
separable hyperplane problem in the next section to the case of nonseparable classes.
yi ( βt xi + β 0 ) ≥ 1 − ξ i ; i = 1, . . . , n.
Hence, the value ξ i in the above constraint is proportional to the amount by which
each point xi falls at the wrong side of its margin. Therefore, to obtain a reasonable
solution, a bound needs to be set on the sum ∑in=1 ξ i , the total amount by which points
are allowed to lie at the wrong side of their margin. Let Q denote this bound, hence
we require ∑in=1 ξ i ≤ Q. Since the area of the margin between the two classes is not
required to by empty anymore, it is often called a soft margin. Note that a training
point ( xi , yi ) is misclassified if ξ i > 1 because then yi and βt xi + β 0 have a different
sign, hence the constant Q determines the maximal number of misclassified training
points that is allowed.
121
A convenient formulation of the soft margin problem or support vector classifier
problem is " #
n
1
min ∥ β ∥2 + γ ∑ ξ i (5.20)
β,β 0 2 i =1
subject to yi ( βt xi + β 0 ) ≥ 1 − ξ i ; i = 1, . . . , n
ξ i ≥ 0; i = 1, . . . , n (5.21)
where the tuning parameter γ > 0 now takes over the role of Q. A large value of γ
puts a strong constraint on ∑in=1 ξ i and thus allows few misclassifications while a small
value of γ puts little constraint on ∑in=1 ξ i . Note that the limit case γ = ∞ corresponds
to the separable case since all slack variables ξ i should equal zero in this case. The
support vector classifier problem (5.20)-(5.21) still consists of a quadratic criterion with
linear constraints, hence a standard convex optimization problem. The minimization
problem is equivalent to minimizing w.r.t. β, β 0 and ξ i the Lagrange function
n n n
1
LP = ∥ β∥2 + γ ∑ ξ i − ∑ αi [yi ( βt xi + β 0 ) − (1 − ξ i )] − ∑ µi ξ i (5.22)
2 i =1 i =1 i =1
where αi ≥ 0 and µi ≥ 0 are the lagrange multipliers. Setting the derivatives w.r.t. β,
β 0 and ξ i equal to zero yields the equations
n
β = ∑ αi yi xi (5.23)
i =1
n
0 = ∑ αi yi (5.24)
i =1
αi = γ − µi ; i = 1, . . . , n (5.25)
Note that equations (5.23) and (5.24) are the same as in the separable case discussed
in the previous section (equations (5.15) and (5.16)). The difference is that the last
equation relates the Lagrange multipliers αi to the constant γ. Substituting these first
order conditions in (5.22) yields the Wolfe dual objective function
n
1 n n
LD = ∑ αi − 2 i∑ ∑ αi α j yi y j xit x j (5.26)
i =1 =1 j =1
Note that this is exactly the same dual function as in the separable case (expres-
sion (5.17)) but now it needs to be maximized w.r.t. α1 , . . . , αn under the constraints
n
∑ αi yi = 0
i =1
0 ≤ αi ≤ γ. (5.27)
122
The Karush-Kuhn-Tucker conditions satisfied by the solution include the constraints
αi [yi ( βt xi + β 0 ) − (1 − ξ i )] = 0; i = 1, . . . , n (5.28)
µi ξ i = 0; i = 1, . . . , n (5.29)
t
yi ( β xi + β 0 ) − (1 − ξ i ) ≥ 0; i = 1, . . . , n. (5.30)
From (5.23) it still follows that the solution for β is a linear combination of the observa-
tions with αi > 0. From condition (5.28) it follows that these points are observations for
which constraint (5.30) is exactly met. These observations are again called the support
vectors. Some of these support vectors will lie on the margin (ξ i = 0) which according
to (5.29) and (5.25) implies that 0 < αi < γ while others lie on the wrong side of their
margin (ξ i > 0) and are characterized by αi = γ.
As an illustration we consider an artificial two-dimensional dataset from Hastie
et al. (2001). The dataset is a two-class classification problem where each class consists
of a mixture of 10 bivariate Gaussian distributions with the same covariance matrix
but different centers. The training dataset contains 100 points from each class and is
displayed in Figure 5.2. The colored triangles in this plot represent the observations of
the training dataset. The observations of one class are plotted as red upper triangles
while the observations of the other class are plotted as green lower triangles. Next to
the training data, there is a test dataset which consists of a large grid of points indi-
cated by the small dots in Figure 5.2. This test set allows us to evaluate precisely the
performance of classification methods. The test grid also makes it possible to visualize
the decision boundary which clearly is highly nonlinear. Even with the true boundary,
there still are many misclassifications in the training data, which shows that the two
classes highly overlap. Indeed, even the optimal (Bayes) misclassification error rate
corresponding to the true boundary is quite large as it equals 21.0%.
In R several packages can be used to fit support vector machines. For the examples
in this chapter we use the function svm from the e1071 package. Fitting the support
vector classifier to the artificial training dataset yields the result shown in Figure 5.3.
As expected the boundary that separates the two classes is linear as can be seen from
the predicted labels for the test data. The misclassification rate on the test data is 28.8%,
so it is 7.8% higher than the optimal error rate. There are 63 support vectors in each
class. These support vectors are highlighted by their larger plotting symbol. It can be
seen from the plot that the support vectors consist of the points closest to the decision
boundary and the points on the wrong side of the decision boundary.
123
Optimal decision boundary
3
2
1
X2
0
−1
−2
−2 −1 0 1 2 3 4
X1
Figure 5.2 Artificial two-dimensional binary classification data with 100 observations
of each class in the training set.
will generally achieve a better class separation on the training data and lead to nonlin-
ear boundaries in the original predictor space. Consider a transformed feature space
h( X ) = (h1 ( X ), . . . , h M ( X ))t which leads to a transformed set of feature vectors h( xi )
for the training data. Applying the support vector classifier in this enlarged space pro-
duces a boundary fˆ( x ) = h( x )t β̂ + β̂ 0 . This boundary is linear in the enlarged space
h( X ), but nonlinear in the original feature space X. The support vector classifier is
ŷ( x ) = sign( fˆ( x )) as before. By allowing the dimension of the transformed feature
space to become very large, it becomes possible to find a boundary that completely
separates the classes. However, the nonlinear boundary in the original space can be
very wiggly, indicating overfitting of the training data. In such cases the optimistic
zero training error will be countered by a high test error. To avoid overfitting, regular-
ization is needed to penalize roughness of the boundary such that rougher boundaries
are only allowed if they significantly improve the training classification error.
In the previous section we showed that the support vector optimization prob-
lem (5.20)-(5.21) could be rewritten as maximizing the dual function (5.26) subject to
the constraints (5.27). For a transformed predictor space h( X ) this becomes: maximize
the dual function
n
1 n n
L D = ∑ αi − ∑ ∑ αi α j yi y j h ( xi )t h ( x j ) (5.31)
i =1
2 i =1 j =1
124
Support vector classifier
3
2
1
X2
0
−1
−2
−2 −1 0 1 2 3 4
X1
Figure 5.3 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the support vector classifier learned on the training data.
0 ≤ αi ≤ γ. (5.32)
Using (5.23) it follows that the support vector boundary can be written as
f ( x ) = βt h( x ) + β 0
n
= ∑ αi yi h ( x )t h ( xi ) + β 0 . (5.33)
i =1
Note that for given αi and β, the constant β 0 can be determined as the solution of
yi f ( xi ) = 1 for support points xi on the margin (0 < αi < γ).
From (5.31) and (5.33) it follows that both the optimization problem (5.31) and its
solution (5.33) in the transformed feature space only depend on products h( x )t h( x ′ )
of the transformed features. This allows us to take this one step further and to con-
sider transformations of the features into an infinite-dimensional reproducing kernel
Hilbert space (RKHS) HK generated by a positive definite kernel K ( x, y). More details
about kernel functions are given in the next section. In this case, the scalar product
h( x )t h( x ′ ) needs to be replaced by the corresponding values of the kernel function
125
K ( x, x ′ ). Hence, if the transformation of the feature space is induced by a kernel func-
tion, then (5.31) can be rewritten as
n
1 n n
L D = ∑ αi − ∑ ∑ αi α j yi y j K ( xi , x j ) (5.34)
i =1
2 i =1 j =1
which needs to be maximized under constraints (5.32) and the solution (5.33) becomes
n
f (x) = ∑ αi yi K(x, xi ) + β0. (5.35)
i =1
From (5.34)- (5.35) it can be seen that the support vector classifier optimization prob-
lem can be formulated and solved in the transformed space HK without the need to
explicitly specify the underlying transformation h( X ). The knowledge of the kernel
function is sufficient. The above dual problem still can easily be solved using the same
optimization techniques as for the support vector classifier (in the original space) in
the previous section. The resulting classification method is called a support vector ma-
chine (SVM).
Popular choices for the kernel function K are
126
A positive definite kernel K ( x, y) has an eigen-expansion
∞
K ( x, y) = ∑ γi ϕi (x)ϕi (y) (5.37)
i =1
with γi ≥ 0 and ∑i∞=1 γi2 < ∞. The basis functions ϕi ( x ) form a basis for the (infinite-
dimensional) Hilbert space HK spanned by the kernel function K. This result is known
as Mercer’s theorem. Another application of this theorem and more discussion on this
result can be found in the chapter on functional data analysis. Functions in HK thus
have an expansion of the form
∞
f (x) = ∑ ai ϕi (x) (5.38)
i =1
The kernel K ( x, y) allows to define an inner product in the Hilbert space. For any two
functions f ( x ) = ∑i∞=1 ai ϕi ( x ) and g( x ) = ∑i∞=1 bi ϕi ( x ) in HK , the inner product is
defined as
∞
ab
< f , g >= ∑ i i . (5.39)
i =1
γi
This inner product leads to the definition of a norm in the Hilbert space HK given by
a2i∞
∥ f ∥2HK =< f , f >= ∑ . (5.40)
γ
i =1 i
Note that for any z ∈ Rd the function g( x ) = K ( x, z) = ∑i∞=1 [γi ϕi (z)]ϕi ( x ) sat-
γ2 ϕ ( z )2
isfies (5.38) with ∥ g∥2HK = ∑i∞=1 i γi = ∑i∞=1 γi ϕi (z)2 = K (z, z) < ∞. Hence,
i
g( x ) = K ( x, z) ∈ HK . For a fixed z ∈ Rd , the function g( x ) = K ( x, z) will be denoted
by K (., z) to indicate that we consider the kernel as a function of its first argument with
the second argument fixed.
The penalty term J ( f ) in (5.36) is usually taken equal to J ( f ) = ∥ f ∥2HK . In this case,
contributions to f of basis functions ϕ( x ) with large eigenvalue γi will get penalized
less while contributions of basis functions with small eigenvalue γi will get penalized
more.
The minimization problem (5.36) now becomes
" #
n
min
f ∈HK
∑ L(yi , f (xi )) + λ∥ f ∥2HK . (5.42)
i =1
127
It has been shown that this problem has a finite-dimensional solution in HK that can
be written in the form
n
f (x) = ∑ αi K(x, xi ), (5.44)
i =1
see e.g. Christmann and Steinwart (2008).
For the kernel function K it holds that
∞ γm ϕm ( xi )γm ϕm ( x j )
< K (., xi ), K (., x j ) > = ∑ γm
m =1
∞
= ∑ γm ϕm ( xi )ϕm ( x j )
m =1
= K ( x i , x j ). (5.45)
This property is known as the reproducing property of the kernel K. It implies that any
function g ∈ HK can be approximated arbitrary well by a function f of form (5.44).
For a function f of the form (5.44) it follows further that
n
< f , K (., xi ) > = < ∑ α j K(., x j ), K(., xi ) >
j =1
n
= ∑ α j < K(., x j ), K(., xi ) >
j =1
n
= ∑ α j K ( x i , x j ). (5.47)
j =1
Using (5.44) and (5.45) the penalty term J ( f ) = ∥ f ∥2HK can be rewritten as
n n
∥ f ∥2HK =< f , f > = ∑ ∑ αi α j < K(., xi ), K(., x j ) >
i =1 j =1
n n
= ∑ ∑ αi α j K ( xi , x j )
i =1 j =1
= αt Kα (5.48)
128
which now is a finite-dimensional criterion that can be solved numerically. The fact
that the infinite-dimensional problem (5.42) can be reduced to a finite-dimensional
problem as in (5.49) is often called the kernel property.
In general there are two important choices in regularization using kernels. The
choice of the kernel function K and the choice of loss function L. As mentioned before,
popular choices for the kernel function K ( x, y) are polynomial kernels and radial basis
function kernels.
Polynomial kernels
K ( x, y) = ( x t y + 1)2
= 1 + 2x1 y1 + 2x2 y2 + x12 y21 + x22 y22 + 2x1 x2 y1 y2
1
g j ( x ) = exp(− (∥ x − x j ∥/σ)2 ); j = 1, . . . , n
2
The radial kernel K ( x, y) = ∥ x − y∥2 log(∥ x − y∥) leads to a solution that is an expan-
sion in the radial basis functions
g j ( x ) = ∥ x − x j ∥2 log(∥ x − x j ∥).
129
SVM − Polynomial kernel, degree 2
3
2
1
X2
0
−1
−2
−2 −1 0 1 2 3 4
X1
Figure 5.4 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the polynomial SVM of degree 2 learned on the training data.
For our artificial data example, the SVM with Gaussian kernel yields the result in
Figure 5.6. The boundary is clearly much more flexible than for the polynomial ker-
nels due to the transformation to an infinite-dimensional space. The misclassification
rate of this Gaussian kernel SVM equals 22.1% which is only 1.1% higher than the op-
timal error rate and much better than the performance of the support vector classifier
and the polynomial SVMs. Note that there is some similarity between the decision
boundary of the Gaussian kernel SVM in Figure 5.6 and the optimal decision bound-
ary in Figure 5.2. The largest differences are in areas without training observations, so
without information for the SVM to determine the boundary.
Above we introduced support vector machines for classification by using the dual
formulation of the support vector classifier which immediately yields a computation-
ally attractive representation of SVMs. To show that classification SVMs fit in the gen-
eral RKHS regularization framework and to better understand SVMs, we now also
consider the original (primal) formulation of the support vector classifier.
For non-support vectors (αi = 0), condition (5.25) implies that µi = γ. It then
follows from condition (5.29) that ξ i = 0 for non-support vectors. Moreover, condi-
tion (5.30) becomes yi ( βt xi + β 0 ) ≥ 1 or equivalently
1 − yi ( βt xi + β 0 ) = 1 − yi f ( xi ) ≤ 0.
130
SVM − Polynomial kernel, degree 3
3
2
1
X2
0
−1
−2
−2 −1 0 1 2 3 4
X1
Figure 5.5 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the polynomial SVM of degree 3 learned on the training data.
For support vectors (αi > 0), condition (5.28) implies that
y i ( β t x i + β 0 ) − (1 − ξ i ) = 0
or equivalently
0 ≤ ξ i = 1 − yi ( β t xi + β 0 )
= 1 − y i f ( x i ).
Using these results, ∑in=1 ξ i can be rewritten as
n
∑ [1 − yi f (xi )]+ (5.50)
i =1
By inserting (5.50) into (5.20) and denoting λ = 1/(2γ) the support vector classifier
solves the optimization problem
" #
n
min
β,β 0
∑ [1 − yi f (xi )]+ + λ∥ β∥2 .
i =1
131
SVM − Gaussian kernel
3
2
1
X2
0
−1
−2
−2 −1 0 1 2 3 4
X1
Figure 5.6 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the Gaussian RBF SVM learned on the training data.
132
(where α̃i = yi αi in (5.35)) and (5.52) can be rewritten as
" #
n
min
α̃,β 0
∑ [1 − yi f (xi )]+ + λαt Kα (5.53)
i =1
where K is the kernel matrix. This is a minimization problem of the form (5.49) and is
a special case of the more general problem
" #
n
min
f ∈H
∑ L(yi , f (xi )) + λJ ( f )
i =1
for example.
5.2 Regression
Regularized optimization in RKHS as in (5.42) can also be used in the regression
context. Consider the linear regression model
f ( x ) = β 0 + x t β. (5.54)
If we take the common squared error loss for the loss function L, then (5.49) reduces
to a penalized least squares problem
with solution
α̂ = (K + λI )−1 y. (5.56)
This yields
n
fˆ( x ) = ∑ α̂i K(x, xi )
i =1
or
f̂ = K α̂
= K (K + λI )−1 y
= ( I + λK −1 )−1 y,
133
which shows that the fitted values again are a linear function of the responses y. This
least squares SVM for regression can be seen as a generalization of ridge regression.
Indeed, if we set K = XXt , i.e. no transformation is applied on the features, then
by using the SVD of X it can be shown that f̂ above is exactly the same as the ridge
regression solution ŷridge in (4.13).
We now discuss regression support vector machines using a different loss func-
tion that has an effect similar to the loss function used in support vector machines for
classification.
Hence, the loss is zero for observations with response yi whose distance to the hyper-
plane f ( x ) is smaller than ϵ. For observations larger than ϵ the loss equals |y − f ( x )| −
ϵ.
With this loss function, support vector regression solves the problem
" #
n
λ
min ∑ L(yi , f ( xi )) + ∥ β∥2 , (5.57)
β 0 ,β i =1 2
where as usual the penalty allows for a bias-variance trade-off and this regularization
makes it possible to select more stable models. By introducing slack variables this
problem can be rewritten as
" #
n
1 1
min ∥ β∥2 + ∑ (ξ i + ξ i⋆ )
β 0 ,β 2 λ i =1
subject to yi − f ( xi ) ≤ ϵ + ξ i
f ( xi ) − yi ≤ ϵ + ξ i⋆
ξi ≥ 0
ξ i⋆ ≥ 0 i = 1, . . . , n
134
which leads to the Lagrange primal function
1 1 n n
LP = ∥ β∥2 + ∑ (ξ i + ξ i⋆ ) + ∑ αi (yi − β 0 − xit β − ϵ − ξ i )
2 λ i =1 i =1
n n n
+∑ αi⋆ ( β 0 + xit β − yi − ϵ − ξ i⋆ ) − ∑ µi ξ i − ∑ µi⋆ ξ i⋆ (5.58)
i =1 i =1 i =1
where αi , αi⋆ ≥ 0 and µi , µi⋆ ≥ 0 are the Lagrange multipliers. Setting the derivatives
w.r.t. β, β 0 , ξ i and ξ i⋆ equal to zero yields the equations
n
β = ∑ (αi − αi⋆ )xi (5.59)
i =1
n
0 = ∑ (αi − αi⋆ ) (5.60)
i =1
1
αi = − µi ; i = 1, . . . , n (5.61)
λ
1
αi⋆ = − µi⋆ ; i = 1, . . . , n (5.62)
λ
Substituting the first order conditions (5.59)-(5.62) in (5.58) yields the dual objective
function
n n
1 n n
LD = ∑ (αi − αi⋆ )yi − ϵ ∑ (αi + αi⋆ ) − 2 i∑ ∑ (αi − αi⋆ )(α j − α⋆j ) xit x j (5.63)
i =1 i =1 =1 j =1
αi (yi − β 0 − x t β − ϵ − ξ i ) = 0; i = 1, . . . , n
αi⋆ ( β 0 + x t β − yi − ϵ − ξ i⋆ ) = 0; i = 1, . . . , n
αi αi⋆ = 0; i = 1, . . . , n
ξ i ξ i⋆ = 0; i = 1, . . . , n
1
(αi − )ξ i = 0; i = 1, . . . , n
λ
1
(αi⋆ − )ξ i⋆ = 0; i = 1, . . . , n
λ
135
These constraints imply that only a subset of the values αi − αi⋆ are different from zero
and the associated observations are called the support vectors. The dual function and
the fitted values f ( x ) = x t β = ∑in=1 (αi − αi⋆ ) x t xi both are determined only through
the products x t xi . Thus, we can generalize support vector regression by using kernels.
136
Regression SVM
140
linear
polynomial
120
radial
100
80
bone
60
40
20
0 10 20 30 40 50
age
Figure 5.7 Two-dimensional regression dataset fitted by regression SVM with linear,
polynomial (degree 3) and radial kernel.
137
138
Chapter 6
Figure 6.1 Left: yearly temperature data from 73 weather stations across Spain. Right:
angle formed by the knee of 39 children going through a 20-point gait cycle.
139
While functional data are defined on an infinite set, such as an interval, a sphere,
. . . , in practice the curves are only observed at a finite grid of points. The grid at
which we observe our data X1 , . . . , Xn can be large and with high resolution or small
and sparse. In the former case, we say that the functional data is densely observed
while in the latter case we say that the functional data is sparsely observed. Of course,
often there is no clear cut-off between sparsely and densely observed functional data.
The weather data in the left panel of Figure 6.1 may be considered to be densely ob-
served while the knee angles in the right panel can probably be classified as sparsely
observed, due to the small number of discretization points. We will be mostly con-
cerned with dense functional data in this chapter.
A major focus in FDA is the shape of the observed functions or of functions which
summarize the properties of the data in some specific way. For example, it may be
of interest to know how a certain treatment affects the mean protein level of patients.
One would then be interested in constructing a curve which describes how this level
changes over time in a group of patients. Another example concerns magnetometer
data. Space physics researchers are not interested in the value of the magnetic field
every five seconds, but in the shape of the curve over several days. The shapes of
such curves at many locations allow researchers to make inferences about physical
processes occurring in near Earth space where no instruments are placed.
Assume now that we are in the possession of X1 , . . . , Xn independent and identi-
cally distributed curves. Typically, the first step in working with such functional data
is to express them by means of a basis expansion
K
Xi ( t ) ≈ ∑ cim Bm (t), (i = 1, . . . , n). (6.1)
m =1
In (6.1), the Bm are some standard collection of basis functions, like splines, wavelets
or trigonometric functions. Expansion (6.1) reflects the intuition that our data are ob-
servations from smooth functions that share some shape properties, and so can be
approximated as linear combinations of the same basis functions. If the discretization
grid is very large, (6.1) serves the practical purpose of replacing the original data by
a smaller collection of the coefficients cim . Moreover, if the discretization points differ
between subjects, the expansion puts the curves into a common domain so that they
are more readily comparable. Figure 6.2 presents some commonly used basis func-
tions, generated by the following R-code.
require(fda)
spline.basis = create.bspline.basis(rangeval = c(0,1), nbasis = 5)
par(mar = c(6.6, 6.6, 1.1, 1.1)) # bottom, left, top, right
par(mgp = c(4.6, 1.6, 0)) # bottom, left, top, right
140
plot(spline.basis, lty = 1, lwd = 3, cex.lab = 2.5,
cex.axis = 2.5, col = "black")
fourier.basis = create.fourier.basis(rangeval = c(0,1), nbasis = 5)
plot(fourier.basis, lty = 1, lwd = 3, col = "black",
cex.lab = 2.5, cex.axis = 2.5)
Figure 6.2 Left: plot of five B-spline basis functions. Right: plot of the first five Fourier
basis functions.
Example (B-spline expansion of the Wiener process). The Wiener process, also known
as the Brownian motion, is an appropriate limit of the random walk
i
Si = ∑ Nk , Nk ∼ N(0, 1), 1 ≤ k ≤ i.
k =1
Figure 6.3 shows a realization of the random walk which can be viewed as a function
defined on [0, 1] by X (ti ) = Si along with its lower-dimensional representation ob-
tained by a B-spline basis expansion with 25 basis functions. The coefficients in the
B-spline expansion are estimated by the least-squares criterion.
141
Figure 6.3 Random walk and its expansion using 25 B-spline basis functions.
require(fda)
set.seed(2)
N <- rnorm(500)
Wiener <- cumsum(N)
plot(seq(0,1, len = 500), Wiener, xlab = "",
ylab = "", lwd = 3, type = "l", cex.lab = 2.5, cex.axis = 2.5)
B25.basis <- create.bspline.basis(rangeval = c(0, 1), nbasis = 25)
Wiener.fd <- smooth.basis(y = Wiener, fdParobj = B25.basis,
argvals = seq(0,1, len = 500))
lines(Wiener.fd, lwd = 3, col = "blue")
Notice that the B-spline representation captures the rough shape of the process but not
its finer details, for which a larger collection of B-splines is needed.
142
tions. In classical statistics, observations are either scalars or vectors and the model
that is most commonly used stipulates that the observations are realizations of a ran-
dom variable or random vector which is a measurable mapping from some underlying
probability space (Ω, A, P) to the space of real numbers that is equipped with its usual
Borel σ-algebra (see, e.g., the appendix).
At the most basic level, functional data may be viewed as independent and identi-
cal copies of a stochastic process { X (t), t ∈ E} with E a compact interval of which R
without loss of generality we shall take to be the [0, 1]-interval. The index t may refer
to time, for example. The process X may be entirely or partially observed. For every
ω ∈ Ω the function
t 7→ X (t, ω )
is called a path (or sample path) of X (t). See figures 6.1 and 6.3 for examples.
The measure-theoretic foundation of a stochastic process is that X (t) is a random
R
variable for each fixed t, i.e., X (t, ω ) is a measurable mapping from (Ω, A, P) to with
its Borel sigma algebra. We say that two stochastic processes X1 and X2 are independent
R
if for each d and (t1 , . . . , td ) ∈ d the vectors
In other words, if all joint distributions are the same. By induction, these properties
may be extended to n stochastic processes X1 , . . . , Xn .
Of particular importance are the mean and covariance functions of the process
{ X (t), t ∈ [0, 1]} given by
and
provided that the expectations exist. Processes with well-defined mean and covariance
functions are referred to as second-order processes. The covariance function K (s, t) is
positive semi-definite as for any choice of (α1 , . . . , αd ) ∈ d R
( )
d d d
∑ ∑ αi α j K(ti , t j ) = Var ∑ α j X (t j ) ≥ 0.
i =1 j =1 j =1
143
We focus herein on second-order processes such that
for any t ∈ [0, 1] and any sequence {tn } ∈ [0, 1] converging to t. Such processes are
said to be mean-square continuous. We have the following useful characterization of
mean-square continuous processes.
Theorem 6.1. Let X be a second-order process. Then, X is mean-square continuous if, and
only if, its mean and covariance functions are continuous.
Proof: As
which tends to zero as s → t by virtue of (6.4). Without loss of generality, assume that
µ(t) = 0 in the following. Write
and, by symmetry,
n o1/2
|K (s′ , t) − K (s′ , t′ )| ≤ K1/2 (s′ , s′ ) E{| X (t) − X (t′ )|2 } ,
To see that this operator indeed maps L2 ([0, 1]) into L2 ([0, 1]), notice that for every
f ∈ L2 ([0, 1]) we have
Z Z 2 Z Z Z
2
| f (t)|2 dt < ∞,
K ( s, t ) f ( t ) dt ds ≤ | K ( s, t )| dtds
[0,1] [0,1] [0,1] [0,1] [0,1]
as, by the continuity of K and the compactness of [0, 1]2 , the double integral above is fi-
nite (recall that continuous functions on compact domains are uniformly continuous).
144
6.2.2 The Karhunen-Loève theorem
As noted previously, K is symmetric and positive semi-definite. Hence, by Mercer’s
theorem (Theorem A.8 in the Appendix),
∞
K (s, t) = ∑ λ j e j ( s ) e j ( t ), (6.6)
j =1
where the sum converges absolutely and uniformly and (λ j , e j ) are eigenvalue and
eigenfunction pairs of K.
Our goal now is to develop an expansion for our stochastic process X in terms
of the eigenvalue and eigenfunction pairs of K. The first step in this direction is to
define the abstract L2 stochastic integral of a mean-square continuous process X. For
convenience we shall henceforth assume that the mean function µ(t) is equal to zero.
For each n consider any measurable partition of [0, 1], E1 , . . . , Emn such that each Ei ⊂
B([0, 1]) and its diameter is less than 1/n. Let ti be an arbitrary point in Ei for each
i = 1, . . . , mn . For any function f ∈ L2 ([0, 1]) define
mn Z
IX ( f , { Ei , ti : 1 ≤ i ≤ mn }) = ∑ X ( ti ) Ei
f (u)du.
i =1
mn mn Z Z
= ∑ ∑ K ( t i1 , t i2 )
Ei1
f (u)du
Ei2
f (v)dv
i1 =1 i2 =2
m′n m′n Z Z
+ ∑ ∑ K (t j1 , t j2 )
Ej1
f (u)du
Ej2
f (v)dv
j1 =1 j2 =2
mn m′n Z Z
−2∑ ∑ K (ti , t′j ) f (u)du f (v)dv.
i =1 j =1 Ei E′j
As K is uniformly continuous, each double sum on the right-hand side can be made
arbitrarily close to
Z Z
K (u, v) f (u) f (v)dudv,
[0,1] [0,1]
145
Theorem 6.2. Let { X (t), t ∈ [0, 1]} be a mean-square continuous stochastic process with
mean zero. Then, for any f , g ∈ L2 ([0, 1]),
(i) E{ IX ( f )} = 0.
(ii) E{ IX ( f ) X (t)} =
R
[0,1] K ( u, t ) f ( u ) du for any t ∈ [0, 1].
(iii) E{ IX ( f ) IX ( g)} =
R R
[0,1] [0,1] K ( u, v ) f ( u ) g ( v ) dudv.
Proof. We only prove assertion (i) as assertions (ii) and (iii) can be proven in a similar
manner. Observe that
by definition of the eigenvalue and eigenfunction pairs (λi , ϵi ). We are now ready to
state and prove the Karhunen-Loève theorem.
Theorem 6.3 (Karhunen-Loève). Let { X (t), t ∈ [0, 1]} be a mean-square continuous pro-
cess with mean zero. Then,
where X N (t) = ∑ N
j =1 IX ( e j ) e j ( t ) .
E{| X N (t) − X (t)|2 } = E{| X N (t)|2 } − 2E{ X N (t) X (t)} + E{| X (t)|2 }
N N
= ∑ λ j |e j (t)|2 − 2 ∑ λ j |e j (t)|2 + K(t, t),
j =1 j =1
N
= K (t, t) − ∑ λ j |e j (t)|2 ,
j =1
146
6.2.3 Mean-square continuous processes in L2 ([0, 1])
where B([0, 1]) denotes the Borel sets of [0, 1] and σ(B([0, 1]) × A) denotes the smallest
σ-algebra that includes the Cartesian product B([0, 1]) × A (which is, in general, not a
sigma-algebra).
If X is jointly measurable and X (·, ω ) ∈ L2 ([0, 1] then for every f ∈ L2 ([0, 1]),
R
⟨ X, f ⟩ is a measurable mapping from (Ω, A, P) to equipped with its Borel sigma-
algebra. In other words, ⟨ X, f ⟩ is a random variable. A proof of this may be found in
the standard treatment of Fubini’s theorem. This important fact allows us to simplify
the scores IX (e j ) appearing in the Karhunen-Loève expansion.
Theorem 6.4. Let { X (t), t ∈ [0, 1]} be a mean-square continuous process with sample paths
in L2 ([0, 1]) that is jointly measurable. Then, for any f ∈ L2 ([0, 1]),
Z
IX ( f ) = X (t, ω ) f (t)dt,
[0,1]
almost surely.
Proof. Let { Ei , ti , 1 ≤ i ≤ mn } denote the partition of [0, 1] with maximal diameter less
than 1/n discussed previously. Under our assumptions ⟨ X, f ⟩ is a random variable.
147
Therefore, by Fubini’s theorem,
( Z 2 )
E IX ( f , { Ei , ti , 1 ≤ i ≤ mn }) −
X (t) f (t)dt
[0,1]
mn mn Z Z
= ∑ ∑ K ( t i1 , t i2 )
Ei1
f (u)du
Ei2
f (v)dv
i1 =1 i2 =1
Z Z
+ K (u, v) f (u) f (v)dudv
E E
mn Z Z
−2∑ K (ti , v) f (u) f (v)dudv.
i =1 Ei E
E{| IX ( f ) − ⟨ X, f ⟩|2 },
with the series converging uniformly in t ∈ [0, 1] and ⟨ X, e j ⟩ orthogonal random vari-
ables with variances equal to λ j for all j. Since, by Mercer’s theorem, |λ j | → 0 as
j → ∞, the path of the process can often be described quite well with the partial sum
Xn for small n.
The Karhunen-Loève theorem says even more. In particular, the orthonormality of
the e j gives
N N
E{∥ X N ∥2 } = ∑ E{|⟨X, e j ⟩|2 } = ∑ λj.
j =1 j =1
∞
E{∥ X ∥2 } = ∑ λj. (6.9)
j =1
This equality implies that the variance of X ∈ L2 ([0, 1]) equals the sum of the variances
of the scores ⟨ X, e j ⟩ and forms the basis for functional principal component analysis.
To illustrate these results, let us return to the Brownian motion example and obtain
expressions for the eigenvalue and eigenfunction pairs.
148
Example (Wiener process, continued). Let W (t) denote the stochastic process that
is the limit of the random walk. It can be shown that E{W (t)} = 0 and K (s, t) =
min(s, t). To determine the eigenvalues and eigenfunctions of the covariance operator
we need to find numbers λ j and functions e j such that
Z 1
min(s, t)e j (s)ds = λ j e j (t),
0
Differentiating both sides with respect to t with Leibiniz’s formula and simplifying we
get
Z 1
e j (s)ds = λ j e′j (t).
t
Differentiating once more reveals that e j (t) = −λ j e′′j (t). A general solution for this
differential equation is
The above equations imply that e(0) = 0 and e′ (1) = 0 so that b = 0 and
−1/2
a cos(1/λ1/2
j ) = 0, which lead to λ j = jπ − π/2 whence
4
λj =
(2j − 1)2 π 2
and
(2j − 1)π
1/2
e j (s) = 2 sin s .
2
The eigenvalues are indeed decreasing and summable, while the eigenfunctions are
continuous (actually infinitely differentiable).
How well do the partial sums of the Karhunen-Loève expansion approximate the
trajectory of W (t)? Figure 6.4 shows that the rough shape of the trajectories may be
captured with only a handful of eigenfunctions and associated scores.
N1 <- 5
E.m <- matrix(0, nrow = 500, ncol = N1)
for(j in 1:N1){
E.m[,j] = sapply(seq(0,1, len =
149
Figure 6.4 Random walk and its Karhunen-Loève approximation with N = 5 and
N = 10 in blue and red, respectively.
500), FUN =
function(x) 2^{1/2}*sin((2*j-1)*pi*x/2)
)
}
scores <- t(E.m)%*%Wiener/500
150
lines(seq(0,1, len = 500), X.N1, lwd = 3, col = " blue" )
lines(seq(0,1, len = 500), X.N2, lwd = 3, col = " red" )
legend("topleft" , col = c(" blue" , " red" ), lty = c(1, 1), lwd = c(3, 3),
legend = c("N = 5" , "N = 10" ), cex = 2 )
151
as E{∥ X1 − µ∥22 } ≤ 2E{∥ X1 ∥22 } + 2∥µ∥22 < ∞. □
We now turn to the properties of the sample covariance estimator K b(s, t). We as-
sume without loss of generality that µ(t) = 0 in the Theorem below. To see this, notice
that
n
Kb(s, t) = 1 ∑ ( Xi (s) − µ(s))( Xi (t) − µ(t)) − (µ
b(s) − µ(s))(µ
b(t) − µ(t)),
n i =1
so that, by Theorem 6.5,
Z Z
b(s, t) − K (s, t)|2 dsdt
|K
[0,1] [0,1]
2
n
1
Z Z
≤2 ∑ ( Xi (s) − µ(s))( Xi (t) − µ(t)) − E{( X (s) − µ(s))( X (t) − µ(t))} dsdt
n i =1
1
+ OP .
n2
Thus, up to terms of order n−2 we can always assume that the process X is centered.
Theorem 6.6. Let X1 , . . . , Xn denote iid copies of a mean-square continuous process { X (t), t ∈
[0, 1]} that is jointly measurable with mean zero and E{∥ X ∥42 } < ∞. Then
1
Z Z
E{|K (s, t) − K (s, t)| }dsdt = O
b 2
.
[0,1] [0,1] n
Proof. We have
2
Z Z
1
Z Z n
b(s, t) − K (s, t)|2 dsdt =
|K ∑ [ Xi (s) Xi (t) − E{ X (s) X (t)}] dsdt,
[0,1] [0,1] n2 [0,1] [0,1] i =1
and, by independence and identical distributions,
2
n n n o
E ∑ [ Xi (s) Xi (t) − E{ X (s) X (t)}] = ∑ E | X1 (s) X1 (t) − E{ X (s) X (t)|2
i =1 i =1
n o
≤ nE | X1 (s) X1 (t)|2 ,
where we have used E{| X − E{ X }|2 } ≤ E{| X |2 } for all random variables X. Thus,
interchanging expectation and integration (this is permissible for positive integrands)
we have
1
Z Z Z Z n o
E{|K (s, t) − K (s, t)| }dsdt ≤
b 2
E | X1 (s) X1 (t)|2 dsdt
[0,1] [0,1] n [0,1] [0,1]
Z
1
Z
= E 2
| X1 (s) X1 (t)| dsdt
n [0,1] [0,1]
1
= E{∥ X ∥42 }
n
1
=O ,
n
152
as E{∥ X ∥42 } < ∞. □.
One may also consider other modes of convergence of µ b and Kb to their population
counterparts. For example, for every t ∈ [0, 1] the strong law of large numbers (SLLN)
yields the existence of a set A ∈ A such that P( A) = 0 and
1 n n→∞
∑
n i =1
Xi (t, ω ) −−−→ µ(t),
for every ω ∈
/ A. The exceptional set A may depend on t, hence the SLLN only implies
pointwise convergence. A stronger mode of convergence is the uniform distance given
by
sup |µ
b(t) − µ(t)|.
t∈[0,1]
Convergence in this distance clearly implies both L2 and pointwise convergence. Gen-
erally, however, one needs stronger conditions on X and µ in order to ensure that the
uniform distance tends to zero either in probability or almost surely. Such conditions
may involve smooth trajectories, continuously differentiable mean function µ(t) etc.
Figure 6.5 provides an example of an estimated covariance function. Notice the
interesting pattern at the beginning and end of the year.
Figure 6.5 Estimated covariance surface for the spanish weather data seen in Fig-
ure 6.1.
153
6.4 Functional principal component analysis
We have seen previously that the Karhunen-Loève expansion decomposes a mean-
square continuous and jointly measurable L2 ([0, 1])-process X with mean zero into a
series involving the orthonormal eigenfunctions of the covariance operator K. The
uniform convergence of this series implies that X can be well-approximated by
N
X N (t) = ∑ ⟨X, e j ⟩e j (t),
j =1
for some N depending on the desired quality of the approximation. However, the
eigenfunctions of K are only one of the possible orthonormal systems one can use
to express X. For example, one could instead use a Fourier or orthogonalized spline
system. Thus, one may ask if there exists an ”optimal” basis system with which we
can express X.
The answer to this question (if it even exists) depends on the optimality criterion
we choose to employ. Herein we call an N-dimensional orthonormal system u1 , . . . , u N
optimal if for any other N-dimensional orthonormal system w1 , . . . , w N it holds that
2
2
N
N
E
X − ∑ ⟨ X, u j ⟩u j
≤ E
X − ∑ ⟨ X, w j ⟩w j
. (6.10)
j =1 j =1
2 2
The next question would then be how to find an orthonormal system u1 , . . . , u N satis-
fying (6.10).
Although a seemingly difficult task, the assumed orthonormality greatly simplifies
things. Indeed, note that for any orthonormal system w1 , . . . , w N we have
2
N
N N
E
X − ∑ ⟨ X, w j ⟩w j
= E{∥ X ∥22 } − 2 ∑ E{|⟨ X, w j ⟩|2 } + ∑ E{|⟨ X, w j ⟩|2 }
j =1 j =1 j =1
2
N
= E{∥ X ∥22 } − ∑ E{|⟨ X, w j ⟩|2 }. (6.11)
j =1
The first term in (6.11) does not depend on w j and the second term is non-negative.
Thus, to identify the minimum we need to select the basis functions w j , j = 1, . . . , N in
such a way that ∑ Nj=1 E{|⟨ X, w j ⟩| } is maximized. Now,
2
so that the optimization problem involves the covariance operator of X. Theorem 6.7
provides the solution.
154
Theorem 6.7. Suppose the covariance operator of a mean-square continuous, jointly measur-
able process with mean zero, K, has eigenvalues satisfying λ1 > λ2 > . . . > λ N > . . . > 0
with corresponding orthonormal eigenfunctions e j . Then, for j ≤ N
λ j = sup{⟨K f , f ⟩ : ∥ f ∥2 = 1, ⟨ f , ek ⟩ = 0, 1 ≤ k ≤ j − 1}
so that the error depends on the rate of decay of the eigenvalues λ j . In the extreme
case that λ N +k = 0 for all k ∈ Nthe representation incurs no approximation error
whatsoever. In the general case, however, the error depends on the regularity of the
paths of X.
The analysis above is at the population level and in practice one has no knowledge
of either the covariance operator or its eigenvalues and corresponding eigenfunctions,
which have to be estimated from the sample. A natural estimator for the covariance
operator is
Z
K
b f (t) = b(s, t) f (s)ds.
K
[0,1]
Sample functional principal component analysis may be carried out through the use
of K,
b which can also be shown to be bounded, positive-semidefinite and self-adjoint
(show it!). Therefore, estimates for λ j and e j may be obtained by solving
Kb
b e j (t) = b e j ( t ).
λjb (6.13)
155
Practically, the solutions to (6.13) are obtained by discretizing the random functions
X1 , . . . , Xn and proceeding as in the multivariate case. A complication is that from a
sample of size n one can identify at most n unique eigenvalues and eigenfunctions.
The reason is that the range of K b is only the span of X1 , . . . , Xn which is at most an
n-dimensional subspace of L2 ([0, 1]).
Estimated eigenfunctions can be shown to minimize the sample equivalent of the
previously discussed problem. In particular, we have
2
2
n
N n
N
1
1
∑
Xi − ∑ ⟨ Xi , b
e j ⟩b
ej
≤ ∑
Xi − ∑ ⟨ Xi , w j ⟩ w j
n i =1
j =1
n i =1
j =1
2 2
These values cannot be computed from the data, so in practice it must be ensured that
the statistics we want to work with do not depend on cbj . We have the following result
which is given without proof.
Theorem 6.8. Let X1 , . . . , Xn denote iid copies of a mean-square continuous process { X (t), t ∈
[0, 1]} that is jointly measurable with E{∥ X ∥42 < ∞. If
Figure 6.6 plots the first three estimated eigenfunctions of the spanish weather and
gait datasets seen in Figure 6.1. The corresponding eigenvalues respectively explain
99% and 82% of the total variance, respectively.
156
Figure 6.6 First three sample eigenfunctions of the Spanish weather and gait datasets
on the left and right, respectively.
require(fda.usc)
data("aemet")
temp <- aemet$temp$data
ctemp <- cov(temp)
etemp <- eigen(ctemp)
dim(etemp$vectors)
par(mar = c(6.6, 6.6, 2, 1.1)) # bottom, left, top, right
par(mgp = c(4.6, 1.6, 0)) # bottom, left, top, right
matplot(etemp$vectors[, 1:3], type = "l", lwd = 3, lty = 1, cex.lab = 2.5,
cex.axis = 2.5, cex = 2.5,
col = c("blue", "red", "darkgreen"), xlab = "Day", ylab = "")
for some unknown α0 ∈ R and β 0 ∈ L2 ([0, 1]). The function β 0 is referred to as the
coefficient (or weight) function while the errors ϵi are assumed to be iid with mean zero
157
and finite variance. The functional regression model (FLM) in (6.15) has a vast number
of applications. For example, it is often of interest to relate near-infrared reflectance
(NIR) spectra with octane measurements of gasoline samples. Figure 6.7 presents a
sample of 60 such data.
How can we estimate the unknown parameters (α0 , β 0 ) of the functional linear
model? There are roughly speaking two approaches: the basis expansion approach and
the penalization approach. Starting with the basis expansion approach, the main idea
is to reduce the FLM to a standard regression problem by expressing β 0 in a small
number of suitable basis functions. That is, we approximate β by
N
βeN (t) = ∑ β j B j ( t ), (6.16)
j =1
for some small N. In principle, B1 , . . . , BN can be any system of basis functions such as
Fourier and B-splines but also the sample eigenfunctions of K discussed in Section 4.
Either way, the approximation in (6.16) works by substituting the infinite-dimensional
β 0 with an N-dimensional approximation. Having done so, the coefficients β 1 , . . . , β N
can be estimated by minimizing
n
L(α, β) = ∑ |yi − α − ⟨Xi , βeN ⟩|2, (6.17)
i =1
R R
over × N . For N ≪ n, the Gram matrix is positive definite, hence there exists a
unique solution to (6.17).
The basis expansion approach is simple and intuitively appealing, yet the resulting
estimates are greatly dependent upon (i) the choice of the basis system and (ii) the
number of basis functions employed. Varying any of these is likely to cause great
158
changes in the shape of the estimated coefficient function βb(t). At both theoretical and
practical levels, one also has to content with the large biases that may be incurred by
approximation (6.16). In particular, while with most basis systems that are dense in
L2 ([0, 1]) we can write
∞ N
β 0 (t) = ∑ β j Bj (t) = ∑ β j B j ( t ) + δ ( t ),
j =1 j =1
the basis function approach essentially ignores the remainder term δ(t). The impor-
tance of this term generally depends on the (unknown) shape of β 0 with more compli-
cated coefficient functions requiring a large number of basis functions for an accurate
description. Thus, the basis function approach can only be recommended in cases
where one is only interested in the rough shape of β 0 and not its finer characteristics.
A more potent alternative that reduces the dependence on the basis system and
the number of basis functions involves using a large number of basis functions and
controlling the increase in the variance of the estimates with a penalty (hence the name
penalization). Assume now that N is so large, possibly even larger than n, so that δ(t)
is negligible. The penalization approach then solves
n
Lλ (α, β) = ∑ |yi − α − ⟨Xi , βeN ⟩|2 + λI( β), (6.18)
i =1
where Pij = [0,1] Bi′′ (t) B′′j (t)dt. This penalty implicitly assumes that the basis func-
R
tions B1 , . . . , BN are twice differentiable with a second derivative in L2 ([0, 1]). Notice
that this holds for Fourier basis and B-spline functions with p ≥ 3. The null space of
I( βeN ) consists precisely of affine functions of the form
β N (t) = β 0 + β 1 t,
159
Figure 6.8 Penalized spline estimates for the gasoline data with 36 interior knots and
p = 4. Top left: λ = 0. Top right: λ = 0.5. Bottom left: λ = 20. Bottom right: λ
selected with generalized cross-validation.
R
for some ( β 0 , β 1 ) ∈ 2 . Thus, for λ → 0 the estimated coefficient function becomes
wiggly and complex while for λ → ∞ it resembles more and more a straight line.
Figure 6.8 illustrates the role of λ on the gasoline data.
require(refund)
data(gasoline)
y = gasoline$octane
x = gasoline$NIR
par(mar = c(6.6, 6.6, 1.1, 1.1))
par(mgp = c(4.6, 1.6, 0))
fit1 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 0)
fit2 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 0.5)
fit3 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 20)
160
fit4 <-pfr(y~lf(x, k = 40, bs = "ps"), method = "GCV.Cp")
N log(n)
1
E{|⟨ X, βb − β 0 ⟩| } = OP
2
+ OP + O P ( λ ).
n N 2(r + v )
The quantity E{|⟨ X, βb − β 0 ⟩|2 } is called the prediction error of βb with respect to β 0 and
is very often the pseudo-metric employed in the literature. If we could choose the
rate of growth of N and the rate of decay of λ, then what would be the best rate of
convergence to zero for the prediction error?
Further reading
Functional data analysis is one of the most modern branches of statistics and there
is a blossoming literature covering many techniques from both practical and theoreti-
cal standpoints. The early expository book by Ramsay and Silverman (2006) has been
very influential and was a catalyst for further developments in FDA. A more recent
practical treatment with many worked out examples is Kokoszka and Reimherr (2017).
Theoretical aspects of FDA are covered in Horváth and Kokoszka (2012) and Hsing
and Eubank (2015) with the latter placing emphasis on reproducing kernel Hilbert
space methods.
161
Recent developments in FDA include robust functional data analysis and the study
of discretely sampled functional data or even the robust analysis of discretely sampled
functional data. Examples of such contributions are Sinova et al. (2018) and Kalogridis
and Van Aelst (2021).
162
Appendix A
Background
µ ( A ∪ B ) = µ ( A ) + µ ( B ), A, B disjoint.
By induction, this property may be readily extended to any finite union of disjoint
sets. Usually, we require something more, namely that the measure µ is sigma-additive
which means that
∞ ∞
!
∑ µ ( A i ), Ai ⊂ Ω, Ai ∩ A j = ∅, i ̸= j.
[
µ Ai = (A.1)
i =1 i =1
Sigma-additivity allows for a much richer mathematical theory but brings about
technical difficulties whenever the domain of the measure is every subset A of Ω.
Therefore, it is necessary to restrict the definition to a sigma-algebra (or sigma-field) of
subsets of Ω, that is, to a class of subsets of Ω often denoted by A satisfying the fol-
lowing three properties
(i) Ω ∈ A.
(ii) For A ∈ A, Ac = Ω \ A ∈ A.
S∞
(iii) For Ai ∈ A, i =1 Ai ∈ A.
163
If A is a sigma-algebra of a space Ω then (Ω, A) is said to be a measurable space and
the sets A ∈ A are said to be measurable. With these definitions, we call µ a measure on
(Ω, A) if it’s a nonnegative function on A satisfying (A.1). The triple (Ω, A, µ) is then
often referred to as a measure space.
The following are three important examples of measure spaces.
Example: Counting measure. Let Ω be countable and let A be the class of all subsets
of Ω. For any A ∈ A, let µ( A) be the number of points of A if A is finite and µ( A) = ∞
otherwise. This measure is called the counting measure.
R
Example: Lebesgue measure on p . Let Ω be the whole of R p and let A be the smallest
sigma algebra containing all half open rectangles
The members of A are called Borel sets. It can be shown that this is a very large class
R
containing, among others, all open and all closed subsets of p . A fundamental result
of measure theory states that there exists a unique measure µ defined over A which
assigns to (A.2) the measure
µ( R) = (b1 − a1 ) . . . (bn − an ),
Ai = { x : s( x ) = ai } ∈ A.
164
An important special case of a simple function is the indicator I A of a set A ∈ A defined
by
1 x ∈ A
I A ( x ) = I ( x ∈ A) = .
0 x ∈/A
f ( x ) := lim sn ( x ). (A.3)
n→∞
The limit exists because for every x the sequence is increasing but f ( x ) may be in-
finite. A non-negative function f with domain Ω will be called A-measurable or, for
short, measurable if there exists a pointwise increasing sequence of non-negative sim-
ple functions such that (A.3) holds for every x ∈ Ω.
Third, for an arbitrary function f (not necessarily non-negative), define its positive
and negative parts by
f −1 ( B) = { x : f ( x ) ∈ B} ∈ A.
It follows from the definition of Borel sets that it is enough to check that { x : f ( x ) <
R
b} ∈ A for every b ∈ . This shows in particular that if f : Ω → is continuousR
then it is measurable. As another important class of functions, consider the functions
taking on a countable number of values, say a1 , a2 , . . .. Then f is measurable if, and
only if, the sets
Ai = { x : f ( x ) = ai } ∈ A.
165
and therefore { x : f ( x ) ∈ B} = ∪i:ai ∈ B Ai ∈ A or product of measurable functions is a
measurable function itself. What is more, if f n are a sequence of measurable functions
then so are
all of these quantities being defined pointwise. Whenever it exists, limn→∞ f n is also a
measurable function.
The integral of a measurable function with respect to a measure µ can then be
defined in three corresponding steps.
(i) For a non-negative simple function s taking on values ai on the sets Ai , define
Z ∞
Ω
sdµ = ∑ a i µ ( A i ), (A.4)
i =1
Thus, the definition of the integral is independent of the sequence of simple func-
tions converging to f .
R
The definitions (A.4) and (A.5) do not preclude the possibility that Ω sdµ or
R
Ω f dµ are infinite. A non-negative measurable function is said to be integrable
(with respect to µ) if Ω f dµ < ∞.
R
(iii) An arbitrary measurable function f is said to be integrable if its positive and neg-
ative parts are integrable in which case its integral is defined as
Z Z Z
+
f dµ = f dµ − f − dµ. (A.6)
Ω Ω Ω
166
Let us consider two special cases of integrals with respect to measures µ. First, for
R
y ∈ consider the Dirac measure defined on the set of all subsets of given by R
1 y ∈ A
δy ( A) = ,
0 otherwise
R
for any A ⊂ . Is this measure sigma additive on the set of all subsets of R? For every
function f we have
Z
f dδy = f (y).
Ω
R
To see this let Ai denote a partition of , that is, Ai ∩ A j = ∅ for i ̸= j and ∪i Ai = R.
Any simple function s can then be represented as
∞
s( x ) = ∑ a i I Ai ( x ).
i =1
as claimed.
Consider now the more complex measure
∞
µ( A) = ∑ an δn ,
n =0
R
with an = e−1 /n!, again on all subsets of . Verify that µ( ) = 1. What is R
R
f dµ? The
same reasoning as in the previous example shows that
Z ∞
Ω
f dµ = ∑ a n f ( n ).
n =0
The defined integral has the properties one would expect of it. In particular, for any
real numbers c1 , . . . , cm and any integrable functions f 1 , . . . , f m , ∑ ci f i is also integrable
and
Z m m Z
∑ ci fi dµ = ∑ ci
Ω i =1 Ω
f i dµ. (A.7)
i =1
167
Also, if f is measurable and g is integrable and if 0 ≤ f ≤ g then f is also integrable
and
Z Z
f dµ ≤ gdµ. (A.8)
Ω Ω
R
Theorem A.1. Let f : [ a, b] → be a measurable and Riemann integrable function. Then, if
µ denotes the Lebesgue measure on the Borel sigma algebra of [ a, b],
Z Z Z b
| f |dµ < ∞ and f dµ = f ( x )dx.
[ a,b] [ a,b] a
f =0 on A
or
µ( A) = 0. (A.9)
168
Conversely, if f is almost everywhere non-negative on A,
Z
f dµ =⇒ f = 0 almost everywhere on A,
A
Note that, as a special case of (A.10), if f and g are integrable functions differing
only on a set of measure zero, that is, f = g a.e., then
Z Z
f dµ = gdµ.
Ω Ω
µ( A) = 0 =⇒ v( A) = 0. (A.12)
169
A.1.2 Applications to probability theory
For statistical developments, the most important application of measure theory is
its specialization to probability theory. A measure P defined over a measure space
(Ω, A) satisfying
P(Ω) = 1,
then p is called the probability density of P with respect to µ. Such densities are,
of course, determined up to sets of µ-measure zero. Many distributions are either
discrete, that is, absolutely continuous with respect to the counting measure, or ab-
solutely continuous with respect to the Lebesgue measure. However, there are many
distributions that are neither and instead have both discrete and absolutely continuous
components.
A random variable X is a measurable function from (Ω, A, P) to ( , B( )) with R R
R R
B( ) denoting the Borel sigma-algebra of . The expectation of a random variable X,
E{ X }, is defined as
Z
E{ X } = XdP. (A.14)
Ω
The function
Theorem A.2. Let X denote a random variable with cumulative distribution FX . Then
Z Z
XdP = xdFX ( x ).
Ω R
Theorem A.2 shows that we only need to know the distribution of the random vari-
able X in order to compute its expectation and we do not need to know the underlying
(abstract) probability space (Ω, A, P).
170
If not just one but m real-valued aspects of an experiment are of interest, these
are represented by a measurable vector-valued function ( X1 , . . . , Xm ) defined over
(Ω, A, P) with joint cdf
FX (α1 , . . . , αm ) = P( X1 ≤ a1 , . . . , Xm ≤ am ).
Likewise this cdf defines a unique Lebesgue-Stieltjes measure on the Borel sigma alge-
R
bra of d .
Assume now that X is a random variable and its cdf FX is absolutely continuous
with respect to a measure µ and has Radon-Nikodym derivative f X . What is E( X ) in
this case? From Theorem A.2 we know that E{ X } = R xdFX ( x ) but much more can
R
now be said.
Theorem A.3. Let X denote a random variable with cumulative distribution FX . Then
Z Z
xdFX ( x ) = x f X ( x )dµ( x ).
R R
For example when µ is Lebesgue measure, using Theorem A.1 we obtain
Z
E{ X } = x f X ( x )dx,
R
which is the classical expectation formula for a random variable with density f X . Now
assume that FX is absolutely continuous with respect to the counting measure. What
is E{ X }?
R
Z
p p
L (Ω) := f : Ω → , f measurable and | f | dµ < ∞ , p ∈ [1, ∞).
Ω
Such functions possess many interesting properties and arise in various contexts. We
define
Z 1/p
p
∥ f ∥ p := | f | dµ , p ∈ [1, ∞).
171
That is, ∥ f ∥ p = 0 implies only that f = 0 almost everywhere. An easily verifiable
R
property of ∥ · ∥ p is that for all α ∈
∥α f ∥ p = |α|∥ f ∥ p . (A.15)
Two important inequalities on L p (Ω) spaces were discovered by Hölder and Minkowski
and relate the norm of the product of functions to the product of the norms and the
norm of the sum of functions to the sum of the norms. Specifically:
Theorem A.4 (Hölder’s inequality). Assume that f ∈ L p (Ω) and g ∈ Lq (Ω), where
p, q ∈ [1, ∞) and satisfy 1/p + 1/q = 1. Then f g ∈ L1 (Ω) and
Z Z
f gdµ ≤ | f g|dµ ≤ ∥ f ∥ p ∥ g∥q .
Ω Ω
Theorem A.5 (Minkowski’s inequality). Assume that f , g ∈ L p (Ω), p ∈ [1, ∞). Then
f + g ∈ L p (Ω) and
∥ f + g∥ p ≤ ∥ f ∥ p + ∥ g∥ p .
Theorem A.4 with p = q = 2 is referred to as the Cauchy-Schwarz inequality, or
simply the Schwarz inequality. Theorem A.5 combined with (A.15) shows that L p (Ω)
is a vector space and ∥ · ∥ p is a semi-norm, that is, a real-valued function on L p (Ω)
satisfying the subadditivity and absolute homogeneity properties but the positive def-
initeness property only holds up to sets of µ-measure zero.
Theorem A.4 and A.5 are valid for general measure spaces. When specialized to
the probability space (Ω, A, P) we have the useful inequalities
E{ XY } ≤ [E{| X | p }]1/p [E{|Y |q }]1/q ,
for X ∈ L p (Ω), Y ∈ Lq (Ω) and p, q ∈ [1, ∞), 1/p + 1/q = 1 and
[E{| X + Y | p }]1/p ≤ [E{| X | p }]1/p + [E{|Y | p }]1/p ,
for X, Y ∈ L p (Ω) and p ∈ [1, ∞).
Consider now a Cauchy sequence in L p (Ω), that is, a sequence f n ∈ L p (Ω) for
which for every ϵ > 0 there exists a large enough N such that
sup ∥ f n − f m ∥ p ≤ ϵ.
n,m≥ N
In Euclidean spaces sequences are convergent if and only if they are Cauchy. However,
in infinite dimensional function spaces this equivalence doesn’t always hold. Thus, it
is of interest to know in which spaces Cauchy sequences converge. Spaces with such
a property are called complete. For L p (Ω) spaces we have the remarkable result below.
Theorem A.6 (Riesz-Fischer). The spaces L p (Ω), p ∈ [1, ∞), are complete, i.e., every
Cauchy sequence f n ∈ L p (Ω) converges to some limit in f ∈ L p (Ω).
Complete (semi-)normed vector spaces are referred to as Banach spaces.
172
A.2.2 The Hilbert space L2 ([0, 1])
Of particular importance for functional data analysis is the space L2 (Ω, A, µ) with
Ω a closed, bounded and connected domain, which we shall without loss of generality
take to be the [0, 1]-interval, A is the Borel sigma algebra of Ω and µ is Lebesgue mea-
sure on [0, 1]. We use the simplified notation L2 ([0, 1]) in order to denote this space.
What makes the space L2 ([0, 1]) so attractive for theory and applications is our
ability to define an inner product on it with the result that its geometry resembles that
of finite-dimensional euclidean spaces. Specifically, an inner product on L2 ([0, 1]) is a
R
function ⟨·, ·⟩ : L2 ([0, 1]) × L2 ([0, 1]) → satisfying
The semi-norm and inner product on L2 ([0, 1]) are related through ⟨ f , f ⟩ = ∥ f ∥22 . We
call f and g orthogonal if ⟨ f , g⟩ = 0.
Basis expansions are an important tool for functional data analysis and they may
be conveniently defined for functions in L2 ([0, 1]). We say that a set of functions
{e1 (t), e2 (t), . . .} is a basis in L2 ([0, 1]) if every function f ∈ L2 ([0, 1]) admits a unique
expansion
∞
f (t) = ∑ f j e j ( t ). (A.17)
j =1
where the coefficients { f j } depend on f and the convergence is in the norm of L2 ([0, 1])
(so not always pointwise). We say that {e1 (t), e2 (t), . . .} is an orthonormal basis if, in
addition, ⟨ei , e j ⟩ = 0 whenever i ̸= j and ∥ei ∥2 = 1 for all i.
What does convergence of the series in (A.17) with orthonormal e j imply for the
coefficients f j ? We have the following fundamental result due to Parseval.
(a) The series in (A.17) converges in the norm ∥ · ∥2 if and only if the following series
converges
∞
∑ | f j |2 .
j =1
173
(b) If (A.17) converges, then the coefficients f j are given by ⟨ f , e j ⟩ with f = ∑ j f j e j ; hence
in this case we have
∞
f (t) = ∑ ⟨ f , e j ⟩ e j ( t ).
j =1
(c) For any f ∈ L2 ([0, 1]) the series in (A.17) with f j = ⟨ f , e j ⟩ converges in the norm of
L2 ([0, 1]).
Trigonometric (Fourier) functions form the best known orthonormal basis on L2 ([0, 1]).
These functions are given by
It is easy to show that the Fourier functions are orthonormal but less obvious that every
function in L2 ([0, 1]) can be expanded in such series. The truncated polynomials and
B-spline functions which are often used in functional data analysis are not orthogonal
but they are still linearly independent. This means that they can be made orthonormal
using the Gram-Schmidt process.
Since every vector space contains the zero element, this also implies M( a f ) = aM( f )
for all a ∈ R
and f ∈ L2 ([0, 1]) and L(0) = 0. We will call a linear operator simply
an operator. An operator L is bounded if there is a constant K such that for any f ∈
L2 ([0, 1])
M ( f ) ≤ K ∥ f ∥.
174
We say that a bounded operator M is self-adjoint if and only if, for all f , g ∈ L2 ([0, 1])
⟨ M( f ), g⟩ = ⟨ f , M( g)⟩.
x⊤ M⊤ y = ⟨ M( x ), y⟩ = ⟨ x, M (y)⟩ = x⊤ My,
for all x, y ∈ R p . This implies that M is indeed a symmetric matrix. In another parallel
with linear algebra, an operator M is called positive-semidefinite if and only if,
⟨ M( f ), f ⟩ ≥ 0, f ∈ L2 ([0, 1]).
It is called positive-definite, if the equality can hold only for f = 0, i.e., the null space
of the operator only contains the zero element.
We say that a real number λ is an eigenvalue of an operator M if, and only if, there
exists a non-zero f ∈ L2 ([0, 1]) such that
M( f ) = λ f . (A.19)
λ = ⟨ f , M( f )⟩ ≥ 0.
λ1 1 1 λ
⟨ f1, f2 ⟩ = ⟨ f 1 , f 2 ⟩ = ⟨ M( f 1 ), f 2 ⟩ = ⟨ f 1 , M( f 2 )⟩ = 2 ⟨ f 1 , f 2 ⟩,
λ1 λ1 λ1 λ1
175
symmetric and positive semi-definite and define the integral operator K : L2 ([0, 1]) →
L2 ([0, 1]) given by
Z
K( f )(s) = K (s, t) f (t)dt.
[0,1]
It is easy to check that K is self-adjoint and positive semi-definite. Thus, its eigenvalues
λ j are necessarily ≥ 0. Mercer’s theorem provides a remarkable characterization of K
in terms of the eigenvalues and eigenfunctions of K.
Theorem A.8. If (λ j , e j ) are the eigenvalue and eigenfunction pairs of K, then K has the
representation
∞
K (s, t) = ∑ λ j e j ( t ) e j ( s ), s, t ∈ [0, 1]2 ,
j =1
3. f has a ( p − 1)st derivative that is a step function with jumps at the knots
t1 , . . . , t K .
176
p p
The space of spline functions of order p is denoted with SK . The dimension of SK
can be heuristically derived as follows. The space of all pth order polynomials between
each [ti , ti+1 ) has dimension (K + 1) p but our smoothness requirements add ( p − 1)
constraints on each knot1 , hence K ( p − 1) constraints in total. Therefore the dimension
p
of SK is
(K + 1) p − K ( p − 1) = K + p.
p
A basis for SK may be derived by means of the truncated polynomials
p −1 p −1
1, x, . . . , x p−1 , ( x − t1 )+ , . . . , ( x − tK )+ , (A.20)
p −1
where x+ = (max{0, x }) p−1 . To see that these functions are indeed a basis one needs
to check first that every piecewise polynomial is in their span and secondly that they
are linearly independent. The former is trivial while for the latter notice that if
p −1 K
p −1
f (x) = ∑ αi x i
+ ∑ β j ( x − t j )+ = 0,
i =0 j =1
K
p −1
f (x) = ∑ β j ( x − t j )+ = 0.
j =1
177
choice in theoretical work and practical implementations nowadays is the B-spline ba-
sis, which we now briefly describe.
Much like truncated polynomials B-splines require a choice of knots and the speci-
fication of the order of the polynomial in order to be defined, although their construc-
tion is more involved. As an equivalent basis, one can derive the B-splines as a linear
combination of truncated polynomial functions, but it is often much more convenient
to define them through recursive formulae.
To explain this procedure we first need to augment the set of knots t1 < t2 < . . . <
tK by defining 2p additional boundary knots
t−( p−1) = . . . = t0 = 0
and
tK +1 = . . . = tK + p = 1.
The B-splines of order p with knots at t1 < t2 < . . . < tK can then be recursively
defined by
x − ti ti + p − x
Bi,p ( x ) = Bi,p−1 ( x ) + B (x) (A.21)
t i + p −1 − t i ti+ p − ti+1 i+1,p−1
for i = −( p − 1), . . . , K and the recursion starts from the step functions
1 x ∈ [ t , t )
i i +1
Bi,1 ( x ) =
0 otherwise.
for i = −( p − 1), . . . , K + p. In using this definition we must employ the property that
a B-spline of order p corresponding to p + 1 coincident knots is the zero function. To
illustrate the use of (A.21) consider the case of linear (p = 2) B-splines. In this instance
2p = 4 additional knots are defined by ti−1 = t0 = 0 and tK +1 = tK +2 = 1. Since B−1,1
has coincident knots at t−1 and t0 we have B−1,1 = 0 and, similarly, BK +1,1 = 0. Thus,
(A.21) gives
t1 − x
B−1,2 ( x ) = B0,1 ( x )
t1 − t0
x − t0 t2 − x
B0,2 ( x ) = B0,1 ( x ) + B1,1 ( x )
t1 − t o t2 − t1
..
.
x − tK
BK,2 ( x ) = BK,1 ( x ).
t K +1 − t K
B-splines of order 2 are known as ”tent” functions (plot a few of them and find out
why).
Higher order B-splines may be defined by continuing the recursion. We mention
three key properties of B-splines of general order p, namely
178
a) Bi,p ( x ) = 0 for x ∈
/ [ t i , t i + p ),
b) Bi,p ( x ) ≥ 0,
Note that property a) implies that every B-spline can be positive for at most p con-
secutive intervals. This needs to be contrasted with truncated polynomials which can
be nonzero throughout the [0, 1]-interval. This local support property means that B-
splines Bi,p ( x ) and Bj,p ( x ) such that |i − j| > p will be orthogonal. Thus, while this
basis is still not completely orthogonal it is very nearly so and in fact it can be shown
that the B-spline basis has the smallest compact support among all bases spanning the
p
spline subspace SK . Figure A.1 illustrates the B-spline basis for 10 interior knots.
Figure A.1 Quadratic and cubic B-splines based on 10 equidistant knots in [0.05, 0.95].
Step functions, that is, functions in the span of B-splines of order 1, can be shown
to be dense in L2 ([0, 1]). In general, however, spline functions are exceptionally pow-
erful when the function to be approximated exhibits a higher degree of smoothness,
i.e., it has a few continuous derivatives. In that case, it can be shown that splines can
approximate the target function uniformly on its domain with an approximation er-
ror decaying like O(K − p ) where p is the number of continuous derivatives the target
function possesses.
Further reading
The material for this appendix has been collected from various sources. Our discus-
sion of measure and probability theory is drawn mainly from Lehmann and Casella
(2006) and Schilling (2017). Our material for functional analysis has been collected
179
from Kreyszig (1978) and Schilling (2017). A more thorough treatment of spline func-
tions may be found in de Boor (2001) and DeVore and Lorentz (1993).
180
Bibliography
Barut, E., Fan, J., and Verhasselt, A. (2016). Conditional sure independence screening. Journal of the
American Statistical Association, 111(515):1266–1277, DOI: 10.1080/01621459.2015.1092974, https:
//doi.org/10.1080/01621459.2015.1092974. PMID: 28360436.
Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data. Springer Series in Statistics.
Springer, Heidelberg, ISBN: 978-3-642-20191-2, DOI: 10.1007/978-3-642-20192-9, http://dx.
doi.org/10.1007/978-3-642-20192-9. Methods, theory and applications.
Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than
n. Ann. Statist., 35(6):2313–2351, DOI: 10.1214/009053606000001523, https://doi.org/10.1214/
009053606000001523.
Christmann, A. and Steinwart, I. (2008). Support Vector Machines. Springer, New York.
Cui, H., Li, R., and Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional
discriminant analysis. Journal of the American Statistical Association, 110(510):630–641, DOI:
10.1080/01621459.2014.920256, https://doi.org/10.1080/01621459.2014.920256.
Davison, A. and Hinkley, D. (1997). Bootstrap Methods and their Applications. Cambridge University
Press, Cambridge.
DeVore, R. and Lorentz, G. (1993). Constructive Approximation. Springer Verlag, New York.
Donoho, D. and Huber, P. (1983). The notion of breakdown point. In Bickel, P., Doksum, K., and Hodges,
J., editors, A Festschrift for Erich Lehmann, pages 157–184, Belmont. Wadsworth.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall.
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-
dimensional additive models. Journal of the American Statistical Association, 106(494):544–557, DOI:
10.1198/jasa.2011.tm09779, https://doi.org/10.1198/jasa.2011.tm09779.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American Statistical Association, 96:1348–1360.
181
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–
911, DOI: 10.1111/j.1467-9868.2008.00674.x, https://rss.onlinelibrary.wiley.com/doi/
abs/10.1111/j.1467-9868.2008.00674.x.
Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear
model. J. Mach. Learn. Res., 10:2013–2038, ISSN: 1532-4435.
Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice (Springer Series
in Statistics). Springer-Verlag, Berlin, Heidelberg, ISBN: 0387303693.
Hampel, F. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics,
42:1887–1896.
Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical
Association, 69:383–393.
Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (1986). Robust Statistics: The Approach Based on
Influence Functions. Wiley, New York.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New
York.
Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and
Generalizations. Chapman & Hall/CRC, ISBN: 1498712169.
Hjort, N. and Pollard, D. (1993). Asymptotics for minimizers of convex processes. Technical report, Yale
University.
Hodges, J. and Lehmann, E. (1963). Estimates of location based on rank tests. Annals of Mathematical
Statistics, 34:598–611.
Hoerl, A. and Kennard, R. (1970a). Ridge regression: Applications to nonorthogonal problems. Techno-
metrics, 12:69–82.
Hoerl, A. and Kennard, R. (1970b). Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12:55–67.
Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications, volume 200. Springer
Science & Business Media.
Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction
to Linear Operators. Wiley Series in Probability and Statistics. Wiley, ISBN: 9780470016916, https:
//books.google.be/books?id=YjsbAAAACAAJ.
Huber, P. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35:73–
101.
Huber, P. and Ronchetti, E. (2009). Robust Statistics. Wiley Series in Probability and Statistics. Wiley,
ISBN: 9781118210338.
182
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014). An Introduction to Statistical Learning: With
Applications in R. Springer Publishing Company, Incorporated, ISBN: 1461471370.
Kalogridis, I. and Van Aelst, S. (2021). Robust optimal estimation of location from discretely sampled
functional data. Scandinavian Journal of Statistics.
Koenker, R. (2005). Quantile Regression. Econometric Society Monographs. Cambridge University Press,
DOI: 10.1017/CBO9780511754098.
Kokoszka, P. and Reimherr, M. (2017). Introduction to Functional Data Analysis. Chapman & Hall
/ CRC numerical analysis and scientific computing. CRC Press, ISBN: 9781498746342, https:
//books.google.be/books?id=HIxIvgAACAAJ.
Kreyszig, E. (1978). Introductory Functional Analysis with Applications. Wiley classics library. Wiley, ISBN:
9780471507314, https://books.google.be/books?id=Va8rAAAAYAAJ.
Lehmann, E. L. and Casella, G. (2006). Theory of point estimation. Springer Science & Business Media.
Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the
American Statistical Association, 107(499):1129–1139, DOI: 10.1080/01621459.2012.695654, https:
//doi.org/10.1080/01621459.2012.695654. PMID: 25249709.
Maronna, R. A., Martin, D. R., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New
York.
Pan, W., Wang, X., Xiao, W., and Zhu, H. (2019). A generic sure independence screening procedure. Jour-
nal of the American Statistical Association, 114(526):928–937, DOI: 10.1080/01621459.2018.1462709,
https://doi.org/10.1080/01621459.2018.1462709.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, http://www.R-project.org/.
Ramsay, J. and Silverman, B. (2006). Functional Data Analysis. Springer Series in Statistics. Springer New
York, ISBN: 9780387227511, https://books.google.be/books?id=REzuyz_V6OQC.
Rousseeuw, P. (1984). Least median of squares regression. Journal of the American Statistical Association,
79:871–880.
Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the Amer-
ican Statistical Association, 88:1273–1283.
Rousseeuw, R., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M.,
and Maechler, M. (2003). robustbase: Basic Robust Statistics. R package version 0.9-10, http://CRAN.
R-project.org/package=robustbase.
Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, ISBN: 9780521785167, https://books.
google.be/books?id=Y4uEvXFP2voC.
183
Schilling, R. L. (2017). Measures, integrals and martingales. Cambridge University Press.
Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond. MIT Press, Cambridge, MA, USA, ISBN: 0262194759.
Sinova, B., Gonzalez-Rodriguez, G., and Van Aelst, S. (2018). M-estimators of location for functional
data. Bernoulli, 24(3):2328–2357.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society. Series B, 58:267–288.
Todorov, V. and Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis.
Journal of Statistical Software, 32(3):1–47, http://www.jstatsoft.org/v32/i03/.
Tukey, J. (1958). Bias and confidence in not quite large samples (abstract). Ann. Math. Statist., 29(2):614,
DOI: 10.1214/aoms/1177706647, https://doi.org/10.1214/aoms/1177706647.
van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Series in Statistical and Proba-
bilistic Mathematics. Cambridge University Press, ISBN: 9780521650021.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, DOI: 10.1017/CBO9780511802256.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, fourth
edition, http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.
Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition. Chapman &
Hall/CRC Texts in Statistical Science. CRC Press, ISBN: 9781498728379, https://books.google.
be/books?id=JTkkDwAAQBAJ.
Yohai, V. (1987). High breakdown point and high efficiency robust estimates for regression. The Annals
of Statistics, 15:642–656.
Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-
dimensional data. Journal of the American Statistical Association, 106(496):1464–1475, DOI:
10.1198/jasa.2011.tm10563, https://doi.org/10.1198/jasa.2011.tm10563.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–
320, DOI: 10.1111/j.1467-9868.2005.00503.x, https://rss.onlinelibrary.wiley.com/doi/
abs/10.1111/j.1467-9868.2005.00503.x.
184
Prof. dr. Stefan Van Aelst
KU Leuven, Department of Mathematics and
Leuven Statistics Research Centre (LStat)
Celestijnenlaan 200B, 3001 Leuven
Phone. +32 16 37 23 83
[email protected]