Adv Stat Inf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 194

STATISTICS AND DATA SCIENCE

Advanced Statistical Methods

Prof. dr. Stefan Van Aelst


Dr. Ioannis Kalogridis

KU Leuven, Department of Mathematics

Academic year 2022-2023


Table of contents

Table of contents

Preface V

1 The Jackknife and the Bootstrap 1


1.1 The framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 An example of failure of the jackknife . . . . . . . . . . . . . . . . 6
1.3 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 The bootstrap idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Quality of the bootstrap approximation . . . . . . . . . . . . . . . 15
1.4 Jackknife and Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7 Bootstrap hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2 M-estimators 41
2.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Univariate parametric models . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 Univariate location model . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.2 Scale model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 The sensitivity curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4 The finite-sample breakdown value . . . . . . . . . . . . . . . . . . . . . . 51
2.5 M-estimators of univariate location . . . . . . . . . . . . . . . . . . . . . . 52
2.6 M-estimators of scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 M-estimators of univariate location with unknown scale . . . . . . . . . 56
2.8 Computation of M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.8.1 Computing an M-estimator of location . . . . . . . . . . . . . . . 58
2.8.2 Computing an M-estimator of scale . . . . . . . . . . . . . . . . . 59

I
2.8.3 Computing simultaneous M-estimates . . . . . . . . . . . . . . . . 59
2.9 Other robust estimators of univariate location and scale . . . . . . . . . . 61
2.10 Functional definition of M-estimators . . . . . . . . . . . . . . . . . . . . 62
2.11 Consistency of M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.12 Asymptotic normality of M-estimators . . . . . . . . . . . . . . . . . . . . 65
2.13 The influence function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.14 The asymptotic breakdown value . . . . . . . . . . . . . . . . . . . . . . . 68
2.15 The maxbias curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3 Regression M-estimators 73
3.1 Review of the LS method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2 M-estimators of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Computing M-estimators of regression . . . . . . . . . . . . . . . . . . . . 80
3.4 Robustness properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.1 Finite-sample breakdown value . . . . . . . . . . . . . . . . . . . 83
3.4.2 Influence function . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Asymptotic distribution of regression M-estimators . . . . . . . . . . . . 85

4 Regularized Regression and Variable Screening 93


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Least squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.3 Other Regularization Methods . . . . . . . . . . . . . . . . . . . . 107
4.4 Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.1 Correlation learning . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4.2 Iteratively thresholded ridge regression screening . . . . . . . . . 112
4.4.3 Iterative sure independence screening . . . . . . . . . . . . . . . . 113
4.4.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5 Support Vector Machines 115


5.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.1 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.2 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.3 Support Vector Machines for Classification . . . . . . . . . . . . . 123
5.1.4 Regularization with Kernels . . . . . . . . . . . . . . . . . . . . . . 126
5.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.1 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . 134

II
5.2.2 Regression Support Vector Machines . . . . . . . . . . . . . . . . . 136

6 Functional Data Analysis 139


6.1 Definitions and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Mathematical foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.1 Mean-square continuous processes . . . . . . . . . . . . . . . . . . 142
6.2.2 The Karhunen-Loève theorem . . . . . . . . . . . . . . . . . . . . 145
6.2.3 Mean-square continuous processes in L2 ([0, 1]) . . . . . . . . . . . 147
6.3 Mean and covariance estimation . . . . . . . . . . . . . . . . . . . . . . . 151
6.4 Functional principal component analysis . . . . . . . . . . . . . . . . . . 154
6.5 Functional regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

A Background 163
A.1 Basics of measure and probability theory . . . . . . . . . . . . . . . . . . 163
A.1.1 The abstract integral . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.1.2 Applications to probability theory . . . . . . . . . . . . . . . . . . 170
A.2 Elements of Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . 171
A.2.1 The Banach L p (Ω) spaces . . . . . . . . . . . . . . . . . . . . . . . 171
A.2.2 The Hilbert space L2 ([0, 1]) . . . . . . . . . . . . . . . . . . . . . . 173
A.2.3 Linear operators on L2 ([0, 1]). . . . . . . . . . . . . . . . . . . . . 174
A.3 Spline functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Bibliography 181

III
IV
Preface

This course discusses several methods for modern statistical inference. These mod-
ern inference methods are more flexible than classical inference techniques such as
least squares or maximum likelihood in the sense that they relax the assumptions that
need to be satisfied to obtain reliable results and/or can handle data analysis problems
that cannot be solved by the classical techniques. This course is the result of a joint ef-
fort of the colleagues of the Section of Statistics and Data Science at the Department of
Mathematics.
For practical computations and applications, the software package R (R Core Team,
2013) is used in this course. R is freely available at http://www.r-project.org. For
first time users of R, download and install R following the instructions on this website.
An easy user interface for the R statistical computing environment is RStudio which
can be freely downloaded from http://www.rstudio.com/.
Here is a quick guideline to get started if you have no experience with R/RStudio
yet. When opening RStudio you should see the R console window in the lower left
corner. Here you can type commands behind the symbol ‘>’. However, it is more
convenient to type commands in the editor window in the upper left corner. These
commands can then be executed by clicking the ’run’ button (executes the selected
line or block of code) or the ’source’ button (executes the whole code). To learn more
about the use of RStudio, select the ’RStudio Docs’ option in the ’Help’ menu of RStudio.
R is a modular system where functionality is organized in packages. Some of the
basic packages are installed by default. If a specific package is needed for a statistical
method, then it can be installed by selecting the ‘Packages’ tab in the lower right panel
of RStudio, and then clicking the ‘Install’ button. After installing a package, it becomes
possible to select them from the list which loads the package into R. When the package
is loaded, its content is ready to use.
Help files are readily available by typing

> help.start()

Select ‘An Introduction to R’ for a basic manual. It can be a useful exercise to read and
execute some of the examples of this manual.
Chapter 1

The Jackknife and the Bootstrap

1.1 The framework


The usual framework in (frequentist) statistics is the following:
1. We have some data set, a collection of observations

2. We assume that the observations were drawn from some unknown population

3. We are interested in some quantities of this population

4. We use the data to estimate these quantities


Of course, the analysis should not end here. An estimate in itself does not give that
much information. Therefore:
5. We obtain estimates of the accuracy of the estimates

6. Using the accuracy estimates, we can build confidence intervals, perform hy-
pothesis tests, . . .
The term statistical inference often refers specifically to the last two steps. The boot-
strap is a generally applicable method to perform these two steps. Schematically, sta-
tistical inference can be represented as follows.
#
{ x1 , . . . , x n } ∼ F standard error of θb

⇓ =⇒ confidence interval for θ ( F )


θb estimate for θ ( F ) hypothesis tests for θ ( F )
" !

Classical inference is based on assumptions such as normality of the underlying


distribution, or large-sample asymptotics. Jackknife and bootstrap are the principal
alternatives to classical inference, and have two big advantages:

1
• They rely on much fewer assumptions; For instance, they do not require normal-
ity;

• They can easily be applied in statistical models which are too complex for ana-
lytical (classical) solutions.

Let us specify in more detail what we mean by the accuracy of an estimate. Sup-
pose we have a data set Xn = { x1 , . . . , xn } from an unknown population F and we
computed some estimate θb = θb(Xn ) for the quantity θ = θ ( F ). The accuracy obviously
concerns the difference between θb and θ. We can never know exactly how far our es-
timate is from θ (otherwise we would know the unknown θ). But we can figure out
how far we can expect it to be. There are two main aspects about this accuracy:

• Is θb on average close to θ? (⇒ bias of θ)


b

• Is it highly variable? (⇒ variance/standard error of θ)


b

When we know that the estimate is more or less unbiased, we are mainly interested
in the variability of θ.
b To asses the accuracy, we take into account that the available
data set Xn is just one possible random sample from F. We could have had a different
sample of the same size and the question is then to what extent our estimate θb could
be different. The accuracy of the estimate can be derived from its sampling distribu-
tion, which is the distribution of the estimator over all possible random samples from
F with the same sample size. The sampling distribution reflects the accuracy of the
estimate from which we can obtain

var[θb| F ] (variance) (1.1)

E[θb| F ] − θ (bias) (1.2)

P[(θb − θ ) ≤ t| F ] (for confidence intervals) (1.3)

The bootstrap is a generally applicable method to estimate the sampling distribu-


tion of an estimator while the jackknife is limited to estimating the bias and variance
of an estimator.

1.2 The jackknife


The jackknife is a technique for estimating the bias and standard error of an estima-
tor. The jackknife method was invented in 1956 by Quenouille (Quenouille, 1956) and

2
further developed by Tukey (Tukey, 1958). The word “jackknife” refers to the handy
‘knife’ that one should always have.
The basic idea behind the jackknife estimator lies in systematically recomputing
the estimate leaving out one observation at a time from the sample set. From this new
set of “observations” for the statistic an estimate for the bias can be calculated, as well
as an estimate for the variance of the statistic.

Let X1 , X2 , · · · , Xn be an i.i.d. sample from the population X with cumulative


distribution function F. Let θ ( F ) = θ be the quantity of interest. Denote by Tn =
T ( X1 , X2 , · · · , Xn ) an estimator of θ. Here, we consider estimators that can be written
as functions of the empirical distribution of the data, i.e.

Tn = θ ( Fn ) ,

where Fn is the empirical cumulative distribution function.


Let us define the leave-one-out estimators of θ ( F ) as
(i )
Tn−1 = T ( X1 , · · · , Xi−1 , Xi+1 , · · · , Xn ) i = 1, . . . , n. (1.4)

(i )
The leave-one-out estimators Tn−1 are thus obtained by estimating θ ( F ) from the re-
duced sample ( X1 , · · · , Xi−1 , Xi+1 , · · · , Xn ) of size n − 1.
Corresponding to the leave-one-out estimates, the ‘pseudo-values’ Tn,i are now de-
fined by
(i )
Tn,i = nTn − (n − 1) Tn−1 i = 1, · · · , n . (1.5)

The jackknife estimator of the mean of Tn is then defined as


n
E bn = 1 ∑ Tn,i
b jack ( Tn ) = T
n i =1
1 n h (i )
i
n i∑
= nTn − ( n − 1 ) Tn −1
=1
n − 1 n (i )
n i∑
= nTn − Tn−1 . (1.6)
=1

The jackknife estimator for the mean of Tn is thus equal to the average of the pseudo-
values.
The jackknife estimator for the bias of Tn is now given by:

d jack ( Tn ) = Tn − E
Bias b jack ( Tn )
n − 1 n (i )
n i∑
= Tn − nTn + Tn−1 ,
=1

3
and hence

" #
1 n (i )
n i∑
Biasjack ( Tn ) = (n − 1)
d Tn−1 − Tn (1.7)
=1

The jackknife estimator for the variance of Tn is given by the sample variance of
the pseudo-values Tn,i :

( )
n  2
d jack ( Tn ) = 1 1
n − 1 i∑
Var Tn,i − E
b jack ( Tn ) (1.8)
n =1

The motivation behind this estimation procedure can be explained as follows. Sup-
pose that Tn is asymptotically unbiased with an expansion as follows

a(θ ) b(θ )
E( Tn ) = θ + + 2 +··· .
n n

Now, we have

( )
  n − 1 n (i )
n i∑
E T
bn = E nTn − Tn−1
=1
n−1 n 
(i )

n i∑
= nE( Tn ) − E Tn−1
=1
 
(1)
= nE( Tn ) − (n − 1)E Tn−1
a(θ ) b(θ ) a(θ ) b(θ )
   
= n θ+ + 2 + · · · − ( n − 1) θ + + +···
n n n − 1 ( n − 1)2
b(θ ) b(θ )
= θ+ − +···
n n−1
b(θ )
= θ− +··· .
n ( n − 1)

If we compare with the expansion of E( Tn ), we can see that in this case the estimator
T
bn should be a more accurate estimator for θ than Tn .

4
1.2.1 Some examples
The mean of a random variable

Suppose that θ ( F ) = E( X ) and consider Tn = X n , the sample mean. We first


calculate the pseudo-values Tn,i :
(i )
Tn,i = nTn − (n − 1) Tn−1
n
1
= nX n − (n − 1)
n−1 ∑ Xj
j =1
j ̸ =i
n n
= ∑ Xj − ∑ Xj
j =1 j =1
j ̸ =i
= Xi .
So, we obtain
n
E bn = 1 ∑ Tn,i = X n = Tn .
b jack ( Tn ) = T
n i =1
The jackknife estimator for the variance of Tn becomes:
( )
n  2
1 1
n n − 1 i∑
d jack ( Tn ) =
Var Tn,i − E
b jack ( Tn )
=1
( )
n
1 1 2
n n − 1 i∑
= Xi − X n
=1
1 2
= S ,
n n
with Sn2 the sample variance of the sample X1 , · · · , Xn .

The variance of a random variable

Suppose that θ ( F ) = Var( X ) = E ( X − E( X ))2 and consider the empirical variance


1 n 2
Tn = θ ( Fn ) = ∑ Xi − X n .
n i =1
We first calculate the pseudo-values Tn,i :
(i )
Tn,i = nTn − (n − 1) Tn−1
1 n 2 1 n 
(i )
2
= n ∑ Xi − X n − ( n − 1 ) ∑ X j − X n −1
n i =1 n−1 j =1
j ̸ =i
n n  2
2 (i )
= ∑ Xi − X n − ∑ X j − X n −1 .
i =1 j =1
j ̸ =i

5
Since
n
(i ) 1 1
X n −1 =
n−1 ∑ Xj = n − 1 (nX n − Xi ) ,
j =1
j ̸ =i
we find that
n
2 2
Tn,i = ∑ Xi2 − 2nX n + nX n
i =1
n 2 2
(i ) (i )
 
−∑ X 2j + 2( n − 1) X n −1 − ( n − 1) X n −1
j =1
j ̸ =i
2
(i )

2
= Xi2 − nX n + ( n − 1) X n −1
 2
2 1
= Xi2 − nX n
+ ( n − 1) (nX n − Xi )
n−1
2 1 2
= Xi2 − nX n + nX n − Xi
n−1
2 2 n2 2 n 1
= Xi − nX n + Xn − 2 Xi X n + Xi2
n−1 n−1 n−1
n n 2 − n ( n − 1) n
2
= Xi2 + Xn − 2 X Xn
n−1 n−1 n−1 i
n n 2 n
= Xi2 + Xn − 2 X Xn
n−1 n−1 n−1 i
n 2
= Xi − X n .
n−1
We then find
n n
bn = 1 ∑ Tn,i = 1 ∑ Xi − X n 2 = Sn2 ,

T
n i =1 n − 1 i =1
n
the sample variance of the sample X1 , · · · , Xn . Note that E( Tn ) = θ and E( T
bn ) =
n−1
θ. Hence, T
bn is a better estimator than Tn .

1.2.2 An example of failure of the jackknife


The statistics to be analyzed need to be ‘smooth’ in the data, meaning that small
changes in the data set cause only small changes in the statistic. We provide an exam-
ple to illustrate this.
Example: The median. Let
θ ( F ) = F −1 (0.5) = inf{ x : F ( x ) ≥ 0.5} .
Take an i.i.d. sample X1 , · · · , Xn from X having distribution F, and consider the case
that n = 2m (even sample size). Then
1 
Tn = X ( m ) + X ( m +1) ,
2

6
where X(1) ≤ X(2) ≤ · · · ≤ X(n) denote the order statistics of X1 , · · · , Xn . We calculate
(i ) (i )
the pseudo-values Tn,i = nTn − (n − 1) Tn−1 ; and for this we first look at Tn−1 :

(
(i ) X(m+1) i = 1, · · · , m
Tn−1 =
X( m ) i = m + 1, · · · , n .

So the pseudo-values are

(
nTn − (n − 1) X(m+1) i = 1, · · · , m
Tn,i =
nTn − (n − 1) X(m) i = m + 1, · · · , n ,

and

n
bn = 1 ∑ Tn,i = nTn − n − 1 mX(m+1) − n − 1 (n − m) X(m)
T
n i =1 n n
n−1  
= nTn − m X ( m +1) + X ( m )
n
n−1
= nTn − 2mTn
n
= nTn − (n − 1) Tn
= Tn .

The jackknife estimator of the bias is

Tn − E
b jack ( Tn ) = Tn − T
bn = 0 .

However, the jackknife estimator of the variance does not work. Let us calculate that
estimator. Recall (1.8) and calculate

Tn,i − T
bn = Tn,i − Tn
(
(n − 1) Tn − (n − 1) X(m+1) i = 1, · · · , m
=
(n − 1) Tn − (n − 1) X(m) i = m + 1, · · · , n
  
 (n − 1) Tn − X(m+1) i = 1, · · · , m
=  
 (n − 1) Tn − X(m) i = m + 1, · · · , n.

7
It follows that
( )
n  2
d jack ( Tn ) = 1 1
n − 1 i∑
Var Tn,i − Tbn
n =1
 2 2 
1 1 2
 1 2

= m(n − 1) Tn − X(m+1) + m(n − 1) Tn − X(m)
n n−1 n−1
 2  2 
1
= m ( n − 1) Tn − X(m+1) + Tn − X(m)
n
( 2  2 )
1 1 1 1 1
= ( n − 1) X − X + X − X
2 2 ( m ) 2 ( m +1) 2 ( m +1) 2 ( m )
 
1 1 2 1 1 2 1
= ( n − 1 ) 2 X ( m ) − X ( m ) X ( m +1) + 2 X ( m +1) − X ( m ) X ( m +1)
2 4 2 4 2
1   2
= ( n − 1 ) X ( m +1) − X ( m ) .
4
However, for symmetric densities one has that

1
Asymptotic Variance of Tn = ,
4 f 2 (0)

where f = F ′ .
The two quantities are different even when n → ∞:

1
d jack ( Tn ) −
Var ̸→ 0 , as n → ∞ .
4n f 2 (0)

In conclusion, the jackknife estimator of the variance of the median is inconsistent (it
fails to converge to the true standard error).

1.3 The Bootstrap


The bootstrap is a data-driven procedure for statistical inference. Originally (see
Efron, 1979), it was proposed as a method to assess the accuracy of statistical esti-
mation procedures as described in Efron and Tibshirani (1993): “The bootstrap is a
computer-based method for assigning measures of accuracy to statistical estimates”.
In most applications the bootstrap cannot be performed analytically, but is executed
in practice by computer simulation. With the increase of computing power, the pop-
ularity of the bootstrap grew tremendously. A main reason for its popularity is its
simplicity and its generality in the sense that it does not depend on assumptions that
may be complicated and hard to verify in practice. In Davison and Hinkley (1997)
this was formulated as: “The key idea is to resample from the original data [. . . ] to
create replicate datasets, from which the variability of the quantities of interest can

8
be assessed without long-winded and error-prone analytical calculation”. Nowadays
bootstrap methodology can be used for almost any type of inference. It can thus be
used to obtain standard errors and confidence regions, to perform hypothesis tests and
much more.

1.3.1 The bootstrap idea


If we could easily obtain random samples repeatedly, then we could approximate
the sampling distribution. Suppose X1n , X2n , . . . , XnB were all random samples from F.
Then, we can obtain the corresponding estimates θbn1 , θbn2 , . . . , θbnB .
For example, suppose F = N (µ, σ2 ) with µ = 5 and σ = 1. We constructed B =
1000 random samples of size n = 100 from F, and computed the sample means as
estimates for the population mean µ. The histogram in Figure 1.1 represents the B
estimates:
150
Frequency
100
50
0

4.7 4.8 4.9 5.0 5.1 5.2 5.3


θ

Figure 1.1 Histogram of B = 1000 sample means from random samples of size n = 100
from F = N (5, 1)

We can see that around 95% of the time the difference between the sample mean and
µ is less than 0.2, just as we of course know from the normal theory (i.e. sample mean
is distributed as N (µ, σ2 /n)).

Considering a whole range of possible random samples from F would thus give us an
idea of the sampling distribution and thus of the accuracy of a single estimate (in prob-
ability terms). Of course, in practice we cannot draw extra samples from F, we only
have one. In the above example, even if we do not know µ and σ, this can be solved if

9
we are willing to assume that F is normal. Then, we can rely on the theory and take the
normal distribution as the sampling distribution. But if it is not reasonable to assume
that F is normal, then we cannot rely on the theoretical approximation anymore. Note
that the approximation based on repeated sampling would not need any assumptions
about the distribution F in contrast to the analytical approximation. The bootstrap
mimics the process of repeated sampling from F and thus allows to approximate the
sampling distribution without making possibly irrealistic assumptions.
Let us focus on the situation where we have a single sample Xn of independent and
identically distributed (i.i.d.) observations, which may be univariate or multivariate.
In short, the bootstrap reshuffles the data set Xn many times and computes θbn again
for each reshuffle to find out about its variability.
We would like to have many extra samples from F but we do not know F, so what
can we do? As always in statistics, we replace F by an estimate F. b The best estimate,
in the absence of prior knowledge of the possible form of F, is given by the empirical
distribution Fn corresponding to the sample Xn . Its distribution function is given by

# { Xi ≤ x }
Fn ( x ) = .
n
Now that we have Fn as an estimate of F, we can proceed by randomly drawing sam-
ples X∗n from Fn (instead of F), compute the corresponding estimates θbn∗ , and consider
the sampling distribution under Fn as an estimate for the sampling distribution under
F. We then use the following approximations:

var[θbn | F ] ∼
= var[θbn∗ | Fn ]
E[θbn | F ] − θ ∼
= E[θbn∗ | Fn ] − θbn
P[(θbn − θ ) ≤ t| F ] ∼
= P[(θbn∗ − θbn ) ≤ t| Fn ]

The general bootstrap principle thus can be stated as follows. Replace the unknown
F by the estimate Fn and obtain an approximation for the sampling distribution of θbn
under Fbn by repeatedly sampling from Fn . This implies that θ is replaced by θbn and θbn
by θbn∗ .
Consider again the example above, where F = N (µ, σ2 ) with µ = 5 and σ = 1. We
took one random sample Xn of size n = 100 from F, which happened to give us a sam-
ple mean equal to θb = 5.094. Figure 1.2 shows the approximation of the distribution
F by Fn corresponding to Xn : We then constructed B = 1000 samples from Fn corre-
sponding to this Xn , each time computing the sample mean θbn∗ . The histogram of these
estimates is represented in Figure 1.3. We see that the B estimates of the mean are cen-
tered around the sample mean 5.094 of the original sample, as could be expected. But
clearly, the distribution of θbn∗ − θbn resembles the distribution of θbn − θ (where θ = µ).

10
1.0
0.8
0.6
CDF
0.4
0.2

F = N(µ,σ2)
Fn
0.0

2 3 4 5 6 7 8
x

Figure 1.2 Distribution F and its approximation Fn using a sample of size n = 100.
200
150
Frequency
100
50
0

4.8 4.9 5.0 5.1 5.2 5.3 5.4


θ*

Figure 1.3 Histogram of B = 1000 bootstrap means based on a random samples of size
n = 100 from F = N (5, 1)

The samples X∗n from Fn are called bootstrap samples. They are samples with n in-
dependent observations that are all (identically) distributed according to the discrete
distribution Fn . The density corresponding to Fn puts probability mass 1/n on each
observation in Xn (assuming there are no ties), which means that a bootstrap sample
needs to be constructed by drawing n observations from Xn with replacement.

Theoretically, the bootstrap replaces F by its estimate Fn to approximate the sampling

11
distribution of θbn . In practice this means that the original data set Xn is re-sampled
with replacement to create many bootstrap samples and many re-computed estimates
of θbn .
 

Xn ∼ F

=⇒ θn
b 
 c [θn | Fn ],
var b

⇓ ⇒ Eb [θbn∗ | Fn ] − θ,
b

X∗n1 , X∗n2 , . . . , X∗nB =⇒ θbn∗1 , θbn∗2 , . . . , θbn∗ B

Pb[(θbn∗ − θbn ) ≤ t| Fn ]

 

In some simple cases, E[θbn∗ | Fn ], var[θbn∗ | Fn ] may be calculated analytically and thus
it is not necessary to actually perform the resampling. As an example, we consider the
estimation of a mean µ by the sample mean X n where X is such that E( X ) = µ and
Var( X ) = σ2 (but there is no further information about the distribution of X).
Let ( X1∗ , X2∗ , · · · , Xn∗ ) be an i.i.d. sample drawn from the empirical cumulative dis-
tribution Fn . We have

• In the real world: X ∼ F , so EF ( X ) = µ VarF ( X ) = σ2

• In the bootstrap world: Xi∗ |( X1 , · · · , Xn ) ∼ Fn , so

1 n
n i∑
E∗ ( Xi∗ ) = EFn ( Xi∗ ) = Xi = X n
=1
Var∗ ( Xi∗ ) = VarFn ( Xi∗ ) = EFn ( Xi∗ − EFn ( Xi∗ ))2
1 n n−1 2
= ∑ ( Xi − X n ) 2 = Sn .
n i =1 n

Hence, regarding the sampling properties of X n , we have for the mean and the vari-
ance:

• In the real world:

E( X n ) = E F ( X n ) = µ and Bias( X n ) = 0
Var( X )
Var( X n ) = VarF ( X n ) = ,
n

• In the bootstrap world:

∗ ∗ 1 n
n i∑
E∗ ( X n ) = EFn ( X n ) = EFn ( Xi∗ ) = X n
=1
∗ ∗ 1 1n−1 2
Var∗ ( X n ) = VarFn ( X n ) = VarFn ( Xi∗ ) = Sn .
n n n
The latter expressions yield the bootstrap estimates of bias and variance of Tn =
Xn:

12

the bootstrap estimate of the bias is E∗ ( X n ) − X n = 0
∗ n − 1 Sn2
the bootstrap estimate of the variance of X n is Var∗ ( X n ) = ≈
n n
Sn2
.
n
However, in most cases we cannot calculate bootstrap estimates of bias and vari-
ance analytically. The re-sampling in the bootstrap procedure then amounts to cal-
culations through Monte Carlo simulation. This re-computing part then makes the
bootstrap a computer-intensive method. we may need thousands or more bootstrap
samples! The re-sampling by computer simulation gives us only an approximation (or
simulation analogue) to var[θbn∗ | Fn ], E[θbn∗ | Fn ] and P[(θbn∗ − θbn ) ≤ t| Fn ]. Hence the hats in
the above scheme. In particular, the simulation analogues are defined as
B
1
B − 1 b∑
c [θbn∗ | Fn ] =
var (θbn∗b − (θbn∗ ) B )2 ( =: v∗B ) (1.9)
=1

1 B b∗b b
B b∑
b [θbn∗ | Fn ] − θbn =
E θn − θn ( =: b∗B ) (1.10)
=1

#{(θbn∗b − θbn ) ≤ t}
Pb[(θbn∗ − θbn ) ≤ t| Fn ] = (1.11)
B
where θbn∗b ; b = 1, . . . , B are the re-computed estimates, and (θbn∗ ) B = 1
B ∑bB=1 θbn∗b .
Now, (1.9)-(1.11) are the usual bootstrap estimates for the variance, the bias and the
sampling distribution of θbn .
Example: (Davison and Hinkley, 1997) The following table reports the data set
Xn = { x1 , . . . , xn }, where n = 10 and where xi = (yi , zi ). Each pair xi corresponds
to a city in the US and contains the 1920 (yi ) and 1930 (zi ) populations of that city, in
thousands.
obs 1 2 3 4 5 6 7 8 9 10
yi 138 93 61 179 48 37 29 23 30 2
zi 143 104 69 260 75 63 50 48 111 50

Of interest here is the ratio of the means, since this would enable us to estimate the to-
tal population growth in the US from 1920 to 1930. We assume that the given pairs are
a random sample from a bivariate distribution F, with X = (Y, Z ) denoting the corre-
sponding random variable. The quantity or parameter of interest is θ = E[ Z ]/E[Y ].
A natural estimate of θ is the empirical analogue: θbn = z/y, i.e. the ratio between the
two sample means. From the data, we obtain θbn = 1.520.
We are then concerned about the accuracy of θbn , especially because with n = 10
we have quite a small sample. We do not know anything about the underlying dis-
tribution F. We could make assumptions about it and use theoretical calculations.

13
Assuming bivariate normality would be an option, but probably not a very good one
(plot the data to see this). Another option is assuming that F is bivariate lognormal,
which might be better. If we would do so, it would be possible to do the necessary
calculations. But the calculations would be rather difficult and above all we cannot be
not sure that our assumption is correct. Instead we can avoid the calculations and the
assumptions, and use the bootstrap.
We need to re-sample the data: generate new samples, consisting again of n pairs,
by randomly drawing observations from the original sample Xn = { x1 , . . . , xn }, with
replacement. So in any newly generated sample, some of the 10 original cities will
appear more than once, and some will not appear at all. We generated B = 1000 boot-
strap samples. The following table lists how many times each observation appears in
the bootstrap samples and shows the resulting recomputed estimates θbn∗ .

obs x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
original sample: 1 1 1 1 1 1 1 1 1 1 θbn = 1.520

boot sample 1: 0 1 0 1 2 0 1 2 3 0 θbn∗1 = 1.863


boot sample 2: 2 0 0 1 1 1 0 1 1 3 θbn∗2 = 1.658
boot sample 3: 2 2 1 1 0 2 0 0 0 2 θbn∗3 = 1.345
boot sample 4: 1 0 1 2 1 0 3 0 0 2 θbn∗4 = 1.519
boot sample 5: 1 0 0 2 0 0 0 2 2 3 θbn∗5 = 1.860
boot sample 6: 1 1 0 2 0 2 0 1 0 3 θbn∗6 = 1.577
boot sample 7: 1 0 1 1 1 1 1 3 1 0 θbn∗7 = 1.548
boot sample 8: 1 0 1 1 3 1 1 0 0 2 θbn∗8 = 1.537
boot sample 9: 0 0 3 1 2 0 0 1 1 2 θbn∗9 = 1.701
.. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . .

The histogram with all B = 1000 bootstrap recomputations θbn∗1 , . . . , θbn∗ B is given in
Figure 1.4. The vertical line indicates the original estimate θb = 1.520. The bootstrap
estimate for the variance of θb is
B
1
B − 1 b∑
v∗B = (θbn∗b − (θbn∗ ) B )2 = 0.0465
=1

so the bootstrap standard error equals 0.0465 = 0.216. The bootstrap estimate for
the bias of θbn is
1 B
b∗B = ∑ θbn∗b − θbn = 0.034
B b =1

Summarized: estimated ratio = 1.520 (standard error = 0.216, bias = 0.034)

14
200
150
Frequency
100
50
0

1.5 2.0 2.5


θ*

Figure 1.4 Histogram with B = 1000 bootstrap recomputations θb∗ .

Why the name ‘bootstrap’?

The name comes from the English expression ”Pull yourself up by the bootstraps”
which means that you get yourself out of an almost impossible situation by using only
your own strength. This expression is sometimes said to originate from an 18th cen-
tury tale of Baron Muchhausen, who found himself at the bottom of the lake and got
out by pulling himself up by his own bootstraps. Using the data itself to generate more
data seems like an analogous trick!

Sampling from the data and pretending it’s new data yields often the initial reaction
that this is a kind of fraud because no new (extra data) or additional (assumptions)
information is used. However, it turns out that it is not. Using the data only for es-
timation just means that the information provided by the data has not yet been fully
exploited. Applying the bootstrap is a way to make the extra information in the data
to use. Theoretical research and lots of practice have validated the bootstrap as a very
useful tool that can tackle a wide range of statistical problems. This can be summa-
rized as follows. Bootstrap doesn’t really give you something for nothing. It gives you
something you previously ignored.

1.3.2 Quality of the bootstrap approximation


When we compare the estimates (1.9)-(1.11) to the true variance, bias and sampling
distribution in(1.1)-(1.3), we can see that there are two different sources of error, the

15
bootstrap error and the simulation error.
The simulation error occurs when we cannot obtain the bootstrap distribution an-
alytically, but have to resort to an approximation based on computer simulation, i.e.
because we approximate var[θbn∗ | Fn ] by var
c [θbn∗ | Fn ], and similarly for the other quanti-
ties. This error decreases when the number of bootstrap samples B in the simulation
increases. By choosing B sufficiently large, we can limit the simulation error as much
as we want. Of course, this costs us computation time.
The Bootstrap error occurs because we approximate the sampling distribution by
the bootstrap distribution, i.e. because we approximate var[θbn | F ] by var[θbn∗ | Fn ], and
similarly for the other quantities. This error is independent of the simulation error
and hence does not decrease with B. It does decrease with the sample size n, but
in practice n is fixed, of course. The bootstrap error occurs because the bootstrap
principle does not provide the exact sampling distribution, it only gives an estimate of
the true sampling distribution.
Hence, if we want (1.9)-(1.11) to be good estimates of (1.1)-(1.3), then (a) we should
take B large enough, and (b) we should hope that the bootstrap error is small... The
size of the bootstrap error is thus crucial. We can give two remarks about it.

1. It has been shown that the bootstrap error is small, perhaps surprisingly, for
many applications

2. We have to attempt to make it as small as possible by choosing the appropriate


quantity to bootstrap, especially when it comes to the whole sampling distribu-
tion, or quantile estimates. For example, the approximation

P[θbn ≤ t| F ] ∼
= P[θbn∗ ≤ t| Fn ]

will in general not be very good. That’s why we usually work with

P[(θbn − θ ) ≤ t| F ] ∼
= P[(θbn∗ − θbn ) ≤ t| Fn ]

which already gives a better approximation. However, often this approximation


can still be improved, e.g. by considering

P[(θbn − θ )/V 1/2 ≤ t| F ] ∼


= P[(θbn∗ − θbn )/V∗1/2 ≤ t| Fn ].

where V and V∗ are some estimates of the variance of θbn and θbn∗ .

The simulation error can always be made small, but how many bootstrap sam-
ples should then be drawn? The answer depends on the desired accuracy of the ob-
tained approximation, but there are some guidelines. Traditional guidelines state that

16
B = 200 is required to obtain accurate variance estimates, while at least B = 1000
is generally necessary for confidence intervals. However, since computer power has
increased enormously, higher numbers are often used nowadays.
Note that the empirical distribution Fn is discrete, even if F is continuous. There
are only a finite number of possible samples from Fn and therefore the distribution
of θbn∗1 , . . . , θbn∗ B will have some discrete nature. For a continuous distribution the true
sampling distribution will usually not have this characteristic. To explore whether
this could be a problem, let’s see how many different bootstrap samples there are.
This number is given by the number of ways you can put n balls into n boxes (cfr. the
number of balls in box j is the number of times observation j appears in the bootstrap
sample). Basic combinatorics tells us that this number is given by the combinatorial
number !
2n − 1
n−1
The following table yields the number of different bootstrap resamples for a few sam-
ple sizes:

n=5 n = 10 n = 15 n = 20 n = 50
126 92378 7.75 × 107 6.89 × 1010 5.04 × 1028

These large numbers show that the discreteness of the bootstrap distribution should
not be a problem when the sample size is not really small, at least if the statistic is a
sufficiently smooth function of the observations. Most statistics are indeed sufficiently
smooth, but some are not. The sample median is the prime example. If you try to
bootstrap the median for a small sample, you will see what happens! Note that the
combinatorial number gives you the total number of different bootstrap samples, but
not all these bootstrap samples are equally likely to occur. The most likely bootstrap
sample to be drawn is in fact the original sample itself.

1.4 Jackknife and Bootstrap


Both methods, the jackknife method and the bootstrap method, estimate the vari-
ability of a statistic from the variability of that statistic between subsamples, rather
than from parametric assumptions. The jackknife is a less general technique than the
bootstrap, and explores the sample variation differently. However the jackknife is eas-
ier to apply to complex sampling schemes than the bootstrap.

Let us have a look at the different formulas for estimating the bias and the variance
of a statistic Tn by both methods.

17
For estimating the bias of a statistic Tn , the methods give us:
" #
n
1 ( )
n i∑
d jack ( Tn ) = (n − 1) i
Bias Tn−1 − Tn ,
=1

for the jackknife method and


" #
1 B ∗(b)
B b∑
Bias∗ ( Tn ) ≈ Bias
d boot ( Tn ) = Tn − Tn
=1

for the bootstrap method.

For estimating the variance of a statistic Tn , the jackknife method gives


( )
n  2
1 1
n n − 1 i∑
d jack ( Tn ) =
Var Tn,i − E b jack ( Tn )
=1
" #!2
n
1 h
(i )
i n − 1 n ( j)
n(n − 1) i∑ n j∑
= nTn − (n − 1) Tn−1 − nTn − Tn−1
=1 =1
 !2 
n n
1 (i ) 1 ( j)
= (n − 1)  ∑ Tn−1 − ∑ Tn−1 
n i =1 n j =1

and the bootstrap estimate for the variance leads to (with the Monte-Carlo ap-
proximation)
!2
B
d boot ( Tn ) = 1 ∑ ∗(b) 1 B ∗(ℓ)
Var∗ ( Tn ) ≈ Var Tn − ∑ Tn .
B b =1 B ℓ=1

When comparing the jackknife formulas to the bootstrap expressions, we note that
both jackknife formulas have an inflation factor (n − 1). This “inflation factor” is
(i )
needed because the jackknife values Tn−1 are much closer to each other than the boot-
∗(b)
strap values Tn which would result in underestimation of the bias and variance
without correction by the inflation factor. The jackknife ‘resamples’ thus resemble
much more to X = { X1 , · · · , Xn } than X ∗ = { X1∗ , · · · , Xn∗ } does.

Bootstrap or jackknife?

In terms of practical performances of the jackknife and the bootstrap we remark


the following:
Since the jackknife uses only n specific resamples from X = { X1 , · · · , Xn }, whereas
the bootstrap makes use of much more information on the data (B resamples), we ex-
pect better performance of the bootstrap. On the other hand, the jackknife estimates

18
of bias and variance are easier to compute (at least for a not too large n).
In fact, it has been shown that the jackknife can be viewed as an approximation of the
bootstrap. if θb is a linear statistic (i.e. if we can write θb = µ + ∑in=1 α( xi ) for some func-
tion α), then the jackknife is equivalent to the bootstrap. If θb is nonlinear, as is usually
the case, then Efron showed that the jackknife variance estimate is a bootstrap estimate
corresponding to a linear approximation to θ. b The jackknife is thus an approximation
to the bootstrap and therefore inferior to it, especially for highly nonlinear statistics.

A major advantage of the bootstrap is its versatility. For example, it is able to estimate
quantiles of the sampling distribution, while the jackknife is limited to variance and
bias estimates.

1.5 Bootstrap confidence intervals


Obtaining the bootstrap estimates for the variance (v∗B ) or bias (b∗B ) of an estimator
is pretty straightforward. In principle, confidence intervals can also be easily con-
structed. But there are a number of different methods for constructing confidence
intervals and the best ones are more complicated. We now give an overview of the
most common methods to construct confidence intervals based on the bootstrap dis-
tribution of an estimator.

Method 1: The simplest bootstrap intervals

If we are willing to assume that the sampling distribution of the estimator θbn is
approximately N (θ, σ2 ), then a 95% confidence interval for θ is given by

(θbn − 1.96 σ, θbn + 1.96 σ),

as usual. The scale σ in this interval is unknown, so we can estimate it from the boot-
strap distribution. The bootstrap confidence interval is thus obtained by simply re-
placing σ by the bootstrap standard error estimate:
q q
(θbn − 1.96 v∗B , θbn + 1.96 v∗B ). (1.12)

We can go one (small) step further and also make use of the bootstrap distribution
to estimate the bias. That means that we relax our assumption by assuming that the
sampling distribution of θbn is approximately N (θ + β, σ2 ). Hence, β represents the bias
of the estimator θbn . This leads to the bias-corrected normal interval:
q q
(θbn − b∗B − 1.96 v∗B , θbn − b∗B + 1.96 v∗B ). (1.13)

19
When is the assumption of “approximately normal” justified? Of course, exact nor-
mality is practically only obtained when F itself is normal and θbn is the sample mean.
Approximate normality can be claimed in a lot more situations, in the sense that many
statistics or estimators are asymptotically normal. For example, the sample mean is
asymptotically normal even when F is not normal. In many other cases, especially
more complex situations, asymptotic normality is not true or just difficult to find out.
Moreover, even if θbn is asymptotically normal, the normal distribution may not yield
a good approximation for the sampling distribution at small n. Thus, the simple in-
terval (1.12) or (1.13) would be a good option only when asymptotic normality is pre-
sumed and when n is large.

Method 2: Basic bootstrap intervals

Having to assume a normal sampling distribution is not really satisfactory, because


the bootstrap is then not really better than the analytical approximation. Therefore,
we now discuss several methods which do not assume normality, but directly use the
bootstrap estimate of the sampling distribution. First note the following.

In general, denote the distribution of θbn − θ by W, and define its quantiles wα by

P [W ≤ w α ] = α

Then, a 95% confidence interval is derived as follows:

0.95 = P[w.025 ≤ W ≤ w.975 ]


= P[w.025 ≤ θbn − θ ≤ w.975 ]
= P[w.025 − θbn ≤ −θ ≤ w.975 − θbn ]
= P[θbn − w.975 ≤ θ ≤ θbn − w.025 ]

Hence, given an estimate θbn , a 95% confidence interval is given by

(θbn − w.975 , θbn − w.025 )

With θbn − θ ∼ W, applying the general bootstrap principle means approximating


W by W ∗ , where θbn∗ − θbn ∼ W ∗ is the centered bootstrap distribution. Thus, the esti-
mated interval becomes
(θbn − w∗ , θbn − w∗ )
.975 .025

In general, we need to estimate the quantiles w.975 ∗ ∗


and w.025 through simulation, i.e.
∗ 1 ∗ B
based on the recomputed estimates θbn , . . . , θbn .

20
Denote by θb(∗1) ≤ . . . ≤ θb(∗B) the order statistics corresponding to the bootstrap esti-
mates θbn∗1 , . . . , θbn∗ B . The quantile wα∗ is then usually estimated by

bα∗ = θb(∗α( B+1)) − θbn .


w

The corresponding 95% basic bootstrap interval is then defined as


∗ ∗
(θbn − w
b.975 , θbn − w
b.025 ),

or equivalently

(θbn − (θb(∗.975( B+1)) − θbn ), θbn − (θb(∗.025( B+1)) − θbn )),

which yields
(2θbn − θb(∗.975( B+1)) , 2θbn − θb(∗.025( B+1)) ). (1.14)

Remark 1: Why do we use θb(∗α( B+1)) to estimate the quantile wα∗ instead of θb(∗αB) ? Sup-
pose Y1 , . . . YN are i.i.d. with CDF K, then it can be shown for the order statistics that

E[Y( j) ] ≈ K −1 ( j/( N + 1)).

Hence, the j-th order statistic is a good estimate of the j/( N + 1)-quantile rather than
of the j/N-quantile.
Remark 2: Because of the use of the (α( B + 1) order statistics, it is often recommended
to choose B such that, for example, .025( B + 1) and .975( B + 1) are integers. Therefore,
common choices are B = 999 or B = 4999. When α( B + 1) is not an integer, interpo-
lation should be used in principle. However, many people use rounded numbers of B
(e.g. 1000, 5000 or 10000) and αB anyway because for large values of B it doesn’t really
matter.
For example, when choosing B = 999, the 95% basic bootstrap interval becomes

(2θbn − θb(∗975) , 2θbn − θb(∗25) ).

Method 3: Bootstrap-t intervals

The basic bootstrap interval is quite straightforward, but the disadvantage is that
the confidence interval generally is quite inaccurate because the approximation of W
by W ∗ is often not good enough. Let’s look at this in more detail. A central 1 − 2α
confidence interval (θbL , θbU ) is supposed to have probability α of not covering the true
θ from above or from below, i.e.

P[θ < θbL ] = α = P[θ > θbU ]. (1.15)

21
The nominal level α is often not reached exactly, but the difference becomes smaller
when the sample size increases. The accuracy of an approximate confidence interval
can be measured by how fast this difference decreases with increasing n. The basic
bootstrap interval is first-order accurate. This means that its error w.r.t. (1.15) goes to

zero at rate 1/ n:
P[θ < θbL ] = α + O(n−1/2 )

and similarly for the upper limit. This is usually the same accuracy as can be obtained
by asymptotic approximations. However, when using the bootstrap we can do better.
It is possible to construct bootstrap confidence intervals that are second-order accurate,
which means that the error w.r.t. (1.15) is of an order of magnitude smaller. It goes to
zero at the much faster rate 1/n:

P[θ < θbL ] = α + O(n−1 )

We now present such a method.


The basic bootstrap interval approximates the distribution of θbn − θ by that of θbn∗ −
θbn . The bootstrap-t interval studentizes these statistics to improve the accuracy of the
approximation. The studentized version of θbn − θ is defined as

θbn − θ
z= (1.16)
vb1/2
n

where vbn is an estimate of Var[θbn | F]. We then need the quantiles z0.025 and z0.975 deter-
mined by
θbn − θ
P[z.025 ≤ 1/2 ≤ z.975 ] = 0.95,
vbn
so that a 95% confidence interval for θ is given by

(θbn − vb1/2 b1/2


n z.975 , θn − v
b n z.025 ).

Using the bootstrap principle on the studentized sampling distribution, we can ap-
proximate the distribution of z by that of

θbn∗ − θbn
z∗ = (1.17)
vb∗n 1/2

Hence, we need recomputed statistics z∗1 , . . . , z∗ B . Let z∗(1) , . . . , z∗( B) be the correspond-
ing order statistics, then the corresponding 95% bootstrap-t confidence interval is de-
fined as
∗ ∗
(θbn − vb1/2
n z , θbn − vb1/2
(.975( B+1)) n z (.025( B+1))).

22
(Disadvantage 1) What variance estimate vbn could be used here?

Bootstrap-t intervals can be more accurate, but a major disadvantage is the need
for the variance estimates vbn and vb∗n (in every bootstrap sample). In some special
cases, like the sample mean, we may have an easy formula for the variance of the
bootstrapped estimate. (One could argue that, if this is the case, then we do not need
bootstrap in the first place: this is not entirely true, however, since a good variance
estimate does not necessarily mean we have a good estimate of the whole distribution.)
In most cases we do not have an easy way to obtain variance estimates. Still, two
relatively general methods for obtaining the variance estimates can be used:

1. Nested bootstrap: at each bootstrap sample X∗n with corresponding θbn∗ , we can
generate B′ second-level bootstrap samples X∗∗ ∗∗ B′ , randomly drawn
n , . . . , Xn
1

with replacement from X∗n , to obtain the variance estimate vb∗n . This is generally
applicable in principle, but can become impractical because of the high com-
putational demands: for the second-level bootstrap variance estimates we need
at least, say, B′ = 200. Considering that we need precise quantile estimates,
the first-level bootstrap requires at least B = 999, say. Together this makes
B( B′ + 1) > 200, 000 re-computations of θbn , which is extremely high if θbn is com-
puted by a costly algorithm.

2. Nonparametric delta method: see e.g. Davison and Hinkley (1997, p. 46). If there
is a way to compute or estimate the empirical influence values li of each obser-
vation i; 1, . . . , n, then a common variance estimate is given by vbn = n−2 ∑in=1 li2 ,
with an analogous bootstrap version. Sometimes the li can be estimated from
an available formula for the theoretical influence function (see the Chapter on
M-estimators for details on influence functions), while in other cases jackknife-
type techniques can be used to obtain li , which then again becomes more time-
consuming, however.

(Disadvantage 2) Variance stabilizing transformation

Another problem with bootstrap-t intervals is their instability in smaller samples. It


turns out that the problem is associated with the variance var[θbn | F] being dependent
on θ ( F ). Indeed, it has been shown that the bootstrap-t intervals work best if the
numerator and denominator in (1.17) are independent. (Cfr. Student-t intervals for
samples from N (µ, σ2 ): the denominator in the t-distribution is independent from µ).

However, often var[θbn | F] depends on θ ( F ), and thus vb∗n will be related to θbn∗ . In most
cases, we can find a transformation

ϕ : = h ( θ ), ϕ bn∗ := h(θbn∗ )
bn := h(θbn ) and ϕ (1.18)

23
bn | F] is approximately independent from ϕ( F ).
such that after transformation var[ϕ

How can we find the appropriate parameter transformation h(.)? Well, suppose
we know the form of the dependency of var[θbn | F] on θ, i.e. suppose we can write
var[θbn | F ] = v(θ ) for some function v(.). Then for any well-behaved function h(.), a
Taylor expansion yields
h(θbn ) ≈ h(θ ) + h′ (θ )(θbn − θ )
which leads to
bn | F ] ≈ (h′ (θ ))2 v(θ )
var[ϕ
which implies that the variance of ϕ
bn will be approximately constant (equal to 1) if
Z t
du
h(t) = . (1.19)
v(u)1/2
To apply this result, we need to know the function v(.). However, we can estimate
this function from the bootstrap! Indeed, once we have applied the bootstrap on the
original scale, we can take a look at the relation between the values of θbn∗ and their
respective variance estimates vb∗n . If the variance estimates are fairly constant over the
range of θbn∗ -values, then no transformation is required. If not, we use these bootstrap
values to estimate a function v. We can then apply (1.19), do the bootstrap again on the
transformed scale, and eventually back-transform the resulting interval to the original
scale.

We now illustrate this procedure with an example. The left panel of Figure 1.5,
shows vb∗n versus θbn∗ for B = 999 bootstrap samples in the City data example. Here, the
values of vb∗n were obtained by double bootstrap with B′ = 200. Clearly, the variance of
θbn∗ seems to be an increasing function of θbn∗ . There are various ways now to obtain an
estimate of the function v(.). Nonparametric curve fitting is an option, or you can try
to fit any parametric function. Here, v(θ ) = (θ − 1.1)2 /5 turned out to give a nice fit to
the data, as indicated by the line on the plot. We then applied the bootstrap again, but
on the transformed scale based on (1.19), where h(.) was computed through numerical
integration. The right panel of Figure 1.5 shows the relation between the resulting ϕ bn∗ -
values and their variance estimates. It is clear that the trend has indeed disappeared.
The confidence interval after variance-stabilizing transformation can then be back-
transformed to the original scale by numerical methods.
Let us compare the bootstrap based 95% confidence intervals for θ that we can
construct so far for the City data. Without any transformation we obtain:

1. simple bootstrap: (θbL , θbU ) = (1.098, 1.943)

2. basic bootstrap: (θbL , θbU ) = (0.912, 1.771)

24
0.0 0.5 1.0 1.5 2.0 2.5 3.0
1.0
0.8
0.6
v*

v*
0.4
0.2
0.0

1.5 2.0 2.5 3.0 2 4 6 8


θ* φ*

Figure 1.5 (City data) Left: vb∗n versus θbn∗ for 999 bootstrap samples with line indicating
the approximate fit vb∗n ≈ v(θbn∗ ) = (θbn∗ − 1.1)2 /5. Right: vb∗n versus ϕ
bn∗ for 999 bootstrap
samples using transformation ϕ = h(θ )).

3. bootstrap-t: (θbL , θbU ) = (1.249, 2.168)

With variance-stabilization we obtain

3’. bootstrap-t: (θbL , θbU ) = (h−1 (ϕ


bL ), h−1 (ϕ
bU )) = (1.235, 2.188)

2’. basic bootstrap: (θbL , θbU ) = (h−1 (ϕ


bL ), h−1 (ϕ
bU )) = (1.272, 2.138)

Note that in this example the variance-stabilization has a large effect on the basic
bootstrap confidence interval, but less on the bootstrap t-confidence interval. It is
sometimes argued that, once you have stabilized the variance through transformation,
it is sufficient to use basic bootstrap and there’s actually no need for a bootstrap-t
interval (since the denominator is supposed to be approximately constant). It is true
that after transformation both intervals will be close, but it is wise to use bootstrap-t
even after transformation (as pointed out by Davison and Hinkley, 1997).

(Disadvantage 3) Not transformation respecting

The fact that transforming the parameter can improve the intervals, implies that
the intervals are not transformation-respecting. You may get different conclusions
depending on the scale you’re working on. This may be seen as a disadvantage and
holds for both the bootstrap-t and basic bootstrap interval. A side-effect hereof is that
these intervals may include impossible values. For example, in the City data, it is

25
quite realistic to assume that all cities have a positive growth (population decline does
not occur). Thus, θ < 1 is not reasonable, but still the untransformed basic bootstrap
interval includes such values. This can happen in bootstrap-t intervals as well. The
next methods, percentile and BCa intervals, do not suffer from this.

Method 4: Percentile intervals

As seen in the previous methods, confidence intervals may be improved by work-


ing on a transformed scale but finding a good transformation is not easy. The next
methods are based on the assumption that such an appropriate transformation exists,
but we do not need to know its form.

Suppose that a monotone increasing transformation h(.) as in (1.18) exists. Moreover,


assume that the sampling distribution of the transformed estimator is approximately
normal:
( a) (b)
bn∗ − ϕ
ϕ bn ∼ ϕ bn − ϕ ∼ N (0, σ2 ) (1.20)
(hence, distribution of ϕbn is pivotal(a) and normal(b) ). From (b) it follows that a 95% CI
for ϕ is given by (ϕ bU ) with
bL , ϕ

{ϕbL , ϕbU } = ϕbn ± 1.96σ (1.21)

On the other hand, from (a) it follows that


bn∗ ∼ N (ϕ
ϕ bn , σ2 ) (1.22)

Now, from (1.21) and (1.22) we have that ϕ bL is just the 0.025-quantile of the distribu-
bn∗ , and ϕ
tion of ϕ bU is the 0.975-quantile of this distribution. Hence, given B bootstrap
estimates ϕ ∗ 1 bn∗ B , a 95% CI for ϕ is given by
bn , . . . , ϕ

(ϕb(∗.025( B+1)) , ϕb(∗.975( B+1)) )

Since h(.) is assumed to be monotone increasing, the transformation is order preserv-


ing. Back-transforming this interval to the θ-scale thus yields:

(θb(∗.025( B+1)) , θb(∗.975( B+1)) ) (1.23)

which is the bootstrap percentile interval. Hence, constructing this interval is very
simple! We don’t even need to know the transformation ϕ = h(θ ) when applying this
method. However, we do assume that h exists and leads to a pivot with a symmetric
distribution. This is a somewhat strong assumption which is not always satisfied and
implies that the method does not always work very well. Moreover, percentile inter-
vals are only first-order accurate, similarly to basic bootstrap intervals. However, the
method can be improved to second-order accuracy by some adjustments, which will
be described next.

26
Method 5: BCa percentile intervals

The assumption described in (1.20) is strong, since it implies that we can find a
transformation such that (1) the estimate becomes unbiased and symmetrically dis-
tributed, and (2) its distribution does not change when moving from F to Fbn . These
concerns were partially addressed when Efron proposed a bias-corrected version of
the percentile interval, which was based on the assumption that h(.) exists such that

bn∗ − ϕ
ϕ bn − ϕ ∼ N (−wσ, σ2 )
bn ∼ ϕ

for some w. This leads to so-called BC intervals, but it was shown that this still is a
strong assumption and that it hardly yields an improvement on the (basic) percentile
method.

Efron’s next adjustment turned out to be more successful. He now assumed that h(.)
exists such that

bn ∼ N (ϕ − wσ(ϕ), σ2 (ϕ))(a) and ϕ


ϕ bn∗ ∼ N (ϕ bn ))(b)
bn ), σ2 (ϕ
bn − wσ (ϕ (1.24)

with σ(ϕ) = 1 + aϕ. Here, w is called the bias-correction and a is called the acceleration
factor. Allowing the variance of the sampling distribution to be non-constant and to
depend on ϕ is a much weaker assumption that is much more realistic.

Suppose w and a are known. Then (a) in assumption (1.24) implies that

bn ∼ ϕ + (1 + aϕ)( Z − w), with Z ∼ N (0, 1).


ϕ

By multiplying both sides by a, adding the constant 1 and taking the logarithm, we
obtain
log(1 + aϕ
bn ) ∼ log(1 + aϕ) + log(1 + a( Z − w)).

From this, we obtain the 95% confidence limits for ϕ as

w ± 1.96
{ϕbL , ϕbU } = ϕbn + σ(ϕbn )
1 − a(w ± 1.96)

On the other hand, from (b) in (1.24) we have that the α-quantile of the distribution of
b∗ is
ϕ
bα∗ = ϕ
ϕ bn − wσ (ϕ
bn ) + σ(ϕ
bn )zα (1.25)

with zα the α-quantile of Z ∼ N (0, 1). Can we find values of α such that the quantile
(1.25) matches with ϕ
bL and ϕbU , respectively? Indeed, from

w ± 1.96
bn − wσ(ϕ
ϕ bn ) + σ(ϕ
bn )zα = ϕ
bn + σ(ϕ
bn )
1 − a(w ± 1.96)

27
it follows that the levels α should be defined by
w ± 1.96
zα = w + .
1 − a(w ± 1.96)
This means that by choosing α as
w ± 1.96
 
{b αU } = Φ w +
αL, b
1 − a(w ± 1.96)
we get that a 95% CI for ϕ based on B recalculated estimates is given by

(ϕb(∗bα L ( B+1)) , ϕb(∗bαU ( B+1)) )

Back-transforming then gives the 95% bias-corrected and accelerated (BCa) interval
for θ:
(θb(∗bα L ( B+1)) , θb(∗bαU ( B+1)) )
So, the BCa limits are just percentiles of θbn∗1 , . . . , θbn∗ B , but with adjusted level α. Again,
we do not need to know the transformation function h(.) to construct the intervals. It
can be shown that BCa intervals are generally second-order accurate. However, we
need to know w and a, so how can we obtain them?

The bias-correction w: Note from (1.24) that

P[θbn∗ ≤ θbn ] = P[ϕ


bn∗ ≤ ϕ
bn ] = Φ(w)

such that an easy estimate for w is given by


!
#{θbn∗b ≤ θbn }
b = Φ −1
w
B+1

The acceleration a: This parameter is more difficult to estimate. It can be shown that
a good estimate is given by
∑in=1 bli3
a=
6{ ∑ n b
b
l 2 }3/2
i =1 i

li is some estimate of the influence value li of observation i. The most common


where b
way of estimating li is by the jackknife:
(i )
li = (n − 1)(θen−1 − θbn−1 )
b

(i )
where θbn−1 is the estimate of θ based on the original sample without observation i,
known as the ith jackknife estimate, and θen−1 is the average of the n jackknife esti-
mates.
To complete the example, we also give the percentile based 95% confidence inter-
vals for θ in the City data:

28
4. percentile: (θbL , θbU ) = (1.270, 2.129)

5. BCa: (θbL , θbU ) = (1.225, 2.119)

(a in BCa estimated by jackknife; all intervals based on R = 999). Both intervals


are similar to the basic bootstrap and bootstrap-t confidence interval after variance-
stabilizing transformation. The advantage is that now we do not have to know the
transformation explicitly, but only assume implicitly that an appropriate transforma-
tion exists.
In the table below we summarize the properties of the 5 methods to construct con-
fidence intervals based on the bootstrap distribution of an estimator.

Method Order TR
Normal 1st - inaccurate, but simple
Basic 1st - inaccurate, but simple (and pure bootstrap principle)
Bootstrap-t 2nd - accurate, but need b
σ and variance-stabilization
Percentile 1st + inaccurate, but simple (and intuitive)
BCa 2nd a
+ accurate, but need b
(TR = transformation-respecting)

Remark: the orders of accuracy above are meant for estimators which are approxi-
mately (i.e. asymptotically) normal. Otherwise, even theoretical consistency itself is
hard to show.

1.6 Regression models


Until now, we assumed that the data at hand was in its simplest form: one sample
of n independent and identically distributed observations. If the data structure is more
complex than that, we can still apply the bootstrap. That is, we can still use the data
to construct new (re-)samples to re-compute the estimates on, so that we can estimate
the sampling distribution. The difference with the simple situation lies in how exactly
one should construct these bootstrap samples. Since we cannot always state anymore
that
F =⇒ Xn
for some distribution F, we cannot just apply the principle of ”replacing F with Fn ”.
Indeed, we now have to think of the data as being the result of an unknown probability
model P:
P =⇒ Xn
which is more general, and we need to estimate the probability model P in order to
construct new samples. Once we have properly constructed such bootstrap samples,

29
however, the rest remains the same (standard errors, the various confidence intervals,
. . . ). We explain this in more detail for the case of regression models.
Consider a data set of the form Xn = {(x1 , y1 ), . . . , (xn , yn )}, where xi is a p-variate
vector, and suppose we want to investigate the dependence of the y-variable on the
x-variable by assuming a linear regression model:

E[yi |xi ] = xit β

The aim of the regression analysis is then to estimate the unknown p-variate β from
the data Xn , and often also to find out which coefficients are different from zero. The
classical estimate βb is the one obtained through least-squares minimization. The prob-
ability structure of the linear model is usually expressed as

yi = xit β + ϵi (1.26)

where the error terms ϵi are assumed to be an i.i.d. sample from an unknown error
distribution F having expectation 0.

Approach 1: resampling errors

So, we can see the data-generating process as

P = ( β, F ) =⇒ Xn = {(x1 , y1 ), . . . , (xn , yn )}

Hence, P is the probability distribution of the responses which depends on the error
distribution F and the parameter β. The distribution P depends also on the covariates
xi , which can be either fixed or random. In this approach we consider the xi to be fixed.
If we want to replace P by an estimate P̂, then we need (1) an estimate for β, and
(2) an estimate for F. The estimate for β we have immediately when estimating β for
the original data set. For F, we know that it is the underlying distribution of the errors
ϵi , so a natural estimate would be the empirical distribution Fn of the ϵi . However, the
ϵi are not available since β is unknown, but we can use βb to compute the residuals

ϵi = yi − xit βb
b

and estimate F by the empirical distribution of the b ϵi . If we consider the xi as fixed,


then we construct bootstrap samples by sampling according to Pb = ( β, b Fn ) as follows.
We first sample the bootstrap error terms {ϵ1∗ , . . . , ϵn∗ }, by drawing with replacement
from {b ϵn }. Then, the bootstrap responses yi∗ are generated according to the
ϵ1 , . . . , b
model (1.26):
y∗ = xt βb + ϵ∗
i i i

30
The bootstrap data-generating process can then be seen as

b Fn ) =⇒ X∗n = {(x1 , y∗ ), . . . , (xn , y∗n )}


Pb = ( β, 1

Note that since we are treating the xi as fixed, the resampled data are taken to have the
same design as the original data.

Approach 2: resampling cases

A quite different approach results from viewing the data as a random sample from
a p + 1-variate unknown distribution F:

F =⇒ Xn = {(x1 , y1 ), . . . , (xn , yn )}

which means that we consider both yi and xi as random variables. In fact, this keeps
us in the original, simple framework. Bootstrap samples are constructed by sampling
from the empirical distribution Fn , which just means drawing cases from the original
Xn with replacement (where by cases we mean the whole vectors (xi , yi )). So,

Fn =⇒ X∗n = {(x1 , y1 )∗ , . . . , (xn , yn )∗ }

In this approach, since the x-variable is just as random as the response y, the design in
the bootstrap samples will be different from the original sample. The bootstrap infer-
ence will thus reflect the uncertainty about the design.

Example: The following plots show a regression dataset (pilot-plant data: y = acid
content determined by titration, x = acid content determined by extraction and weigh-
ing), with 5 random bootstrap samples based on the two approaches. Note that
error-resampling yields n different bootstrap observations every time, while case-
resampling generally produces multiple replicates of certain observations (which are
hidden in the plot due to overlapping). Note the gaps in the designs for case-
resampling.
• original + 5 bootstrap samples by ERROR-RESAMPLING

31
90

90

90
80

80

80
70

70

70
Titration

Titration

Titration
60

60

60
50

50

50
40

40

40
50 100 150 50 100 150 50 100 150

Extraction Extraction Extraction

90

90

90
80

80

80
70

70

70
Titration

Titration

Titration
60

60

60
50

50

50
40

40

40
50 100 150 50 100 150 50 100 150

Extraction Extraction Extraction

• original + 5 bootstrap samples by CASE-RESAMPLING


90

90

90
80

80

80
70

70

70
Titration

Titration

Titration
60

60

60
50

50

50
40

40

40

50 100 150 50 100 150 50 100 150

Extraction Extraction Extraction


90

90

90
80

80

80
70

70

70
Titration

Titration

Titration
60

60

60
50

50

50
40

40

40

50 100 150 50 100 150 50 100 150

Extraction Extraction Extraction

32
Resampling errors versus resampling cases

• Can we interpret the original design as clearly fixed or clearly random?


In some situations, the design should obviously be considered as fixed. For example,
imagine a scientist who collects data in a dose-response experiment. That scientist
usually chooses the exact doses beforehand and therefore the x-variable, i.e. the dose,
is not sampled but fixed. Mimicking this process in the bootstrap then should mean
resampling errors, not cases. Other situations naturally lead to interpreting the covari-
ates x as random such that resampling cases is the appropriate bootstrap approach.

• Do we fully trust our regression model?


The regression model (1.26) says that the errors are identically distributed, which im-
plies that the error variance does not depend on x. This is a strong assumption which is
often violated (heteroscedasticity) and resampling errors fails if it is violated. Resam-
pling cases, on the other hand, would still be valid in case of heteroscedastic errors.
More generally, if the model is wrong, the bootstrap samples in case of resampling
errors will not look like the original data.

• How much do we value the principle of fixed information content?


The design determines the information content of a sample. In principle, inference
is supposed to be specific to the information in the data. When we resample cases,
then we cause some variation in this information. The estimated standard error of βb
in such case will not exactly reflect the precision associated with the sample of covari-
ates actually observed. On the other hand, resampling errors does reflect the precision
of the estimates for the sample of covariates observed and is preferred in this view.
However, if we think of the inference as reflecting variation on the estimates due to
random sampling from the data generating process, then resampling cases becomes
more appropriate.

Example: To illustrate the second of these issues, the following plots show a regression
dataset where we have heteroscedasticity or non-constant error-variance (hormone as-
say data: y = measurement by new method, x = measurement by reference method).
Note how error-resampling causes the large residuals, corresponding to large x, to
end up at small values of x in the bootstrap samples. The specific x-dependence of
the error-variance is lost. Case-resampling seems a better choice, although the occa-
sional absence of that high leverage point may be disturbing in view of the third issue
above. Another possibility here would be to incorporate the non-constant variance in
the regression model, and use error-resampling based on this new model.

33
• original + 5 bootstrap samples by ERROR-RESAMPLING

70

70

70
60

60

60
50

50

50
New method

New method

New method
40

40

40
30

30

30
20

20

20
10

10

10
0

0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Reference method Reference method Reference method


70

70

70
60

60

60
50

50

50
New method

New method

New method
40

40

40
30

30

30
20

20

20
10

10

10
0

0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Reference method Reference method Reference method

• original + 5 bootstrap samples by CASE-RESAMPLING


70

70

70
60

60

60
50

50

50
New method

New method

New method
40

40

40
30

30

30
20

20

20
10

10

10
0

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Reference method Reference method Reference method


70

70

70
60

60

60
50

50

50
New method

New method

New method
40

40

40
30

30

30
20

20

20
10

10

10
0

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Reference method Reference method Reference method

Whatever approach you use (error or case resampling), once you have B bootstrap
samples you can re-compute the estimate βb B times, and proceed as before to obtain
standard errors and confidence intervals for every component of β.

34
Remark 1: The approaches described here for linear models, can be applied to gener-
alized linear models as well. See e.g Davison and Hinkley (1997).

Remark 2: There is substantial theory available for inference by least-squares (LS) es-
timation in linear regression, such that bootstrap would seem unnecessary. However,
this theory is based on the error distribution F being normal, which is an assumption
that cannot always be trusted. The bootstrap makes much weaker assumptions, lead-
ing to more reliable inference in many cases.

Remark 3: LS is just one possible method to estimate β. Many other estimators can
be used instead, such as Least-trimmed-squares (LTS), M- and S-estimators for ex-
ample. Such alternative (robust) regression estimators are increasingly popular, and
since normal-theory is much less available for these estimators than for LS, appropri-
ate bootstrap methods are an important inference tool in this context.

1.7 Bootstrap hypothesis tests

Apart from standard errors and confidence intervals, hypothesis testing is the main
application in statistical inference. The bootstrap can be used for testing, but things are
more complicated than for confidence intervals. The reason is that we need to estimate
the distribution of a certain test statistic under the null hypothesis H0 . And since we
do not know whether our data satisfies H0 , we cannot simply re-sample from the data.
The exact procedure depends on the type of test and type of data, but let us illustrate
some issues by means of an example.

Example: (Efron and Tibshirani, 1993) In a small experiment, 16 mice were randomly
assigned to a treatment group or a control group. The treatment was intended to
prolong survival after a test surgery. These are the observed survival times (in days)
after surgery:
Treatment: 94 197 16 38 99 141 23
Control: 52 104 146 10 51 30 40 27 46
Denote by x1,1 , . . . , x1,7 the observations in the Treatment group, having underlying
distribution F1 , and by x2,1 , . . . , x2,9 those in the Control group, having underlying
distribution F2 . We wonder if the treatment has had any significant effect. Formally,
we would like to test the hypothesis H0 : F1 = F2 .

35
(A) Testing H0 : F1 = F2

In case H0 holds, we expect the difference in mean survival time to be small, in case
H0 is false we expect the difference to be large. Hence, a reasonable test statistic is

θb = x1 − x2 = 30.63

Is this strong evidence against H0 ? To answer this, we need to know the distribution
of θb under H0 . We can bootstrap, but simple (stratified) re-sampling from the data is
not an option, since the data may or may not satisfy H0 .

How to construct bootstrap samples under H0 ?

Under H0 , both parts of the sample have a common distribution, say F0 . In the
absence of any knowledge of F0 , the best way to estimate F0 is by the empirical dis-
tribution Fb0 of the data considered as one large sample of size n1 + n2 = 16. Hence,
rewrite the data:

Xn = { x1,1 , . . . , x1,7 , x2,1 , . . . , x2,9 } → {z1 , . . . , z7 , z8 , . . . , z16 }

Bootstrap samples are then constructed under H0 by drawing with replacement from
Xn = {z1 , . . . , z16 }, yielding X∗n = {z1∗ , . . . , z16
∗ }. The result is then simply rewritten as


X∗n = {z1∗ , . . . , z7∗ , z8∗ . . . , z16 ∗
} → { x1,1 ∗
, . . . , x1,7 ∗
, x2,1 ∗
, . . . , x2,9 }

and we re-compute the statistic θb∗ = x1∗ − x2∗ .

The bootstrap p-value

Constructing B bootstrap samples leads to θb1∗ , . . . , θbB∗ and these give us an estimate
of the distribution of the statistic under H0 . Therefore, we can define the bootstrap
p-value as
#{θbb∗ ≥ θb}
pboot = .
B
Instead of θb = x1 − x2 we could also use the studentized test statistic

x − x2
θb = √ 1
σ 1/n1 + 1/n2
b

where b σ2 is the pooled variance of the two groups. This typically leads to similar
results.

36
(B) Testing H0 : µ1 = µ2

What if we just want to test whether the two groups have equal means, without as-
suming anything about equal variances? Then, we can use the following test statistic:
x1 − x2
θb = q
σ12 /n1 + b
b σ22 /n2

In the procedure described above, the bootstrap samples were constructed under the
assumption that the whole distribution was equal, not just the means. We have to
think of something different now.

How to construct bootstrap samples under H0 ?

This time we have to keep the two groups apart when re-sampling. We will adjust
the data so that the groups have equal means. Define

xe1,i = x1,i − x1 + z and xe2,i = x2,i − x2 + z

Then draw stratified bootstrap samples

X∗n = { x1,1
∗ ∗
, . . . , x1,7 ∗
, x2,1 ∗
, . . . , x2,9 }

from
X
e n = { xe1,1 , . . . , xe1,7 , xe2,1 , . . . , xe2,9 }.

and re-compute the test statistic:


x1∗ − x2∗
θb∗ = q
σ12∗ /n1 + b
b σ22∗ /n2

Just as before, based on B such bootstrap samples, the p-value is

#{θbb∗ ≥ θb}
pboot = .
B

(C) Testing H0 : µ1 = 129.0

Consider now a one-sample problem, with just the data for the Treatment group.
Suppose some earlier experiment had found an average survival time of 129 days, and
suppose we want to test whether our observations have the same underlying mean.
So we want to test H0 : µ1 = 129.0. The obvious test statistic for this is
x − 129.0
θb = 1 √ (1.27)
σ1 / n1
b
This can be bootstrapped in the same way as in the test for H0 : µ1 = µ2 outlined
above: by translating the data first so that they have mean equal to 129.0.

37
Connection with confidence intervals

There is an equivalent way to find the null distribution of (1.27), and that is by
simply bootstrapping from the un-translated data, and computing

x ∗ − x1
θb∗ = ∗1 √ (1.28)
σ1 / n1
b

As before, based on B bootstrap samples, the p-value is

#{θbb∗ ≥ θb}
pboot = .
B

Both methods here are completely equivalent, and make the assumption that the dis-
tribution of the test statistic is pivotal, or translation-invariant. Just like the classical
t-test (which additionally assumes normality, of course).

The bootstrapped statistics (1.28) point to a connection with bootstrap-t intervals.


Indeed, the 95% bootstrap-t confidence interval in this case will exist of those values
µ0 that will not be rejected at the 5% level by the test described here.

Using confidence intervals for hypothesis testing

Of course, whenever we are testing a hypothesis of the form H0 : θ = θ0 , we can


use the duality between tests and confidence intervals. Select any type of confidence
interval. Then the associated test is:

“reject the null hypothesis H0 : θ = θ0 at the α-level when the (1 − α)100%


confidence interval for θ does not contain θ0 .”

A typical situation where this can be useful is for testing if a coefficient in a regression
model is significantly different from 0.
Summarized:

• By bootstrapping test statistics we avoid assumptions like normality.

• The general idea is to adjust the data in some way such that it satisfies H0 , while
keeping much of the characteristics of the original data.

• If possible, an easier way is to use duality with confidence intervals.

38
Further reading
A nice introduction to the bootstrap is given by Efron and Tibshirani (1993) while
an extensive overview of bootstrap methods is given in Davison and Hinkley (1997).
A more practical discussion can be found in e.g. Chernick (1999) while a mathematical
treatment of the subject can be found in Shao and Tu (1995).

39
40
Chapter 2

M-estimators

2.1 Robustness
All statistical methods rely on a number of assumptions. Typically two types of
assumptions occur. Distributional assumptions assume that the observations are gen-
erated from a specific distribution, often the Gaussian distribution. Classical linear
least squares regression for instance assumes that the errors are normally distributed
with constant variance. Secondly, even when such an explicit distribution is not re-
quired, classical statistical methods rely on the assumption that all observations are
generated from the same underlying process. However, in many real life applications
it is observed that data sets contain several atypical observations called outliers. These
are observations that deviate from the main pattern in the data.
Classical statistical methodology often turns out to be very sensitive to outliers.
A few deviating observations can have a huge impact on the results. On the other
hand, robust statistics constructs estimators that control the effects of outliers. Typically
robust methods give similar results as classical estimates when there are no outliers,
but keep giving appropriate results when outliers are present.
Three R packages with robust methods that will be frequently used, are MASS (Ven-
ables and Ripley, 2002), robustbase (Rousseeuw et al., 2003) and rrcov (Todorov and
Filzmoser, 2009).
Consider a data set containing the logarithm of the annual incomes of 10 persons.
If the file annualincome.txt contains the data and is located in the working directory
of R, then provide some basic graphical displays (histogram, boxplot and normal qq-
plot).

> data <- read.table("annualincome.txt",col.names="logIncome")


> anninc <- data$logIncome
> par(mfrow=c(1,3))

41
> boxplot(anninc,xlab="anninc")
> hist(anninc)
> qqnorm(anninc)

Figure 2.1 shows the resulting plots. From this graphical analysis, it is clear that
the log income of person 10 is quite large compared to the other log incomes.

Histogram of anninc Normal Q−Q Plot

6
15

15
5
14

14
4
13

13
Sample Quantiles
Frequency

3
12

12
2
11

11
1
10

10
0

9 10 11 12 13 14 15 16 −1.5 −0.5 0.5 1.0 1.5

anninc anninc Theoretical Quantiles

Figure 2.1 Eploratory graphs for the annual income data.

Now let us analyze the sample mean and the sample median.

> mean(anninc)
[1] 10.49203
> median(anninc)
[1] 9.978188

Observe that there is a large difference between both estimates. Now consider the data
set without observation 10, the outlier.

> anninc2 <- anninc[-10]


> mean(anninc2)
[1] 9.967398
> median(anninc2)
[1] 9.957906

A simple visualization can be obtained by the following commands:

42
> plot(anninc,rep(1,10),ylab="",xlab="log annual income",ylim=c(0.8,1.1))
> points(mean(anninc),0.95,pch=4,col="red")
> points(mean(anninc2),0.92,pch=4,col="red")
> points(median(anninc),0.89,pch=4,col="blue")
> points(median(anninc2),0.86,pch=4,col="blue")
> text(11.5,0.95,"mean with outlier",cex=0.8,col="red")
> text(10.9,0.92,"mean without outlier",cex=0.8,col="red")
> text(10.9,0.89,"median with outlier",cex=0.8,col="blue")
> text(10.9,0.86,"median without outlier",cex=0.8,col="blue")

which results in the following figure:


1.10
1.05
1.00
0.95

mean with outlier

mean without outlier


0.90

median with outlier

median without outlier


0.85
0.80

10 11 12 13 14 15

log annual income

Note that both the mean and the median of the sample without the outlier give ap-
proximately the same result as the median of the sample containing the outlier, but
a very different result from the mean of the sample containing the outlier. This illus-
trates that the sample mean is not a robust estimator: a single observation can have an
arbitrary large effect on the result. For the median the influence of a single observation
is bounded. The sample median is an example of a robust estimator of location.
For univariate scale estimators the situation is similar. Consider the sample stan-
dard deviation s:
n
1
n − 1 i∑
s2 = ( xi − x̄ )2
=1

Again an outlier can be arbitrary influential: notice again the difference between using
the data set with or without the outlier:

> sqrt(var(anninc))
[1] 1.678832

43
> sqrt(var(anninc2))
[1] 0.2726027

For a robust scale estimator, e.g. the Median Absolute Deviation (MAD),

MADN = 1.4826 medi (| xi − med j ( x j )|),

the effect of individual observations is again much smaller:

> mad(anninc)
[1] 0.2105529
> mad(anninc2)
[1] 0.1827097

Exercise: retake the previous analysis replacing observation 10 by arbitrary chosen


values. Compute sample mean, sample median, sample standard deviation and
MADN. Convince yourself that the median and the mad are indeed much more ro-
bust.

At this point one might think that robust estimation could also be obtained by de-
tecting outliers and deleting them. However, it is important to realize that detection
of outliers is far from trivial. In the univariate case, one might suggest to flag points
as outliers when the distance to the mean is larger than 3 standard deviations. This is
based on the fact that for X ∼ N (0, 1), P(| X | > 3) = 0.003. Let us apply this rule to
the income data.

round((anninc-mean(anninc))/sqrt(var(anninc)),digits=1)
[1] -0.6 -0.5 -0.2 -0.3 -0.2 -0.3 0.0 -0.3 -0.3 2.8

Observe that the result is actually smaller than 3! This is because the sample mean
is shifted and the sample variance is blown up due to the outlier. This is called the
masking effect: outliers can change non-robust estimates so dramatically, that outlier
detection using these estimates fails. However, when using the robust estimates in the
outlier detection rule, we do find that observation 10 is very far away from the other
observations.

round((anninc-median(anninc))/mad(anninc),digits=1)
[1] -2.2 -1.4 0.9 -0.1 0.5 0.1 2.4 -0.3 -0.3 24.9

This illustrates the importance of using a robust method in order to detect outliers.
In the remaninder of this chapter we will study the theory of M-estimators which
is a well established class of robust estimators. We start by univariate estimation of
location and scale as in the example above and then continue with general properties
of M-estimators.

44
2.2 Univariate parametric models
We assume that we have a sample { x1 , . . . , xn } of n i.i.d. univariate random vari-
ables Xi . A parametric model assumes that Xi ∼ Fθ with { Fθ : θ ∈ Θ} a set of distri-
bution functions. We will assume that the distribution has a density f θ ( x ) = Fθ′ ( x ). An
important goal of statistics is to find a good estimator θ̂n of θ, that is, a real function of
the sample θ̂n ( x1 , . . . , xn ) such that θ̂n is close to the true parameter θ.

Recall the following concepts:


• The Mean Squared Error (MSE) of an estimator:

MSE(θ̂n ) = E[(θ̂n − θ )2 ].

The MSE is often used as a criterion to measure the accuracy of an estimator. It


can be decomposed as

MSE(θ̂n ) = Var[θ̂n ] + bias(θ̂n )2

with

bias(θ̂n ) = E[θ̂n ] − θ

and

Var[θ̂n ] = E[(θ̂n − E[θ̂n ])2 ].

An estimator is called unbiased if its bias is zero. For unbiased estimators the
MSE equals the variance.

• Under some regularity conditions, the Fisher information of θ in the model is


defined as
" 2 #

Iθ = E ln f θ ( x ) .
∂θ

The Cramér-Rao information inequality states that for any unbiased estimator θ̂n
we have that
1
Var(θ̂n ) ⩾ .
nIθ
Thus
1
0 ⩽ eff(θ̂n ) := ⩽ 1.
nVar(θ̂n ) Iθ
The percentage eff(θ̂n ) is called the (finite sample) efficiency of θ̂n . If the effi-
ciency equals 1, then θ̂n has the smallest possible variance (and hence the small-
est possible MSE) of all unbiased estimators.

45
• An estimator is called consistent if for all θ ∈ Θ it holds that θ̂n → p θ. Here → p
denotes convergence in probability, that is, for any ε > 0 it holds that P(|θ̂n − θ | ≥
ε) → 0.

• An estimator is called asymptotically normal with asymptotic variance Vθ if



n(θ̂n − θ ) →d Y with Y ∼ N (0, Vθ ) where →d denotes convergence in dis-

tribution. (That is, the cdf of n(θ̂n − θ ) converges to the cdf of N (0, Vθ ) in every
point.)
This implies that Vθ is the limit of nVar [θ̂n ] and not the limit of Var [θ̂n ].
If the Fisher information Iθ exists, we define the asymptotic efficiency of an un-
biased and asymptotically normal estimator as

1
0 ⩽ eff(θ̂n ) := ⩽ 1.
Vθ Iθ

Furthermore, the Asymptotic Relative Efficiency (ARE) of two unbiased and


asymptotically normal estimators θ̂ A and θ̂ B is defined as

Vθ̂B
ARE(θ̂ A , θ̂ B ) =
Vθ̂ A

which can also be written as eff(θ̂ A )/eff(θ̂ B ) when Iθ exists.

• The likelihood function is the joint density


n
L ( x1 , . . . , x n ; θ ) = ∏ f θ ( xi )
i =1

considered as a function of θ.

• The Maximum Likelihood Estimator (MLE) is the estimator maximizing the like-
lihood function:

θ̂n = argmax L( x1 , . . . , xn ; θ ).
θ

2.2.1 Univariate location model


A simple example of a parametric model is the univariate location model. It as-
sumes that x1 , . . . , xn are independent and identically distributed (i.i.d.) as

Fµ ( x ) = F ( x − µ) (2.1)

where −∞ < µ < +∞ is the unknown location parameter and F is a continuous distri-
bution with density f , hence f µ ( x ) = Fµ′ ( x ) = f ( x − µ). (Note that Fµ is only a model

46
for the uncontaminated data. We do not model outliers.)

The goal is to find a good estimator µ̂n of µ. Most location estimators satisfy the
following desirable properties: the estimator is called location equivariant if for all b ∈ R
it holds that

µ̂n ( x1 + b, . . . , xn + b) = µ̂n ( x1 , . . . , xn ) + b

and it is called scale equivariant if for all a ̸= 0

µ̂n ( ax1 , . . . , axn ) = aµ̂n ( x1 , . . . , xn ).

This guarantees that estimates are independent of the measurement units.

Often f is assumed to be symmetric. The simplest location model assumes that F


is normal (gaussian) with mean 0 and known standard deviation σ, that is

Xi ∼ N (µ, σ2 ).

If this assumption holds exactly, then the sample mean

1 n
X̄ = ∑ Xi
n i =1

is an optimal estimator: it is the MLE, it is unbiased, asymptotically normal, and has


the highest possible efficiency (100%).

The sample median is another well known location estimator:



 X ( n +1 ) if n is odd
Med =  2 
 1 X ( n ) + X ( n +1) if n is even
2 2 2

with X(i) the ith order statistic. Under the assumption of normality the sample median
has a lower efficiency than the sample mean, namely 2/π ≈ 64%. But the situation
may be reversed if the model distribution F has longer tails.

2.2.2 Scale model


The scale model states that the i.i.d. random variables Xi satisfy
x
Xi ∼ Fσ with Fσ ( x ) = F ( ) (2.2)
σ

47
where σ > 0 is the unknown scale parameter. As before F is a continuous distribution
with density f , but now
1 x
f σ ( x ) = Fσ′ ( x ) = f ( ) .
σ σ

Most scale estimators σ̂n are location invariant, i.e. for all b ∈ R it holds that

σ̂n ( x1 + b, . . . , xn + b) = σ̂n ( x1 , . . . , xn ) (2.3)

and they are typically scale equivariant, i.e. for all a ̸= 0

σ̂n ( ax1 , . . . , axn ) = | a|σ̂n ( x1 , . . . , xn ). (2.4)

The most common scale estimator is the sample standard deviation


s
n
1
n − 1 i∑
Stdev( x1 , . . . , xn ) = ( xi − x̄ )2 .
=1

Other well-known scale estimators include the Inter Quartile Range (IQR) defined as

IQR( x1 , . . . , xn ) = x(⌈0.75n⌉) − x(⌊0.25n⌋)

and the Median Absolute Deviation (MAD):

MAD( x1 , . . . , xn ) = median(| xi − Med( x1 , . . . , xn )|) .


i

If F is a gaussian distribution, Stdev is a consistent estimator of the parameter σ,


and it is also the most efficient estimator. Note that neither the MAD nor the IQR are
consistent the way they are defined above. Under normality one can prove that the
MAD converges to cσ and the IQR to 2cσ with c = Φ−1 (0.75) = 0.6745. We typically
use the normalized MAD given by

MAD( x1 , . . . , xn )
MADN( x1 , . . . , xn ) = .
Φ−1 (0.75)

The factor (Φ−1 (0.75))−1 = 1.4826 is called a consistency factor. Some people prefer
the notation MAD to include this factor, others do not. When using software one
should watch out which convention is used. In R, mad refers to the version with the
factor included.

48
2.3 The sensitivity curve

Let { Fθ } be a parametric model. For a sample ( x1 , . . . , xn ) and an estimator θ̂n the


sensitivity curve in the point z is defined by

SC(z) = (n + 1)(θ̂n+1 ( x1 , . . . , xn , z) − θ̂n ( x1 , . . . , xn )). (2.5)

Consider for example the location model. The following function computes the sensi-
tivity curve of the sample mean for a given sample on a grid between −10 and 10.

scmean <- function(sample) {


n <- length(sample)
sc <- rep(0,201) # "repeat": array with 201 elements, all zero
for (z in -100:100){
sc[z+101] <- n*(mean(c(sample,z/10))-mean(sample))
}
sc # the output
}

A similar function can be written for the sample median. Both functions can be drawn
as follows.

> sample <- rnorm(10) # randomly generated from N(0,1)


> res <- scmean(sample)
> plot((-100:100)/10,res,xlab="z",ylab="SC(z)")
> resmed <- scmedian(sample)
> points((-100:100)/10,resmed)

The sample mean is clearly more sensitive to outliers than the sample median. Its
sensitivity curve is unbounded as |z| increases. The sensitivity curve of the sample
median is bounded, reflecting the fact that the addition of a single point z to a sample
cannot have an arbitrarily large impact.

49
5
SC(z)

0
−5

−10 −5 0 5 10

50
2.4 The finite-sample breakdown value
If the sensitivity curve is bounded, then one observation cannot have an arbitrarily
large impact on the estimator. But what happens when more than one point is altered?
To measure this, the finite-sample breakdown value (Donoho and Huber, 1983) plays
an important role.

In a given sample x = ( x1 , . . . , xn ), replace any m data points xi1 , . . . , xim by arbi-


trary values y1 , . . . , ym . Call the new data set (z1 , . . . , zn ). Let Θ be the parameter space
of the parametric model. The finite-sample breakdown value of an estimator θ̂n is
nm o
ε∗n (θ̂n ; x) = min : θ̂n (z1 , . . . , zn ) approaches the boundary of Θ . (2.6)
n
In most cases ε∗n does not depend on the sample x (for samples satisfying some weak
conditions).

Consider first the univariate location model. Then Θ = R and


( )
∗ m
ε n (µ̂n ; x) = min : max sup |µ̂n (z1 , . . . , zn )| = ∞ . (2.7)
n i1 ,...,im y1 ,...,ym

In this situation:

• The breakdown value of the sample mean equals 1/n.

• For any location equivariant location estimator µ̂n it holds that

1 n+1
 
∗ 1
ε n (µ̂n , x) ⩽ ≈ . (2.8)
n 2 2

• The sample median attains this upper bound.

In the univariate scale model we have that Θ =]0, ∞[. The boundary of Θ is the
pair {0, +∞}. The breakdown value with respect to 0 is called the implosion breakdown
value. The one with respect to +∞ is called the explosion breakdown value. In this setting:

• The breakdown value of the sample standard deviation equals 1/n because it
explodes easily.

• For any equivariant scale estimator σ̂n it holds that

⌊n/2⌋ 1
ε∗n (σ̂n , x) ⩽ ≈ .
n 2

• The MAD attains this upper bound at samples without ties.

51
2.5 M-estimators of univariate location
Let ρ be an even and nondecreasing function of | x |, with ρ(0) = 0. If ρ is differen-
tiable, we define ψ = ρ′ . We then have that ψ is odd and ψ( x ) ⩾ 0 for x ⩾ 0. For the
location model, an M-estimator (Huber, 1964) is defined as
n
µ̂n = argmin ∑ ρ( xi − µ). (2.9)
µ i =1

If ρ is differentiable, differentiating (2.9) with respect to µ yields


n
∑ ψ(xi − µ̂n ) = 0. (2.10)
i =1

Examples:

• ρ( x ) = − ln f ( x ) hence ψ( x ) = − f ′ ( x )/ f ( x ). Then it follows from (2.9) that


n n
µ̂n = argmin ∑ − ln( f ( xi − µ)) = argmax ln ∏ f ( xi − µ)
µ i =1 µ i =1

and thus µ̂n equals the MLE of µ. M-estimators are thus generalizations of MLE-
estimators.
x2
• ρ( x ) = 2. Then ψ( x ) = x and we have that
n
∑ (xi − µ̂n ) = 0
i =1

which means that µ̂n = x̄. Thus the sample mean is an M-estimator. At the
normal model, f ( x ) = ϕ( x ), it is the MLE.

• ρ( x ) = | x |. Then it follows from (2.9) that


n
µ̂n = argmin ∑ | xi − µ|.
µ i =1

The sample median attains this minimum, so it is an M-estimator with ψ( x ) =


sign x . It is the MLE at the double exponential distribution f ( x ) = 12 e−| x| (also


called the Laplace distribution).

• Huber M-estimator with parameter b > 0:


(
x2 /2 if | x | ⩽ b
ρ( x ) = 2
b| x | − b /2 if | x | > b.

52
and (
x if | x | ⩽ b
ψ( x ) = 
b sign x if | x | > b
Note that the limiting cases b → ∞ and b → 0 correspond to the mean and the
median.

• Tukey’s bisquare M-estimator with parameter c > 0:


( 2
x x4 x6
2 − 2c 2 + 6c4 if | x | ⩽ c
ρ( x ) = (2.11)
c2 /6 if | x | > c

and 2
x2

ψ( x ) = x 1 − 2 I (| x | ⩽ c).
c

The graphs of these ρ and ψ are depicted in Figure 2.2. For the mean both func-
tions are unbounded. For the median and the Huber M-estimator ψ is increasing and
bounded and ρ is unbounded. Such M-estimators are called monotone. One can prove
that the optimization problem in (2.9) is then a convex optimization problem and no lo-
cal minima exist. This is a big computational advantage. For the bisquare M-estimator
both ρ and ψ are bounded. Moreover ψ is not increasing: for large x the function ψ( x )
decreases. Such M-estimators are called redescending. Computationally this yields a
more difficult optimization since local minima can arise. This means that some solu-
tions of (2.10) may not correspond to the global minimum of (2.9).
You can check that location M-estimators as defined by (2.9)-(2.10) are location
equivariant, but not scale equivariant. M-estimators of location can achieve a high
breakdown value. If ψ is odd, bounded and nondecreasing, they attain the maximal
breakdown value (2.8). This implies that the Huber location M-estimator has maximal
breakdown value for all b < ∞.
It turns out that properties of M-estimators are determined by the ψ-function
in (2.10) rather than the loss function ρ in (2.9). Therefore, M-estimators are often
defined directly as solutions of an estimating equation as in (2.10). This is the common
approach for M-estimators of scale.

53
rho function of various estimators

10
Classical mean
Median
Huber, b=1.345
Tukey, c=4.68

8
6
ρ(x)

4
2
0

−6 −4 −2 0 2 4 6

psi function of various estimators


2

Classical mean
Median
Huber, b=1.345
Tukey, c=4.68
1
ψ(x)

0
−1
−2

−6 −4 −2 0 2 4 6

Figure 2.2 The ρ and ψ functions of the mean, the median, the Huber
estimator with b = 1.345, and the bisquare M-estimator with c = 4.68.

54
2.6 M-estimators of scale
Assume the scale model from (2.2). An M-estimator of scale satisfies an equation
of the form
1 n
 
xi

n i =1
ψ
σ̂n
=0 (2.12)

Note that the factors xi /σ̂n are actually standardized deviations from the center of the
distributions which is 0 in this scale model. For symmetric distributions, it is natu-
ral that the contribution of a discrepancy is the same no matter whether xi lies in the
left or right tail of the distribution. Hence, a natural requirement for ψ is to use an
even function and thus ρ functions as defined above are ideal choices. However, since
ρ(t) ≥ 0 for all values of t it is clear that generally n1 ∑in=1 ρ( σ̂xni ) > 0. Therefore, we set
ψ(t) = ρ(t) − δ where the constant δ can be chosen to make the estimator σ̂n consis-
tent at a particular scale model, usually Fσ = N (0, σ2 ). Expression (2.12) can now be
rewritten as
1 n
 
xi

n i =1
ρ
σ̂n
=δ (2.13)

for some 0 < δ < ρ(∞). Usually we set δ = ρ(t) f (t)dt which makes σ̂n consistent.
R

Examples:

• If ρ( x ) = − x f ′ ( x )/ f ( x ) and δ = 1, then σ̂n equals the MLE estimator of σ. (Note


that ρ( x ) = xψ( x ) where ψ( x ) = − f ′ ( x )/ f ( x ) corresponds to the MLE estimator
for location we saw before.)
q
• ρ( x ) = x . If δ = 1, then σ̂n = n1 ∑in=1 xi2 which is called the root mean square.
2

This is the MLE of σ at the normal scale model with known µ = 0.

• ρ( x ) = | x | and δ = 1. Then σ̂n = 1


n ∑in=1 | xi |. This is the MLE at the double
exponential (Laplace) distribution.

• ρ( x ) is the bisquare function (2.11) and δ = 21 ρ(∞). The resulting σ̂n is called the
bisquare scale M-estimator.
1
• ρ( x ) = I (| x | > c) with c > 0 and δ = 0.5, then σ̂n = c mediani (| xi |).

• As in the case of the MLE, we can also take the ψ function of a location M-
estimator and construct an M-estimator of scale with the ρ function ρ( x ) =
xψ( x ).

Check that the scale M-estimators given by (2.13) are scale equivariant but not location
invariant.

55
2.7 M-estimators of univariate location with unknown
scale
The general location-scale model assumes that the xi are i.i.d. according to

x−µ
 
F(µ,σ) ( x ) = F
σ

where −∞ < µ < +∞ is the location parameter and σ > 0 is the scale parameter. In
this general model both µ and σ are assumed to be unknown, which is realistic. The
density is now
x−µ
 
′ 1
f (µ,σ) ( x ) = F(µ,σ) ( x ) = f .
σ σ

In this general situation we can still estimate location by the median and scale by
MADN which possess the required equivariance and invariance properties. But the
location M-estimators defined by (2.9) are not scale equivariant, so they depend on the
measurement units. If σ were known, the natural solution would be to divide xi − µ
in the formulas by this σ:
n
xi − µ
 
µ̂ = argmin ∑ ρ (2.14)
µ i =1
σ

and thus
n
xi − µ̂
 
∑ψ σ
= 0. (2.15)
i =1

Since σ is unknown, two strategies can be followed:

1. A simple solution is to estimate σ beforehand using a scale estimator that is lo-


cation invariant and scale equivariant, and to plug the scale estimate σ̂ into ex-
pression (2.14):
n
xi − µ
 
µ̂ = argmin ∑ ρ (2.16)
µ i =1
σ̂

which yields the equation


n
xi − µ̂
 
∑ψ σ̂
= 0. (2.17)
i =1

(Note that we have dropped the subscript n from µ̂ and σ̂ for simplicity.)
Naturally, σ̂ should be a robust estimator of σ (e.g. the MADN) if one wants
µ̂ to be a robust estimator of µ. For monotone M-estimators with a bounded

56
and odd ψ, the breakdown value of the location estimate µ̂ can be shown to be
equal to the breakdown value of the scale estimate σ̂, thus resulting in a maximal
breakdown value when the MADN is chosen. Using the standard deviation as
scale estimate would result in a zero breakdown value. For redescending M-
estimators, the breakdown value depends on the data. For the bisquare estimator
with the MADN as preliminary scale estimate, the breakdown value is 1/2 for
all practical purposes.

2. An alternative approach is to simultaneously estimate both parameters µ and σ


by a system of equations combining (2.10) and (2.13):
n
xi − µ̂
 
∑ ψlocation σ̂
=0 (2.18)
i =1
1 n xi − µ̂
 

n i∑
ρscale =δ (2.19)
=1
σ̂

where we may choose ρscale differently from ρlocation . This approach is more com-
plicated than the first, and unfortunately it yields less robust estimators, so the
first approach is preferable.

57
2.8 Computation of M-estimators
2.8.1 Computing an M-estimator of location
Here we follow the first approach above so we first compute a scale estimate σ̂,
typically by MADN. To solve (2.17) an iterative reweighting algorithm can be used.
Based on the ψ function for location, define
(
ψ( x )/x if x ̸= 0
W (x) = (2.20)
ψ′ (0) if x = 0

where the second line is just L’Hopital’s rule for x → 0. Then (2.17) can be written as
n 
xi − µ̂ xi − µ̂
  
∑ σ̂
W
σ̂
= 0.
i =1

It follows that µ̂ can be interpreted as a weighted mean:

∑in=1 wi xi
µ̂ = (2.21)
∑in=1 wi

with wi = W (( xi − µ̂)/σ̂). Note that this is still an implicit equation, as the wi depend
on µ̂ itself. This suggests the following algorithm:

1. Compute σ̂. Set µ0 = Med(x).

2. For k = 0, 1, . . . compute the weights

wk,i = W (( xi − µ̂k )/σ̂).

∑in=1 wk,i xi
3. Set µ̂k+1 = ∑in=1 wk,i
.

4. Stop when |µ̂k+1 − µ̂k | < εσ̂.

Because a weighted mean like (2.21) is a special case of weighted least squares, this al-
gorithm is called iteratively reweighted least squares (IRLS). If W ( x ) is bounded and non-
increasing for x > 0, the sequence µ̂k is guaranteed to converge to a solution of (2.17).

The weight functions for the Huber estimator and the Tukey bisquare estimator are
shown in Figure 2.3. We see that the weights for the Huber estimator converge to zero
at the rate 1/x, whereas the Tukey estimator assigns zero weight to large outliers.

58
weight functions for Huber weight functions for Tukey’s bisquare

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2


Huber, b = 1.345 Tukey, c = 2.5
Huber, b = 2.5 Tukey, c = 4.68
ψ(x)

ψ(x)
x

x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

x x

Figure 2.3 Weight functions for the Huber and Tukey M-estimators

2.8.2 Computing an M-estimator of scale


Based on ρscale we now define the weight function
(
ρ( x )/x2 if x ̸= 0
W (x) = (2.22)
ρ′′ (0) if x = 0.

Then (2.13) is equivalent to

1 n x 
2
σ̂ = ∑
nδ i=1
W i xi2
σ̂

which suggests the following iterative reweighting algorithm:

1. For k = 0, 1, . . . compute the weights


 
xi
wk,i = W .
σ̂k
q
2. Set σ̂k+1 = 1
nδ ∑in=1 wk,i xi2 .

3. Stop when |σ̂k+1 /σ̂k − 1| < ε.

2.8.3 Computing simultaneous M-estimates


In this case the previous algorithms can be combined as follows:

1. Compute starting values µ̂0 and σ̂0 .

59
2. For k = 0, 1, . . . compute the weights

w1k,i = W1 (( xi − µ̂k )/σ̂k ) and w2k,i = W2 (( xi − µ̂k )/σ̂k )

with W1 and W2 as in (2.20) and (2.22).

3. Set
s
∑in=1 w1k,i xi 1 n
nδ i∑
µ̂k+1 = and σ̂k+1 = w2k,i ( xi − µ̂k )2 .
∑in=1 w1k,i =1

4. Stop when |µ̂k+1 − µ̂k | < εσ̂k+1 and |σ̂k+1 /σ̂k − 1| < ε.

60
2.9 Other robust estimators of univariate location and
scale
The trimmed mean with parameter 0 ⩽ α < 0.5 and m = ⌊(n − 1)α⌋ is defined as
n−m
1
n − 2m i=∑
µ̂α = x (i ) .
m +1
Its breakdown value is (m + 1)/n.

The Hodges-Lehmann estimator (Hodges and Lehmann, 1963):


xi + x j
µ̂ = med
2

has breakdown value 1 − 0.5 = 0.293 (for n → ∞).

The least median of squares (LMS) location estimator (Rousseeuw, 1984) is defined as
µ̂ = argmin med( xi − µ)2 .
µ

It equals the midpoint of the shortest ‘half’. This is the subsample containing h =
⌊n/2⌋ + 1 observations that possesses the shortest range. Its breakdown value is max-
imal. The length of the shortest half yields a robust estimator of scale.

The least trimmed squares (LTS) location estimator (Rousseeuw, 1984) is defined as
h
µ̂ = argmin ∑ ( x − µ)2(i) .
µ i =1

where ( x − µ)2(i)is the i-th order statistic of the set {( xi − µ)2 ; i = 1, . . . , n}, so the
differences are first squared and then ordered. The LTS is the mean of the subsample
with h observations that has the smallest standard deviation. Its breakdown value is
maximal for h = ⌊n/2⌋ + 1. The standard deviation of the ’best’ subsample yields a
robust scale estimator.

Many scale estimators such as Stdev and MAD require the estimation of a center
as a first step. The IQR does not use centering. A more robust scale estimator that
does not require centering is the Qn estimator (Rousseeuw and Croux, 1993). This
scale estimator is essentially the first quartile of all distances between two data points.
More precisely, the estimator is defined as
Qn = 2.219{| xi − x j |; i < j}(k) (2.23)
with k = (2h) ≈ (n2 )/4 and h = ⌊ n2 ⌋ + 1. The breakdown value of Qn is 50% whereas its
statistical efficiency at the normal model is 82%.

61
2.10 Functional definition of M-estimators
To study the asymptotic properties of M-estimators, we first give a general defini-
tion of M-estimators as statistical functionals.
A function T is called a statistical functional with domain DT , where DT is a subset
of the set of all distribution functions, iff

1. For all ( x1 , . . . , xn ) the empirical distribution function (edf) given by Fn (t) =


1 n
n ∑i =1 I ( xi ⩽ t ) belongs to DT

2. The function ( x1 , . . . , xn ) → T ( Fn ) is a Borel function on Rn .

These conditions guarantee that a statistical functional can be computed at any sample
distribution Fn .
A statistical functional thus defines an estimator Tn through Tn := T ( Fn ). If T is
sufficiently regular, it often holds that T ( Fn ) → p T ( F ) and thus it makes sense to use
T ( Fn ) as an “estimator” of T ( F ). If our interest is in estimating the parameter θ of a
parametric family { Fθ ; θ ∈ Θ}, we call T Fisher-consistent iff T ( Fθ ) = θ. In that case it
is reasonable to estimate the parameter θ by T ( Fn ).

Examples:

• The expected value is a statistical functional defined by T ( F ) = EF [ X ] =


xdF ( x ) with DT = { F : | x |dF ( x ) < ∞}. The corresponding estimator is
R R

the sample mean since

1 n
Z

n i∑
T ( Fn ) = EFn [ X ] = xdFn = xi .
=1

From the law of large numbers we know that T ( Fn ) → p T ( F ). Under the location
model
Xi ∼ Fµ where Fµ ( x ) = F ( x − µ)

this functional is Fisher-consistent iff T ( Fµ ) = µ for all µ, which happens when


Z Z
T ( Fµ ) = EFµ [ X ] = xdFµ ( x ) = x f µ ( x )dx
Z Z
= x f ( x − µ)dx = t f (t)dt + µ = EF ( X ) + µ

equals µ. Thus for a location model the mean is Fisher consistent iff the first
moment of F equals 0. In that case T ( Fn ) → p T ( Fµ ) = µ so the sample mean is a
consistent estimator for the parameter µ.

62
• The median is a statistical functional defined by

T ( F ) = F −1 (0.5) = inf{ x : F ( x ) ⩾ 0.5}.

Note that T ( F ) is well-defined for any F and thus its domain DT is the set of all
probability distributions. For odd n, the corresponding estimator is the sample
median. If F is strictly increasing, one can prove that T ( Fn ) → p T ( F ). Under
the location model, T is Fisher consistent iff T ( F ) = 0. Thus if the median of F
equals 0, the sample median is a consistent estimator for µ.

M-estimators
M-estimators can be studied as statistical functionals as well. For a given function
Ψ( x, θ ) define the functional T as the solution of the following implicit equation:

EF [Ψ( X, T ( F ))] = 0. (2.24)

The corresponding estimator solves the finite-sample version

EFn [Ψ( X, T ( Fn ))] = 0

which is written more clearly as

1 n
n i∑
Ψ( xi , T ( Fn )) = 0. (2.25)
=1

For example, if we take

Ψ : R2 → R : (u, v) → ψ(u − v)

then T ( Fn ) corresponds to a univariate location M-estimator (2.10). Taking


u
Ψ : R2 → R : (u, v) → ρ −δ
v
we obtain an M-estimator of scale (2.13).

63
2.11 Consistency of M-estimators
For a general M-estimator the convergence of T ( Fn ) to T ( F ) is established under
some conditions on the function Ψ (see Huber, 1964).

Theorem 2.1. Suppose

(i ) EF [|Ψ( X, θ )|] < ∞ for every θ


(ii ) Ψ( x, θ ) is a nonincreasing function of θ
(iii ) lim Ψ( x, θ ) > 0 > lim Ψ( x, θ )
θ →−∞ θ →∞

then, if T ( F ) is unique,
T ( Fn ) → p T ( F )
for the functional T defined by EF [Ψ( X, T ( F ))] = 0.

In the univariate location model one can check that the conditions are satisfied for
the mean, median, Huber M-estimator etc. Note that for redescending M-estimators
such as the bisquare condition (ii ) is not satisfied and thus the preceding theorem does
not apply immediately. However, a similar result can be proven in these cases as well,
under some extra technical conditions.

To obtain consistency we also have to verify that the M-estimator in question is


Fisher consistent, i.e. T ( Fθ ) = θ. In a specific situation this can usually be checked. Let
us first consider location M-estimators. If F is a symmetric distribution and ψ is odd,
Z Z
EFµ [ψ( X − µ)] = ψ( x − µ) f µ ( x ) dx = ψ(t) f (t) dt = 0
R R

since f is even. Therefore the parameter µ solves equation (2.24) defining the M-
estimator, so it holds that T ( Fµ ) = µ which proves Fisher consistency. For a scale
M-estimator, we have that
     
X X x
Z
EFσ ψ = EFσ [ρ −δ = ρ f σ ( x ) dx − δ
σ σ R σ
Z
= ρ (t) f (t) dt − δ.
R

To obtain Fisher-consistency we need EFσ [ψ ( X/σ)] = 0, which is indeed obtained if


R
we set δ = ρ (t) f (t) dt as mentioned before.

64
2.12 Asymptotic normality of M-estimators
In statistics one is not only interested in consistency, but also in the rate of conver-
gence. Fortunately, many estimators are asymptotically normal. For M-estimators the
following theorem can be proven.

Theorem 2.2. Suppose that E[Ψ( X, T ( F ))2 ] < ∞ and that the derivative of Ψ( x, θ ) with

respect to its second argument exists. Denote this derivative as the function Ψ. Assume that

the conditions in Theorem 2.1 are satisfied and that EF [Ψ ( X, T ( F ))] is nonnull. Then it holds
that

n( T ( Fn ) − T ( F )) → N (0, V ( T, F ))
with
EF [Ψ( X, T ( F ))2 ]
V ( T, F ) = • (2.26)
(E F [ Ψ ( X, T ( F ))])2
for T defined by EF [Ψ( X, T ( F ))] = 0.

In other words, M-estimators with an appropriate Ψ function are asymptotically


normal with asymptotic variance V ( T, F ) given by (2.26).

Consider for example the gaussian location model with variance 1.

• For the mean we have that

EFµ [Ψ( X, T ( Fµ ))2 ] = EFµ [( X − T ( Fµ ))2 ] = EFµ [( X − µ)2 ] = VarFµ [ X ] = 1.



Moreover Ψ= −1 and thus the asymptotic variance equals 1.

• For the median Ψ( x, T ( Fµ )) = sign x − T ( Fµ ) and thus




EFµ [Ψ( X, T ( Fµ ))2 ] = 1.

In this case the derivative of Ψ does not exist since the sign function is discon-
tinuous in 0. However, one can write this derivative with a delta function: if
a function g is discontinuous at a point u with left limit g(u− ) and right limit
g(u+ ), we can write the derivative of g in u as ( g(u+ ) − g(u− ))δu . With this trick
it is easy to see that

EFµ [Ψ ( X, T ( Fµ ))] = −2 f (0).
At the normal model this means that the asymptotic variance equals π2 ≈ 1.57
and is thus larger than that of the mean. This shows that the median is less
efficient than the mean at the normal model, with ARE = π2 = 0.64. However,
the opposite holds at some longer-tailed distributions.

65
2.13 The influence function
In classical statistics the focus is on the analysis of the approximation of T ( F ) by
T ( Fn ). In robust statistics we are also interested in the behavior of T when the distri-
bution F is slightly changed. To this end we consider a “neighborhood” around F of
contaminated distributions Fε,H defined as

Nε = { Fε,H = (1 − ε) F + εH : H ∈ H}

with H a suitable set of distributions. Intuitively, the value ε represents the percent-
age of outliers and the distribution H represents the distribution of these outliers. The
idea is to examine the difference between T ( F ) and T ( Fε,H ). For a robust estimator this
difference should not be too large. One can measure this difference in several ways. A
first concept is the influence function (Hampel, 1974; Hampel et al., 1986).

The influence function of a functional T at the distribution F in z is defined by

T ( Fε,∆z ) − T ( F )
IF(z; T, F ) = lim (2.27)
ε ↓0 ε
(if the limit exists), with ∆z the discrete distribution putting all its probability mass
at z. The influence function thus measures the difference between applying T to the
‘clean’ distribution F and to the contaminated distribution Fε,∆z in the limit ε ↓ 0. Often
it can be interpreted as a limit case of the sensitivity curve (2.5) too.

Special interest goes to the worst case influence

γ∗ ( T, F ) = sup ||IF(z; T, F )||


z

which is called the gross error sensitivity (GES). In general, a lower GES indicates better
robustness.

The following three properties hold under certain regularity conditions (without
proof):
R
1. If G ≈ F: T ( G ) ≈ T ( F ) + IF(z; T, F )dG (z).
R
2. IF(z; T, F )dF (z) = 0.

3. Let V ( T, F ) = IF(z; T, F )2 dF (z), then


R


n( Tn − T ( F )) → N (0, V ( T, F )).

This is often a convenient way to compute the asymptotic variance.

66
For an M-estimator satisfying (2.24), we have that

Ψ(z, T ( F ))
IF(z; T, F ) = • .
−EF [Ψ ( X, T ( F ))]
Therefore, the influence function of an M-estimator is simply proportional to the func-
tion Ψ.

Consider for example the location model:

• For the mean we have Ψ( x, µ) = x − µ and thus IF(z; T, F ) = z − µ. The gross


error sensitivity is +∞. This tells us that the influence of an infinitesimally small
amount of contamination can be arbitrary large.

• For the median we have Ψ( x, µ) = sign ( x − µ hence





IF (z; T, F ) = sign z − µ /(2 f (0)).

The GES thus equals (2 f (0))−1 . This illustrates that the median is more robust
than the mean.

• For the Huber M-estimator with parameter b > 0, we find IF(z; T, F ) = ψ(z −
µ)/(2F (b) − 1) at a symmetric F. For the unit variance normal location model
this yields:

b γ∗ ( T, F ) eff( T )
b→0 1.25 0.64
b = 1.5 1.73 0.96
b→∞ ∞ 1

Note the typical robustness-efficiency tradeoff: for b → 0 (the median) we have


the lowest gross error sensitivity, but also the lowest efficiency. For b → ∞ (the
mean) we obtain the best efficiency, but also the highest GES. Intermediate values
of b provide a compromise between these extremes, e.g. b = 1.5.

67
2.14 The asymptotic breakdown value
For a parametric model { Fθ , θ ∈ Θ} with parameter space Θ, the asymptotic break-
down value (Hampel, 1971) of a statistical functional T at a distribution F is given by

ε∗ ( T, F ) = inf{ε > 0 : T ( Fε,H ) can be arbitrary close to the boundary of Θ}. (2.28)

Intuitively this represents the minimum amount of contamination that is needed to de-
stroy the result of the functional T. Consider for example the location model. Then the
parameter space is R and its boundary is the pair {−∞, +∞}. Therefore the formula
of the breakdown value for the location model becomes:

ε∗ ( T, F ) = inf{ε > 0 : sup | T ( Fε,H )| = ∞}. (2.29)


H

In the univariate scale model (2.2), the parameter space is ]0, +∞[ with boundary
points {0, +∞}. One defines the explosion breakdown value as

ε∗exp ( T, F ) = inf{ε > 0 : sup T ( Fε,H ) = +∞}


H

and the implosion breakdown value as

ε∗imp ( T, F ) = inf{ε > 0 : inf T ( Fε,H ) = 0}.


H

The overall breakdown value is defined as the smallest of these:


 
ε∗ ( T, F ) = min ε∗exp ( T, F ), ε∗imp ( T, F ) .

Often the asymptotic breakdown value is the limit of the finite sample breakdown
value of (2.6):

ε∗n ( T ( Fn ), Fn ) → ε∗ ( T, F ).

68
Examples:

• The mean has asymptotic breakdown value ε∗ ( T, F ) = 0. This follows from:

EFε,∆z [ X ] = (1 − ε)EF [ X ] + EF∆z [ X ] = (1 − ε)EF [ X ] + εz

which becomes arbitrary large as z → ∞, and thus

sup{|EFε,∆z [ X ]|} = ∞
z

and consequently
sup{| T ( Fε,H )|} = ∞
H
for every ε > 0.

• The highest possible asymptotic breakdown value for location equivariant loca-
tion estimators is 0.5.

• Any location M-estimator with nondecreasing, bounded and odd ψ function at-
tains this maximal breakdown value of 0.5. For nondecreasing ψ functions that
are not odd but satisfy ψ(−∞) = −k1 > −∞ and ψ(+∞) = k2 < ∞ it can be
proved that
min(k1 , k2 )
ε∗ ( T, F ) = .
k1 + k2
• The asymptotic breakdown value of the Stdev is 0, since the explosion break-
down value equals 0.

• The asymptotic breakdown value of the MAD is 0.5 at continuous distributions.

• For an M-estimator of scale (2.13) with even function ρ, the asymptotic break-
down value at a symmetric distribution F is
 


∗ ∗
 δ δ
ε ( T, F ) = min ε exp ( T, F ), ε imp ( T, F ) = min ,1− .
ρ(∞) ρ(∞)

69
2.15 The maxbias curve
The maxbias curve of a statistical functional T at F measures the worst case bias for
every contamination percentage ε > 0 (contrary to the influence function which only
shows the limiting case ε ↓ 0 and only considers point contamination). For a location
model the maxbias curve is defined by

maxbias(ε; T, F ) = sup | T ( G ) − T ( F )| = sup | T ( Fε,H ) − T ( F )|. (2.30)


G ∈Nε H

It follows that the asymptotic breakdown value can be derived from the maxbias
curve:

ε∗ ( T, F ) = inf{ε > 0 : maxbias(ε; T, F ) = ∞}.

At the scale model, we instead consider the explosion maxbias curve

maxbiasexp (ε; T, F ) = sup T ( Fε,H )


H

and the implosion maxbias curve

maxbiasimp (ε; T, F ) = inf T ( Fε,H ).


H

Then we obtain

ε∗ ( T, F ) = inf{ε > 0 : maxbiasexp (ε; T, F ) = ∞ or maxbiasimp (ε; T, F ) = 0}.

The maxbias curve is often hard to compute.

Examples at the location model:

• The mean has maxbias(ε; T, F ) = +∞ for all ε > 0, since ε∗ ( T, F ) = 0.

• For an M-estimator with bounded, nondecreasing and odd ψ function with


ψ(∞) = k, and for F continuous and symmetric around zero, it can be proven
that

EF [ψ( X + maxbias(ε; T, F ))] = . (2.31)
1−ε
The maxbias curve of a specific M-estimator can be found by solving this equa-
tion. For instance, when T is the median we have k = 1 and some calculations
show that
 
−1 1
maxbias(ε; Med, F ) = F . (2.32)
2(1 − ε )

70
In fact, under regularity conditions on F it has been shown (Huber, 1964) that
the median has the lowest maxbias curve among all translation equivariant esti-
mators of location (for all 0 ⩽ ε < 12 ). This is called the “min-max bias property”
of the median.

Figure 2.4 plots the maxbias curve (2.32) of the median at the standard normal distri-
bution (F = Φ). This graph combines the maxbias curve, the gross-error sensitivity
γ∗ ( T, F ) and the asymptotic breakdown value ε∗ (which is 12 in this case).

Vertical asymptote
supG∈Nε( T(G) − T(F))

Slope=γ*(T,F)

0 ε* ε

Figure 2.4 Maxbias curve of the median at the standard normal.

71
72
Chapter 3

Regression M-estimators

One of the reasons of the lasting success of M-estimators is their vast area of ap-
plications, of which regression analysis is an important example. In this section we
discuss robust estimation of the parameters of linear regression models. M-estimators
for regression are constructed in the same way as location M-estimators. However,
regression poses some peculiar and difficult robustness problems making robust re-
gression and its applications an active area of research, even after nearly 40 years of
investigation. To motivate the need for robustness, let us consider a simple illustrative
example.

Example. The data in Table 3.1 correspond to an experiment on the speed of learn-
ing of rats. Times were recorded for a rat to go through a shuttle box in successive
attempts. If the time exceeds 5 seconds, the rat received an electric shock for the dura-
tion of the next attempt. The data are the number of shocks received and the average
time for all attempts between shocks.

Shocks Time Shocks Time


0 11.4 8 5.7
1 11.9 9 4.4
2 7.1 10 4.0
3 14.2 11 2.8
4 5.9 12 2.6
5 6.1 13 2.4
6 5.4 14 5.2
7 3.1 15 2.0

Table 3.1 Rats data (Maronna et al., 2006)

Figure 3.1 depicts the data and the straight line fitted by least-squares (LS) corre-

73
sponding to the linear regression model

y i = β 0 + β 1 x i + ϵi .

Figure 3.1 Rats data with the three estimators superimposed.

The relationship between the variables is seen to be roughly linear except for the
three upper left points. The LS line does not fit the bulk of the data, but is a compro-
mise between those three points and the other observations. The plot also contains the
more resistant least-absolute-deviations (LAD) estimator and the least-squares estima-
tor computed from the data without the three points in the upper left corner. These
two estimators give a better representation of the relation in the majority of the data,
while pointing out the exceptional character of points 1, 2 and 4 by their large devia-
tions from the fit.
The LS estimator fits a straight line by finding βb0 and βb1 such that the sum of
squared residuals
n
∑ | y i − β 0 − β 1 x i |2 , (3.1)
i =1

is minimized. Note that location estimation is a special case, obtained by setting β 1 = 0


in (3.1). In this case, the solution is the sample mean, i.e., the LS estimate of location is
the average of the yi s. In contrast, the LAD estimator minimizes
n
∑ | yi − β 0 − β 1 xi | ,
i =1

74
From our example, it can be seen that the LAD estimate is more resistant to outlying
observations. Intuitively, this is because the absolute value loss increases more slowly
than the squared loss. Note that LAD generalizes the sample median which is recov-
ered in the special case of location estimation.
Now, consider the more general case of a dataset consisting of n observations
(x1 , y1 ), . . . , (xn , yn ) with the xi being p-dimensional vectors, i.e., xi = ( xi1 , . . . , xip )⊤ ,
and the yi real random variables. As usual, the xi are referred to as predictor variables
while the yi are called responses. The data are assumed to follow the linear model

yi = xi⊤ β0 + ϵi , i = 1, . . . , n, (3.2)

where β0 is a p-dimensional vector of unknowns to be estimated from the data and ϵi


are random variables (”errors”). When the model has a constant term the first coordi-
nate of each xi is taken to be 1 and we may more explicitly write

yi = β 0 + xi⊤ β1 + ϵi ,

where xi = ( xi2 , . . . , xip )⊤ and β1 ∈ R p−1 with β0⊤ = ( β 0 , β1⊤ ). In a designed exper-
iment, the predictors xi are nonrandom (or fixed). Such situation often arises, for ex-
ample, in controlled lab experiments. When the data are observational, the predictors
xi are random variables generated from a p-variate distribution, for example, a mul-
tivariate Gaussian distribution. In this chapter we focus on fixed predictors, see the
course Robust Statistics for the case of random designs and the implications thereof.
With an estimator β b at our disposal, the fitted values ybi and residuals ri of the fit
n
are defined as

b ) = x⊤ β
ybi ( β b and ri ( β
b ) = yi − ybi ( β
b ),
n i n n n

respectively. The dependence of fitted values and residuals on βb will be suppressed


n
whenever that is clear from the context. In order to combine robustness and efficiency,
we discuss regression M-estimators β b defined as minimizer of an objective function
n
of the form
n
ri ( β )
 
Mn ( β ) : = ∑ ρ . (3.3)
i =1
σn
b

Here, ρ is a non-negative, nondecreasing function of | x | satisfying ρ(0) = 0 and b σn is an


auxiliary scale estimate that is required to make βn scale equivariant (see below). The
b
LS and LAD estimators correspond to ρ(t) = t2 and ρ(t) = |t|, respectively. In these
two cases, b σ becomes a constant factor outside the summation sign and minimizing
(3.2) is equivalent to minimizing ∑in=1 ri2 ( β) or ∑in=1 |ri ( β)|, respectively. Thus neither
the LS nor the LAD estimator require an auxiliary scale estimator.

75
3.1 Review of the LS method
Let us begin by reviewing some basic facts about the computation and theoretical
properties of the universally known least-squares estimator. Setting X = [x1⊤ , . . . , x⊤
n]

it may be seen that with ρ(t) = t2 the solution β


b to (3.2) satisfies
LS

X⊤ X β
b = X⊤ y.
LS

Hence, if X has full rank,

b = (X⊤ X)−1 X⊤ y.
β (3.4)
LS

If the model contains a constant term, then the first column of X is identically one , and
the full rank condition implies that no other column is constant. If X is not of full rank,
we have linear dependencies between the columns of X (extreme multicollinearity) and
other methods of estimation are needed (see following chapters).
The LS estimator satisfies the following three important, easily verifiable proper-
ties:

b (X, y + Xγ) = β
β b (X, y) + γ for all γ ∈ R p (3.5)
LS LS
β b (X, y) for all λ ∈ R,
b (X, λy) = λ β (3.6)
LS LS

and for all nonsingular p × p matrices A,

b (XA, y) = A−1 β
β b (X, y). (3.7)
LS LS

These properties are called regression, scale and affine equivariance, respectively. They
are desirable properties for any regression estimator, since they allow us to know how
the estimate changes under linear transformations of the data. Consider, for example,
transforming y according to

y⋆ = y + Xγ.

If model (3.2) holds for response y, then the transformed response satisfies y⋆ =
X( β0 + γ) + ϵ. Thus, this transformation amounts to adding γ to the coefficient vector
β0 and the value of (3.5) becomes clear. Likewise, if X⋆ = XA for some invertible A,
then y satisfies the model

y = X⋆ A−1 β0 + ϵ,

which is our linear model with X replaced by X⋆ and β0 by A−1 β0 . Again, it is de-
sirable that the estimate transforms in the same way. Finally, scale equivariance (3.6)
is concerned with the units of measurement of the response variable y. Consider, for

76
example, the situation where y is the height of sampled men measured in centimeters.
If it is now preferred to measure the height in meters, then scale equivariance guar-
antees that the estimate transforms in the same manner and thus there is no need to
estimate the regression coefficients anew.
In the fixed design case, it is well-known that the LS estimator is consistent and
asymptotically normal under suitable assumptions. The most common assumptions
are that the errors have mean zero and variance-covariance matrix σ2 I (spherical sym-
metry). Under these assumptions the LS estimator is unbiased, i.e.,

E{ β
b }=β .
LS 0 (3.8)

and has variance-covariance matrix given by

b } = σ 2 ( X ⊤ X ) −1 .
Var{ β (3.9)
LS

The assumption of centred (mean-zero) errors ϵi may be relaxed to the requirement


that E{ϵ1 } = α, for some α ∈ R. In that case, it can be shown that the bias will only
affect the intercept and not the slopes. From (3.8)-(3.9) and the mean-squared error
decomposition we obtain

b − β ∥ 2 } = σ 2 ( X ⊤ X ) −1 .
E{∥ β LS 0

A related result is the normality of the LS estimator, which is extensively used for
inferential purposes. If the ϵi are normally distributed and X is of full rank, then β
b
LS
is multivariate normal
 
b ∼ N p β , σ 2 ( X ⊤ X ) −1 ,
β (3.10)
LS 0

where N p (µ, Σ) denotes the p-variate normal distribution with mean vector µ and co-
variance matrix Σ. This follows immediately from the formula
n
(X⊤ X)1/2 ( β
b − β ) = (X⊤ X)−1/2 X⊤ ϵ =
LS 0 ∑ αni ϵi ,
i =1

where αni , . . . , αnn are the columns of the ( p × n) matrix (X⊤ X)−1/2 X⊤ . If the ϵi are not
normally distributed (and they cannot expected to be), but have a finite variance, then
using the central limit theorem it can be shown that for large n the estimator β b is
LS
approximately normal, with parameters given by (3.10), provided that the Lindeberg
condition holds for the xi . This condition ensures the absence of leverage points and
essentially boils down to

lim max x⊤ (X⊤ X)−1 xi = 0. (3.11)


n → ∞ 1≤ i ≤ n i

77
Notice that these are just the diagonal elements of the hat-matrix. See van der Vaart
(1998) and Huber and Ronchetti (2009) for more details.
Both consistency and asymptotic normality of the LS estimator have been derived
under the assumption that the error possesses two moments, which may not hold
in practice. The purpose of developing M-estimators for regression is to relax these
distributional assumptions as much as possible, while still ensuring the equivariance
properties given in (3.5)–(3.7) as well as a good trade-off between robustness and effi-
ciency. We now formally define this general class of estimators.

3.2 M-estimators of regression


As we have seen previously, M-estimators of location and scale arise as a general-
ization of maximum likelihood estimation. Regression M-estimators are obtained in
the same way. Let us assume model (3.2) with fixed X where ϵi has a density

1 ϵ
f0 ,
σ σ
with σ a scale parameter. For the linear model (3.2) the yi are independent but clearly
not independently distributed (their distribution depends on xi ). Indeed, each yi has
density
!
y − xi⊤ β0
1
σ
f0
σ
, y∈ , R
and, assuming a fixed value of σ, the likelihood function for β is
!
1 n yi − xi⊤ β
L( β) = n ∏ f 0 .
σ i =1 σ

The maximum likelihood estimator β b is obtained by maximizing L( β) or, equivalently,


the log-likelihood. Setting derivatives equal to zero we find that β
b solves

n
ri ( β )
 
∑ ψ0 σ
xi = 0, (3.12)
i =1

where ψ0 = ρ0′ with ρ0 = − log f 0 . If f 0 is the standard normal density, then β b is


the LS estimator. If f 0 is the double-exponential (Laplace) density, then β
b is the LAD
estimator. In this case, it can be seen that β
b satisfies

n
∑ sign

ri ( β
b ) xi = 0.
i =1

78
If the model contains an intercept term, this implies that the residuals have zero me-
dian.
Unlike the LS estimator, there are in general no explicit expressions for the LAD
estimator. However, there exist very fast convex optimization algorithms to compute
it. A complication is that the LAD estimate may not be unique for ”small” samples.
The LAD estimate also has the intriguing property that it sets at least p residuals to
zero, so that the estimator fits exactly part of the data. We refer to Koenker (2005) for
more details concerning the LAD estimator.
More generally, we define regression M-estimators β b as solutions
n

n
ri ( β )
 
b = argmin ∑ ρ
β , (3.13)
n
β ∈R p i =1
σn
b

for some data-determined scale estimate b


σn and a ρ-function satisfying the following
properties.

R1 ρ( x ) is even and a nondecreasing function of | x |.

R2 ρ(0) = 0.

R3 If ρ( x ) is bounded, it is also assumed that ρ(∞) = 1.

As in the location problem, M-estimators do not need to correspond to maximum-


likelihood estimators.
In most situations bσn is a preliminary estimator of scale. Simultaneous estimation
using a scale M-estimating equation is also possible, but as in the location-scale case
this often leads to increased computational burden and inferior robustness properties.
If the scale σ is known, it may be readily verified that M-estimates are regression and
affine equivariant but not necessarily scale equivariant (see the following subsections).
If ρ is continuously differentiable with derivative ψ, then it can immediately be
seen that the M-estimator β b satisfies
n
!
n ri ( β
b )
∑ ψ bσnn xi = 0. (3.14)
i =1

Note that the solution of (3.14) is not identical to the minimizer of (3.13), unless ρ
is convex. For non-convex functions, one has to account for the possibility of local
minima.
M-estimators based on convex ρ-functions are referred to as ”monotone regression
M-estimators”. The reason for this terminology is that the derivative of a differen-
tiable convex function is necessarily nondecreasing. Recall that convex functions are
differentiable almost everywhere, so that if they fail to be differentiable at some point,

79
then one can select a nondecreasing subgradient. M-estimators based on non-convex
ρ-functions, such as the Tukey-bisquare, are referred to as redescending M-estimators.
The main advantage of monotone M-estimators over their redescending counterparts
is their cheaper computation. On the other hand, the latter class of estimators can
achieve a better trade-off between robustness and efficiency but its computation re-
quires special care.
Finally, we discuss a common estimation strategy to obtain an estimate of the scale
σn . Recall that in the location-scale problem we estimated σ using the MAD. Here,
the equivalent procedure would be to first compute a LAD estimator, which does not
require knowledge of the scale. From its residuals we can obtain the analog of the
(normalized) MAD by taking the median of the nonnull absolute residuals:
1
σn = Med (|r | : ri ̸= 0). (3.15)
0.675 1≤i≤n i
b

The reason for using only nonnull residuals is that since at least p residuals are zero
for the LAD estimator, including all residuals could lead to underestimation of σ when
p is large. The standardization factor (0.675)−1 ensures that the estimator is Fisher-
consistent for the normal distribution, since for X ∼ N(0, 1) the median of | X | is 0.675.
Let us write b σn (X, y). Then, since the LAD estimator is regression,
σn in (3.15) as b
scale and affine equivariant, it follows that

σn (X, y + Xγ) = b
b σn (X, y), b
σn (XA, y) = b
σn (X, y) and b
σn (X, λy) = |λ|b
σn (X, y), (3.16)

for all γ ∈ R p , nonsingular A ∈ R p× p and λ ∈ R. We say that b σn is regression


and affine invariant, and scale equivariant. If we obtain a regression M-estimator by
solving (3.13) or (3.14) with this b b is regression, affine and
σ, then (3.16) implies that β
scale equivariant.

3.3 Computing M-estimators of regression


The computation of M-estimators in the regression context is usually carried out
through a generalization of the IRLS algorithm discussed before. The algorithm as-
sumes that ψ( x ) is continuous and limx→0 ψ( x )/x is finite, so that for some special

cases such as ψ( x ) = sign x , one has to employ different optimization methods (see,
e.g. Koenker, 2005).
Define the weight function W ( x ) = bσn ψ( x/b
σn )/x. It is easy to see that the root of
(3.14) satisfies
n
∑ wi xi (yi − xi⊤ βn ) = 0,
i =1

80
with wi = W (ri ( β
b )). This set of equations may be perceived as weighted normal
n
equations, but the weights wi depend upon the data through β b . Nevertheless, this
n
suggests an iterative scheme to calculate β
b .
n

1. Compute an initial LAD estimator β


b and compute b
0 σn from (3.15).

2. For k = 0, 1, 2, . . . :
b compute ri,k ( β
(a) Given β b ) and wi,k = W (ri,k ( β
b )).
k k k
(b) Compute β
b
k+1 by solving
n
∑ wi xi (yi − xi⊤ βb k+1 ) = 0.
i =1

3. Stop when maxi |ri,k − ri,k+1 | < b


σn ϵ, for some predetermined ϵ.

In practice, the algorithm converges very fast. Often less than 10 iterations are
needed for reasonable tolerance levels. The following result provides theoretical justi-
fication for the IRLS algorithm.
Theorem 3.1. Assume that X is of full rank, ψ is continuous and W is nonincreasing in | x |
with a well-defined limit limx→0 W ( x ). Then the IRLS algorithm monotonically decreases the
objective function
n
ri ( β )
 
h( β) = ∑ ρ .
i =1
σn
b

If ρ is strictly convex, then the sequence { βk }k converges to the unique stationary point βS of
h( β), satisfying
n 
ri ( β S )

∑ ψ bσn xi = 0. (3.17)
i =1

The first set of conditions are satisfied for almost all commonly used ρ-functions
including the Huber and the bisquare loss functions. The condition that ρ is increas-
ing, i.e., convex, serves to preserve the equivalence between (3.13) and (3.14). It can
be shown that omitting this assumption has the consequence that the algorithm still
converges, but convergence can be to a local minimum. For strictly convex ρ-functions
the theorem has the remarkable implication that convergence of the algorithm to the
stationary point in (3.17) is irrespective of the starting point.
Proof. To make the notation more tractable, we absorb b σn into the ρ-function. That is,

we write ρ( x ) instead of ρ( x/bσn ). For r ≥ 0 let g(r ) := ρ( r ). Since ρ(r ) = g(r2 ) we
have ψ(r ) = 2rg′ (r2 ) and consequently
ψ (r )
W (r ) = = 2g′ (r2 ). (3.18)
r

81
This implies that W (r ) is nonincreasing for r ≥ 0 if and only if g′ is nonincreasing.
We claim that g is concave, that is, the graph of g lies below the tangent line. For
differentiable functions this is equivalent to

g(y) ≤ g( x ) + g′ ( x )(y − x ), (3.19)

for all x ̸= y. To see that (3.19) holds, first assume that y > x. Then, by the mean value
theorem of calculus, we have

g ( y ) − g ( x ) = ( y − x ) g ′ ( ξ ),

for some ξ ∈ ( x, y). But since g′ is nonincreasing, g′ (ξ ) ≤ g′ ( x ) and (3.19) holds. The
case y < x is handled similarly. We have thus established that g is concave.
Now, define the matrix
n
U ( β) = ∑ W (ri ( β)) xi xi′ ,
i =1

which, by the nonnegativity of W, is positive semidefinite for all β. Define also the
function
n
f ( β) = argmin ∑ W (ri ( β))(yi − xi′ γ)2 .
γ ∈R p i =1

The algorithm can then be written as

β k +1 = f ( β k ).

Note that this is a compact way of saying that for each iteration we are solving a
weighted least-squares problem with weights depending on the previous iteration.
A fixed point β0 , i.e., one satisfying f ( β0 ) = β0 , is, by definition, also a stationary
point of h( β) satisfying (3.17), as ψ is continuous. Put for simplicity wi := Wi (ri ( βk )).
Then βk+1 satisfies
n n
∑ wi xi yi = ∑ wi xi xi⊤ βk+1 = U ( βk ) βk+1. (3.20)
i =1 i =1

Using (3.19) and (3.18) we have


n  
h ( β k +1 ) − h ( β k ) ≤ ∑ g (|ri ( βk )|
′ 2 2
) |ri ( βk+1 )| − |ri ( βk )| 2
i =1
1 n

 
= w i r i ( β k +1 ) − r i ( β k ) r i ( β k +1 ) + r i ( β k ) .
2 i =1

82
But since ri ( βk+1 ) − ri ( βk ) = ( βk − βk+1 )⊤ xi and ri ( βk+1 ) + ri ( βk ) = 2yi − xi⊤ ( βk +
βk+1 ) we have, using (3.20),
n
1
h ( β k +1 ) − h ( β k ) ≤ ( βk − βk+1 )⊤ ∑ wi xi xi⊤ (2βk+1 − βk − βk+1 )
2 i =1
1
= − ( βk − βk+1 )⊤ U ( βk )( βk − βk+1 ) ≤ 0,
2
as U ( β) is positive semidefinite. This proves that the algorithm monotically decreases
h( β) and hence establishes the first part of the theorem.
To prove the convergence of { βk }k to βS , first note that since the sequence {h( βk )}k
is nonincreasing and bounded from below (by zero), it has a limit h0 . This implies that
βk is bounded, for if that were not the case there would exist a subsequence { βk j } j
tending to infinity. However, by assumption ρ is strictly convex and thus strictly in-
creasing. Therefore { h( βk j )} j would also tend to infinity contradicting the conver-
gence of {h( βk )}k .
Now, since { βk }k is bounded, it has a subsequence converging to a limit βS , which
by continuity of f satisfies

f ( βS ) = βS ,

and by our previous remarks this a stationary point of h( β). If the limit is unique,
then βk → βS and we are done. Otherwise, there would exist a subsequence bounded
away from βS , which in turn would have a subsequence converging to a stationary
point different from βS . However, that is not possible because the stationary point in
(3.17) is unique due to strict convexity of ρ. The proof is complete. □

3.4 Robustness properties

3.4.1 Finite-sample breakdown value


Recall the definition of the finite-sample breakdown value for univariate location
and scale estimators. This concept can directly be extended to the regression setting.
Since X is considered fixed we may measure the finite-sample breakdown value (FSB)
by

b ) = max m : β
n o
ϵn⋆ ( β n
b (X, ym ) remains bounded ∀ym ∈ Ym , (3.21)
n
where Ym is the set of length n vectors with at least n − m elements in common with y.
It is clear that for the LS estimator we have that ϵn⋆ ( β
b ) = 1/n, since the estimator is
LS
linear in y.

83
The FSB, in general, depends on how well-conditioned the predictor matrix X is.
Let k⋆ = k⋆ (X) be the maximum number of xi lying on the same subspace (hyperplane)
of dimension smaller than p, that is,

k⋆ (X) = max{#(θ′ xi = 0) : θ ∈ R p , θ ̸= 0},

where a subspace of dimension 0 is the set {0}. In the case of simple straight-line
regression k⋆ is the maximum number of repeated xi . We always have k⋆ ≥ p − 1
as one needs at least p − 1 points to obtain a p − 1 dimensional hyperplane. If k⋆ =
p − 1 the data set is said to be in general position. This means that every set of p − 1
observations determines a unique hyperplane. It can be shown (see Maronna et al.,
2006) that for all regression equivariant estimators it holds that
1 n − k⋆ − 1 1 n−p
   

ϵmax = ≤ . (3.22)
n 2 n 2
The FSB of monotone regression estimators is, in general, difficult to describe pre-
cisely and no closed-form expressions exist. From the available results we know that
monotone M-estimators with bounded ψ attain the maximal FSB for one-way and two-
way analysis of variance but other than that their breakdown value is far below (3.22).
For example, in straight line fitting, it can be shown that ϵn⋆ ( β
b ) ≈ 0.29. Moreover, the
n
breakdown value can depend on p and in certain cases we have ϵn⋆ = O( p−1/2 ). Thus,
for high-dimensional datasets, the FSB can be very close to zero.
Redescending M-estimators can reach a much higher FSB but normally one needs
to start from a robust estimator in order to avoid ”bad” solutions. In particular, if β b
in
denotes an initial regression estimator then Yohai (1987) shows
1 1 − 2( p − 1)/n
 
⋆ b ⋆ b
ϵn ( βn ) ≥ min ϵn ( βin ), , (3.23)
2 1 − ( p − 1)/n
for any sample in general position (i.e., k⋆ (X) = p − 1). Thus, for large n and fixed p
the FSB of redescending M-estimators can be arbitrarily close to (3.22), but only if the
initial estimator is itself very robust. Such estimators exist but are beyond our scope,
see the Robust Statistics course for some examples. Their common disadvantage is
their computational burden, which is far worse than that of M-estimators.

3.4.2 Influence function


For the fixed design, the influence function (IF) of regression M-estimators - mono-
tone and redescending alike - can be shown to be
! " ( !) # −1
n y − x ⊤β n y − x ⊤β
b F) = σ ∑ ψ
IF(y0 , β,
0 i 0
xi − ∑ E F ψ ′
i i 0
xi xi⊤ ,
i =1
σ i =1
σ

84
where, given that the design is fixed, we assume that outliers only lie in the response
space. It follows that the IF is bounded as a function of y0 whenever the score func-
tion ψ is bounded. The IF behaves rather differently formonotone  and redescending
y0 −xi⊤ β0
score functions. In particular, for |y0 | → ∞ the term ψ σ approaches either
supx |ψ( x )| or 0 depending on whether ψ is monotone or redescending. Thus, re-
descending M-estimators are impervious to gross infinitessimal contamination.

3.5 Asymptotic distribution of regression M-estimators


Consider the regression M-estimator β b solving (3.13). What can we say about its
n
asymptotic distribution? In general, the asymptotic distribution depends on the type
of ρ-function used, the covariates xi and the errors ϵi . We assume herein that the ϵi
are independent and identically distributed (iid) but this may not hold in all practical
settings (e.g., in time series analysis).
For an appreciation of the difficulties in deriving the distribution of regression M-
estimators consider a first-order Taylor expansion of the score function in (3.14) about
the true parameter β0 , i.e.,
!
n n r i ( β0 ) n ⊤
ri ( β ′ r i ( β0 ) x i x i b
b )    
∑ ψ bσn xi ≈ ∑ ψ bσn xi + ∑ ψ
n
σn σn
( β n − β0 ).
i =1 i =1 i =1
b b

Multiplying by n and rearranging terms yields
⊤ −1
" #
n
√ 1 n
   
1 ϵ x x ϵi
n ( β n − β0 ) ≈ − ∑ ψ √ ∑ψ
′ i i i
b xi ,
n i =1 σn
b σn
b n i =1 σn
b

as ri ( β0 ) = yi − xi⊤ β0 = ϵi . Despite the fact that the errors ϵi are iid, the presence of
the scale estimator b σn makes the random variables ψ(ϵi /b σn ), i = 1, . . . , n dependent.
Hence, classical central limit theorems are not applicable.
With more advanced techniques we can show that if there exists a positive definite
matrix V such that
1 n
lim
n→∞ n
∑ xi xi⊤ = V,
i =1

then we have
1 n
       
ϵi ϵ 2
√ ∑ψ
D
→ Np
xi − 0, E ψ 1 V

n i =1 σn
b σ
and
1 n ′
 
ϵi n  ϵ o
n i∑
P
ψ → E ψ′ 1
xi xi⊤ − V.
=1
σn
b σ

85
Therefore, by Slutzky’s theorem,
n o
ϵ1  2
 
√ D
E ψ
σ
→ N p 0, σ2  
n( βbn − β0 ) − ϵ1  2
V −1  .
E ψ′ σ

The above derivation is completely heuristic and does not cover M-estimators derived
from non-differentiable ψ-functions (such as the Huber estimator). The following gen-
eral set of conditions serves to add the missing rigor.

(A) There exists a positive definite matrix V such that

lim n−1 X′ X = V.
n→∞

(B) The observations x1 , . . . , xn satisfy



max ∥xi ∥ = o ( n).
1≤ i ≤ n

(C) There exist finite constants κ and M such that for all x ∈ R and |y| < κ,

|ψ( x + y) − ψ( x )| ≤ M.

(D) For any ϵ > 0 there exists an Mϵ > 0 such that

|ψ(tx ) − ψ(sx )| ≤ Mϵ |t − s|,

for any t > ϵ, s > ϵ and −∞ < x < ∞.

P
(E) There exists a σ > 0 such that b
σn −
→ σ.

(F) There exists a δ > 0 such that


ϵ 
1
sup Eψ + t − ξ ( α ) t = o ( t ), as t → 0.

|α−σ|≤δ α

for some function ξ (α) that is continuous at σ and satisfies ξ (σ ) > 0.

(G) We have
   ϵ  2 
ϵ1 
lim sup E ψ + t − ψ 1 = 0,

t→0 |α−σ|≤δ α α

for some δ > 0.

86
Theorem 3.2. Assume that conditions R1–R3 and (A)–(F) hold and further that ρ is a convex
loss function. Then, for the sequence of M-estimators, β
b , it holds that
n
n o
ϵ1  2
 
√ D
E ψ σ
→ N p 0, σ2
n( βbn − β0 ) − V −1  ,
|ξ (σ)|2

with ξ (·) the function appearing in (F).

The proof of Theorem 3.2 for monotone M-estimators is based on the following
interesting lemma of Hjort and Pollard (1993). For redescending M-estimators the
result of the theorem is still true, but the mathematics to prove it are more complicated.

Lemma 3.3. Suppose An (δ) is a real convex process that can be represented as 21 δ⊤ Vδ +
D
U⊤n δ + rn ( δ ) for some rn ( δ ) tending to zero for each δ. If Un −
→ U, then for the minimizer
δn of An (δ), we have

D
→ −V−1 U.
δn −

Proof of Theorem 3.2. We aim to apply Lemma 3.3 and to this end consider the objec-
tive function
( √ !  )
n ϵi − xi⊤ δ/ n ϵi
Zn (δ) = ∑ ρ −ρ .
i =1
σn
b σn
b

Clearly, Zn is a convex process, as ρ is convex. Moreover, Zn (δ) is minimized by


√ b
n( βn − β0 ). We show that Zn (δ) can be represented as

1
Zn (δ) = ξ (σ)δ⊤ Vδ + U⊤
n δ + r n ( δ ),

with
n
1 ϵ 
Un = √ ∑ψ
σn n i=1
b σ
i
xi ,

R
and rn (δ) tending to zero pointwise, that is, for every δ ∈ p . The claim would then
follow from Lemma 3.3 and Slutsky’s theorem.
We start by using the fundamental theorem of calculus to write
√ ! ⊤ √  
ϵi − xi⊤ δ/ n 1 −xi δ/ n ϵi + t
  Z   
ϵi ϵi
ρ −ρ = ψ −ψ dt
σn
b σn
b σn 0
b σn
b σn
b
 
1 ⊤ ϵi
+√ xi δψ .
nbσn σn
b

87
Consequently,
(1) (2)
Zn (δ) = Zn (δ) + Zn (δ),

with

n Z −x⊤ δ/ n
ϵi + t
    
(1) 1 ϵi
∑ 0
i
Zn (δ) = ψ −ψ dt,
σn
i =1
b σ
b n σ
bn
n  
(2) 1 ϵi
Zn (δ) = − √
nb

σn i=1

xi δψ
σn
b
.

(2) D
→ N p (0, σ−2 E |ψ(ϵ1 /σ)|2 V). To see this,

We begin by showing that Zn (δ) −
write
1 n ⊤ 1 n ⊤  ϵi  1 n ⊤
      ϵ 
ϵi ϵi
√ ∑ xi δψ = √ ∑ xi δψ + √ ∑ xi δ ψ −ψ i . (3.24)
n i =1 σn
b n i =1 σ n i =1 σn
b σ
By (F), the first term in the RHS of (3.24) is the sample mean of independent but not
identically distributed random variables with mean equal to zero. It has an asymptotic
Gaussian distribution provided that we can check the Lindeberg condition, which in
our case takes the form
max1≤i≤n |xi⊤ δ|
q → 0,
∑in=1 |xi⊤ δ|2
as n → ∞. Now,

max1≤i≤n |xi⊤ δ| max1≤i≤n ∥xi ∥/ n∥∥δ∥
q ≤ q → 0,
∑in=1 |xi⊤ δ|2 δ⊤ ∑in=1 xi xi⊤ /nδ

by (A) and (B). Thus, by (E) and Slutsky’s theorem,


n      
1 ϵ  ϵ 2
√ ∑ xi δψ
D
⊤ i
→ N p 0, σ E ψ 1 V ,
− − 2
σn n i=1
b σ σ
as required.
To bound the second term in (3.24), first note that by (E) for any δ > 0 there exists a
large enough n such that σ/2 < b σn < 3σ/2 with high probability. Thus, applying (D)
yields the existence of an Mσ > 0 such that
   
ϵ i ϵi
ψ −ψ ≤ Mσ |σ1 − σ2 |,
σ1 σ2
for any σ1 , σ2 ∈ (σ/2, 3σ/2). It now follows from Lemma 8.5 of van de Geer (2000)
that

1 n   
ϵ  ϵ 
√ ∑ xi⊤ δ ψ
i i
−ψ = o P (1),

n i =1 σn
b σ

88
as n → ∞.
(1)
Now, consider Zn (δ), which we may decompose as
(1) (1) (1) (1)
Zn (δ) = E{ Zn (δ)} + ( Zn (δ) − E{ Zn (δ)}). (3.25)

To treat the first term in the right hand side of (3.25), recall that by the Schwarz in-
equality and (B),
√ √
max |xi⊤ δ|/ n ≤ max ∥xi ∥/ n∥δ∥ → 0.
1≤ i ≤ n 1≤ i ≤ n

Therefore, applying (F) and Fubini’s theorem yields



n Z −x⊤ δ/ n
ϵ1 + t
  
(1)

i
E{b
σn Zn (δ)} = E ψ dt
i =1 0
σn
b
Z −x⊤ δ/√n
1 n

i
= { ξ (b
σn )t + o (t)}dt
σn
b i =1 0
σn ) ⊤ n
ξ (b
= δ ∑ xi xi⊤ δ + o (1)
2bσn n i =1
P ξ (σ) ⊤

→ δ Vδ,

as n → ∞, where to identify the limit we have used (A), (E) and the continuity of ξ at
σ assumed in (F).
(1) (1)
σn Zn (δ) − E{b
To complete the proof of the theorem we need to show that b σn Zn (δ)})
is o P (1) as n → ∞. As preliminary steps, we show that for any α ∈ (σ − δ, σ + δ) with
sufficiently small δ, the maximum of the random variables
Z −x⊤ δ/√n  
ϵi + t
  ϵ 
i i
ψ −ψ dt, (i = 1, . . . , n),
0 α α

are o P (1) and the sum of their variances converges to zero, as n → ∞. To establish

the former, note that by (B) max1≤i≤n |xi⊤ δ|/ n → 0. An application of (C) therefore
yields
Z −xi⊤ δ/√n   ϵ + t   ϵ  Z −xi⊤ δ/√n   ϵ + t 

 ϵ 
i
−ψ i dt ≤ i
−ψ i dt

ψ ψ
0 α α 0 α α

|xi⊤ δ|
≤M √ ,
n
uniformly in i = 1, . . . , n. Conclude that
−xi⊤ δ/√n   ϵ + t 
Z
 ϵ 
i i
max −ψ dt = o (1),

ψ
1≤ i ≤ n 0 α α

89
as n → ∞.
To bound the sum of the variances, first recall that for any random variable X,
Var{ X } ≤ E{ X 2 } and it thus suffices to bound the sum of the second moments. By
the Schwarz integral inequality, we have

 Z −xi⊤ δ/√n   ϵ + t 
 
n  ϵ   2 
∑ E  0 i
−ψ i dt

ψ
i =1
α α 
√ (  )
n | x⊤ δ | Z −x⊤ δ/ n
ϵ + t
  ϵ  2
≤ ∑ √i
i i
E ψ − ψ i dt


i =1 n 0 α α
n |xi⊤ δ|2
=∑ o (1)
i =1
n
= o (1),

as n → ∞, where the second to last line follows from (G) and the last line from (A).
The claim now follows from the methods of van de Geer (2000)[Chapter 8].
Combining these two derivations we have shown that

(1) ξ (σ) ⊤
σn Zn (δ) =
b δ Vδ + o P (1).

Therefore, by (E),

(1) ξ (σ) ⊤
Zn (δ) = δ Vδ + o P (1).
2σ2
The result of the theorem now follows from Lemma 3.3. The proof is complete. □
The conditions of the theorem are focused on monotone M-estimators requiring
an auxiliary scale bσn . They may be marginally weakened if the M-estimator does not
require a scale estimator, in that, (D) and (E) may be omitted entirely while (F) and (G)

only need to hold for α = 1. Can you then identify ξ in (F) for ψ( x ) = sign x ? What
does this say about the asymptotic distribution of the LAD estimator?
For M-estimators that do require scale estimators, the conditions of the theorem
can be satisfied for M-estimators without differentiable ψ-functions. Contrast this with
theorem 2.2 of the previous chapter. For example, for the Huber M-estimator it may
be verified that (C) and (D) are automatically satisfied while (F) is satisfied with
 
b
ξ (α) = 2F − 1,
α
for symmetric error distributions F. Moreover, (G) is satisfied as ψb ( x ) is Lipschitz and
therefore
ϵ   ϵ 
1 1
+ t − ≤ C (b)|t|.

ψ
b ψb
α α

90
The theorem also applies to M-estimators of location, as in this case conditions (A)
and (B) are trivially satisfied (why?). What is then the asymptotic distribution of the
Huber location estimator with a preliminary MAD scale? Does the answer change if
the IQR or Qn scale estimators are used instead?

Further reading
Regression with fixed covariate dimension and iid errors is but one of the myriad
applications of M-estimators. Through the course of the last 30 years M-estimators
have been applied in time series analysis, high dimensional statistics, non-parametric
statistics and functional data analysis (see the last chapter of this course) with consid-
erable success.
There are many texts devoted to the study of M-estimators. Of particular impor-
tance are Maronna et al. (2006) and Huber and Ronchetti (2009). These books clearly
present the underlying ideas, albeit in a not always rigorous manner. The theoreti-
cal properties of M-estimators are covered in great detail in van der Vaart (1998) and
van de Geer (2000), among others. The study of M-estimators in general metric spaces
is ongoing.

91
92
Chapter 4

Regularized Regression and Variable


Screening

4.1 Introduction
Linear regression models assume that the regression function E[Y | X ] depends lin-
early on the predictors X1 , . . . , Xd . Such models are simple and in many cases provide
a reasonable and well-interpretable description of the effect of the features on the out-
come. In some situations, simple linear models can even outperform fancier nonlinear
models, such as in cases with a small number of training data, cases with very noisy
data such that the trend (the signal) is hard to detect (low signal to noise ratio) or in
cases with sparse data (high-dimensional data problems).
The standard least squares estimator for linear models may become unstable when
the number of available predictors is large in comparison to the sample size of the
data. A setting which is commonly called high-dimensional data problems. If the dimen-
sion exceeds the sample size, the least squares estimator is not even unique anymore.
Hence, alternative methods are needed that provide more stable solutions. First, we
review relevant properties of standard least squares estimation for linear models to
explain its instability. We then continue with regularized estimation methods that al-
low for bias if this leads to a substantial decrease in variance and thus yields a more
stable model. Regularization is obtained by adding a penalty term to the least squares
objective function which aims to shrink the coefficient estimates towards zero. By
shrinking the coefficient estimates bias is introduced, but at the same time the vari-
ance is reduced.
A common assumption for high dimensional data problems is sparsity which in
this setting means that most available predictors are assumed to be noise variables
that do not help to explain the response. Hence, the final model should only contain a
small number of predictors. To achieve this goal, regularization methods use a penalty

93
term which allows to shrink coefficients completely to zero such that the correspond-
ing predictor actually is not included in the model anymore. Extensive treatments of
sparse methods for high-dimensional data can be found in e.g. Bühlmann and van de
Geer (2011); Hastie et al. (2015).
Data are called ultra-high dimensional if the dimension grows at an exponential rate
of the sample size. Regularization does not suffice to successfully determine a sparse
linear model from such ultra-high dimensional data. To address this issue, an initial
screening step can be applied which screens each candidate predictor for its potential
use in the final model. The goal of this step is to largely reduce the dimension by
removing the least interesting predictors. Regularization methods can then be applied
on the data set with reduced dimension to determine the final model.

4.2 Least squares regression


The standard linear regression model has the form

d
f (X ) = β0 + ∑ β j Xj . (4.1)
j =1

Hence, the linear model assumes that the regression function E[Y | X ] can be approx-
imated reasonably well by the linear form (4.1). Recall that the model is linear in the
unknown parameters β 0 , . . . , β d . The predictor variables X1 , . . . , Xd can be measured
features, transformations of features, polynomial representations or other basis ex-
pansions, dummy variables, interactions, etc....
The unknown parameter vector β = ( β 0 , . . . , β d ) needs to be estimated from the
training sample {( xit , yi ); i = 1, . . . , n} with xi ∈ Rd . The standard least squares estima-
tor minimizes
n
RSS( β) = ∑ (yi − f (xi ))2
i =1
!2
n d
= ∑ yi − β 0 − ∑ β j xij
i =1 j =1
t
= (y − Xβ) (y − Xβ), (4.2)

where X is the feature data matrix which starts with a column of ones to include the
intercept in the model. Differentiating (4.2) with respect to β and setting the result
equal to zero yields the estimating equation

Xt (y − Xβ) = 0 (4.3)

94
with solution
β̂ = (Xt X)−1 Xt y (4.4)
The fitted values for the training data are given by

ŷ = X β̂ = X(Xt X)−1 Xt y = Hy,

where H is called the hat matrix. Note that the fitted values ŷ are a linear function of
the responses y. The linear operator H (which puts the ”hat” on y) does not depend
on y but only on the xi observations. The predicted value at a new input vector x0 is
obtained by ŷ0 = fˆ( x0 ) = (1, x0t ) β̂.
Least squares minimizes RSS( β) = (y − Xβ)t (y − Xβ) = ∥y − Xβ∥2 . Geometri-
cally, minimizing the euclidean norm means that β̂ is determined such that ŷ is the
orthogonal projection of the n-dimensional vector y onto the subspace spanned by the
columns x0 , . . . , xd of the matrix X, called the column space of X. Hence, the resid-
ual vector y − ŷ is orthogonal to this subspace as can be seen in (4.3). The hat matrix
H computes the orthogonal projection of y onto the column space of X and thus is a
projection matrix. This implies that H has some useful properties. It is a symmetric,
positive semidefinite matrix that is idempotent (i.e. H H =H). Moreover, if n > d, then
the rank of H equals the number of linearly independent columns in X. Hence, in ab-
sence of exact linear dependence of the predictors, we have that the rank of H equals
d + 1. Moreover, also trace( H ) = d + 1 the dimension of the projection space which
corresponds with the number of (free) parameters in the fit, also called the degrees of
freedom in the estimation problem. Note that the trace of a square matrix is defined as
the sum of its diagonal elements. The more free parameters are included in a model,
the better it can fit the given data, but the more complex the resulting model is. The
degrees of freedom is thus a measure of model complexity.
It can easily be shown that the variance-covariance matrix of β̂ is given by

Cov( β̂) = σ2 (Xt X)−1 ,

where the error variance σ2 is usually estimated by


n
1
σ̂2 = ∑ (y − ŷi )2
n − ( d + 1 ) i =1 i

which is an unbiased estimator of σ2 .


To make inference about the regression model and its parameters, we need to make
additional assumptions. We need to assume that the linear model (4.1) is the correct
model for the conditional mean E[Y | X ] and that the independent (additive) errors
ϵi all have a Gaussian distribution with mean zero and constant variance σ2 , that is,
ϵi ∼ N (0, σ2 ). Using this assumption, hypothesis tests and confidence intervals for

95
the regression parameters can be constructed, as well as F-tests to check for significant
differences between nested models.
The Gauss-Markov theorem is an important result for least squares regression. It
shows that the least squares estimator has the smallest variance among all unbiased
estimators that can be written as a linear transformation of the response y in the fol-
lowing sense. Consider any linear combination θ = at β with a ∈ Rd+1 , then the least
squares estimator of θ is
θ̂ = at β̂ = [ at (Xt X)−1 Xt ]y,
which is a linear transformation of y. Clearly, at β̂ is an unbiased estimator of θ. Now
consider any other linear, unbiased estimator of θ, that is, an estimator of θ of the form
θ̃ = ct y with c ∈ Rn such that E[θ̃ ] = θ. Then it holds that,

Var(θ̂ ) ≤ Var(θ̃ ).

When using the squared error loss function to measure prediction accuracy, we
focus on the mean squared error of estimators. Consider the prediction of the response
Y0 at a point x0 , assuming that the model is correct, hence the true response Y0 has the
form
Y0 = f ( x0 ) + ϵ0 .
Then, the expected prediction error of an estimate f˜( x0 ) is

E[(Y0 − f˜( x0 ))2 ] = E[([Y0 − f ( x0 )] + [ f ( x0 ) − f˜( x0 )])2 ]


= E[(Y0 − f ( x0 ))2 ] + E[( f ( x0 ) − f˜( x0 ))2 ]
= E[ϵ02 ] + MSE( f˜( x0 ))
= σ2 + MSE( f˜( x0 )),

where MSE( f˜( x0 )) is the mean squared error of the prediction at x0 by the estimator
f˜. Therefore, expected prediction error and mean squared error of a prediction differ
only by the constant σ2 , representing the irreducible variance of the new observation
y0 around its expected value.
Now, assume that the linear model is correct, i.e. f ( x0 ) = x0t β. Consider any esti-
mate β̃ of the unknown regression parameters β, then the mean squared error of the
prediction f˜( x0 ) = x0t β̃ can be decomposed further as follows

MSE( f˜( x0 )) = E[( f ( x0 ) − f˜( x0 ))2 ] (4.5)


= E[( x0t ( β − β̃))2 ]
= E[( x0t ( β − E[ β̃] + E[ β̃] − β̃))2 ]
= ( x0t β − E[ x0t β̃])2 + E[(E[ x0t β̃] − x0t β̃)2 ]
= bias( f˜( x0 ))2 + Var( f˜( x0 )). (4.6)

96
The unbiasedness of the least squares estimator makes the first term in the mean
squared error decomposition (4.6) zero. Since the prediction f˜( x0 ) = x0t β̃ is a linear
combination of β̃, the Gauss-Markov theorem ensures that the least squares estimator
is optimal w.r.t. variance within an important class of unbiased estimators. That is, the
least squares estimator has the smallest mean squared error among all linear unbiased
estimators. However, this optimal unbiased estimator can still be highly unstable. This
can happen in situations with (many) candidate predictors among which only a few
have a significant relation with the outcome. The large fraction of noise variables then
causes a large variability of the least squares estimates. Also the presence of highly
correlated predictors (multicollinearity) will make an unbiased estimator such as least
squares highly variable.

To obtain a more stable, better performing solution, we can look for biased estima-
tors that lead to a smaller mean squared error, and thus to better predictions. These
biased estimators thus trade (a little) bias for a (much) larger reduction in variance.
From a broader viewpoint, almost all models are only an approximation of the true
mechanism that relates the features to the outcome, and hence these models are bi-
ased. Therefore, allowing for a little extra bias when selecting the optimal model from
the class of models considered is a small price to achieve a more stable, better per-
forming model. Picking the right model thus means trying to find the right balance
between bias and variance.

A low prediction accuracy due to high variability is a first reason why the least
squares estimator may not be satisfactory. A second reason is the possible loss of in-
terpretability. With a large number of predictors it is often of interest to determine
a small subset of predictors that exhibit the strongest influence on the outcome. The
high variability of least squares complicates the selection of these predictors. A biased
estimator obtained by shrinkage will be much more successful. With least squares, we
can take the subset selection approach to determine important variables or to find
stable submodels. Note that after subset selection we usually end up with a biased
regression model, a fact that is often ignored in regression analysis. Several selection
strategies exist such as all-subsets, backward elimination, or forward selection. Dif-
ferent selection criteria can be used to select the optimal model in these procedures.
For instance, minimizing prediction error on a test set or through cross-validation can
be used. Note that these selection methods are discrete in the sense that predictors
are either completely in or completely out of the model. This discreteness results in a
high variance of the selection process. Shrinkage methods discussed next are a more
continuous alternative to subset selection which usually have a lower variance.

97
4.3 Regularization
4.3.1 Ridge Regression
Ridge regression (Hoerl and Kennard (1970b,a)) is one of the oldest shrinkage
methods. It shrinks the regression coefficients (slopes) by imposing a penalty on their
overall size. The ridge regression coefficients are obtained by minimizing a penalized
residual sum of squares
 !2 
 n d d 
β̂ ridge = arg min ∑ yi − β 0 − ∑ β j xij + λ ∑ β2j (4.7)
β  i =1 j =1 j =1

λ is a tuning parameter controlling the size of the penalty. The larger λ, the larger the
weight of the penalty term, and thus the more shrinkage that is imposed. The limit
case λ = 0 means no shrinkage and just yields the (unbiased) least squares solution.
Increasing λ increases the bias by shrinking the coefficients towards zero. Note that
no shrinkage is applied on the intercept term (β 0 ).
An equivalent formulation of the ridge regression problem is
!2
n d
β̂ ridge = arg min ∑ yi − β 0 − ∑ β j xij (4.8)
β i =1 j =1

d
subject to ∥ β∥22 = ∑ β2j ≤ s, (4.9)
j =1

which is a formulation as a constrained optimization problem. This formulation more


explicitly shows the size constraint on the regression coefficients. From (4.9) it follows
that the regression coefficients are constrained to lie in a ball around the origin with

radius s. This constraint is useful to solve the problem of multicollinearity among
the predictors. In case of correlated predictors the unbiased least squares coefficient
estimates have a high variability because a large positive coefficient on one predictor
can be (almost) canceled by a large negative coefficient on another highly correlated
predictor (in case of a positive correlation). The size constraint (4.9) in ridge regression
prevents this effect from occurring.
As with all penalized methods, ridge regression is not equivariant under scaling
and centering of the predictors. Therefore, it is advised to standardize the predictors
(to mean zero and variance 1) when fitting the ridge regression model. It follows
immediately that the intercept parameter can be estimated by β̂ 0 = n1 ∑in=1 yi = ȳ.
Hence, the intercept can be removed from the model by centering the response as
well, so there is one parameter less in the model. In the remainder we consider the
case where the response is centered and the predictors have been centered and scaled,

98
such that the n × d matrix X is a centered feature matrix (without the usual columns
of ones, because there is no intercept in the model).
After setting β 0 equal to 0, the ridge estimator in (4.7) can be rewritten in matrix
notation as
β̂ ridge = arg min (y − Xβ)t (y − Xβ) + λβt β ,

(4.10)
β

where β = ( β 1 , . . . , β d )t . By differentiating (4.10) w.r.t. β and setting the derivative


equal to zero, the estimating equation for the ridge regression estimator is obtained.
Solving this equation yields the ridge regression solution, which is given by

β̂ ridge = (Xt X + λI )−1 Xt y. (4.11)

Hence, the ridge regression solution is a linear function of the observed outcome y.
The solution is similar to the least squares solution (4.4) but adds a constant to the
diagonal of Xt X before inversion. This constant makes the matrix nonsingular even if
Xt X itself is singular. This means that ridge regression can be used even when there
are more predictors than observations or when there exists linear dependence between
the predictor variables. From (4.11) it follows that the fitted responses are obtained by

ŷ = X β̂ ridge
= [X(Xt X + λI )−1 Xt ]y
ridge
= Sλ y,

Hence, also for ridge regression the fitted values ŷ are a linear function of the responses
ridge
y. The linear operator Sλ is called the ridge regression linear operator.
Similarly as for least squares, the trace of the ridge regression linear operator,
ridge
trace(Sλ ), can be used as a measure of the model complexity, which is called the
effective degrees of freedom Hastie et al. (2001). Clearly, the effective degrees of freedom
depends on the choice of the tuning parameter λ. Therefore, one way to choose the
value of the tuning parameter is by fixing the effective degrees of freedom of the ridge
regression solution.

To further examine the behavior of ridge regression we use the singular value de-
composition (SVD) of the centered feature matrix X, which is given by

X = U DV t (4.12)

where D is a d × d diagonal matrix with diagonal entries γ1 ≥ γ2 ≥ · · · ≥ γd ≥ 0


which are called the singular values of X. The number of singular values strictly larger
than zero corresponds to the rank of X. The matrix U is an n × d orthogonal matrix
and V is a d × d orthogonal matrix. Note that an m × d matrix P is called orthogonal

99
if Pt P = I d with I d the d-dimensional identity matrix. This means that the d columns
of the orthogonal matrix are an orthonormal basis for a d-dimensional space. In the
singular value decomposition the columns of U (corresponding to non-zero singular
values) are a basis for the space spanned by the columns of X. Similarly, the columns
of V (corresponding to non-zero singular values) are a basis for the subspace in Rd
spanned by the rows of X. For the special case of a d × d matrix V , orthogonality of
the matrix implies that V t = V −1 .
Using the singular value decomposition the fitted values by ridge regression can
be rewritten as

ŷridge = X β̂ ridge
= [X(Xt X + λI )−1 Xt ]y
= [U DV t (V DU t U DV t + λI )−1 V DU t ]y
= [U DV t (V D2 V t + λV IV t )−1 V DU t ]y
= [U DV t (V ( D2 + λI )V t )−1 V DU t ]y
= [U DV t V ( D2 + λI )−1 V t V DU t ]y
= [U D ( D2 + λI )−1 DU t ]y (4.13)

From (4.13) it follows that

ridge
Sλ = U D ( D2 + λI )−1 DU t ,

which reveals how the effective degrees of freedom depend on the choice of the tuning
parameter λ. Since both U and D are independent of λ, we can see that larger values
ridge
of λ lead to smaller values of trace(Sλ ). Hence, more shrinkage leads to ridge re-
gression fits with a smaller effective degrees of freedom because the solution β̂ ridge is
constrained to be in a smaller region around the origin.
Furthermore, note that D ( D2 + λI )−1 D is still a diagonal matrix with diagonal
elements
γ2j
; j = 1, . . . , d.
γ2j + λ

Let us denote by u1 , . . . , ud the columns of U which are an orthonormal basis for the d-
dimensional subspace of Rn determined by the columns of X. Then, expression (4.13)
can be written as
d γ2j
ŷridge = ∑ γ2 + λ (utj y)u j (4.14)
j =1 j

100
For least squares regression a similar derivation yields
ŷLS = X β̂ LS
= [ X ( X t X ) −1 X t ] y
= [U DV t (V DU t U DV t )−1 V DU t ]y
= [U DV t (V D2 V t )−1 V DU t ]y
= [U DV t V ( D2 )−1 V t V DU t ]y
= [U D ( D2 )−1 DU t ]y
= [UU t ]y
d
= ∑ (utj y)u j , (4.15)
j =1

which corresponds to setting λ = 0 in (4.14) as expected. Both least squares and ridge
regression thus look for an approximation of the response vector y in the subspace
spanned by the columns u1 , . . . , ud of U. Least squares finds the approximation by
projecting y onto this subspace. Ridge regression then shrinks the least squares co-
ordinate weights (utj y) by the factor γ2j /(γ2j + λ). Hence, the smaller the squared
singular value γ2j , the larger the amount of shrinkage in the direction u j .

To understand which directions correspond with larger amounts of shrinkage, we


consider the d × d matrix of squares and cross-products:
Xt X = V DU t U DV t
= V D2 V t (4.16)
The matrix Xt X is a symmetric and positive semidefinitive square matrix. Hence, the
matrix has a real eigen-decomposition of the form PΛPt where Λ is a diagonal matrix
whose diagonal elements are the eigenvalues of the matrix and P is an orthogonal
matrix whose columns are the corresponding eigenvectors of the matrix. Since V is
an orthogonal matrix and D2 is a diagonal matrix, (4.16) is the eigen-decomposition of
the d-dimensional square matrix Xt X. Hence, γ2j j = 1, . . . , d are the eigenvalues of the
matrix and the columns v1 , . . . , vd of V are the corresponding eigenvectors which are
an orthonormal basis of Rd . Now let us express the input observations and thus the
feature data matrix X in this new basis. The coordinates of the input observations for
each of the basisvectors v j j = 1, . . . , d are given by z j = Xv j . The sample variance in
the direction v j is thus
Var(z j ) = Var(Xv j ) = (vtj Xt Xv j )/n = (vtj V D2 V t v j )/n = γ2j /n,
since V t v j = e j where e j is a d-dimensional vector with zero everywhere except for
component j which equals 1. Moreover, we also have that
z j = Xv j = U DV t v j = U De j = γ j Ue j = γ j u j .

101
Hence, in the directions u j , the variance of the feature observations equals γ2j /n.
Therefore, directions with small squared singular values γ2j correspond to directions
with small variance of the inputs, and ridge regression shrinks these directions the
most.
The sample covariance matrix S of the predictors (remember that they are centered
at zero) is given by Xt X/n. Therefore, the eigen-decomposition of Xt X in (4.16) corre-
sponds to the eigen-decomposition of the sample covariance matrix up to the factor
n. Consequently, the eigenvectors v1 , . . . , vd are the principal components of the feature
data (in d-dimensional space), that is directions with maximal variance subject to being
orthogonal to directions of earlier principal components. The directions u1 , . . . , ud are
thus the principal component directions of the input observations in the n-dimensional
space.

Least squares regression becomes highly variable if there are directions in the data
with small variance of the data. The small amount of information in these directions
makes it very difficult to accurately estimate the slope of the linear surface fitting the
data in these directions. This makes the least squares solution very unstable. Ridge
regression protects against a potentially high variance in these directions (with small
variability) by shrinking the regression coefficients in these directions. Hence, ridge
regression does not allow a large slope in these directions and thus assumes (implic-
itly) that the response will vary most in directions with high variance of the predictors.
This is often a reasonable assumption and even if it does not hold, the data do not con-
tain sufficient information to reliably estimate the slope of the surface in directions
with small variance of the predictors.

4.3.2 The Lasso


The lasso (Tibshirani (1996)) is a shrinkage method like ridge regression but in-
stead of using the (squared) L2 -norm ∑dj=1 β2j for the penalty term, it uses the L1 norm
∑dj=1 | β j |:
!2
n d
β̂ lasso = arg min ∑ yi − β 0 − ∑ β j xij (4.17)
β i =1 j =1

d
subject to ∥ β∥1 = ∑ | β j | ≤ t, (4.18)
j =1

The constant t is a tuning parameter. A value t ≥ t0 = ∥ β̂ LS ∥1 implies that no shrink-


age is applied. Values 0 < t < t0 will shrink the least squares coefficients towards zero.
The tuning parameter t can be selected to optimize prediction error. Again, no shrink-
age is applied on the intercept term, so after centering the predictors the intercept can

102
be estimated by β̂ 0 = ȳ. Hence, the intercept can again be removed by centering the
response as well.
Replacing the L2 ridge penalty by the L1 lasso penalty, makes the optimal solution
a nonlinear function of the observed responses y (instead of a linear operator). There-
fore, a more complicated optimization problem needs to be solved, but the solution
can still be found in a very efficient way.

The lasso penalty and ridge penalty have a different effect on the regression pa-
rameters. To explain the difference, let us consider an orthonormal feature matrix X,
that is the predictors are uncorrelated and have variance 1. In this case the principal
components are just the coordinate directions. Since all predictors have the same vari-
ance, ridge regression imposes an equal amount of shrinking on each of them. The
ridge regression estimates are
( β̂ LS ) j
( β̂ ridge ) j = . (4.19)
1+λ
Note that these coefficient estimates decrease with increasing value of λ, but do not
become zero for λ < ∞.
To obtain the lasso estimates we have to solve the constraint optimization prob-
lem (4.17)-(4.18). The Lagrange formulation of the problem becomes
 !2 
 n d d 
β̂ lasso = arg min ∑ yi − ∑ β j xij + γ( ∑ | β j | − t)
β  i =1 j =1 j =1

= arg min (y − Xβ)t (y − Xβ) + γ∥ β∥1



β

= arg min yt y − βt Xt y − yt Xβ + βt Xt Xβ + γ∥ β∥1



β

= arg min −2βt Xt y + βt Xt Xβ + γ∥ β∥1 ,



β

with multiplier γ ≥ 0.
In case X is orthonormal, we have that Xt X = I. Therefore, the least squares es-
timator in this case reduces to β̂ LS = Xt y and we can rewrite the lasso optimization
as

β̂ lasso = arg min −2βt β̂ LS + βt β + γ∥ β∥1



β
d  
= arg min ∑ −2( β̂ LS ) j β j + β2j + γ| β j |
β j =1

Since each term in this sum depends on a different component β j of β, the terms in the
sum can be minimized separately. That is,
 
2
( β̂ lasso ) j = arg min −2( β̂ LS ) j β j + β j + γ| β j | . (4.20)
βj

103
Note that the last two terms are always positive. To find the solution we split the
problem in three cases
Case 1: ( β̂ LS ) j > 0. In this case ( β̂ lasso ) j ≥ 0 as well to make the first term in (4.20)
negative. The problem thus reduces to
 
2
( β̂ lasso ) j = arg min −2( β̂ LS ) j β j + β j + γβ j .
βj

Setting the derivative w.r.t. β j equal to zero yields ( β̂ lasso ) j = ( β̂ LS ) j − γ/2 when-
ever this solution is positive and ( β̂ lasso ) j = 0 otherwise.

Case 2: ( β̂ LS ) j < 0. In this case ( β̂ lasso ) j ≤ 0 as well to make the first term in (4.20)
negative. The problem thus reduces to
 
( β̂ lasso ) j = arg min −2 β̂ LS ) j β j + β2j − γβ j .
βj

Setting the derivative w.r.t. β j equal to zero yields ( β̂ lasso ) j = ( β̂ LS ) j + γ/2 when-
ever this solution is negative and ( β̂ lasso ) j = 0 otherwise.

Case 3: ( β̂ LS ) j = 0. The problem now reduces to


 
( β̂ lasso ) j = arg min β2j + γ| β j | ,
βj

with solution ( β̂ lasso ) j = 0.


Combining the three cases yields the lasso solution
( β̂ lasso ) j = sign(( β̂ LS ) j )(|( β̂ LS ) j | − γ/2)+ ,
where the function (.)+ returns the value between brackets when that value is positive
and returns zero otherwise.
The Karush-Khun-Tucker (KKT) conditions for the optimum imply that
d
γ( ∑ | β̂ lasso ) j | − t) = 0,
j =1

so either γ = 0 or ∑dj=1 | β̂ lasso ) j | = t. This condition determines the relation between


the parameter γ and the tuning parameter t in (4.18). If γ = 0, then β̂ lasso = β̂ LS , which
happens when t0 = ∥ β̂ LS ∥ ≤ t. In the other situation, we obtain
d
t = ∑ | β̂lasso ) j |
j =1
d
= ∑ |sign(( β̂LS ) j )(|( β̂LS ) j | − γ/2)+ |
j =1
d
= ∑ (|( β̂LS ) j | − γ/2)+
j =1

104
If t is sufficiently large such that no coefficients are completely shrunken to zero, this
reduces to
d
t = ∑ (|( β̂LS ) j | − γ/2)
j =1
d
= ∑ |( β̂LS ) j | − dγ/2
j =1

= ∥ β̂ LS ∥1 − dγ/2
= t0 − dγ/2,

which yields γ = 2(t0 − t)/d. When t is decreased such that one of the parameter
estimates becomes zero, then both t0 and d in this expression need to be adjusted ac-
cordingly. The lasso thus equally shrinks all the least squares coefficients until the
constraint (4.18) is satisfied. In contrast to the ridge regression coefficients in (4.19) ,
lasso coefficient estimates can be shrunken to zero whenever their least squares esti-
mate ( β̂ LS ) j is in absolute value smaller than threshold γ/2, after which the remaining
nonzero coefficients are shrunken further.

The effect of the lasso is often called soft thresholding. Regular hard thresholding puts
coefficients below the cutoff equal to zero and only keeps the coefficients above the
cutoff. This is the way subset selection for least squares actually proceeds. Fixing a size
M < d for the optimal model, best subset selection in this case chooses the model with
the M predictors corresponding to the largest parameters ( β̂ LS ) j . The soft thresholding
of the lasso can be seen as a continuous subset selection. The tuning parameter t
specifies the threshold. If this threshold is exceeded, then the coefficients are shrunken.
The shrinkage increases and for small values of t some coefficients become zero which
implies that the corresponding predictors are removed from the model.
Let us now consider the general nonorthonormal case. This situation is represented
in Figure 4.1. Ridge regression minimizes the residual sum of squares subject to the
ridge constraint (4.9) while the lasso minimizes the same residual sum of squares sub-
ject to the lasso constraint (4.18). Let us put s = t2 in the ridge constraint (4.9) such
that both the ridge and lasso constraints put a bound t on the norm (L2 or L1 ) of β. In
the parameter space, the ridge constraint (4.9) means that the solution should fall in a
ball of radius t around the origin. The lasso constraint (4.18) means that the solution

should fall in a diamond with edge length 2t around the origin.
Consider the residual sum of squares (4.2) for each point in the parameter space.
The optimum (minimum) is at β̂ LS and let us assume that it does not belong to the
ridge region nor the lasso region. Therefore, we have to allow a larger value of the
residual sum of squares to obtain a feasible solution. The residual sum of squares is a

105
β2 β2

RSS(β) = c2 RSS(β) = c2

β̂ LS t β̂ LS

β̂ ridge
t β1 β1
β̂ lasso

Figure 4.1 Ridge regression (left) and lasso (right) solutions in the general
nonorthonormal case. The green area is the part of the parameter space that is fea-
sible according to the contstraint in the penalty. The red ellipses represent areas of
equal residual sum of squares (RSS). Larger sizes correspond to larger RSS values (in-
creasing value of c).

quadratic function of β, and thus for values of the residual sum of squares above the
optimum, the set of solutions becomes an ellipsoid around the center β̂ LS . The value
of the residual sum of squares thus needs to be increased until the corresponding el-
lipsoid of solutions hits the ridge/lasso constraint region around the origin. In either
case, the intersection point with the ellipsoid is the optimal solution. Hence, the shape
of the feasible region around the origin determines the solution that is found. The
ridge regression ball has no corners and is equally likely to be hit anywhere. The dia-
mond (and certainly the rhomboid in higher dimensions) has corners at the coordinate
axes where a parameter value(s) equals zero. These corners are more likely to be hit
by the ellipsoid because they ’stick out’ in the region. In practice this means that ridge
regression usually shrinks all regression coefficients but they remain positive, so the
size of the model is not reduced. On the other hand, the lasso will shrink some coef-
ficients to zero and thus reduces the size of the model. Lasso shrinkage automatically
incorporates variable selection. This explains the name Lasso which stands for Least
Absolute Shrinkage and Selection Operator.

106
4.3.3 Other Regularization Methods
Ridge regression and lasso can be generalized by considering the general penalized
least squares problem
( )
d
β̂ pen = arg min (y − Xβ)t (y − Xβ) + ∑ pλ (| β j |) , (4.21)
β j =1

where pλ (t) is a penalty function that is nondecreasing and continuously differentiable


on [0, ∞). The lasso is obtained for the choice pλ (t) = λt and ridge regression for the
choice pλ (t) = λt2 .
More complex penalty functions can now be considered that guarantee desirable
properties for the resulting estimator. For high-dimensional regression problems, the
following three desirable properties have been proposed in Fan and Li (2001).

Approximate unbiasedness: If the true coefficient β j is large, then the correspond-


ing estimate ( β̂ pen ) j should be close to the least squares estimate ( β̂ LS ) j to avoid
unnecessary bias.

Sparsity: The estimator performs model selection by setting small estimated co-
efficients ( β̂ LS ) j to zero such that the model complexity is reduced.

Continuity: The estimator is a continuous function of the data to avoid instability


in model estimation and prediction.

The advantage of penalty functions having these properties is that the correspond-
ing regularization method can enjoy the oracle property which means that on the one
hand the zero components are estimated as zero with probability going to one and
on the other hand the coefficients of the nonzero components are estimated consis-
tently. Hence, asymptotically the method performs as well as an oracle procedure
which knows the correct submodel. Note that ridge regression does not satisfy the
sparsity property while the lasso does not satisfy the approximate unbiasedness. To
meet all three criteria the smoothly clipped absolute deviation (SCAD) penalty has been
introduced Fan and Li (2001):



 λt if t ≤ λ

2 2
pλ (t) = − t −22aλt +λ
( a −1)
if λ < t ≤ aλ (4.22)

 ( a +1) λ2

if t > aλ,

2

with λ > 0 and a > 2. The constant a is often set equal to 3.7 which can be motivated
by a Bayesian argument. For small values of β j , the SCAD penalty pλ (| β j |) in (4.22)
corresponds to the lasso which allows for model selection. On the other hand, for

107
5
4
3
pλ(t)

2
1
0

−4 −2 0 2 4
t

Figure 4.2 The ridge regression (dotted, red), lasso (dashed, blue) and SCAD (solid,
black) penalty functions

large values of β j the SCAD penalty pλ (| β j |) is a constant. This implies that the es-
timator is nearly unbiased for large values of β j because shrinking these coefficients
only affects the sum of squares, but does not reduce the penalty. In between a smooth
transition from the absolute value to the constant is established. The penalty is shown
in Figure 4.2. Next to the SCAD penalty, several other penalties have been proposed in
the literature. For example, the elastic net penalty proposed in Zou and Hastie (2005)
uses a convex combination of the L2 norm penalty (ridge regression) and the L1 norm
penalty (lasso) to reach the best of both worlds. Not all regularization methods are
penalized least squares methods, see e.g. the Dantzig selector Candes and Tao (2007).
The idea of adding a penalty term in the optimization problem corresponding to
an estimation method, to obtain more stable and/or more parsimonious models, can
be applied much more widely than just to least squares methods. An obvious class
of methods which can be combined with penalization are parametric maximum like-
lihood estimators (MLE) which are minimizers of the negative loglikelihood corre-
sponding to the parametric statistical model. A convenient family of distributions in
this respect is the exponential family in which it is assumed that the density or proba-
bility function of each response yi is of the form

yθ0,i − b(θ0,i )
 
f (y; θ0,i ) = exp + c(y, ϕ0 ) ,
ϕ0

R R R R
where ϕ0 is a scale parameter and b : → as well as c : × + → are functions R
determining the shape of the distribution. The parameter θi is referred to as the natural
parameter of the exponential family and is specific to each yi . A common formulation

108
is
θ0,i = xi⊤ β0 ,
which is equivalent to the use of the canonical link function to model the effect of co-
variates on the conditional mean function E(yi |xi ). Penalized maximum (log-) likelihood
estimators add a penalty term to the classical log-likelihood in order to increase stabil-
ity (ridge penalty) and/or sparsity (lasso penalty, SCAD penalty, . . . ). The objective
function now is
n  d
yi θi − b ( θi )

L( β, ϕ) = ∑ + c(yi , ϕ) + ∑ pλ (| β j |)
i =1
ϕ j =1

and optimization is carried out with respect to β and ϕ. Note that penalized MLE is
a direct extension of penalized least squares in regression to other statistical models.
Logistic (binary outcome) regression models is an example where penalized MLE is
used for the same reasons as penalized LS in standard continuous outcome regression
models.

4.4 Screening
While regularization methods can be used to obtain more stable and parsimonious
models from high dimensional data, they still run into problems when the dimension
of the data is much larger than its sample size. A typical example of such ultra-high
dimensional data is gene expression data where several thousands of variables have
been measured for a limited number of observations (at most a few hundred). In such
situations the computational burden of regularization methods such as the lasso or
SCAD is still very high. Moreover, while these regularization methods can produce
sparse models, they have only be shown to be model selection consistent under strin-
gent design conditions if the dimension exceeds the sample size. Model selection con-
sistency means that the method selects the true model with probability going to one
when the sample size increases, if this true model is among the candidate models. The
stringent design conditions are hard or impossible to verify in practice, so it remains
unclear how well regularization models select useful models if they can be computed.
To address these issues Fan and Lv (2008) introduced sure independence screening.
The idea is to apply a two step procedure.
1. In the first step a dimension reduction/ feature selection procedure is applied
to reduce the dimension d from huge (ultra-high) to a high but manageable size
for regularization methods, e.g. dimension d˜ = O(n) or even o (n). This dimen-
sion reduction method should be fast and efficient in the sense that with high
probability no important variables are lost.

109
2. Apply a regularization method which is model selection consistent for this re-
duced setting or preferably even has the oracle property so that the correct sub-
model is selected with high probability.

Regularization methods for the second step have been discussed in the previous sec-
tion, so in this section we focus on feature selection procedures for the first step.
The ultra-high dimensional data setting can be formalized and studied by allowing
that the dimension d of the data depends on the sample size n, which can be made
explicit by adding the subscript n to the dimension (d = dn ), and letting the dimension
increase at an exponential rate with the sample size, i.e. d = exp(O(nξ ) for some ξ > 0.
Some of the issues that arise in ultra-high dimensional data are the following.

• The (n × d) design matrix X is a “fat” matrix with many more columns than rows.
The matrix Xt X is thus a huge matrix which is singular because it is of rank at
most n (≤≤ d). As a result, its entries can be very misleading. For example,
the maximum absolute sample correlation between independent predictors can
be very large which goes against our intuition from low dimensional settings.
To illustrate this consider d independent predictors X1 , . . . , Xd that each follow a
standard normal distribution. In this case the (n × d) design matrix X consists
of independent realizations of N (0, 1). We generated N = 500 datasets of size
n = 50 In Figure 4.3 you can see the maximum absolute correlation between the
independent predictors for design matrices X of dimension d = 10, 100, 1000 and
10 000. Multiple correlations between groups of predictors can be even much
higher than these maximum absolute correlations indicating that spurious linear
relations in high dimensional datasets can be very high.

• The dn dimensional population covariance matrix may become ill-conditioned


as the sample size grows which makes the problem more unstable and thus the
variable selection harder.

• The number of nonzero coefficients may also depend on n and the size of the
minimal absolute nonzero coefficients mini {| β i | > 0} may decay with n. The
size of this coefficient estimate then may become close to the size of estimates of
the coefficients for noise variables.

• Predictors may have heavy tails such that the stringent conditions on the design
matrix for model selection consistency of regularization methods are not satis-
fied.

110
25
p=10
p=100
20 p=1000
p=10000
15
Density

10
5
0

0.0 0.2 0.4 0.6 0.8 1.0

maximal absolute correlation

Figure 4.3 Maximal absolute spurious correlation in independent datasets of size n =


50 and increasing dimension.

4.4.1 Correlation learning


To screen out the most promising predictors, Fan and Lv (2008) propose to use
componentwise regressions. As before, let the predictors be standardized and the
response be centered and set
ω = X T y.

note that each component ω j ( j = 1, . . . , d) is just the least squares coefficient obtained
by regressing y on predictor x j . Alternatively, the components of ω are the correlations
of the predictors with the response variable, rescaled by the standard deviation of the
response.
A submodel Aγ is then selected by retaining the indices of the predictors corre-
sponding to the largest absolute coefficients in ω:

Aγ = {1 ≤ j ≤ d : |ω j | ≥ |ω |(⌈n(1−γ)⌉) }, (4.23)

where |ω |(⌈n(1−γ)⌉) is the ⌈n(1 − γ)⌉ order statistic of |ω| with ⌈n(1 − γ)⌉ the smallest
integer larger than (1 − γ) for some γ > 0. Note that this is equivalent to selecting the
d˜ = [γn] predictors which have largest marginal correlation with the response vari-
able ([γn] denotes the integer part of γn). Therefore, this approach is also called corre-
lation learning. The method thus filters out the predictors which have weak marginal

111
correlation with the response. This correlation learning method is called Sure Indepen-
dence Screening (SIS) because each available feature is used independently as predictor
to evaluate how useful it is to predict the response variable. Clearly, SIS is a hard-
thresholding method. The original model of size d is shrunken to a submodel of much
smaller size d˜ by completely removing all features whose marginal correlation with
the response is below the threshold in (4.23).

4.4.2 Iteratively thresholded ridge regression screening


SIS is a special case of iteratively thresholded ridge regression screening. When
d > n the least squares estimator is not uniquely defined, but the instability of the
least squares solution can be addressed by considering the ridge regression solution
β̂ ridge (λ) in (4.11), where we now make explicit its dependence on the tuning λ. For
λ → 0 the ridge regression solution approaches the unstable least squares estimator.
Hence, to obtain stable results sufficiently large values of λ should be used. Let us
now examine what happens if λ → ∞. First, we consider

−1 t Xt X
t
λ β̂ ridge (λ) = λ(X X + λI ) Xy=( + I )−1 Xt y.
λ
It can thus be seen that λ β̂ ridge (λ) → Xt y = ω for λ → ∞. If we use the absolute ridge
regression coefficient estimates β̂ ridge (λ) to order the features and threshold features
with too small coefficients, then we obtain the same order whether we use β̂ ridge (λ)
or λ β̂ ridge (λ). Hence, SIS is a limiting case of thresholded ridge regression screening with
regularization parameter λ = ∞. That is, the case where the variability of the estimator
is as small as possible.
Thresholded ridge regression selects a submodel Aλ by retaining the indices of the
predictors corresponding to the largest absolute coefficients in β̂ ridge (λ):

Aλ = {1 ≤ j ≤ d : |( β̂ ridge (λ)) j | ≥ | β̂ ridge (λ)|(⌈n(1−γ)⌉) }. (4.24)

For ultra-high dimensional data ridge regression is not model selection consistent, so
applying thresholded ridge regression will not lead to a good model. However, we can
apply an iterative procedure which reduces the number of features by a factor 1 − δ in
each iteration for some trimming fraction 0 < δ < 1. Hence, the first step yields the
submodel
A1δ,λ = {1 ≤ j ≤ d : |( β̂ ridge (λ)) j | ≥ | β̂ ridge (λ)|(⌈(1−δ)d⌉) },
which contains [δd] features. Based on the features in A1δ,λ the corresponding ridge
(1)
regression solution β̂ ridge (λ) is computed which is used to select a second submodel

(1) (1)
A2δ,λ = {1 ≤ j ≤ d : |( β̂ ridge (λ)) j | ≥ | β̂ ridge (λ)|(⌈(1−δ)d⌉) }.

112
This submodel A2δ,λ ⊂ A1δ,λ contains [δ2 d] features. Iterating this process yields a se-
quence of submodels A1δ,λ ⊃ A2δ,λ ⊃ · · · ⊃ Akδ,λ where the process is stopped when the
submodel Akδ,λ has sufficiently small size. Note that this process generally yields a dif-
ferent solution than the thresholded ridge regression in (4.24) even if the final models
have the same size because the ridge regression coefficients are re-estimated in each it-
eration step based on the remaining features. The only exception is the case λ = ∞, i.e.
SIS, because the ranking of the features is determined by the componentwise regres-
sion coefficients in this case. SIS is thus a special case of iteratively thresholded ridge
regression screening. Moreover, it is the only case where iterations are unnecessary
and the solution does not depend on the choice of δ.
The sure screening property holds for a screening method if the selected submodel
contains all important variables with probability going to 1 if the sample size in-
creases. In Fan and Lv (2008) it is shown that the sure screening property holds for
iteratively thresholded ridge regression screening under some stringent technical con-
ditions. These conditions have implications for the choice of δ. On the one hand δ
cannot be too small (if λ < ∞)) such that several iterations are needed, but on the
other hand δ can also not be too large such that the number of iterations k does not be-
come too large. For example, iteratively thresholded ridge regression screening where
only one variable is removed in each iteration until a submodel of size n − 1 is obtained
would not satisfy the sure screening property because too many iterations (d − n + 1)
are needed to reach the final solution and in each iteration there is a small probabil-
ity that an important variable is lost. Hence, next to the excessive computation time
required for this procedure, it is also not guaranteed that it provides a reliable solu-
tion which contains the important predictors with high probability. The conditions
for sure screening naturally imply that important predictors should have a marginal
correlation with the response variable that is sufficiently high. Moreover, collinearity
between the predictors cannot be too high either. Because these issues regularly occur
in practice, we consider in the next section an extension of SIS that is more flexible.

4.4.3 Iterative sure independence screening


To improve the screening performance in cases where the stringent conditions for
sure screening are not satisfied, the following iterative procedure can be followed.

1. Apply SIS on the original dataset with all d features and select a submodel with
o (n) ( e.g. n/ log(n)) candidate predictors. Apply a regularization method that
performs variable selection to this submodel such as lasso, SCAD or Dantzig
selector. This yields a first subset with k1 predictors A1 = { Xi1 , . . . , Xik }.
1

2. Remove the effect of the k1 selected predictors from the response. That is, replace

113
y by the residual vector obtained by regressing y on xi1 , . . . , xik . Apply Step 1
1
with this residual vector as response and using the remaining d − k1 features as
candidate predictors to select a second subset A2 with k2 predictors

3. Repeat the procedure in the previous steps until l subsets of predictors have been
obtained where l is a number such that the model A = ∪il=1 Ai contains o (n) or
at most O(n) predictors.

4. Apply a regularization method that performs variable selection to the submodel


A to determine the final model.

Removing from the response the effect of the already selected predictors in step 2 has
several advantages. On the one hand, it allows to select important predictors that had
a weak correlation with the original response due to the presence of already selected
predictors because such features will get a high correlation with the residuals. On
the other hand, features that are not important but have a high correlation with the
response due to their association with already selected predictors will have a much
weaker correlation with the residuals because this residual vector is uncorrelated with
the selected predictors. Moreover, this iterative sure independence screening (ISIS)
has the additional advantage that important variables that are missed by SIS (i.e. by
applying only one iteration) can still be picked up in the next iterations of the ISIS
procedure. Hence, ISIS is much more likely to return a model that is close to the true
submodel, even when the stringent conditions for sure screening are not satisfied.

4.4.4 Extensions
Sure screening is not limited to ultra-high dimensional linear regression problems,
but the idea of ISIS can be extended to handle ultra-high dimensional data in other
settings such as classification, generalized linear models or additive models, see e.g.
Fan et al. (2009); Barut et al. (2016); Cui et al. (2015); Fan et al. (2011) or even model
free settings Li et al. (2012); Pan et al. (2019); Zhu et al. (2011).

114
Chapter 5

Support Vector Machines

In this chapter we mainly discuss the use of kernel-based methods for classifica-
tion and regression problems. Kernels allow to (implicitly) transform and enlarge
the predictor space which results in more flexible (nonlinear) solutions in the origi-
nal predictor space. Because of the flexibility of the transformed space regularization
is needed to obtain a stable solution. These methods are commonly called support
vector machines. An in-depth treatment of support vector machines can be found in
e.g. Schölkopf and Smola (2002); Christmann and Steinwart (2008). A discussion from
a statistical perspective can be found in Hastie et al. (2001) while a low-level exposition
is given in James et al. (2014).

5.1 Classification
In classification we consider problems with a qualitative (categorical) outcome
variable G, where the labels of G can be denoted by 1, . . . , k without loss of gener-
ality. Often the outcome is dichotomous, that is, there are only two classes (k = 2).
Examples are infected vs. not-infected by a certain virus or bacteria, valid vs. fraudu-
lent transaction, or admitted vs. not admitted. In this section we consider such binary
classification problems. In this case the class label G is often represented by a single
binary variable Y which takes the values 0 or 1, or alternatively, -1 or 1.
In a classification problem the goals are to find good estimates of the class labels.
Ideally, the method works well on the whole distribution of outcomes and features.
In practice, these predictions can be considered both for the training data, expressing
the goodness-of-fit quality of the method, as for new objects, expressing the prediction
capacity of the method. Furthermore, the purpose often also is to find a representation
that separates or discriminates the classes as much as possible to get more insight in the
differences between the classes and to determine the features that are associated with
these differences.

115
Given any vector of features x ∈ Rd , we would like to find the classification rule
that makes the least possible mistakes, or more generally, gives the smallest loss. In
the general case, we need to define a loss function L(r, s) that represents the price paid
for classifying an object from class r to class s, for all possible classes in G (which is
2 in this case). The standard setting of least possible mistakes corresponds to setting
L(r, s) = 1 if r ̸= s and zero otherwise, which is called the zero-one loss function. We
now would like to find the classifier Ĝ that minimizes the expected prediction error or
expected misclassification loss given by
k
EML( Ĝ ( x )) = ∑ L(r, Ĝ(x)) P(r|X = x). (5.1)
r =1

This leads to the classifier


k
Ĝ ( x ) = argmin ∑ L(r, g) P(r|X = x). (5.2)
g∈{1,...,k} r =1

The classifier is thus defined pointwise. For each feature vector x, the class estimate is
given by the solution of (5.2). For the standard zero-one loss, the classifier simplifies
to
Ĝ ( x ) = argmin ∑ P(r | X = x ). (5.3)
g∈{1,...,k} r ̸= g

Since ∑rk=1 P(r | X = x ) = 1 = P( g| X = x ) + ∑r̸= g P(r | X = x ), we can rewrite (5.3) as

Ĝ ( x ) = argmin [1 − P( g| X = x )]
g∈{1,...,k}

or equivalently
Ĝ ( x ) = argmax P( g| X = x ). (5.4)
g∈{1,...,k}

This solution is called the Bayes classifier. It says that we should classify an object with
features x to the most probable class as determined by the conditional distribution
P( G | X = x ), that is

Ĝ ( x ) = r if P(r | X = x ) = max P( g| X = x ). (5.5)


g∈{1,...,k}

Figure 5.1 shows a mixture of observations from three different bivariate normal distri-
butions together with the optimal Bayesian decision boundaries which are quadratic
functions in this case. With real data, of course the true conditional distribution of the
classes is unknown and the Bayes classifier (5.4) cannot be derived.
The error rate of the Bayes classifier, obtained as

EX [EML( Ĝ ( X ))] (5.6)

116
Bayesian Classifier


10
● ●● ●●
● ● ●
● ● ●
●●● ● ● ●
● ●●● ●
● ●
●● ●
●●

● ● ●● ●
● ● ● ●● ●●● ●
● ● ● ● ●

5
●● ●
● ●● ● ● ●● ● ●
●● ●● ●
● ● ●● ● ●
●● ● ● ● ●
● ● ●
● ● ● ●
● ● ●●
● ●● ●
●● ● ● ● ●
● ●●
● ● ●●● ●

0

● ● ● ● ● ●
● ●
● ● ● ● ● ●
●● ● ●● ●
X2 ● ●●
● ● ●
● ● ● ●●

● ● ●

● ●
● ●
●● ● ● ●●
−5
● ●

● ● ● ●● ● ● ● ●●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ●

● ● ● ● ● ● ●
●● ● ●
● ● ●●
● ● ●
● ●● ●
−10

● ● ●

● ● ● ●
● ●


−15


−10 −5 0 5

X1

Figure 5.1 Mixture of three bivariate normal distributions with optimal Bayesian deci-
sion boundaries.

with Ĝ ( x ) the Bayes classifier, is called the optimal Bayes error rate. This is the error rate
we would obtain if the conditional distribution P( G | X ) would be known.
We start our discussion of support vector machines for classification by considering
methods that explicitly try to find linear boundaries that separate the two classes as
much as possible when the classes are separable. We then discuss the construction of
such optimal separating hyperplanes when the classes are not completely separable
anymore. Finally, we introduce kernels to add flexibility to this approach and discuss
extensions using different loss functions.

5.1.1 Separating Hyperplanes

Two well-known classification methods, Fisher’s linear discriminant analysis and


logistic regression, both estimate linear decision boundaries that separate classes by
modeling the conditional distribution P( G = r | X ) for each class r = 1, . . . , k. In con-
trast, separating hyperplane classifiers are procedures that explicitly try to construct
linear decision boundaries that separate the data into different classes as well as pos-
sible. We consider the two-class problem and the two classes are coded by the class
variable Y which takes the values −1 and +1. A linear decision boundary is a hyper-

117
plane Hβ represented by an equation

f ( x ) = β 0 + βt x = 0, (5.7)

with β ∈ Rd . If two points x1 and x2 belong to the same hyperplane Hβ , then f ( x1 ) =


f ( x2 ) = 0 and thus also f ( x1 ) − f ( x2 ) = βt ( x1 − x2 ) = 0. Hence,
β
β⋆ =
∥ β∥
is the normalized vector of the direction orthogonal to the hyperplane Hβ . The signed
distance of any point x to a hyperplane Hβ is the signed distance between the point x
and its orthogonal projection x0 on the hyperplane Hβ

dsign ( x, Hβ ) = ( β⋆ )t ( x − x0 ).

The function is called signed distance because its absolute value is the orthogonal
distance from the point to the hyperplane but the function is positive for points on one
side (the positive side for which f ( x ) > 0) and negative for points on the other side
(the negative side for which f ( x ) < 0) of the hyperplane. Note that for any point x0 in
Hβ , it holds that f ( x0 ) = 0, which yields βt x0 = − β 0 . Therefore, the signed distance
can be rewritten as

dsign ( x, Hβ ) = ( β⋆ )t ( x − x0 )
1
= ( β t x − β t x0 )
∥ β∥
1
= ( βt x + β 0 )
∥ β∥
f (x)
= ,
∥ f ′ ( x )∥
since f ′ ( x ) = β in this case. Hence, the signed distance of a point x to the hyperplane
defined by f ( x ) = 0 is proportional to the value f ( x ). Separating hyperplane classi-
fiers try to find a linear function f ( x ) as in (5.7) such that f ( x ) > 0 for objects in class
Y = 1 and f ( x ) < 0 for objects in class Y = −1. Procedures that classify observations
by computing the sign of a linear combination of the input features were originally
called perceptrons when introduced in the engineering literature.
Let us assume that the data are separable, that is, there is at least one hyperplane that
completely separates the two classes. In this case, the solution of the separating hyper-
plane problem will often not be unique, but several hyperplanes completely separate
the classes. To make the solution unique, an additional condition can be imposed to
find the best possible separating hyperplane among all choices. The optimal separating
hyperplane is defined as the hyperplane that separates the two classes and maximizes

118
the distance from the hyperplane to the closest points from both classes. This extra
condition defines a unique solution to the separating hyperplane problem and maxi-
mizes the margin between the two classes on the training data. It can be expected that
it therefore also leads to better performance on test data.
If Hβ is a separating hyperplane, then Y = −1 for points on the negative side
( f ( x ) < 0) of Hβ and Y = 1 for points on the positive side ( f ( x ) > 0). It follows that
the (unsigned) distance of a point xi to the hyperplane Hβ is obtained by d( xi , Hβ ) =
yi ( βt xi + β 0 )/∥ β∥. Hence, a separating hyperplane satisfies d( xi , Hβ ) = yi ( βt xi +
β 0 )/∥ β∥ > 0 for all training points.
The optimal separating hyperplane is the solution to the problem

max C (5.8)
β,β 0

β t xi + β 0
subject to yi ≥ C; i = 1, . . . , n. (5.9)
∥ β∥
The condition (5.9) can equivalently be written as

yi ( βt xi + β 0 ) ≥ C ∥ β∥; i = 1, . . . , n. (5.10)

Now suppose that for some value of C, there are values β and β 0 that satisfy condi-
tion (5.10). Let us consider a positive value γ, and define β̃ = γβ and β̃ 0 = γβ 0 . It
then follows that ∥ β̃∥ = γ∥ β∥ and condition (5.10) is equivalent to

γyi ( βt xi + β 0 ) ≥ γC ∥ β∥; i = 1, . . . , n
or yi ( β̃t xi + β̃ 0 ) ≥ C ∥ β̃∥; i = 1, . . . , n.

This means (not surprisingly) that if values β and β 0 satisfy the condition, then also
any multiple by a positive constant satisfies the constraint. To make the solution
unique, the value of ∥ β∥ can be taken equal to an arbitrary chosen constant. A conve-
nient choice is to take
1
∥ β∥ = . (5.11)
C
The maximization problem (5.8) then becomes the minimization problem

min ∥ β∥.
β,β 0

Minimizing the norm is equivalent to minimizing the squared norm which is analyt-
ically more tractable. Hence, the constrained maximization problem (5.8)-(5.9) can be
reformulated as
1
min ∥ β∥2 (5.12)
β,β 0 2

subject to yi ( βt xi + β 0 ) ≥ 1; i = 1, . . . , n. (5.13)

119
Note that condition (5.9) together with (5.11) implies that around the optimal sepa-
rating boundary there is a margin of width 1/∥ β∥ in both directions, that does not
contain any data points.
The optimal separating hyperplane problem as formulated in (5.12)-(5.13) consists
of a quadratic criterion that needs to be minimized under linear inequality constraints,
which is a standard convex optimization problem of operations research. It is well-
known that minimizing (5.12)-(5.13) is equivalent to minimizing w.r.t. β and β 0 the
Lagrange (primal) function
n
1
LP = ∥ β ∥2 − ∑ α i [ y i ( β t x i + β 0 ) − 1]
2 i =1
n
1 t
= β β − ∑ α i [ y i ( β t x i + β 0 ) − 1] (5.14)
2 i =1

where αi ≥ 0 are the Lagrange multipliers. Setting the derivatives w.r.t. β and β 0 equal
to zero yields the equations
n
β = ∑ αi yi xi (5.15)
i =1
n
0 = ∑ αi yi (5.16)
i =1

These first-order conditions thus define the parameter β as function of the Lagrange
multipliers αi . Hence, it remains to determine the optimal values of the Lagrange mul-
tipliers. To this end, the above first-order conditions are substituted in the Lagrange
function (5.14) which yields the Wolfe dual function
n n n
1
LD = ∑ αi − 2 ∑ ∑ αi α j yi y j xit x j . (5.17)
i =1 i =1 j =1

This dual function needs to be maximized w.r.t. α1 , . . . , αn under the constraints


n
∑ αi yi = 0
i =1
αi ≥ 0. (5.18)

This is a standard convex optimization problem. The final solution must satisfy the
Karush-Kuhn-Tucker conditions, well-known in operations research. These conditions
are given by (5.13),(5.15)-(5.16),(5.18) and the condition

αi [yi ( βt xi + β 0 ) − 1] = 0; i = 1, . . . , n. (5.19)

This condition implies that

120
• If αi > 0, then yi ( βt xi + β 0 ) = 1 which means that the point xi is on the boundary
of the margin around the separating hyperplane.

• If yi ( βt xi + β 0 ) > 1, then xi is not on the boundary of the margin and αi = 0.

From (5.15) it follows that the optimal solution (direction) β is defined as a linear com-
bination of the points xi with αi > 0. These points are the points on the boundary
of the margin around the separating hyperplane, called the support points. Hence, op-
timal separating hyperplanes focus mainly on points close to the decision boundary
and gives zero weights to all other points (although all observations are needed to
determine the support points).
Once the optimal separating hyperplane Hβ̂ has been found as solution of the con-
strained optimization problem, new observations can be classified as

ŷ( x ) = sign fˆ( x ) = sign( β̂t xi + β̂ 0 )

In practice, it will often occur that there does not exist a separable hyperplane and
thus there is no feasible solution to problem (5.12)-(5.13). Therefore, we extend the
separable hyperplane problem in the next section to the case of nonseparable classes.

5.1.2 Support Vector Classifiers


We now modify the optimal separating hyperplane problem (5.12)-(5.13) such that
it can find an optimal separating hyperplane even when the two classes are not com-
pletely separable by a linear bound. To allow for overlap of the classes, we still try
to find a hyperplane that maximizes the margin between the two classes, but we now
allow some data points to fall on the wrong side of their margin. Formally this can
be accomplished by introducing slack variables ξ i ≥ 0; i = 1, . . . , n associated with the
data points. With these slack variables condition (5.13) is modified to

yi ( βt xi + β 0 ) ≥ 1 − ξ i ; i = 1, . . . , n.

Hence, the value ξ i in the above constraint is proportional to the amount by which
each point xi falls at the wrong side of its margin. Therefore, to obtain a reasonable
solution, a bound needs to be set on the sum ∑in=1 ξ i , the total amount by which points
are allowed to lie at the wrong side of their margin. Let Q denote this bound, hence
we require ∑in=1 ξ i ≤ Q. Since the area of the margin between the two classes is not
required to by empty anymore, it is often called a soft margin. Note that a training
point ( xi , yi ) is misclassified if ξ i > 1 because then yi and βt xi + β 0 have a different
sign, hence the constant Q determines the maximal number of misclassified training
points that is allowed.

121
A convenient formulation of the soft margin problem or support vector classifier
problem is " #
n
1
min ∥ β ∥2 + γ ∑ ξ i (5.20)
β,β 0 2 i =1

subject to yi ( βt xi + β 0 ) ≥ 1 − ξ i ; i = 1, . . . , n
ξ i ≥ 0; i = 1, . . . , n (5.21)

where the tuning parameter γ > 0 now takes over the role of Q. A large value of γ
puts a strong constraint on ∑in=1 ξ i and thus allows few misclassifications while a small
value of γ puts little constraint on ∑in=1 ξ i . Note that the limit case γ = ∞ corresponds
to the separable case since all slack variables ξ i should equal zero in this case. The
support vector classifier problem (5.20)-(5.21) still consists of a quadratic criterion with
linear constraints, hence a standard convex optimization problem. The minimization
problem is equivalent to minimizing w.r.t. β, β 0 and ξ i the Lagrange function
n n n
1
LP = ∥ β∥2 + γ ∑ ξ i − ∑ αi [yi ( βt xi + β 0 ) − (1 − ξ i )] − ∑ µi ξ i (5.22)
2 i =1 i =1 i =1

where αi ≥ 0 and µi ≥ 0 are the lagrange multipliers. Setting the derivatives w.r.t. β,
β 0 and ξ i equal to zero yields the equations
n
β = ∑ αi yi xi (5.23)
i =1
n
0 = ∑ αi yi (5.24)
i =1
αi = γ − µi ; i = 1, . . . , n (5.25)

Note that equations (5.23) and (5.24) are the same as in the separable case discussed
in the previous section (equations (5.15) and (5.16)). The difference is that the last
equation relates the Lagrange multipliers αi to the constant γ. Substituting these first
order conditions in (5.22) yields the Wolfe dual objective function
n
1 n n
LD = ∑ αi − 2 i∑ ∑ αi α j yi y j xit x j (5.26)
i =1 =1 j =1

Note that this is exactly the same dual function as in the separable case (expres-
sion (5.17)) but now it needs to be maximized w.r.t. α1 , . . . , αn under the constraints
n
∑ αi yi = 0
i =1

0 ≤ αi ≤ γ. (5.27)

122
The Karush-Kuhn-Tucker conditions satisfied by the solution include the constraints

αi [yi ( βt xi + β 0 ) − (1 − ξ i )] = 0; i = 1, . . . , n (5.28)
µi ξ i = 0; i = 1, . . . , n (5.29)
t
yi ( β xi + β 0 ) − (1 − ξ i ) ≥ 0; i = 1, . . . , n. (5.30)

From (5.23) it still follows that the solution for β is a linear combination of the observa-
tions with αi > 0. From condition (5.28) it follows that these points are observations for
which constraint (5.30) is exactly met. These observations are again called the support
vectors. Some of these support vectors will lie on the margin (ξ i = 0) which according
to (5.29) and (5.25) implies that 0 < αi < γ while others lie on the wrong side of their
margin (ξ i > 0) and are characterized by αi = γ.
As an illustration we consider an artificial two-dimensional dataset from Hastie
et al. (2001). The dataset is a two-class classification problem where each class consists
of a mixture of 10 bivariate Gaussian distributions with the same covariance matrix
but different centers. The training dataset contains 100 points from each class and is
displayed in Figure 5.2. The colored triangles in this plot represent the observations of
the training dataset. The observations of one class are plotted as red upper triangles
while the observations of the other class are plotted as green lower triangles. Next to
the training data, there is a test dataset which consists of a large grid of points indi-
cated by the small dots in Figure 5.2. This test set allows us to evaluate precisely the
performance of classification methods. The test grid also makes it possible to visualize
the decision boundary which clearly is highly nonlinear. Even with the true boundary,
there still are many misclassifications in the training data, which shows that the two
classes highly overlap. Indeed, even the optimal (Bayes) misclassification error rate
corresponding to the true boundary is quite large as it equals 21.0%.
In R several packages can be used to fit support vector machines. For the examples
in this chapter we use the function svm from the e1071 package. Fitting the support
vector classifier to the artificial training dataset yields the result shown in Figure 5.3.
As expected the boundary that separates the two classes is linear as can be seen from
the predicted labels for the test data. The misclassification rate on the test data is 28.8%,
so it is 7.8% higher than the optimal error rate. There are 63 support vectors in each
class. These support vectors are highlighted by their larger plotting symbol. It can be
seen from the plot that the support vectors consist of the points closest to the decision
boundary and the points on the wrong side of the decision boundary.

5.1.3 Support Vector Machines for Classification


The support vector classifier can be made more flexible by enlarging the feature
space using transformations. The linear boundaries obtained in the transformed space

123
Optimal decision boundary

3
2
1
X2

0
−1
−2

−2 −1 0 1 2 3 4

X1

Figure 5.2 Artificial two-dimensional binary classification data with 100 observations
of each class in the training set.

will generally achieve a better class separation on the training data and lead to nonlin-
ear boundaries in the original predictor space. Consider a transformed feature space
h( X ) = (h1 ( X ), . . . , h M ( X ))t which leads to a transformed set of feature vectors h( xi )
for the training data. Applying the support vector classifier in this enlarged space pro-
duces a boundary fˆ( x ) = h( x )t β̂ + β̂ 0 . This boundary is linear in the enlarged space
h( X ), but nonlinear in the original feature space X. The support vector classifier is
ŷ( x ) = sign( fˆ( x )) as before. By allowing the dimension of the transformed feature
space to become very large, it becomes possible to find a boundary that completely
separates the classes. However, the nonlinear boundary in the original space can be
very wiggly, indicating overfitting of the training data. In such cases the optimistic
zero training error will be countered by a high test error. To avoid overfitting, regular-
ization is needed to penalize roughness of the boundary such that rougher boundaries
are only allowed if they significantly improve the training classification error.
In the previous section we showed that the support vector optimization prob-
lem (5.20)-(5.21) could be rewritten as maximizing the dual function (5.26) subject to
the constraints (5.27). For a transformed predictor space h( X ) this becomes: maximize
the dual function
n
1 n n
L D = ∑ αi − ∑ ∑ αi α j yi y j h ( xi )t h ( x j ) (5.31)
i =1
2 i =1 j =1

124
Support vector classifier

3
2
1
X2

0
−1
−2

−2 −1 0 1 2 3 4

X1

Figure 5.3 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the support vector classifier learned on the training data.

under the constraints


n
∑ αi yi = 0
i =1

0 ≤ αi ≤ γ. (5.32)

Using (5.23) it follows that the support vector boundary can be written as

f ( x ) = βt h( x ) + β 0
n
= ∑ αi yi h ( x )t h ( xi ) + β 0 . (5.33)
i =1

Note that for given αi and β, the constant β 0 can be determined as the solution of
yi f ( xi ) = 1 for support points xi on the margin (0 < αi < γ).
From (5.31) and (5.33) it follows that both the optimization problem (5.31) and its
solution (5.33) in the transformed feature space only depend on products h( x )t h( x ′ )
of the transformed features. This allows us to take this one step further and to con-
sider transformations of the features into an infinite-dimensional reproducing kernel
Hilbert space (RKHS) HK generated by a positive definite kernel K ( x, y). More details
about kernel functions are given in the next section. In this case, the scalar product
h( x )t h( x ′ ) needs to be replaced by the corresponding values of the kernel function

125
K ( x, x ′ ). Hence, if the transformation of the feature space is induced by a kernel func-
tion, then (5.31) can be rewritten as
n
1 n n
L D = ∑ αi − ∑ ∑ αi α j yi y j K ( xi , x j ) (5.34)
i =1
2 i =1 j =1

which needs to be maximized under constraints (5.32) and the solution (5.33) becomes
n
f (x) = ∑ αi yi K(x, xi ) + β0. (5.35)
i =1

From (5.34)- (5.35) it can be seen that the support vector classifier optimization prob-
lem can be formulated and solved in the transformed space HK without the need to
explicitly specify the underlying transformation h( X ). The knowledge of the kernel
function is sufficient. The above dual problem still can easily be solved using the same
optimization techniques as for the support vector classifier (in the original space) in
the previous section. The resulting classification method is called a support vector ma-
chine (SVM).
Popular choices for the kernel function K are

• polynomial kernels K ( x, y) = ( x t y + 1)m for some m ≥ 1 (which still is a finite-


dimensional space),

• the Gaussian kernel K ( x, y) = exp(− 21 (∥ x − y∥/σ)2 ),

• the radial basis kernel K ( x, y) = ∥ x − y∥2 log(∥ x − y∥),

• the neural network kernel K ( x, x ′ ) = tanh(κ1 ( x t x ′ ) + κ2 ).

5.1.4 Regularization with Kernels


The classification SVM above is a particular case of the following general regular-
ization problem
" #
n
min
f ∈H
∑ L(yi , f (xi )) + λJ ( f ) (5.36)
i =1

where L(y, f ( x )) is a loss function, J ( f ) is a penalty term, and H is an (infinite-


dimensional) space of functions on which the penalty term J ( f ) is well-defined. Un-
der general assumptions, it can be shown that this problem has a finite-dimensional
solution. We consider the important subclass of problem (5.36) where the space of
functions H is a RKHS HK generated by a positive definite kernel K ( x, y). Moreover,
we assume that the penalty term J ( f ) can be written in terms of the kernel as well.

126
A positive definite kernel K ( x, y) has an eigen-expansion

K ( x, y) = ∑ γi ϕi (x)ϕi (y) (5.37)
i =1

with γi ≥ 0 and ∑i∞=1 γi2 < ∞. The basis functions ϕi ( x ) form a basis for the (infinite-
dimensional) Hilbert space HK spanned by the kernel function K. This result is known
as Mercer’s theorem. Another application of this theorem and more discussion on this
result can be found in the chapter on functional data analysis. Functions in HK thus
have an expansion of the form

f (x) = ∑ ai ϕi (x) (5.38)
i =1

The kernel K ( x, y) allows to define an inner product in the Hilbert space. For any two
functions f ( x ) = ∑i∞=1 ai ϕi ( x ) and g( x ) = ∑i∞=1 bi ϕi ( x ) in HK , the inner product is
defined as

ab
< f , g >= ∑ i i . (5.39)
i =1
γi
This inner product leads to the definition of a norm in the Hilbert space HK given by
a2i∞
∥ f ∥2HK =< f , f >= ∑ . (5.40)
γ
i =1 i

Functions f in HK satisfy the constraint

∥ f ∥2HK < ∞ (5.41)

Note that for any z ∈ Rd the function g( x ) = K ( x, z) = ∑i∞=1 [γi ϕi (z)]ϕi ( x ) sat-
γ2 ϕ ( z )2
isfies (5.38) with ∥ g∥2HK = ∑i∞=1 i γi = ∑i∞=1 γi ϕi (z)2 = K (z, z) < ∞. Hence,
i
g( x ) = K ( x, z) ∈ HK . For a fixed z ∈ Rd , the function g( x ) = K ( x, z) will be denoted
by K (., z) to indicate that we consider the kernel as a function of its first argument with
the second argument fixed.
The penalty term J ( f ) in (5.36) is usually taken equal to J ( f ) = ∥ f ∥2HK . In this case,
contributions to f of basis functions ϕ( x ) with large eigenvalue γi will get penalized
less while contributions of basis functions with small eigenvalue γi will get penalized
more.
The minimization problem (5.36) now becomes
" #
n
min
f ∈HK
∑ L(yi , f (xi )) + λ∥ f ∥2HK . (5.42)
i =1

Using (5.38) and (5.40), this can be rewritten as


∞ ∞ a2j
" #
n
min
a j ; j=1,...,∞
∑ L(yi , ∑ a j ϕj (xi )) + λ ∑ γj . (5.43)
i =1 j =1 j =1

127
It has been shown that this problem has a finite-dimensional solution in HK that can
be written in the form
n
f (x) = ∑ αi K(x, xi ), (5.44)
i =1
see e.g. Christmann and Steinwart (2008).
For the kernel function K it holds that
∞ γm ϕm ( xi )γm ϕm ( x j )
< K (., xi ), K (., x j ) > = ∑ γm
m =1

= ∑ γm ϕm ( xi )ϕm ( x j )
m =1
= K ( x i , x j ). (5.45)

For any g( x ) = ∑i∞=1 bi ϕi ( x ) in HK it holds that


∞ ∞
bm γm ϕm ( xi )
< g, K (., xi ) >= ∑ = ∑ bm ϕm ( xi ) = g( xi ). (5.46)
m =1
γm m =1

This property is known as the reproducing property of the kernel K. It implies that any
function g ∈ HK can be approximated arbitrary well by a function f of form (5.44).
For a function f of the form (5.44) it follows further that
n
< f , K (., xi ) > = < ∑ α j K(., x j ), K(., xi ) >
j =1
n
= ∑ α j < K(., x j ), K(., xi ) >
j =1
n
= ∑ α j K ( x i , x j ). (5.47)
j =1

Using (5.44) and (5.45) the penalty term J ( f ) = ∥ f ∥2HK can be rewritten as
n n
∥ f ∥2HK =< f , f > = ∑ ∑ αi α j < K(., xi ), K(., x j ) >
i =1 j =1
n n
= ∑ ∑ αi α j K ( xi , x j )
i =1 j =1

= αt Kα (5.48)

where α = (α1 , . . . , αn )t and K is an n by n symmetric matrix with elements K ij =


K ( xi , x j ), called the kernel matrix. Note that (5.48) implies that the symmetric kernel
matrix is positive definite.
Using (5.46)-(5.48) the minimization problem (5.42) can be rewritten as

min L(y, Kα) + λαt Kα (5.49)


α

128
which now is a finite-dimensional criterion that can be solved numerically. The fact
that the infinite-dimensional problem (5.42) can be reduced to a finite-dimensional
problem as in (5.49) is often called the kernel property.
In general there are two important choices in regularization using kernels. The
choice of the kernel function K and the choice of loss function L. As mentioned before,
popular choices for the kernel function K ( x, y) are polynomial kernels and radial basis
function kernels.

Polynomial kernels

For x, y ∈ Rd , the polynomial kernel K ( x, y) = ( x t y + 1)m has M = ( d+mm ) eigen-


functions that span the space of polynomials in Rd with maximal degree equal to m.
For example, for degree 2 polynomials in R2 the kernel becomes

K ( x, y) = ( x t y + 1)2
= 1 + 2x1 y1 + 2x2 y2 + x12 y21 + x22 y22 + 2x1 x2 y1 y2

and M = 6 in this case. The corresponding transformation is given by h( x ) =


√ √ √
(1, 2x1 , 2x2 , x12 , x22 , 2x1 x2 )t . Note that indeed K ( x, y) = h( x )t h(y).
We consider again the artificial data of Figure 5.2. The polynomial SVM with m = 2
yields the result shown in Figure 5.4. The misclassification rate of the degree 2 poly-
nomial SVM equals 27.3%, so the added flexibility of the degree 2 polynomial is not
sufficient to markedly improve the misclassification rate of the support vector classi-
fier.
Increasing the degree of the polynomial SVM to m = 3 yields the result shown in
Figure 5.5. The misclassification rate of this degree 3 polynomial SVM equals 27.7%,
so while the extra flexibility leads to a more nonlinear boundary as can be seen in
Figure 5.5, this extra flexibility does not yet lead to a better performance.

Radial basis function kernels

The Gaussian kernel K ( x, y) = exp(− 21 (∥ x − y∥/σ )2 ) leads to a solution that is an


expansion in Gaussian radial basis functions

1
g j ( x ) = exp(− (∥ x − x j ∥/σ)2 ); j = 1, . . . , n
2

The radial kernel K ( x, y) = ∥ x − y∥2 log(∥ x − y∥) leads to a solution that is an expan-
sion in the radial basis functions

g j ( x ) = ∥ x − x j ∥2 log(∥ x − x j ∥).

129
SVM − Polynomial kernel, degree 2

3
2
1
X2

0
−1
−2

−2 −1 0 1 2 3 4

X1

Figure 5.4 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the polynomial SVM of degree 2 learned on the training data.

For our artificial data example, the SVM with Gaussian kernel yields the result in
Figure 5.6. The boundary is clearly much more flexible than for the polynomial ker-
nels due to the transformation to an infinite-dimensional space. The misclassification
rate of this Gaussian kernel SVM equals 22.1% which is only 1.1% higher than the op-
timal error rate and much better than the performance of the support vector classifier
and the polynomial SVMs. Note that there is some similarity between the decision
boundary of the Gaussian kernel SVM in Figure 5.6 and the optimal decision bound-
ary in Figure 5.2. The largest differences are in areas without training observations, so
without information for the SVM to determine the boundary.

Above we introduced support vector machines for classification by using the dual
formulation of the support vector classifier which immediately yields a computation-
ally attractive representation of SVMs. To show that classification SVMs fit in the gen-
eral RKHS regularization framework and to better understand SVMs, we now also
consider the original (primal) formulation of the support vector classifier.
For non-support vectors (αi = 0), condition (5.25) implies that µi = γ. It then
follows from condition (5.29) that ξ i = 0 for non-support vectors. Moreover, condi-
tion (5.30) becomes yi ( βt xi + β 0 ) ≥ 1 or equivalently

1 − yi ( βt xi + β 0 ) = 1 − yi f ( xi ) ≤ 0.

130
SVM − Polynomial kernel, degree 3

3
2
1
X2

0
−1
−2

−2 −1 0 1 2 3 4

X1

Figure 5.5 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the polynomial SVM of degree 3 learned on the training data.

For support vectors (αi > 0), condition (5.28) implies that

y i ( β t x i + β 0 ) − (1 − ξ i ) = 0

or equivalently

0 ≤ ξ i = 1 − yi ( β t xi + β 0 )
= 1 − y i f ( x i ).
Using these results, ∑in=1 ξ i can be rewritten as
n
∑ [1 − yi f (xi )]+ (5.50)
i =1

where the function [ x ]+ is defined as



x if x > 0
[ x ]+ =
0 if x ≤ 0.

By inserting (5.50) into (5.20) and denoting λ = 1/(2γ) the support vector classifier
solves the optimization problem
" #
n
min
β,β 0
∑ [1 − yi f (xi )]+ + λ∥ β∥2 .
i =1

131
SVM − Gaussian kernel

3
2
1
X2

0
−1
−2

−2 −1 0 1 2 3 4

X1

Figure 5.6 Artificial two-dimensional binary classification dataset with test data clas-
sified according to the Gaussian RBF SVM learned on the training data.

In a transformed predictor space h( X ) this becomes


" #
n
min
β,β 0
∑ [1 − yi ( βt h(xi ) + β0 )]+ + λ∥ β∥2 . (5.51)
i =1

Now, consider again a kernel function K ( x, x ′ ) = ∑∞ ′


m=1 γm ϕm ( x ) ϕm ( x ) as in (5.37)

with associated basis functions hm ( x ) = γm ϕm ( x ). Then, βt h( x ) can be rewritten as
∞ ∞ ∞

βt h( x ) = ∑ β m hm ( x ) = ∑ β m γm ϕm ( x ) = ∑ θm ϕm ( x )
m =1 m =1 m =1

where θm = γm β m . Since, ∥ β∥2 = ∑∞ 2 ∞ 2
m=1 β m = ∑m=1 ( θm /γm ) it follows that (5.51)
can be rewritten as
∞ ∞
" " # #
n 2
θm
min ∑ 1 − yi ( ∑ θm ϕm ( xi ) + β 0 ) + λ ∑ (5.52)
θ,β 0 i =1 m =1
γ
m =1 m +

which corresponds to the formulation of the general kernel minimization problem


in (5.43) and thus has a finite-dimensional solution
n
f ( x ) = β 0 + ∑ α̃ j K ( x, x j )
j =1

132
(where α̃i = yi αi in (5.35)) and (5.52) can be rewritten as
" #
n
min
α̃,β 0
∑ [1 − yi f (xi )]+ + λαt Kα (5.53)
i =1

where K is the kernel matrix. This is a minimization problem of the form (5.49) and is
a special case of the more general problem
" #
n
min
f ∈H
∑ L(yi , f (xi )) + λJ ( f )
i =1

which corresponds to (5.36) for the choice L(yi , f ( xi )) = [1 − yi f ( xi )]+ . A natural


generalization is to use different loss functions than the standard SVM loss function
in (5.53) such as the logistic regression loss function

L(yi , f ( xi )) = log(1 + exp(−yi f ( xi ))),

for example.

5.2 Regression
Regularized optimization in RKHS as in (5.42) can also be used in the regression
context. Consider the linear regression model

f ( x ) = β 0 + x t β. (5.54)

If we take the common squared error loss for the loss function L, then (5.49) reduces
to a penalized least squares problem

min(y − Kα)t (y − Kα) + λαt Kα (5.55)


α

with solution
α̂ = (K + λI )−1 y. (5.56)

This yields
n
fˆ( x ) = ∑ α̂i K(x, xi )
i =1
or

f̂ = K α̂
= K (K + λI )−1 y
= ( I + λK −1 )−1 y,

133
which shows that the fitted values again are a linear function of the responses y. This
least squares SVM for regression can be seen as a generalization of ridge regression.
Indeed, if we set K = XXt , i.e. no transformation is applied on the features, then
by using the SVD of X it can be shown that f̂ above is exactly the same as the ridge
regression solution ŷridge in (4.13).
We now discuss regression support vector machines using a different loss func-
tion that has an effect similar to the loss function used in support vector machines for
classification.

5.2.1 Support Vector Regression


For the support vector classifier in Section 5.1.2 we determined the boundary with
largest margin that separated the two classes. The separation was not complete in the
sense that we allowed that a number of observations fell on the wrong side of their
boundary. Similarly, for the linear regression model (5.54) we now look for the hyper-
plane such that all observations lie within a band of width ϵ around the hyperplane.
Again, we do not expect that such a hyperplane exists for sufficiently small values of
ϵ, thus we allow that a number of observations falls outside the band. This leads to
the ϵ-insensitive loss function

L(y, f ( x )) = max(0, |y − f ( x )| − ϵ).

Hence, the loss is zero for observations with response yi whose distance to the hyper-
plane f ( x ) is smaller than ϵ. For observations larger than ϵ the loss equals |y − f ( x )| −
ϵ.
With this loss function, support vector regression solves the problem
" #
n
λ
min ∑ L(yi , f ( xi )) + ∥ β∥2 , (5.57)
β 0 ,β i =1 2

where as usual the penalty allows for a bias-variance trade-off and this regularization
makes it possible to select more stable models. By introducing slack variables this
problem can be rewritten as
" #
n
1 1
min ∥ β∥2 + ∑ (ξ i + ξ i⋆ )
β 0 ,β 2 λ i =1

subject to yi − f ( xi ) ≤ ϵ + ξ i
f ( xi ) − yi ≤ ϵ + ξ i⋆
ξi ≥ 0
ξ i⋆ ≥ 0 i = 1, . . . , n

134
which leads to the Lagrange primal function

1 1 n n
LP = ∥ β∥2 + ∑ (ξ i + ξ i⋆ ) + ∑ αi (yi − β 0 − xit β − ϵ − ξ i )
2 λ i =1 i =1
n n n
+∑ αi⋆ ( β 0 + xit β − yi − ϵ − ξ i⋆ ) − ∑ µi ξ i − ∑ µi⋆ ξ i⋆ (5.58)
i =1 i =1 i =1

where αi , αi⋆ ≥ 0 and µi , µi⋆ ≥ 0 are the Lagrange multipliers. Setting the derivatives
w.r.t. β, β 0 , ξ i and ξ i⋆ equal to zero yields the equations
n
β = ∑ (αi − αi⋆ )xi (5.59)
i =1
n
0 = ∑ (αi − αi⋆ ) (5.60)
i =1
1
αi = − µi ; i = 1, . . . , n (5.61)
λ
1
αi⋆ = − µi⋆ ; i = 1, . . . , n (5.62)
λ
Substituting the first order conditions (5.59)-(5.62) in (5.58) yields the dual objective
function
n n
1 n n
LD = ∑ (αi − αi⋆ )yi − ϵ ∑ (αi + αi⋆ ) − 2 i∑ ∑ (αi − αi⋆ )(α j − α⋆j ) xit x j (5.63)
i =1 i =1 =1 j =1

which needs to be maximized w.r.t. αi , αi⋆ i = 1, . . . , n under the constraints


n
∑ (αi − αi⋆ ) = 0
i =1
1
0 ≤ αi ≤
λ
1
0 ≤ αi⋆ ≤ .
λ
The Karush-Kuhn-Tucker conditions satisfied by the solution include the constraints

αi (yi − β 0 − x t β − ϵ − ξ i ) = 0; i = 1, . . . , n
αi⋆ ( β 0 + x t β − yi − ϵ − ξ i⋆ ) = 0; i = 1, . . . , n
αi αi⋆ = 0; i = 1, . . . , n
ξ i ξ i⋆ = 0; i = 1, . . . , n
1
(αi − )ξ i = 0; i = 1, . . . , n
λ
1
(αi⋆ − )ξ i⋆ = 0; i = 1, . . . , n
λ

135
These constraints imply that only a subset of the values αi − αi⋆ are different from zero
and the associated observations are called the support vectors. The dual function and
the fitted values f ( x ) = x t β = ∑in=1 (αi − αi⋆ ) x t xi both are determined only through
the products x t xi . Thus, we can generalize support vector regression by using kernels.

5.2.2 Regression Support Vector Machines


We now consider the regression model
M
f ( x ) = β 0 + h( x )t β = β 0 + ∑ β m hm ( x )
m =1

where h( x ) = (h1 ( x ) . . . , h M ( x ))t is a transformation of the features, e.g. an expan-


sion in basis functions. To estimate β and β 0 we need to minimize (5.57) or equiva-
lently, we need to maximize the dual function (5.63) with the products xit x j replaced
by h( xi )t h( x j ). This yields the solution f ( x ) = h( x )t β = ∑in=1 (αi − αi⋆ )h( x )t h( xi ).
As in Section 5.1.3 we can now introduce a kernel function K ( x, x ′ ) = ∑∞ ′
m=1 γm ϕm ( x ) ϕm ( x )

as in (5.37) and define the associated basis functions as hm ( x ) = γm ϕm ( x ). Then, for
any points x and x ′ in Rd the product h( x )t h( x ′ ) = ∑∞ ′ ′
m=1 γm ϕm ( x ) ϕm ( x ) = K ( x, x ).
Hence, if the transformation of the feature space is induced by a kernel function, the
products in the dual function (5.63) are given by K ( xi , x j ) and the corresponding solu-
tion becomes f ( x ) = ∑in=1 (αi − αi⋆ )K ( x, xi ).
To illustrate regression SVMs we use a simple dataset which is collected to investi-
gate the relation between age and size of the jawbone. The n = 53 observations can be
seen in Figure 5.7 below. The relation between bone size and age is clearly nonlinear
which is confirmed by the bad fit provided by the regression SVM with linear kernel.
A polynomial fit is more flexible than a linear fit. It can be seen that with a polynomial
kernel of degree 3 the fit is better than the linear fit, but still not entirely satisfactory.
Finally, we also consider a regression SVM with radial kernel. This yields a flexible
(nonparametric) fit to the data that better represents the nonlinear relation between
these two variables. The support vectors for the regression SVM with radial kernel
are marked in Figure 5.7 as large red dots. It can be seen that the support vectors are
the observations furthest from the fit, in line with the ϵ-insensitive loss function that is
used.

136
Regression SVM
140

linear
polynomial
120

radial
100
80
bone

60
40
20

0 10 20 30 40 50

age

Figure 5.7 Two-dimensional regression dataset fitted by regression SVM with linear,
polynomial (degree 3) and radial kernel.

137
138
Chapter 6

Functional Data Analysis

6.1 Definitions and examples


Functional data analysis (FDA) is a modern branch of statistics that deals with the
analysis of samples of curves, possibly combined with vectors or scalars. For example,
a satellite image processed to show surface temperature can be viewed as a function
X defined on a sphere where X (t, s) represents the temperature at latitude t and longi-
tude s. Clearly, due to finite resolution, the values of X are available only at a finite grid
of points, but the temperature exists at every location (t, s), so it is natural to view X as
a function of the whole sphere. This means that X is intrinsically infinite-dimensional.
However, since temperature cannot vary too much with small changes in latitude and
longitude, X can be assumed to be a smooth function of latitude and longitude. Fig-
ure 6.1 presents two examples from meteorology and biology, respectively.

Figure 6.1 Left: yearly temperature data from 73 weather stations across Spain. Right:
angle formed by the knee of 39 children going through a 20-point gait cycle.

139
While functional data are defined on an infinite set, such as an interval, a sphere,
. . . , in practice the curves are only observed at a finite grid of points. The grid at
which we observe our data X1 , . . . , Xn can be large and with high resolution or small
and sparse. In the former case, we say that the functional data is densely observed
while in the latter case we say that the functional data is sparsely observed. Of course,
often there is no clear cut-off between sparsely and densely observed functional data.
The weather data in the left panel of Figure 6.1 may be considered to be densely ob-
served while the knee angles in the right panel can probably be classified as sparsely
observed, due to the small number of discretization points. We will be mostly con-
cerned with dense functional data in this chapter.
A major focus in FDA is the shape of the observed functions or of functions which
summarize the properties of the data in some specific way. For example, it may be
of interest to know how a certain treatment affects the mean protein level of patients.
One would then be interested in constructing a curve which describes how this level
changes over time in a group of patients. Another example concerns magnetometer
data. Space physics researchers are not interested in the value of the magnetic field
every five seconds, but in the shape of the curve over several days. The shapes of
such curves at many locations allow researchers to make inferences about physical
processes occurring in near Earth space where no instruments are placed.
Assume now that we are in the possession of X1 , . . . , Xn independent and identi-
cally distributed curves. Typically, the first step in working with such functional data
is to express them by means of a basis expansion
K
Xi ( t ) ≈ ∑ cim Bm (t), (i = 1, . . . , n). (6.1)
m =1

In (6.1), the Bm are some standard collection of basis functions, like splines, wavelets
or trigonometric functions. Expansion (6.1) reflects the intuition that our data are ob-
servations from smooth functions that share some shape properties, and so can be
approximated as linear combinations of the same basis functions. If the discretization
grid is very large, (6.1) serves the practical purpose of replacing the original data by
a smaller collection of the coefficients cim . Moreover, if the discretization points differ
between subjects, the expansion puts the curves into a common domain so that they
are more readily comparable. Figure 6.2 presents some commonly used basis func-
tions, generated by the following R-code.

require(fda)
spline.basis = create.bspline.basis(rangeval = c(0,1), nbasis = 5)
par(mar = c(6.6, 6.6, 1.1, 1.1)) # bottom, left, top, right
par(mgp = c(4.6, 1.6, 0)) # bottom, left, top, right

140
plot(spline.basis, lty = 1, lwd = 3, cex.lab = 2.5,
cex.axis = 2.5, col = "black")
fourier.basis = create.fourier.basis(rangeval = c(0,1), nbasis = 5)
plot(fourier.basis, lty = 1, lwd = 3, col = "black",
cex.lab = 2.5, cex.axis = 2.5)

Figure 6.2 Left: plot of five B-spline basis functions. Right: plot of the first five Fourier
basis functions.

The R function create.bspline.basis contains additional arguments which spec-


ify the degrees of and smoothness and the position of the B-spline basis functions but
we use the defaults here. Spline functions are an extremely versatile tool for smooth-
ing and have become very promiment in functional data analysis.
In the Fourier basis the first function is the constant function, then there are sine
and consine functions of decreasing periods, or equivalently increasing frequencies. A
rough rule is that one uses the Fourier basis for periodic data, that is, data which have
approximately the same values at the beginning and the end of the interval and the
B-spline basis otherwise.

Example (B-spline expansion of the Wiener process). The Wiener process, also known
as the Brownian motion, is an appropriate limit of the random walk
i
Si = ∑ Nk , Nk ∼ N(0, 1), 1 ≤ k ≤ i.
k =1

Figure 6.3 shows a realization of the random walk which can be viewed as a function
defined on [0, 1] by X (ti ) = Si along with its lower-dimensional representation ob-
tained by a B-spline basis expansion with 25 basis functions. The coefficients in the
B-spline expansion are estimated by the least-squares criterion.

141
Figure 6.3 Random walk and its expansion using 25 B-spline basis functions.

Figure 6.3 is generated with the following R-code.

require(fda)
set.seed(2)
N <- rnorm(500)
Wiener <- cumsum(N)
plot(seq(0,1, len = 500), Wiener, xlab = "",
ylab = "", lwd = 3, type = "l", cex.lab = 2.5, cex.axis = 2.5)
B25.basis <- create.bspline.basis(rangeval = c(0, 1), nbasis = 25)
Wiener.fd <- smooth.basis(y = Wiener, fdParobj = B25.basis,
argvals = seq(0,1, len = 500))
lines(Wiener.fd, lwd = 3, col = "blue")

Notice that the B-spline representation captures the rough shape of the process but not
its finer details, for which a larger collection of B-splines is needed.

6.2 Mathematical foundations


6.2.1 Mean-square continuous processes
In order to perform inference, e.g., estimation with associated measures of uncer-
tainty or hypothesis tests, we need to construct a mathematical model for our func-

142
tions. In classical statistics, observations are either scalars or vectors and the model
that is most commonly used stipulates that the observations are realizations of a ran-
dom variable or random vector which is a measurable mapping from some underlying
probability space (Ω, A, P) to the space of real numbers that is equipped with its usual
Borel σ-algebra (see, e.g., the appendix).
At the most basic level, functional data may be viewed as independent and identi-
cal copies of a stochastic process { X (t), t ∈ E} with E a compact interval of which R
without loss of generality we shall take to be the [0, 1]-interval. The index t may refer
to time, for example. The process X may be entirely or partially observed. For every
ω ∈ Ω the function

t 7→ X (t, ω )

is called a path (or sample path) of X (t). See figures 6.1 and 6.3 for examples.
The measure-theoretic foundation of a stochastic process is that X (t) is a random
R
variable for each fixed t, i.e., X (t, ω ) is a measurable mapping from (Ω, A, P) to with
its Borel sigma algebra. We say that two stochastic processes X1 and X2 are independent
R
if for each d and (t1 , . . . , td ) ∈ d the vectors

( X1 (t1 ), . . . , X1 (td )) and ( X2 (t1 ), . . . , X2 (td ))

are independent. The processes X1 and X2 are identically distributed if


D
( X1 (t1 ), . . . , X1 (td )) = ( X2 (t1 ), . . . , X2 (td )).

In other words, if all joint distributions are the same. By induction, these properties
may be extended to n stochastic processes X1 , . . . , Xn .
Of particular importance are the mean and covariance functions of the process
{ X (t), t ∈ [0, 1]} given by

µ(t) = E{ X (t)}, t ∈ [0, 1], (6.2)

and

K (s, t) = Cov{ X (s), X (t)}, (s, t) ∈ [0, 1]2 , (6.3)

provided that the expectations exist. Processes with well-defined mean and covariance
functions are referred to as second-order processes. The covariance function K (s, t) is
positive semi-definite as for any choice of (α1 , . . . , αd ) ∈ d R
( )
d d d
∑ ∑ αi α j K(ti , t j ) = Var ∑ α j X (t j ) ≥ 0.
i =1 j =1 j =1

143
We focus herein on second-order processes such that

lim E{| X (tn ) − X (t)|2 } = 0, (6.4)


n→∞

for any t ∈ [0, 1] and any sequence {tn } ∈ [0, 1] converging to t. Such processes are
said to be mean-square continuous. We have the following useful characterization of
mean-square continuous processes.
Theorem 6.1. Let X be a second-order process. Then, X is mean-square continuous if, and
only if, its mean and covariance functions are continuous.
Proof: As

E{| X (s) − X (t)|2 } = K (s, s) + K (t, t) − 2K (s, t) + |µ(s) − µ(t)|2 ,

the continuity of µ and K implies (6.4).


To show the other direction, we first see that the continuity of m follows from the
Schwarz inequality, i.e.,

|µ(s) − µ(t)| = |E{ X (s) − X (t)}| ≤ {E{| X (s) − X (t)|2 }}1/2 ,

which tends to zero as s → t by virtue of (6.4). Without loss of generality, assume that
µ(t) = 0 in the following. Write

K (s, t) − K (s′ , t′ ) = K (s, t) − K (s′ , t) + K (s′ , t) − K (s′ , t′ ).

The Schwarz inequality then gives


n o1/2
′ 1/2
|K (s, t) − K (s , t)| ≤ K (t, t) E{| X (s) − X (t)| }
2
,

and, by symmetry,
n o1/2
|K (s′ , t) − K (s′ , t′ )| ≤ K1/2 (s′ , s′ ) E{| X (t) − X (t′ )|2 } ,

so the mean-square continuity of X is seen to imply continuity of K. □


The covariance operator of a mean-square continuous process is defined as the inte-
gral operator K : L2 ([0, 1]) → L2 ([0, 1]) given by
Z
K( f )(s) = K (s, t) f (t)dt, f ∈ L2 ([0, 1]). (6.5)
[0,1]

To see that this operator indeed maps L2 ([0, 1]) into L2 ([0, 1]), notice that for every
f ∈ L2 ([0, 1]) we have
Z Z 2 Z Z Z
2
| f (t)|2 dt < ∞,

K ( s, t ) f ( t ) dt ds ≤ | K ( s, t )| dtds
[0,1] [0,1] [0,1] [0,1] [0,1]

as, by the continuity of K and the compactness of [0, 1]2 , the double integral above is fi-
nite (recall that continuous functions on compact domains are uniformly continuous).

144
6.2.2 The Karhunen-Loève theorem
As noted previously, K is symmetric and positive semi-definite. Hence, by Mercer’s
theorem (Theorem A.8 in the Appendix),

K (s, t) = ∑ λ j e j ( s ) e j ( t ), (6.6)
j =1

where the sum converges absolutely and uniformly and (λ j , e j ) are eigenvalue and
eigenfunction pairs of K.
Our goal now is to develop an expansion for our stochastic process X in terms
of the eigenvalue and eigenfunction pairs of K. The first step in this direction is to
define the abstract L2 stochastic integral of a mean-square continuous process X. For
convenience we shall henceforth assume that the mean function µ(t) is equal to zero.
For each n consider any measurable partition of [0, 1], E1 , . . . , Emn such that each Ei ⊂
B([0, 1]) and its diameter is less than 1/n. Let ti be an arbitrary point in Ei for each
i = 1, . . . , mn . For any function f ∈ L2 ([0, 1]) define
mn Z
IX ( f , { Ei , ti : 1 ≤ i ≤ mn }) = ∑ X ( ti ) Ei
f (u)du.
i =1

Now take another such partition { E′j , t′j , 1 ≤ j ≤ m′n }. Then,


 2 
′ ′ ′
E IX ( f , { Ei , ti : 1 ≤ i ≤ mn }) − IX ( f , { Ej , t j : 1 ≤ j ≤ mn })

mn mn Z Z
= ∑ ∑ K ( t i1 , t i2 )
Ei1
f (u)du
Ei2
f (v)dv
i1 =1 i2 =2
m′n m′n Z Z
+ ∑ ∑ K (t j1 , t j2 )
Ej1
f (u)du
Ej2
f (v)dv
j1 =1 j2 =2
mn m′n Z Z
−2∑ ∑ K (ti , t′j ) f (u)du f (v)dv.
i =1 j =1 Ei E′j

As K is uniformly continuous, each double sum on the right-hand side can be made
arbitrarily close to
Z Z
K (u, v) f (u) f (v)dudv,
[0,1] [0,1]

for large enough n and n′ . Hence IX ( f , { Ei , ti : 1 ≤ i ≤ mn }) is a Cauchy sequence


in L2 (Ω, A, P) and, by completeness, there exists a random variable IX ( f ) to which
IX ( f , { Ei , ti : 1 ≤ i ≤ mn }) converges in mean-square as n → ∞. Moreover, this limit
is independent of the choice of the partition { Ei , ti : 1 ≤ i ≤ mn }.
The random variable IX ( f ) satisfies a number of important properties, summarized
in Theorem 6.2 below.

145
Theorem 6.2. Let { X (t), t ∈ [0, 1]} be a mean-square continuous stochastic process with
mean zero. Then, for any f , g ∈ L2 ([0, 1]),
(i) E{ IX ( f )} = 0.

(ii) E{ IX ( f ) X (t)} =
R
[0,1] K ( u, t ) f ( u ) du for any t ∈ [0, 1].

(iii) E{ IX ( f ) IX ( g)} =
R R
[0,1] [0,1] K ( u, v ) f ( u ) g ( v ) dudv.

Proof. We only prove assertion (i) as assertions (ii) and (iii) can be proven in a similar
manner. Observe that

|E{ IX ( f )}|2 = |E{ IX ( f ) − IX ( f , { Ei , ti : 1 ≤ i ≤ mn })}|2


≤ E{| IX ( f ) − IX ( f , { Ei , ti : 1 ≤ i ≤ mn })|2 },

which tends to 0 as n → ∞, by mean-square convergence. □


Let us look at assertion (iii) of theorem 6.2 more closely. The assertion holds for all
f , g ∈ L2 ([0, 1]), hence putting f = ei and g = e j with ei and e j eigenfunctions of K we
obtain
Z Z
E{ IX (ei ) IX (e j )} = K (u, v) f (u) g(v)dudv = δij λi , (6.7)
[0,1] [0,1]

by definition of the eigenvalue and eigenfunction pairs (λi , ϵi ). We are now ready to
state and prove the Karhunen-Loève theorem.
Theorem 6.3 (Karhunen-Loève). Let { X (t), t ∈ [0, 1]} be a mean-square continuous pro-
cess with mean zero. Then,

lim sup E{| X (t) − X N (t)|2 } = 0, (6.8)


N →∞ t∈[0,1]

where X N (t) = ∑ N
j =1 IX ( e j ) e j ( t ) .

Proof. Using Theorem 6.2 and (6.7) we see that

E{| X N (t) − X (t)|2 } = E{| X N (t)|2 } − 2E{ X N (t) X (t)} + E{| X (t)|2 }
N N
= ∑ λ j |e j (t)|2 − 2 ∑ λ j |e j (t)|2 + K(t, t),
j =1 j =1
N
= K (t, t) − ∑ λ j |e j (t)|2 ,
j =1

which tends to zero uniformly in [0, 1] by Mercer’s theorem. □


The random variables IX (e j ) are sometimes referred to as the principal component
scores or simply scores. The partial sum X N (t) is the N-term Karhunen-Loève expan-
sion and (6.8) says that the L2 (Ω, A, P) distance between X (t) and its Karhunen-Loève
expansion can be made arbitrarily small uniformly in t.

146
6.2.3 Mean-square continuous processes in L2 ([0, 1])

The Karhunen-Loève expansion is reminiscent of classical expansions for deter-


ministic L2 ([0, 1]) functions, e.g., the Fourier expansion. Parseval’s theorem (Theo-
rem A.7 in the appendix) strongly suggests in this case that the abstract scores IX (e j )
are simply ⟨ X, e j ⟩, but under which conditions is this true for a mean-square continu-
ous process X? This section provides the answer.
At an intuitive level, for expressions such as ⟨ X, e j ⟩ to make sense we have to min-
imally require that for every ω ∈ Ω the paths t 7→ X (t, ω ) belong to L2 ([0, 1]). More-
over, for ⟨ X, e j ⟩ to be a random variable we have to impose another measurability
condition. We shall, in particular, assume that X (t, ω ) is a jointly measurable function
R
from [0, 1] × (Ω, A, P) to . This measurability condition entails that for any open set
R
O ⊂ we have

{(t, ω ) ∈ [0, 1] × Ω : X (t, ω ) ∈ O} ∈ σ(B([0, 1]) × A),

where B([0, 1]) denotes the Borel sets of [0, 1] and σ(B([0, 1]) × A) denotes the smallest
σ-algebra that includes the Cartesian product B([0, 1]) × A (which is, in general, not a
sigma-algebra).
If X is jointly measurable and X (·, ω ) ∈ L2 ([0, 1] then for every f ∈ L2 ([0, 1]),
R
⟨ X, f ⟩ is a measurable mapping from (Ω, A, P) to equipped with its Borel sigma-
algebra. In other words, ⟨ X, f ⟩ is a random variable. A proof of this may be found in
the standard treatment of Fubini’s theorem. This important fact allows us to simplify
the scores IX (e j ) appearing in the Karhunen-Loève expansion.

Theorem 6.4. Let { X (t), t ∈ [0, 1]} be a mean-square continuous process with sample paths
in L2 ([0, 1]) that is jointly measurable. Then, for any f ∈ L2 ([0, 1]),

Z
IX ( f ) = X (t, ω ) f (t)dt,
[0,1]

almost surely.

Proof. Let { Ei , ti , 1 ≤ i ≤ mn } denote the partition of [0, 1] with maximal diameter less
than 1/n discussed previously. Under our assumptions ⟨ X, f ⟩ is a random variable.

147
Therefore, by Fubini’s theorem,
( Z 2 )
E IX ( f , { Ei , ti , 1 ≤ i ≤ mn }) −

X (t) f (t)dt
[0,1]
mn mn Z Z
= ∑ ∑ K ( t i1 , t i2 )
Ei1
f (u)du
Ei2
f (v)dv
i1 =1 i2 =1
Z Z
+ K (u, v) f (u) f (v)dudv
E E
mn Z Z
−2∑ K (ti , v) f (u) f (v)dudv.
i =1 Ei E

By mean-square convergence, as n → ∞ the left-hand side tends to

E{| IX ( f ) − ⟨ X, f ⟩|2 },

and the right-hand side tends to zero by uniform continuity of K. □.


A corollary of Theorem 6.4 is that IX (e j ) = ⟨ X, e j ⟩ and the Karhunen-Loève expan-
sion becomes

X (t) = ∑ ⟨X, e j ⟩e j (t),
j =1

with the series converging uniformly in t ∈ [0, 1] and ⟨ X, e j ⟩ orthogonal random vari-
ables with variances equal to λ j for all j. Since, by Mercer’s theorem, |λ j | → 0 as
j → ∞, the path of the process can often be described quite well with the partial sum
Xn for small n.
The Karhunen-Loève theorem says even more. In particular, the orthonormality of
the e j gives

N N
E{∥ X N ∥2 } = ∑ E{|⟨X, e j ⟩|2 } = ∑ λj.
j =1 j =1

For n → ∞, by uniform mean-square convergence, we get


E{∥ X ∥2 } = ∑ λj. (6.9)
j =1

This equality implies that the variance of X ∈ L2 ([0, 1]) equals the sum of the variances
of the scores ⟨ X, e j ⟩ and forms the basis for functional principal component analysis.
To illustrate these results, let us return to the Brownian motion example and obtain
expressions for the eigenvalue and eigenfunction pairs.

148
Example (Wiener process, continued). Let W (t) denote the stochastic process that
is the limit of the random walk. It can be shown that E{W (t)} = 0 and K (s, t) =
min(s, t). To determine the eigenvalues and eigenfunctions of the covariance operator
we need to find numbers λ j and functions e j such that
Z 1
min(s, t)e j (s)ds = λ j e j (t),
0

which is the same as


Z t Z 1
se j (s)ds + t e j (s)ds = λ j e j (t).
0 t

Differentiating both sides with respect to t with Leibiniz’s formula and simplifying we
get
Z 1
e j (s)ds = λ j e′j (t).
t

Differentiating once more reveals that e j (t) = −λ j e′′j (t). A general solution for this
differential equation is

e j (s) = a sin(s/λ1/2 1/2


j ) + b cos( s/λ j ).

The above equations imply that e(0) = 0 and e′ (1) = 0 so that b = 0 and
−1/2
a cos(1/λ1/2
j ) = 0, which lead to λ j = jπ − π/2 whence

4
λj =
(2j − 1)2 π 2
and
(2j − 1)π
 
1/2
e j (s) = 2 sin s .
2

The eigenvalues are indeed decreasing and summable, while the eigenfunctions are
continuous (actually infinitely differentiable).
How well do the partial sums of the Karhunen-Loève expansion approximate the
trajectory of W (t)? Figure 6.4 shows that the rough shape of the trajectories may be
captured with only a handful of eigenfunctions and associated scores.

N1 <- 5
E.m <- matrix(0, nrow = 500, ncol = N1)
for(j in 1:N1){
E.m[,j] = sapply(seq(0,1, len =

149
Figure 6.4 Random walk and its Karhunen-Loève approximation with N = 5 and
N = 10 in blue and red, respectively.

500), FUN =
function(x) 2^{1/2}*sin((2*j-1)*pi*x/2)
)
}
scores <- t(E.m)%*%Wiener/500

X.N1 <- E.m%*%scores


N2 <- 10
E.m <- matrix(0, nrow = 500, ncol = N2)
for(j in 1:N2){
E.m[,j] = sapply(seq(0,1, len =
500), FUN =
function(x) 2^{1/2}*sin((2*j-1)*pi*x/2)
)
}
scores <- t(E.m)%*%Wiener/500
X.N2 <- E.m%*%scores
plot(seq(0,1, len = 500), Wiener, xlab = " " ,
ylab = " " , lwd = 3, type = "l" , cex.lab = 2.5, cex.axis = 2.5)

150
lines(seq(0,1, len = 500), X.N1, lwd = 3, col = " blue" )
lines(seq(0,1, len = 500), X.N2, lwd = 3, col = " red" )
legend("topleft" , col = c(" blue" , " red" ), lty = c(1, 1), lwd = c(3, 3),
legend = c("N = 5" , "N = 10" ), cex = 2 )

6.3 Mean and covariance estimation


The mean and covariance functions µ(t) and K (s, t) of a stochastic process { X, t ∈
[0, 1]} are population quantities and are in practice unknown. In this section we as-
sume that we are in possession of n independent and identically distributed copies
X1 , . . . , Xn of a jointly measurable mean-square continuous process X with sample
paths in L2 ([0, 1]). We assume moreover that the processes are fully observed. Intu-
itive estimators for µ(t) and K (s, t) are
1 n n
b(s, t) = 1 ∑ ( Xi (s) − µ
n i∑
b(t) =
µ Xi ( t ) , and K b(s))( Xi (t) − µ
b(t)).
=1
n i =1
How well do these quantities approximate µ(t) and K (s, t) and under which con-
ditions? Theorems 6.5 and 6.6 provide partial answers.
Theorem 6.5. Let X1 , . . . , Xn denote iid copies of a mean-square continuous process { X (t), t ∈
[0, 1]} that is jointly measurable and E{∥ X ∥22 } < ∞. Then
 
1
E{∥µ 2
b − µ ∥2 } = O ,
n
as n → ∞.
Proof. We have
n n
1
E{∥µ
b − µ∥22 } =
n2 ∑ ∑ E{⟨Xi − µ, Xj − µ⟩}
i =1 j =1
n n
1
=
n2 ∑ ∑ E{E{⟨Xi − µ, Xj − µ⟩}|Xj }
i =1 j =1
n n
1
=
n2 ∑ ∑ E{⟨E{Xi − µ|Xj }, Xj − µ⟩}.
i =1 j =1

But, by independence, E{ Xi − µ| X j } = 0 for i ̸= j and E{ Xi − µ| Xi } = Xi − µ. It now


follows that
1 n
E{∥µb − µ∥2 } = 2 ∑ E{⟨ Xi − µ, Xi − µ⟩}
2
n i =1
1
= E{∥ X1 − µ∥22 }
n 
1
=O ,
n

151
as E{∥ X1 − µ∥22 } ≤ 2E{∥ X1 ∥22 } + 2∥µ∥22 < ∞. □
We now turn to the properties of the sample covariance estimator K b(s, t). We as-
sume without loss of generality that µ(t) = 0 in the Theorem below. To see this, notice
that
n
Kb(s, t) = 1 ∑ ( Xi (s) − µ(s))( Xi (t) − µ(t)) − (µ
b(s) − µ(s))(µ
b(t) − µ(t)),
n i =1
so that, by Theorem 6.5,
Z Z
b(s, t) − K (s, t)|2 dsdt
|K
[0,1] [0,1]
2
n
1
Z Z
≤2 ∑ ( Xi (s) − µ(s))( Xi (t) − µ(t)) − E{( X (s) − µ(s))( X (t) − µ(t))} dsdt

n i =1
 
1
+ OP .
n2
Thus, up to terms of order n−2 we can always assume that the process X is centered.
Theorem 6.6. Let X1 , . . . , Xn denote iid copies of a mean-square continuous process { X (t), t ∈
[0, 1]} that is jointly measurable with mean zero and E{∥ X ∥42 } < ∞. Then
 
1
Z Z
E{|K (s, t) − K (s, t)| }dsdt = O
b 2
.
[0,1] [0,1] n
Proof. We have
2
Z Z
1
Z Z n
b(s, t) − K (s, t)|2 dsdt =
|K ∑ [ Xi (s) Xi (t) − E{ X (s) X (t)}] dsdt,

[0,1] [0,1] n2 [0,1] [0,1] i =1

and, by independence and identical distributions,
 2 
 n  n n o
E ∑ [ Xi (s) Xi (t) − E{ X (s) X (t)}] = ∑ E | X1 (s) X1 (t) − E{ X (s) X (t)|2

 i =1  i =1
n o
≤ nE | X1 (s) X1 (t)|2 ,

where we have used E{| X − E{ X }|2 } ≤ E{| X |2 } for all random variables X. Thus,
interchanging expectation and integration (this is permissible for positive integrands)
we have
1
Z Z Z Z n o
E{|K (s, t) − K (s, t)| }dsdt ≤
b 2
E | X1 (s) X1 (t)|2 dsdt
[0,1] [0,1] n [0,1] [0,1]
Z 
1
Z
= E 2
| X1 (s) X1 (t)| dsdt
n [0,1] [0,1]
1
= E{∥ X ∥42 }
n 
1
=O ,
n

152
as E{∥ X ∥42 } < ∞. □.
One may also consider other modes of convergence of µ b and Kb to their population
counterparts. For example, for every t ∈ [0, 1] the strong law of large numbers (SLLN)
yields the existence of a set A ∈ A such that P( A) = 0 and

1 n n→∞

n i =1
Xi (t, ω ) −−−→ µ(t),

for every ω ∈
/ A. The exceptional set A may depend on t, hence the SLLN only implies
pointwise convergence. A stronger mode of convergence is the uniform distance given
by

sup |µ
b(t) − µ(t)|.
t∈[0,1]

Convergence in this distance clearly implies both L2 and pointwise convergence. Gen-
erally, however, one needs stronger conditions on X and µ in order to ensure that the
uniform distance tends to zero either in probability or almost surely. Such conditions
may involve smooth trajectories, continuously differentiable mean function µ(t) etc.
Figure 6.5 provides an example of an estimated covariance function. Notice the
interesting pattern at the beginning and end of the year.

Figure 6.5 Estimated covariance surface for the spanish weather data seen in Fig-
ure 6.1.

153
6.4 Functional principal component analysis
We have seen previously that the Karhunen-Loève expansion decomposes a mean-
square continuous and jointly measurable L2 ([0, 1])-process X with mean zero into a
series involving the orthonormal eigenfunctions of the covariance operator K. The
uniform convergence of this series implies that X can be well-approximated by

N
X N (t) = ∑ ⟨X, e j ⟩e j (t),
j =1

for some N depending on the desired quality of the approximation. However, the
eigenfunctions of K are only one of the possible orthonormal systems one can use
to express X. For example, one could instead use a Fourier or orthogonalized spline
system. Thus, one may ask if there exists an ”optimal” basis system with which we
can express X.
The answer to this question (if it even exists) depends on the optimality criterion
we choose to employ. Herein we call an N-dimensional orthonormal system u1 , . . . , u N
optimal if for any other N-dimensional orthonormal system w1 , . . . , w N it holds that
 2   2 
 N   N 
E X − ∑ ⟨ X, u j ⟩u j ≤ E X − ∑ ⟨ X, w j ⟩w j . (6.10)

j =1 j =1
   
2 2

The next question would then be how to find an orthonormal system u1 , . . . , u N satis-
fying (6.10).
Although a seemingly difficult task, the assumed orthonormality greatly simplifies
things. Indeed, note that for any orthonormal system w1 , . . . , w N we have
 2 
 N  N N
E X − ∑ ⟨ X, w j ⟩w j = E{∥ X ∥22 } − 2 ∑ E{|⟨ X, w j ⟩|2 } + ∑ E{|⟨ X, w j ⟩|2 }

j =1 j =1 j =1
 
2
N
= E{∥ X ∥22 } − ∑ E{|⟨ X, w j ⟩|2 }. (6.11)
j =1

The first term in (6.11) does not depend on w j and the second term is non-negative.
Thus, to identify the minimum we need to select the basis functions w j , j = 1, . . . , N in
such a way that ∑ Nj=1 E{|⟨ X, w j ⟩| } is maximized. Now,
2

E{|⟨ X, w j ⟩|2 } = E = ⟨Kw j , w j ⟩,




⟨ X, w j ⟩ X, w j (6.12)

so that the optimization problem involves the covariance operator of X. Theorem 6.7
provides the solution.

154
Theorem 6.7. Suppose the covariance operator of a mean-square continuous, jointly measur-
able process with mean zero, K, has eigenvalues satisfying λ1 > λ2 > . . . > λ N > . . . > 0
with corresponding orthonormal eigenfunctions e j . Then, for j ≤ N

λ j = sup{⟨K f , f ⟩ : ∥ f ∥2 = 1, ⟨ f , ek ⟩ = 0, 1 ≤ k ≤ j − 1}

The supremum is attained if f = e j and the maximizing function is unique up to a sign.

Proof. Let f ∈ L2 ([0, 1]) be such that ∥ f ∥ = 1 and ⟨ f , ek ⟩ = 0, 1 ≤ k ≤ j − 1. By


Mercer’s theorem,
∞ ∞
⟨K f , f ⟩ = ∑ λ k ⟨ f , e k ⟩2 = ∑ λ k ⟨ f , e k ⟩2 ,
k =1 k= j

from which we obtain



0 ≤ ⟨K f , f ⟩ ≤ λ k ∑ ⟨ f , e k ⟩2 ≤ λ k ∥ f ∥2 ≤ λ k .
k= j

Furthermore, since {e j } is an orthonormal set and ⟨Ke j , e j ⟩ = λ j the supremum is


attained if, and only if, f = e j or f = −e j . □
Theorem 6.7 implies that the eigenfunctions of K provide optimal representations
of X in the sense that the expected L2 -error is minimized. Via the Karhunen-Loève
decomposition we can even precisely obtain the error of this representation:
 2 
 N  ∞
E X − ∑ ⟨ X, u j ⟩u j = ∑ λj,

j =1
  j = N +1
2

so that the error depends on the rate of decay of the eigenvalues λ j . In the extreme
case that λ N +k = 0 for all k ∈ Nthe representation incurs no approximation error
whatsoever. In the general case, however, the error depends on the regularity of the
paths of X.
The analysis above is at the population level and in practice one has no knowledge
of either the covariance operator or its eigenvalues and corresponding eigenfunctions,
which have to be estimated from the sample. A natural estimator for the covariance
operator is
Z
K
b f (t) = b(s, t) f (s)ds.
K
[0,1]

Sample functional principal component analysis may be carried out through the use
of K,
b which can also be shown to be bounded, positive-semidefinite and self-adjoint
(show it!). Therefore, estimates for λ j and e j may be obtained by solving

Kb
b e j (t) = b e j ( t ).
λjb (6.13)

155
Practically, the solutions to (6.13) are obtained by discretizing the random functions
X1 , . . . , Xn and proceeding as in the multivariate case. A complication is that from a
sample of size n one can identify at most n unique eigenvalues and eigenfunctions.
The reason is that the range of K b is only the span of X1 , . . . , Xn which is at most an
n-dimensional subspace of L2 ([0, 1]).
Estimated eigenfunctions can be shown to minimize the sample equivalent of the
previously discussed problem. In particular, we have
 2   2 
n  N n  N
1  1 
∑ Xi − ∑ ⟨ Xi , b
e j ⟩b
ej ≤ ∑ Xi − ∑ ⟨ Xi , w j ⟩ w j

n i =1  j =1
 n i =1  j =1

2 2

for all N-dimensional orthonormal systems w1 , . . . , w N . Furthermore, the associated


error can be shown to be
 2 
n  N n
1 
n i∑ ∑ i j j  ∑ bλ j .
X − ⟨ X , e ⟩ e = (6.14)

i b b
=1 j =1 j = N +1

2

for all N < n.


How well do estimated eigenvalues and eigenfunctions obtained from (6.13) ap-
proximate their population counterparts? It is worth noting that the eigenfunctions
are only unique up to a sign, as if e j is an eigenfunction K then so is −e j . Thus, if b e j is
an estimate computed from the data we can only hope that cbj b e j is close to e j , where

cbj = sign ⟨b
ej, ej ⟩ .

These values cannot be computed from the data, so in practice it must be ensured that
the statistics we want to work with do not depend on cbj . We have the following result
which is given without proof.

Theorem 6.8. Let X1 , . . . , Xn denote iid copies of a mean-square continuous process { X (t), t ∈
[0, 1]} that is jointly measurable with E{∥ X ∥42 < ∞. If

λ1 > λ2 > . . . > λ N > . . . > 0,

then for each 1 ≤ j ≤ N,


   
1 1
E{∥cbj b
e j − e j ∥22 } =O and E{∥b
λ j − λ j ∥22 } =O .
n n

Figure 6.6 plots the first three estimated eigenfunctions of the spanish weather and
gait datasets seen in Figure 6.1. The corresponding eigenvalues respectively explain
99% and 82% of the total variance, respectively.

156
Figure 6.6 First three sample eigenfunctions of the Spanish weather and gait datasets
on the left and right, respectively.

require(fda.usc)
data("aemet")
temp <- aemet$temp$data
ctemp <- cov(temp)
etemp <- eigen(ctemp)
dim(etemp$vectors)
par(mar = c(6.6, 6.6, 2, 1.1)) # bottom, left, top, right
par(mgp = c(4.6, 1.6, 0)) # bottom, left, top, right
matplot(etemp$vectors[, 1:3], type = "l", lwd = 3, lty = 1, cex.lab = 2.5,
cex.axis = 2.5, cex = 2.5,
col = c("blue", "red", "darkgreen"), xlab = "Day", ylab = "")

6.5 Functional regression


It is often of interest to explore functional data in relation to either other functional
data or random variables. Herein we shall examine the most popular of these models,
which assumes that we are in possession of tuples ( X1 , y1 ) . . . ( Xn , yn ) consisting of
mean-square continuous jointly measurable L2 ([0, 1]) processes Xi and real random
variables yi satisfying the model
Z
y i = α0 + β 0 (t) Xi (t)dt + ϵi , (i = 1, . . . , n), (6.15)
[0,1]

for some unknown α0 ∈ R and β 0 ∈ L2 ([0, 1]). The function β 0 is referred to as the
coefficient (or weight) function while the errors ϵi are assumed to be iid with mean zero

157
and finite variance. The functional regression model (FLM) in (6.15) has a vast number
of applications. For example, it is often of interest to relate near-infrared reflectance
(NIR) spectra with octane measurements of gasoline samples. Figure 6.7 presents a
sample of 60 such data.

Figure 6.7 Left: NIR spectra. Right: octane measurements.

How can we estimate the unknown parameters (α0 , β 0 ) of the functional linear
model? There are roughly speaking two approaches: the basis expansion approach and
the penalization approach. Starting with the basis expansion approach, the main idea
is to reduce the FLM to a standard regression problem by expressing β 0 in a small
number of suitable basis functions. That is, we approximate β by
N
βeN (t) = ∑ β j B j ( t ), (6.16)
j =1

for some small N. In principle, B1 , . . . , BN can be any system of basis functions such as
Fourier and B-splines but also the sample eigenfunctions of K discussed in Section 4.
Either way, the approximation in (6.16) works by substituting the infinite-dimensional
β 0 with an N-dimensional approximation. Having done so, the coefficients β 1 , . . . , β N
can be estimated by minimizing
n
L(α, β) = ∑ |yi − α − ⟨Xi , βeN ⟩|2, (6.17)
i =1

R R
over × N . For N ≪ n, the Gram matrix is positive definite, hence there exists a
unique solution to (6.17).
The basis expansion approach is simple and intuitively appealing, yet the resulting
estimates are greatly dependent upon (i) the choice of the basis system and (ii) the
number of basis functions employed. Varying any of these is likely to cause great

158
changes in the shape of the estimated coefficient function βb(t). At both theoretical and
practical levels, one also has to content with the large biases that may be incurred by
approximation (6.16). In particular, while with most basis systems that are dense in
L2 ([0, 1]) we can write

∞ N
β 0 (t) = ∑ β j Bj (t) = ∑ β j B j ( t ) + δ ( t ),
j =1 j =1

the basis function approach essentially ignores the remainder term δ(t). The impor-
tance of this term generally depends on the (unknown) shape of β 0 with more compli-
cated coefficient functions requiring a large number of basis functions for an accurate
description. Thus, the basis function approach can only be recommended in cases
where one is only interested in the rough shape of β 0 and not its finer characteristics.
A more potent alternative that reduces the dependence on the basis system and
the number of basis functions involves using a large number of basis functions and
controlling the increase in the variance of the estimates with a penalty (hence the name
penalization). Assume now that N is so large, possibly even larger than n, so that δ(t)
is negligible. The penalization approach then solves
n
Lλ (α, β) = ∑ |yi − α − ⟨Xi , βeN ⟩|2 + λI( β), (6.18)
i =1

for some penalty parameter λ ≥ 0 and a penalty function I : N → + . The penalty R R


parameter λ places a premium on values of β that lead to large values I( β). For λ = 0
the penalty drops out and we just have the basis expansion estimator but with larger
N. On the other hand, for λ → ∞ the penalty term dominates the objective function
(6.18) forcing the estimated coefficient function to lie in the null space of I(·).
Possible examples of penalties include the ridge and lasso penalties seen in Chap-
ter 4. More often, however, we place the penalty directly on the shape of the estimated
coefficient function with the goal of producing smooth interpretable estimates. A pop-
ular penalty in this respect is given by
Z
I( βeN ) = | βe′′N (t)|2 dt = β⊤ Pβ,
[0,1]

where Pij = [0,1] Bi′′ (t) B′′j (t)dt. This penalty implicitly assumes that the basis func-
R

tions B1 , . . . , BN are twice differentiable with a second derivative in L2 ([0, 1]). Notice
that this holds for Fourier basis and B-spline functions with p ≥ 3. The null space of
I( βeN ) consists precisely of affine functions of the form

β N (t) = β 0 + β 1 t,

159
Figure 6.8 Penalized spline estimates for the gasoline data with 36 interior knots and
p = 4. Top left: λ = 0. Top right: λ = 0.5. Bottom left: λ = 20. Bottom right: λ
selected with generalized cross-validation.

R
for some ( β 0 , β 1 ) ∈ 2 . Thus, for λ → 0 the estimated coefficient function becomes
wiggly and complex while for λ → ∞ it resembles more and more a straight line.
Figure 6.8 illustrates the role of λ on the gasoline data.

require(refund)
data(gasoline)
y = gasoline$octane
x = gasoline$NIR
par(mar = c(6.6, 6.6, 1.1, 1.1))
par(mgp = c(4.6, 1.6, 0))
fit1 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 0)
fit2 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 0.5)
fit3 <- pfr(y~lf(x, k = 40, bs = "ps"), sp = 20)

160
fit4 <-pfr(y~lf(x, k = 40, bs = "ps"), method = "GCV.Cp")

plot(fit1, se = FALSE, rug = FALSE, xlab = "t", ylab = "",


lwd = 3, cex.lab = 2.5, cex.axis = 2.5)
plot(fit2, se = FALSE, rug = FALSE, xlab = "t", ylab = "",
lwd = 3, cex.lab = 2.5, cex.axis = 2.5)
plot(fit3, se = FALSE, rug = FALSE, xlab = "t", ylab = "",
lwd = 3, cex.lab = 2.5, cex.axis = 2.5)
plot(fit4, se = FALSE, rug = FALSE, xlab = "t", ylab = "",
lwd = 3, cex.lab = 2.5, cex.axis = 2.5)

The theoretical properties of penalized estimators tend to be involved, as they de-


pend on (i) the smoothness of β 0 (t), (ii) the choice of the basis system, (iii) the rate of
growth of N as n → ∞ and (iv) the rate of decay of λ as n → ∞. Note that N needs
to increase with n in order to eliminate the approximation bias δ(t) while λ needs to
decrease in order to eliminate the penalization bias.
In the particular case of a B-spline approximation with the integrated squared
derivative penalty recent research has shown that if β 0 has uniformly bounded deriva-
tives up to order r with rth derivative satisfying a Lipschitz condition of order v ∈
(0, 1) and p > r then

N log(n)
   
1
E{|⟨ X, βb − β 0 ⟩| } = OP
2
+ OP + O P ( λ ).
n N 2(r + v )

The quantity E{|⟨ X, βb − β 0 ⟩|2 } is called the prediction error of βb with respect to β 0 and
is very often the pseudo-metric employed in the literature. If we could choose the
rate of growth of N and the rate of decay of λ, then what would be the best rate of
convergence to zero for the prediction error?

Further reading
Functional data analysis is one of the most modern branches of statistics and there
is a blossoming literature covering many techniques from both practical and theoreti-
cal standpoints. The early expository book by Ramsay and Silverman (2006) has been
very influential and was a catalyst for further developments in FDA. A more recent
practical treatment with many worked out examples is Kokoszka and Reimherr (2017).
Theoretical aspects of FDA are covered in Horváth and Kokoszka (2012) and Hsing
and Eubank (2015) with the latter placing emphasis on reproducing kernel Hilbert
space methods.

161
Recent developments in FDA include robust functional data analysis and the study
of discretely sampled functional data or even the robust analysis of discretely sampled
functional data. Examples of such contributions are Sinova et al. (2018) and Kalogridis
and Van Aelst (2021).

162
Appendix A

Background

A.1 Basics of measure and probability theory


A.1.1 The abstract integral
The most natural example of a ”measure” is that of length, area or volume of sets in
one-, two- or three-dimensional Euclidean space. As in these special cases, a measure µ
assigns non-negative (not necessarily finite) values to sets in some space Ω. A measure
µ is thus a set function; the value it assigns to a set A ⊂ Ω will be denoted by µ( A).
In generalization of the properties of length, area and volume, a measure is re-
quired to be additive, that is, to satisfy

µ ( A ∪ B ) = µ ( A ) + µ ( B ), A, B disjoint.

By induction, this property may be readily extended to any finite union of disjoint
sets. Usually, we require something more, namely that the measure µ is sigma-additive
which means that
∞ ∞
!
∑ µ ( A i ), Ai ⊂ Ω, Ai ∩ A j = ∅, i ̸= j.
[
µ Ai = (A.1)
i =1 i =1

Sigma-additivity allows for a much richer mathematical theory but brings about
technical difficulties whenever the domain of the measure is every subset A of Ω.
Therefore, it is necessary to restrict the definition to a sigma-algebra (or sigma-field) of
subsets of Ω, that is, to a class of subsets of Ω often denoted by A satisfying the fol-
lowing three properties

(i) Ω ∈ A.

(ii) For A ∈ A, Ac = Ω \ A ∈ A.
S∞
(iii) For Ai ∈ A, i =1 Ai ∈ A.

163
If A is a sigma-algebra of a space Ω then (Ω, A) is said to be a measurable space and
the sets A ∈ A are said to be measurable. With these definitions, we call µ a measure on
(Ω, A) if it’s a nonnegative function on A satisfying (A.1). The triple (Ω, A, µ) is then
often referred to as a measure space.
The following are three important examples of measure spaces.
Example: Counting measure. Let Ω be countable and let A be the class of all subsets
of Ω. For any A ∈ A, let µ( A) be the number of points of A if A is finite and µ( A) = ∞
otherwise. This measure is called the counting measure.
R
Example: Lebesgue measure on p . Let Ω be the whole of R p and let A be the smallest
sigma algebra containing all half open rectangles

R = {( x1 , . . . , x p ) : ai < xi ≤ bi }, −∞ < ai < bi < ∞. (A.2)

The members of A are called Borel sets. It can be shown that this is a very large class
R
containing, among others, all open and all closed subsets of p . A fundamental result
of measure theory states that there exists a unique measure µ defined over A which
assigns to (A.2) the measure

µ( R) = (b1 − a1 ) . . . (bn − an ),

that is, its volume; µ is called the Lebesgue measure.


R R
Example: Lebesgue-Stieltjes measure on . Let Ω be and let A denote the Borel
R
sets of . Let F denote a non-decreasing right-continuous function on (such as R
F ( x ) = x). Another important result of measure theory states that there exists a unique
R
measure µ F defined over the Borel sets of such that

µ F (( a, b]) = F (b) − F ( a).

This measure is called Lebesgue-Stieltjes measure and it is often abbreviated as F. Lebesgue-


R
Stieltjes measures can also be defined on the Borel sets of p , but their construction
becomes more complicated.
A third principal concept needed in addition to sigma algebra and measure is that
of the integral of a real-valued function f with respect to measure µ. However, before
defining this integral, it is necessary to specify a suitable class of functions f . This will
be done in three steps.
First, consider the class of real-valued functions s called simple, which take on only
a finite number of values, say a1 , . . . , am and for which the sets

Ai = { x : s( x ) = ai } ∈ A.

164
An important special case of a simple function is the indicator I A of a set A ∈ A defined
by

1 x ∈ A
I A ( x ) = I ( x ∈ A) = .
0 x ∈/A

Second, let s1 , s2 , . . . be a pointwise increasing sequence of non-negative simple


functions, that is, for every x ∈ Ω, s1 ( x ) ≤ s2 ( x ) ≤ . . .. Let

f ( x ) := lim sn ( x ). (A.3)
n→∞

The limit exists because for every x the sequence is increasing but f ( x ) may be in-
finite. A non-negative function f with domain Ω will be called A-measurable or, for
short, measurable if there exists a pointwise increasing sequence of non-negative sim-
ple functions such that (A.3) holds for every x ∈ Ω.
Third, for an arbitrary function f (not necessarily non-negative), define its positive
and negative parts by

f + ( x ) = max{0, f ( x )}, f − ( x ) = − min{ f ( x ), 0},

so that f + and f − are both non-negative and f ( x ) = f + ( x ) − f − ( x ). Then a function


with domain Ω will be called measurable if both its positive and negative parts are
measurable.
Measurable functions constitute a very large class which has a simple alternative
R
characterization. Indeed, it can be shown that f : Ω → is measurable if, and only if,
R
for every Borel set B ⊂ , the set

f −1 ( B) = { x : f ( x ) ∈ B} ∈ A.

It follows from the definition of Borel sets that it is enough to check that { x : f ( x ) <
R
b} ∈ A for every b ∈ . This shows in particular that if f : Ω → is continuousR
then it is measurable. As another important class of functions, consider the functions
taking on a countable number of values, say a1 , a2 , . . .. Then f is measurable if, and
only if, the sets

Ai = { x : f ( x ) = ai } ∈ A.

Indeed, for any such f we have the representation



f (x) = ∑ α i I Ai ( x ),
i =1

165
and therefore { x : f ( x ) ∈ B} = ∪i:ai ∈ B Ai ∈ A or product of measurable functions is a
measurable function itself. What is more, if f n are a sequence of measurable functions
then so are

sup f n , inf f n , lim sup f n , lim inf f n ,


n∈ N n∈ N n→∞ n→∞

all of these quantities being defined pointwise. Whenever it exists, limn→∞ f n is also a
measurable function.
The integral of a measurable function with respect to a measure µ can then be
defined in three corresponding steps.

(i) For a non-negative simple function s taking on values ai on the sets Ai , define
Z ∞


sdµ = ∑ a i µ ( A i ), (A.4)
i =1

where we follow the convention that aµ( A) = 0 when α = 0 and µ( A) = ∞.

(ii) For a non-negative measurable function f given by (A.3) define


Z Z
f dµ = lim sn dµ. (A.5)
Ω n→∞ Ω

The limit exists because the monotonicity of sn implies the monotonicity of


sn dµ. This definition is meaningful because it can be shown that if sn and s′n
R

are two sequences of simple functions tending to f pointwise then


Z Z
lim sn dµ = lim s′ dµ.
n→∞ Ω n→∞ Ω n

Thus, the definition of the integral is independent of the sequence of simple func-
tions converging to f .
R
The definitions (A.4) and (A.5) do not preclude the possibility that Ω sdµ or
R
Ω f dµ are infinite. A non-negative measurable function is said to be integrable
(with respect to µ) if Ω f dµ < ∞.
R

(iii) An arbitrary measurable function f is said to be integrable if its positive and neg-
ative parts are integrable in which case its integral is defined as
Z Z Z
+
f dµ = f dµ − f − dµ. (A.6)
Ω Ω Ω

Whenever we want to emphasize the variable of integration we write


Z Z
f dµ = f ( x )dµ( x ).
Ω Ω

166
Let us consider two special cases of integrals with respect to measures µ. First, for
R
y ∈ consider the Dirac measure defined on the set of all subsets of given by R

1 y ∈ A
δy ( A) = ,
0 otherwise

R
for any A ⊂ . Is this measure sigma additive on the set of all subsets of R? For every
function f we have
Z
f dδy = f (y).

R
To see this let Ai denote a partition of , that is, Ai ∩ A j = ∅ for i ̸= j and ∪i Ai = R.
Any simple function s can then be represented as

s( x ) = ∑ a i I Ai ( x ).
i =1

Now, since y is in exactly one Ai , say Ak , then


Z
sdδy = ak = s(y).

Therefore, by the definition of the integral


Z Z
f dδy = lim sn dδy = lim sn (y) = f (y),
Ω n→∞ n→∞

as claimed.
Consider now the more complex measure

µ( A) = ∑ an δn ,
n =0

R
with an = e−1 /n!, again on all subsets of . Verify that µ( ) = 1. What is R
R
f dµ? The
same reasoning as in the previous example shows that
Z ∞


f dµ = ∑ a n f ( n ).
n =0

The defined integral has the properties one would expect of it. In particular, for any
real numbers c1 , . . . , cm and any integrable functions f 1 , . . . , f m , ∑ ci f i is also integrable
and
Z m m Z
∑ ci fi dµ = ∑ ci
Ω i =1 Ω
f i dµ. (A.7)
i =1

167
Also, if f is measurable and g is integrable and if 0 ≤ f ≤ g then f is also integrable
and
Z Z
f dµ ≤ gdµ. (A.8)
Ω Ω

As a last extension of the concept of integral define


Z Z
f dµ = I A f dµ.
A Ω

Note that for measurable f and A ∈ A, I A ( x ) f ( x ) is also measurable (why?). Constant


functions are easily shown to be measurable, hence putting f = 1 yields
Z
dµ = µ( A).
A
R R
Obviously, properties (A.7) and (A.8) continue to hold when is replace by A .
Specializing to the Lebesgue measure µ we have the following important result
linking the Lebesgue and (classical) Riemann integrals for measurable functions in . R
The result is also valid for real-valued measurable functions with domain in p but R
the conditions become more technical.

R
Theorem A.1. Let f : [ a, b] → be a measurable and Riemann integrable function. Then, if
µ denotes the Lebesgue measure on the Borel sigma algebra of [ a, b],
Z Z Z b
| f |dµ < ∞ and f dµ = f ( x )dx.
[ a,b] [ a,b] a

It is often useful to know under what conditions an integrable function f satisfies


Z
f dµ = 0.
A

This will clearly be the case when either

f =0 on A

or

µ( A) = 0. (A.9)

More generally, it will be the case whenever

f =0 almost everywhere on A, (A.10)

that is, f is zero except on a subset of A having measure zero.

168
Conversely, if f is almost everywhere non-negative on A,
Z
f dµ =⇒ f = 0 almost everywhere on A,
A

and if f is almost everywhere (a.e.) positive on A, then


Z
f dµ = 0 =⇒ µ( A) = 0.
A

Note that, as a special case of (A.10), if f and g are integrable functions differing
only on a set of measure zero, that is, f = g a.e., then
Z Z
f dµ = gdµ.
Ω Ω

It is a consequence that functions can never be determined by their integrals uniquely


but at most up to sets of measure zero.
For a non-negative integrable function f , let us now consider
Z
v( A) = f dµ, (A.11)
A

as a set of function defined over A. Then v is non-negative, sigma-finite and sigma-


additive and hence a measure over (Ω, A).
If µ and v are two measures defined over the same measurable space (Ω, A) it is a
question of central importance whether there exists a function f such that (A.11) holds
for any A ∈ A. By (A.9), a necessary condition for such a representation is clearly that

µ( A) = 0 =⇒ v( A) = 0. (A.12)

When (A.12) holds, v is said to be absolutely continuous with respect to µ. It is a sur-


prising and basic fact known as the Radon-Nikodym theorem that (A.12) is not only
necessary but also sufficient for the existence of a function f satisfying (A.12) for all
A ∈ A. The resulting function f is called the Radon-Nikodym derivative of v with re-
spect to µ. This f is not unique because it can be changed on a set of µ-measure zero
without affecting the integral (A.11). However, it is unique a.e. µ in the sense that if
g is any other integrable satisfying (A.11), then f = g a.e. µ. A useful consequence of
this result is that
Z
f dµ = 0 for all A ∈ A, (A.13)
A

implies that f = 0 a.e. µ.

169
A.1.2 Applications to probability theory
For statistical developments, the most important application of measure theory is
its specialization to probability theory. A measure P defined over a measure space
(Ω, A) satisfying

P(Ω) = 1,

is called a probability measure (or probability distribution). If P is absolutely continuous


with respect to a measure µ with Radon-Nikodym derivative, p, so that
Z
P( A) = pdµ,
A

then p is called the probability density of P with respect to µ. Such densities are,
of course, determined up to sets of µ-measure zero. Many distributions are either
discrete, that is, absolutely continuous with respect to the counting measure, or ab-
solutely continuous with respect to the Lebesgue measure. However, there are many
distributions that are neither and instead have both discrete and absolutely continuous
components.
A random variable X is a measurable function from (Ω, A, P) to ( , B( )) with R R
R R
B( ) denoting the Borel sigma-algebra of . The expectation of a random variable X,
E{ X }, is defined as
Z
E{ X } = XdP. (A.14)

The function

FX ( a) = P( X −1 (−∞, a]) = P( X ≤ a), a∈ R,


is called the cumulative distribution function (or cdf for short) of the random variable X.
It is easy to check that FX is non-decreasing and right continuous and therefore leads to
R R
a unique Lebesgue-Stieltjes measure on ( , B( )). We have the following extremely
useful result.

Theorem A.2. Let X denote a random variable with cumulative distribution FX . Then
Z Z
XdP = xdFX ( x ).
Ω R
Theorem A.2 shows that we only need to know the distribution of the random vari-
able X in order to compute its expectation and we do not need to know the underlying
(abstract) probability space (Ω, A, P).

170
If not just one but m real-valued aspects of an experiment are of interest, these
are represented by a measurable vector-valued function ( X1 , . . . , Xm ) defined over
(Ω, A, P) with joint cdf

FX (α1 , . . . , αm ) = P( X1 ≤ a1 , . . . , Xm ≤ am ).

Likewise this cdf defines a unique Lebesgue-Stieltjes measure on the Borel sigma alge-
R
bra of d .
Assume now that X is a random variable and its cdf FX is absolutely continuous
with respect to a measure µ and has Radon-Nikodym derivative f X . What is E( X ) in
this case? From Theorem A.2 we know that E{ X } = R xdFX ( x ) but much more can
R

now be said.

Theorem A.3. Let X denote a random variable with cumulative distribution FX . Then
Z Z
xdFX ( x ) = x f X ( x )dµ( x ).
R R
For example when µ is Lebesgue measure, using Theorem A.1 we obtain
Z
E{ X } = x f X ( x )dx,
R
which is the classical expectation formula for a random variable with density f X . Now
assume that FX is absolutely continuous with respect to the counting measure. What
is E{ X }?

A.2 Elements of Functional Analysis


A.2.1 The Banach L p (Ω) spaces
Throughout this subsection we work with Borel measurable functions on (Ω, A, P).
We are interested in functions whose absolute pth power is integrable. More precisely,
we are interested in the function spaces

R
 Z 
p p
L (Ω) := f : Ω → , f measurable and | f | dµ < ∞ , p ∈ [1, ∞).

Such functions possess many interesting properties and arise in various contexts. We
define
Z 1/p
p
∥ f ∥ p := | f | dµ , p ∈ [1, ∞).

Clearly, f ∈ L p (Ω) if f is measurable and ∥ f ∥ p < ∞. It is easy to show that for f = 0


we have ∥ f ∥ p = 0. However, the converse is only true up to sets of µ-measure zero.

171
That is, ∥ f ∥ p = 0 implies only that f = 0 almost everywhere. An easily verifiable
R
property of ∥ · ∥ p is that for all α ∈
∥α f ∥ p = |α|∥ f ∥ p . (A.15)
Two important inequalities on L p (Ω) spaces were discovered by Hölder and Minkowski
and relate the norm of the product of functions to the product of the norms and the
norm of the sum of functions to the sum of the norms. Specifically:
Theorem A.4 (Hölder’s inequality). Assume that f ∈ L p (Ω) and g ∈ Lq (Ω), where
p, q ∈ [1, ∞) and satisfy 1/p + 1/q = 1. Then f g ∈ L1 (Ω) and
Z Z

f gdµ ≤ | f g|dµ ≤ ∥ f ∥ p ∥ g∥q .
Ω Ω

Theorem A.5 (Minkowski’s inequality). Assume that f , g ∈ L p (Ω), p ∈ [1, ∞). Then
f + g ∈ L p (Ω) and
∥ f + g∥ p ≤ ∥ f ∥ p + ∥ g∥ p .
Theorem A.4 with p = q = 2 is referred to as the Cauchy-Schwarz inequality, or
simply the Schwarz inequality. Theorem A.5 combined with (A.15) shows that L p (Ω)
is a vector space and ∥ · ∥ p is a semi-norm, that is, a real-valued function on L p (Ω)
satisfying the subadditivity and absolute homogeneity properties but the positive def-
initeness property only holds up to sets of µ-measure zero.
Theorem A.4 and A.5 are valid for general measure spaces. When specialized to
the probability space (Ω, A, P) we have the useful inequalities
E{ XY } ≤ [E{| X | p }]1/p [E{|Y |q }]1/q ,
for X ∈ L p (Ω), Y ∈ Lq (Ω) and p, q ∈ [1, ∞), 1/p + 1/q = 1 and
[E{| X + Y | p }]1/p ≤ [E{| X | p }]1/p + [E{|Y | p }]1/p ,
for X, Y ∈ L p (Ω) and p ∈ [1, ∞).
Consider now a Cauchy sequence in L p (Ω), that is, a sequence f n ∈ L p (Ω) for
which for every ϵ > 0 there exists a large enough N such that
sup ∥ f n − f m ∥ p ≤ ϵ.
n,m≥ N

In Euclidean spaces sequences are convergent if and only if they are Cauchy. However,
in infinite dimensional function spaces this equivalence doesn’t always hold. Thus, it
is of interest to know in which spaces Cauchy sequences converge. Spaces with such
a property are called complete. For L p (Ω) spaces we have the remarkable result below.
Theorem A.6 (Riesz-Fischer). The spaces L p (Ω), p ∈ [1, ∞), are complete, i.e., every
Cauchy sequence f n ∈ L p (Ω) converges to some limit in f ∈ L p (Ω).
Complete (semi-)normed vector spaces are referred to as Banach spaces.

172
A.2.2 The Hilbert space L2 ([0, 1])
Of particular importance for functional data analysis is the space L2 (Ω, A, µ) with
Ω a closed, bounded and connected domain, which we shall without loss of generality
take to be the [0, 1]-interval, A is the Borel sigma algebra of Ω and µ is Lebesgue mea-
sure on [0, 1]. We use the simplified notation L2 ([0, 1]) in order to denote this space.
What makes the space L2 ([0, 1]) so attractive for theory and applications is our
ability to define an inner product on it with the result that its geometry resembles that
of finite-dimensional euclidean spaces. Specifically, an inner product on L2 ([0, 1]) is a
R
function ⟨·, ·⟩ : L2 ([0, 1]) × L2 ([0, 1]) → satisfying

(i) ⟨ f , g⟩ = ⟨ g, f ⟩ for all f , g ∈ L2 ([0, 1]).

(ii) ⟨α f + βg, h⟩ = α⟨ f , h⟩ + β⟨ g, h⟩ for all f , g, h ∈ L2 ([0, 1]) and α, β ∈ R.


(iii) For f that is non-zero almost everywhere, ⟨ f , f ⟩ > 0.

The classical inner-product in L2 ([0, 1]) is given by


Z
⟨ f , g⟩ = f (t) g(t)dµ(t). (A.16)
[0,1]

The semi-norm and inner product on L2 ([0, 1]) are related through ⟨ f , f ⟩ = ∥ f ∥22 . We
call f and g orthogonal if ⟨ f , g⟩ = 0.
Basis expansions are an important tool for functional data analysis and they may
be conveniently defined for functions in L2 ([0, 1]). We say that a set of functions
{e1 (t), e2 (t), . . .} is a basis in L2 ([0, 1]) if every function f ∈ L2 ([0, 1]) admits a unique
expansion

f (t) = ∑ f j e j ( t ). (A.17)
j =1

where the coefficients { f j } depend on f and the convergence is in the norm of L2 ([0, 1])
(so not always pointwise). We say that {e1 (t), e2 (t), . . .} is an orthonormal basis if, in
addition, ⟨ei , e j ⟩ = 0 whenever i ̸= j and ∥ei ∥2 = 1 for all i.
What does convergence of the series in (A.17) with orthonormal e j imply for the
coefficients f j ? We have the following fundamental result due to Parseval.

Theorem A.7 (Parseval). Let ek be an orthonormal sequence in L2 ([0, 1]). Then

(a) The series in (A.17) converges in the norm ∥ · ∥2 if and only if the following series
converges

∑ | f j |2 .
j =1

173
(b) If (A.17) converges, then the coefficients f j are given by ⟨ f , e j ⟩ with f = ∑ j f j e j ; hence
in this case we have

f (t) = ∑ ⟨ f , e j ⟩ e j ( t ).
j =1

(c) For any f ∈ L2 ([0, 1]) the series in (A.17) with f j = ⟨ f , e j ⟩ converges in the norm of
L2 ([0, 1]).

A notable consequence of the theorem is that if (A.17) converges then



∥ f ∥22 = ∑ | f j |2 . (A.18)
j =1

Trigonometric (Fourier) functions form the best known orthonormal basis on L2 ([0, 1]).
These functions are given by

e1 ( x ) = 1, e2 ( x ) = 21/2 sin(2πx ), e3 ( x ) = 21/2 cos(2πx ), e4 ( x ) = 21/2 sin(4πx ), . . .

It is easy to show that the Fourier functions are orthonormal but less obvious that every
function in L2 ([0, 1]) can be expanded in such series. The truncated polynomials and
B-spline functions which are often used in functional data analysis are not orthogonal
but they are still linearly independent. This means that they can be made orthonormal
using the Gram-Schmidt process.

A.2.3 Linear operators on L2 ([0, 1]).


In order to describe key techniques discussed in chapter 6, we need to introduce
some concepts related to linear operators on L2 ([0, 1]). A mapping M : L2 ([0, 1]) →
L2 ([0, 1]) is called a linear operator (or a linear transformation), if for any scalars a, b ∈ R
we have

M( a f + bg) = aM( f ) + bM( g), f , g ∈ L2 ([0, 1]).

Since every vector space contains the zero element, this also implies M( a f ) = aM( f )
for all a ∈ R
and f ∈ L2 ([0, 1]) and L(0) = 0. We will call a linear operator simply
an operator. An operator L is bounded if there is a constant K such that for any f ∈
L2 ([0, 1])

M ( f ) ≤ K ∥ f ∥.

The norm of an operator M is defined as


∥ M( f )∥
∥ M∥ = sup .
2
f ∈ L ([0,1])
∥ f ∥

174
We say that a bounded operator M is self-adjoint if and only if, for all f , g ∈ L2 ([0, 1])

⟨ M( f ), g⟩ = ⟨ f , M( g)⟩.

Think of this as a symmetry condition. In finite-dimensional spaces, operators can be


represented by matrices and consequently if M is a self-adjoint operator on p with R
M as its matrix representative we have

x⊤ M⊤ y = ⟨ M( x ), y⟩ = ⟨ x, M (y)⟩ = x⊤ My,

for all x, y ∈ R p . This implies that M is indeed a symmetric matrix. In another parallel
with linear algebra, an operator M is called positive-semidefinite if and only if,

⟨ M( f ), f ⟩ ≥ 0, f ∈ L2 ([0, 1]).

It is called positive-definite, if the equality can hold only for f = 0, i.e., the null space
of the operator only contains the zero element.
We say that a real number λ is an eigenvalue of an operator M if, and only if, there
exists a non-zero f ∈ L2 ([0, 1]) such that

M( f ) = λ f . (A.19)

This f ∈ L2 ([0, 1]) is referred to as the eigenfunction corresponding to the eigenvalue


λ. Since M is linear we can always assume that ∥ f ∥ = 1, otherwise we can just divide
both sides in (A.19) by ∥ f ∥. Observe that the eigenvalues of positive semidefinite
operators are necessarily non-negative since

λ = ⟨ f , M( f )⟩ ≥ 0.

Eigenfunctions corresponding to distinct non-zero eigenvalues of positive semidef-


inite, self-adjoint operators are orthogonal. Indeed, let (λ1 , f 1 ) and (λ2 , f 2 ) be two
eigenvalue-eigenfunction pairs with λ1 ̸= λ2 , then

λ1 1 1 λ
⟨ f1, f2 ⟩ = ⟨ f 1 , f 2 ⟩ = ⟨ M( f 1 ), f 2 ⟩ = ⟨ f 1 , M( f 2 )⟩ = 2 ⟨ f 1 , f 2 ⟩,
λ1 λ1 λ1 λ1

with λ2 /λ1 ̸= 1. This is only possible if ⟨ f 1 , f 2 ⟩ = 0. An operator can have an infi-


nite number of eigenvalues and what is more, this number may even be uncountably
infinite. However, the operators that we most often encounter in functional data anal-
ysis have at most a countably infinite number of eigenvalues with zero as their only
accumulation point. This is a consequence of the fact that these operators are compact.
We close this section with a result that is useful in the study of Support Vector
Machines and Functional Data Analysis. Let K be a continuous kernel on [0, 1]2 that is

175
symmetric and positive semi-definite and define the integral operator K : L2 ([0, 1]) →
L2 ([0, 1]) given by
Z
K( f )(s) = K (s, t) f (t)dt.
[0,1]

It is easy to check that K is self-adjoint and positive semi-definite. Thus, its eigenvalues
λ j are necessarily ≥ 0. Mercer’s theorem provides a remarkable characterization of K
in terms of the eigenvalues and eigenfunctions of K.

Theorem A.8. If (λ j , e j ) are the eigenvalue and eigenfunction pairs of K, then K has the
representation

K (s, t) = ∑ λ j e j ( t ) e j ( s ), s, t ∈ [0, 1]2 ,
j =1

with the sum converging absolutely and uniformly.

Theorem A.8 generalizes the well-known spectral decomposition of a symmetric


positive semi-definite matrix. The absolute and uniform convergence of the series has
important practical implications, see chapter 6.

A.3 Spline functions


A spline function is defined as a piecewise polynomial that is smoothly connected
at its joints (knots), where the degree of smoothness is pre-specified. Thus, although
regular polynomials are infinitely differentiable, spline functions are only required to
have a finite number of continuous derivatives. This modest requirement results in
great flexibility and avoids problems associated with smoothing or interpolation of
functions with polynomials known as Runge’s phenomenon (de Boor, 2001).
To describe spline functions in detail one needs two ingredients: the knots and the
desired degree of smoothness, as measured by the number of continuous derivatives
the spline should possess. For a given closed interval, say the [0, 1]-interval without
loss of generality, define an increasing sequence of knots 0 = t0 < t1 < t2 < . . . <
tK < 1 = tK +1 . f is a spline function of order p with knots at ti if and only if

1. f is a piecewise polynomial of order p (degree p − 1) on any interval [ti , ti+1 ) for


i = 0, . . . , K.

2. f has p − 2 continuous derivatives.

3. f has a ( p − 1)st derivative that is a step function with jumps at the knots
t1 , . . . , t K .

176
p p
The space of spline functions of order p is denoted with SK . The dimension of SK
can be heuristically derived as follows. The space of all pth order polynomials between
each [ti , ti+1 ) has dimension (K + 1) p but our smoothness requirements add ( p − 1)
constraints on each knot1 , hence K ( p − 1) constraints in total. Therefore the dimension
p
of SK is

(K + 1) p − K ( p − 1) = K + p.
p
A basis for SK may be derived by means of the truncated polynomials
p −1 p −1
1, x, . . . , x p−1 , ( x − t1 )+ , . . . , ( x − tK )+ , (A.20)
p −1
where x+ = (max{0, x }) p−1 . To see that these functions are indeed a basis one needs
to check first that every piecewise polynomial is in their span and secondly that they
are linearly independent. The former is trivial while for the latter notice that if
p −1 K
p −1
f (x) = ∑ αi x i
+ ∑ β j ( x − t j )+ = 0,
i =0 j =1

for all x ∈ [0, 1], taking any x ∈ (0, t1 ) implies


p −1
∑ αi x i = 0
i =0

and since by the fundamental theorem of algebra a degree ( p − 1) polynomial can


have at most p − 1 zeroes unless it is the zero polynomial conclude that α0 = α1 . . . =
α p−1 = 0 and

K
p −1
f (x) = ∑ β j ( x − t j )+ = 0.
j =1

Inductively repeating the procedure by examining the intervals (t1 , t2 ), . . . (tK −1 , tK )


reveals that for f ( x ) = 0 we must also have β 1 = . . . = β K = 0 from which it fol-
lows that the truncated polynomials are linearly independent for any choice of distinct
knots.
Truncated polynomial bases are very useful for understanding the mechanics of
spline functions, but suffer from being far from orthogonal in L2 ([0, 1]). This is a very
important practical disadvantage as it can lead to badly conditioned design matrices
and numerical instability, particularly if one uses a large number of knots. There-
fore, in practice it is advisable to work with equivalent bases, that is, with bases span-
ning the same subspace, with more stable numerical properties. The most common
1 Continuity plus continuity of ( p − 2) derivatives.

177
choice in theoretical work and practical implementations nowadays is the B-spline ba-
sis, which we now briefly describe.
Much like truncated polynomials B-splines require a choice of knots and the speci-
fication of the order of the polynomial in order to be defined, although their construc-
tion is more involved. As an equivalent basis, one can derive the B-splines as a linear
combination of truncated polynomial functions, but it is often much more convenient
to define them through recursive formulae.
To explain this procedure we first need to augment the set of knots t1 < t2 < . . . <
tK by defining 2p additional boundary knots
t−( p−1) = . . . = t0 = 0
and
tK +1 = . . . = tK + p = 1.
The B-splines of order p with knots at t1 < t2 < . . . < tK can then be recursively
defined by
x − ti ti + p − x
Bi,p ( x ) = Bi,p−1 ( x ) + B (x) (A.21)
t i + p −1 − t i ti+ p − ti+1 i+1,p−1
for i = −( p − 1), . . . , K and the recursion starts from the step functions

1 x ∈ [ t , t )
i i +1
Bi,1 ( x ) =
0 otherwise.

for i = −( p − 1), . . . , K + p. In using this definition we must employ the property that
a B-spline of order p corresponding to p + 1 coincident knots is the zero function. To
illustrate the use of (A.21) consider the case of linear (p = 2) B-splines. In this instance
2p = 4 additional knots are defined by ti−1 = t0 = 0 and tK +1 = tK +2 = 1. Since B−1,1
has coincident knots at t−1 and t0 we have B−1,1 = 0 and, similarly, BK +1,1 = 0. Thus,
(A.21) gives
t1 − x
B−1,2 ( x ) = B0,1 ( x )
t1 − t0
x − t0 t2 − x
B0,2 ( x ) = B0,1 ( x ) + B1,1 ( x )
t1 − t o t2 − t1
..
.
x − tK
BK,2 ( x ) = BK,1 ( x ).
t K +1 − t K
B-splines of order 2 are known as ”tent” functions (plot a few of them and find out
why).
Higher order B-splines may be defined by continuing the recursion. We mention
three key properties of B-splines of general order p, namely

178
a) Bi,p ( x ) = 0 for x ∈
/ [ t i , t i + p ),

b) Bi,p ( x ) ≥ 0,

c) ∑iK=−( p−1) Bi,p ( x ) = 1 for all x ∈ [0, 1].

Note that property a) implies that every B-spline can be positive for at most p con-
secutive intervals. This needs to be contrasted with truncated polynomials which can
be nonzero throughout the [0, 1]-interval. This local support property means that B-
splines Bi,p ( x ) and Bj,p ( x ) such that |i − j| > p will be orthogonal. Thus, while this
basis is still not completely orthogonal it is very nearly so and in fact it can be shown
that the B-spline basis has the smallest compact support among all bases spanning the
p
spline subspace SK . Figure A.1 illustrates the B-spline basis for 10 interior knots.

Figure A.1 Quadratic and cubic B-splines based on 10 equidistant knots in [0.05, 0.95].

Step functions, that is, functions in the span of B-splines of order 1, can be shown
to be dense in L2 ([0, 1]). In general, however, spline functions are exceptionally pow-
erful when the function to be approximated exhibits a higher degree of smoothness,
i.e., it has a few continuous derivatives. In that case, it can be shown that splines can
approximate the target function uniformly on its domain with an approximation er-
ror decaying like O(K − p ) where p is the number of continuous derivatives the target
function possesses.

Further reading
The material for this appendix has been collected from various sources. Our discus-
sion of measure and probability theory is drawn mainly from Lehmann and Casella
(2006) and Schilling (2017). Our material for functional analysis has been collected

179
from Kreyszig (1978) and Schilling (2017). A more thorough treatment of spline func-
tions may be found in de Boor (2001) and DeVore and Lorentz (1993).

180
Bibliography

Barut, E., Fan, J., and Verhasselt, A. (2016). Conditional sure independence screening. Journal of the
American Statistical Association, 111(515):1266–1277, DOI: 10.1080/01621459.2015.1092974, https:
//doi.org/10.1080/01621459.2015.1092974. PMID: 28360436.

Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data. Springer Series in Statistics.
Springer, Heidelberg, ISBN: 978-3-642-20191-2, DOI: 10.1007/978-3-642-20192-9, http://dx.
doi.org/10.1007/978-3-642-20192-9. Methods, theory and applications.

Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than
n. Ann. Statist., 35(6):2313–2351, DOI: 10.1214/009053606000001523, https://doi.org/10.1214/
009053606000001523.

Chernick (1999). Bootstrap methods: A Practitioner’s Guide. Wiley.

Christmann, A. and Steinwart, I. (2008). Support Vector Machines. Springer, New York.

Cui, H., Li, R., and Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional
discriminant analysis. Journal of the American Statistical Association, 110(510):630–641, DOI:
10.1080/01621459.2014.920256, https://doi.org/10.1080/01621459.2014.920256.

Davison, A. and Hinkley, D. (1997). Bootstrap Methods and their Applications. Cambridge University
Press, Cambridge.

de Boor, C. (2001). A Practical Guide to Splines. Springer Verlag, New York.

DeVore, R. and Lorentz, G. (1993). Constructive Approximation. Springer Verlag, New York.

Donoho, D. and Huber, P. (1983). The notion of breakdown point. In Bickel, P., Doksum, K., and Hodges,
J., editors, A Festschrift for Erich Lehmann, pages 157–184, Belmont. Wadsworth.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.

Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall.

Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-
dimensional additive models. Journal of the American Statistical Association, 106(494):544–557, DOI:
10.1198/jasa.2011.tm09779, https://doi.org/10.1198/jasa.2011.tm09779.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American Statistical Association, 96:1348–1360.

181
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–
911, DOI: 10.1111/j.1467-9868.2008.00674.x, https://rss.onlinelibrary.wiley.com/doi/
abs/10.1111/j.1467-9868.2008.00674.x.

Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear
model. J. Mach. Learn. Res., 10:2013–2038, ISSN: 1532-4435.

Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice (Springer Series
in Statistics). Springer-Verlag, Berlin, Heidelberg, ISBN: 0387303693.

Hampel, F. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics,
42:1887–1896.

Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical
Association, 69:383–393.

Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (1986). Robust Statistics: The Approach Based on
Influence Functions. Wiley, New York.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New
York.

Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and
Generalizations. Chapman & Hall/CRC, ISBN: 1498712169.

Hjort, N. and Pollard, D. (1993). Asymptotics for minimizers of convex processes. Technical report, Yale
University.

Hodges, J. and Lehmann, E. (1963). Estimates of location based on rank tests. Annals of Mathematical
Statistics, 34:598–611.

Hoerl, A. and Kennard, R. (1970a). Ridge regression: Applications to nonorthogonal problems. Techno-
metrics, 12:69–82.

Hoerl, A. and Kennard, R. (1970b). Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12:55–67.

Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications, volume 200. Springer
Science & Business Media.

Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction
to Linear Operators. Wiley Series in Probability and Statistics. Wiley, ISBN: 9780470016916, https:
//books.google.be/books?id=YjsbAAAACAAJ.

Huber, P. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35:73–
101.

Huber, P. and Ronchetti, E. (2009). Robust Statistics. Wiley Series in Probability and Statistics. Wiley,
ISBN: 9781118210338.

182
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014). An Introduction to Statistical Learning: With
Applications in R. Springer Publishing Company, Incorporated, ISBN: 1461471370.

Kalogridis, I. and Van Aelst, S. (2021). Robust optimal estimation of location from discretely sampled
functional data. Scandinavian Journal of Statistics.

Koenker, R. (2005). Quantile Regression. Econometric Society Monographs. Cambridge University Press,
DOI: 10.1017/CBO9780511754098.

Kokoszka, P. and Reimherr, M. (2017). Introduction to Functional Data Analysis. Chapman & Hall
/ CRC numerical analysis and scientific computing. CRC Press, ISBN: 9781498746342, https:
//books.google.be/books?id=HIxIvgAACAAJ.

Kreyszig, E. (1978). Introductory Functional Analysis with Applications. Wiley classics library. Wiley, ISBN:
9780471507314, https://books.google.be/books?id=Va8rAAAAYAAJ.

Lehmann, E. L. and Casella, G. (2006). Theory of point estimation. Springer Science & Business Media.

Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the
American Statistical Association, 107(499):1129–1139, DOI: 10.1080/01621459.2012.695654, https:
//doi.org/10.1080/01621459.2012.695654. PMID: 25249709.

Maronna, R. A., Martin, D. R., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New
York.

Pan, W., Wang, X., Xiao, W., and Zhu, H. (2019). A generic sure independence screening procedure. Jour-
nal of the American Statistical Association, 114(526):928–937, DOI: 10.1080/01621459.2018.1462709,
https://doi.org/10.1080/01621459.2018.1462709.

R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, http://www.R-project.org/.

Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4):353–360, ISSN: 00063444,


http://www.jstor.org/stable/2332914.

Ramsay, J. and Silverman, B. (2006). Functional Data Analysis. Springer Series in Statistics. Springer New
York, ISBN: 9780387227511, https://books.google.be/books?id=REzuyz_V6OQC.

Rousseeuw, P. (1984). Least median of squares regression. Journal of the American Statistical Association,
79:871–880.

Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the Amer-
ican Statistical Association, 88:1273–1283.

Rousseeuw, R., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M.,
and Maechler, M. (2003). robustbase: Basic Robust Statistics. R package version 0.9-10, http://CRAN.
R-project.org/package=robustbase.

Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, ISBN: 9780521785167, https://books.
google.be/books?id=Y4uEvXFP2voC.

183
Schilling, R. L. (2017). Measures, integrals and martingales. Cambridge University Press.

Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond. MIT Press, Cambridge, MA, USA, ISBN: 0262194759.

Shao, J. and Tu (1995). The Jackknife and the Bootstrap. Springer.

Sinova, B., Gonzalez-Rodriguez, G., and Van Aelst, S. (2018). M-estimators of location for functional
data. Bernoulli, 24(3):2328–2357.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society. Series B, 58:267–288.

Todorov, V. and Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis.
Journal of Statistical Software, 32(3):1–47, http://www.jstatsoft.org/v32/i03/.

Tukey, J. (1958). Bias and confidence in not quite large samples (abstract). Ann. Math. Statist., 29(2):614,
DOI: 10.1214/aoms/1177706647, https://doi.org/10.1214/aoms/1177706647.

van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Series in Statistical and Proba-
bilistic Mathematics. Cambridge University Press, ISBN: 9780521650021.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, DOI: 10.1017/CBO9780511802256.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, fourth
edition, http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.

Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition. Chapman &
Hall/CRC Texts in Statistical Science. CRC Press, ISBN: 9781498728379, https://books.google.
be/books?id=JTkkDwAAQBAJ.

Yohai, V. (1987). High breakdown point and high efficiency robust estimates for regression. The Annals
of Statistics, 15:642–656.

Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-
dimensional data. Journal of the American Statistical Association, 106(496):1464–1475, DOI:
10.1198/jasa.2011.tm10563, https://doi.org/10.1198/jasa.2011.tm10563.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–
320, DOI: 10.1111/j.1467-9868.2005.00503.x, https://rss.onlinelibrary.wiley.com/doi/
abs/10.1111/j.1467-9868.2005.00503.x.

184
Prof. dr. Stefan Van Aelst
KU Leuven, Department of Mathematics and
Leuven Statistics Research Centre (LStat)
Celestijnenlaan 200B, 3001 Leuven
Phone. +32 16 37 23 83
[email protected]

You might also like