Linton - Nonparametric Methods

Nonparametric Methods: Harmless Econometrics of the
Unknown
Preliminary and Incomplete
Oliver Linton1
University of Cambridge
January 17, 2013
1 Corresponding author: Department of Economics, University of Cambridge, Austin

Robinson Building, Sidgwick Avenue, Cambridge CB3 9DD, United Kingdom, e-mail:
mailto:[email protected]@cam.ac.uk. Web Page: https://sites.google.com/site/oliverlinton/oliver-
lintonhttps://sites.google.com/site/oliverlinton/oliver-linton.
Contents
1 Introduction 1
2 Nonparametric Estimation 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 C.D.F. and Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Estimation of Regression Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Alternative Estimation Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Empirical Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.5 Bootstrap Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Plug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.3 Uniform Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.4 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.5 Some Nonasymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Additive Regression 45
v
vi CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Identication and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Interaction models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Parametric Components and Discrete Covariates . . . . . . . . . . . . . . . . . . . . . 54
3.6 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Generalized Additive and Other Models 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Transformation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Homothetic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 General Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Nonparametric Transformation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 Non Separable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Time Series and Panel Data 73

5.1 Mean Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Panel Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Standard Panel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Nonlinear Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 A Semiparametric Fama French Model . . . . . . . . . . . . . . . . . . . . . . 83
6 Marginal Integration 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS vii
6.5 Oracle Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.7 Oracle Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7.1 Conditional Moment Specication . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7.2 Full Model Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 Backtting 107
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Classical Backtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Smooth Backtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Bandwidth Choice and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8 Appendix 121
8.1 Stochastic Dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.2 Uniform Laws of Large Numbers (ULLN) . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Stochastic Equicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3.1 Unnormalized Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.3.2 Normalized Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.4 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.5 General Theory for Local Nonlinear Estimators . . . . . . . . . . . . . . . . . . . . . 136
8.5.1 Consistency of the Nadaraya-Watson Estimator . . . . . . . . . . . . . . . . . 140
8.5.2 Asymptotic Normality of the Nadaraya-Watson Estimator . . . . . . . . . . . 142
8.5.3 Backtting in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 146

As we know, There are known knowns. There are things we know we know. We also
know There are known unknowns. That is to say, We know there are some things We
do not know. But there are also unknown unknowns, The ones we dont know We dont
know.
Donald Rumsfeld, U.S. Secretary of Defence, February 12th, 2002, Department of
Defense news brieng.
Chapter 1
Introduction
These lectures are about nonparametric estimation. Parametric models arise frequently in economics
and are of central importance. However, such models only arise when one has imposed specic
functional forms on utility or production functions, like Cobb-Douglas or Leontie. Without these ad
hoc assumptions one only gets much milder restrictions on functional form like concavity, symmetry,
homogeneity etc. The nonparametric approach is based on the belief that parametric models are
usually misspecied and may result in incorrect inferences. By not restricting the functional form
one obtains valid inferences for a much larger range of circumstances. In practice, the applicability
depends on the sample size and quality of data available.
There are many cases in economics where nonparametric methods have been used and are con-
sidered important. Applications arise in cross-section and in time series cases. Many of these ap-
plications are routine estimation of nonparametric functions like densities and regression functions.
For example, the density of stock returns or the income distribution of a country at a point in time;
or of nonparametric regression of one variable on another. In insurance, one is often interested in
the conditional hazard function, which represents the probability of dying in a small interval give
that you have survived until now. The theory and methods for carrying out such estimation is well
understood, and we will spend some time reviewing that in these notes. In more recent times there
has been interest in a variety of models with nonparametric components that are not dened as
regressions of observable variables and have more complicated structures.
1
2 CHAPTER 1 INTRODUCTION
This review covers work on the specication and estimation of what we call structured nonpara-
metric models, that is, models that contain unknown functions but which are restricted in some
way, so as to achieve a low eective dimensionality. All these concepts need to be dened more
carefully, but it may be helpful to x ideas to start with a simple model, nonparametric regression.
Suppose that we observe i.i.d. observations f(Yi ; Xi ); i = 1; : : : ; ng with Yi 2 R and Xi 2 Rd ,
and let m(x) = E(Yi jXi = x) denote the regression function. We are interested in estimating this
regression function from the available data, and various quantities derived from it, like the average
partial eect. We shall focus here on the case where X is continuously distributed. In this context,
the curse of dimensionality is that the quality of estimates deceases with the dimension d. The term
curse of dimensionality actually comes from Bellman in the context of computing high dimensional
dynamic programs but has been also by many dierent authors in this nonparametric statistical
context. Stone (1980, 1982) and Ibragimov and Hasminskii (1980) showed that the optimal rate in
r=(2r+d)
L2 for estimating m is n ; with r a measure of the smoothness of m. This rate of convergence
can be very slow for large dimensions d: Silverman (1983) computed exact mean squared error for a
multivariate density estimate that used the optimal bandwidth. He showed how rapidly the perfor-
mance deteriorated with dimension. In eect, it is impossible to estimate the regression function m
with any accuracy for the sort of sample sizes available in practice when d 4:
One way out of this is to assume a compromise between fully nonparametric specication and a
fully parametric specication. Semiparametric models have played a prominent part in econometrics
in the last twenty ve years, and these go some way towards bridging this gap. In particular, partial
linear specications, like Robinson (1987) allows one to separate out some parametric eects from
some nonparametric eects and to obtain parametric theory for the estimation of the parametric
part. However, this again suers from the curse of dimensionality when the nonparametric part is
multidimensional, and this aects both the quality of estimation of the nonparametric part but also
the quality of the approximation for the parametric parts, Linton (1995). Another popular approach
is the index type assumption where functions of vector arguments are replaced by functions of a
linear (or other parametric) combination of its elements, see Ichimura (1998). This approach works
well in some settings and leads to parsimonious representations for many functional forms. The main
argument against it is perhaps its arbitrariness. We instead will focus on a fully nonparametric class of
3
alternatives, called additive regression, which we treat next. This writes the regression function as the
of one dimensional functions, which subverts the curse of dimensionality. An additional advantage
of this approach is that one can display the univariate functions graphically, whereas one cannot
display multidimensional regression functions this way. This visualization aids the understanding of
empirical phenomena and development of further modelling. There are a number of extensions of the
basic setting, which we later review, that involve some transformations before invoking additivity.
According to Luce and Tukey (1964), additivity is basic to science, they propose an axiom scheme
that asks whether there exists a labelling of the variables such that they produce an additive eect. It
is certainly hard to think of functions that are not additive in some sense, i.e., after transformations
or relabelling of variables.
Chapter 2
Nonparametric Estimation
2.1 Introduction
Smoothing techniques have a long history starting at least in 1857 when the Saxon economist Engel
found the law named after him. He analyzed Belgian data on household expenditure, using what
we would now call the regressogram. Whittaker (1923) used a graduation method for regression
curve estimation which one would now call spline smoothing. Nadaraya (1964) and Watson (1964)
provided an extension for general random design based on kernel methods. In time series, Daniell
(1946) introduced the smoothed periodogram for consistent estimation of the spectral density. Fix
and Hodges (1951) extended this for the estimation of a probability density. Rosenblatt (1956) proved
asymptotic consistency of the kernel density estimator. These methods have developed considerably
in the last twenty ve years, and are now frequently used by applied statisticians. The massive
increase in computing power as well as the increased availability of large cross-sectional and high-
frequency nancial time-series datasets are partly responsible for the popularity of these methods.
2.2 C.D.F. and Density Estimation

Suppose that we have an i.i.d. sample X1 ; : : : ; Xn drawn from some population with c.d.f. F (x) =
Pr[Xi x]: The c.d.f is itself of interest in a number of applications, in particular in specifying and
5
6 CHAPTER 2 NONPARAMETRIC ESTIMATION
testing hypotheses about populations. Let X1 and X2 be two variables (incomes, returns/prospects)
at either two dierent points in time, or for dierent regions or countries, or with or without a
program (treatment). Let Xki , i = 1; : : : ; N ; k = 1; 2 denote the not necessarily i.i.d. observations.
Let U1 denote the class of all von Neumann-Morgenstern type utility functions, u, such that u0
0, (increasing). Also, let U2 denote the class of all utility functions in U1 for which u00 0 (strict
concavity). Let F1 (x) and F2 (x) denote the cumulative distribution functions, respectively.
Denition 1 X1 First Order Stochastic Dominates X2 , denoted X1 F SD X2 , if and only if:

(1) E[u(X1 )] E[u(X2 )] for all u 2 U1 ; with strict inequality for some u; Or
(2) F1 (x) F2 (x) for all x with strict inequality for some x.
Denition 2 X1 Second Order Stochastic Dominates X2 , denoted X1 SSD X2 , if and only if either:
(1) E[u(X1 )] E[u(X2 )] for all u 2 U2 , with strict inequality for some u; Or:
Rx Rx
(2) 1 F1 (t)dt F (t)dt for all x with strict inequality for some x:
1 2
This says that we can learn from the cdfs of two outcomes whether one is preferred over another
according to a big class of utility functions. We estimate the c.d.f. F by the empirical c.d.f.
1X
n
Fn (x) = 1 (Xi x) :
n i=1
This estimator obeys 0 Fn (x) 1 and is weakly increasing. It is a step function with jumps
of height 1=n (assuming no ties, which happens with probability zero for continuously distributed
data). This is a CADLAG (Continue A Droite and Limite A Gauche) function. The integrated cdf
can be estimated by
Z
1X
x n
In (x) = Fn (t)dt = (x Xi )1 (Xi x) :
1 n i=1
The estimator In (x) 0 but grows with x and is unbounded in x:
The RadonNikodym theorem states that, given a measurable space, if a -nite measure P is
absolutely continuous with respect to Lebesgue measure, then there is a measurable function f taking
values in [0; 1), such that Z x
F (x) = f (x)dx;
1
2.2 C.D.F. AND DENSITY ESTIMATION 7
where f is called the density function of F: The density function is of interest in many applications
because it is often more informative than the c.d.f. The density function can be interpreted as the
derivative of the c.d.f.
1 1
f (x) = lim Pr [x h Xi x + h] = lim E [1(x h Xi x + h)] :
h!0 2h h!0 2h
However, the density function cannot be estimated by the derivative of Fn (x); since this is a discon-
tinuous function at the sample points and constant elsewhere. However, a numerical derivative with
small h would be
1
fb(x) = [Fn (x + h) Fn (x h)] :
2h
This can be written in the form
1 X
n
fb(x) = 1 (jXi xj h) :
2nh i=1
We dene now a more general class of estimators. Let h be a scalar bandwidth and K( ) a kernel
R
satisfying K(u)du = 1 and Kh ( ) = h 1 K(h 1 ): A kernel K is said to be of order q if
Z Z
u K(u)du = 0; j = 1; : : : ; q 1; and uq K(u)du < 1:
j
(2.1)
The integrals here are over the support of the kernel which in general is some compact interval or
the real line. Frequently, attention is restricted to K a probability density function symmetric about
zero for which q = 2: Then let
1X
n
fb(x) = Kh (Xi x)
n i=1
be an estimate of the density of Xi : This estimate is non-negative and integrates to one in the special
case where the support of X is the entire real line. If there are restrictions on the support of X it may
be advisable to use a more complicated kernel that has two or more parameters, see Chen (1999).
A simple approach is just to renormalize the kernel, and we use this frequently below. Suppose that
the support of X is X : Then let
Kh (u v)
Kh (u; v) = 1(u; v 2 X ) R ;
X
Kh (w v) dw
where K is a symmetric probability density function with support [ 1; 1]. For this reason we get
that Kh (u; v) = Kh (uv) for v in interior points so Kh (u; v) diers from Kh (u v) only on the
R
boundary. The norming gives that X Kh (u; v) du = 1. Then let
1X
n
fb(x) = Kh (x; Xi ):
n i=1
R
It follows that fb(x)dx = 1: Therefore, the estimator fbh (x) is on average exactly unbiased. This
means that the bias in the boundary region is of order h: One can estimate derivatives of f by the
derivatives of fb(x) provided K is chosen to be smooth enough.
In some cases we are interested in obtaining the same order of bias in the boundary as in the
interior. We consider a general class of so-called boundary kernels that are functions of two arguments
K(t; u); where the parameter t controls the support of the kernel. Suppose that K(u; t) has support
Rt Rt
[ 1; t] and satises 1 uj K(t; u)du = 0; j = 1; : : : ; r 1; and 1 ur K(t; u)du < 1 as for regular
kernels. This type of boundary modication unfortunately, requires that K(t; u) takes negative
values, but it does make the bias the same magnitude everywhere.
R
Let Kh (x) = K(x=h) = Kh (x0 )dx0 a smooth increasing c.d.f. and let
Z
1X
n
Fen (x) = (Kh Fn )(x) = Kh (x y)dFn (y) = Kh (x Xi )
n i=1
be a corresponding smoothed estimator of the c.d.f., where denotes convolution. Then

Z
b e 0
f (x) = Fn (x) = (Kh Fn )(x) = Kh (x y)dFn (y):
We can now provide another interpretation of the kernel density and c.d.f. estimator: Fen (x); fb(x)
are the c.d.f. and density functions respectively of a sample of the random variables Yi = Xi + h"i
conditional on X1 ; : : : ; Xn ; when "i has density K: That is, we introduce a small measurement error
"i to the original data. Specically, we have
x Xi
Pr [Yi xjXi ] = Pr "i = Kh (x Xi ):
h
2.3 ESTIMATION OF REGRESSION FUNCTIONS 9
2.3 Estimation of Regression Functions

We shall for the most part assume i.i.d. sampling as would be appropriate for cross-sectional data.
Suppose that one observes i.i.d. observations (Xi ; Yi ) for i = 1; : : : ; n; where the response Yi is real
valued and where the covariates Xi = (X1i ; : : : ; Xdi ) take values in Rd . Dene the regression function
m(x) = E(Y jX = x); a quantity of interest that summarizes how the covariates aect the response.
This can also be dened as the measurable function that minimizes
E[fY g(X)g2 ]: (2.2)
Then we can write this in regression model format
Yi = m(Xi ) + "i ; i = 1; : : : ; n; (2.3)
where i is a random error independent over observations that satises E("i j Xi = x) = 0: Then m( )
2
is the regression function of Y on X. It is usual also to assume that var(Yi j Xi = x) = (x) < 1:
The dimensionality of X and the smoothness of m determine how well it can be estimated.
We discuss a number of estimators of m(x); many of these are linear smoothers of the form
Pn n
i=1 wni (x)Yi ; for some weighting sequence fwni (x)gi=1 depending only on X1 ; : : : ; Xn ; but arise
from dierent motivations and possess dierent statistical properties. The methods we consider are
appropriate for both random design, where (Xi ; Yi ) are i.i.d., and xed design, where Xi are xed
in repeated samples. In the random design case, X is an ancillary statistic, and standard statistical
practice, see Cox and Hinkley (1974), is to conduct inference conditional on the sample fXi gni=1 :
However, many papers in the literature prove theoretical properties unconditionally, and we shall,
for ease of exposition, present results in this form. We also quote most results only for the case
where X is scalar, although we discuss the extension to multivariate data. We restrict our attention
to independent sampling, but extensions to the dependent sampling case are straightforward.
Recall that R
yf (x; y)dy
m(x) = R ; (2.4)
f (x; y)dy
where f (x; y) is the joint density of (X; Y ): A natural way to estimate m( ) is rst to compute an
estimate of f (x; y) and then to integrate it according to this formula. A kernel density estimate
fbh (x; y) of f (x; y) is

1X
n
fb(x; y) = Kh (x; Xi )Kh (y; Yi ):
n i=1
We have (ignoring the limits of integration):
Z Z
1X 1X
n n
fb(x; y)dy = Kh (x; Xi ) ; y fb(x; y)dy = Kh (x; Xi )Yi :
n i=1 n i=1
Plugging these into numerator and denominator of (2.4) we obtain the NadarayaWatson kernel
estimate Pn
Kh (x; Xi )Yi
b
m(x) = Pi=1
n : (2.5)
i=1 Kh (x; Xi )
b In fact, this denition makes sense also
The bandwidth h determines the degree of smoothness of m.
when x 2 Rd in which case K is understood to be dened on a subset of Rd and Kh ( ) = h d K(h 1
):
In the multivariate case, one can also allow more general bandwidth schemes, but for simplicity we
shall not consider this.
We can interpret the kernel estimator as the minimizer of the local least squares criterion function
X
n
Qn ( ) = Kh (x; Xi ) (Yi )2 ; (2.6)
i=1
that is, b(x) = arg min 2R b h (x): The Nadaraya-Watson estimator is linear in Y
Qn ( ) = m
X
n
b
m(x) = wni (x)Yi ; (2.7)
i=1
P
where wni (x) = Kh (x Xi )= ni=1 Kh (x Xi ) depends only on the covariates X1 ; : : : ; Xn : The weights
P
satisfy ni=1 wni (x) = 1; so that if Yi 7! a + bYi ; m(x)
b b
7! a + bm(x): When K 0; the weights are
probability weights since they also satisfy wni (x) 2 [0; 1]: This estimator is always dened at the
sample points Xi ; but there is a positive (but vanishing very rapidly) probability at any other point
x that the denominator wni (x) = 0 for all i:
In fact, one can also set-up the global objective function
Z X n
Qn ( (:)) = Kh (x; Xi ) (Yi (x))2 d (x); (2.8)
i=1
where now the parameter is a function (:) and is any positive measure absolutely continuous
with respect to Lebesgue measure on the support of X: It can be shown that the function b( ) that
b
minimizes this criterion is exactly m(x) for each x: Specically, let (x) = b(x) + g(x) for any
function g; and compute the rst order condition
Z X
n
@ b(x) g(x)d (x) = 0:
Qn ( (:)) = Kh (x; Xi ) Yi
@ =0 i=1
This must hold for every function g possessing certain regularity. It follows by the Euler-Lagrange
Theorem that
X
n
Kh (x; Xi ) Yi b(x) = 0
i=1
for each x: Therefore, the resulting estimator is exactly the Nadaraya-Watson estimator. This global
denition does however suggest how we might impose additional restrictions on the NW estimator,
by introducing side constraints in the optimization (2.8). We can also interpret this is a projec-
tion problem. We can think of the data Y = (Y1 ; : : : ; Yn )T as an element of the space F of n
functions ff i : i = 1; : : : ; n : f i are functions from R to Rg :We do this by putting f i (x) Yi . We
dene the following semi-norm on F;
Z
1X i
n
2 2
kf k = f (x) Kh (Xi x) dx: (2.9)
n i=1
In this way we can think of the local constant estimator as the projection of the data onto the space
F.
In practice, one estimates at a grid of points x1 ; : : : ; xm : Often one takes m = n and xi = Xi the
covariate value. In that case one obtains
b = Wy;
m
b and y are the n

where m 1 estimator and dependent variable vectors respectively, while W is the
n n smoother matrix with typical element Wij = wnj (Xi ): Let
K = [Kh (Xi ; Xj )]i;j

and let W = K:=Ki; where := is the matrix division operation. We will talk more later about the
properties of the matrix W; but note that by construction Wi = i; where i is the n 1 vector of
ones.
Local Polynomial Estimators
The Nadaraya-Watson estimator can be regarded as the solution of the minimization problem (2.6),
and consequently we call this estimator the local constant estimator. We might generalize this to
allow for a local linear shape, so specically, consider the objective function
X
n
Qn ( 0 ; 1) = Kh (x; Xi ) (Yi 0 1 (Xi x))2 ; (2.10)
i=1
where 0; 1 are parameters to be determined. The motivation for this comes from the Taylor expan-
sion
m(Xi ) = m(x) + m0 (x)(Xi x) + : : :
from which we can see that 0 can be identied with m(x); while 1 can be identied with m0 (x): The
sample objective function (2.10) will have a unique solution provided essentially # fi : Kh (x; Xi ) 6=
0g 2: In this case,
2 3 2 3 1 2 3
Pn Pn Pn
6 b0 7 6 i=1 K h (x; Xi ) i=1 Kh (x; Xi )(Xi x) 7 6 i=1 Kh (x; Xi )Yi 7
6 7=6 7 6 7:
4 5 4P Pn 5 4P 5
b1 n
K (x; X )(X x) x)2 n
i=1 h i i i=1 Kh (x; Xi )(Xi i=1 Kh (x; Xi )(Xi x)Yi
This method simultaneously estimates both the function and its derivative. The estimator is linear
in the response variable.
We consider the more general local polynomial case for vector arguments, which we warn has
complicated notation. For a vector t = (t1 ; : : : ; td ); dene the pth order polynomial
X
d
1 X s >
P (t) = 0 + 1j tj + ::: + ps t = d;p (t) ; (2.11)
j=1
p!
s:jsj=p
Pd
where s = (s1 ; : : : ; sd ) with ts = ts11 tsdd and jsj = j=1 sj : Here, = ( 0; 1; : : : ; p) is a
vector of parameters where 1 = ( 11 ; : : : ; 1d ); : : : ; p = ( p(d;0;:::;0) ; : : : ; p(0;:::;0;d) ); while d;p (t) =
(1; t1 ; : : : ; td ; : : : ; tpd )> : The total cardinality of ` is N` = (` + d 1)!=(d 1)!`!; and the total number
P
of parameters in is q(p; d) = d`=0 N` . The problem with this method is just that in high dimensions
the number of local parameters to be estimated is very large. If d = 1; then q(p; d) = p + 1: Let b
minimize
X
n
Kh (x Xi ) fYi P (Xi x)g2 (2.12)
i=1
with respect to 2 Rq(p;d) : Then, b0 serves as an estimator of m(x); while bj estimates the j 0 th deriva-
tives of m: A variation on these estimators called LOW ESS was rst considered in Cleveland (1979)
who employed a nearest neighbor window. Fan (1992) establishes an asymptotic approximation for
b h;l (x).
the case where p = 1, which he calls the local linear estimator m
The local linear estimator is unbiased when m is linear, while the Nadaraya-Watson estimator may
be biased depending on the marginal density of the design. Higher order polynomials can achieve
bias reduction, see Fan and Gijbels (1992) and Ruppert and Wand (1992). It also embodies a natural
boundary adjustment when the order is odd, so that one can work with standard kernels. This class
of estimators is perhaps the most popular judged by journal article counts.
One issue in the multivariate case is that the estimator is only well-dened at a point x when
there are at least q(p; d) observations within the kernel window, i.e., #fi : Kh (x Xi ) 6= 0g q(p; d):
This is not necessarily satised, even for sample observations, i.e., x 2 fX1 ; : : : ; Xn g.
Local Likelihood
The principle underlying the local polynomial estimator can be generalized in a number of ways.
Tibshirani (1984) introduced the local likelihood procedure in which an arbitrary parametric regres-
sion function g(x; ) substitutes the polynomial in (2.12). Fan, Heckman and Wand (1992) develop
theory for a nonparametric estimator in a Generalized Linear Model (GLIM) in which, for example,
a probit likelihood function replaces the polynomial in (2.12).
Suppose that f (yjg(x)) is the density function (or frequency function) of Y jX where f is known
and g is an unknown function related to the mean through a known function, i.e., for known h;
h(g(x)) = m(x). Then let b0 ; : : : ; bp minimize
X
n
`n ( ) = Kh (x Xi ) log f (Yi jP (Xi x));
i=1
with respect to 2 Rp+1 : Then, b0 serves as an estimator of g(x); while bj estimates the j 0 th
derivative of m; and h(b0 ) serves as an estimator of m(x): This includes the standard local polynomial
estimator as a special case when f is the normal density function. Suppose that Y is binary then
f (yjg(x)) = (g(x))y [1 (g(x))]1 y : The advantage of this method is that it imposes the restrictions
implied by the data. Fan, Heckman, and Wand (1992) See also Gozalo and Linton (1999).
Local GMM
In many situations, information about a model is expressed through conditional moment restrictions,
Hansen and Singleton (1988). In the local setting suppose that one can instead have conditional
moment restrictions of the form, for some unknown function g
E [ (Y; g(X)) jX = x] = 0;
where is a vector of moment conditions. For example, suppose that E(Y jX = x) = m(x) and
var(Y jX = x) = m(x): Gagliardini and Gourieroux (2005), Gozalo and Linton (1999). Then estima-
tion can proceed by minimizing kGn ( )k with respect to ; where
X
n
kGn ( )k = Kh (x Xi ) (Yi ; P (Xi x)) (Xi x) ; (2.13)
i=1
where jjAjj is some vector norm, while (:) is dened in (2.11). Let b0 serve as an estimator of g(x):
Quantile Regression
To estimate conditional quantiles we use a version of local likelihood. The main dierence is that
m(x) is not interpreted as the conditional mean any more but some other location parameter. Also,
b
the criterion function need not be smooth. Let m(x) = b0 ; where b is any minimizer of the following
criterion function
X
n
Kh (x Xi ) (Yi P (Xi x))
i=1
with (u) = u( 1(u < 0). In general the solution is easy to compute but is not unique so some
additional restriction has to be imposed to obtain a well dened solution. Chaudhuri (1991)
2.3.1 Alternative Estimation Paradigms

Spline Estimators
b of m, the residual sum of squares (RSS) is dened as

For any estimate m
X
n
RSS = fYi b i )g2 ;
m(X
i=1
which is a widely used criterion, in parametric contexts, for generating estimators of regression
b interpolating the data, assuming no ties in the
functions. However, the RSS is minimized by an m
X 0 s. To avoid this problem it is necessary to add a stabilizer. Most work is based on the stabilizer
Z
2
b = fm
(m) b 00 (u)g du;
although see Ansley, Kohn and Wong (1993) and Koenker, Ng and Portnoy (1993) for alternatives.
b is the (unique) minimizer of
The cubic spline estimator m
X
n Z
2 2
b m) =
R (m; fYi b i )g +
m(X b 00 (u)g du:
fm (2.14)
i=1
b has the following properties: It is a cubic polynomial between two successive X-

The spline m
b ( ) and its rst two derivatives are continuous; at the boundary
values; at the observation points m
of the observation interval the spline is linear. This characterization of the solution to (2.14) allows
the integral term on the right hand side to be replaced by a quadratic form, see Eubank (1988) and
Wahba (1990), and computation of the estimator proceeds by standard, although computationally
intensive, matrix techniques.
The smoothing parameter b . As

controls the degree of smoothness of the estimator m ! 0;
b interpolates the observations, while if
m b tends to a least squares regression line.
! 1; m
b is linear in the Y data, see Hrdle (1990, p58-59), its dependency on the design and
Although m
on the smoothing parameter is rather complicated. This has resulted in rather less treatment of
the statistical properties of these estimators, except in rather simple settings, although see Wahba
(1990) in fact, the extension to multivariate design is not straightforward. However, splines are
asymptotically equivalent to kernel smoothers as Silverman (1984) showed. The equivalent kernel is
1 juj juj
K(u) = exp p sin p + ; (2.15)
2 2 2 4
which is of fourth order, since its rst three moments are zero, while the equivalent bandwidth
h = h( ; Xi ) is
1=4 1=4 1=4
h( ; Xi ) = n f (Xi ) : (2.16)
One advantage of spline estimators over kernels is that global inequality and equality constraints
can be imposed more conveniently: for example, it may be desirable to restrict the smooth to pass
through a particular point, see Jones (1985). Silverman (1985) discusses a Bayesian interpretation
of the spline procedure.
Series or Sieve Estimators
Series estimators have received considerable attention in the econometrics literature, following El-
badawi, Gallant and Souza (1983). This theory is very much tied to the structure of Hilbert space.
Suppose that the function m has an expansion for all x:
X
1
m(x) = j 'j (x); (2.17)
j=0
1 1
in terms of the orthogonal basis functions 'j j=0
and their coe cients j j=0 : Suitable basis
systems include the Legendre polynomials described in Hrdle (1990), the Fourier series used in
Gallant and Souza (1991). A simple method of estimating m(x) involves rstly selecting a basis
system and a truncation sequence (n), where (n) is an integer less than n, and then regressing Yi
on 'ti = ('0 (Xi ); : : : ; ' (Xi ))> : Let fbj gj=0 be the least squares parameterestimates, then
(n)
2.4 ASYMPTOTIC PROPERTIES 17
(n)
X X
n
b (x) =
m bj 'j (x) = Wni (x)Yi ; (2.18)
j=0 i=1
where Wn (x) = (Wn1 ; : : : ; Wnn )> ; with
Wn (x) = '>x ( >

) 1 >
; (2.19)
where ' x = ('0 (x); : : : ; ' (x))> and = (' 1 ; : : : ; ' n )> :
Chen (2007) gives an up to date account of these methods for a variety of models including additive
nonparametric regression in both cross-sectional and time series settings. These estimators are
typically very easy to compute. In addition, the extension to additive structures and semiparametric
models is convenient, see Andrews and Whang (1990) and Andrews (1991), and accommodation of
shape restrictions and endogeneity is also quite simple. Finally, series estimators can adapt to the
smoothness of m: provided (n) grows at a su ciently fast rate, the optimal, for the smoothness
class of m; rate of convergence can be established see Stone (1982) while xed window order
r=2r+1
r kernel estimators achieve at best a rate of convergence of n : On the downside, although it
is possible to obtain pointwise normality for the centred and scaled estimators, it is not possible to
obtain simple expressions for bias and variance as it is for kernel estimators.
Chen (2007) gives an up to date account of these methods for a variety of models including
additive nonparametric regression in both cross-sectional and time series settings.
2.4 Asymptotic Properties
2.4.1 Empirical Process

The asymptotic properties of the empirical cdf have been established in 1933 in two separate papers
by Glivenko and Cantelli. We have
Theorem 1 As n ! 1;
a:s:
sup jFn (x) F (x)j ! 0:
1<x<1
Remark. The only conditionis that Xi are i.i.d., although note that since F is a distribution
function it has at most a countable number of discontinuities, and is right continuous. Note also that
the supremum is over a non-compact set - in much subsequent work generalizing this theorem it has
been necessary to restrict attention to compact sets. The proof of Theorem 1 exploits some special
structure: specically that for each x, 1(Xi x) is Bernoulli with probability F (x). This proof is
very special and uses the structure of the empirical c.d.f quite a lot.
We now turn to the limiting distribution of Fn . It is quite easy to see that for each x 2 R
p
n (Fn (x) F (x)) =) N (0; F (x)(1 F (x))) ;
so that the variance is small at both endpoints. We consider Fn (:) as a function estimator of the
function F (:); and so we seek a Functional Central Limit Theorem (FCLT). For this we need to
introduce certain continuous time stochastic processes.
A continuous time Wiener (or Brownian motion) process W satises:
1. Wt Ws is independent of
Fs = fWu : u sg
2. Wt Ws N (0; t s)
The existence of such a process is guaranteed by Kolmogorovs Extension theorem. It can be
shown that Wt has continuous sample paths and even sample paths that are locally Hlder continuous
upto < 1=2 but are are nowhere locally Hlder continuous for any > 1=2 and in particular are
nowhere dierentiable. The sample paths are of unbounded variation.
A Brownian bridge is a continuous-time stochastic process B(t); t 2 [0; 1]; whose probability
distribution is the conditional probability distribution of a Wiener process W (t) given the condition
that B(1) = 0. A process B is a Brownian Bridge if satises
cov(B(t); B(s)) = minft; sg ts:
It is a Gaussian process. Furthermore, we can write
B(t) = W (t) tW (1):
We now describe the more general functional central limit theorem result about Fn (:):
Theorem 2 (Donsker, 1952) As n ! 1

p
n (Fn (:) F (:)) =) B(F (:))
where B is the Brownian Bridge
The process B(F (:)) is also Gaussian with
cov(B(F (t)); B(F (s)) = minfF (t); F (s)g F (t)F (s):
2.4.2 Density Estimation

We now turn to density estimation. The results here are much more di cult to obtain than for the
empirical process and generally require further conditions.
Theorem 3 Suppose that:

(i) The density f has continuous second derivative at the interior point x; and f (x) > 0;
(ii) The kernel K is symmetric about zero and continuous on its compact support. Let kKk22 =
R 2 R
K (u)du; 2 (K) = u2 K(u)du:
(v) h ! 0 and nh ! 1 such that limn!1 nh5 < 1.
Then h i
p
nh fb(x) f (x) h2 b(x) =) N (0; v(x));
where
2 (K)
v(x) = kKk22 f (x) and b(x) = f 00 (x):
2
The variance is proportional to the density at the point x: We will discuss the CLT more in the
context of regression. We now turn to the question of an FCLT. Unlike the empirical process, the
kernel density estimator is not "tight", meaning specically that for any x0 6= x; fb(x) and fb(x0 ) are
asymptotically mutually independent. Specically,
nh cov(fb(x); fb(x0 )) ! 0:
This means that we cannot expect a FCLT to hold in a simple way because any limiting process
would have to have pretty strange properties, like having independent points (not increments like
Wiener process). We next turn to uniform convergence of the density estimator.
We now turn to density estimation. There are several types of consistency results, either pointwise
or uniform.
Theorem 4 (Nadaraya, (1965)) Suppose that K is of bounded variation and integrates to one, that
f is uniformly continuous on R; and that h ! 0 and nh2 ! 1: Then
a:s:
sup fb(x) f (x) !0
x2R
This theorem places very weak assumptions on the kernel and density but somewhat stronger
conditions on the bandwidth sequence. Note that the convergence is uniform over the entire real line.
It is possible to establish uniform consistency of the kernel density estimator under weaker conditions
on the bandwidth sequence like nh= log n ! 1 at the expense of stronger conditions on K:
Theorem 5 (Silverman (1978)) Suppose that K is uniformly continuous with modulus of continuity
R R
w and of bounded variation, that jK(x)jdx < 1 and K(x) ! 0 as x ! 1; that K(x)dx = 1;
R
and that jx log jxjj1=2 jdK(x)j < 1: Suppose that f is uniformly continuous. Then, provided h ! 0
and nh= log n ! 1
a:s:
sup fb(x) f (x) ! 0:
x2R
R1
Suppose additionally that 0 [log(1=u)]1=2 d (u) < 1; where (u) = fw(u)g1=2 and that nh(log n) 2 flog(1=h)g !
P
1 and 1 n=1 hn < 1 for some ; then
r !
log(1=h)
sup fb(x) E[fb(x)] = O a:s:
x2R nh
This result uses arguments that are special to the univariate case.
Silverman (1978) proves uniform consistency of the kernel density estimator under the assumption
that f is uniformly continuous. This assumption is innocuous for densities of unbounded support
but rather restrictive for those living on a bounded interval. For example, it rules out the uniform
density. The assumption is needed for handling the bias term E[fb(x)] f (x) and the results hold
true for supx2R jfb(x) E[fb(x)]j without this assumption. Recently, Gin and Guillou (2002) have
shown the following
Theorem 6 (Gin and Guillou (2002)) Suppose that the kernel K is a bounded function of bounded
variation. Suppose that h ! 0 monotonically, such that nhd =j log hj ! 1 and j log hj= log log n ! 1:
Suppose further that the density f is bounded. Then
r !
log(1=h)
sup fb(x) E[fb(x)] = O a:s:
x2Rd nhd
This result is quite remarkable in terms of the weakness of the conditions. To establish consistency
of fb(x) though we also need to analyze the term supx2R jE[fb(x)] f (x)j: This requires additional
conditions. For example, one might assume that f is uniformly continuous like Silverman (1978).
To establish a rate one needs stronger conditions specically smoothness. We note that uniform
continuity is an appropriate condition for densities with unbounded support but does rule out many
density with compact support for example the uniform density. For those cases dierent conditions
are appropriate. The bias term is handled by making a change of variables. We have
Z
E[fb(x)] = Kh (x X)f (X)dX
and if we transform X 7! u = (x X)=h the integrand becomes K(u)f (x uh)du: If the support of
X is R then the range of integration of u is not aected. Then
Z Z Z Z
0 h2
K(u)f (x uh)du = f (x) K(u)du hf (x) K(u)udu + K(u)u2 f 00 (x (u; h))du
2
2 Z Z
h
= f (x) + f 00 (x) K(u)u2 du + K(u)u2 [f 00 (x (u; h)) f 00 (x)]du
2
where x (u; h) is a vector of values intermediate between x and x uh and we use the fact that
R R
K(u)du = 1 and K(u)udu = 0:
If the support of X is some interval [x; x], then the range of integration of u becomes [(x
x)=h; (x x)=h]: When x is an interior point this interval tends towards ( 1; 1); but if x =x, then
the interval tends towards (0; 1):
2.4.3 Regression Function

Stone (1977) gave the following result for linear estimators
X
n
b
m(x) = wni (x)Yi ;
i=1
where fwni (x)g only depend on the covariate X1 ; : : : ; Xn :
Theorem 7 (Stone (1977). Let fwni (x)g be a sequence of weights and let X; X1 ; : : : ; Xn be i.i.d.
Suppose that the following conditions hold
(1) There is a C 1 such that for every nonnegative Borel measurable function f
X
n
E wni (X)f (Xi ) CEf (X);
i=1
(2) There is a D 1 such that " #

X
n
Pr jwni (X)j D = 1;
i=1
(3)
X
n
P
jwni (X)j 1 (jjXi Xjj > a) ! 0 for all a > 0;
i=1
(4)
X
n
P
jwni (X)j ! 1;
i=1
(5)
P
max jwni (X)j ! 0:
1 i n
Then fwni (x)g are consistent in the sense that whenever E[jY jr ] < 1;
b
E [jm(X) m(X)jr ] ! 0: (2.20)
These are quite weak conditions. Note that for many regression estimators the sequence of
weights fwni (x)g are probability weights, i.e., they lie between zero and one and sum to one. In
this case conditions (1), (2), and (4) are quite natural. A consequence of condition (1) is that
b
E[jm(X)j] < 1 whenever E[jY j] < 1: Condition (3) is trivially satised for kernel estimators with
kernels of bounded support. Condition (5) is also satised for many estimators - for nearest neighbor
estimators for example it is trivial. Stone (1977) shows that these conditions can be satised for a
range of estimators. The standard local linear estimator does not satisfy these conditions however.
He shows how to modify local linear estimators to make them probability weights and thereby to
satisfy the conditions. Kohler (2000) suggests an alternative way of doing this by restricting the
optimization in (2.12) to a bounded parameter space (that is allowed to expand slowly with sample
size). Stone also shows how to apply these results to nonlinear estimators like conditional quantile
estimators.
Devroye and Wagner (1980) showed that the kernel estimator with non-negative bounded and
bounded away from zero at the origin and compactly supported kernel K satises (2.20) provided
only h ! 0 and nhd ! 1:
Theorem 8 (Local Linear). Suppose that:

(i) The marginal density f of the covariates is continuous at the interior point x and f (x) > 0;
(ii) The regression function m(x) = E(Y jX = x) is twice dierentiable and m00 (x) is continuous at
2
x; the variance function (x) = var(Y jX = x) is continuous and positive at x;
(iii) The kernel K is continuous on its compact support;
(iv) E(jY j2+ ) < 1 for some > 0;
(v) h ! 0 and nh ! 1 such that limn!1 nh5 < 1.
Then
p
b LL (x)
nh m m(x) h2 b(x) =) N (0; v(x));
where
2
(x) 2 (K)
v(x) = kKk22 and b(x) = m00 (x):
f (x) 2
Corollary 9 (Nadaraya-Watson) Suppose that (i)-(v) above hold and that also f 00 ( ) exists and is
continuous at x; and that K is a second order kernel. Then
p
b N W (x)
nh m m(x) h2 bN W (x) =) N (0; v(x));
where
2 (K) (m f )00 m f 00 2 (K) m00 + 2m0 f 0
b(x) = (x) = (x):
2 f 2 f
The bias of the NW estimators depends on f and on its derivatives. The local linear estimator
by contrast has bias uniformly of order h2 ; and is design adaptive, i.e., its bias only depends
on m00 (x). The regularity conditions for the local linear estimator are weaker, since no derivatives
of f are needed, just continuity. Fan and Gijbels (1996). When the evaluation point x is at the
boundary or close to the boundary, the Nadaraya-Watson Estimator suers badly from boundary
bias. Specically, in this case the bias is O(h) for any point xn that lies within h of the boundary.
This is because the change of variables argument we use to obtain the bias no longer applies. The
local linear estimator does not suer so badly in the boundary region, and its bias is the same
magnitude as at interior points, namely h2 :
For both estimators the mean squared error for any interior point x is O(h4 ) + O(1=nh); and the
1=5 4=5
best rate is given by taking h / n in which case the mean squared error is of order n : This
1
is larger than in the parametric case where the mean squared error declines like n and usually
contains only a variance term. The asymptotic distribution for both estimators has a bias and is
nuisance parameter dependent. To obtain correct condence intervals we would have to estimate
m00 (x); f (x) and 2
(x) in the case of the local linear estimator; and m0 (x); m00 (x); f (x); f 0 (x); and
2
(x) in the case of the Nadaraya-Watson estimator, which in either case is an even more di cult
than the problem we started out with. Therefore, in practice it is usual only to estimate the variance
1=5
and to argue that the bias is of smaller order [which would be the case if h = o(n )]: This is called
undersmoothing.
The variance of the estimator increases as f (x) ! 0; which reects the intuition that the smaller
is f (x); the fewer are the number of observations on average that are captured in the smoothing
window Kh (x Xi ): Suppose that f (x) = 0 or f (x) = 1: Then we may still obtain consistency and
asymptotic normality but at slower (faster) rates of convergence, see Hengartner and Linton (1999).
The intuition is that even in these cases, the "empirical density" does have positive information, that
P P
is, ni=1 Kh (x Xi ) >> 0; that is, even if ni=1 Kh (x Xi )=n ! 0 as n ! 1; there may exist some
P
> 0 such that ni=1 Kh (x Xi )=n stays bounded away from zero (in probability) as n ! 1:
The variance of the estimator depends on the kernel through jjKjj2 and the bias depends on the
kernel through 2 (K): There is some work on optimal kernels that trades o these two contributions.
The Epanechnikov kernel (quadratic) is known to be optimal amongst compactly supported kernels.
2
Suppose that (:) and or f (:) is (boundedly) discontinuous at the point x but that m is continuous
at x: Then the estimators are still consistent but the asymptotic variance changes to
2
R0 R1
(x )f (x ) 1
K(u)2 du + 2
(x+ )f (x+ ) 0 K(u)2 du
R0 R1 2 ;
f (x ) 1 K(u)du + f (x+ ) 0 K(u)du
where for a function g; g(x+ ) = limt#x g(t) and g(x ) = limt"x g(t): Under symmetry of the kernel
this simplies to [ 2 (x )f (x ) + 2
(x+ )f (x+ )]= (f (x ) + f (x+ ))2 =2: Bias? If m(x) is discontinuous
at x; then the estimator converges to (m(x ) + m(x+ ))=2 under symmetry of the kernel.
We can allow the marginal density and error variance (or distribution) to vary with i and n;
2
denoted fni (:) and ni (:) provided these functions are uniformly smooth etc.: In this case the limiting
variance is Pn
1 2
i=1 fni (x) ni (x)
kKk22 n P n 2 :
1
n i=1 f ni (x)
An extreme example is when Xi = i=n; i.e., purely deterministic in which case fni (x) ! 1 as n ! 1.
The theory is as above where f can be interpreted as a limiting or average density. This would be
called a xed design. A number of estimators have dierent properties depending on whether the
design is xed or random but the local linear and local constant have essentially identical properties
in these two cases.
Suppose that E[jY j2 ] = 1 but E[jY j1+ ] < 1 for some 2 (0; 1): Then we still have consistency
but with slower rates and possibly non-normal limiting distributions (this is just as in the parametric
case). Stute (1986) established the almost sure convergence of the Nadaraya-Watson estimator under
only weak moment conditions, namely E(jY j1+ ) < 1:
R
The compact support condition on the kernel can be weakened to just K bounded and jK(u)ju2 du <
1.
Consider the case where X is multivariate, of dimensions d, then the variance is of order n 1 h d
and the bias is of order h2 so the optimal bandwidth that balances the two contributions to mean
1=(d+4) 4=(d+4)
squared error is of order n and results in mean squared error of order n : In general
with multivariate data one can do better by choosing bandwidths of dierent magnitude for each
direction. Especially, if the function has dierent smoothness properties in each direction, meaning
the derivatives @ j m(x)=@xj j exists for j = 1; : : : ; d; where 1; : : : ; d may be dierent. In this
case, choose bandwidths hj and assume that the kernel is of higher order than max1 j d j : In
P
this case, the bias is of order dj=1 hj j and the variance is of order n 1 h1 1 hd 1 : In this case,
=
one can take hj = h1 1 j
and then solve for h1 ; which yields an optimal rate according to mean
=(2 +d)
squared error of order n ; where the eective smoothness is dened through the harmonic
P
mean = d= dj=1 1= j :
If the data are from a time series, the same result continues to hold provided the time series is
weakly dependent. Robinson (1983), Masry (1995), Masry and Fan (1997). The dependent struc-
ture of the data gets washed out by the kernel estimation because this essentially shu- es the data
according to the level of X and then only takes a small fraction of that data. Of course, the depen-
dence does matter in higher order terms, and may aect the practical performance of estimators and
condence intervals. Interestingly, the results for local polynomials requires some bound on the joint
density of Yt ; Yt j ; for j = 1; : : : :; denotes this density by f0;j ; then their condition 2 requires that
f0;j (u; v) M < 1 uniformly over j for all u; v in a neighborhood of the point of interest. In fact,
the self normalized quantity continues to be asymptotically normal even with strong dependence and
even unit root in the covariates.
2.4.4 Condence Intervals

The asymptotic distribution contained in the above results can be used to calculate pointwise con-
dence intervals for the local constant and local linear estimators. In practice it is usual to ignore the
bias term, since this is rather complicated, depending on higher derivatives of the regression function
and perhaps on the derivatives of the density of X: This approach can be justied when a bandwidth
p
is chosen that makes the bias relatively small. That is, we suppose that h2 is small relative to 1= nh;
1=5
i.e., h = o(n ): In this case, the interval
p
b
C = m(x) z =2 c [m(x)];
var b
c m(x)]
where var[ b b
is a consistent estimate of the asymptotic variance of m(x); is a valid 1 condence
set in the sense that
Pr[m(x) 2 C ] ! 1 :
b
To get estimates of var[m(x)] we exploit the linearity of the local constant and local linear
estimates. That is, they both can be written in the form
X
n
b
m(x) = wni (x)Yi ; (2.21)
i=1
where fwni (x)g only depend on the design. This is also true of a much wider class of estimators
than kernels or local linear. One could argue that the conditional distribution is the appropriate
framework for inference here, since the covariates are ancillary. In this case the calculations leading
to the asymptotic variance are particularly easy for any linear smoother of this type. We have
X
n
2
b
var fm(x) jX1 ; : : : ; Xn g = wni (x) 2i ;
i=1
2
where i = E("2i jXi ): Note that this is exactly true for any linear smoother of the form (2.21). If
b
the error terms were normally distributed, then m(x) itself is also normally distributed, conditional
b
on the design. In general, although we may not be able to prove it, we can expect that m(x) is
asymptotically normal after location and scale adjustment. Thus we expect that under appropriate
regularity conditions,
b
m(x) b
E fm(x) jX1 ; : : : ; Xn g b
m(x) b
E fm(x) jX1 ; : : : ; Xn g
qP = pPn + op (1) =) N (0; 1); (2.22)
n
w 2
(x)b
" 2
i=1
2
wni (x) 2i
i=1 ni i
where b
"i = Y i b i ) are the nonparametric residuals. The result (2.22) is the basis for condence
m(X
intervals for any linear smoother. This case includes splines, series, local polynomial, nearest neigh-
bors and the many hybrid modications thereof. It also includes multidimensional estimates and
standard estimates of derivatives. Thus the condence interval becomes
p
b
C = m(x) z =2 vb1 (x);
Pn
where vb1 (x) = i=1
2
wni "2i : One could instead separate out the estimator of variance from the
(x)b
weights,
X
vb2 (x) = b2 (x) 2
wni (x);
Pn
where b2 (x) = i=1 "2i : Some authors also impose homoskedasticity in construction of con-
wni (x)b
2
dence intervals since there are a number of good estimators of in that case like
1X 2
n
2
b = b
":
n i=1 i
In special cases there are additional estimators based on the specic structure of the limiting
distribution. So for kernel and local linear estimators we might consider
1 b2 (x)
vb3 (x) = jjKjj22 ;
nh fb(x)
where fb(x) is the standard kernel density estimate. Typically one nds that vb1 (x) vb2 (x) vb3 (x)
although the dierence is not large.
Pn
In many cases, we have the additional condition that i=1 wni (x) = 1 and so
X
n X
n
b
m(x) m(x) = wni (x)"i + wni (x)fm(Xi ) m(x)g:
i=1 i=1
We have
b
var[m(x)] b
= E(var[m(x) b
jX1 ; : : : ; Xn ]) + var(E[m(x) jX1 ; : : : ; Xn ]);
P
where the second term is of smaller order when ni=1 wni (x) = 1. It follows that
X
n
2
b
var[m(x)] b
' var[m(x) jX1 ; : : : ; Xn ] = wni (x) 2 (Xi );
i=1
so that conditional and unconditional variance are approximately the same.

P
Some estimators that do not satisfy ni=1 wni (x) = 1 have var(E[m(x)
b jX1 ; : : : ; Xn ]) of the same
b
magnitude as E(var[m(x) jX1 ; : : : ; Xn ]) and so the asymptotics are dierent. One might argue that
b
one should only care about the var[m(x) jX1 ; : : : ; Xn ] because of the ancillarity of the covariate.
Can also subtract o the mean from the residuals b

"i since they are not guaranteed to have mean
zero and do not. This does not aect the consistency of the standard error estimates but it can aect
the bias.
2.4.5 Bootstrap Condence Intervals

Suppose that X1 ; : : : ; Xn are i.i.d. with density f . Consider the kernel estimator
1X
n
f^(x) = Kh (x Xi ):
n i=1
1=5
Suppose that the conditions of Theorem 1 hold and in particular the bandwidth sequence h / n ;
so that
n2=5 ff^(x) f (x)g =) N (b(x); v(x))
for some b(x); v(x): We now investigate a boostrap algorithm for approximating the distribution
of n2=5 ff^(x) f (x)g: Ideally, we would like to take account of both bias and variance; the usual
asymptotic approach ignores the bias.
Suppose that we resample with replacement from X n = (X1 ; : : : ; Xn ), and let
1X
n
f^ (x) = Kh (x Xi ):
n i=1
We might take Lfn2=5 (f^ (x) f^(x))jX n g as an approximation to Lfn2=5 (f^(x) f (x))g: Unfortunately,
1 X
n
x Xi
Eff^ (x)jX n g = E K Xn
nh i=1 h
1 x Xi
= E K Xn ;
h h
where Xj puts mass 1=n at each Xj ; so that
1 X
n
1 x Xi x Xj
E K X n
= K = f^(x):
h h nh i=1 h
In other words, f^ (x) is a conditionally unbiased estimate of f^(x); so that the mean is completely
wrong. However, the variance is correct, i.e.,
1 X
n
x Xi
varff^ (x)jX n g = var K Xn
n2 h2 i=1 h
1 x Xi
= var K Xn
nh2 h
2 ( )2 3
1 41 X
n
x Xi
2
1 X
n
x Xi
= K K 5
nh2 n i=1
h n i=1
h
" #
1 X
n 2
1 x Xi
= K hfb2 (x)
nh nh i=1 h
1
= f (x) kKk2 f1 + op (1)g;
nh
which is the asymptotic variance of f^(x): The central limit theorem is also valid.
There are two obvious ways of correcting this problem. We can work instead with bandwidths
h = o(n 1=5 ) so that the bias is not present in the limiting distribution of f^(x): The second approach
is to make an explicit bias correction which requires estimation of f 00 : We consider a more appealing
approach which is correct but does not require explicit estimation of the higher derivatives of f:
The proposal is to resample from a smoothed version of f; e.g., f^(x). Generate a sample
fU1 ; : : : ; Un g of U [0; 1]s, and then let
X1 = F^ 1
(U1 ); : : : ; Xn = F^ 1
(Un );
Rx
where F^ (x) = 1
f^(z)dz. Now let
1 X x Xi
f^ (x) = K
nh n
as before. However, now we have

1 x Xi
E[f^ (x)jX n ] = EK
h h
Z
1 x z ^
= K f (z)dz
h h
Z
= K(u)f^(x uh)du
^ h2 ^00 (x);
= f (x) + 2 (K)f
2
which implies that
h2
E[f^ (x) f^(x)jX n ] = (K)f 00 (x);
2 2
provided f^00 (x) !p f^(x): The only problem here is that for the consistency of f^00 (x) we would require
that nh5 ! 1; which rules out the optimal bandwidth h / n 1=5 : Therefore, we resample from
Rx
F^g (x) = 1 f^g (z)dz; where f^g (x) is a kernel density estimate constructed from the bandwidth g:
This gives Z
1 x X h2
E[f^ (x)jX n ] = K f^g (x)dx = f^g (x) + (K)f^g00 (x);
n h 2 2
which includes g = 0 and g = h as special cases. Now take Lff^ (x) f^g (x)jX n g as an estimate
for Lff^h (x) f (x)g: Provided ng 5 ! 1,
f^g00 (x) !p f 00 (x)
and so
h2
E[f^ (x)jX n ] f^g (x) = 2 (K)f
00
(x);
2
which is the same as the mean of f^h (x) f (x):
We now consider methods for Nonparametric Regression. Suppose that
Yi = m(Xi ) + "i ;
2
where (Yi ; Xi ) are i.i.d. m(x) = E(Y jX = x) and var(Y jX = x) = (x). The most common
method for bootstrapping in regression models is residual resampling, Mammen (1993).
Residual resampling (Wild Bootstrap)
1. Calculate residuals ^"i = Yi b h (Xi ), i = 1; : : : ; n

m
2. Draw v1 ; : : : ; vn from a distribution with mean zero and variance one
3. Let "i = vib

"i , i = 1; : : : ; n
b g (Xi ) + "i , i = 1; : : : ; n with bandwidth g.

4. Yi = m
5. Calculate bootstrap nonparametric estimate

Pn
i=1 Kh (x Xi )Yi
b h (x) = P
m n
i=1 Kh (x Xi )
b h ( ) m( ) use the computable conditional

6. To approximate the distribution of any functional of m
b h( )
distribution of m b g ( ).
m
1=5
Theorem 10 Suppose that h / n , g=h ! 1, g ! 0, K is bounded support and symmetric about
zero, m 2 C2 . Then,
p p
b h (x)
L[ nh(m b g (x))jdata]
m b h (x)
L[ nh(m b
m(x))] ! 0:
We can also employ "iid resamplng", that is draw f(Xi ; Yi ); i = 1; : : : ; ng from the joint empirical
distribution f(Xi ; Yi ); i = 1; : : : ; ng: Then compute
Pn
i=1 Kh (x Xi )Yi
mb h (x) = P n :
i=1 Kh (x Xi )
This method has the same issue as for density estimation, that is, we can obtain the variance but
not the bias by this method. To obtain correct inference taking account of hte bias we should need
to resample from a smoothed joint distribution.
2.5 Bandwidth Selection

We describe several methods of bandwidth selection for nonparametric regression estimation. We
b ) of the function m( ): In the sequel ( ) is
rst dene some performance criteria for an estimate m(
some weighting function dened on the support of X.
2.5 BANDWIDTH SELECTION 33
1. Pointwise MSE
dM P (m(x);
^ m(x)) = E fm(x)
^ m(x)g2
2. Integrated MSE Z
dM I (m;
^ m) = E fm(x)
^ m(x)g2 (x)dx
3. Average S.E.
1X
n
dA (m;
^ m) = fm(X
^ j) m(Xj )g2 (Xj )
n j=1
4. Integrated S.E. Z
dI (m;
^ m) = fm(x)
^ m(x)g2 (x)f (x)dx
5. Conditional MISE
dC (m; ^ m)jX1 ; : : : ; Xn g:
^ m) = EfdI (m;
Let hj be the bandwidth sequences that minimize the corresponding criterion dj :

The mean squared error criteria actually have explicit formulae for the optimal bandwidth. Recall
that for univariate NW regression, the asymptotic mean squared error at the point x is
Z 2 Z 2
1 2 (x) 2 h4 00 2m0 (x)f 0 (x)
dM P (m(x);
^ m(x)) = K (u)du + m (x) + u2 K(u)du
nh f (x) 4 f (x)
a(x)
+ b(x)h4 :
nh
An optimal bandwidth can be dened as one that minimizes this criterion; this bandwidth will satisfy
the following rst order condition
a(x)
= 4b(x)h3 ;
nh2
which solves to give
1=5
a(x) 1=5
hM P (x) = n :
4b(x)
So the optimal bandwidth depends on the unknown quantities
2
(x); f (x); f 0 (x); m0 (x); and m00 (x);
and changes with each point x: Frequently, people work with an Integrated mean squared error
criterion dM I (m(x);
^ m(x)); in which case the optimal bandwidth is
R 1=5
a(x) (x)dx
hM I = R n 1=5 ;
4 b(x) (x)dx
2
and the optimal bandwidth depends on only averages of (x); f (x); f 0 (x); m0 (x); and m00 (x):
We now discuss specic methods of selecting bandwidths from data.
2.5.1 Plug-in
This involves nonparametrically estimating the unknown quantities in a(x) and b(x) by b
a(x) and
bb(x); say, and then let
" #1=5 "R #1=5
b b
a (x) b
a (x) (x)dx
hM P (x) = n 1=5 ; b
hM I = R n 1=5 :
b
4b(x) b
4 b(x) (x)dx
a(x) !p a(x) and bb(x) !p b(x); then
Provided b
jb
hM P (x) hM P (x)j
!p 0;
hM P (x)
while if supx: (x)>0 jb
a(x) a(x)j !p 0 and supx: (x)>0
bb(x) b(x) !p 0; then
jb
hM I (x) hM I (x)j
sup !p 0:
x: (x)>0 hM I (x)
The disadvantage of this method is that one must estimate the derivatives of m and f; which are
typically poorly behaved estimates [the variance of a kernel estimate of m00 (x) is of order 1=nh5 ]:
Silverman (1986) suggests a compromise method he called "rule of thumb". This involves speci-
fying an auxiliary parametric model for the data distribution and using this to infer a simple formula
for the optimal bandwidths. In density estimation, assuming a Gaussian distribution, his approach
^ = 1:06^ n 1=5 (based on a Gaussian kernel), where ^ is the estimated
yields the simple formula h
standard deviation or a robust estimator thereof. Because of the simplicity and intuitive nature of
the formula, this method is widely used in applications even for unrelated estimation problems like
regression estimation.
2.5.2 Cross Validation
The principle of cross validation and the term was rst introduced by M. Stone (1974). Rudemo
(1984) introduced least squares cross validation for density estimation. C. Stone (1984) rst proved
that such a method could produce optimal bandwidths.
This approach is based on an approximation to ASE or ISE. Thus
1X
n
dA (m;
^ m) = fm(X
^ j) m(Xj )g2 (Xj )
n j=1
1X 2X 1X
n n n
= ^ j )2 (Xj )
m(X m(X
^ j )m(Xj ) (Xj ) + m(Xj )2 (Xj ):
n j=1 n j=1 n j=1
The last term does not depend on the bandwidth, so we drop it from consideration. The rst term
just depends on the data and so can be computed easily. The problem arises with the second term,
Pn
and in particular ^ j )m(Xj ) (Xj )=n: We clearly cant just substitute m(Xj ) by m(X
j=1 m(X ^ j ):
However, we might replace it by an unbiased estimator [in the conditional distribution], which is Yj :
This is the equivalent to taking
1X
n
r(h) = fYj ^ j )g2 (Xj )
m(X
n j=1
as the bandwidth criterion. Unfortunately, this method will lead us to select h = 0 always, because
then m(X
^ j ) = Yj for all j: What has gone wrong? The problem is that m(X
^ j ) depends on all the
Y 0 s in the sample, i.e.,
X
n X
n
m(X
^ j) = wjl Yl = wjj Yj + wjl Yl ;
l=1 l=1
l6=j
so that
1X 1X 1 XX
n n n n
m(X
^ j )Yj (Xj ) = wjj Yj2 (Xj ) + wjl Yl Yj (Xj ):
n j=1 n j=1 n j=1 l=1
l6=j
We have
" #
1X 1X
n n
E wjj Yj2 (Xj )jX = wjj fm2 (Xj ) + 2
(Xj )g (Xj )
n j=1 n j=1
K(0) 1 X fm2 (Xj ) + 2 (Xj )g (Xj )

n
' ;
nh n j=1 f (Xj )
which is the same magnitude as the variance eect we are trying to pick up. Therefore, r(h) is a
downward biased estimator of dA (m;
^ m): There are two solutions to this problem.
Pn P
^ j )2 (Xj )=n and nj=1 m(X
First, we can estimate j=1 m(X ^ j )m(Xj ) (Xj )=n by
1X 1X 2
n n
^ j (Xj )Yj (Xj ) and
m m
^ (Xj ) (Xj );
n j=1 n j=1 j
where m
^ j (Xj ) is the leave-out-jestimator:
1
P x Xi
(n 1)h i6=j K h
Yi 1 X x Xi
m^ j (x) = ; f^j (x) = K :
f^j (x) (n 1)h i6=j
h
In conclusion, let
1X
n
CV (h) = fYj ^ j (Xj )g2 (Xj ):
m
n j=1
^ cv 2 Hn to minimize CV (h) for some set Hn ; and then let m
Choose h ^ h^ cv ( ): An equivalent method
which has some advantages computationally, is to let
1X 2X
n n
2
CV (h) = fYj m(X
^ j )g (Xj ) + wjj Yj2 (Xj ):
n j=1 n j=1
This latter approach is similar in spirit to the model selection ideas of time series.
We next give a theorem due to Hrdle and Marron (1985) [see also Stone (1984) for density
estimation], which established the optimality of this method.
A1. Hn = [h(n); h(n)]

1 1
h(n) C n , h(n) Cn , C; > 0
A2. K is Hlder continuous, i.e.,
jK(x1 ) K(x2 )j cjx1 x2 j , >0

R
and juj K(u)du < 1.
A3. The regression function m and the marginal density f are Hlder continuous.
A4. The conditional moments are bounded by constants Ci
E(jY ji jX = x) Ci for all x; for i = 1; 2; : : :
A5. The marginal density f (x) of x is compactly supported and is bounded from below on the
support of w.
^ cv is asymp-
Theorem 11 Suppose that the assumptions A1-A5 are satised. Then the bandwidth h
totically optimal with respect to distances dA , dI and dC , in the sense that with probability one
d(m^ h^ cv ; m) ^ cv
h
!1 ; ! 1.
inf h2Hn d(m ^ h ; m) ^ opt
h
The conditions of this theorem are very weak in some respects. Specically, the amount of
smoothness assumed for m and f is almost nil. This means that the bandwidth selection method
is automatically adapting to the amount of smoothness. In the full proof one must take account of
an general magnitude for d(m
^ h ; m) and of a parameter set that is much larger than the one we
considered, which is why the theorem is stated in this fashion. Finally, in the special case we worked
with one can also establish the stronger result
n2=5 fm
b bhcv (x) b bhopt (x)g !p 0:
m
We remark that cross validation has also been applied to density estimation, and to other smooth-
ing problems like robust or quantile regression using the leave one out principle. Specically, Leung
(2005) considers the bandwidth seleciton criterion
1X
n
CV R(h) = (Yj m
^ j (Xj ));
n j=1
where m
^ j is now a leave-one-out robust regression smoother and is a robust objective function. He
provides conditions under which this method yields the optimal bandwidth asymptotically.
There was a lot of work, mostly in density estimation, rening cross-validation to produce band-
width selection methods with better perfomance, see Wand and Jones (1995).
One issue is that with time series data the usual leave-one-out formula does not work.
2.5.3 Uniform Consistency

In this section we discuss the uniform consistency for kernel regression estimators, that is we look
for conditions under which
b
km mkp = Op ( n ) or Oa:s: ( n )
for some sequence n # 0; where

8
>
> R 1=p
< jg(x)jp d (x) if p < 1
kgkp =
>
>
: supx2C jg(x)j if p = 1
with C a compact set and ( ) some measure. We shall concentrate on the L1 distance, which is
usually the most di cult to work with. These results are especially important for the analysis of
semiparametric estimators which involve averages of nonparametric estimates evaluated at a large
number of points. They are also relevant for many other estimation and testing problems.
Theorem 12 (Local Linear, Masry (1996)). Suppose that:

(i) The marginal density f of the covariates is continuous on the compact set X and inf x2X f (x) > 0;
(ii) The regression function m00 (:) is Lipschitz continuous on X ;
(iii) The kernel K is Lipschitz continuous on its compact support;

(iv) For some > 0; E(jY j2+ ) < 1;
=(2+ )
(v) h ! 0 and nh ! 1 such that n h= log nflog n(log log n)1+ g2=(2+ )
! 1 for some > 0.
Then !
1=2
log n
b
km mk1 = O + O(h2 ) a.s:
nh
This is a simplied version of Masry (1996). By taking h = O((log n=n)1=5 ) we obtain the best
possible rate of (log n=n)2=5 : We need n =(2+ )
n 1=5
! 1; which requires that > 1=2:
b
Einhmahl and Mason (2000) establish more precise results for the stochastic part of m m;
obtaining the precise rate of almost sure convergence.
Unlike in the density estimation case we must restrict our attention to compact sets. The reason
is due to the presence of the marginal covariate density in the denominator. For unbounded support,
f (x) ! 0 as x ! 1 and so supx 1=f (x) = 1: It is possible to extend these results to allow Cn to
expand with sample size at some rate although this slows down the rate of convergence depending
on the tails of the marginal distribution of the covariate, Andrews (1995), Hansen (2006), Lu et al.
b
(2010). Further results for weighted norms. Law of the iterated logarithm. Results for jjm mjjp :
For kernel regression we can further show that there exists a bounded continuous function ( )
such that
log n
b
sup jm(x) m(x) h2 (x) Ln (x)j = O + o(h2 ) a:s:;
x2X nhd
where
1 X
n
Ln (x) = Kh (x; Xi )"i :
nf (x) i=1
This shows the remainder terms in the expansion are small. This result is extended in Kong, Linton,
and Xia (2010) to nonlinear estimators in a time series setting and to estimation of the derivatives
of the regression function.
Functional Central Limit Theorem
b
Above we established an upper bound on the order of magnitude of jjm mjj1 : We now seek to
rene this result into a limiting distribution. We shall work with kernel estimates throughout and
will make assumptions to guarantee that the bias term is small.
The rst type of result is a local functional central limit theorem for the kernel estimator. Fix
an interior point x and let
p
n (t) = b + th)
nh [m(x m(x + th)] ; t 2 [ T; T ]
for some xed T: Then we already have pointwise convergence in distribution of n (t): Under addi-
tional conditions it can be shown that
n( ) =) Z( ); (2.23)
where Z( ) is some Gaussian process. It follows that supt2[ T;T ] j n (t)j has the distribution of
supt2[ T;T ] jZ(t)j: A similar result can be established for the local in bandwidth process
p
n (t) = n1 b t (x)
t [m m(x)] ; t 2 [ T; T ];
where h = tn for some given : We obtain likewise a functional CLT like (2.23). These results
have a number of applications from establishing the e ciency of plug-in estimators to testing theory.
Let
b
m(x) m(x)
Tn = sup jTn (x)j ; Tn (x) = p ;
x2C b
var[m(x)]
b
where C is some compact set contained in the support of X; while var[m(x)] is the asymptotic
variance or conditional variance. We know that (with undersmoothing) Tn (x) is asymptotically
p
standard normal for each x; but that Tn = Op ( log n): We will establish that there exists increasing
sequences an ; bn such that
2e t
Pr [an (Tn bn ) t] ! e ; (2.24)
i.e., Tn is asymptotically Gumbel. This result was rst proved in Bickel and Rosenblatt (1973) for the
one dimensional density case and Rosenblatt (1976) for the d-dimensional density case, and Johnston
(1982) for univariate local constant nonparametric regression. Before we explore this result further,
lets see why it might be important.
One application of this result is to the limiting distribution of estimates of nonparametric bounds
for covariate eects in the presence of selection, see Manski (1994). The main use of (2.24) is in
setting uniform condence intervals. The condence intervals we have provided
p
b
m(x) z =2 b
var[m(x)]
have been valid for a single point. However, we are usually interested in the function m at a number
of dierent points in which case simply plotting out the above interval for each x will not give the
right level. There are two main approaches to providing correct condence intervals. One is to
use Bonferroni type inequalities to correct the level [see Savin (1984) and Hrdle (1991) for further
b ) as a random variable and
discussion of this] and the second approach is to treat the function m(
b with the property
use stochastic process limit theory. In other words, we nd a set of functions C(m)
that
b =1
Pr [m 2 C(m)]
for large n: This is provided by the limit theory (2.24) by letting
b = fm( ) : an (Tn
C(m) bn ) c g;
where c solves exp( 2 exp( c )) = 1 ; which leads to bands of the form

c p
b
m(x) (bn + c m(x)]
) var[ b all x;
an
c m(x)]
where var[ b b
is some estimate of var[m(x)]: This intervals has the correct coverage. In practice,
these intervals do not work terribly well for the reasons discussed in Hall (1993). A better approach
is based on the bootstrap, which we will cover later on.
We now present the main result.
Theorem 13 Suppose that

2
1. The functions m; f; are all twice continuously dierentiable on C:
2. The kernel is symmetric about zero and dierentiable with bounded support [ A; A] for some
A; where K( A) = 0:
3. For all k; E(jY jk jX = x) Ck < 1:
1
4. h = O(n ) with 5
< < 31 :
Then,
2e t
Pr [an (Tn bn ) t] ! e
with
p log C2 p
bn = 2 log h + p ; an = 2 log h;
2 log h
where C = kK 0 k2 kKk2 :
2.5.4 Optimality
Stone (1982) established what is the optimal rate of convergence for nonparametric regression under
certain conditions. In particular for a class of distributions for (Y; X) he found the sequence n such
that for positive constants c
b
lim inf sup Pr [jjm mjjq c n ] = 1:
n!1 m2M
b m2M
Here, q 2 (0; 1] and the Lq norm is taken over a compact set D Rd . The set M includes all
estimators. The set M determines the di culty. When M includes just functions from a particular
1=2
parametric class, one can usually obtain rate n : When M includes d-dimensional functions that
r=(2r+d)
are r times continuously dierentiable on D; the optimal rate (for q < 1) is n . This bound
b such that
is achievable if there exists an estimator m
b
lim sup Pr [jjm mjjq c0 n ] = 0
n!1 m2M
for some constant c0 : Stone exhibited a rate optimal estimator.

Fan (1993) has investigated optimality under a mean squared error criterion. Let
b
Rn (M; M) = inf sup E (m(x) m(x))2
b
m2M m2M
be the pointwise MSE optimal bound for an interior point x. He showed that the (best) local
linear estimator comes within 0:8962 of the bound asymptotically when M is chosen to include all
estimators. He also showed that when M is restricted to the class of estimators linear in Y; the (best
modied) local linear estimator achieves the bound. In this sense the local linear estimator is Best
Linear Asymptotic Minimax (BLAM). Fan modied the local linear by the inclusion of a trimming
2
factor of order n in the denominator to ensure that the moments existed.
Recent work has concentrated on "adaptive estimation". This can be explained as follows. Let
M be the class of regression functions and let M1 M be a subclass such that the optimal rate of
estimation on M1 is better than on M: For example, M1 could be the class of additive functions with
r=(2r+1)
smoothness r; in which case the optimal L2 rate for estimation is of order n1 =n ; whereas
r=(2r+d)
the optimal rate on M is of order n =n : An adaptive procedure is one such that
lim sup sup E n

2
b adapt
(m) km mk2 < 1;
n!1 m2M
where n (m) = n1 if m 2 M1 and n (m) = n if m 2 M nM1 . See Homann and Lepski (2002).
2.5.5 Some Nonasymptotic results

We consider some non-asymptotic results. Suppose that we consider the criterion
X
n
Q= b i)
E (m(X m(Xi ))2 jX1 ; : : : ; Xn
i=1
otherwise known as the trace mean squared error criterion associated with the n b =
1 vector m
(m(X b n ))> : Suppose that m
b 1 ); : : : ; m(X b is linear, i.e.,
b = Wy;
m (2.25)
where y = (Y1 ; : : : ; Yn )> and W is an n n matrix just depending on the covariates. Write the
regression model as
y = m + ";
where m = (m(X1 ); : : : ; m(Xn ))> and " = ("1 ; : : : ; "n )> ; and suppose that E ["jX] = 0 and E ""> jX =
2
" In : Then
2 >
Q= " tr(WW ) + tr(bb> );
where the bias is b = (W In )m and the rst term is the variance. Dene the symmetric matrix
1=2
Wc = In (W In )> (W In ) :
Cohen (1966) showed that the estimator
b c = Wc y
m
has smaller Q - in particular, its bias is the same but its variance is smaller unless W is symmetric,
specically
tr(Wc Wc> ) tr(WW> ):
This follows because

1=2
Wc> Wc W> W = In + (W In )> (W In ) 2 (W In )> (W In ) W> W
1=2
= 2(In f
W) 2 (W In )> (W In ) ;
f = (W+W> )=2: Then tr((In W)

where W f 2 ) tr((W In )> (W In )) is equivalent to showing that
f 2 WW> ) 0: This follows because we can write tr(W
tr(W f 2 WW> ) = tr((W W> )> (W
W> ))=4 0.
This says that any estimator of m of the form (2.25) for which W is not symmetric is inadmissible
according to the trace mean squared error criterion and gives a concrete way of improving estimators.
Kernel and local polynomial estimators have asymmetric W matrices, and are inadmissible, although
asymptotically this inadmissibility disappears as we know. Only spline estimators amongst the
commonly used estimators have symmetric W.
Chapter 3
Additive Regression
3.1 Introduction
We rst introduce the additive structure. Suppose that
X
d
m(x) = m0 + mj (xj ); (3.1)
j=1
where mj (:) are one-dimensional unknown functions and m0 is a constant. The variables Xj aect the
mean of the response only through the scalar valued function mj : We shall maintain that the model
structure (3.1) is true, but if it is not true, then we will later provide an interpretation of our model
ts as providing the closes additive t to the regression function in the sense of minimizing (2.2).
This simplifying structure is present in many models of economic behavior starting with Leontie
(1947); see Deaton and Muellbauer (1980) for examples. Additivity is also widely used in parametric
and semiparametric models of economic data. The model restriction (3.1) means for example that
@ 2m
(x) = 0 (3.2)
@xj @xj 0
for all x and all j 6= j 0 ; so that it rules out interaction eects between the covariates. Likewise
the average partial eects E[@m(X)=dx] = (E[m01 (X1 )]; : : : ; E[m0d (Xd )]) do not depend on the joint
density of the covariates only the marginal distributions.
45
46 CHAPTER 3 ADDITIVE REGRESSION
Our purpose here is to investigate a very general class of statistical models that combine addi-
tive separability with an unrestricted functional form for the covariate eects; this general class of
structures are generically called additive nonparametric regression models. Stone (1985) showed in
a precise mathematical way how these models circumvent the curse of dimensionality. Specically,
he showed that the optimal rate of convergence in L2 distance for estimating m or mj is of order
r=(2r+1)
n in probability, where r is a smoothness index of the functions mj . He constructed an
estimator based on splines that achieved the optimal rate of convergence. In the statistical literature
the additive regression model has been advanced in the eighties largely by the work of Buja, Hastie
and Tibshirani (1989) and Hastie and Tibshirani (1991).
The additive regression model can be written as
Yi = m0 + m1 (X1i ) + : : : + md (Xdi ) + "i ; (3.3)
where the error variables "i satisfy E("i jXi ) = 0 a.s. We shall maintain throughout that var("i jXi ) =
2
(Xi ) < 1 a.s. The constant m0 and the functions m1 ; : : : ; md are unknown and have to be
estimated from the data. We next discuss how these functions can be identied as they are not
generally explicitly dened as conditional expectations except in the special case when the covariates
are mutually independent.
3.2 Identication and Interpretation

Write for each j; x = (xj ; x j ); X = (Xj ; X j ); and Xi = (Xji ; X ji ): We shall suppose for simplicity
that X are absolutely continuous with respect to Lebesgue measure on some set X (usually a compact
subset of Rd ) and have a density function f (x); which has marginals fj (xj ) and f j (x j ) for all j: Note
P
that replacing e.g. mj (xj ) by mj (xj )+m0j ; j = 1; : : : d; such that dj=1 m0j = 0 would not change the
P
sum m0 + dj=1 mj (xj ). So the model remains unchanged although the functions mj changed, i.e.,
the functions are not all separately identied. Let Q be a probability measure with marginals Qj : For
example, Q could be the distribution of X: For identiability we make the additional "assumption"
that Z
mj (xj )dQj (xj ) = 0 (3.4)
3.2 IDENTIFICATION AND INTERPRETATION 47
for j = 1; : : : ; d. Dene Z
gj (xj ) = m(x)dQ j (x j ); (3.5)
where Q j (x j ) is the d 1 dimensional marginal probability measure. It follows that

d Z
X
gj (xj ) = m0 + mj (xj ) + mk (xk )dQ j (x j ) = m0 + mj (xj ):
k6=j
so that gj (xj ) is equal to mj (xj ) upto an additive constant. Since we have assumed that mj are mean
zero with respect to Qj , Z
mj (xj ) = gj (xj ) gj (xj )dQj (xj ): (3.6)
Z
m0 = m(x)dQ(x): (3.7)
This gives a constructive approach to identication. If Q were the joint distribution of X; then
m0 = E(Y ) and Emj (Xj ) = 0; j = 1; : : : ; d. This uniquely identies the functions mj and the
constant. Note that if m is not additive, the quantities gj (xj ) and mj (xj ) are still well-dened and
meaningful as partial average eects.
An alternative identication strategy can be to normalize the components at particular points.
For example, suppose that mj (xj0 ) = 0; say, for some xj0 for each j. This might be convenient in
some cases as particular values of the covariate might have special meaning, like zero input should
produce zero output. In this case, mj (xj ) = m(0; : : : ; 0; xj ; 0; : : : ; 0) and m0 = m(0; : : : ; 0): We will
discuss this later.
We next give an alternative interpretation of the additive model, which leads also to identication.
Recall the denition of the regression function as the measurable function m that minimizes the least
squares criterion
E fY m(X)g2 ; (3.8)
where E(Y 2 ) < 1: This can be characterized as a projection problem in Hilbert space. Let H be
the Hilbert space of square integrable functions of X; then the regression function m is the function
in H that minimizes (3.8) over H. Now consider the population problem of nding additive functions
to minimize (3.8). For economy of notation we shall assume that E(Y ) = 0 and drop the intercept
d
m0 : Dene the subspace of additive functions Hadd = j=1 Hj H, where Hj is the space of square
integrable functions of Xj with expectation zero. Then the function that minimizes (3.8) over Hadd ;
P
denoted m (x) = dj=1 mj (xj ); satises the set of rst order conditions:
m1 (x1 ) = E(Y jX1 = x1 ) E[m2 (X2 )jX1 = x1 ] E [md (Xd )jX1 = x1 ] ;

.. .
. = .. (3.9)
md (xd ) = E(Y jXd = xd ) E [m1 (X1 )jXd = xd ] E [md 1 (Xd 1 )jXd = xd ] :
This can be represented more compactly as Pj fY m (X)g = 0; j = 1; : : : ; d; where Pj ( ) = E ( jXj )

is the projection operator onto the subspace Hj : The projection theorem says that the element in
Hadd is unique, but the individual components need not be so in general. However, with a further
assumption we may guarantee essential uniqueness of the components. Specically, suppose that
X
d
rj (Xj ) = 0 a.s. =) rj 0 a.s., j = 1; : : : ; d: (3.10)
j=1
This rules out what is called concurvityin Hastie and Tibshirani (1991).1 It is a generalization of
the usual full rank condition on the cross product matrix in linear regression. It rules out not just
linear functional relationships but also any nonlinear additive functional relationships, so that one
cannot include powers of Xj or interaction products like Xj Xk : The stronger assumption that the
vector X is absolutely continuous implies (3.10).
We can represent the rst order conditions symbolically as in Hastie and Tibshirani (1990):
1
An example is given by the joint distribution such that Pr(X1 < 0; X2 < 0) = 1=2 = Pr(X1 0; X2 0); in which
case the functions g1 (x1 ) = 1(x1 0) 1=2; and g2 (x2 ) = 1(x2 < 0) 1=2; satisfy g1 (X1 ) + g2 (X2 ) = 0 a.s.
0 10 1 0 1
B I P1 P1 C B m1 C B P1 Y C
B CB C B C
B CB C B C
BP I C
P2 C B m2 C
B B C
B 2 C B P2 Y C
B CB C B C: (3.11)
B . . . CB . C = B . C
B .. . . .. C B C B C
B C B .. C B .. C
B CB C B C
@ A@ A @ A
Pd Pd I md Pd Y
This can also be thought of as a system of linear integral equations (Fredholm equations of type
2) in the functions (m1 ; : : : ; md ); see Carrasco, Florens, and Renault (2006). One can solve the
linear system (3.11) to write (m1 ; : : : ; md ) uniquely in terms of the projection operators P1 ; : : : ; Pd :
This gives a relationship between the functions mj (:) and the conditional expectations E ( jXk ) ;
k = 1; : : : ; d, which is quite complicated and implicit. However, we are able to use the projection
structure to great eect.
At this point, we draw a connection with a more general class of problems where the quantity
of interest is a function or vector of functions m(:) that is only implicitly dened but is known to
satisfy a system of linear Fredholm integral equation of the second kind in the space L2 (f ) for some
density f: Specically, Z
m(x) = m (x) + H(x; y)m(y)f (y)dy;
where the function m (x) and the operator H(x; y) are dened explicitly in terms of the distribution
of some observable quantities. The above equation fall into this category as do many others, and we
shall give some examples below. Carrasco, Florens, and Renault (2006) present an extensive review
of the methods and applications involved here. We write this equation in short hand
m = m + Hm; (3.12)
where H is operator and m is intercept. The key questions here are: (1) Does there exist a unique
solution to this equation? (2) Is the solution continuous in some sense? (3) How to compute
b y) are available on m (x)
b (x) and H(x;
estimators of m(x) in practice given noisy observations m
and H(x; y)? (4) Asymptotic distributions and inference? (5) Optimality. There may be many such
equations?
The key properties are to do with the nature of the operator or family of operators H. We will
make use of the following condition.
Assumption A1. The operator H(x; y) is Hilbert-Schmidt
Z Z
H(x; y)2 f (x)f (y)dxdy < 1: (3.13)
Under Assumption A1, H is a continuous compact operator. If it is also self-adjoint it has a

countable number of real eigenvalues2 :
1 > j 1j j 2j ::::;
X
1
2
j < 1:
j=1
This condition is satised in many cases under quite weak conditions on the data generating process.
Certainly, in the additive regression case it is satised.
Another key condition is that for a constant < 1; j < for all j 1. To verify this condition
requires some special arguments that depend on the problem at hand. If this condition holds, we get
that I H has eigenvalues bounded from below by 1 > 0. Therefore, I H is strictly positive
denite and so invertible. So we can directly solve the integral equation and write
m = (I H) 1 m :
That is, the function m is uniquely identied from the equation (3.12). Furthermore, the inverse
operation is continuous and the equation is therefore called, well-posed, in the sense of Chen (2007,
p5560).
We comment on some further properties that are of importance when constructing estimators. If
also
j 1 j < 1; (3.14)
P1
then, m = j=0 Hj m : Furthermore, the sequence of successive approximations
mn = m + Hmn 1 ; n = 1; 2; : : :
2
An operator H is self adjoint if hHm; gi = hm; Hgi for all m; g in L2 (f ):
The eigenvalues are real numbers for which there exists eigenfunctions ej (:) such that Hej = j ej :
converges rapidly to m from any starting point. In cases dened by projections this condition is
generally satised.
There are some situations where j 1 j j kj 1; and so the conditions that guarantee
convergence of the successive approximations method (3.14) is not satised. In that case, one has to
transform the integral equation in order to obtain an equation which is more regular before applying
successive approximations.
The projection theorem also says what happens when the true regression function m is not
additive. This is important for interpretation and also because it relates to e ciency in a certain
sense. The projection will take the the function m into the closest member of Hadd to m; where
distance is measured by the expected squared error.
We next show that the mapping (3.6) can also be given a projection interpretation. Let
Z X
d Z Z
I (m)(x) = m(x)dQ j (x j )dQj (xj ) + m(x)dQ j (x j ) m(x)dQ j (x j )dQj (xj ) ,
j=1
be the integration map, that takes a function m(x) in the space of additive functions. It is easy
to see that I is a linear idempotent map from H into itself, and moreover I (m) = m if m 2 Hadd :
However, I is not self-adjoint, i.e., it is not an orthogonal projection with respect to the norm induced
by expectation with respect to the joint distribution of the covariates. However, if we change the
denition of norm on the space H we can nd an interpretation of I as an orthogonal projection.
d
Specically, if distance is calculated by the product measure j=1 Qj ; then I is self-adjoint, and
hence an orthogonal projection, Nielsen and Linton (1998). In other words we can consider I (m)
as the solution to the minimization problem

Z ( X
d
)2
m(x) m0 mj (xj ) d (x); (3.15)
j=1
d
where = j=1 Qj : This gives an interpretation to the mapping and says what happens when the
true function itself is not additive.
What happens if condition (3.10) is violated? In general the projection onto the space of addi-
tive functions is unique, but the individual components are not uniquely identied even after the
normalization.
3.3 Quantile Regression

A currently popular alternative model to regression is quantile regression. Let FY jX denote the
strictly monotonic conditional distribution function of Y given X: The -quantile regression function
q (Y jX = x) = m (x) is the function such that
m (x) = infft : FY jX (tjx) g:
It can also be dened as the function that minimizes E (Y g(X)) over all measurable functions
g, where (x) = (2 1)x + jxj=2: There is a big literature on parametric quantile regression,
>
see Koenker (2008) for full disclosure, where say m (x) = x: An important distinction can be
made between moderate quantiles, for which 2 (0; 1); extreme quantiles for which 2 f0; 1g; see
Chernozhukov (2000), and intermediate quantiles for which ! f0; 1g. For direct comparison with
the mean, the median is a popular choice, while for some applications, one cares about quantiles in
some tail, like = 0:05:
As in ordinary nonparametric regression one suers a curse of dimensionality when the dimension
of x is large and one wants to allow the function m (x) to not be restricted other than via smoothness
conditions. Consider the additive quantile regression model
X
d
m (x) = m 0 + m j (xj ):
j=1
This model can be specied in terms of (3.3), where q ("i jXi ) = 0 a.s. One could be concerned with
a single quantile level ; but usually one wants to display multiple quantiles in which case one needs
a stronger assumption on the error term. For all quantiles to have the additive structure, one needs
to assume that "i is independent of Xi ; which is a strong assumption. Note that, when it exists,
R1
the regression function E(Y jX = x) = 0 m (x)d ; so that if all quantiles are additive then so is
the mean regression function. There is an intermediate quantity, the conditional expected shortfall,
R
which satises E (Y jX = x) = E(Y 1(Y m (X))jX = x) = 0 m (x)d :
3.4 INTERACTION MODELS 53
Study of additive quantile regression is discussed in Hastie and Tibshirani (1990). See Gooijer
and Zerom (2003) for a more recent study.
The identication issues are more less the same as in for ordinary regression models. What is
dierent is that we do not have the projection theory associated with the Hilbert space formulation
we had for regression.
3.4 Interaction models

Although additivity is a powerful assumption, there may be some contexts where it is inappropriate
and where the interaction between certain variables is of central interest. Suppose that
X
d X
Y = m0 + mj (Xj ) + mj;k (Xj ; Xk ) + "; (3.16)
j=1 1 j<k d
where E("jX) = 0; and mj ; mj;k ; k = j + 1; : : : ; d and j = 1; : : : ; d are unknown functions. We have
@ 2m @ 2 mj;k
(x) = (xj ; xk ): (3.17)
@xj @xk @xj @xk
Furthermore, the average partial eects will depend on the joint distributions of pairs of covariates.
An example is given by the translog production function where Y is log of output and mj (xj ) =
cj log xj ; where xj is some input, while mj;k (xj ; xk ) = cj;k log xj log xk : For identication it is natural
R R R
to assume mj (xj )dQj (xj ) = 0 and mj;k (xj ; xk )dQj (xj ) = mj;k (xj ; xk )dQk (xk ) = 0: In this case
dene Z Z
gj (xj ) = m(x)dQ j (x j ) ; gj;k (xj ; xj ) = m(x)dQ j;k (x j;k ):
Then
gj (xj ) = mj (xj ) + cj
gj;k (xj ; xj ) gj (xj ) gk (xk ) = mj;k (xj ; xk ) + cj;k ;
for constants cj ; cj;k ; which gives a constructive identication of the functions mj ; mj;k :
Sperlich, Tjostheim, and Yang (2002) proposed this identication scheme and estimated the
model using kernel methods. Andrews and Whang (1990) proposed an estimator in this case based
on series estimation, they established rates of convergence for the regression function. Breiman
PL Y d
(1991) investigated a model of the form L (x) = `=1 m`;j (xj ); which shares some features with
j=1
(3.16). This class of models (as L varies) is dense in the class of square integrable functions, so it
can be thought of as a general approximation strategy.
3.5 Parametric Components and Discrete Covariates
In many applications there are good reasons for considering at least some of the component functions
to be driven by parametric, for example, linear, specications. These reasons might be due to
prior subject-based knowledge or because the covariate in question is discrete. Suppose that Xj 2
f0; 1g; j = 1; 2; then E(Y jX1 = x1 ; X2 = x2 ) can only take four distinct values, which exactly
matches the number of distinct values generated by the set of additive functions m1 (x1 ) + m2 (x2 );
as xj 2 f0; 1g; j = 1; 2: So additivity is not a restriction in this case. Furthermore, without loss of
generality we can write mj (xj ) = j xj ; so that linearity is not a restriction either. For this reason
we may wish to consider the class of partially additive, partially linear functions
X
da db
X
m(x) = mj (xaj ) + b
j xj ; (3.18)
j=1 j=1
where we have partitioned x = (xa ; xb ); where xa 2 Rda and xb 2 Rdb with da + db = d: The
identication issues are not really changed much in this setting, but estimation procedures should
take of the additional restrictions embodied in (3.18). See Fan and Li (2003). Additional issues
arise also in the discussion of e ciency with regard to the parametric components, Bickel, Klaassen,
Ritov, and Wellner (1993).
Linear index models, corresponding to the case where m(x) = x> , are a very common semi-
parametric specication that arises in a variety of contexts, particularly limited dependent variable
models. See Powell (1994) for a survey.
3.6 ENDOGENEITY 55
3.6 Endogeneity
Endogeneity is an important, if not central issue in economics. Suppose the model (3.3) holds except
that some of the regressors are exogenous and some are endogenous. Specically, let Xj1 ; : : : ; Xjr be
exogenous, but Xk1 ; : : : ; Xks are endogenous (with r + s = d); and suppose that
E ["i jX] = (Xk1 i ; : : : ; Xks i ) 6= 0
This may be considered a strong assumption. In this case, what happens? We note that
X
d
E[Yi jX = x] = m0 + mj (xj ) + (xk1 ; : : : ; xks )
j=1
Xr
= m0 + mj` (xj` ) + (xk1 ; : : : ; xks );
`=1
Ps
where (xk1 ; : : : ; xks ) = l=1 mk (xkl ) + (xk1 ; : : : ; xks ): This means the function m(x) is partially
additive, additive in the components of the exogenous variables but not with respect to the endoge-
nous variables. Therefore, the exogenous eects are identied but in order to identify the endogenous
eects, we need some additional assumptions. A common assumption is that of the existence of an
instrumental variable, see for example Newey, Powell, and Vela (1999). In general this leads to a
more complicated estimation problem. To see this, consider the special case with one endogenous
covariate X1 and one instrumental variables Z: We have
Z
E(Y jZ = z) = E[m1 (X)jZ = z] = m1 (x)fXjZ (xjz)dx; (3.19)
where fXjZ is the conditional density of X given Z: We can estimate E(Y jZ = z) and fXjZ (xjz)
from the data, so consider these observable quantities. The equation (3.19) is an example of a type
1 linear integral equation, called NPIV (nonparametric instrumental variables). Specically, we may
write m = Hm1 ; where H is the conditional expectation operator determined by fXjZ : Under some
conditions, we can invert H to obtain m1 = H 1 m: However, the inverse is not continuous, hence
the problem is called, ill-posed, see Chen (2007).
Estimation in some special cases can be tractable. In the partial linear partial additive special
case where the endogenous variables enter linearly, the endogeneity can be handled by standard IV
estimation methods.
Chapter 4
Generalized Additive and Other Models
4.1 Introduction
In this chapter we discuss a more general class of models that possess additivity, but only after
some transformation, or sequence of transformations. The additive regression function is suitable for
many applications, but there are good reasons to seek more general structures. These more general
structures may be suggested by economic theory or simply by the type of restrictions certain types
of data should satisfy to be logically consistent.
4.2 Generalized Additive Models

Consider the generalized additive model in which there is some known transformation G for which
X
d
G fm(x)g = m0 + mj (xj ); (4.1)
j=1
where m(x) = E(Y jX = x): In fact, m could be the quantile regression or some other estimable
nonparametric function, but most precedent exists with the regression function. This includes the
additive model as a special case when G is the identity and also the multiplicative model when
G is the logarithm. In general however, the specication (4.1) makes the function m(x) no longer
57
58 CHAPTER 4 GENERALIZED ADDITIVE AND OTHER MODELS
additively separable, since the individual functions mj now enter in a nonlinear way, so that (3.2) no
longer holds. However, the elasticity function
@ ln m=@xj ln m0j (xj )

j:k (x) = (x) = = j:k (xj ; xk )
@ ln m=@xk ln m0k (xk )
is restricted to only depend on (xj ; xk ); and in a particular way.

Note that (4.1) is a partial model specication and we have not restricted in any way the variance
or other aspects of the conditional distribution L(Y jX) of Y given X. A full model specication,
widely used in this context, is to assume that L(Y jX) belongs to an exponential family with known
link function G and mean m: For example, with V (m(x)) = var(Y jX = x) the conditional log
likelihood of Y = yjX = x is of the form Q(m(x); y); where @Q(m; y)=@m = (y m)=V (m): This
class of models was called Generalized Additive by Hastie and Tibshirani (1991). This arises in the
context of limited dependent variable models. For example, if Yi is binary we might take G to be
the inverse of a c.d.f. F; so that the model is
!
X
d
Pr [Yi = 1jX = x] = F m0 + mj (xj ) = m(x):
j=1
In this case, var(Y jX = x) = m(x)(1 m(x)): Popular choices for F include the probit link, where F
is the normal cdf, and the logit link, where F is the logit cdf, but more exible structures are possible,
see Morgan (1992, chapter 4) for a number of such models like the Aranda-Ordaz and Copenhaver-
Mielke models. We may wish to allow for parametric transformations. A leading example would
be
((1 m=(1 m)) 1
G (m) = log ;
which nests the logit [ = 1] and the complementary log-log [ ! 0] as special cases. Such a
specication can arise from missclassication of a binary dependent variable as in Copas (1988) and
Hausman, Abrevaya and Scott-Morton (1998), that is, suppose that Pr(Y = 1jX = x) = m(x) =
P
F (m0 + dj=1 mj (xj )) for some known link function F = G 1 ; but that when Y = 1 we erroneously
observe Y = 0 with probability 1 and when Y = 0 we erroneously observe Y = 1 with probability
2: Then
4.2 GENERALIZED ADDITIVE MODELS 59
Pr (Y = 1jX = x) = Pr (Y = 1jX = x) (1 1) + Pr (Y = 0jX = x) 2
X
d
= 2 + (1 1 2 )F (m0 + mj (xj )); (4.2)
j=1
1 1
which is of the form (??) with G = 2 + (1 1 2 )G .
Another example is from count data. Suppose that Yi 2 f0; 1; 2; : : :g with conditional distribution
m(x)k
Pr (Y = kjX = x) = exp [ m(x)]
k!
for some function m(x) = E(Y jX = x) that satises (4.1) with G = log : Applications of count data
include patents.
Stone (1986) showed that the optimal rate of convergence for estimating m or mj in L2 is of order
r=(2r+1)
n ; where r is a smoothness index of the functions mj . He constructed an estimator based on
splines that achieved the optimal rate of convergence.
The generalized additive model with known G raises only minor new issues for identication and
estimation, since we can obtain the left hand side of (4.1) and then employ the same integration
techniques we applied for additive models. The calculation is a bit more complicated for "objective
function" approach. Consider the expected likelihood
E [Q (m(Xi ); Yi )] (4.3)
and like in (3.8) dene the components m0 ; m1 ( ); : : : ; md ( ) as the minimizers of (4.3) over the space
of additive functions Hadd : Provided Q is globally convex in m; such a minimizer exists and is unique,
subject to the same normalization issues discussed above. The rst order conditions from this are
(for j = 1; : : : ; d)
E [w(m(Xi )) fYi F (m(Xi ))g jXji = xj ] = 0

E [w(m(Xi )) fYi F (m(Xi ))g] = 0;
Pd
where w(m) = F 0 (G(m)) =V (m) and m(x) = m0 + j=1 mj (xj ); which is a nonlinear system of
equations in m0 ; m1 ( ); : : : ; md ( ): The study of these equations is more complex than for the purely
additive case. Let us describe this as a system of d + 1 equations in d functions and one constant,
shortly denoted by S(m) = 0:
The case where G is unknown does however raise new issues. We will return to this later.
In some respects, econometricians would prefer the partial model specication in which we keep
(4.1), but do not restrict ourselves to the exponential family. This exibility is a relevant consideration
for many datasets where there is overdispersion or heterogeneity.
A function m : Rd ! R is homogeneous of degree r if
m(cx) = cr m(x) (4.4)
for any positive constant c: This property is often assumed in microeconomics models, Deaton and
Muellbauer (1980). In statistical terms it embodies a type of dimensionality reduction property. Write
x in polar coordinates as ; , where is length and is direction, so contains the same information
as x=jjxjj. Any function M ( ) automatically corresponds to a homogeneous function m(x), dened
by m(x) = M ( ). Therefore, the nonparametric function m is really determined completely by
the nonparametric function M; whose argument has one less dimension. Polar coordinates have a
natural interpretation in many economic applications of homothetic functions. For example, if h is
a production function then M ( ) denes the marginal rates of substitution among inputs x, while
denes the scale of production, with economies of scale given by the dependence of h on . Without
loss of generality we shall assume that 0; although the estimation technology can be applied
in the case where can be negative. Tripathi and Kim (2000) discuss estimation of homogenous
functions. If we take logs log m(x) = log + log M ( ) we end up with an additive function with
one of the components completely known. There is a more general class of "almost homogenous"
functions, Aczel (1960, p233) where m(cr1 x1 ; : : : ; crd xd ) = cr m(x1 ; : : : ; xd ) for some r1 ; : : : ; rd ; r and
for all positive constants c:
4.2.1 Hazard Function

For iid data, the hazard function (t) = f (t)=(1
F (t)) is non-negative but otherwise unrestricted.
Rt
We are also interested in the cumulated hazard function (t) = 1
(s)ds; which is a weakly
increasing function. The hazard function and density function are in one to one correspondence so
that f (t) = (t) exp( (t)): The hazard function represents the failure intensity given surivival to
that point. It is widely used in labour economics and nance, but especially in medical statistics,
where it is an improtant parameter in many studies.
In situations where covariates are available some model structure is desirable. The proportional
hazard model of Cox (1972) is a special case of the generalized additive model with full distributional
specication; in particular, it allows a nonparametric baseline hazard and a log linear (hence additive)
covariate eect. We consider a generalization of this specication, which also allows time dependent
covariates whose eects is nonparametric. Aalen (1978) laid down the standard framework that is
based on counting processes, and which can accommodate a wide variety of censoring mechanisms.
Let N(n) (t) = (N1 (t); : : : ; Nn (t)) be a n-dimensional counting process with respect to an increasing,
(n)
right-continuous, complete ltration Ft ; t 2 [0; T ]; i.e., N(n) is adapted to the ltration and has
components Ni ; which are right-continuous step-functions, zero at time zero, with jumps of size one
such that no two components jump simultaneously. Here, Ni (t) records the number of observed
failures for the ith individual during the time interval [0; t]; and is dened over the whole period
[taken to be [0; T ]; where T is nite]. Suppose that Ni has intensity
1 (n)
i (t) = lim P Ni (t + ) Ni (t) = 1jFt = (t; Xi (t))Yi (t); (4.5)
#0
where Yi is a predictable process taking values in f0; 1g, indicating (by the value 1) when the ith
individual is observed to be at risk, while Xi is a d-dimensional predictable covariate process. The
function (t; x) represents the failure rate for an individual at risk at time t with covariate Xi (t) = x:
This is of interest in many areas from mortality to unemployment to high frequency nance.
The main object of interest is the hazard function ( ) and functionals computed from it: Linton,
Nielsen, and Van der Geer (2005) consider the case that is restricted to be separable either additively
or multiplicatively. These are contained as a special case of the generalized additive hazard model
X
d
G( (t; x)) = m0 + m0 (t) + mj (xj ); (4.6)
j=1
where G is a known monotonic function, while m0 ( ) and mj ( ) are unknown functions. If G is

the identity, we obtain the additive hazard model, while if G is the log function we obtain the
multiplicative hazard model.
The Cox model is a special case of log transformed (4.6) with mj (xj ) = j xj for some parameters
j: Honda (2005) investigates log additive hazard models, i.e., he writes (t; x) = 0 (t) exp( 1 (x1 ) +
2 (x2 )); and proposes estimators of the functions j( ) based on a local polynomial partial likelihood
principle, that partials out the baseline hazard function 0 (t): The exponential specication ensures
that the resulting eect is positive, although the estimation procedure that is needed is nonlinear.
Separable nonparametric models have been investigated previously in hazard estimation by Andersen,
Borgan, Gill, and Keiding (1992).
4.3 Transformation Models

Taking transformations of the data has been an integral part of statistical practice for many years.
Transformations have been used to aid interpretability as well as to improve statistical performance.
An important contribution to this methodology was made by Box and Cox (1964) who proposed a
parametric power family of transformations that nested the logarithm and the level. They suggested
that the power transformation, when applied to the dependent variable in a linear regression setting,
might induce normality, error variance homogeneity, and additivity of eects. They proposed estima-
tion methods for the regression and transformation parameters. Carroll and Ruppert (1984) applied
this and other transformations to both dependent and independent variables. A number of other de-
pendent variable transformations have been suggested, for example the Zellner-Revankar transform,
see Zellner and Revankar (1969). The transformation methodology has been quite successful and a
large literature exists on this subject for parametric models, see Carroll and Ruppert (1988). There
are also a number of applications to economics data: see Zarembka (1968), Zellner and Revankar
(1969), Heckman and Polachek (1974), Ehrlich (1977), Hulten and Wyko (1981).
4.3 TRANSFORMATION MODELS 63
Suppose that
X
d
(Y ) = mj (Xj ) + "; (4.7)
j=1
where " is independent of X; while is a class of monotonic transformations, where 2 . Among

the many examples of interest, the following ones are used most commonly:
y 1
(Box-Cox) (y) =
(Zellner-Revankar) (y) = ln y + y 2
(Arcsinh) (y) = sinh 1 ( y)= ;
The arcsinh transform is discussed in Johnson (1949) and more recently in Robinson (1991). The
main advantage of the arcsinh transform is that it works for y taking any value, while the Box-Cox
and the Zellner-Revankar transforms are only dened if y is positive. For these transformations,
the error term cannot be normally distributed except for a few isolated parameters, and so the
Gaussian likelihood is misspecied. In fact, as Amemiya and Powell (1981) point out, the resulting
estimators (in the parametric case) are inconsistent when only n ! 1. Bickel and Doksum (1981)
proposed an alternative transformation to the Box-Cox, which is consistent with y 2 R, namely,
(y) = sign(y)(jyj 1)= ; however, this transformation is not dierentiable in y at y = 0: We shall
discuss identication of the transformation below, but clearly, given ; the model is simply additive
so that our identication results for the additive components would apply.
Linton, Chen, Wang, and Hrdle (1997) proposed to estimate the parameters of the transforma-
tion by either an instrumental variable method or a pseudo-likelihood method based on Gaussian
". Horowitz (1996) considered the reverse case where is an unknown monotonic transformation
but mj are linear functions.
Due to the monotonicity of and independence of " from X; the conditional quantile function
q (Y jX) satises
X
d
(q (Y jX = x)) = mj (xj ) + w ;
j=1
where w is the quantile of ": This connects the generalized additive models with the transformation
models. In particular, with known and assuming w = 0; we obtain (4.1) except that we have the
quantile regression function instead of the mean regression function on the left hand side.
4.4 Homothetic functions

Suppose that there exist functions h and g such that
m(x) = h[g(x)]; (4.8)
where g is linearly (without loss of generality) homogeneous and h is strictly monotonic. Then we say
that m(x) is homothetic. This is the rst case we have considered of the type, unknown unknowns,
i.e., an unknown function of an unknown function.
Homothetic and homothetically separable functions are commonly used in models of consumer
preferences and rm production. The function m(x) could be a utility or consumer cost function
recovered from estimated consumer demand functions via revealed preference theory, or it could be a
directly estimated production or producer cost function. Some examples of homothetic functions used
in economics are provided in Chiang (1984). Zellner and Ryu (1998) perform empirical comparisons
of a large number of dierent homothetic functional forms for production. Blackorby, Primont, and
Russell (1978) provide an extensive study of the properties of homothetically separable functions and
their applications. See also Matzkin (1994) for a general survey on imposing restrictions of economic
theory on nonparametric estimators.
In many applications the functions h and g are of direct interest, e.g., the returns to scale of a
homothetic production function is dened as the log derivative of h with respect to g. Matzkin (1992)
provides a consistent estimator for the binary threshold crossing model y = I[g(x)+" 0] where g(x)
is homogeneous and " is independent of x. This threshold crossing model has E(yjx) = h[g(x)] where
h is the distribution function of ", and so is equivalent to our framework with m(x) = E(yjx).1 In
an unpublished manuscript, Newey and Matzkin (1993) propose an estimator of Matzkins (1992)
model and provide (without derivation) an associated limiting distribution.
How do we identify this model? We rst identify g: For given values x; x0 , suppose we can nd a
scalar u such that m (x) = m(ux0 ), a match. Then g (x) = g(ux0 ) = ug(x0 ), so that u = u(x; x0 ) =
1
A motivating example Matzkin provides for the threshold crossing model is where g(x) is a constant returns to
scale cost function for a project, " is rms benet or return from undertaking the project (which is unknown to the
researcher), and y indicates whether the rm embarks on the project, which it does if the benet exceeds the cost.
4.5 GENERAL SEPARABILITY 65
g(x)=g(x0 ). For any weighting function w; we have

u(x; x0 ) g(x)
R = R ;
u(x; x0 )w(x)dx g(x)w(x)dx
which identies g(:) up to a free scale normalization. We can obtain more information by combining
the results of this procedure for dierent values of x0 . One can then estimate h(:) by the nonparametric
regression of m on g: This would be an example of a generated regression, see below, because the
conditioning variable has to be estimated from data itself.
4.5 General Separability

There are a number of dierent forms of separability used in economics, see Leontie (1947), Goldman
and Uzawa (1964). Let N1 ; : : : ; NS be a partition of f1; : : : :; dg ; and let m be a utility function dened
over the goods x: Separability properties are dened in terms of MRS (marginal rate of substitution)
M RSij (x) between two goods i and j. Strong separability is the property that M RSij (x) does not
depend on goods k for all i 2 Ns ; j 2 Nt and k 2
= Ns [ Nt : Weak separability is the property that
M RSij (x) does not depend on goods k for all i; j 2 Ns and k 2
= Ns : Pearce separability is the property
that M RSij (x) does not depend on goods k for all i; j 2 Ns and k 6= i; j: Goldman and Uzawa (1964).
prove that these properties are equivalent to functional form restrictions on the utility function.
Strong Separability: For functions mj of x(s) = fxl ; l 2 Ns g and scalar argument monotonic h
m(x) = hfm1 (x(1) ) + : : : + ms (x(S) )g (4.9)
Weak Separability: For a function of S arguments H;
m(x) = Hfm1 (x(1) ); : : : ; ms (x(S) )g (4.10)
Pearce Separability: For a function of S arguments H; and additive functions mj
m(x) = Hfm1 (x(1) ); : : : ; ms (x(S) )g: (4.11)
We present a number of econometric models that fall into the strongly separable classication.
One derivation of this model comes from ordinary additive regression models in which the de-
pendent variable is censored, truncated, binary, or otherwise limited. These are models in which
Pd
Y = j=1 mj (Xj ) + " for some unobserved Y and ", where " is independent of (X) with an
absolutely continuous distribution function, and what is observed is (Y; X), where Y is some func-
tion of Y such as Y = 1 (Y 0), or Y = Y j Y 0, or Y = 1 (Y 0) Y , in which case
m (x) = E [Y j X = x] or m (x) = med[Y j X = x]. The function h would then be the distribution
function or quantile function of ". Threshold or selection equations in particular are commonly of
this form, having Y = 1 [m (X) + " Z], where Z is some threshold, e.g., a price or a bid, with
m (X)+" equalling willingness to pay or a reservation price; see, e.g., Lewbel, McFadden, and Linton
(2010). Monotonicity of h holds automatically in most of these examples because h either equals or
is a monotonic transformation of a distribution function.
This model may arise in a nonparametric regression model with unknown transformation of
the dependent variable, (Y ) = m (X) + ", where " has an absolutely continuous distribution
function h which is independent of X, is an unknown monotonic transformation and m is an
unknown regression function. It follows that the conditional distribution of Y given X, F Y jX , has
the form h ( (y) m (x)) m (y; x). A number of copula models also satisfy this structure. For
example, strictArchimedean copulas can be written in this form, where the joint distribution of (X),
1
FX (x), is such that FX (x1 ; x2 ) = ( (F1 (x1 )) + (F2 (x2 ))), where F1 , F2 represent the marginal
distributions of X1 and X2 respectively, is a continuous strictly decreasing convex function from
1
[0; 1] to [0; 1) such that (1) = 0, and denotes the inverse.
Pinkse (2001) discusses identication and nonparametric estimation in the weakly separable case
where both H and the mj functions are unknown. However, in Pinkses specication, mj are only
identied up to an arbitrary monotonic transformation.
Pearce separability corresponds somewhat to a multiple index model, where the indices are ad-
ditive nonparametric functions rather than linear ones.
We now consider the question of whether one can identify all the functions on the right hand
side of (4.9) from knowledge of the function m(:): Horowitz (2001) discusses this question in the case
where each x(s) is a scalar variable. There are some additional identication constraints that are
needed. Specically, one needs location, sign and scale normalizations, as well as requiring at least
4.5 GENERAL SEPARABILITY 67
two non-constant functions mj ; mk . His identication strategy is based on the observation that
R @m
m0j (xj ) @xj
(x)w(x)dx j;1
=R (4.12)
m0k (xk ) @m
@xj
(x)w(x)dx k;1
for any weighting function w: Then, one can use the normalizations and further integration to infer
mj (:):
We consider the slightly weaker structure where
m (x) = h [M (x)] = h [m1 (x1 ) + m2 (x2 )] ; (4.13)
where x1 may be a vector. In this case, the function m1 ( ) need not be additive, so that the Horowitz
strategy is not so attractive. Observe that the model is unchanged if m1 and m2 are replaced by
m1 + c1 and m2 + c2 , respectively, and h (m) is replaced by e
h (m) = h (m c1 c2 ). Similarly, it
remains unchanged if m1 and m2 are replaced by cm1 and cm2 respectively, for some c 6= 0 and
h (m) is replaced by e
h (m) = h (m=c). Therefore, as is commonly the case in this literature, location
and scale normalizations are needed to make identication possible. We will describe and discuss
these normalizations below, but rst we state the following conditions which are assumed to hold
throughout our exposition.
We suppose that the function h is strictly monotonic and h, m1 and m2 are continuous and dier-
entiable with respect to any mixture of their arguments. We also suppose that m1 has nite rst deriv-
ative, m01 , over its entire support, and m01 (x20 ) = 1 for some x20 . We also suppose that h (0) = m0 ,
where m0 is a constant. We require that the image of m (x1 ; x2 ) over its entire support is replicated
once m is evaluated at x20 for all x. This assumption implies that s (x1 ; x2 ) @m (x1 ; x2 ) =@x2 is a
well dened function for all x. Then, for the random variables m (X1 ; X2 ) and s (X1 ; X2 ), dene the
function q (t; x2 ) by
q (t; x2 ) = E [ s (X)j m (X) = t; X2 = x2 ] . (4.14)
The assumed strict monotonicity of h ensures that h 1 , the inverse function of h, is well dened
over its entire support. Let h0 be the rst derivative of h. Let Assumption I hold. Then, claim that
Z
m(x)
dt
M (x) m1 (x1 ) + m2 (x2 ) = . (4.15)
q(t; x20 )
m0
It follows from the model that s (x) = h [M (x)] m01 (x2 ), so that
E [s (X)j m (X) = t; X2 = x20 ] = E [h0 [M (X)] m01 (X2 )j m (X) = t; X2 = x20 ]

= E h0 h 1
(m (X)) m01 (X2 ) m (X) = t; X2 = x20
= h0 h 1
(t) m01 (x20 ) ,
and q (t; x20 ) = h0 [h 1

(t)] m01 (x20 ). Then using the change of variables m = h 1
(t), and noticing
that h0 [h 1
(t)] = h0 (m) and dt = h0 (m) dm, we obtain
1
Z
m(x) Z
m(x) h Z[m(x)]
dt dt h0 (m) dm
= =
q(t; x20 ) 0 1
h [h (t)] f (x20 ) h0 (m) m01 (x20 )
m0 m0 h 1 [m
0]
1 1
= h [m (x)] h [m0 ] (1=m01 (x20 )) = M (x) m1 (x1 ) + m2 (x2 ) ,
as required. The equation (4.15) involves quantities that are estimable from the data.
4.6 Nonparametric Transformation Models

Suppose that
(Y ) = G(m1 (X1 ); : : : ; md (Xd )) + "; (4.16)
where " is independent of X; while G is a known function and is an unknown monotonic function.
Pd Yd
Special cases of G are G(z) = j=1 zj and G(z) = zj : The general model in which is
j=1
Pd
monotonic and G(z) = j=1 zj was previously addressed in Breiman and Friedman (1985) who
suggested estimation procedures based on the iterative backtting method, which they called ACE.
However, they did not provide many results about the statistical properties of their procedures.
Linton, Chen, Wang, and Hrdle (1997) considered the model with = parametric and additive
4.6 NONPARAMETRIC TRANSFORMATION MODELS 69
Pd
G; G(z) = j=1 zj : They proposed to estimate the parameters of the transformation by either an
instrumental variable method or a pseudo-likelihood method based on Gaussian ". Horowitz (1996)
considered the reverse case where is an unknown monotonic transformation but mj are linear
functions.
Pd
The case where G(z) = j=1 zj corresponds to the principles expounded in Luce and Tukey
(1964), namely that there is some measurement scheme ( (y); m1 (x1 ); : : : ; md (xd )) in which the
eect of X on Y is additive.
Suppose that
(Y ) = m (X) + "; (4.17)
where " is independent of X with unknown distribution F" ; and the functions and m are unknown.
Then
FY jX (y; x) = Pr [Y yjX = x] = F" ( (y) m(x)) (4.18)
fY jX (y; x) = f" ( (y) m(x)) (y);
where (y) = @ (y)=@y and f" (e) = F"0 (e): Ekeland, Heckman, and Nesheim (2004) show that this
model is identiable up to a couple of normalizations under smoothness conditions on (F" ; ; m) and
monotonicity conditions on and F" . The basic idea is to note that for each j
@FY jX (y; x) @FY jX (y; x) (y)
= : (4.19)
@y @xj @m(x)=@xj
Then by integrating out either y or x one obtains ( ) up to a constant or @m( )=@xj up to a constant.
By further integrations one obtains ( ) and m( ) up to a constant. One then obtains F" by inverting
the relationship (4.18) and imposing the normalizations. Horowitz (1996) covers the special case
where m(x) is linear.
The above arguments show that for identication it is not necessary to restrict ,m; or F" beyond
monotonicity, smoothness and normalization restrictions. However, the implied estimation strategy
can be very complicated, see for example Lewbel and Linton (2006). In addition, the fully nonpara-
metric model does not at all reduce the curse of dimensionality in comparison with the unrestricted
conditional distribution FY jX (y; x), which makes the practical relevance of the identication result
limited.
Finally, we mention a nonparametric identication result of Breiman and Friedman (1985). They
dened functions ( ); m1 ( ); : : : ; md ( ) as minimizers of the least squares objective function
n Pd o2
E (Y ) j=1 mj (Xj )
2
e ( ; m1 ; : : : ; md ) = 2
(4.20)
E[ (Y )]
for general random variables Y; X1 ; : : : ; Xd : They showed the existence of minimizers of (4.20) and
showed that the set of minimizers forms a nite dimensional linear subspace (of an appropriate class of
Pd
functions) under additional conditions. These conditions were that: (i) (Y ) j=1 mj (Xj ) = 0 a.s.
2
implies that (Y ) ; mj (Xj ) = 0 a.s., j = 1; : : : ; d; (ii) E[ (Y )] = 0; E[mj (Xj )] = 0; E[ (Y )] < 1;
and E[m2j (Xj )] < 1; (iii) The conditional expectation operators E[ (Y ) jXj ]; E[mj (Xj ) jY ]; j =
1; : : : ; d are compact. This result does not require any model assumptions like conditional moments
or independent errors but has more limited scope. Critique of ACE
An important class of models can be dened through the conditional moment restriction
E [ ( (Y ); m1 (X1 ); : : : ; md (Xd )) jZ] = 0; (4.21)
where is a known vector function, and Z are instrumental variables, possibly including some of the
covariates X; Ai and Chen (2003), Blundell, Chen, and Kristensen (2007), and Chen (2007). The
restriction is supposed to uniquely identify the functions ( ); m1 ( ); : : : ; md ( ): Let
j (t(y); gj (xj ); g j ( )) = E [ (t(Y ); g1 (X1 ); : : : ; gd (Xd )) jXj = xj ]

Z
= (t(y); g1 (x1 ); : : : ; gd (xd )) fY jX (yjx)fXjX1 (xjx1 )dydx2 dxd
for any functions t; g1 ; : : : ; gd : This yields a nonlinear system of integral equations,
1 (m1 (x1 ); m2 (:); : : : ; md (:)) = 0

..
.
d (md (xd ); m1 (:); : : : ; md 1 (:)) = 0:

4.7 NON SEPARABLE MODELS 71
4.7 Non Separable Models

According to Luce and Tukey (1964), additivity is basic to science. It is certainly hard to think of
functions that are not additive in some sense, i.e., after transformations or relabelling of variables.
We present a well known result by the famous mathematician Kolmogorov.
Pd
Theorem 1. [Kolmogorov (1957)] There exist d constants j > 0; j = 1; : : : ; d; j=1 j 1;
and 2d + 1 continuous strictly increasing functions k; k = 1; : : : ; 2d + 1; which map [0; 1] to [0; 1]d
d
and have the property that for each continuous function m from [0; 1]d to R
!
X
2d+1 X
d
m(x1 ; : : : ; xd ) = g j k (xj )
k=1 j=1
or some function g continuous on [0; 1]d :

This interesting result says that every continuous function has an additive like representation.
However, this will turn out not to be so helpful when it comes to the curse of dimensionality.
Specically, if the function m is smooth of a certain order, it does not guarantee that the functions
g; k are smooth of the same order. This shows in some sense that smoothness and dimensionality
are linked together. Furthermore, the functions g; k may not be of interest, in themselves.
A general class of non separable models consists of
Y = m(X; U ); (4.22)
where X 2 Rd and Y 2 R are observed, while U is an unobserved covariate. The function is monotone
in U: If we normalize the distribution of U to be uniform on [0; 1] (which is a natural thing to do for
identication), we obtain
q (Y jX = x) = m(x; ):
Chesher. Heckman and Vytlacil.

Torgivitsky (2011) considers the model where X is endogenous but there is some instrument Z
with the property that U is independent of Z and the copula of (X; U ) is independent of Z:
Chapter 5
Time Series and Panel Data
There are many applications of these ideas in time series. In time series, the dimensionality of the
covariate space is often huge, because a priori all lagged values of a variable can be included in the
model. This means that in some cases it is desirable to employ semiparametric features that allow
some restriction on the dynamics. It also means that some kind of model selection procedure is
desirable. Also, the notions of stationarity and mixing are in some cases questionable and deserving
of study. Although linear time series models have been very popular, nonlinearity has been greatly
emphasized more recently for a variety of reasons.
Let f(Yt ; Xt )g be a jointly stationary processes. We introduce here the mixing coe cient. Let
Fab be the algebra of events generated by random variables f(Yt ; Xt ); a t bg: A stationary
stochastic processes f(Yt ; Xt )g is strongly mixing if
sup jP [AB] P [A]P [B]j = (k) ! 0; as k ! 1;

A2F 0 1
B2Fk1
and (k) is called the strong mixing coe cient. A large class of models satisfy this restriction, and
even specify how rapidly (k) declines to zero with k: See for example Carrasco and Chen (199). In
fact for many arguments, stationarity per se is not strictly needed, as some uniformity in the mixing
conditions can replace that. There some specic examples of nonstationarity that are of interest.
For example, suppose that Xt = t=T: This case can easily be accomodated in the results, but some
73
74 CHAPTER 5 TIME SERIES AND PANEL DATA
results can change. For example, in nonparametric regression with
Yt = m(t=T ) + "t
with "t a stationary and mixing process, we nd that kernel estimation has a variance that depends
on the long run variance of "t not just the short run variance (which is the case when Xt is instead
a stationary mixing process). An alternative to stationarity is the concept of local stationarity. The
~ u;t g such
stochastic process fXt;T g is called locally stationary if there exists a stochastic process fX
that
Pr max Xt;T (!) ~ t=T;t (!)
X DT T 1
=1 (5.1)
1 t T
for all T; where fDT g is a measurable positive process satisfying for some > 0; supT E (jDT j4+ ) <
1:
5.1 Mean Response

The classical time series models are the stationary invertible ARMA(p,q) class analyzed in Box and
Jenkins (196?). In this case, we may write A(L)Yt = B(L)"t ; where "t is an iid innovation process,
while A(L) = a0 a1 L ap Lp ; and B = b0 b1 L bq Lq are lag polynomials. Under the
invertibility condition we may write Yt = A(L) 1 B(L)"t = C(L)"t and D(L)Yt = B(L) 1 A(L)Yt = "t ;
P
so that E(Yt jFt 1 ) = 1 j=1 dj Yt j ; where Ft is the sigma eld generated by the past history of Yt :
Generalizing the autoregressive class of models, we can consider the nonparametric pth order
autoregression where
Yt = m(Yt 1 ; : : : ; Yt d ) + "t ; (5.2)
where "t is a martingale dierence sequence, i.e., E("t jFt 1) = 0; where Ft is the sigma eld gen-
erated by the past history of Yt , while m is an unknown function. In this case, m(Yt 1 ; : : : ; Yt d ) =
E(Yt jFt 1 ) = E(Yt jYt 1 ; : : : ; Yt d ): One can also include exogenous covariates in the conditioning set,
whence
Yt = m(Yt 1 ; : : : ; Yt d ; Xt ; : : : ; Xt d0 ) + "t ;
5.2 VOLATILITY 75
for some d0 : This might be called an ARMAX model. Robinson (1983) contains results for both types
of structures. The inclusion of moving average terms is more problematic. Suppose that d = 1 in
(5.2) and "t = t t 1; where t are i.i.d. Then m is no longer the conditional expectation of Yt
given the past. However, we can write Yt = m(Yt 1 ) + (1 L) t ; and provided j j < 1; we have
Yt + Y t 1 + : : : = m(Yt 1 ) + m(Yt 2 ) + : : : + t;
that is, the right hand side is an innite order additive model (see below for more discussion of this
type of model) and the left hand side is also a weighted average of current and past outcomes.
As discussed already, the general model (5.2) suers from the curse of dimensionality. An additive
P
approximation may be fruitful, i.e., letting m(y1 ; : : : ; yd ) = dj=1 mj (yj ): In this way, one can allow
an arbitrary but nite lag order. This model was treated in Tjostheim and Auestadt (1994). Cai
and Masry (2000) treat the more general model with additional covariates.
5.2 Volatility
For some time series, especially nancial, it is also of interest to model the conditional variance, thus
we may consider models of the form
Yt = m(Yt 1 ; : : : ; Yt d ) + (Yt 1 ; : : : ; Yt d )"t ; (5.3)
where E("t jFt 1) = 0 and E("2t 1 jFt 1) = 0: In this case, (Yt 1 ; : : : ; Yt d ) = var(Yt jFt 1 ) is treated
as an unknown but smooth measurable function. This class of models can be called nonparametric
ARCH, Engle (1982). Masry and Tjostheim (1995) treat this general case, discussing stationarity and
mixing properties as well estimation and inference methods. The nonparametric ARCH literature
apparently begins with Pagan and Schwert (1990) and Pagan and Hong (1991). They considered the
2 2
case where t = (Yt 1 ); where ( ) is a smooth but unknown function, and the multilag version
2 2
t = (Yt 1 ; Yt 2 ; : : : ; Yt d ):
Obviously, when d is at all large, these methods do not work so well, which motivates looking at
additive models
X
d
2 2
t = cv + j (Yt j ):
j=1
2 2
This is a natural extension of the original ARCH model, where j (Y ) = jY : The news impact
function can be dened as the way new information aects volatility holding past volatility constant.
2
In this case, the news impact function is separable and equal to 1 (:). The nonparametric models
allow for general news impact functions including both symmetric and asymmetric functions, and
so accommodates the leverage eect [Nelson (1991)]. Yang, Hrdle, and Nielsen (1999) proposed an
alternative nonlinear ARCH model in which the conditional mean is additive, but the volatility is
Q
multiplicative: 2t = cv dj=1 2j (Yt j ): Their estimation strategy is based on the method of partial
means/marginal integration using local linear ts as a pilot smoother. Kim and Linton (2002)
generalize this model to allow for arbitrary [but known] transformations, i.e.,
X
d
2 2
G( t) = cv + j (Yt j );
j=1
where G(:) is known function like log or level. Carroll, Mammen, and Hrdle (2001) consider the
Pd
model 2t = 2
j=1 j (Yt j ); where the functions
2
j (Yt j ) are further restricted to be of the form
j 1
m(Yt j ) for some unknown scalar parameters and single unknown function (:): This is con-
venient when tting many lags, a point that was made in justifying the GARCH(1,1) model of
Bollerslev (1986). The parameter determines the persistence of the process.
These separable models deal with the curse of dimensionality but still do not capture the per-
sistence of volatility, and specically they do not nest the favourite GARCH(1,1) process, in which
d = 1. Linton and Mammen (2005) analyses a class of semiparametric ARCH models that has both
general functional form aspects and exible dynamics. Consider the process
X
1
2
t = j( )m(Yt j ); (5.4)
j=1
p
where 2 R and m 2 M, where M = fm: measurableg. The coe cients j( ) satisfy at least
P1 j 1
j( ) 0 and j=1 j ( ) < 1 for all 2 : In the special case that j ( ) = with 0 < <1
we can rewrite (5.4) as a dierence equation in the unobserved variance
2 2
t = t 1 + m(Yt 1 ); t = 1; 2; : : : ; (5.5)
which is essentially the Engle and Ng (1993) PNP model. This model is consistent with a stationary
GARCH(1,1) structure for the unobserved variance when m(y) = + y 2 for some parameters ; .
5.2 VOLATILITY 77
It also includes other parametric models as special cases: the Glosten, Jegannathan and Runkle
(1993) model, taking m(y) = + y 2 + y 2 1(y < 0); the Engle (1990) asymmetric model, taking
m(y) = + (y + )2 ; and the Engle and Bollerslev (1986) model, taking m(y) = + jyj :
2
The new feature here is that t is a latent process. This makes the question of identication and
estimation more tricky. We next discuss identication of the function m: For simplicity, we consider
the special case (5.5) where is known. Dene m0 to be the minimizer of the following population
least squares criterion function
" #
X
1
j 1
S(m) = E fYt2 m(Yt j )g2 : (5.6)
j=1
The rst order condition for this equation is of the form (3.12), where
X
1
2 j 1
m (y) = (1 ) E Y02 Y j ;
j=1
X
1
jjj f0;j (y; x)
H(y; x) = ;
j= 1
f (y)f (x)
where f0;j (f ) density of (Y0 ; Yj ) (Y0 , respectively.) In this case, the operator may satisfy the Hilbert-
Schmidt condition (5.17) for example if the joint densities f0;j (y; x) are uniformly bounded for j 6= 0
and jxj; jyj c and that the density f0 (x) is bounded away from 0 for jxj c for some nite c. The
P
counterpart to (3.10) is that there exists no m with kmk2 = 1 such that 1 j=1
j 1
m(Yt j ) = 0 with
probability one. This condition rules out a certain concurvity in the stochastic process. That is,
the data cannot be functionally related in this particular way. It is a natural generalization to our
situation of the condition that the regressors be not linearly related in a linear regression. A special
case of this condition was used in Weiss (1986) and Kristensen and Rahbek (2003) for identication in
parametric ARCH models, see also the arguments used in Lumsdaine (1996, Lemma 5) and Robinson
and Zaaroni (2002, Lemma 9). This condition is straightforward to verify. We can now show that
for a constant 0 < < 1; 1 < : To prove this note that for m with kmk2 = 1
2 !2 3
X
1
0 < E4 j 1
m(Yt j ) 5
j=1
Z Z Z X
2
= m (x)f0 (x)dx + m(x)m(y) k( )f0;k (x; y)dxdy
jkj 1
Z Z
= m2 (x)f0 (x)dx m(x)Hm(x)f0 (x)dx;
P1 P P
where = j=1is a positive constant depending on ; and j ( ) = k6=0 2j+k 2 = 1
2j 2
j=1
2j 2
:
R 2 R 2
For eigenfunctions m of H with eigenvalue this shows that m (x)f0 (x)dx m (x)f0 (x)dx > 0:
Therefore j < 1 for j 1.
We next discuss a further property that leads to an iterative solution method rather than a direct
P
inversion. If it holds that j ;1 j < 1; then m = 1 j
j=0 H m : In this case the sequence of successive
[n] [n 1]
approximations m = m +H m ; n = 1; 2; : : : converges in norm geometrically fast to m from
any starting point. This sort of property has been established in other related problems, see Hastie
and Tibshirani (1990) for discussion, and is the basis of most estimation algorithms in this area.
Unfortunately, the conditions that guarantee convergence of the successive approximations method
j 1
are not likely to be satised here even in the special case that . The reason is that the j( )=
P 1 jjj
unit function is always an eigenfunction of H with eigenvalue determined by j= 1 1= 1;
which implies that = 2 =(1 ): This is less than one in absolute value only when < 1=3:
This implies that we will not be able to use directly the particularly convenient method of successive
approximations [i.e., backtting] for estimation; however, with some modications it can be applied,
see Linton and Mammen (2003).
2
We should mention some work by Audrino and Bhlmann (2001): their model is that t =
2
(Yt 1 ; t 1) for some smooth but unknown function (:); and includes the PNP model as a spe-
cial case. However, although they proposed an estimation algorithm, they did not establish the
distribution theory of their estimator.
2 2
Pd 2 2
Another possibility is index models of the form t = ( j=1 j Yt j ); where (:) is an unknown
function, see for example Xia, Tong, Li, and Zhu (2002).
5.3 PANEL DATA MODELS 79
The above studies all take place within a stationary and mixing world. Recent developments
have allowed some theoretical analysis for nonparametric techniques in nonstationary environments,
Wang and Phillips (2009ab). They show that the usual issue of endogeneity is less important in the
nonparametric nonstationary world. Schienle (2008) considers additive nonparametric autoregression
for nonstationary Harris-recurrent processes.
5.3 Panel Data Models

Panel data are found in many contexts. Traditionally, it is associated with a series of household
surveys conducted over time on the same individuals for which the cross-sectional dimension is large
and the time series is dimension is short. Parametric methods appropriate for this kind of data can
be found in Hsiao (1986). There has also been some work on semiparametric models for such data,
see for example Kyriazidou (1997). The increase in the length of time series available for these data
has lead to some interest in the application of time series concepts, see for example Arellano (2003).
More recently, there has been work on panel data with large cross-section and time series dimension,
especially in nance where the datasets can be large along both dimensions and in macro where there
are many series with modest length time series. Some recent works include Phillips and Moon (1999),
Bai and Ng (2002), Bai (2003,2004), and Pesaran (2006). These authors have addressed a variety
of issues including nonstationarity, estimation of unobserved factors, and model selection. They all
work with essentially parametric models.
5.3.1 Standard Panel Regression

We rst show that the concepts we discussed above are useful in analyzing panel data models with
nonparametric components. Suppose that
Yit = i + m(Xit ) + "it ; (5.7)
where i; i = 1; : : : ; n are unobserved xed eects. When m(x) = x; this is a standard linear panel
data regression model, Hsiao (1986 ). Then ( 1; : : : ; n) is a large dimensional nuisance parameter,
which is known to cause problems for maximum likelihood estimation in the parametric case. A
common approach here is to time dierence the data in which case
Yit Yi;t 1 = m(Xit ) m(Xi;t 1 ) + uit ; (5.8)
where uit = "it "i;t 1 : This eliminates the nuisance parameters and creates and additive regression
model with two functions where there is a restriction between the components, and where the error
process is a moving average of the original errors. If the original errors satised E["it jXi1 ; : : : ; XiT ] =
0; then the errors in (5.8) also satisfy this restriction. Porter (1996) rst studied estimation of the
panel data regression model using (5.7). Note that this strategy extends easily to the case of multiple
covariates
X
d
Yit = i + mj (Xjit ) + "it ;
j=1
where mj are unknown functions. See also Lee and Kondo (2000).
See also Fengler, Hrdle and Mammen (2006) and Mammen, Stve, and Tjstheim (2006).
Consider the following model where the data are generated as an unbalanced panel:
Yi;t = t + g(Xi;t ) + ui;t ; i = 1; : : : ; nt ; t = 1; : : : ; T; (5.9)
where the unobserved errors (ui;t )i;t satisfy at least the conditional moment restriction E[ui;t jXi;t ; t ] =
0: Here, ( t )t is an unobserved time series, while (Xi;t )i;t are observed covariates. We shall assume
throughout that ( t )t is independent of the observed covariates and errors. Our framework is consis-
tent with the inuential model of Carter and Lee (1992) for US mortality.
The model is a semiparametric panel data model and some aspects of this have been discussed
recently in for example Fan and Li (2004), Fan, Huang, and Li (2007), and Mammen, Stve, and
Tjstheim (2006), although our assumptions will be more general in some cases and our focus is
dierent. In practice we expect the distribution of observed and unobserved variables to change over
time, and this is allowed for in our model. For example, we wish to allow the covariates to have
potentially time-varying densities ft ; i.e., Xi;t ft ;
i = 1; : : : ; nt :
R
Observe that the mean of Yi;t is E[Yi;t ] = E[ t ] + g(x)ft (x)dx: Without further restrictions,
the mean of the latent process f t g and the function g( ) are not separately identied. Clearly, we
may subtract a constant from t and add it to the function g without changing the distribution of
the observed data. In the context of additive models, for example Linton and Nielsen (1995), it
is common to assume that E[g(X)] = 0: However, since we wish to allow for the possibility that
the covariate distribution is nonstationary, this is not an attractive assumption. One could instead
assume that for example E[g(Xi;1 )] = 0; which would be consistent with nonstationary covariates.
We instead put restrictions on the process f t g: A restriction on the mean of t would eectively
rule out nonstationarity in that component. Therefore, we shall impose that 1 = 0 (one could
choose an arbitrary initial value instead, if this has better interpretation): This is consistent with
the process f t g being a unit root process starting from the origin. It also allows the process f t g to
be asymptotically stationary. We remark that there is an air of arbitrariness in the decomposition
between t and g(Xi;t ) and whatever restriction is imposed cannot get around this. The quantity
't = E(Yit j t ; Xit ) = t + g(Xit ) is invariant to the choice of identifying restriction. However, the
quantity 't contains two sources of nonstationarity though, t and the changing mean of g due to the
changing covariate distribution. It is of interest to separate out these two sources of nonstationarity
by examining separately t and ft :
We close this section with some motivation for considering the model (5.9). The model captures
the general idea of an underlying and unobserved trend modifying the eect of a covariate on a
response. For example, suppose that output of a rm Q is determined by inputs capital K and
labour L but the production function F is subject to technological change a that aects all rms in
the industry. This could be captured by the deterministic equation Q = aF (K; L): Taking logs and
adding a random error yields the specication (5.9) for Yit = log Qit ; t = log at ; and g(:) = log F (:).
Note that @ log Q=@ log a = 1; and this specication imposes the so-called Hicks Neutral technical
0
change. In this case, the Total Factor Productivity or Solow Residual is (t); the part of growth not
explainable by measurable changes in the inputs. In the popular special case where the production
function is homothetic, one can replace F (Kit ; Lit ) by f (Xit ); where Xit is the scalar capital to labour
ratio. Traditional econometric work chose particular functional forms for F like Cobb-Douglas or
CES, and made t a polynomial function of time. However, there is not general agreement on
the form of production functions, see Jorgensen (1986), and so it is well motivated to treat g as a
nonparametric function. Likewise it is restrictive to assume a particular form that underpins how the
technology should change and so we do not restrict the relationship t 7! t. The model assumption
that 1 = 0 has a natural interpretation in this case as it corresponds to a1 = 1; in which case
Qi1 = F (Ki1 ; Li1 ) is a baseline level of production.
5.3.2 Nonlinear Generalized Additive Models

For many microeconomic panel data series there are very large cross-section and short time series.
A common concern in this case is the modelling of individual heterogeneity. Imbens and Wooldridge
(2007) gives a succinct outline of the main issues.
Consider a binary outcome panel data model, where
!
X
d
Pr [Yit = 1jXit ; i] =F i + mj (Xjit ) ; (5.10)
j=1
where F is a known link function, for example the probit link, while i are unobserved individual
specic eects, possibly correlated with Xi1 ; : : : ; Xit : If we rst consider the special case where the
individual eects are independent of the covariates (the random eect case), with distribution H:
Then we have
Z ! !
X
d X
d
Pr [Yit = 1jXit ] = F + mj (Xjit ) dH( ) = mj (Xjit ) ; (5.11)
j=1 j=1
R
where (t) = F ( + t) dH( ): If H is unknown, then is an unknown but strictly increasing
function of t; and so the covariate functions and in (5.11) can be identied by the techniques
above. One could then obtain H from by deconvolution techniques.
Imbens and Wooldridge (2007) dene some of the main parameters of interest. Dene
@
j (x; )= E (Yit jXit = x; i = ):
@xj
Then the partial eect at the average (PEA) is
P EA(x) = j (x; );
where is the mean of the distribution of ; the average partial eects (APE) is
Z
AP E(x) = j (x; )dH( );
and the average structural function (ASF)

Z
ASF (x) = E (Yit jXit = x; i = ) dH( ):
Altonji and Matzkin (2005) suggest instead to allow the distribution of to very with x through
functions of x: Suppose that i jXi1 ; : : : ; XiT is the same as i jX i : One possible assumption here is
P
that i = dj=1 aj (X ji ) + i ; where i is independent of X and has unknown distribution H: Then
Z ! !
X d X
d X d X
d
Pr Yit = 1jXit = x; X i = x0 = F aj (x0 j ) + mj (xj ) dH( ) = aj (x0j ) + mj (xj ) :
j=1 j=1 j=1 j=1
Provided X i ; Xit has a non-degenerate distribution, the quantities of interest are identied. This can
be guaranteed provided we drop one time period.
5.3.3 A Semiparametric Fama French Model

The model
X
d
Yi;t+1 = u;t+1 + mj (Xjit ) j;t+1 + "i;t+1 ;
j=1
where: Xjit are observed (continuous) covariates; Yit observed returns; t is an unobserved strictly
exogenous stochastic process with unspecied dynamics and at least
E ["jX; ] = 0;
We treat vector 2 RT (d+1) as unknown xed parameters (independent of all). m1 (:); : : : ; md (:) are
unknown but smooth functions. cross-section dimension n and time series length T are large.
Consider the least squares objective function
2( )2 3
1 XT X
d
QT ( ; g) = E 4 Yit ut gj (Xji ) jt
5: (5.12)
T t=1 j=1
Next we turn to the characterization of g given : The rst-order condition dening the criterion-
minimizing function gj (x) at the value xj
1X 1X 1X
T T T
2
jt E [Yit jXji = xj ] = jt ut + mj (xj ) jt (5.13)
T t=1 T t=1 T t=1
1 XX
T
+ jt kt E [mk (Xki )jXji = xj ] :
T t=1 k6=j
Doing this for each j = 1; : : : ; d we obtain a system of implicit linear equations for m given ; i.e.,
a system of integral equations (of type 2) in the functional parameter m; see Mammen, Linton, and
Nielsen (1999) and Linton and Mammen (2005): We next argue that there exists a unique solution
to these equations. Dene:
PT
t=1 jt E [(Yit ut ) jXji = xj ]
mj (xj ) = PT 2
t=1 jt
PT
fj;k (xj ; xk ) t=1 jt kt
Hjk (xj ; xk ) = ; jk = P T 2
;
fj (xj )fk (xk ) t=1 jt
where fj;k is the joint density of (Xji ; Xki ): We drop the dependency on T in the notation for
simplicity. Then for j = 1; : : : ; d; we have the system of linear equations in the space L2 (f ) :
X
mj = mj + jk Hj mk ; (5.14)
k6=j
R
where (Hj gk )(xj ) = Hjk (xj ; xk )gk (xk ))fk (xk )dxk : In the absence of jk and with a dierent mj ;
(5.14) are the system of equations that dene the additive nonparametric regression model, Mammen,
Linton, and Nielsen (1999). We can write the system of equations (5.14) as
0 10 1 0 1
B I 12 H1 1d H1 C B g1 C B m1 C
B CB C B C
B CB C B C
B CB g C Bm C
B 21 H2 I 23 H2 CB 2 C B 2C
B C B C = H( )m = m = B C
B .. ... CB . C B . C;
B . C B .. C B .. C
B CB C B C
B CB C B C
@ A@ A @ A
d1 Hd I gd md
c.f. Hastie and Tibshirani (1991, 5.5 and 5.6). The question is whether there exists a unique solution
to this system of equations such that we can write m = H( ) 1 m . The associated system with
ij = 1 for all i; j has been well studied in the literature and the system Hm = m has a unique
solution.
Consider the case d = 2: By substitution we obtain the equations
(I 12 21 H1 H2 ) m1 = m1 + 12 H1 m2 (5.15)
(I 12 21 H2 H1 ) m2 = m2 + 21 H2 m1 : (5.16)
For there to exist a solution to these equations it su ces that I 12 21 H1 H2 and I 12 21 H2 H1
be invertible. Suppose that the Hilbert-Schmidt condition holds:

Z
fk;j (x; x0 )2
dxdx0 < 1; for all j; k: (5.17)
fj (x)fk (x0 )
This is satised under standard conditions. Then it holds that the operator norm of the composition
of the operators satises jjH1 H2 jj < 1 and jjH2 H1 jj < 1. Then, since
PT 2
t=1 1t 2t
12 21 = PT 2 PT 2
2 [0; 1];
t=1 1t t=1 2t
it follows that jj 12 21 H1 H2 jj < 1 and jj 12 21 H2 H1 jj < 1 so that I 12 21 H1 H2 and I 12 21 H2 H1

1
are invertible and there exists a unique solution to (5.15), m1 = (I 12 21 H1 H2 ) (m1 + 12 H1 m2 ),
1
and to (5.16), m2 = (I 12 21 H2 H1 ) (m2 + 21 H2 m1 ): This shows the unique identication of
the functions mj : Furthermore, we can write
X
1
k
g1 = ( 12 21 H1 H2 ) (m1 + 12 H1 m2 ) :
k=0
The sum converges geometrically fast, which suggests that iterative methods (which amount to taking
a nite truncation of the innite sum) will converge rapidly to the solution and be independent of
starting values, c.f. Hastie and Tibshirani (1991, pp 118-120).
Chapter 6
Marginal Integration
6.1 Introduction
This method is due to Linton and Nielsen (1995), who called it Marginal Integration, to Newey (1994),
who called it Partial Mean, and to Tjstheim and Auestad (1994), who called it Projection. These
were completely independent contributions published in dierent journals, with dierent emphases
and results. The basic idea is to form empirical counterparts to (3.5)-(3.7).
6.2 The Method

We use the relations (3.6) to generate estimators. In practice we have to replace m by an unrestricted
b
nonparametric regression estimator m(x) called the input or pilot estimator (and perhaps the Q j ; Qj
b j; Q
by estimates Q bj when they are unknown) and approximate the integral by some method. We
then let Z
gej (xj ) = b
m(x)d b j (x j )
Q (6.1)
Z
e j (xj ) = gej (xj )
m bj (xj )
gej (xj )dQ (6.2)
X
d Z
e
m(x) =b
c+ e j (xj );
m b
c= b
m(x)d b
Q(x): (6.3)
j=1
87
88 CHAPTER 6 MARGINAL INTEGRATION
b here. For example, the multivariate local constant estimator or local

There are many choices for m
polynomial estimators discussed above. There are several common choices of weighting measure
Q j : (a) Q b j is the empirical distribution Fb j of X ji ; (b) Q
b j is the integral of a kernel estimate
fb j of the density of X ji ; (c) Q j is the integral of some xed density q j dened on a subset of the
support of X ji : The integration in (6.1) can be done by standard numerical integration techniques
when d is small in cases (b) and (c), but this becomes problematic for larger dimensions. However,
the choice of the empirical distribution in (a) makes this computation very simple since it is replaced
by a sample average. Linton and Nielsen (1995) and Fan, Mammen, and Hrdle (1998) consider the
choice of optimal weighting.
An important feature of marginal integration is that in principle it can be applied to a variety
of functions m; not just regression functions. Therefore, we can see immediately how to compute
additive quantile regression or additive hazard functions.
The common implementation of the integration method is computationally demanding. This is
based on taking choice (a), the empirical distribution of the covariates X j : In this case,
1X
n
gej (xj ) = b j ; X ji )
m(x
n i=1
e j (xj ) at each sample observation Xji in eect one needs
for each point of interest xj : If we compute m
b ji ; X
to compute m(X jl ) for i; l = 1; : : : ; n; i.e., one has to compute n dierent n n smoothing
matrices. Furthermore, if the support of X is restricted in some way (for example, consider x1 is age
of an individual and x2 is the time since they incurred a disability, then x2 x1 ); then the method
can break down since the pairs (Xji ; X jl ) may not lie in the support of X: This means one has
to truncate the empirical distribution to a subset of the support. Cheze et al. (2003) suggested a
way of doing this in a systematic way to avoid neglecting usable information. We next discuss an
alternative approach.
6.3 Instrumental Variables

P
Letting i = k6=j mk (Xki ) + "i , we rewrite the model (3.3) as
Yi = gj (Xji ) + i = c + mj (Xji ) + i; (6.4)
6.3 INSTRUMENTAL VARIABLES 89
which is a classical example of omitted variableregression. That is, although (6.4) appears to take
the form of a univariate nonparametric regression model, smoothing Y on Xj will incur a bias due
to the omitted variable ; because contains X j , which in general depends on Xj . One solution to
this is suggested by the classical econometric notion of instrumental variable. That is, we look for
an instrument Z such that
E (ZjXj ) 6= 0 ; E (Z jXj ) = 0 (6.5)
with probability one.

Note the contrast with the marginal integration method. In this approach one denes mj by
some unconditional expectation
mj (xj ) = E[m(xj ; X j )W (X j )]
for some weighting function W that depends only on X j and which satises
E[W (X j )] = 1 ; E[W (X j )m j (X j )] = 0:
If such a random Z exists, we have
E (ZY jXj ) = E (ZjXj ) mj (Xj )
so that
E(ZY jXj = xj )
gj (xj ) = : (6.6)
E(ZjXj = xj )
This suggests that we estimate the function mj ( ) by nonparametric smoothing of ZY on Xj and
Z on Xj . In typical parametric models the choice of instrument is usually not obvious and requires
some caution, since there the issue is about endogeneity. However, our additive model has a natural
class of instruments f j (X j ) =f (X) times any measurable function of Xj will do. Suppose that
we take
fj (Xj )f j (X j )
Z(X) = : (6.7)
f (X)
We have
! !
X
E (Z jXj ) = E Z mk (Xk ) jXj
k6=j
Z !
fj (Xj )f j (X j ) X f (X)
= mk (Xk ) dX j
f (X) k6=j
fj (Xj )
XZ
= mk (Xk )f j (X j ) dX j
k6=j
= 0:
Furthermore, E (ZjXj ) = 1 so that gj (xj ) = E(ZY jXj = xj ):

Of course, the distribution of the covariates is rarely known a priori. In practice, we have
to rely on estimates of these quantities. Let fb( ) ; fbj ( ) ; and fb j ( ) be kernel estimates of the
densities f ( ) ; fj ( ) ; and f
( ), respectively. Then, the feasible procedure is dened by replacing the
j
instrumental variable Zi by Zbi = fbj (Xji ) fb j (X ji ) =fb(Xi ) and computing an internally normalized
one dimensional smooth of Zbi Yi on Xji : Thus
1 X 1 X
n n
xj Xji Zbi Yi xj Xji fb j (X ji )
gej (xj ) = K = K Yi (6.8)
nh i=1 h fbj (Xji ) nh i=1 h fb(Xi )
as our estimate of gj (xj ) = c + mj (xj ): It has several interpretations in addition to the above
instrumental variable estimate. First, as a version of the one-dimensional regression smoother but
adjusting internally by a conditional density estimate
fb(Xji ; X ji )
fbjj j (Xji jX ji ) = ;
fb j (X ji )
instead of by a marginal density estimate. Second, one can think of (6.8) as a one-dimensional
standard Nadaraya-Watson b
. (externalized) regression smoother of the adjusted data Yi on Xji ; where
Ybi = fbj (xj )fb j (X ji )Yi fb(Xji ; X ji ) : Finally, note that gej (Xji ) can be interpreted as a marginal
integration estimator in which the pilot estimator is a fully internalized smoother [see Jones, Davies
and Park (1994)] and the integrating measure is the empirical covariate one
6.3 INSTRUMENTAL VARIABLES 91
1 X
n
x Xi Yi
b
m(x) = K ;
nhd i=1 h fb(Xi )
rather than the Nadaraya-Watson: by interchanging the orders of summation, we obtain
1 X
n
Xji Xjl fb j (X jl )
gej (Xji ) = K Yl
nh l=1 h fb(Xl )
Xji Xjl ( )
1 XK Yl 1 X
n n
h X jk X jl
= K
nh l=1 fb(Xl ) nhd 1 k=1 h
Xji Xjl X jk X lj
1 XX K K Yl
n n
h h
= 2 d
n h k=1 l=1 fb(Xl )
( )
1X 1 X
n n
Xji Xjl X jk X jl Yl
= K K
n k=1 nhd l=1 h h fb(Xl )
1X
n
= b ji ; X
m(X jk );
n k=1
b ji ; X
where m(X jk ) is an internally normalized pilot smoother.
The main advantage that the local instrumental variable method has is in terms of the compu-
tational cost. There is a convenient matrix formula for the IV estimator in terms of the smoother
matrices
Kj = (Kh (Xji Xjl ))i;l ; Wi = Kj :=Kj i;
where i is the n vector of ones, and : and := denote element by element multiplication and division
respectively. Specically, we have
gej = W1 [f(Kj i): (K1 : : Kj 1 : Kj+1 : : Kd i):=(K1 : : Kd i)g : y] :
The marginal integration method actually needs n2 regression smoothings evaluated at the pairs
(Xji ; X jl ); for i; l = 1; : : : ; n; while the backtting method requires nr operations-where r is the
number of iterations to achieve convergence. The instrumental variable procedure, in contrast, takes
at most 2n operations of kernel smoothings in a preliminary step for estimating the instrumental
variable, and another n operations for the regressions. Thus, it can be easily combined with the
bootstrap method whose computational costs often becomes prohibitive in the case of marginal
integration [see Kim, Linton, and Hengartner (1999)].

We next discuss the asymptotic properties of these estimates. We discuss this in both cases when
additivity holds and when it does not hold.
We rst present the asymptotic properties of the integration class of estimators. We consider the
case when the additive model is true, and when it is not. Why does it work? The basic motivation
for the integration estimator is that integration = averaging = variance reduction: This is the basis
for the eld of estimation. It is true that the the second order eect in the integration estimator
deteriorates with dimensions [as in semiparametric problems]. However, I would argue that the
second order eect in the backtting estimator deteriorates when the number of iterations is small.
Ceteris paribus, higher dimensions leads to fewer iterations.
Perhaps the more signicant disadvantage of the integration method is that the curse of dimen-
sionality does not get completely eliminated. Thus one must use bias reduction arguments to achieve
the optimal rate in high dimensions and one might expect poor small sample performance relative
to the asymptotics.
We explicitly treat the case with a xed integrating factor determined using a known density q
with marginals qj and q j ; that is,
Z
gej (xj ) = b
m(x)q j (x j )dx j
Z
m gej (xj )qj (xj )dxj
X
d Z
e
m(x) b0 +
=m e j (xj );
m b0 =
m b
m(x)q(x)dx:
j=1
b
These estimators are linear when the unrestricted m(x) is linear. Specically,
X
n
e j (xj ) =
m ejni (xj )Yi ;
w
i=1
R
where w ejni (xj ) depends only on the covariates X1 ; : : : ; Xn : In fact, w
ejni (xj ) = wni (x)q j (x j )dx j
R
wni (x)q j (x j )qj (xj )dx:
b
First we argue that if the weight sequence is consistent for m(x) in the sense of Stone (1977),
R
then it is consistent for m(x)q j (x j )dx j : We just present results for uniform consistency. Similar
e j (Xj )
results hold for E[jm mj (Xj )jp ]:
b
Theorem 3. Suppose that supx2X jm(x) m(x)j = oP ( n ) for some sequence n ! 0: Then
e j (xj )
sup jm mj (xj )j = oP ( n )
xj 2Xj
e
sup jm(x) m(x)j = oP ( n ):
x2X
b
The downside with these results is that they require that m(x) be consistent in some sense, which
requires that nhd ! 1 at least. In the sequel we wish to obtain the asymptotic distribution of
the marginal integration estimator and we expect that it converges at the one-dimensional rate. To
establish this we need to make a more detailed analysis. A leading question is whether we can achieve
1=(2r+1)
optimal rates of convergence for given smoothness, which requires that h / n ; which can
only happen when d < 2r + 1: The following theorem does not achieve that objective since it requires
smoothness to increase with dimensionality. We use the following regularity conditions:
R
A1. The kernel K is symmetric about zero and of order r, i.e., K(u)uj du = 0; j = 1; : : : ; r 1:
Furthermore, K is supported on [ 1; 1], bounded, and Lipschitz continuous, i.e., there exists
a nite constant c such that jK(u) K(v)j c ju vj for all u; v.
A2. The functions m( ) and f ( ) are r-times continuously dierentiable in each direction, where
r (d 1)=2. The support of X is the set X .
d
A3. The joint density q( ) is continuous on its compact support, which is Q= j=1 [xj ; xj ] X. f
is bounded away from zero on Q
2
A4. The conditional variance (x) = var(Y jX = x) is continuous, and is bounded away from zero
and innity on Q.
A5. E[jY j ] < 1 for some > 5=2:

1=(2r+1)
A6. h = n for some 2 (0; 1):
Pd
Dene Dr g(x1 ; : : : ; xd ) = j=1 @ r g(x1 ; : : : ; xd ) @xrj for any positive integer r:
Theorem 4. Suppose that A1-A6 hold. Then,
nr=(2r+1) [m
e j (xj ) mj (xj )] =) N [bj (xj ); vj (xj )]
nr=(2r+1) [m(x)
e m(x)] =) N [b(x); v(x)]
Pd Pd
in distribution, where b(x) = j=1 bj (xj ); v(x) = j=1 vj (xj ) and
r Z Z Z
bj (xj ) = (K) bN W (x)qj (x j )dx j bN W (x)qj (x j )qj (xj )dx j dxj
r! r
bN W (x) = Dr (mf )(x) m(x)Dr f (x)
Z 2
1 2 (x) 2
vj (xj ) = kKk2 q (x j )dx j :
f (x) j
We make some comments on the asymptotic distribution. First, the bias reects the recentering
e j (xj ) but otherwise it is the average of the bias of the input or pilot estimator. We
that goes into m
have not in this result assumed that additivity holds; if it does hold then it the bias expression can
P
be simplied, since for example Dr m(x) = dj=1 m(r) (xj ) is additive. The asymptotic variance is not
the average variance of the pilot estimator exactly, which is 1
kKk2d
2
2
(x)=f (x); since it involves
the square of the weighting function. We give some discussion of this issue in the appendix below.
The variance reduction that yields the faster rate of convergence changes the form of the asymptotic
variance in this way. The form of the asymptotic variance simplies when Xj and X j are mutually
2
independent, (:) is constant, and q j is the covariate density f j : in that case
2
vj (xj ) = 1
kKk22 ;
fj (xj )
which is the asymptotic variance of the one-dimensional kernel estimator. The asymptotic variance
depends on the choice of q j : Linton and Nielsen (1995) obtained the optimal choice in terms of
integrated variance over the support of x under homoskesdasticity; in the general case the opti-
j
R R
mal weighting is q optj (x j ) = 1 (x j )= 1
(x j )dx j ; where (x j ) = ffj (xj ) 2 (x)=f (x)gdxj in
R 1
which case the asymptotic variance constant is proportional to 1= (x j )dx j : The results for the
empirical weighting are obtained by just replacing q j by f j in the bias and asymptotic variance
formulae. Fan, Hrdle, and Mammen (1998) consider the case with some discrete variables.
e j (xj ) are asymptotically uncorrelated, which explains why the variance
The individual estimates m
b
of m(x) is just the sum of the individual variances.
Suppose that the observations are from a time series.
There are several ways to conduct inference about the functions mj and m: The simplest ap-
proach is to use the pointwise asymptotic normal distribution to construct standard errors that
are asymptotically valid at a single or nite number of points. By undersmoothing the estimation
one eliminates bias terms from the asymptotic distribution and then one just requires a consistent
estimate of vj (xj ). In the plug-in method we compute
Z
b2 (x) 2
vbj (xj ) = 1
kKk22 q j (x j )dx j ;
fb(x)
where b2 (x) is estimated from the residuals b

"i = Yi b i ) or e
m(X "i = Yi e i ); e.g.,
m(X
2
X
n
b (x) = "2i ;
wni (x)b
i=1
where wni (x) are the smoothing weights. This method is consistent, so that vbj (xj ) ! vj (xj ) with
p
e j (xj ) z =2 n r=(2r+1) vbj (xj ) has asymptotic coverage of
probability one. Therefore, the interval m
1 ; where z is the normal critical value. This result can be applied to a nite number of points
using the asymptotic independence of the estimates at distinct points. The result also applies to the
function m itself. To obtain uniform condence intervals we can apply results of ..
We next discuss the results of Hengartner and Sperlich (2006), which do achieve the optimal
rate for given smoothness regardless of dimensionality. They propose a specic type of integration
estimator with internally normalized pilot, i.e.,

Z Z
e j (xj ) = m(x)q
m b j (x j )dx j b
m(x)q(x)dx
1X
n
Yi
b
m(x) = Kh (x Xi ) ;
n i=1 fb(Xi )
where fb(x) is a standard kernel smoother. They argue that by imposing additional smoothness on
the integrating density one can obtain rate optimality.
B1. The multivariate regression function m(x) = E(Y jX = x) is r times continuously dierentiable
2
in x1 and the conditional variance (x) = var(Y jX = x) is nite and Lipschitz continuous.
B2. The joint density of the covariates f is compactly supported, Lipschitz continuous and strictly
bounded away from zero on the interior of the support of Q.
B3. The product measure Q has continuous density q (with respect to Lebesgue measure) bounded
away from zero and innity. Further the support of Q is contained in the support of f .
1=(2r+1)
B4. The bandwidths satisfy hj = cn for some c with 0 < c < 1; h` = o(1) and nh1 hd !
1:
R
B5. The density qj (xj ) = q(x)dx j has r + 1 continuous and bounded derivatives.
Let
Z
r r (K) 1 r
bj (xj ) = c D j (xj ) mj (xj )Dr qj (xj )dxj
r! fj (xj )
" #
2 Z Z 2
jjKjj 2 q j (x j )
vj (xj ) = c 1 2
(x) + m2 (x) 2 f (x j jxj )dx j m(x)q j (x j )dx j :
fj (xj ) f (x j jxj )
Theorem 5. Suppose that A1,B1B5 hold. Then,
nr=(2r+1) [m
nr=(2r+1) [m(x)
e m(x)] =) N [b(x); v(x)]
6.5 ORACLE ESTIMATION 97
Pd Pd
in distribution, where b(x) = j=1 bj (xj ) and v(x) = j=1 vj (xj ).
This result is curse of dimensionality free. Furthermore, note that the smoothness condition is
only with regard to the direction of interest. The reason for this is that the averaging over x j
smooths out the other components. Thus the function

Z
ek (u) = m(x + u)q(x)dx
is dierentiable in u even when mk is not dierentiable, since by change of variables, we have by a

change of variables and Taylor expansion
Z
e(u) = m(t)q(t u)dt
Z Z Z
0 1 2
= m(t)q(t)dt u m(t)q (t)dt + u m(t)q 00 (t u )dt
2
for some intermediate u value. Thus the smoothness of m is not needed here. The method over-
smoothes in the direction not of interest and relies on the recentering to exactly kill the bias.
One question that is not addressed here is that of estimation of the full function. For examlpe,
suppose the functions m1 ; : : : ; md have dierent smoothness parameters, say r1 ; : : : ; rd : If we use
dierent bandwidths h1 ; : : : ; hd in the estimation with polynomial order tuned to the most smooth
case we obtain the following error terms
!
X
d
r
X
d
1
O(hj j ) + Op p ;
j=1 j=1
nhj
1=(2 j +1)
and one cannot achieve better rate for the sum. If we choose, hj = n ; then the rate is
dominated by the least smooth component.
The downside of the result is the asymptotic variance is quite complicated and in general larger
than the asymptotic variance of the usual integration estimator. We show below however that one
can improve the performance of both of these estimators by making a further modication.
6.5 Oracle Estimation

Our purpose here is to dene a standard by which to measure estimators of the components. The
notion of e ciency in nonparametric models is not as clear and well understood as it is in parametric
models. In particular, pointwise mean squared error comparisons do not provide a simple ranking
between estimators like kernel, splines, and nearest neighbors. While minimax e ciencies can in
principle serve this purpose, they are hard to work with and even harder to justify. An oracle in
Greek and Roman polytheism was an agency or medium, usually a priest or a priestess, through
which the gods were supposed to speak or prophesy. Suppose that one knew mk ( ); k 6= j; perhaps
from such an oracle. In this case, one can estimate mj (xj ) by a one-dimensional (odd order) local
polynomial regression smoother morc
j of the partial error
X
Uji = Yi mk (Xki ) (6.9)
k6=j
on Xji . As we have seen, under standard regularity conditions,

p
nh morc
j (xj ) mj (xj ) hr borc (xj ) ) N f0; v orc (xj )g;
where borc ( ) and v orc ( ) are bounded continuous functions,

2
orc r r (K) (r) orc 1 j (xj )
b (xj ) = c mj (xj ) ;v (xj ) = c r (K) ; (6.10)
r! fj (xj )
where r (K) and r (K) are constants depending on the kernel function K; while
Z
2 2 f (x)
j (xj ) = var("jXj = xj ) = (x) dx j :
fj (xj )
1=(2r+1) r=(2r+1)
If h / n ; then the optimal rate of n is achieved. If one also imposed the knowl-
edge that E[mj (Xj )] = 0 one would recenter the estimate and the bias would be recentered, i.e.,
(r) (r) R (r)
mj (xj ) 7! mj (xj ) mj (xj )fj (xj )dxj : By an application of the Cauchy-Schwarz inequality,
morc
j (xj ) has smaller variance than any integration based procedure that uses the same kernel under
homoskedasticity. In that case we have
Z
1=2 q j (x j )
1 = f jjj (x j jxj ) 1=2 dx j
f jjj (x j jxj )
Z Z
q 2 j (x j )
f jjj (x j jxj )dx j dx j
f jjj (x j jxj )
Z 2
q j (x j )
= fj (xj ) dx j :
f (x)
The notion of e ciency here is tied to asymptotic variance, which yields mean squared error holding
bias constant, and comes from the classical parametric theory of likelihood. The local likelihood
method was introduced in Tibshirani (1984) and has been applied in many other contexts. Tibshirani
(1984, Chapter 5) presents the justication for the local likelihood method (in the context of an
exponential family): he shows that its asymptotic variance is the same as the asymptotic variance
of the MLE of a correctly specied parametric model at the point of interest using the same number
of observations as the local likelihood method.
Linton (1996) showed that one can achieve the same e ciency as the oracle estimator by replacing
the partial residuals by a suitable estimate. Specically, dene the estimated partial residuals
X
eji = Yi
U e k (Xki );
m
k6=j
where m e j2
e k are preliminary, e.g., integration-based, estimators. Then let m step
(xj ) be the smooth of
eji against Xji : Under various conditions,
U
nr=(2r+1) morc
j (xj ) e j2
m step
(xj ) = op (1):
See Linton (1996) and Linton, Hengartner and Kim (1997). This estimator can be interpreted as
a one-step backtting from specic consistent starting values provided by the integration method.
The result that one step is enough is similar to that obtained in parametric problems: Rothenberg
and Leenders (1965) and Bickel (1971).

Estimation of mj (xj ) in (4.1) by marginal integration can be carried out in the analogous fashion,
since
Z d Z
X
gj (xj ) = G(m(x))dQ j (x j ) = c + mj (xj ) + mk (xk )dQ j (x j ) mj (xj ) + j:
k6=j
b
Let m(x) be an estimator of E(Y jX = x): Then let
Z
b
gej (xj ) = G(m(x))dQ j (x j ) (6.11)
Z
m gej (xj )dQj (xj ) (6.12)
! Z
X
d
e
m(x) =F b
c+ e j (xj ) ;
m b
c= b
G(m(x))dQ(x): (6.13)
j=1
The instrumental variable approach can also be applied to generalized additive models. Under
additivity we have
E [ZG(m (X))jXj ]
mj (Xj ) = (6.14)
E [ZjXj ]
for the Z dened in (6.7). Since m ( ) is unknown, we need consistent estimates of m (X) in a
preliminary step, and then the calculation in (6.14) is feasible.
We rst give the properties of the marginal integration estimators dened in (6.11)-(6.13) where
b
m(x) is the local constant.
Theorem 8. Suppose that A1-A6 hold for r = 2 and that F; G are twice continuously dieren-
tiable over the relevant compact interval. Then,
nr=(2r+1) [m
nr=(2r+1) [m(x)
e m(x)] =) N [b(x); v(x)]
P P
in distribution, where b(x) = F 0 (m0 + dj=1 mj (xj )) dj=1 bj (xj );
P P
v(x) = fF 0 (m0 + dj=1 mj (xj ))g2 dj=1 vj (xj ) and
r Z Z
0
bj (xj ) = 2 (K) G (m(x))bN W (x)q j (x j )dx j G0 (m(x))bN W (x)q j (x j )qj (xj )dx
r!
Z
q 2 j (x j )
vj (xj ) = 1
kKk22 G0 (m(x))2 2
(x) dx j :
f (x)
This result parallels the main result for the integration estimator in additive nonparametric
regression. The variance has an additional factor due to G0 (m(x))2 that is equal to one in the ad-
ditive model. Note that even though G(m(x)) is additive, the function m is not, which explains
the form of the bias. We have for example, @m(x)=@xj = F 0 (G(m(x))m0 (xj ) and @ 2 m(x)=@x2j =
F 00 (G(m(x))(m0 (xj ))2 + F 00 (G(m(x))m00 (xj ) so one can simplify the bias formula a bit. These esti-
mators are ine cient as was shown in Linton (1997).
6.7 ORACLE PERFORMANCE 101
The argument is basically that

1
b
G(m(x)) G(m(x)) ' G0 (m(x))fm(x)
b m(x)g + G00 (m(x))fm(x)
b m(x)g2 ;
2
where the second term is of order h4 + 1=nhd : The rst term is just like the usual estimation error
for additive nonparametric regression.
Hrdle, Huet and Mammen (2004) propose a bootstrap algorithm for conducting inference in a
generalized additive model. They also allowed for parametric eects.
6.7 Oracle Performance

We extend the concept of oracle e ciency to nonlinear models like Generalized Additive models. In
this case it is not possible usually to dene the oracle residuals (6.9). Instead we suggest the following
solution impose our knowledge about m0 + m j (X ji ) inside of a suitable criterion function. This
emphasizes that the concept is tied tot he specic estimation procedure one would use in a one-
dimensional setting. This seems reasonable enough. We next show how this is applied in the context
of partial model specication, or conditional moment restrictions, and a full model specication or
likelihood setting.
6.7.1 Conditional Moment Specication

We suppose that the distribution Y jX is completely unspecied apart from the restriction (4.21). In
this case, it is appropriate to look at sample rst order condition estimators. In particular, consider
the local polynomial GMM estimator (2.13) specialized to the univariate case, i.e., consider the
objective function
X
n
kGnj ( ; xj )k = Kh (xj Xji ) (Yi ; m j (X ji ); P (Xji xj )) 1;p (Xji xj ) :
i=1
Let b = arg min 2Rp+1 e j (xj ) = b0 (xj ). The properties of this estimator can be
kGnj ( ; xj )k ; and let m
obtained by straghtforward techniques. Let j (y; x; z) = @ (y; x; z)=@z; and denote by
Z
jj (xj ) = j (y; m j (x j ); mj (xj ))fY jX (yjx)fX (x)dydx j
Z
2
vj (xj ) = (y; m j (x j ); mj (xj ))fY jX (yjx)fX (x)dydx j :
Theorem 9. Suppose that conditions A given in the appendix hold. Then:
nr=(2r+1) [m
e j (xj ) mj (xj )] =) N [bj (xj ); vj (xj )] ;
1 (p) vj (xj )
bj (xj ) = p (K)mj (xj ) ; vj (xj ) = p (K) 2 :
p! jj (xj )
A special case of this is given by the generalized additive model where it is assumed that (4.1)
holds, i.e.,
(y; m j (x j ); mj (xj )) = w(x) (y F (m0 + m j (x j ) + mj (xj ))) ;
where w is any function of x: By taking w(x) = F 0 (m0 + m j (x j ) + mj (xj )) 1;p (Xji xj ) we obtain
the rst order condition from the least squares criterion
1X
n
> 2
Qn ( ) = Kh (xj Xji ) Yi F m0 + m j (X ji ) + 1;p (Xji xj ) : (6.15)
n i=1
In this case
Z Z
2
jj (xj ) = 0
F [G fm(x)g] fX (x)dx j ; vj (xj ) = 2
(x)F 0 [G fm(x)g]2 fX (x)dx j :
2
When Y jX is homoskedastic with constant variance , vj (xj ) is proportional to ij (xj ) and one
obtains the simpler asymptotic variance
2
1
jjKjj22 :
nh jj (xj )
In this case, the asymptotic variance is less than that of the Linton and Hrdle (1996) procedure.
The relevant comparison is between
Z
1 q 2 (x j )
X1LH = dx j ;
F 0 [G fm(x)g]2 f (x)
[we have used the fact that G0 fm(x)g = 1 /F 0 [G fm(x)g]], and
1
X1E = R :
F 0 [G fm(x)g]2 f (x)dx j
Applying the Cauchy-Schwarz inequality, one obtains VE VLH ; and the oracle estimator has lower
variance than the integration estimator: In the heteroskedastic case, however, it is not possible to
(uniformly) rank the two estimators unless the form of heteroskedasticity is restricted in some way,
see the next section.
e j (xj ) is what you would expect if m0 + m j ( ) were known to be exactly zero. In
The bias of m
the Linton and Hrdle procedure there is an additional multiplicative factor to the bias
Z
q j (x j )
0
dx j ;
F [G fm(x)g]
which can be either greater or less than one.
R
e j (xj ) is not guaranteed to satisfy
Note that m me j (xj )qj (xj )dxj = 0; but the recentred estimate
Z
e cj (xj ) = m
m e j (xj ) me j (xj )qj (xj )dxj
does have this property. In fact, the variance of m e cj (xj ) and m e j (xj ) are the same to rst order, while
R
e cj (xj ) has m00j (xj ) replaced by m00j (xj )
the bias of m m00j (xj )qj (xj )dxj : According to integrated mean
squared error, then, we are better o recentering because
Z Z 2 Z
00 00 2
mj (xj ) mj (xj )qj (xj )dxj qj (xj )dxj m00j (xj ) qj (xj )dxj :
6.7.2 Full Model Specication

In many situations, the entire conditional distribution of Y jX is completely specied by the mean
equation (4.1), e.g., in one-parameter exponential families. Thus, suppose that the conditional
distribution of Y given X = x belongs to the family
fY jX (yjx) = exp fyS(x) b(S(x)) + c(y)g (6.16)
for known functions b( ) and c( ): Suppose also that (4.1) holds and that G is the canonical link,
i.e., G = (b0 ) j ; so that b0 (S(x)) = m(x) and 2
(x) = b00 (S(x)); while S(x) = (G b0 ) 1 (G(m(x)))
S0 fG(m(x))g : See Gourieroux, Monfort, and Trognon (1984a,b) for parametric theory and appli-
cations in economics. We can take account of the additional information contained in (6.16) by
employing the partial pseudo-likelihood criterion function
1X
n
`n ( ) = Kh (xj Xji ) fYi Si ( ) b(Si ( ))g ; (6.17)
n i=1
where Si ( ) = S0 m0 + m j (X ji ) + > 1;p (Xji xj ) : Let e minimize `n ( ), and let m

j (xj ) =
e0 (xj ) be our infeasible estimate of mj (xj ): We have the following result
Theorem 11. Under the regularity conditions A given in the appendix, we have under the
specication (6.16):
nr=(2r+1) [m
j (xj ) mj (xj )] =) N [bj (xj ); vj (xj )] ;
p (K) (p) 1
bj (xj ) = mj (xj ) ; vj (xj ) = p (K) ;
2 ij (xj )
Z
ij (xj ) = b00 [S0 fG(m(x))g] S00 [G(m(x))]2 fX (x)dx j :
This estimator is more e cient than both the integration estimator and the two-step estimator based
on the least squares criterion when (6.16) is true. The bias is as in Theorem 1, and is design adaptive.
We next discuss a number of leading examples and calculate the information quantity ij (xj ) for
them.
Examples
Binomial Suppose that Yi 2 f0; 1g ; and that m(x) = Pr(Y = 1jX = x); where for some known
G we have (4.1). Then, take S = ln F=(1 F ) and b0 = F = G 1 : Therefore, under (6.16) the
asymptotic variance of m
j (xj ) is proportional to 1=ij (xj ); where
Z
F 0 [G fm(x)g]2
ij (xj ) = f (x)dx j ;
m(x) f1 m(x)g
while the variance of the Linton and Hrdle (1996) procedure is proportional to
Z
m(x) f1 m(x)g q 2 j (x j )dx j
VLH (xj ) = 2 ;
F 0 [G fm(x)g] f (x)
where for any joint distribution we have
1
VLH (xj );
ij (xj )
by the Cauchy-Schwarz inequality.
Poisson Suppose that Yi 2 f0; 1; 2; : : :g with conditional distribution
m(x)k
Pr (Y = kjX = x) = exp [ m(x)]
k!
for some function m(x) = E(Y jX = x) that satises (4.1) with G = log : Poisson regression models
with exible form have been considered in Hausman, Hall, and Griliches (1984). The variance of the
Linton and Hrdle (1996) procedure is proportional to
Z
exp( m j (x j ))q 2 j (x j )dx j
VLH (xj ) = exp( m0 mj (xj )) ;
f (x)
while under (6.16)
1 exp( m0 mj (xj ))
=R :
ij (xj ) exp(m j (x j ))f (x)dx j
Variance Models (ARCH) Suppose that with probability one E(Y jX = x) = 0 and
2
var(Yi jXi = x) = (x) = F [m0 + j (xj ) + j (x j )] (6.18)
for some positive function F : When F = exp and Xi = (Yi 1 ; : : : ; Yi d ) we have the multiplicative
volatility model of Yang and Hrdle (1997). The partial least squares criterion would be
1X
n
Qn ( ) = Kh (xj Xji ) Yi2 2
i( ) ;
2n i=1
2
where i( ) = F [m0 + j (x j ) + 0 + 1 (Xji xj )] ; which is of the form (6.15) with Y replaced
2 2
by Y ; while the partial pseudo-likelihood function [for which one posits that Y jX = x is N (0; (x))]
is
1X
n
2 Yi2
`n ( ) = Kh (xj Xji ) log i( )+ 2
; (6.19)
2n i=1 i( )
which is of the form (6.17) with Y replaced by Y 2 : In this case,
Z 2
4 (x) +2 F0
vj (xj ) = fm0 + j (xj ) + j (x j )g f (x)dx
4 F
Z 2
1 F0
jj (xj ) = ij (xj ) = fm0 + j (xj ) + j (x j )g f (x)dx;
2 F
where is the conditional fourth cumulant of Y jX = x: When F = exp; the information is con-
4 (x)
R
stant, in fact ij (xj ) = 1=2: When F is the identity, ij (xj ) = F 2 fm0 + j (xj ) + j (x j )g f (x)dx /2 :
Chapter 7
Backtting
7.1 Introduction
We next discuss an alternative method for tting additive models called "backtting". The name is
a bit strange and reminiscent of a recent oscar winning movie that featured two cowboys enjoying
the views of Montana, but this method has a nice starting point and some practical and conceptual
advantages. This method has been extensively discussed and generalized.
7.2 Classical Backtting
Recall the system (3.11) of operators equations. How do we use this to obtain estimators of the
functions mj ? One approach is to nd a sample analogue to the rst order system and solve it. This
can be thought of as an innite dimensional Z-estimator in the terminology of Van der Vaart (1998).
The sample analogue of the projection operator Pj is the sample smoothing matrix Wj (the n by
n smoother matrix used in computing Eb ( jXj )). Therefore, consider the corresponding sample rst
order condition
107
108 CHAPTER 7 BACKFITTING
0 10 1 0 1
B I W1 W 1 CB me 1 C B W1 y C
B CB C B C
B CB C B C
BW I C B
W 2 CB me 2 C B W2 y C
C B
B 2 C
B CB C=B C (7.1)
B . . . CB . C B . C;
B .. . . .. C B C B C
B CB .. C B .. C
B CB C B C
@ A@ A @ A
Wd Wd I ed
m Wd y
| {z }| {z } | {z }
W:nd nd e
m:nd 1 w:nd 1
> >
e j = (m
where y = (Y1 ; : : : ; Yn ) and m e j (Xj1 ); : : : ; m
e j (Xjn )) : The estimator m
e can then be dened
e = W 1 w when this inverse exists. However, in practice the inversion of W is quite
through m
di cult when n is large. One issue here is that for many smoothers Wj i = i; which means that one
cannot guarantee uniqueness of the solutions to (7.1). Hastie and Tibshirani (1990) recommended
recentering the smoothers so that we replace Wj by Wj = (I ii> =n)Wj ; in which case Wj i = 0:
In the bivariate case there is a simple solution to (7.1)
e1 =
m I (I W1 W2 ) 1 (I W1 ) y
e2 =
m I (I W2 W1 ) 1 (I W2 ) y
provided the inverses exist. These only involve inverting n n matrices. However, in some cases
even this can be computationally impossible, say if n is huge. In addition, direct unrestricted
matrix inversion can be quite inaccurate even for moderate sized matrices. Therefore, an alternative
approach is called for. This is based on the idea of alternating projections, a subject that originates
with Von Neumann (1933). We give the following result, which generalizes that and is due to Halperin
(1962).
Theorem 2. Suppose that H is a Hilbert Space, S is a subspace of H and M is the projection
onto S. Let H1 ; : : : ; HJ be closed subspaces of H so that S = H1 \ \ HJ ; and let Mj be the
projection onto Hj ; j = 1; : : : ; J. Form a cycle
T = MJ MJ 1 M1 :
Then T K converges strongly to M as K ! 1; that is,
T K (f ) ! M (f )
7.2 CLASSICAL BACKFITTING 109
as K ! 1 for any f 2 H:
This says that by computing a sequence of perhaps simpler projections one can obtain the pro-
jection of interest in the limit. The rate at which this process converges depends on angle between
subspaces; if this is less than one, then convergence is geometrically fast. In the additive nonpara-
metric regression case the projections Pj are just the conditional expectation operators E( jXj ) but
the projection M onto (in this case) the space of additive functions S is more complicated and not
available in "closed form". One can arrive at M by sequence of one dimensional projections.
We next apply the theorem to the problem at hand, of estimating additive nonparametric re-
gression. One can directly apply this result to the population problem, but the application to the
empirical one is a bit more problematic. But if we take it literally, we should compute the sequence
of one dimensional nonparametric smooths
( )
[r]
X [r 1]
e j = Wj
m y e j0
m
j 0 6=j
for j = 1; : : : ; d and r = 0; 1; : : : until convergence. One issue is that the operators Wj are not
necessarily projections, since they are not idempotent nor are they symmetric in general. Buja,
Hastie and Tibshirani (1989) prove convergence of this algorithm only for the case that the matrices
Wj are projections.
In practice, the backtting (Gauss-Seidel) algorithm is often used instead, this is one of a number
of well-developed methods for solving large linear systems, Golub and Van Loan (1996). This mod-
ication just involves using the most upto date information on each component within each cycle.
The algorithm is as follows:
[0]
e j = Wj y and
1. For each j = 1; : : : ; d compute m
2. For each j = 1; : : : ; d and r = 1; 2; : : :

( )
[r]
X [r]
X [r 1]
e j = Wj
m y e j0
m e j0
m
j 0 <j j 0 >j
3. Repeat until some convergence criterion is satised like the sum of squared residuals.
Each step involves just one dimensional smoothing. The estimators are linear in y: There are
some problems with this algorithm. A su cient condition (d = 2) for convergence of Backtting
is if either kW1 W2 k < 1 or if both W1 and W2 are symmetric e.g. cubic splines. approach is to
iteratively solve empirical versions of the above equations, see Breiman and Friedman (1985), Buja,
Hastie an Tibshirani (1989), and Hastie and Tibshirani (1991). Hastie and Tibshirani (1990). These
estimators are computed at each observation point and so are quite computationally demanding as
writ when n or d is large.
7.3 Smooth Backtting

We now discuss an alternative estimation method introduced in Mammen, Linton, and Nielsen (1998).
Instead of working with an empirical version of the projection rst order conditions they start from
e j (:) as the minimizers
an empirical analogue of the projection problem itself. They dene estimates m
of the following empirical norm
Z
b
km mk2fb = b
[m(x) m1 (x1 ) ::: md (xd )]2 fb(x)dx; (7.2)
P R
where the minimization runs over all functions m(x) = + j mj (xj ); with mj (xj )fbj (xj )dxj = 0:
Pn Y
d
R
b
Here, f (x) = n 1
Kh (xj ; Xji ) is the density estimator with marginals fbj (xj ) = fb(x)dx j
i=1
j=1
Pn
[this is the one-dimensional kernel density estimate fbj (xj ) = n 1
i=1 b
Kh (xj ; Xji )]; while m(x) is the
unrestricted Nadaraya-Watson estimator
Pn Y
d
i=1 Kh (xj ; Xji )Yi

j=1
b
m(x) = : (7.3)
Pn Y
d
i=1 Kh (xj ; Xji )

j=1
A minimizer of (7.2) exists if the density estimate fb is non-negative. Equation (7.2) means that
e
m(x) =m e0 + m e d (xd ) is the projection in the space L2 (fb) of m
e 1 (x1 ) + : : : + m b onto the subspace of
additive functions fm 2 L2 (fb) : m(x) = m0 + m1 (x1 ) + : : : + md (xd )g. This is a central point. For
7.3 SMOOTH BACKFITTING 111
projection operators backtting is well understood (method of alternating projections, see above).
Therefore, this interpretation will enable us to understand convergence of the backtting algorithm
e j . We remark that not every backtting algorithm based on iterative
and the asymptotics of m
smoothing can be interpreted as an alternating projection method. We also remark that the estimator
dened as a minimizer of (7.2) can be thought of as an innite dimensional M-estimator in the
terminology of Van der Vaart (1998), which emphasizes the dierence between this method and the
classical backtting method described above.
The solution to (7.2) is characterized by the following system of equations (j = 1; : : : ; d):
Z XZ
fb(x) fb(x)
e j (xj ) =
m b
m(x) dx j e k (xk )
m dx j e0
m (7.4)
fbj (xj ) k6=j fbj (xj )
Z
0 = e j (xj )fbj (xj )dxj :
m (7.5)
Straightforward algebra gives

Z Pn
fb(x) n 1
i=1 Kh (xj ; Xji )Yi
b
m(x) dx j = b j (xj );
m
fbj (xj ) fbj (xj )
RQ
because of `6=j = 1, where m
Kh (x` ; Xì )dx je j (xj ) is exactly the corresponding univariate Nadaraya-
R R Qd
Watson estimator: Furthermore, m b fb(x)dx; and because of
e 0 = m(x) `=1 Kh (x` ; Xì )dx j = 1,
P n
we nd, as in Hastie and Tibshirani (1991), that m e 0 = n 1 i=1 Yi , i.e., that me 0 is the sample mean.
p
Therefore, me 0 is a n-consistent estimate of the population mean and the randomness from this
estimation is of smaller order and can be eectively ignored. Note also that
Z
me0 = m e j (xj )fbj (xj ) dxj = Y for j = 1; : : : ; d: (7.6)
e j (xj ); j = 1; : : : ; d; as a solution to the system of

We therefore dene a backtting estimator m
equations [j = 1; : : : ; d]
XZ fb(x)
e j (xj ) = m
m b j (xj ) e k (xk )
m dx j e 0;
m
k6=j fbj (xj )
Z
0 = e j (xj )fbj (xj )dxj :
m
e 0 dened by (7.6). Up to now we have assumed that multivariate estimates of the density and
with m
of the regression function exist for all x. This assumption is not reasonable for large dimensions d
(or at least such estimates can perform very poorly). Furthermore, this assumption is not necessary.
Note that (7.4) can be rewritten as
XZ fbj;k (xj ; xk )
e j (xj ) = m
m b j (xj ) e k (xk )
m dxk e 0;
m (7.7)
k6=j fbj (xj )
Pn
where fbj;k (xj ; xk ) = n 1
Kh (xj ; Xji )Kh (xk ; Xki ) is the two-dimensional marginal of the full
i=1
dimensional kernel density estimate fb(x). In this equation only one and two dimensional marginals
of fb are used. The integrals are computed numerically. The estimator can be computed on a grid of
points in the covariate support I1 Id so it does not need residuals as in the standard backtting
approach.
This estimator has been called smooth backtting (SBF) by Nielsen and Sperlich (2005). We do
not have to calculate (partial) residuals for the iteration. Moreover, it is enough to calculate each
component only on the same grid as we use for calculating the integrals in equation (7.7). Finally,
b j , nor Y nor the expressions fbjk (xj ; xk )=fbj (xj ) need to be updated in the iteration; we
neither the m
must calculate them only once. This can be done simultaneously, resulting in a computer e cient
procedure that works satisfactorily even in very large dimensional nonparametric regression problems.
The main dierence from classical backtting is that here, the conditional expectations in the
system are estimated by smoothing over the whole vector (or rather all pairwise elements) instead
of smoothing only over the one-dimensional subvectors. Owing to this extra-smoothing the new
backtting is called smooth, leading to the name SBE. Specically, we consider the partially smoothed
cdf estimator
1X
n
Fn (x0k ; x0j ) = Kh (Xji ; x0j )1(Xki xk ):
n i=1
R
In classical backtting, the quantity E[mk (Xk )jXj = xj ] is replaced by m e k (x0k )dFn (x0k ; x0j )=fbj (xj );
R
whereas in the smooth backtting we use m e k (xk )fbj;k (xj ; xk )dxk =fbj (xj ):

Yu, Park, and Mammen (2008) have dened a smooth backtting method for estimating general-
ized additive models. Dene the expected quasi-likelihood and smoothed quasi-likelihood functions,
E[Q(m(Xi ); Yi )jXi = x]
Z
1X
n
Q(m(x); Yi )Kh (x; Xi )dx; (7.8)
n i=1
which is a straight generalization of (7.2). We can dene the estimated additive components as the
minimizers of (7.8). Dene Dene for j = 1; : : : ; d;
Z " P #
e
m(x) F ( 0 + dj=1 j (xj ))
Sb = P P fb(x)dx
V (F ( 0 + dj=1 j (xj )))G0 (F ( 0 + dj=1 j (xj )))
Z " P #
e
m(x) F ( 0 + dj=1 j (xj ))
Sbj (xj ) = P P fb(x)dx j ;
V (F ( 0 + dj=1 j (xj )))G0 (F ( 0 + dj=1 j (xj )))
e
where m(x) is the boundary corrected Nadaraya-Watson estimator, and let Sb (x) = (Sb ;Sb1
(x1 ); : : : ; Sbd (xd )): Here, = Then the rst order condition to (7.8) is
Sb b (x) = 0:
For uniqueness, one must normalize the components somehow. The major hurdle in analysing the
estimator is that it solves a nonlinear system of equations, as opposed to the smooth backtting in
additive models, which is a linear system derived from projection in Hilbert space. The approach
they take is to employ a double iteration scheme which consists of inner and outer iterations. This
amounts to a weighted least squares scheme similar to the "Fisher scoring" advocated in Hastie
and Tibshirani (1991). The main advantage of this is that given the weighting scheme, one has a
projection problem just like in the additive case, so that they apply the results for convergence of the
SBF under weighting. They prove that the algorithm converges geometrically fast udner some global
concavity condition on Q and other technical conditions. They obtian the asymptotic properties of
the resulting estimators, and they satisfy the oracle property. This theory is completely curse of
dimensionality free.
Hegland et al. (1999) give an implementation of the backtting algorithm for generalized additive
models which is suitable for parallel computing. This implementation is designed to handle large data
sets such as those occurring in data mining with several millions of observations on several hundreds
of variables. For such large data sets it is crucial to have a fast, parallel implementation for tting
generalized additive models to allow an exploratory analysis of the data within a reasonable time.
The approach used divides the data into several blocks (groups) and ts a (generalized) additive
model to each block. These models are then merged to a single, nal model. It is shown that this
approach is very e cient as it allows the algorithm to adapt to the structure of the parallel computer
(number of processors and amount of internal memory).
Lee, Mammen, and Park (2010) proposed a SBF estimator for quantile regression. Unlike in
the additive regression or the generalized additive regression there is no projection theroy lurking
even remotely by. Instead, they require an initial consistent estimator (that converges at some
b 0;
algebraic rate) and then update according essentially to the oracle objective function. Take m
b j ( ); j = 1; : : : ; d as initial consistent estimators. Then let
m
Z !
Xn X Y
old
mb j (xj ) = arg min Kh (xj ; Xji ) Yi mb0 b j (xj )
m Kh (x` ; Xì )dx` ; (7.9)
2
i=1 `6=j `6=j
where the integration is over the support of X j : We then iterate until convergence. This can be
shown to be like a weighted least squares procedure. They show that their estimator is asymptotically
the same as the estimator obtained from (7.9).

Hastie and Tibshirani (1990) contains some discussion of statistical properties of their estimators.
Speccally, since the estimators of mj and m are linear, if the data are i..i.d. we can obtain the
b = Wy for smoothing matrix W, we have
conditional bias and variance, specically, if m
b
Em m = (W I)m ; b = W W> ;
var(m) (7.10)
where = diagf 21 ; : : : ; 2
ng with 2
i = var("i jXi ): Although they do not prove that a central limit
theorem is valid, they do suggest how one might obtain plausible condence intervals from (7.10).
Opsomer and Ruppert (1997) presented a complete theory for the two-dimensional local poly-
nomial version of the classical backtting algorithm of Buja et al. (1989). They made a boundary
adjustment to the kernel local polynomial. They supposed the order of the polynomial was r + 1
(odd order) and they used bandwidths h1 ; h2 ; but we shall just restrict attention to the common
bandwidth case: They assume homoskedastic errors, but this is clearly not needed for their main
results. They maintain throughout that the additive model is correct.
Assumptions
C1. The kernel K is bounded and continuous, it has compact support and its rst derivative has
a nite number of sign changes overs its support. Also,
C2. The densities f; f1 and f2 are bounded and continuous, have compact support and their
rst derivatives have a nite number of sign changes over their supports. Also, fj (xj ) > 0 for all
x = (x1 ; x2 ) in the joint support and
f (x1 ; x2 )
sup 1 <1 (7.11)
x1 ;x2 f1 (x1 )f2 (x2 )
C3. As n ! 1; h ! 0 and nh= log n ! 1.
C4. The functions m1 and m2 are r times continuously dierentiable.
2
C5. "i is i.i.d. with mean zero and variance independent of Xi :
Theorem 6. Suppose that Assumptions 1-3 hold. Then I W1 W2 is invertible with probability
one for large enough n: For an interior observation point (Xi ) the conditional bias and variance of
b j (Xji ) can be approximated by
m
b j (Xji )
E [m mj (Xji )jX1 ; : : : ; Xn ] = hr j (Xji ) + op (hr ):
2
1
b j (Xji )
var [fm mj (Xji )g jX1 ; : : : ; Xn ] = jjKequiv jj22 + op (n 1 h 1 ):
nh fj (Xji )
The asymptotic variance is the same as if the other component were known, and has the so-called
oracle property, more of that later. However, the conditional bias is a nasty mess, it depends on
the joint distribution of both covariates, and so is not design adaptive like the usual odd order local
polynomial estimator. Furthermore, the crucial condition in this work is given by (7.11), which
restricts the correlation between the regressors; see Fig. 2 of Opsomer and Ruppert (1997).
The proof involves deriving properties of the large matrices Wj ; in particular, they show that
1 f (X1i ; X2i )
W1 W2 = 1 + o(n 1 )
n f1 (X1i )f2 (X2i )
element by element. Therefore
X
1
(I W1 W2 ) 1
=I+ (W1 W2 )j + o(n 1 ) = I + o(n 1 )
j=1
and so
e1
m m1 = I (I W1 W2 ) 1 (I W1 ) " + bias ' W1 " + bias:
Opsomer (2000) extended the results of Opsomer and Ruppert (1997) to the general case. The
counterpart to condition (7.11) is condition (8) on page 5 of Opsomer (2000), and is specied in terms
of the smoother matrices rather than the underlying covariate density. His other conditions are similar
and his asymptotic mean squared error expansion is also similar to that above, possessing the oracle
property. Most importantly, the estimator is free of the curse of dimensionality, since the convergence
rate is of the one dimensional smoother regardless of the relationship between dimensionality and
smoothness.
It should be noted that the condition (7.11) is su cient but not necessary
Smooth Backtting
Some of the shortcomings of classical backtting were overcome when Mammen et al. (1999) dened
a new backtting-type estimator employing the interpretation of Mammen et al. (2001) of the
local polynomial kernel estimator as a projection. The additive regression estimator is dened as the
projection down on the additive space with respect to the norm that is dened by the local polynomial
kernel estimator. Geometrically speaking, the estimator of Mammen et al. (1999) is therefore very
easy to understand: the estimator is simply the projection of the data on the additive space of
interest. Besides its naturalness, there are four clear advantages of this new smooth backtting
estimator (SBE): e ciency, robustness and calculability, and all this under rather weak assumptions.
These four advantages of the SBE are based on the fact that it is oracle e cient.
We next present the result for the smooth backtting procedure.
C1. The kernel K is bounded, has compact support( [ C1 ; C1 ], say), is symmetric about zero, and is
Lipschitz continuous, i.e., there exists a positive nite constant C2 such that jK(u) K(v)j
C2 ju vj :
C2. The d -dimensional vector X has compact support [0; 1]d and its density f (with marginals
fj ) is bounded away from zero and innity on [0; 1]d .
2
C3. For some > 5=2; E(jY j ) < 1: The conditional variance (x) = var[Y jX = x] is continuous
on [0; 1]d :
C4. The function m00 exists and is continuous. The derivative f 0 exists and is continuous.
R
Dene a constant b0 and functions bj on R [with bj (xj )fj (xj ) dxj = 0] by
Z
(b0 ; b1 ; : : : ; bd ) = arg min [b(x) b0 b1 (x1 ) : : : bd (xd )]2 f (x) dx: (7.12)
b0 ;:::;bd
where
X
d
@ 1
b(x) = m0j (xj ) log f (x) + m00j (xj )] 2 (K)
j=1
@xj 2
2
is the bias function of the Nadaraya-Watson estimator. Let j (xj ) = var[Y m(X)jXj = xj ]:
Theorem 7. Suppose that the additive model holds and that conditions C1-C4 hold, and that
b j , fbj and fbj;k are dened above and m
Nadaraya Watson backtting smoothing is used, i.e., m e j is
dened by (7.7). Suppose additionally that n1=5 h ! c for a constant c. Then, the algorithm converges
geometrically fast. Specically, with probability tending to one, there exists constants 0 < < 1 and
c > 0 such that, with probability tending to one, the following inequality holds
Z h d Z
!
i2 X
[r] 2r [0] 2
me j (xj ) m
e j (xj ) fj (xj )dxj c 1+ e j (xj )g fj (xj )dxj :
fm (7.13)
j=1
[0] [0]
e 1 (x1 ); : : : ; m
Here, the functions m e d (xd ) are the starting values of the backtting algorithm.
Furthermore, the following convergence holds in distribution for any xj 2 (0; 1);
n2=5 [m
e j (xj ) mj (xj )] =) N c2 bj (xj ); vj (xj ) ;
1
where bj (xj ) was dened above and vj (xj ) = c jjKjj22 2
j (xj )=fj (xj ), j = 1; : : : ; d. Consequently,
!
X
d X
d
2=5 2
n e
[m(x) m(x)] =) N c bj (xj ); vj (xj ) :
j=1 j=1
The bias does not correspond to the bias of a one dimensional smoother, except where this is
additive. MLN show that for a local linear implementation one obtains the same result but with bias
Z
1 00
bj (xj ) = 2 (K) mj (xj ) m00j (xj )fj (xj )dxj :
2
This estimator is oracle e cient. Horowitz, Klemel, and Mammen (2002) have shown that this
estimator is Oracle BLAM, like the local linear regression is for standard nonparametric regression.
Pd
It is easy to obtain consistent estimates of vj (xj ) given residuals e
"i = Y i e
c e j (Xji ) :
j=1 m
one estimates 2 (xj ) from the smooth of e
j "i on Xji and fbj as the standard univariate kernel density
estimator.
7.6 Bandwidth Choice and Model Selection

Sperlich etc. considered plug-in methods for bandwidth choice based on estimating the asymptotic
bias and variance of the additive estimator. This can be shown to work in selecting asymptotically
optimal bandwidths according to pointwise or integrated mean squared error. Nielsen and Sperlich
(2005) considered cross-validation of the bandwidths, i.e., choosing the bandwidth h to minimize the
X
n
2
CV (h) = Yi b i (Xi )
m ;
i=1
b i ( ) denotes the SBF estimator of m excluding observation i:

where m
Mammen and Park (2005) proposed an automatic method for smooth backtting based on pe-
nalized least squares. They consider the case where kernel estimation with dierent bandwidths
h1 ; : : : ; hd are allowed, but we just present the common bandwidth case here for simplicity. To select
the bandwidth they consider the penalized (trimmed) least squares objective function
1X
n
2dK(0)
P LS(h) = "2i (h) 1 +
wn (Xi )b ;
n i=1 nh
7.6 BANDWIDTH CHOICE AND MODEL SELECTION 119
Pd
where b b0
"i (h) = Yi m j=1 b ji ) are the residuals from the backtting procedure computed using
m(X
bandwidth h and kernel K; while wn (x) is a trimming function. They show that the bandwidth that
minimizes P LS; denoted b
hP LS is asymptotically optimal in the sense that (b
hP LS b
hASE )=b
hASE ! 0
with probability one where b
hASE minimizes the averages squared error ASE(h); with
( )2
1X X X
n d d
ASE(h) = wn (Xi ) mb0 + b ji )
m(X m0 m(Xji ) :
n i= j=1 j=1
Kauerman and Opsomer (2008) proposed a method of bandwidth choice for generalized additive
models under full specication.
Chapter 8
Appendix
8.1 Stochastic Dominance
We give a rough argument for the equivalence of the two denitions of rst order stochatic dominance.
We suppose for simplicity that both random variables have compact support [a; b] and that Fj (a) = 0
and Fj (b) = 1 for j = 1; 2: Let fj be the density of Fj : Let U 2 U1 meaning that it has U 0 (x) 0:
By integration by parts
Z b
E1 U (X) E2 U (X) = U (x)[f1 (x) f2 (x)]dx
a
Z b
= [(F1 (x) F2 (x))U (x)]ba [F1 (x) F2 (x)]U 0 (x)dx
a
Z b
= [F1 (x) F2 (x)]U 0 (x)dx;
a
which has the opposite sign of F1 (x) F2 (x): To prove the necessity, one constructs a special utility
function that coincides with the interval of cdf domination. See Levy (2006) for the full argument.
121
122 CHAPTER 8 APPENDIX
8.2 Uniform Laws of Large Numbers (ULLN)

Suppose that Tn ( ) is a stochastic process with index 2 . We shall discuss results of the form
p
sup jTn ( )j ! 0: (8.1)
2
In our applications, Tn ( ) = Gn ( ) G( ) will have the special form of an unscaled empirical process
1X
n
Tn ( ) = [m(Xi ; ) Efm(Xi ; )g] (8.2)
n i=1
for some random sequence Xi , usually i.i.d. [but perhaps only stationary and ergodic], and m( ; )
a family of functions where the parameter space is a xed subset of Euclidean k-space. A related
application is to the partial sum process
[ n]
1X
Tn ( ) = [m(Xi ) Efm(Xi )g] ; (8.3)
n i=1
where [x] denotes the integer part of x:

The ULLN result (8.1) is a stronger requirement than pointwise LLN except when is a nite
set of points. Su cient conditions for the pointwise strong law of large numbers to apply to (8.2) at
a given are that (a) Xi are i.i.d.; and (b) E [jm(Xi ; )j] < 1 for the given : We will discuss below
what additional conditions are needed for the uniform result.
The earliest ULLN results were by Glivenko and by Cantelli all published in the Italian Actuarial
Journal in 1933 [in the same year and journal, Kolmogorov established the limiting distribution of
the normalized process]. This argument is very special and uses the structure of the empirical c.d.f
quite a lot. We next give a result that is more widely applicable. The cost will be that we need
to impose some restrictions on the function m and the parameter space : We shall combine the
known pointwise results with a sort of continuity condition, which says that the process does not
move too much when the parameter is changed a little. In fact, we shall need more than just ordinary
continuity as the following example illustrates
8.2 UNIFORM LAWS OF LARGE NUMBERS (ULLN) 123
Example. Consider the following family of functions ffn ( ); n = 1; 2; : : :g;
x2
fn (x) = 2 ; 0 x 1; n = 1; 2; : : :
x + (1 nx)2
Then,
jfn (x)j 1; lim fn (x) = 0 for all xed 0 x 1
n!1
but
1
fn = 1 for all n; i.e., lim sup jfn (x) 6 0:
f (x)j =
n n!1 0 x 1
The following condition rules out this sort of behaviour. A family of functions ffn ( ); n = 1; : : :g is
equicontinuous on a set X , if for any " > 0 there exists > 0; such that
sup jfn (x) f (y)j < "

jx yj<
for all x; y 2 X and for all n = 1; 2; : : : In our theorem, we will have to consider a stochastic version
of this condition.
Theorem 14 Suppose that Assumptions 2.12.3 given below hold. Then

p
sup jTn ( )j ! 0
2
2.1. is a compact subset of a Euclidean space.

p
2.2. (Pointwise convergence) For each 2 , jTn ( )j ! 0.
2.3. (Stochastic equicontinuity) For all "; > 0, there exists > 0 such that
( )
lim supPr sup sup jTn ( 0 ) Tn ( )j > < ";
n!1 2 0
2N ( ; )
where each N ( ; ) is an open ball of radius centered at :

Proof. Given " > 0, take as in assumption 2.3, and let fN ( j ; )gJj=1 be a nite cover of ,
i.e.,
[
J
N ( j; ) :
j=1
The proof follows from the following facts:
F1. sup 2 jTn ( )j max1 j J sup 2N ( j; ) jTn ( )j:
F2. By the triangle inequality, for any ; j; we have
jTn ( )j = jTn ( j ) + Tn ( ) Tn ( j )j jTn ( j )j + jTn ( ) Tn ( j )j:
F3. If jTn ( )j > > 0; then
either jTn ( j )j > =2 or jTn ( ) Tn ( j )j > =2:
F4. max1 j J jTn ( j )j !p 0:
We have
lim supPr sup jTn ( )j > "

n!1 2
( )
lim supPr max sup jTn ( 0 ) Tn ( j )j + max jTn ( j )j > " [by F1and F2]
n!1 1 j J 0
2N ( 1 j J
j; )
( )
lim supPr sup sup jTn ( 0 ) Tn ( )j > "=2 + lim supPr max jTn ( j )j > "=2 [by F3]
n!1 2 0
2N ( ; ) n!1 1 j J
< "=2 + "=2 < " [by assumption 2.3 and F4].
This is true for any "; which implies the result.

8.3 STOCHASTIC EQUICONTINUITY 125
This says that a su cient condition for R1 is that Gn ( ) !p G( ) and that the process Tn ( ) =
Gn ( ) G( ) is stochastically equicontinuous uniformly in .
8.3 Stochastic Equicontinuity

We rst repeat the denition for a general process along with two equivalent denitions.
Definitions.
1.1. For all ; > 0; there exists > 0 such that

" #
lim sup Pr sup kTn ( 1 ) Tn ( 2 )k > < :
n!1 f 1 ; 2 :k 1 2k g
1.2. For all deterministic sequences f n g with n # 0 we have
sup kTn ( 1 ) Tn ( 2 )k !p 0:
f 1 ; 2 :k 1 2k ng
1.3 For all random sequences fbn1 g and fbn2 g with jjbn1 bn2 jj !p 0; we have
Tn (bn1 ) Tn (bn2 ) !p 0:
p
These denitions can also be applied to the normalized process n( )= n(Gn G); although in
our applications it then su ces to have 2 ; bn2 = 0 :
8.3.1 Unnormalized Process

Here we give a result which guarantees stochastic equicontinuity for the unnormalized process Tn ( ) =
Gn ( ) G( ):
Theorem 15 Suppose that
1X
n
Tn ( ) = [m(Xi ; ) E fm(Xi ; )g] ;
n i=1
where Xi are i.i.d., and
3.1. m(x; ) is continuous in at each on a set of x 2 X that has probability one.
3.2. sup 2 jm(x; )j < d(x); where E [d(X)] < 1.
3.3. is a compact subset of a Euclidean space.
Then, the stochastic equicontinuity condition 2.3 holds.
Proof. Let
1X
n
yn = sup sup jm(Xi ; 0 ) m(Xi ; )j:
2 0
2N ( ; ) n i=1
Then
E(yn ) 2E sup jm(X; )j < 1;
2
and by dominated convergence yn ! 0 with probability one as ! 0.

Note that
1X
n
0
sup sup jTn ( ) Tn ( )j = sup sup m(Xi ; 0 ) Em(Xi ; 0 ) m(Xi ; ) + Em(Xi ; )
2 0
2N ( ; ) 2 0
2N ( ; ) n i=1
yn + E(yn ):
Therefore,
( )
lim supPr sup sup jTn ( 0 ) Tn ( )j > " (8.4)
n!1 2 0
2N ( ; )
lim supPr fyn + E(yn ) > "g

n!1
E[yn + E(yn )]
lim sup [by Markov Inequality]
n!1 "
2
= E(yn ) (8.5)
"
! 0 as ! 0;
i.e., can nd (") such that (8.5) and hence (8.4) are less than ":
Assumption 3.1 can be veried even for indicator functions like fx0 > 0g; 6= 0; as show up in
LAD and censored LAD regression, provided X has a continuous distribution, in which case the set
of discontinuity points is of probability zero. Let
1X
n
Yn ( ) = sup jfXi0 > 0g fXi0 > 0gj;
2N ( ; ) n i=1
and let X ( ; ) = fX1 ; : : : ; Xn : Yn ( ) 6= 0g: When Xi is continuously distributed, there does not
exist a hyperplane of dimensions d 1 that contains more than two X 0 s with positive probability.
Therefore, no matter how is chosen it is not possible to make enough of jfXi0 > 0g fXi0 > 0gj
equal to one. Therefore, sup 2 Yn ( ) converges to zero as ! 0.
However, when Xi is discrete, X0 ( ; ) can have positive probability for some and for all . For
example, suppose that d = 2 and that
8
>
>
< (1; 1) with probability p
Xi =
>
>
: ( 1; 1) with probability 1 p:
Then, when 2 f(c; c); c 2 Rg; Yn ( ) = 1 with probability one and this is true for any :
8.3.2 Normalized Process

We now turn to a discussion of stochastic equicontinuity for the normalized process
1 X
n
p
n( )= nfGn ( ) G( )g = p [m(Xi ; ) Efm(Xi ; )g]: (8.6)
n i=1
To establish this property requires considerably more work than was necessary for the unnormalized
process.
Pn 0
This condition is readily veried for the case that n( ) is linear in ; i.e., n( )= p1 (Xi
n i=1
E(Xi )); since
!
1 X
n
0
k n( 1) n ( 2 )k = ( 1 2) p (Xi E(Xi ))
n i=1
1 X
n
k 1 2k p (Xi E(Xi ))
n i=1
by the Cauchy-Schwarz inequality. Therefore,
1 X
n
sup k n( 1) n ( 2 )k n p (Xi E(Xi )) = op (1):
f 1 ; 2 :k 1 2k ng
n i=1
Consider now the more general case (8.6) where m is dierentiable in but nonlinear in : Then
:
k n( 1) n ( 2 )k =k n ( 12 )( 1 2 )k ;
where the process

1 X @m(Xi ; ) X
n n
: @m(Xi ; ) 1=2
n( ) = p E =n "(Xi ; )
n i=1 @ @ i=1
is Op (1) for each . It seems plausible that if, say, @m(Xi ; ) /@ were bounded, then
:
sup k n( )k = Op (1)
2
and the stochastic equicontinuity condition would hold. This is not a proof. Furthermore, there are
many instances in which the function m is not dierentiable, such as in LAD and censored LAD
estimation where m(x; ) = fx0 > 0g: Furthermore, when is innite dimensional, as occurs in
semiparametric GMM problems, then it is not clear how to proceed.
We now outline some of the general theory which is available. The basic idea is that we must
measure how big the family F = fm(x; ); 2 g is. If F is too large, then the process n( ) will not
be stochastically equicontinuous because small changes in may lead to large changes in n( ). Size
will obviously depend on the pseudo-metric [ a pseudo-metric is a metric except that (f; g) = 0
does not necessarily mean that f = g] that is used to measure distance in F. A fairly general class
of metrics is given by the so-called Lp (Q) norms
8
>
> R
< ff (x) g(x)gp dQ(x) 1=p 1 p < 1
pQ (f; g) =
>
>
: supx2supp(Q) jf (x) g(x)j p=1
for some measure Q: Whatever metric is chosen, we dene a neighborhood of f as a closed ball
B(f; ) = fg : (f; g) g:
We now give three alternative measures of the size of a function class F with respect to a distance
:
Definitions.
2.1. Metric Entropy

H( ; ; F) = log N ( ; ; F);
where N is the smallest number of closed balls of radius needed to cover F.
2.2. Metric Capacity

C( ; ; F) = log D( ; ; F);
where D is the largest number for which there exists points f1 ; : : : ; fD 2 F such that (fi ; fj ) >
:
2.3. Bracketing Entropy

B( ; ; F) = log N B ( ; ; F);
where N B is the cardinality of the smallest -bracket set S ; where S has the property that
for all f 2 F there exists fL ; fU 2 S such that
fL f fU ; (fL ; fU ) :
The numbers N; D; and N B all tend to innity as goes to zero. There are some well known
relationships between them
N ( ; ; F) D( ; ; F) N ( ; ; F) ; N ( ; ; F) N B (2 ; ; F):
2
We next consider some examples.
Examples.
1. F = fm(x; ) = ; 2 = [0; 1]g ; is Euclidean distance. In this case, it is easy to see that
N ( ; ; F) = 1= :
2. F = fm(x; ) = fx > 0g; 2 Rg ; where x is scalar. There are actually only three distinct
functions in F.
8 8
>
> >
>
< 1 if x > 0 < 0 if x > 0
g1 (x) = ; g2 (x) = g3 (x) = 0:
>
> >
>
: 0 if x 0 : 1 if x 0
Therefore, F has constant entropy.
3. F = fm(x; ) = fx0 > 0g; 2 R2 g : Suppose that X has a compact support. Now there
are genuinely an uncountable innity of functions in F. However, we can index each function by
an angle 2 [0; 2 ]; where is the angle that the line 1 x1 + 2 x2 = 0 makes with the horizontal
axis: For each we can cover F by O(1= ) functions with indexes 2 j= [ ] ; i = 1; 2; : : : ; [ ] : It

is easy to see that the entropy is O(1= ) when the metric space is Lp (Q) for nite p [since the
functions are only zero or one, the distance between two functions is just the Lebesgue measure
of a certain set].
(d 1)
4. When x 2 Rd ; the entropy of F = fm(x; ) = fx0 > 0g; 2 g is O(1= ):
We now state a theorem about stochastic equicontinuity.
Theorem 16 The process n( ) is stochastically equicontinuous if either:

(a) Z 1 q
sup H( 2Q (F; 0); 2Q ; F)d < 1;
0 Q2Q
where Q is the set of all distributions with nite support, 2Q is the L2 (Q) distance, and F is an
[square integrable] envelope function with f F for all f 2 F;
(b) Z 1 p
HB( 2P (F; 0); 2P ; F)d < 1;
0
where P is the distribution of X:
Note that in the above examples, the entropy is polynomial and always satises part (a) of the
theorem. Thus Z 1 p
(d 1) log d < 1;
0
which can be veried by a simple change of variables ! u = log :
We nally state three important classes of functions that satisfy the conditions of our theorem.
Examples.
1. Type I. Either F = m(x; ) = x0 ; 2 Rd or
F = m(x; ) = h(x0 ); 2 Rd ; h of bounded variation :
This includes indicators and sign functions.
2. Type II. F = m(x; ); 2 Rd ; where for some square integrable function R;
km(x; ) m(x; 0 )k R(x) k 0

k:
3. Type III. F = m(x; ); 2 Rd ; where

" ( )#1=p
p
E sup jm(X; 1) m(X; 2 )j C
f 1 ; 2 :k 1 2k g
for some nite C and positive and some p 2:
Type I and II satisfy condition (a), while Type III satisfy condition (b) of the theorem.
Finally, one can combine various function classes and still preserve the entropy condition. For ex-
ample, if F1 and F2 satisfy entropy condition (a) or (b), then so do [with appropriate envelopes
and index changes]: jF1 j = fjf1 j : f1 2 F1 g ; F1 + F2 = ff1 + f2 : f1 2 F1 ; f2 2 F2 g ; F1 F2 =
W
ff1 f2 : f1 2 F1 ; f2 2 F2 g ; F1 F2 = fmaxff1 ; f2 g : f1 2 F1 ; f2 2 F2 g ; and
V
F1 F2 = fmin 2 ff1 ; f2 g : f1 2 F1 ; f2 2 F2 g :
8.4 Proofs of Theorems

Proof of Theorem 1. We assume for simplicity that F is continuous and strictly monotonic. Let
xjk be the value of x that satises F (xjk ) = j=k for integer j; k with j k: For any x between xjk
and xj+1;k ;
F (xjk ) F (x) F (xj+1;k ) ; Fn (xjk ) Fn (x) Fn (xj+1;k );

8.4 PROOFS OF THEOREMS 133
while 0 F (xj+1;k ) F (xjk ) 1=k; so that
1
Fn (x) F (x) Fn (xj+1;k ) F (xjk ) Fn (xj+1;k ) F (xj+1;k ) +
k
1
Fn (x) F (x) Fn (xj;k ) F (xj+1;k ) Fn (xj;k ) F (xj;k ) :
k
Therefore, for any x and k;
1
jFn (x) F (x)j max jFn (xjk ) F (xjk )j + : (8.7)
1 j k k
Since the right hand side of (8.7) does not depend on x; we can replace the left hand side by
sup 1<x<1 jFn (x) F (x)j:
Now let Ajk = f! : Fn (xjk ) ! F (xjk )g : By the pointwise strong law of large numbers, Pr(Ajk ) =
1: Furthermore, Pr(Ak ) = 1; where
\
k
Ak = Ajk = ! : max jFn (xjk ) F (xjk )j ! 0 ;
1 j k
j=1
since
( )
[
k X
k
Pr(Ack ) = Pr Acjk Pr(Acjk ) = 0:
j=1 j=1
T1
It follows that Pr(A) = 1; where A = k=1 Ak . Therefore, we can make the right hand side of (8.7)
arbitrarily small with probability one.
Proof of Theorem 2. By the triangle inequality
sup fb(x) f (x) sup fb(x) E[fb(x)] + sup E[fb(x)] f (x) :

x2R x2R x2R
We can write Z
1 x y
fb(x) E[fb(x)] = K d[Fn (y) F (y)]:
h h
Since K has bounded variation and Fn (y) F (y) is a continuous from the right step function with
Fn ( 1) F ( 1) = 0 and Fn (1) F (1) = 1; we can apply integration by parts (Carter and van
x y
Brunt (2000)). Let S be the set of points y where both K h
and Fn (y) F (y) are discontinuous
+
(this is a subset of the sample points X1 ; : : : ; Xn ); and dene g (a) = g (a ) g (a ): Then
Z Z
1 x y 1 x y
K d[Fn (y) F (y)] = [Fn (y) F (y)]dK
h h h h
X
+ K ( xh )(Fn F) (R) + K ( xh ) (a) Fn F (a):
a2S
Since Fn ( 1) F ( 1) = 0; )(Fn F ) (R) = 0: The last term is also zero with probability one
K ( xh
because the discontinuity points of K x h y (which are countable in number by bounded variation)
do not coincide with the discontinuity points of Fn (y) F (y); when X is continuously distributed.
Therefore,
Z
sup fb(x) E[fb(x)] = sup Kh (x y)d[Fn (y) F (y)]
x2R x2R
Z
sup jFn (y) F (y)jdjKh (x y)j
x2R
1
sup jFn (x) F (x)j T V AR(jKj)
x2R h
Zn
n1=2 h
p
for some tight sequence of random variables Zn = n supx2R jFn (x) F (x)j; where T V AR denotes
total variation. This holds in probability but also almost surely:
Then note that Z
1 x y
E[fb(x)] f (x) = K f (y)dy f Kh :
h h
This is called the convolution of f with Kh and is an object well treated in real analysis. In particular,
there are many dierent results covering f Kh ! f in various senses under various conditions. In
particular, it is not necessary for K or f to have compact support.
R
Proposition (Folland ) Suppose that K 2 L1 and K = 1:
(a) If f 2 Lp ( 1 p 1); then f Kh ! f in the Lp norm as h ! 0
8.4 PROOFS OF THEOREMS 135
(b) If f is bounded and uniformly continuous, then f Kh ! f uniformly as h ! 0

(c) If f 2 L1 and f is continuous on an open set U; then f Kh ! f uniformly on compact
sets of U as h ! 0
d
(d) Suppose that jK(x)j C(1 + jxj) for some C; 2 (0; 1): If f 2 Lp ( 1 p 1); then
f Kh ! f for every x in the Lebesgue set of f; in particular for every x at which f is continuous.
Suppose that f has compact support. For any M we can write K(x) = K A (x) + K B (x); where
K A (x) = K(x)1(jxj M ) and K B (x) = K(x)1(jxj > M ):
Then f Kh = f KhA + f KhB and

Z Z Z Z Z Z Z
jf Kh f j dx f KhA f KhA dx + f KhB dx + jf j jKhB jdx:
R
The last two terms are bounded by 2 jKhB jdx ! 0 as h ! 0: Dene the modulus of continuity ! f
of a function f ,
! f (t) = sup sup jf (x + z) f (x)j :
x2R jzj t
For a uniformly continuous function f; ! f (t) ! 0 as t ! 0: Then

Z Z Z Z
f KhA f KhA dx jf (x y) f (x)j KhA (y)dydx
Z
2C! f (2M h) K(y)dy
where [ C; C] is the support of f: This last quantity tends to zero for M xed as h ! 0:
R
To prove lemma for general f; approximate f by g with compact support such that jf gj <
for some > 0: Then
Z Z Z Z
jf Kh f j dx jf Kh g Kh j dx + jf gj dx + jg Kh gj dx
Z Z
2 jf gj dx + jg Kh gj dx:
Then since we can take arbitrarily small the result is established.

8.5 General Theory for Local Nonlinear Estimators

We provide theory for a general class of estimators dened as nding zeros of an estimated moment
conditions Gn ( ; x) 2 Rq ; where Gn ( ; x) is dened on the set X . The population moment
condition G( ; x) is dened on the set X , where ; X are both compact and G is continuous:
The following proposition is an extension of a standard consistency result of Pakes and Pollard (1989,
Theorem 3.1). It is closely related to Lemma A1 of Newey and Powell (2003), although our results
are more applicable for pointwise theory. See also Horowitz and Mammen (2004). We do not require
continuity in or x of the estimated moment function Gn ( ; x):
Proposition 1. Suppose that:
(i) For each x 2 X ; b(x) 2 Rp (with p q) is any sequence such that
sup jjGn (b(x); x)jj inf 2 jjGn ( ; x)jj = op (1):

x2X
(ii) Suppose that G( 0 (x); x) = 0 for all x and that for all > 0 there exists > 0 such that
inf inf jjG( ; x)jj > :

x2X 2 :jj 0 (x)jj
(iii) As n ! 1;
P
sup sup kGn ( ; x) G( ; x)k ! 0:
x2X 2
Then for all x 2 X ;
b(x) P
0 (x) ! 0: (8.8)
The rst condition is just that b(x) is an approximate minimizer of jjGn ( ; x)jj uniformly over x;
this condition is commonly used in cases where the objective function is not continuous, Pakes and
Pollard (1989). The second condition is an identication codnition that holds uniformly over x: The
third condition requires uniform laws of large numbers to hold over ; x:
Note that by the Kumagai (1980) implicit function theorem, 0 (x) is a continuous function of
x 2 X:
8.5 GENERAL THEORY FOR LOCAL NONLINEAR ESTIMATORS 137
Proof of Proposition 1. We have for all x 2 X
G(b(x); x) Gn (b(x); x) + Gn (b(x); x) G(b(x); x)

kGn ( 0 (x); x)k + sup kGn ( ; x) G( ; x)k
2
kG( 0 (x); x)k + sup kGn ( ; x) G( ; x)k + op (1) = op (1);
2
using the triangle inequality, (i), and (iii). Then note that for all x 2 X
h i h i
Pr b(x) 0 (x) > Pr G(b(x); x) > ;
because 0 < inf jj 0 (x)jj jjG( ; x)jj jjG(b(x); x)jj; whenever jjb(x) 0 (x)jj > : Therefore, (8.8)
follows.
To establish uniformity over x 2 X we need further conditions. For example, we could replace
the identication condition as follows.
(ii) Suppose that G( 0 (x); x) = 0 for all x and that for all > 0 there exists > 0 such that
sup inf jjG( (x); x)jj > :

x2X (:):supx2X jj (x) 0 (x)jj
Then we can show as above that
sup G(b(x); x) = op (1);

x2X
which implies that (8.8) holds uniformly over x 2 X : There are alternative approaches here that make
use of smoothness of Gn or an explicit structure like a sample average and exponential inequalities
for such averages.
We next establish the limiting distributions. The following result is an extension of Theorem 3
of Pakes and Pollard (1989). The extension is to estimators with more general rates of convergence;
it also takes account of the failure of a global stochastic equicontinuity condition for nonparametric
estimators by providing a more targeted condition (iii)(a),(b), which uses a local stochastic equicon-
tinuity condition [see Mller and Stadtmller (1987) for comparison]. Also, condition (iv) reects
the nonparametric setting through a bias term and has a strengthening for purposes of uniformity.
Notice that we do not require any smoothness on the estimated moment function Gn ( ; x) although
it should be well approximated by the smooth function G:
Proposition 2. Suppose that the following conditions hold for some sequence jn ! 1:
(i) For each x; b(x) 2 is any sequence such that
sup jjGn (b(x); x)jj inf 2 jjGn ( ; x)jj = op (jn 1 ):

x2X
(ii) (a) The function G( ; x) is dierentiable in at = 0 (x) with derivative

matrix I(x) = @G( 0 (x); x)=@ of full rank uniformly in x 2 X .
(b) The derivative matrix @G( ; x)=@ is uniformly in x 2 X Hlder continuous in with
exponent & > 0.
(iii) There is a sequence n ! 0 such that:
(a) For some sequence of positive numbers f n g that converges to zero
1 P
sup sup n kGn ( ; x) G( ; x)k ! 0:
x2X jj 0 (x)jj n
(b) For every sequence of positive numbers f n g that converges to zero

P
sup sup jn kGn ( ; x) G( ; x) Gn ( 0 (x); x)k ! 0:
x2X 1
n jj 0 (x)jj n
(iv) For some positive denite and nite matrix V (x ) and deterministic sequence bn (x) ! 0 with
lim sup jn bn (x) < 1;
jn sup kGn ( 0 (x); x) bn (x) Zn (x)k = op (1); where

x2X
for each x; jn Zn (x) =) N (0; V (x))
for some r 1; sup kZn (x)k = Op (jn 1 logr n):

x2X
(v) 0 (x) is an interior point of for all x:

Then
sup b(x) 0 (x) (I > I) 1 I > (x)bn (x) (I > I) 1 I > (x)Zn (x) = op (jn 1 ); where
x2X
for each x; jn (I > I) 1 I > (x)Zn (x) =) N (0; (I > I) 1 I > V I(I > I) 1 (x)):
Proof of Proposition 2. The proof is similar to Theorem 3 of Pakes and Pollard (1989). We
rst do the pointwise argument for a xed x: Condition (ii) transfers the rate on Gn in (iii)(a) to the
same rate on b(x) 0 (x) because for all x; kG( ; x)k C(x) k 0 (x)k for close to 0 (x); where
inf x2X C(x) > 0: Therefore,
n
1
jjb(x) 0 (x)jj = op (1) (8.9)
for each x: Having obtained n -consistency of b(x); we then use condition (iii)(b) to obtain jn -
consistency along the lines of Pakes and Pollard (1989). Specically, there exists a sequence n ! 0
such that Pr[ 1 jjb(x) 0 (x)jj
n
b
n ] ! 0 and so the supremum in (iii)(a) covers (x) with probability
tending to one. It follows that by the triangle inequality and (iii)(a) we have with probability tending
to one
G(b(x); x) Gn (b(x); x) kGn ( 0 (x); x)k (8.10)
Gn (b(x); x) G(b(x); x) Gn ( 0 (x); x) op (jn 1 ):
Therefore,
G(b(x); x) Gn (b(x); x) + kGn ( 0 (x); x)k + op (jn 1 ) (8.11)

2 kGn ( 0 (x); x)k + op (jn 1 ) 2 kbn (x) + Zn (x)k + op (jn 1 )
by (i) and (iv). It follows that jjG(b(x); x)jj = Op (jn 1 ) and so
jn jjb(x) 0 (x)jj = Op (1): (8.12)
Let
Ln ( ; x) = I(x) ( 0 (x)) + bn (x) + Zn (x):
By similar arguments to Pakes and Pollard (1989) one shows that jjGn (b(x); x) Ln (b(x); x)jj = op (1):
The minimizing value of jjLn ( ; x)jj is (I > I) 1 I > (x)(bn (x) + Zn (x)) and it can be
(x) = 0 (x)
shown that jjGn ( (x); x) Ln ( (x); x)jj = op (jn 1 ): It follows that jjb(x) (x)jj = op (jn 1 ): The
pointwise asymptotic normality of b(x) then follows along the lines of their proof.
The extension to uniformity over x proceeds as follows. By the triangle inequality, we have
sup G(b(x); x) inf Gn (b(x); x) inf kGn ( 0 (x); x)k

x2X x2X x2X
sup Gn (b(x); x) G(b(x); x) Gn ( 0 (x); x) op (jn 1 ) (8.13)

x2X
from which we obtain that
sup G(b(x); x) 2 sup kGn ( 0 (x); x)k + op (jn 1 ) 2 sup kbn (x) + Zn (x)k + op (jn 1 ):
x2X x2X x2X
Then using supx2X kG( (x); x)k (inf x2X C(x)) supx2X k (x) 0 (x)k for (x) uniformly close to
0 (x); one obtains that supx2X jjb(x) 0 (x)jj
1 r
= Op (jn log n): By the triangle inequality
sup Gn (b(x); x) Ln (b(x); x) sup Gn (b(x); x) G(b(x); x) Gn ( 0 (x); x)

x2X x2X
+ sup kGn ( 0 (x); x) bn (x) Zn (x)k
x2X
+ sup G(b(x); x) I(x) (b(x) 0 (x))

x2X
= op (jn 1 );
since by the mean value theorem and the Hlder continuity condition (ii)(b)
sup G(b(x); x) I(x) (b(x) 0 (x)) = Op ((jn 1 logr n)1+& ) = op (jn 1 ):

x2X
By similar arguments one shows that supx2X jjGn ( (x); x) Ln ( (x); x)jj = op (jn 1 ): Finally, one
obtains that supx2X jjb(x) (x)jj = op (jn 1 ) by the same arguments as in Pakes and Pollard (1989).
8.5.1 Consistency of the Nadaraya-Watson Estimator

We can apply our general results to the Nadaraya-Watson estimator. However, technically we should
have to restrict our attention to a compact ; which is not strictly needed for the NW estimator,
since it is in closed form and can take any value in R. Furthermore, since it is in closed form more
direct methods can be applied to obtain its limiting behaviour.
Consider the following moment function

1X
n
Gn ( ; x) = Kh (x Xi )fYi g:
n i=1
We seek to establish the consistency and asymptotic normality of the zero of Gn ( ; x); which is the
Nadaraya-Watson kernel estimator. We shall suppose that x is an interior point of the support of X;
that is, there is an open ball B(x; ) [for some small ] that is totally contained in the support of X:
We take the corresponding limit to be
G( ; x) = fm(x) gfX (x):
Then, it is clear that 0 = m(x) is the unique zero of G( ; x) so that the pointwise identication
condition is automatically satised for all x with fX (x) > 0. We must show the uniform convergence
of Gn to G: We have
kGn ( ; x) G( ; x)k kGn ( ; x) EX Gn ( ; x)k+kEX Gn ( ; x) EGn ( ; x)k+kEGn ( ; x) G( ; x)k ;
where EX denotes expectation conditional on X1 ; : : : ; Xn ; and it su ces to work on these three terms
separately. First, note that
1X
n
EX Gn ( ; x) = Kh (x Xi )fm(Xi ) g
n i=1
Z
EGn ( ; x) = EEX Gn ( ; x) = EKh (x Xi )fm(Xi ) g= Kh (x X)fm(X) gf (X)dX
by iterated expectation. Then apply a change of variables X 7 ! u = (x Xi )=h to write

Z x Z x x
h
Kh (x X)fm(X) gf (X)dX = K(u)fm(x + uh) gf (x + uh)du;
x x
x h
where x and x are the lower and upper limits respectively of the support of X: Provided the point
x is such that x x > ch and x x > ch for some nite c and the kernel K is of nite support, we
can replace the limits of integration by those of the kernel [e.g., 1; 1]. In the sequel we shall assume
this is this case. Therefore, by dominated convergence,
Z 1
sup kEGn ( ; x) G( ; x)k = sup K(u)fm(x + uh) gf (x + uh)du fm(x) gfX (x) = o(1)
2 2 1
for any compact set ; provided m and f are continuous at x: Second,
1X
n
Gn ( ; x) EX Gn ( ; x) = Kh (x Xi )"i ;
n i=1
where "i is an independent sequence satisfying E("i jXi ) = 0 with probability one. By a law of
P
large numbers for triangular arrays of independent random variables [of the form ni=1 Zni ], we get,
provided supn E [jKh (x Xi )"i j] < 1; that
Gn ( ; x) EX Gn ( ; x) = op (1);
and since this random sequence does not depend on ; convergence is uniform over 2 . Likewise
1X 1X
n n
EX Gn ( ; x) EG( ; x) = fKh (x Xi )(m(Xi ) ) E[Kh (x Xi )(m(Xi ) )]g = n (Xi ; )
n i=1 n i=1
is a sum of independent mean zero random variables that is op (1) by the same reasoning: The
uniformity in comes from the linear way in which this enters and the compactness of . Specically,
supn sup 2 E [j n (Xi ; )j] < 1:
8.5.2 Asymptotic Normality of the Nadaraya-Watson Estimator

p
In our case, we take 1=%n = maxf1= nh; n g; where
n = sup kEGn ( ; x) G( ; x)k :

2
Clearly,
kG( ; x)k = jm(x) jfX (x) = Cj 0 j;
where C = fX (x); so that (ii) is satised. We now turn to (iii). Let r (z) = fm(z) gf (z) for any
z: Now, provided r is twice continuously dierentiable at x; we have
Z
EGn ( ; x) G( ; x) = K(u)fr (x + uh) r (x)gdu
Z Z
0 h2
= hr (x) K(u)udu + u2 K(u)r00 (x (u; h))du;
2
R
where x (u; h) lies between x and x + uh: Provided K(u)udu = 0; the rst term drops out. By
dominated convergence we then have
Z
h2
EGn ( ; x) G( ; x) = u2 K(u)r00 (x)duf1 + o(1); (8.14)
2
where r00 (x) = (mf )00 (x) f 00 (x) and the error is uniform in 2 : Furthermore, provided
x Xi
E[ h1 K 2 h
"2i ] < 1; we have
1X
n
p
Gn ( ; x) EX Gn ( ; x) = Kh (x Xi )"i = Op (1= nh);
n i=1
because this random sequence is mean zero. We have

Z
1 1 2 x Xi 2 1
E K "i = K 2 (u) 2 (x + uh)fX (x + uh)du
nh h h nh
Z
1 2
= (x)fX (x) K 2 (u)duf1 + o(1)g:
nh
Furthermore,
1X
n
p
EX Gn ( ; x) EGn ( ; x) = fKh (x Xi )(m(Xi ) ) E[Kh (x Xi )(m(Xi ) )]g = op (1= nh)
n i=1
uniformly in k 0k Again this is a sum of mean zero independent random variables and
n:
Z
1 1 2 x Xi 2 1
E K (m(Xi ) ) = K 2 (u)(m(x + uh) )2 fX (x + uh)du = o(1=nh);
nh h h nh
for k 0k n:
p
Finally, Gn ( 0 ; x) = Op (1= nh) by the same arguments.
In this case, is scalar and = fX (x): (iii) It clearly su ces to show that
sup k%n [EGn ( ; x) G( ; x)] %n [EGn ( 0 ) G( 0 )]k = o(1);

k 0k n
which follows because

Z
sup kEGn ( ; x) G( ; x)k = sup u2 K(u)fr00 (x (u; h)) r000 (x (u; h))gdu = o(1)
k 0k n k 0k n
and
sup kEX Gn ( ; x) EG( ; x)k

k 0k n
1X
n
p
= sup fKh (x Xi ) EKh (x Xi )g (m(x) ) = op (1= nh):
k 0k n
n i=1
The nal result combines a central limit theorem for
1X
n
Gn ( 0 ) EX Gn ( 0 ) = Kh (x Xi )"i ;
n i=1
with the bias result (8.14). We use the following CLT.

Lindebergs CLT for Triangular Arrays. Consider the triangular array Zn1 ; : : : ; Znrn
with
1. For each n; the rndom variables Zn1 ; : : : ; Znrn are mutually independent
2. EZnj = 0; j = 1; 2; : : : ; rn ; and n = 1; 2; : : : ;
Prn 2
3. j=1 E(Znj )=1
4. For all > 0

X
rn
2
lim E Znj 1(jZnj j > ) = 0:
n!1
j=1
Then
X
rn
Znj =) N (0; 1):
j=1
Take
1 x Xi
Uni = K "i = wni "i
nh h
p
Zni = Uni = var(Uni jX1 ; : : : ; Xn )
We have to check that with probability one

2 0 v 13
u n
1 Xn
uX
Pn E 4wni "i 1 @jwni "i j > ct
2 2 2
wni 2 (X
i)
A5 ! 0 (8.15)
2 2
w
i=1 ni (X i ) i=1 i=1
.p P
n 2 2 (X
as n ! 1 for all c > 0: Letting vni = wni (Xi ) i=1 wni i) and i = "i / (Xi ) ; (??) is
bounded by
2 2
max E vni i 1 (j i j > c /vni ) ;
1 i n
which tends to zero provided that

max1 i jwni (Xi )j
n
max vni = pPn 2 2
! 0: (8.16)
i=1 wni (Xi )
1 i n
Pn 2 2 2
We have already shown that i=1 wni (Xi ) = Op (1=nh): Provided K and ( ) are bounded
1 x Xi
max jwni (Xi )j max K max (Xi ) = O(1=nh);
1 i n nh 1 i n h 1 i n
and (??) is satised.

Consider the median kernel smoother.
X
n
b(x) = arg min Kh (x Xi ) jYi j:
i=1
The minimizer is not unique but any rule can to select b(x) ok. More generally we can have that
P
fwni (x)g are smoother weights that satisfy ni=1 wni (x) = 1: Just like the usual median, the local
median is a nonlinear function of the Y 0 s: Note that b(x) solves the rst order condition
b(x) = arg zero Gn ( ) = 0;

X
n
Gn ( ) = wni (x) f1(Yi > 0) 1(Yi 0)g :
i=1
Let
G( ) = E [Gn ( ) jX1 ; : : : ; Xn ]
Xn
= wni (x) f1 2Fi ( )g ;
i=1
where Fi ( ) = Pr(Yi jXi ): Note that
0 X
n
G ( (x)) = 2 wni (x)Fi0 ( (x))
i=1
' 2fx ( (x));
where fx = Fx0 and Fx = Pr(Y jX = x); by a Taylor expansion and using the fact that
Pn
i=1 wni (x) = 1:
Therefore,
b(x) (x)
Pn
i=1 wni (x) f1(Yi > 0) 1(Yi 0)g
=
2fx ( (x))
+Op (h2 )
by standard arguments. Therefore, we have [with undersmoothing]

n o
b(x) E b(x) jX1 ; : : : ; Xn
Pn 1=2
=) N (0; 1):
2
i=1 wni (x)
4fx ( (x))2
See Jones and Marron (1990) for further discussion.
8.5.3 Backtting in Linear Regression

We consider a very familiar example to illustrate this result. Suppose that we have y; x1 ; : : : ; xJ
vectors in Rn : Dene the projection operators Pj = xj (x> 1 > > 1 >
j xj ) xj ; P = x(x x) x ; Mj = In Pj ;
and M = In P: Thus
y = P y + M y = yb + u
b:
The residual after K cycles of the backtting algorithm is
bK = T K y
u ; T = MJ MJ 1 M1 :
Then
T Ky ! My = u
b:
This just says that one can compute the regression of y on x1 ; : : : ; xJ by computing a sequence of
linear one dimensional regressions.
Prove T is a strict contraction for z 2 C(x). First, for any vector z we have
kT zk kMJ k kMJ 1 M1 zk
kMJ 1 M1 zk
kzk
since jjMj jj = max (Mj ) = 1 (Mj are symmetric idempotent). Furthermore, if jjT zjj = jjzjj then
jjM1 zjj = jjzjj; which implies that z is orthogonal to the space spanned by x1 ; i.e., z 2 C(x1 )? :
Similarly obtain that z 2 C(xj )? ; j = 2; : : : ; J: In other words, if jjT zjj = jjzjj; z 2 C(x)? : Therefore,
if z 2 C(x); it must be that
kT zk < kzk (1 ) kzk
for some 0 < 1. Also, if z 2 C(x); then T z 2 C(x): Hence
T Kz (1 ) T K 1z (1 )K kzk :
Bibliography
[1] Ai, C., and X. Chen (2003). E cient Estimation of Models with Conditional Moment Restric-
tions Containing Unknown Functions, Econometrica vol. 71, 1795-1843
[2] Ai, C., and X. Chen (2007). Estimation of possibly misspecied semiparametric conditional
moment restriction Models with dierent conditioning variables, Journal of Econometrics
141, 5-43.
[3] Akaike, H. (1970): Statistical predictor information, Annals of the Institute of Statistical
Mathematics 22, 20317.
[4] Akaike, H. (1974): A new look at the statistical model identication. IEEE Transactions of
Automatic Control AC 19, 71623.
[5] Andrews, D.W.K., (1991): Asymptotic Normality of Series Estimators for Nonparametric and
Semiparametric Regression Models.Econometrica 59, 307-346.
[6] Andrews (1995). Nonparametric Kernel Estimation for Semiparametric Models, Econometric
Theory 11, 560-596.
[7] Andrews, D.W.K., and Y-J. Whang (1990): Additive and Interactive Regression Models:
Circumvention of the Curse of Dimensionality,Econometric Theory 6, 466-479.
[8] Andrews, D.W.K., (1994a): Asymptotics for semiparametric econometric models via stochas-
tic equicontinuity, Econometrica 62, 43-72.
149
150 BIBLIOGRAPHY
[9] Andrews, D.W.K., (1994b): Empirical process method in econometrics,in The Handbook of
Econometrics, vol. IV, eds. D.F. McFadden and R.F. Engle III. North Holland.
[10] Ansley, C.F., R. Kohn, and C. Wong (1993): Nonparametric spline regression with prior
information,Biometrika 80, 75-88.
[11] Audrino, F., and Bhlmann, P. (2001): Tree-structured GARCH models, Journal of
The Royal Statistical Society, 63, 727-744.
[12] Bickel, P.J., and M. Rosenblatt (1973). On some global measures of the deviations of density
function estimates. Annals of Statistics 1, 1071-1095.
[13] Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and J. A. Wellner (1993): E cient and
adaptive estimation for semiparametric models. The John Hopkins University Press, Baltimore
and London.
[14] Bierens, H.J., (1987): Kernel Estimators of Regression Functions.in Advances in Economet-
rics: Fifth World Congress, Vol 1, ed. by T.F. Bewley. Cambridge University Press.
[15] Blackorby, C., D. Primont and R. R. Russell, (1978), Duality, Separability, and Func-
tional Structure: Theory and Economic Applications. New York: North Holland.
[16] Blundell, R., X. Chen, and D. Kristensen (2007). Semi-Nonparametric IV Estimation of Shape
Invariant Engel Curves. Econometrica, vol. 75, 1613-1670
[17] Bosq, D. (1998): Nonparametric Statistics for Stochastic Processes. Estimation and Prediction.
Springer, Berlin.
[18] Box, G.E.P. and D.R. Cox, (1964), An analysis of transformations, Journal of the Royal Sta-
tistical Society, Series B, 26, 211-252.
[19] Breiman, L. and J.H. Friedman, (1985), Estimating optimal transformations for multiple re-
gression and correlation (with discussion), Journal of the American Statistical Association, 80,
580-619.
BIBLIOGRAPHY 151
[20] Brillinger, D.R., (1980) Time Series, Data analysis and Theory. Holden-Day.
[21] Brown, L.D., M.G. Low, and L. H. Zhao (1997): Supere ciency in nonparametric function
estimation,The Annals of Statistics 25, 2607-2625.
[22] Buja, A., T. Hastie and R. Tibshirani. (1989). Linear smoothers and additive models (with
discussion). Annals of Statistics 17, 453-555.
[23] Cai, Z., and E. Masry (2000). Nonparametric estimation of additive nonlinear ARX time series:
Local linear tting and projections. Econometric Theory, 16, 465501
[24] Carrasco, M., J.P. Florens, and E. Renault (2002), Linear Inverse problems in Struc-
tural Econometrics,Forthcoming in Handbook of Econometrics, volume 6, eds. J.J. Heckman
and E. Leamer.
[25] Carroll, R., E. Mammen, and W. Hrdle (2002): Estimation in an additive model
when the components are linked parametrically,Econometric Theory 18, 886-912.
[26] Chen, X. (2007). Large Sample Sieve Estimation of Semi-nonparametric Models, chapter 76
in Handbook of Econometrics, Vol. 6B, 2007, eds. James J. Heckman and Edward E.Leamer,
North-Holland.
[27] Chen, X., O.B. Linton and I. Van Keilegom, (2003), Estimation of semiparametric models
when the criterion function is not smooth, Econometrica, 71, 1591-1608.
[28] Chen, X., and E. Ghyssels (2010): News - Good or Bad - and its impact on volatility predic-
tions over multiple horizonsReview of Financial Studies (forthcoming
[29] Chesher, A. (2003). Identication in Nonseparable Models, Econometrica, 71, 1401 - 1444.
[30] Cheze, N., J-M Poggi, and B. Portier (2003). Partial and Recombined Estimators for Nonlinear
Additive Models. Statistical Inference for Stochastic Processes 6: 155197.
[31] Christopheit, N. and S.G.N. Hoderlein (2005) Local Partitioned Regression, Manu-
script, Mannheim University.
152 BIBLIOGRAPHY
[32] Claeskens, G. and M. Aerts (2000). On local estimating equations in additive multiparameter
models. Statistics & Probability Letters 49, 139-148.
[33] Coppejans, M. (2004) On Kolmogorovs representation of functions of several variables by

functions of one variable. Journal of Econometrics 123, 1 31
[34] Deaton, A. and J. Muellbauer (1980): Economics and Consumer Behavior. Cambridge Univer-
sity Press: Cambridge.
[35] Debbarh, M., and V. Viallon (2008) Uniform limit laws of the logarithm for estimators of
the additive regression function in the presence of right censored data Electronic Journal of
Statistics Vol. 2, 516541
[36] Dette, H. and C. Von Lieres Und Wilkau (2001) ProbabilityTesting Additivity by Kernel-Based
Methods: What Is a Reasonable Test? Bernoulli, Vol. 7, No. 4, 669-697
[37] Dette, H., J.C. Pardo-Fernandez and I. Van Keilegom (2009) Goodness-of-Fit Tests for Multi-
plicative Models with Dependent Data. Scandinavian Journal of Statistics, Vol. 36: 782799,
[38] Dette, H. and R. Scheder (2009). Estimation of additive quantile regression. Ann Inst Stat
Math DOI 10.1007/s10463-009-0225-5
[39] Ekeland, I., J.J. Heckman, and L. Nesheim (2004). Identication and estimation of Hedonic
Models. Journal of Political Economy, 112, S60-S109.
[40] Engle, R.F. and V.K. Ng (1993): Measuring and Testing the impact of news on volatility,
The Journal of Finance XLVIII, 1749-1778.
[41] Fan, J. (1992): Design-adaptive nonparametric regression, Journal of the American Statis-
tical Association 82, 998-1004.
[42] Fan, J. (1993): Local linear regression smoothers and their minimax e ciencies, Annals of
Statistics 21, 196-216.
BIBLIOGRAPHY 153
[43] Fan, J., and J. Chen (1997): One-step Local Quasi-Likelihood Estimation,Journal of the
Royal Statistical Society 61, 927-943.
[44] Fan, J., and I. Gijbels (1996), Local Polynomial Modelling and Its Applications Chapman
and Hall.
[45] Fan, J., W. Hardle, and E. Mammen (1998). Direct Estimation of Low-Dimensional Compo-
nents in Additive Models. The Annals of Statistics, Vol. 26, No. 3, pp. 943-971
[46] Fan, J., and J. Jiang (2005). Nonparametric Inferences for Additive Models. Journal of the
American Statistical Association, Vol. 100, No. 471
[47] Fan, J., and Q. Yao (1998): E cient estimation of conditional variance functions in Sto-
chastic Regression,Biometrika 85, 645-660.
[48] Fan, Y. and Q. Li (2003). A kernel-based method for estimating additive partially linear models.
Statistica Sinica 13, 739-762
[49] Friedman, J. H. and W. Stutzle, (1981), Projection Pursuit Regression,Journal of the

American Statistical Association, 76, 817-823.
[50] Gao, J., Lu, Z. and Tjstheim, D. (2006). Estimation in semiparametric spatial regression.
Ann. Statist. 34 13951435.
[51] Goldman, S. M. and H. Uzawa, (1964), A Note On Separability and Demand Analysis,
Econometrica, 32, 387-398.
[52] G. Golub & C. van Loan (1996), Matrix computations, third edition, The Johns Hopkins
University Press, London
[53] Gooijer, Jan G De, Dawit Zerom (2003). On Additive Conditional Quantiles With High-
Dimensional Covariates. Journal of the American Statistical Association. 98(461): 135-146.
[54] Gorman, W. M., (1959), Separable Utility and Aggregation,Econometrica, 27, 469-481.
154 BIBLIOGRAPHY
[55] Gozalo, P., and O. Linton (2000): Local nonlinear least squares estimation: Using para-
metric information nonparametrically,The Journal of Econometrics 99, 63-106.
[56] Gozalo, P., and O. Linton (2001): A Nonparametric Test of Additivity in Generalized
Nonparametric Regression with estimated parameters, (with P. Gozalo). Journal of Econo-
metrics 104, 1-48.
[57] Hrdle, W., (1990): Applied Nonparametric Regression. Cambridge University Press, Cam-
bridge.
[58] Hrdle, W., S. Huet, and E. Mammen (2004) Bootstrap inference in seiparametric gen-
eralized additive models. Econometric Theory, 20, 265300
[59] Hrdle, W., W. Kim and G. Tripathi, (2001): Nonparametric Estimation of Additive
Models With Homogeneous Components,Economics Essays: A Festschrift for Werner Hilden-
brand, eds. G. Debreu, W. Neuefeind, and W. Trockel, 159-179, Berlin: Springer.Structural
Tests in Additive Regression
[60] Hrdle, W., S. Sperlich, and V. Spokoiny (2001) Structural tests in additive regression. Journal
of the American Statistical Association, Vol. 96, No. 456
[61] Haag, B. (2008). Non-parametric Regression Tests Using Dimension Reduction Techniques.
Scandinavian Journal of Statistics, Vol. 35: 719738, 2008
[62] Hastie, T. J. and R. Tibshirani, (1990), Generalized Additive Models, Chapman and Hall:
London.
[63] Heckman, J.J., H. Ichimura, and P. Todd (1998) Matching As An Econometric Evaluation
Estimator. Review of Economic Studies 65, 261-294.
[64] Hegland, M., I. McIntosh, and B. Turlach (1999). A parallel solver for generalised additive
models. Computational Statistics & Data Analysis 31, 377-396
BIBLIOGRAPHY 155
[65] Hengartner, N.W. and S. Sperlich, (2005), Rate optimal estimation with the integration method
in the presence of many covariates, Journal of Multivariate Analysis, 95, 246-272.
[66] Honda, T. (2005). Estimation in additive cox models by marginal integration. Annals of the
Institute of Statistical Mathematics 3, 403-423.
[67] Horowitz, J., (1996), Semiparametric Estimation of a Regression Model with an Unknown
Transformation of the Dependent Variable, Econometrica, 64, 103-137.
[68] Horowitz, J., (2001), Nonparametric Estimation of a Generalized Additive Model With An
Unknown Link Function, Econometrica, 69, 499-513.
[69] Horowitz, J., Klemela and Mammen, E. (2006), Optimal estimation in additive regres-
sion models,Bernoulli 12 271298. MR2218556
[70] Horowitz, J., and Mammen, E. (2004), Nonparametric Estimation of an Additive Model
With A Link Function,Annals of Statistics 32, 2412-2443.
[71] Horowitz, J., and Mammen, E. (200?), Oracle e cient nonparametric estimation of an
additive model with an unknown link function
[72] Horowitz, J., and Mammen, E. (2007), Rare-optimal estimation for a general class of
nonparametric regression models with unknown link function. The Annals of Statistics, Vol.
35, No. 6, 25892619
[73] Ibragimov, I.A. and R.Z. Hasminskii, (1980), On nonparametric estimation of regression, Soviet
Math. Dokl., 21, 810-814.
[74] Jiang, J, Y. Fan and J. Fan (2010). Estimation in additive models with highly or nonhighly
correlated covariates. The Annals of Statistics, Vol. 38, No. 3, 14031432
[75] Kauermann, G. and J. D. Opsomer (200?) Generalized Cross-Validation for Bandwidth Selec-
tion of Backtting Estimates in
156 BIBLIOGRAPHY
[76] Kim, W., and O. Linton (2002): A Local Instrumental Variable Estimation method for
Generalized Additive Volatility Models,Forthcoming in Econometric Theory.
[77] Lee, M.J. and Y. Kondo (200) Nonparametric Derivative Estimation for Related-Eect Panel
Data
[78] Leontie, W. (1947). Introduction to a theory of an internal structure of functional relation-

ships. Econometrica, 15, 361-373.
[79] Lewbel, A., and O. Linton (2006). Nonparametric Matching and E cient Estimators of Ho-
mothetically Separable Functions. Econometrica.
[80] Lin, X. and R. Carroll (2006). Semiparametric estimation in general repeated measures prob-
lems. J. R. Statist. Soc. B 68, Part 1, pp. 6988
[81] Linton, O.B. (1996): E cient estimation of additive nonparametric regression models,
Biometrika 84, 469-474.
[82] Linton, O.B. (2000): E cient estimation of generalized additive nonparametric regression
models,Econometric Theory 16, 502-523.
[83] Linton, O.B., R. Chen, N. Wang and W. Hrdle, (1997), An analysis of transformations for
additive nonparametric regression, Journal of the American Statistical Association, 92, 1512-
1521.
[84] Linton, O. and W. Hrdle (1996), Estimating additive regression models with known
links,Biometrika 83, 529-540.
[85] Linton, O. and E. Mammen, (2005), Estimating semiparametric ARCH(1) models by kernel
smoothing, Econometrica, 73, 771-836.
[86] Linton, O.B. and J.P. Nielsen, (1995), A kernel method of estimating structured nonparametric
regression using marginal integration, Biometrika, 82, 93-100.
BIBLIOGRAPHY 157
[87] Lu, Z., A. Lunderevold, D. Tjostheim, and Q. Yao (2007). Exploring spatial nonlinearity using
additive approximation. Bernoulli 13(2), 447472
[88] Luce, R.D., and J.W. Tukey (1964). Simultaneous conjoint measurement: a new type of fun-
damental measurement. Journal of Mathematical Psychology 1, 1-27.
[89] Mammen, E., O.B. Linton and J.P. Nielsen, (1999), The existence and asymptotic properties of
a backtting projection algorithm under weak conditions, Annals of Statistics, 27, 1443-1490.
[90] Mammen, E., and J.P. Nielsen (2003). Generalised structured models Biometrika, 90, 3, pp.
551566
[91] Mammen, E. and B.U. Park, (2005), Bandwidth selection for smooth backtting in additive
models, Annals of Statistics, 33, 1260-1294.
[92] Mammen, E. and B.U. Park, (2006). A Simple Smooth Backtting Method for Additive Models.
The Annals of Statistics, Vol. 34, No. 5, 2252-2271.
[93] Masry, E. (1996), Multivariate local polynomial regression for time series: Uniform strong
consistency and rates,J. Time Ser. Anal. 17, 571-599.
[94] Masry, E., and D. Tjstheim (1995): Nonparametric estimation and identication of non-
linear ARCH time series: strong convergence and asymptotic normality,Econometric Theory
11, 258-289
[95] Matzkin, R. L. (1992), Nonparametric and Distribution-Free Estimation of the Binary

Threshold Crossing and the Binary Choice Models,Econometrica, 60, 239-70
[96] Matzkin, R. L. (1994), Restrictions of Economic Theory in Nonparametric Methods, in

Handbook of Econometrics, vol. iv, ed. by R. F. Engle and D. L. McFadden, 2523-2558, Ams-
terdam: Elsevier.
[97] Matzkin, R. L. (2003), Nonparametric Estimation of Nonadditive Random Functions,

158 BIBLIOGRAPHY
[98] Neumeyer, N., and I. Van Keilegom (2010). Estimating the error distribution in nonparametric
multiple regression with applications to model testing Journal of Multivariate Analysis 101,
1067-1078
[99] Neumeyer, N., and S. Sperlich (2006). Comparison of Separable Components in Dierent Sam-
ples. Scandinavian Journal of Statistics.
[100] Newey, W.K. (1994). Kernel estimation of partial means. Econometric Theory. 10, 233-253.
[101] Newey, W. K., J. L. Powell, and F. Vella (1999), Nonparametric Estimation of

Triangular Simultaneous Equations Models,Econometrica, 67, 567-603.
[102] Newey, W. K. and Powell, J. L. (2003): Instrumental variables estimation for nonpara-
metric regression models,Forthcoming in Econometrica.
[103] Nielsen, J.P., O.B. Linton and Bickel, P.J. (1998). On a semiparametric survival model with
exible covariate eect. Annals of Statistics, 26, 215-241.
[104] Nielsen, J.P. and S. Sperlich, (2005), Smooth backtting in practice, Journal of the Royal
Statistical Society, Series B, 61, 43-61.
[105] Opsomer, J. D. and D. Ruppert (1997): Fitting a bivariate additive model by local
polynomial regression,Annals of Statistics 25, 186 - 211.
[106] Opsomer, J.-D. (2000), Asymptotic Properties of Backtting Estimators, Journal of Multi-
variate Analysis, 73, 166179
[107] Pagan, A.R., and G.W. Schwert (1990): Alternative models for conditional stock volatil-
ity,Journal of Econometrics 45, 267-290.
[108] Pagan, A.R., and Y.S. Hong (1991): Nonparametric Estimation and the Risk Premium,
in W. Barnett, J. Powell, and G.E. Tauchen (eds.) Nonparametric and Semiparametric Methods
in Econometrics and Statistics, Cambridge University Press.
BIBLIOGRAPHY 159
[109] Pardinas, J.R., C. Cadarso-Suarez and W. Gonzalez-Manteiga (2005). Testing for interactions
in generalized additive models: Application to SO2 pollution data. Statistics and Computing
15: 289299.
[110] Park, J. and B. Seifert (2010). Local additive estimation. J. R. Statist. Soc. B 72, Part 2, pp.
171191
[111] Pinkse, J., (2001), Nonparametric Regression Estimation Using Weak Separability,
http://www.econometricsociety.org/meetings/wc00/pdf/1241.pdf
[112] Poo, J.R., S. Sperlich, and P. Vieu (2003). Semiparametric Estimation of Separable Models
with Possibly Limited Dependent Variables. Econometric Theory, Vol. 19, No. 6, pp. 1008-
1039Published
[113] Porter, J. (1996). Essays in Econometrics, M.I.T. Ph.D. Dissertation.
[114] Primont, D. and D. Primont, (1994), Homothetic Non-parametric Production Models,

Economics Letters, 45, 191-195.
[115] Qian, J. and L. Wang (2009) Estimating Semiparametric Panel Data Models by Marginal
Integration. Available at http://mpra.ub.uni-muenchen.de/18850/
[116] Robinson, P. M. (1988), Root-N-Consistent Semiparametric Regression,Econometrica, 56,

931954.
[117] Schienle, M. (2008). Nonparametric Regression with Stationary and Nonstationary Variables.
Discussion paper no. 36. University of Mannheim.
[118] Severance-Lossin, E., and S. Sperlich (1995). Estimation of derivatives for additively separable
models. SFB 373 Discussion Paper no. 60.
[119] Severini, T.A., and W.H. Wong, (1992), Prole likelihood and conditionally parametric models,
Annals of Statistics, 20, 1768-1802.
160 BIBLIOGRAPHY
[120] Sperlich, S., O.B. Linton and W. Hrdle, (1999), Integration and backtting methods in addi-
tive models: nite sample properties and comparison, Test, 8, 419-458.
[121] Sperlich, S., D. Tjstheim and L. Yang, (2002), Nonparametric estimation and testing of
interaction in additive models, Econometric Theory, 18, 197-251.
[122] Stone, C.J., (1980), Optimal rates of convergence for nonparametric estimators, Annals of
Statistics, 8, 1348-1360.
[123] Stone, C.J., (1982), Optimal global rates of convergence for nonparametric regression, Annals
of Statistics, 8, 1040-1053.
[124] Stone, C.J. (1985). Additive regression and other nonparametric models. Annals of Statistics,
13, 685-705.
[125] Stone, C.J., (1986), The dimensionality reduction principle for generalized additive models,
Annals of Statistics, 14, 592-606.
[126] Studer, M., B. Seifert, and T. Gasser (2005). Nonparametric Regression Penalizing Deviations
from Additivity. The Annals of Statistics, Vol. 33, No. 3, pp. 1295-1329.
[127] Tibshirani, R. (1984). Local Likelihood Estimation. PhD Stanford University.
[128] Tjstheim, D. and B. Auestad, (1994), Nonparametric identication of nonlinear time series:
projections, Journal of the American Statistical Association, 89, 1398-1409.
[129] Tripathi, G. and W. Kim, (2001), Nonparametric Estimation of Homogeneous Functions,

unpublished manuscript.
[130] Wang, J. and L. Yang (2009). E cient and fast spline-backtted kernel smoothing of additive
models. Ann Inst Stat Math 61:663690
[131] Wang, J. and L. Yang (2009). Spline-backtted kernel smoothing og nonlinear additive autore-
gression model. The Annals of Statistics, Vol. 35, No. 6, 24742503
BIBLIOGRAPHY 161
[132] Wang, Q. and P. C. B. Phillips (2009a). Asymptotic Theory for Local Time Density Estimation
and Nonparametric Cointegrating Regression, Econometric Theory, 25, 710-738.
[133] Wang, Q. and P. C. B. Phillips (2009b). Structural Nonparametric Cointegrating Regression,

[134] Yang, L. (2004). Condence bands for additive regression.
[135] Yang,L., W. Hrdle, and J.P. Nielsen (1999): Nonparametric Autoregression with
Multiplicative Volatility and Additive Mean,Journal of Time Series Analysis 20, 579-604
[136] Yang, L., B. Park, L. Xue, and W. Hardle (2006). Estimation and Testing for Varying Co-
e cients in Additive Models With Marginal Integration. Journal of the American Statistical
Association, Vol. 101, No. 475, Theory and Methods
[137] Yang, L., S. Sperlich, and W. Hardle (2003). Derivative estimation and testing in generalized
additive models. Journal of Statistical Planning and Inference 115, 521 542
[138] Yu, K. Park, and E. Mammen (2008). Smooth backtting in generalized additive models.
Annals of Statistics 36, 228-260.
[139] Xia, Y., H. Tong, W.K. Li, and Z. Lu (2002): Single-index diusion models and their
estimation,Journal of the Royal Statistical Society, Series B 64, 363-410.
[140] Xiao, Z., O. Linton, R. Carroll, and E. Mammen (2003): More E cient Local Poly-
nomial Estimation in Nonparametric Regression with Autocorrelated Errors, Journal of the
American Statistical Association 98, 980-992
[141] Chamber, J.M., Clevelend, W.S., Kleiner, B., and P.A. Tukey (1983). Graphical Methods for
Data Analysis. Duxburry Press.
[142] Chan, K.C., G. Karolyi, F. Longsta and A. Sanders (1992). An Empirical Comparison of
Alternative Models of Short-Term Interest Rate. Journal of Finance 47, 1209-1227.
162 BIBLIOGRAPHY
[143] Chen, S.X. (1999). Beta kernel estimators for density functions. Computational Statistics and
Data Analysis 31, 131-145.
[144] Chen, X. and X. Shen (1998): Sieve Extremum Estimates for Weakly Dependent Data,
[145] Cleveland, W.S., (1979): Robust Locally Weighted Regression and Smoothing Scatterplots.
Journal of the American Statistical Association 74, 829-836.
[146] Cohen, A. (1966). All admissible linear estimates of the mean vector. Ann. Math. Statist. 37,
458-463.
[147] Cox, D.R., and D.V. Hinkley (1974): Theoretical Statistics. Chapman and Hall.
[148] Craven, P. and Wahba, G. (1979): Smoothing noisy data with spline functions, Numer.
Math. 31, 377403.
[149] Daniell, P.J., (1946): Discussion of paper by M.S. Bartlett, Journal of the Royal Statistical
Society Supplement 8:27.
[150] Darolles, S., J.P. Florens, and E. Renault (2002): Nonparametric instrumental regression,
Working paper, GREMAQ, Toulouse.
[151] Devroye, L. (1981). On the almost everywhere convergence of nonparametric function estimates.
[152] Einmahl, U., and D.M. Mason (2000). An empirical process approach to the uniform consistency
of kernel-type function estimators. Journal of Thoeretical Probability 13, 1-37.
[153] Eubank, R.L., (1988): Smoothing Splines and Nonparametric Regression. Marcel Dekker.
[154] Fix, E. and J.L. Hodges (1951): Discriminatory analysis, nonparametric estimation: consis-
tency properties, Report no 4, Project no 21-49-004, USAF School of Aviation Medicine,
Randolph Field, Texas.
BIBLIOGRAPHY 163
[155] Gasser, T. and H. G. Mller (1984): Estimating regression functions and their derivatives by
the kernel method,Scandinavian Journal of Statistics 11, 17185.
[156] Gasser, T., Mller, H. G., and V. Mammitzsch (1985): Kernels for nonparametric curve
estimation,Journal of the Royal Statistical Society Series B 47, 23852.
[157] Gin, W., and A. Guillou (2002). Rates of Strong uniform consistency for multivariate kernel
density estimators. Ann. I. H. Poincar 6, 907-921.
[158] Glad, I.K. (1998): Parametrically guided nonparametric regression,Scandinavian Journal of

Statistics 25, 4.
[159] Gorman, W. M., (1959), Separable Utility and Aggregation,Econometrica, 27, 469-481.
[160] Gyor, L., Hrdle, W., Sarda, P., and P. Vieu (1990): Nonparametric Curve Estimation from
Time Series. Lecture Notes in Statistics, 60. Heidelberg, New York: Springer-Verlag.
[161] Hall, P., (1992): The Bootstrap and Edgeworth Expansion. Springer-Verlag, New-York.
[162] Hall, P., (1993): On Edgeworth Expansion and Bootstrap Condence Bands in Nonparametric
Curve Estimation,Journal of the Royal Statistical Society Series B 55, 291-304.
[163] Hall, P. and I. Johnstone (1992): Empirical functional and e cient smoothing parameter
selection,(with discussion). Journal of the Royal Statistical Society Series B. 54, 475-530.
[164] Hrdle, W., Hall, P. and Marron, J. S. (1988): How far are automatically chosen regres-
sion smoothing parameters from their optimum? (with discussion). Journal of the American
Statistical Association 83, 8699.
[165] Hrdle, W., Hall, P. and Marron, J. S. (1992): Regression smoothing parameters that are not
far from their optimumJournal of the American Statistical Association 87, 227233.
[166] Hrdle, W., and Marron, J. S. (1985): Optimal bandwidth selection in nonparametric regres-
sion function estimation,Annals of Statistics 13, 1465-81.
164 BIBLIOGRAPHY
[167] Hrdle, W., A.B. Tsybakov, and L. Yang, (1996): Nonparametric vector autoregression,
Discussion Paper, SFB 373, Humbodt-Universitt Berlin.
[168] Hall, P. (1993). On Edgeworth expansion and bootstrap condence bands in nonparametric
curve estimation. Journal of the Royal Statistical Society, Series B. 55, 291-304.
[169] Hart, J. and P. Vieu (1990): Data-driven bandwidth choice for density estimation based on
dependent data,Annals of Statistics 18, 873890.
[170] Hart, D. and T. E. Wehrly (1986): Kernel regression estimation using repeated measurements
data,Journal of the American Statistical Association 81, 10808.
[171] Johnston, G.J. (1982). Probabilities of maximal deviations for nonparametric regression func-
tion estimates. Journal of Multivariate Analysis 12, 402-414.
[172] Krieger, A.M., and J. Picklands (1981). Weak convergence and e cient density esitmation at
a point. The Annals of Statistics 9, 1066-1078.
[173] Lu, Z. (2001): Kernel density estimation for time series under generalized conditions: As-
ymptotic normality and applications, Annals of the Institute of Statistical Mathematics 53,
447-468.
[174] Mack, Y. P. (1981): Local properties of k-N N regression estimates, SIAM J. Alg. Disc.
Meth. 2, 31123.
[175] Mack, Y. P. and Mller (1989): Derivative estimation in nonparametric regression with ran-
dom predictor variable,Sankhya, Ser. A., 51, 59-72.
[176] Mammen, E., (1992): When does bootstrap work? Asymptotic results and simulations,
Springer Verlag, Berlin.
[177] Marron, J.S. and D. Nolan (1989): Canonical kernels for density estimation, Statistics and
Probability Letters 7, 191-195.
BIBLIOGRAPHY 165
[178] Marron, J.S. and M.P.Wand (1992): Exact Mean Integrated Squared Error. Annals of Sta-
tistics 20, 712-736.
[179] Masry, E. (1996b): Multivariate regression estimation Local polynomial tting for time series,
Stochastic Processes and their Applications 65, 81-101.
[180] Mller, H. G. (1988): Nonparametric Regression Analysis of Longitudinal Data. Lecture Notes
in Statistics, Vol. 46. Heidelberg/New York: Springer-Verlag.
[181] Nadaraya, E.A., (1964): On estimating regression,Theory of Probability and its Applications
10, 186-190.
[182] Newey, W. K. and Powell, J. L. (2003): Instrumental variables estimation for nonparametric
regression models,Forthcoming in Econometrica.
[183] Newey, W. K., J. L. Powell, and F. Vella (1999), Nonparametric Estimation of Triangular
Simultaneous Equations Models,Econometrica, 67, 567-603.
[184] Pagan, A.R., and A. Ullah (1999): Nonparametric Econometrics Cambridge University Press,
Cambridge.
[185] Rice, J. A. (1984): Bandwidth choice for nonparametric regression Annals of Statistics 12,
121530.
[186] Robinson, P.M. (1983): Nonparametric Estimators for Time Series. Journal of Time Series
Analysis 185-208.
[187] Robinson, P. M. (1988): Root-N-Consistent Semiparametric Regression, Econometrica, 56,

931-954.
[188] Robinson, P. M. (1991): Automatic Frequency Domain Inference on Semiparametric and

Nonparametric Models.Econometrica 59, 1329-1364.
[189] Rosenblatt, M., (1956): Remarks on some nonparametric estimates of a density function,
Annals of Mathematical Statistics 27, 642-669.
166 BIBLIOGRAPHY
[190] Ruppert, D., and M.P.Wand (1992): Multivariate Locally Weighted Least Squares Regres-
sion,Rice University, Technical Report no 92-4.
[191] Schuster, E.F., (1972): Joint asymptotic distribution of the estimated regression function at
a nite number of distinct points,Annals of Mathematical Statistics 43, 84-8.
[192] Shibata, R.(1981): An optimal selection of regression variables,Biometrika, 68, 4554.
[193] Silverman, B. W. (1978): Weak and Strong uniform consistency of the kernel estimate of a
density and its derivatives,Annals of Statistics 6, 177-184.
[194] Silverman, B. W. (1984): Spline smoothing: the equivalent variable kernel method. Annals
of Statistics 12, 898916.
[195] Silverman, B. W. (1985): Some aspects of the Spline Smoothing approach to Non-parametric
Regression Curve Fitting,Journal of the Royal Statistical Society Series B 47, 1-52
[196] Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman
and Hall.
[197] Silverman, B. (1978). Weak and strong consistency of a kernel estimate of a density and its
derivatives. Annals of Statistics 6, 177-184.
[198] Stute, W. (1984): Asymptotic normality of nearest neighbor regression function estimates.
[199] Tikhonov, A.N. (1963): Regularization of incorrectly posed problems,Soviet Math., 4, 1624
1627.
[200] Wahba, G. (1990): Spline Models for Observational Data. CBMS-NSF Regional Conference
Series in Applied Mathematics, no 59.
[201] Watson, G.S. (1964): Smooth regression analysis,Sankhya Series A 26, 359-372.
BIBLIOGRAPHY 167
[202] Weinstock, R. (1952): Calculus of Variations with applications to physics and engineering.
Dover Publications inc, New York.
[203] Whittaker, E.T., (1923): On a new method of graduation, Proc. Edinburgh Math.Soc 41,
63-75.

Linton - Nonparametric Methods

Uploaded by

Copyright:

Available Formats

Linton - Nonparametric Methods

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linton - Nonparametric Methods

Uploaded by

Copyright:

Available Formats

Nonparametric Methods: Harmless Econometrics of the

January 17, 2013

1 Corresponding author: Department of Economics, University of Cambridge, Austin

4 Generalized Additive and Other Models 57

5 Time Series and Panel Data 73

6.5 Oracle Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.5.3 Backtting in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 146

2.2 C.D.F. and Density Estimation

Denition 1 X1 First Order Stochastic Dominates X2 , denoted X1 F SD X2 , if and only if:

be a corresponding smoothed estimator of the c.d.f., where denotes convolution. Then

2.3 Estimation of Regression Functions

E[fY g(X)g2 ]: (2.2)

Then we can write this in regression model format

Yi = m(Xi ) + "i ; i = 1; : : : ; n; (2.3)

fbh (x; y) of f (x; y) is

b and y are the n

K = [Kh (Xi ; Xj )]i;j

Local Polynomial Estimators

h(g(x)) = m(x). Then let b0 ; : : : ; bp minimize

2.3.1 Alternative Estimation Paradigms

b of m, the residual sum of squares (RSS) is dened as

b has the following properties: It is a cubic polynomial between two successive X-

The smoothing parameter b . As

Series or Sieve Estimators

where Wn (x) = (Wn1 ; : : : ; Wnn )> ; with

Wn (x) = '>x ( >

2.4 Asymptotic Properties

2.4.1 Empirical Process

cov(B(t); B(s)) = minft; sg ts:

It is a Gaussian process. Furthermore, we can write

B(t) = W (t) tW (1):

Theorem 2 (Donsker, 1952) As n ! 1

where B is the Brownian Bridge

The process B(F (:)) is also Gaussian with

cov(B(F (t)); B(F (s)) = minfF (t); F (s)g F (t)F (s):

2.4.2 Density Estimation

Theorem 3 Suppose that:

2.4.3 Regression Function

where fwni (x)g only depend on the covariate X1 ; : : : ; Xn :

(2) There is a D 1 such that " #

Theorem 8 (Local Linear). Suppose that:

2.4.4 Condence Intervals

so that conditional and unconditional variance are approximately the same.

Can also subtract o the mean from the residuals b

2.4.5 Bootstrap Condence Intervals

where Xj puts mass 1=n at each Xj ; so that

as before. However, now we have

f^g00 (x) !p f 00 (x)

1. Calculate residuals ^"i = Yi b h (Xi ), i = 1; : : : ; n

2. Draw v1 ; : : : ; vn from a distribution with mean zero and variance one

3. Let "i = vib

b g (Xi ) + "i , i = 1; : : : ; n with bandwidth g.

5. Calculate bootstrap nonparametric estimate

b h ( ) m( ) use the computable conditional

2.5 Bandwidth Selection

Let hj be the bandwidth sequences that minimize the corresponding criterion dj :

2.5.2 Cross Validation

K(0) 1 X fm2 (Xj ) + 2 (Xj )g (Xj )

A1. Hn = [h(n); h(n)]