Ast Part1 PDF
Ast Part1 PDF
Ast Part1 PDF
Walter Zucchini
October 2003
Contents
1 Density Estimation
1.1
1.2
1.3
1.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1
1.1.2
1.2.1
Weighting functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3
12
1.3.1
12
1.3.2
1.3.3
Optimal bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3.4
Optimal kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
16
1.4.1
Subjective selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.4.2
16
1.4.3
Crossvalidation
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.4.4
Plugin estimator . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.4.5
19
12
Chapter 1
Density Estimation
1.1
1.1.1
Introduction
The probability density function
Pn
P
n
1
1
where
= n i=1 xi and
2 = n1 i=1 (xi
)2 .
This approach has advantages as long as the distributional assumption is correct, or at
least is not seriously wrong. It is easy to apply and it yields (relatively) stable estimates.
The main disadvantage of the parametric approach is lack of flexibility. Each parametric
family of distributions imposes restrictions on the shapes that f (x) can have. For example the density function of the normal distribution is symmetrical and bellshaped, and
therefore is unsuitable for representing skewed densities or bimodal densities.
2
1.1.2
The idea of the nonparametric approach is to avoid restrictive assumptions about the
form of f (x) and to estimate this directly from the data. A wellknown nonparametric
estimator of the pdf is the histogram. It has the advantage of simplicity but it also has
disadvantages, such as lack of continuity. Secondly, in terms of various mathematical
measures of accuracy there exist alternative nonparametric estimators that are superior
to histograms.
To construct a histogram one needs to select a left bound, or starting point, x0 , and the
bin width, b. The bins are of the form [x0 + (i 1)b, x0 + ib), i = 1, 2, ..., m. The estimator
of f (x) is then given by
1 Number of observations in the same bin as x
f(x) =
n
b
More generally one can also use bins of different widths, in which case
1 Number of observations in the same bin as x
f(x) =
n
Width of bin containing x
The choice of bins, especially the bin widths, has a substantial effect on the shape and
other properties of f(x). This is illustrated in the example that follows.
Example 1
We consider a population of 689 of a certain model of new cars. Of interest here is the
amount (in DM) paid by the customers for optional extras, such as radio, hubcaps,
special upholstery, etc. . The top histogram in Figure 1.1 relates to the entire population;
the bottom histogram is for a random sample of size 10 from the population.
Figure 1.2 shows three histogram estimates of f (x) for the sample, for different bin widths.
Note that the estimates are piecewise constant and that they are strongly influenced by
the choice of bin width. The bottom right hand graph is an example of a socalled kernel
estimator of f (x). We will be examining such estimations in more detail.
1.2
1.2.1
From the definition of the pdf, f (x), of a random variable, X, one has that
x+h
Z
P (x h < X < x + h) =
f (t)dt
xh
2hf (x)
Density
10
20
30
40
30
40
1000 DM
Density
10
20
1000 DM
Figure 1.1: Histogram for all cars in the population and for a random sample of size n = 10.
0.08
0.05
0.04
0.03
0.02
0.01
0.00
0.06
0.04
0.02
0.00
10
15
20
25
30
10
15
20
25
1000 DM
1000 DM
Histogram of samp
0.10
0.08
0.06
0.04
0.02
0.00
Density
Density
Histogram of samp
Density
Density
Histogram of samp
30
0.06
0.04
0.02
0.00
10
15
1000 DM
20
25
30
10
15
20
25
30
N = 10 Bandwidth = 2.344
Figure 1.2: Histograms with different bin widths and a kernel estimate of f (x) for the same
sample.
and hence
f (x)
1
P (x h < X < x + h)
2h
(1.1)
The above probability can be estimated by a relative frequency in the sample, hence
1 number of observations in (x h, x + h)
f(x) =
2h
n
(1.2)
1X
f(x) =
w(x xi , h) ,
n i=1
(1.3)
Instead of using rectangles in (1.3) one could use other weighting functions, for example
triangles:
1
(1 |t|/h) for |t| < h ,
h
w(t, h) =
0
otherwise .
Again it is left to the reader to check that the resulting f(x) is indeed a pdf. Examples of
f(x) based on the triangular weighting function and four different values of h are shown
in Figure 1.4. Note that here too larger values of h lead to smoother estimates f(x).
Another alternative weighting function is the Gaussian:
w(t, h) =
1
2
2
et /2h ,
2h
< t < .
Figure 1.5 shows f(x) based on this weighting function for different values of h. Again the
fluctuations in f(x) decrease with increasing h.
6
Rectangular kernel
bw= 1
0.20
0.20
0.15
0.15
Density
Density
Rectangular kernel
bw= 0.5
0.10
0.05
0.00
0.05
0.00
10
20
30
40
10
20
30
1000 DM
1000 DM
Rectangular kernel
bw= 2
Rectangular kernel
bw= 4
0.20
0.20
0.15
0.15
Density
Density
0.10
0.10
0.05
0.00
40
0.10
0.05
0.00
10
20
30
40
10
1000 DM
20
30
40
1000 DM
Figure 1.3: Estimates of f (x) for different values of h. The abbreviation bw (short for
bandwidth) is used here instead of h.
Triangular kernel
bw= 1
0.20
0.20
0.15
0.15
Density
Density
Triangular kernel
bw= 0.5
0.10
0.05
0.00
0.05
0.00
10
20
30
40
10
20
30
1000 DM
1000 DM
Triangular kernel
bw= 2
Triangular kernel
bw= 4
0.20
0.20
0.15
0.15
Density
Density
0.10
0.10
0.05
0.00
40
0.10
0.05
0.00
10
20
1000 DM
30
40
10
20
30
40
1000 DM
0.00
10
20
30
40
10
20
30
1000 DM
Gaussian kernel
bw= 2
Gaussian kernel
bw= 4
40
0.10
0.00
0.00
0.10
Density
0.20
1000 DM
0.20
Density
0.10
Density
0.10
0.00
Density
0.20
Gaussian kernel
bw= 1
0.20
Gaussian kernel
bw= 0.5
10
20
30
40
10
1000 DM
20
30
40
1000 DM
1.2.2
Kernels
The above weighting functions, w(t, h), are all of the form
t
1
w(t, h) = K
,
h
h
(1.4)
Kernel
Epanechnikov
Biweight
Efficiency (exact
and to 4 d.p.)
K(t)
3
(1
4
15
(1
16
Triangular
Gaussian
1
2
Rectangular
1
2
e(1/2)t
1
3087 1/2
3125
243 1/2
250
36 1/2
125
108 1/2
125
0.9939
0.9859
0.9512
0.9295
Table 1.1: Six kernels and their efficiencies (that will be discussed in section 1.3.4).
In general any function having the following properties can be used as a kernel:
Z
Z
Z
(a)
K(z)dz = 1
(b)
zK(z)dz = 0
(c)
z 2 K(z)dz := k2 <
(1.5)
It follows that any symmetric pdf is a kernel. However, nonpdf kernels can also be used,
e.g. kernels for which K(z) < 0 for some values of z. The latter type of kernels have the
disadvantage that f(x) may be negative for some values of x.
Kernel estimation of pdfs is charactized by the kernel, K, which determines the shape
of the weighting function, and the bandwidth, h, which determines the width of the
weighting function and hence the amount of smoothing. The two components determine
the properties of f(x). Considerable research has been carried out (and continues to be
carried out) on the question of how one should select K and h in order to optimize the
properties of f(x). This issue will be discussed in the sections that follow.
1.2.3
In many situations the values that a random variable, X, can take on is restricted, for
example to the interval [0, ), that is f (x) = 0 for x < 0. We say that the support of
f (x) is [0, ). Similarly if X can only take on values in the interval (a, b) then f (x) = 0
for x 6 (a, b); the support of f (x) is (a, b).
In such situations it is clearly desirable that the estimator f(x) has the same support as
f (x). Direct application of kernel smoothing methods does not guarantee this property
and so they need to be modified when f (x) has bounded support. The simplest method
of solving this problem is use a transformation. The idea is to estimate the pdf of a
transformed random variable Y = t(X) which has unbounded support. Suppose that the
pdf of Y is given by g(y). Then the relationship between f and g is given by
f (x) = g(t(x))t0 (x) .
(1.6)
(1.7)
The resulting estimator has support [0, ). Figure 1.6 provides an illustration for this
case for the sample considered in Example 1.
(a) The graph on the top left gives the estimated density f(x) obtained without restrictions on the support. Note that f(x) 0 for some x < 0.
(b) The graph on the top right shows a modified version of f(x) obtained in (a), namely
f (x)
for x 0
R
f(x)dx
fc (x) =
(1.8)
0
0
for x < 0
Here fc (x) is set equal to zero for x < 0 and the f(x) is rescaled so that the area
under the estimated density equals one.
(c) The bottom graph shows a kernel estimator of g(y), that is the density of Y = log(X).
(d) The bottom right graph shows the transformed estimator f(x) obtained via g(y).
Example 3
Suppose that the support
xaof f (x) is 0 (a, b). 1Then 1a simple transformation t : (a, b)
(, ) is f (x) = log bx . Here t (x) = xa + bx and so
xa
1
1
t(x) = g log
+
(1.9)
bx
xa bx
10
0.20
0.20
0.15
0.15
Density
Density
Gaussian kernel
default bw= 2.34
0.10
0.05
0.00
0.05
0.00
10
20
30
40
10
20
30
40
1000 DM
1000 DM
0.8
0.20
0.6
0.15
Density
Density
0.10
0.4
0.2
0.10
0.05
0.00
0.0
0
10
20
30
40
1000 DM
log(1000 DM)
0.20
0.20
0.15
0.15
Density
Density
Gaussian kernel
default bw= 2.34
0.10
0.05
0.00
0.05
0.00
10
15
20
25
10
15
20
25
1000 DM
1000 DM
0.4
0.3
0.2
0.1
0.0
Density
Density
0.10
0.15
0.10
0.05
0.00
logit(1000 DM)
10
15
20
1000 DM
25
11
Gaussian kernel with cutoff option
default bw= 2.34
0.20
0.20
0.15
0.15
Density
Density
Gaussian kernel
default bw= 2.34
0.10
0.05
0.00
0.05
0.00
10
15
20
25
10
15
20
25
1000 DM
1000 DM
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.20
Density
Density
0.10
0.15
0.10
0.05
0.00
normit(1000 DM)
10
15
20
25
1000 DM
Figure 1.8: Application of the normit transformation for f(x) with support (0,25).
Figure 1.7 provides an illustration of this case with a = 0 and b = 25.
The four figures shown are analogous to those in Example 2 but with
f(x)
Rb
for a < x < b
f(x)dx
fc (x) =
a
0
otherwise
(1.10)
1 xa
probability distribution function, such as the normal; e.g. f (x) =
, where is
ba
the distribution function of the standard normal. Here too t : (a, b) (, ) and the
estimator is
(
ba
for a < x < b ,
g 1 xa
ba
( xa
ba )
(1.11)
f(x) =
0
otherwise ,
where (x) is the pdf of a standard normal distribution.
The application of this transformation is illustrated in Figure 1.8 in which the four figures
are analogous to those in the previous two examples.
12
The above three examples illustrate that the transformation procedure can lead to a considerable change in the appearance of the estimate f(x). By applying kernel smoothing
to the transformed values one is, in effect, applying a different kernel at each point in the
estimation of f (x).
1.3
1.3.1
There are various ways to quantify the accuracy of a density estimator. We will focus here
on the mean squared error (MSE) and its two components, namely bias and standard error
(or variance). We note that the MSE of f(x) is a function of the argument x:
MSE(f(x)) = E(f(x) f (x))2
= (E f(x) f (x))2 + E(f(x) E f(x))2
= Bias2 (f(x)) + V ar(f(x))
(1.12)
A measure of the global accuracy of f(x) is the mean integrated squared error (MISE)
Z
MISE(f) = E
(f(x) f (x))2 dx
(1.13)
Z
MSE(f(x))dx
Z
Bias (f(x))dx +
V ar(f(x))dx
1.3.2
n
X
1
x
x
1
i
E(f(x)) =
EK
n i=1 h
h
Z
n
1X1
xt
=
K
f (t)dt
n i=1 h
h
1
=
h
Z
K
xt
h
f (t)dt
(1.14)
xt
,
h
13
i.e. t = hz + x, | dz
|=
dt
1
h
yields
Z
E(f(x)) =
K(z)f (x hz)dz
Z
E(f(x)) =
K(z)f (x)dz
K(z)hzf 0 (x)dz
(1.15)
Z
+
K(z)
(hz)2 00
f (z)dz + o(h2 )
2
Z
= f (x)
Z
0
K(z)dz hf (x)
zK(z)dz
h2 00
f (x)
2
Z
z 2 K(z)dz + o(h2 )
= f (x) +
h
k2 f 00 (x) + o(h2 )
2
(1.16)
h2
k2 f 00 (x)
2
(1.17)
Bias(f(x))
Bias (f(x)) 0 ,
h
0
This depends on
k
the
variance
of the kernel ,
2
00
f (x) the curvature of the density at the point x .
The variance of f(x) is given by
!
n
X
x
x
1
i
V ar(f(x)) = V ar
K
nh i=1
h
n
1 X
x xi
=
V ar K
n2 h2 i=1
h
14
V ar K
x xi
h
2 !
2
x xi
x xi
= E K
EK
h
h
2
Z
2
Z
xt
xt
=
K
f (t)dt
K
f (t)dt
h
h
1
xt
f (t)dt
K
h
h
2
Z
2
1
1
xt
1
(x))
=
K
f
(t)dt
f
(x)
+
Bias(
f
n
h2
h
n
1
V ar(f(x)) =
n
1
K
h2
xt
h
1
f (t)dt
n
xt
h
one obtains
Z
2
1
1
V ar(f(x)) =
K(z)2 f (x hz)dz
f (x) + o(h2 )
nh
n
Substituting z =
V ar(f (x))
f (x) K 2 (z)dz
(1.18)
nh
We note that the variance decreases as h increases.
The above approximations for the bias and variance of f(x) lead to
M SE(f(x)) = Bias2 (f(x)) + V ar(f(x))
1
1 4 2 00 2
h k2 f (x) +
f (x)j2
4
nh
R
R
where k2 := z 2 K(z)dz and j2 := K(z)2 dz.
(1.19)
1.3.3
15
Optimal bandwidth
Expression (1.20) is the measure that we use to quantify the performance of the estimator.
We can find the optimal bandwidth by minimizing (1.20) with respect to h. The first
derivative is given by
1
d M ISE(f)
= h3 k22 (f ) 2 j2 .
dh
nh
Setting this equal to zero yields the optimal bandwidth, hopt , for the given pdf and kernel:
1/5
1 (K)
hopt =
,
(1.21)
n (f )
where (K) := j2 k22 . Substituting (1.21) for h in (1.20) gives the minimal MISE for the
given pdf and kernel. After some manipulation this can be shown to be
4 2 1/5
5
(f
)j
k
2
2
M ISEopt (f) =
.
(1.22)
4
n4
We note that hopt depends on the sample size, n, and the kernel, K. However, it also
depends on the unknown pdf, f , through the functional (f ). Thus as it stands expression (1.21) is not applicable in practice. However, the plugin estimator of hopt , to be
discussed later, is simply expression (1.21) with (f ) replaced by an estimator.
1.3.4
Optimal kernels
The MISE(f) can also be minimized with respect to the kernel used. It can be shown (see,
e.g., Wand and Jones, 1995) that Epanechnikov kernel is optimal in this respect.
3
(1 1 z 2 ) for |z| <
5
5
4 5
K(z) =
0
otherwise .
This result together with (1.22) enables one to examine the impact of kernel choice on
MISEopt (f). The efficiency of a kernel, K, relative to the optimal Epanechnikov kernel,
KEP , is defined as
!5/4
5/4
k22 j24 using KEP
M ISEopt (f) using KEP
=
Eff(K) =
(1.23)
M ISEopt (f ) using K
k22 j24 using K
The efficiencies for a number of wellknown kernels are given in Table 1. It is clear that
the selection of kernel has rather limited impact on the efficiency.
The rectangular kernel, for example, has an efficiency of approximately 93%. This can
be interpreted as follows: The MISEopt (f) obtained using an Epanechnikov kernel with
n = 93 is approximately equal to the MISEopt (f) obtained using a rectangular kernel with
n = 100.
16
Density
0.10
0.05
0.00
0
10
15
20
25
Figure 1.9: The pdf for the car example and a histogram for a sample of size 20.
1.4
1.4.1
Subjective selection
One can experiment by using different bandwidths and simply select one that looks right
for the type of data under investigation. Figure 1.9 shows the pdf (for the car example) and
a histogram for random sample of size n = 20. Figure 1.10 shows kernel density estimation
(based on a Gaussian kernel) of f (x) using 4 different bandwidths.
Also shown is the density of the population. The latter is usually unknown in practice
(otherwise we wouldnt need to estimate it using a sample). Clearly h = 0.5 is too small,
and h = 3 is too large. Appropriate here is a bandwidth greater than 1 but less than 3.
1.4.2
Here one selects the bandwidth that would be optimal for a particular pdf. Convenient
here is the normal. We note that one is not assuming that f (x) is normal; rather one is
Bandwidth: h = 0.5
0.15
0.10
0.05
0.00
0
10
20
17
Bandwidth: h = 1
0.15
0.10
0.05
0.00
0
Bandwidth: h = 2
0.15
0.10
0.05
0.00
0
10
x
20
20
10
Bandwidth: h = 3
0.15
0.10
0.05
0.00
0
10
20
Figure 1.10: The pdf for the car example and kernel density estimates using a Gaussian
kernel and different bandwidths.
selecting h which would be optimal if the pdf were normal. In this case it can be shown
that
Z
3 5
(f ) = f 00 (x)2 dx =
8
and using a Gaussian kernel leads to
1/5
4
1.06
hopt =
.
(1.24)
3n
n5
The normal distribution is not a wiggly distribution; it is unimodal and bellshaped.
It is therefore to be expected that (1.24) will be too large for multimodal distributions.
Secondly to apply (1.24) one has to estimate . The usual estimator, the sample variance,
is not robust; it overestimates if some outliers (extreme observations) are present and
opt even more. To overcome these problems Silverman (1986) proposed
thereby increases h
the following estimator
opt = 0.9
,
(1.25)
h
5
n
Pn
1
2
where
= min(s, R/1.34), where s2 = n1
i=1 (xi x) and R is the interquartile range
of the data. The constant 1.34 is derived from the fact that for a N (, 2 ) random variable,
X, one has P {|X | < 1.34 } = 0.5.
CV estimate
of MISE
CV
0.035
0.040
0.045
0.050
0.055
0.060
0.065
1
18
hnorm = 2.9
0.15
0.10
0.05
0.00
0
hcv = 0.78
0.15
0.10
0.05
0.00
0
10
x
20
20
10
hsj= 1.41
0.15
0.10
0.05
0.00
0
10
20
Figure 1.11: The cross-validation criterion (top left) and the estimated pdf using three
different bandwidth selectors, namely cross-validation (bottom left), normal-based (top
right) and plug-in (bottom right).
The expression (1.25) is used as the default option in the R function density. It is also
used as a starting value in some more sophisticated iterative estimators for the optimal
bandwidth. The top right graph in Figure 1.11 shows the estimated density, with this
method of estimating hopt .
1.4.3
Crossvalidation
The technique of crossvalidation will be discussed in more detail in the chapter on model
selection. At this point we will only outline its application to the problem of estimating
optimal bandwidths. By definition
Z
M ISE(f ) =
(f(x) f (x))2 dx
Z
Z
Z
2
=
f(x) dx 2 f(x)f (x)dx + f (x)2 dx
19
The third term does not depend on the sample or on the bandwidth. An approximately
unbiased estimator of the first two terms is given by
n
1X
Md
CV (f) =
n i=1
2X
fi (x)2 dx
fi (xi ) ,
n i=1
(1.26)
where fi (x) is the estimated density at the argument x using the original sample apart
from observation xi . One computes Md
CV (f) for different values of h and estimates the
optimal value, hopt , using the h which minimizes Md
CV (f). The top left hand graph
in Figure 1.11 shows the curve Md
CV (f) for the sample of car data. The bottom left
hand graph shows the corresponding estimated density. In this example cross-validation
has selected a bandwidth that is too small and the resulting estimated f has not been
sufficiently smoothed.
1.4.4
Plugin estimator
The idea developed by Sheather and Jones (1991) is to estimate h from (1.21) by applying
a separate smoothing technique to estimate f 00 (x) and hence (f 00 ). For details see, e.g.
Wand and Jones (1995), section 3.6. An R function to carry out the computations is
available in the R library sm of Bowman and Azzalini (1997).
Figure 1.11 shows that for the car example considered here the plugin estimator yields
the most sensible estimator of f.
1.4.5
The above bandwidth selectors represent only a sample of the many suggestions that have
been offered in the recent literature. Some alternatives are described in Wand and Jones
(1995) in which the theory is given in more detail. These authors also offer recommendations regarding which estimators should be used. The plugin estimator outlined above is
one of their recommendations.
The above methodology can be extended to bivariate (and multivariate) pdfs. it is also
possible to provide approximate confidence bands for the estimated pdfs. These subjects
will be covered in the presentations by the participants of the course.