On Optimal and Data-Based Histograms: Department of Mathematical Sciences, Rice University, Houston, Texas
On Optimal and Data-Based Histograms: Department of Mathematical Sciences, Rice University, Houston, Texas
On Optimal and Data-Based Histograms: Department of Mathematical Sciences, Rice University, Houston, Texas
605-10 605
Printed in Great Britain
SUMMARY
In this paper the formula for the optimal histogram bin width is derived which asymptotic-
ally minimizes the integrated mean squared error. Monte Carlo methods are used to verify
the usefulness of this formula for small samples. A data-based procedure for choosing the bin
width parameter is proposed, which assumes a Gaussian reference standard and requires only
the sample size and an estimate of the standard deviation. The sensitivity of the procedure is
investigated using several probability models which violate the Gaussian assumption.
1. INTRODUCTION
The histogram is the classical nonparametric density estimator, probably dating from the
mortality studies of John Graunt in 1662 (Westergaard, 1968, p. 22). Today the histogram
remains an important statistical tool for displaying and summarizing data. In addition it
provides a consistent estimate of the true underlying probability density function. Present
guidelines for constructing histograms do not directly address the issues of estimation bias
and variance. Rather, they draw heavily on the investigator's intuition and past experience.
In this paper we propose new guidelines that reduce the subjectivity involved in histogram
construction by considering a mean squared error criterion.
2. BACKGROUND
We consider only histograms denned on an equally spaced mesh {tni; — oo < i < oo} with bin
width hn = tnH+1) — tni, where n denotes the sample size and emphasizes the dependence of
the mesh and bin width on the sample size. For a fixed point x, the mean squared error of a
histogram estimate,/(x), of the true density value,/(a;), is defined by
MSE(Z) =
For a random sample of size n from /, Cencov (1962) proved that MSE(X) asymptotically
converges to zero at a rate proportional to n~%ls, that is, MSE (a;) = (^(n"278). This rate is fairly
close to the Cram6r-Rao lower bound of 0{n~x). The integrated mean squared error repre-
sents a global error measure of a histogram estimate and is defined by
IMSE= [E{f{x)-f{x)Ydx.
Since it is the shape of the density that is of most interest, the IMSE is more relevant than the
mean squared error of the density height. The IMSE of a histogram also converges to zero
as 0(n- 2/s ).
To achieve these rates of convergence requires proper choice of the two parameters of the
histogram, the bin width hn and the relative position of the mesh. The latter is determined by
606 D A V I D W. SCOTT
any particular mesh point, say tn0. Statistical texts suggest various methods for choosing
these two parameters. First the bin width is determined indirectly by choosing an appropriate
number of bins over the sample range. Most authors advise that 6-20 bins are usually
adequate for real data sets (Haber & Runyon, 1969, p. 33; Guttman & WUks, 1965, p. 59).
Larson (1975, p. 15) suggests using 1 + 2-2log10n bins as a first choice, similar to a formula
proposed by Sturges in 1926. The final choice for hn is a convenient whole number or fraction,
often related to the accuracy with which the data are measured. Next, tn0 is picked so that
the data do not fall on the bin boundaries. If we assume that the data are measured to
infinite accuracy, then the choice of tn0 becomes less important as the sample size increases.
Since we are focusing on consistency, we shall assume tnQ = 0 in the sequel. However, the
choice of hn is quite important. If hn is too small, then the histogram will be too rough; on
the other hand, if hn is too large, then the histogram will be too smooth, equivalent statistic-
ally to large variance and large bias, respectively. The proper choice for hn should balance
the bias and variance by minimizing, for example, the integrated mean squared error.
Let vn(x) be the number of values falling in In(x). Then vn(x) has a binomial distribution
B{n,pn(x)}. The histogram estimate is given by the random variable
/(*) = vn(x)l{nhn),
with expectation
f
Combining, we have that
MSE (x) = f{x)l(nhn) + \h\f\xf +/'(*)*{* - tn{x)f - hn /'(*)»{* - tn(x)} + O(l/n + A*). (1)
Integration of equation (1) over the real line implies that
i—cojtrt i—oo JO
JO J-co
by standard numerical integration approximations. A similar analysis for the fourth term
in (2) yields
Therefore
for each of 1000 generated samples and then averaged over the number of repetitions to
obtain an estimate of the IMSE. The optimal bin widths predicted by equation (5) were quite
close to the empirically observed optimal bin widths for the Monte Carlo study even for
samples as small as 25. The estimated IMSE also increased as (c3 + 2)/(3c) for bin widths
differing from the empirically optimal bin width by the factor c.
5. DATA-BASED HISTOGRAMS
The optimal choice for hn requires knowledge of the true underlying density/. This know-
ledge is rare. In another context Tukey (1977, p. 623) has suggested using the Gaussian
density as a reference standard, to be used cautiously but frequently. Therefore, we propose
the data-based choice for the bin width
hn = 3-49an-1/8, (6)
where a is an estimate of the standard deviation. Although the Gaussian density forms the
0-5
6 6
Skewnees coefficient KurtosU coefficient Distance between modes
Fig. 1. Ratio of theoretical bin width for several non-Gaussian probability densities to the
theoretical bin width for a Gaussian density with the same variance.
6. EXAMPLES
In Fig. 2 we display three histograms of a Monte Carlo N(0,1) sample of size 1000 which
has a sample standard deviation equal to 1-011 with h = 0-176, 0-353 and 0-706, the second
choice obtained from (6). Many statisticians prefer a smaller bin width and a rougher histo-
gram, leaving the final smoothing to be done by eye.
- 3 - 2 - 1 0 1 2 - 3 - 2 - 1 0 1 2 - 3 - 2 - 1 0 1 2
•Scale of observations Scale of observations Scale of observations
Fig. 2. Histograms of 1000 pseudorandom Gaussian numbers for three bin widths: the data-based
choice and that choice perturbed by a factor of 2.
To illustrate extremely large sample sizes, Kendall & Stuart (1969, p. 8) consider a histo-
gram of the ages of 301,785 Australian bridegrooms (1907-14) with a bin width of 3 years.
The sample standard deviation and skewness for these data are 7-97 and 1-93, respectively.
Thus the data-based choice for A is 0-41 years. Applying a skewness correction factor of
0-43 using Fig. l(a), the final data-based choice is 0-18 years. Thus the sample is of sufficient
size to use a bin width of 1 year or even 3 months if the data were recorded to sufficient
accuracy.
7. DISCUSSION
We have considered the optimal construction of histograms given either knowledge of the
true underlying density or, more commonly, given only the data. Waterman & Whiteman
(1978) have recently carried out a similar attack for Rosenblatt's kernel estimator. Kernel
estimates converge faster than histograms to the true density, and therefore integrated mean
squared error is more sensitive to the choice of the smoothing parameter; see also Silverman
(1978). Furthermore, kernel estimates require the entire data set for evaluation. Thus in
some modern automated data collectors, it is often more economical to summarize sequenti-
ally relatively more samples, calibrating the histogram using a small training sample.
Some recently developed nonparametric techniques for density estimation start with a
histogram and then smooth it; see, for example, Boneva, Kendall & Stefanov (1971). Our
procedures could be used to construct the required histogram directly from the data. We
remark that our analysis extends easily to histograms in higher dimensions.
34
610 DAVID W. SCOTT
It should be possible to further reduce the integrated mean squared error by using an
unequally spaced mesh. However, the algorithms required would surely be iterative and
would require the entire data set. It is easier to discount rougher estimates in the tails or to
construct a rootgram as suggested by Tukey (1977, p. 543).
This research was supported in part by the National Heart, Lung, and Blood Institute, the
National Institutes of Health, the Department of Health, Education and Welfare. The
author would like to thank a referee for helpful commente.
REFERENCES
BONEVA, L. I., KENDALL, D. G. & STEFANOV, I. (1971). Spline transformations: Three new diagnostic aids
for the statistical data-analyst (with discussion). J. R. Statist. Soc. B 83, 1-70.
CENOOV, N. N. (1962). Estimation of an unknown distribution density from observations. Soviet Math. 3,
1559-62.
GUTTMAN, I. & WrLKS, S. S. (1965). Introductory Engineering Statistics. New York: Wiley.