Environmental Data Sets With Below Detection Limit Observations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Robust Estimation of Mean and Variance Using

Environmental Data Sets with Below Detection Limit Observations

Anita Singh

Lockheed-Martin Environmental Systems & Technologies Company


980 Kelly Johnson Drive, Las Vegas, NV 89 I 19

John Nocerino

United States Environmental Protection Agency National Exposure Research Laboratory P. 0. Box 93478, Las Vegas, NV 89193-3478
SUMMARY

Scientists, especially environmental scientists often encounter trace level concentrations that are typically reported as less than a certain limit of detection, L. Type I, left-censored data arise when certain low values lying below L are ignored or unknown as they cannot be measured accurately. In many environmental quality assurance and quality control (QA/QC), and groundwater monitoring applications of the United States Environmental Protection Agency (U.S. EPA), values smaller than L are not required to be reported. However, practitioners still need to obtain reliable estimates of the population mean, u. and the standard deviation (sd), F. The problem gets complex when a small number of high concentrations are observed with a substantial number of concentrations below the detection limit. The high outlying values contaminate the underlying censored sample, leading to distorted estimates of and F. The U.S. EPA. through the National Exposure Research Laboratory- Las Vegas (NERL-LV), under the Office of Research and Development (ORD), has research interests in developing statistically rigorous robust estimation procedures for contaminated left-censored data sets. Robust estimation procedures based upon a proposed (PROP) influence function are shown to result in reliable estimates of population parameters of mean and sd using contaminated left-censored samples. It is also observed that the robust estimates thus obtained with

or without the outliers arc m close agreement with the corresponding classical estimates after the removal of outliers. Several classical and robust methods for the estimation of p and o using left-censored (truncated) data sets with potential outliers have been reviewed and evaluated.

Key Words: Type I censoring, Type II censoring, left-censored (truncated) data, detection limit, robust statistics, Monte Carlo simulation, mean square error (MSE), PROP influence function, unbiased maximum likelihood estimation (UMLE), Cohen*s maximum likelihood estimation, Pearson and Rootzen*s restricted

1. INTRODUCTION

The processing of the analytical results of environmental samples containing potentially hazardous chemicals is often complicated by the fact that some of these pollutants are present at trace levels. which cannot be measured reliably and therefore are reported as results lying numerically below a certain limit of detection, L. This results in left-censored data sets. In many environmental monitoring applications, values smaller than L are not even required to be reported. However, since the presence of some of these toxic pollutants (e.g., dioxin) in the various environmental media can pose a threat to human health and the environment even at trace level concentrations, these non-detects cannot be ignored or deleted (often done in practice) from subsequent analyses. For site characterization purposes such as to establish mean contamination levels at various parts of a polluted site, it is desirable to obtain reliable estimates of p and o using the left-censored data sets. The problem gets complicated when some outliers are also present In conjunction with the non-detects. Also, sometimes in environmental applications, non-detects (e.g., due to matrix effects) exceed the observed values adding to the complexity of the estimation procedures. Improperly obtained estimates of these parameters can result in inaccurate estimates of cleanup standards. which In turn can lead to incorrect remediation decisions at a polluted site. In this article, emphasis is given to obtain robust estimates of population mean and sd using left-censored data sets with potential outliers in the right tall of a data set. In this study, it is assumed that all non-detects are smaller than the observed values.

In general, censoring means that observations at one or both extremes (tails) are not available. In Type 1 censoring, the point of censoring (e.g., the detection limit, L) is fixed a priori for all observations and the number, k( > 0), of the censored observations varies. In Type II censoring, the number of censored observations, k, is fixed a priori, and the point(s) of censoring vary. For example, Type II right-censoring (large values are not available) typically occurs in life testing and reliability applications, In a life testing application, n items(e. g., electronic items)are subjected to a life testing experiment which terminates as soon as (n-k) of the n data values have been observed (failed). The lifetime of the remaining k living objects are unavailable or being censored.

The estimation of the parameters of normal and lognormal populations from censored samples has been studied by several researchers, including Cohen [ 1950,1959], Perrson and Rootzen [1977], Gleit [1985]. Sclneider [ 1986], Gilliom and Helsel [ 1986]. A myriad of estimation procedures for Type I leftcensored data exist in the literature, including simple substitution methods and several rigorous procedures such as Cohen*s maximum likelihood estimation (MLE) procedure, Perrson and Rootzen*s restricted MLE (RMLE) method, and regression methods (Gilliom and Helsel [1986], Newman, Dixon, and Pinner [ 1989]. The commonly used substitution methods are: replacement of below detection limit data by zero, or by half of the detection limit, L/2, or by the detection limit, L itself. Using Monte Carlo simulation experiments, several researchers, including Gleit [1985], Gilliom & Helsel [1986]. and Haas and Schelf [1990], concluded that the data substitution methods resulted in a biased estimate of the population mean. In practice, probably due to computational ease, these data substitution methods are commonly used in many environmental applications. Depending upon the sample size n and the censoring intensity. k, substitution of the censored values by L/2 is one of the recommended methods in some U.S. EPA guidance documents, such as the Guidance for Data Quality Assessment, 96. None of the simulation studies conducted so far included the unbiased maximum likelihood estimation (UMLE) method. Also, the results and conclusions of the above mentioned studies are not directly comparable due to reasons
discussed in the following paragraphs.
Gleit [1985] performed simulation experiments for a fixed detection limit, L, for various censoring

intensities. Based upon Dempster, Laird, and Rubins [1977] expectation-maximization (EM) algorithm.

Gleit used the conditional expected values of order statistics of the Gaussian distribution for censored observations. Based on the low mean square error (MSE) criterion, he recommended the use of the EM method which replaces all of the non-detects by the conditional expected value of the order statistics as given by equation (I 1) below. Gleits simulation experiments did not include Person und Rootzen*s RMLE method or any of the regression methods.
Gilliom and Helsel [1986] performed simulation experiments for various distributions and several

levels of censoring intensities. Their simulation experiments used the computed detection limit based on the distribution used. For example. for a normal distribution with mean, 5, and sd, 2, - N(5.2), and for censoring intensities of 30% L and 60% L, will be 5 + 2 * z0.30 -5- 2* 0.525=3.95, and 5 + 2 * z0.30 - 5+ 2* 0.255=5.51, respectively, where z represents a value of the standard normal deviate such that area to the left of z is . Thus,

the limit, L. changes with the censoring intensity. They concluded that the extrapolation regression approach on the log-transformed data results In an estimate of the population mean with the smallest root mean square error (RMSE). Their simulation study did not include the RMLE method and the EM algorithm.

Haas and Scheff [1990] and Lecher [ 1991] compared the performance of classical estimation methods for left-censored samples in terms of bias and MSE. They also used the computed detection limit based on the distribution used in their simulation experiments. They concluded that the bias-corrected RMLE procedure results in as good estimates as the Cohen*s MLE method, and also the RMLE method possesses lower bias and MSE than the regression and substitution methods. They also suggested that the RMLE IS less sensitive to the deviations from normality. Their study did not include the EM method. The objective of the present article is to develop robust procedures which yield reliable estimates of population parameters from left-censored data sets in the presence of outliers, and also to compare the performances of the various estimation procedures. The authors of this article performed Monte Carlo simulation experiments for both the fixed and the computed detection limit cases to assess the performances of the various classical and robust procedures in terms of bias and MSE. Several methods, including the EM algorithm, MLE, UMLE, RMLE, and the regression method, have been considered. The results of a couple of simulation runs are presented here to demonstrate the differences in the performances

of these methods for the two cases: 1) L stays fixed for all censoring levels, and 2) L is computed based on the distribution used and, therefore, varies with the censoring intensity. In environmental applications. the first case (1) of fixed detection limit occurs quite frequently.

The occurrence of non-detects in combination with potential outliers is inevitable in data sets originating from environmental applications. The data set resulting from such a combination of non-detects in the left tail of the distribution, and high concentrations in the right tail of the distribution, typically do not follow a well-known statistical distribution. The problem gets complex when multiple detection limits (reporting limits) are present. In practice, such a data set may have been obtained from two or more populations with significantly different mean concentrations such as the one coming from the clean background part of the site and the other obtained from a contaminated part of the site. Unfortunately, many times such a data set can be modeled incorrectly by a lognormal distribution (which could pass the lognormality test). Also, a normally distributed data set with a few extreme (high) observations can be incorrectly modeled by a lognormal distribution with the lognormal assumption hiding the outliers [Singh, Singh, and Engelhardt (1997,2000)]. An example is discussed next to elaborate on this point. Example 1. A simulated data set with 15 observations has been obtained from a mixture of two normal populations. Ten observations (representing background) were generated from a normal distribution with a mean of 100 and an sd of 50, and five observations (representing contamination) were generated from a normal distribution with a mean of 1000 and an sd of 100. The mean of this mixture distribution is 400. The generated data are: 180.51,2.33,48.67, 187.07, 120.21, 87. 96, 136.75, 24.47, 82 23, 128. 38, 850 91. 1041 73. 901.92. 1027.18, and 1229.94. The data set failed the normality test based on several goodness-of-fit tests, such as the Shapiro-Wilk test, the W-test (W=0.7572), and the Kolmogorov-Smimov (K-S = 0.35) test. However, when these tests were carried out on the log-transformed data, the test statistics are insignificant at the = 0.05 level of significance with W=0.8957 and K-S = 0.168, suggesting that a lognormal

distribution provides a reasonable fit to the data. Based upon those tests, one might conclude that the observed data come from a single background lognormal population, a situation which occurs frequently m practice. It is, therefore, warranted to make sure that the data come from a single population before one 5

would try to use a lognormal distribution on a data set. Like full uncensored samples, the classical procedures used on left-censored data sets with potential high outliers result in distorted estimates of location and scale. A brief description of some of the above mentioned procedures is given in Section 2 and some real and simulated data sets are discussed in Section 3. Results of a few simulation runs for the various classical methods are given in Section 4, and the conclusions are summarized in Section 5. The simulation results based on robust procedures are in agreement with the classical procedures without outliers. However, due to the length of the present article, results of the complete Monte Carlo study (using data sets with outliers) to assess the performances of the various classical and robust procedures are not included in this article, and will be submitted for publication at a later date.

2. MATHEMATICAL FORMULATION

In the following, it has been implicitly assumed that the data set under consideration has been obtained from a single normal population, perhaps after a suitable Box-Cox [1964] type transformation (including the log-transformation) with unknown mean, , and sd, F. Cohen [1950, 1959] derived the maximum likelihood (ML) equations for censored samples and prepared tables of the constants needed to obtain the MLEs of and F. The ML equations are solved interatively using a suitable numerical method such as the Newton-Raphson method. Some computer programs (e.g., UNCENSOR by Newman et al. [1989]) are available to compute the MLEs and the RMLEs from left-censored samples obtained from normal and lognormal populations. Some of these classical and robust methods have been incorporated into a computer program, CENSOR (see Scout: A Data Analysis Program), for estimation of , and F from left-censored data sets

with potential outliers.


Let x1, x2 ,..., xn be a random sample from a normal, N (, F), population with k of the non detects. x1, x2 ,..., xk lying numerically below the detection limit, L. Let I$ and CD be the probability density function (pdf) and cumulative distribution function (pdf) of the standard normal distribution (cdf). The logarithm of the likelihood function is given as follows:

A brief description of some of the robust procedures to estimate population parameters from contaminated

left-censored samples is given as follows.

Robust Procedures When dealing with data sets originated from environmental applications, one is faced with the dual problem of the occurrence ofbelow detection limit concentrations (non-detects) in the left tail and possibly some extreme concentrations in the right tall of the distribution of the contaminant (e.g., lead) under consideration. The p~sence of outlicrs leads to distorted estimates of the population mean, , and the sd, F. It is, therefore, important that these unusual observations in both tails of the distribution be treated adequately. For full uncensored data sets, simple robust estimates such as the trimmed mean or Winsorlzed mean (Hoaglin, Mosteller, and Tukey [1983]) are sometimes used to estimate the population mean In the presence of outliers. For example, a 100p% trimmed mean is obtained by using only the middle n(1-2p) data values and the np values are omltted from each of the two (left and right) tails of the data set. Gilbert [ 1987] suggested the use of the Winsorized and trimmed means for the estimation of p and u for left censored data sets. Depending upon the censoring intensity, the use of the trimmed and Winsorized mean has alao been recommended in some guidance documents, such as Guidance for Data Quality Assessment, 1996. Helsel [1990] discussed the use of non-parametric, distribution-free procedures, and of a simple robust pair, (median, MAD/0.6745), to estimate p and a, where MAD represents the median absolute deviation. Gilliom and Helsel [1986] suggested the use of the least-squares regression on the logtransformed data to obtain robust estimates of and F from left- censored data sets.

In this article, we use robust M-estimation procedures based on the notion of the influence function (Hampel [1974]), which assigns reduced weights to the outlying observations. For full uncensored data sets. several robust procedures exist in the literature for the estimation of the population mean and the variance (Huber[ 198 1], Rousseeuw and Leroy [1984], Staudte and Sheather [1990], Singh and Nocerino [1995]). For leftcensored data sets, in order to identify and subsequently assign reduced weights to the outliers that may be present in the right tail of a data set, the robust sample mean, and the robust sd, sF , using the (n-k) observed

values, need to be obtained first. These values are then used in the various estimation methods, such as MLE, UMLE, RMLE, and the EM method to obtain robust estimates of the population mean and sd. Singh and Nocerino [1995] showed that for full data sets, the PROP influence function works very well for 1) the Identification of multiple multivariate outliers, and 2) the robust estimation of the population mean vector and the dispersion matrix. In this article, those techniques are extended to obtain the robust estimates of the population mean and the vat-lance using left-censored data sets with outllers.
The PROP influence function (Singh [1931]) and the corresponding interactively obtained sample mean

and sd based on the (n-k) detected observations are given as follows.

unmask outliers. This is especially true when the sample size is small (e.g., 20 or less). It needs to be pointed out that the outlier identification procedures based on influence functions typically identify extreme observations in both tails of the underlying distribution. When dealing with left-censored data sets. one is concerned with the Identification of outlying observations that might be present in the right tail of the distribution; therefore, reduced weights are to be assigned to those extreme observations found in the right tail only. Each of the observed detected values in the left-tail is assigned a unit weight. Cohen*s Method

Cohens MLEs for the mean and the variance are obtained by solving the following equations:

Expectation Maximization (EM) Algorithm

Dempster, Laird, and Rubin [1977] developed the EM algorithm to maximize the likelihood function based upon censored and missing data. The iterative EM algorithm works on the observed values assuming that no observations were censored. At the initial iteration, using the observed (n-k) data values, one could start with some convenient estimates for and F, such as the sample mean and sd, or a simple one-step robust pair represented by the median and MAD/0.6745. The iterations are defined as successively maximizing the expectation of the conditional likelihood function of the complete data, given the type of censoring. Gleit [1985] used this procedure for leftcensored samples and found it to possess a lower MSE than the various other substitution and likelihood procedures. For the single detection limit case, the estimates of Azari, and Johnson [1989]). and F at the (j+1)th iteration are given as follows (Shumway,

Thus the EM method is an iterative substitution method in which at each iteration all of the non-detects are replaced by the same conditional expected value as given by equation (11). In the presence of outliers. the conditional expected value given by equation (11) gets distorted (e.g., becomes negative), and results in inadequate estimates given by equations (9) and (10). Typically, contaminant concentrations are nonnegative and substituting a negative value for non-detects will be inappropriate. In these cases, the nondetects have to be replaced by zero, or half of the detection limit, L/2 (see Example 4). In this article. whenever the conditional expected value became negative, it was replaced by L/2. As shown in the examples

10

to follow, the robust EM estimation procedure takes care of this problem by assigning reduced weights to the outlying observations. The robust EM estimates at the (j+l)th iteration are given as follows:

Restricted Maximum Likelihood (RMLE) Method Person and Rootzen [1977] obtained the restricted likelihood estimates by simplifying the ML equations. The likelihood function can be written as follows:

11

The mean, , and sd, F, can be estimated in two ways: 1) by using the intercept and slope of the fit

12

13

replaced by L itself. The mean and variance are then computed using the replacement values. It is well-known that the OLS estimates of intercept and slope (Rousseeuw and Leroy [1984]). and. hence. the mean and sd and the extrapolated non-detects, get distorted even by the presence of a single outlier. In the presence of outliers, the use of the log-transformation alone will not result in robust estimates of intercept and slope. Robust regression methods as given by Rousseeuw and Leroy, [1984]. and by Singh and Nocerino [1995], may be used to obtain robust estimates of slope and intercept. Some examples are considered next to illustrate the procedures described here. The discussion and use of the robust regression for censored data sets is beyond the scope of the present article. All computations are performed using the CENSOR program. In the following, all replacement values (when applicable) for non-detects are listed in parentheses. Since the substitution methods do result in biased estimates of and F, then computations are omitted from most of the examples and simulation

results discussed in the following sections.


3. MORE EXAMPLES

Example 2. A simulated data set of size 15 was obtained from a normal population with mean.

= 1.33. and

sd. F - 0.2, N(1.33, 0.2), with L=1.0, and k=2. The left-censored data are: <1.0, <1.0. 1.2883. 1.1612. 1.1560. 1.3251, 1.1568, 1.5638, 1.2914, 1.3253, 1.2884, 1.4688, 1.4581, 1.3641. 1.1342. The sample mean and sd obtained using the 13 observed data values were 1.306 and 0.134, respectively. Classical and robust procedures produced similar results and are given as follows in Table 1. For this simulated data set, observe that all of the methods, except the first two substitution methods, resulted in similar results.

14

Example 3. Next, the data set of Example 2 was contaminated with two outliers, 3.8561 and 6.25 13. from a normal, N(5,2) population. Outliers distorted the classical estimates of the mean and the sd for all of the methods, and also distorted the intercept and slope of the OLS regression. The sample mean and sd for the 15 observed data points were 1.806 and 1.4. respectively. The results are given In Table 2. Notice that for the EM algorithm, the outliers distorted the conditional replacement value of 0.91 to 0.025. The robust MLE. RMLE, and the EM methods, on the other hand, resulted in fairly accurate estimates. and the robust results of Table 2 with the outliers, and the classical results of Table 1, without the outliers, are in close agreement.

Example 4. This left-censored data set is taken from the U.S. EPA RCRA guidance document [ 1992]. The detection limit is 1450. The data with 3 non-detects and 21 observed values are: <1450, <1450, <1450. 1850. 1760. 1710, 1575, 1475, 1780, 1790, 1780, 1790, 1800, 1800, 1840, 1820. 1860, 1780. 1760, 1800. 1900. 1770, 1790, 1780. The sample mean and sd obtained using 2 1 observed data are 1771 .90 and 92.702. respectively. The classical and robust estimates are summarized in Table 3. Note that the substitution by L/2 method resulted in a biased estimate of mean with the highest variability, which is one of the most
frequently used methods in environmental applications. All of the likelihood methods and the EM method

15

In order to further illustrate how the presence of outliers distort these estimates, three arbitrarily chosen outliers, 7000, 8000, and 11000 are added to the data set of this example. The relevant classical and robust statistics for the contaminated data set are summarized below in Table 4. The classical observed sample mean and sd for the data with outliers (24 values) are 2633.75 and 2410.35. Using equation (21). the Intercept and slope of for the left-censored contaminated data set are 2216.51 and 2061.25. Use of this OLS Fit resulted in distorted negative values for the extrapolated non-detects which are given in Table 4.

16

Substitution values for non-detects are given in parentheses, identified by one asterisk for the Regress method and identified by two asterisks for the EM method. For the classical EM procedure, the estimated non-detects became negative. Notice that the robust results for the MLE, the RMLE. and the EM-based algorithm are in close agreement with or without the outliers as can be seen by comparing Tables 3 and 4. Also note that the robust replacement value of 1385.97 for the EM algorithm is in agreement with the corresponding classical value without the outliers. Next, the log-transformation of the data set with the outliers is considered. The corresponding
estimates are given below in Table 5. The classical mean and sd for the observed log-transformed data are 7,675

and 0.537. In the following, all back-transformation results are obtained using equation (22). The outliers distorted the estimates of the mean and the sd for all of the methods, including the regression method. The OLS fit on the log-transformed data is given in Figure 3. The intercept and slope are 7.577 and 0.48 1, respectively. Using this fit, the estimated non-detects are 6.62,6.83, and 6.95, which, when converted back 10 the original units, are 749.95.925.19, and 1043.15, respectively. The resulting estimates (using all 27 data points) of the mean and the sd in the original units are 2441.79 and 2333.93. which are obviously influenced by the three outliers. Thus, as mentioned above, the log-transformation alone cannot product robust regression estimates. The only advantage of using the log-transformed data is that the replacements

17

values for the non-detects did not become negative for the regression and the EM method.

18

Figures 4 through 6 have the MSEs and Figures 7 and 8 have the bias for various procedures when L=2.0. Figures 9 through 11 show the MSEs and Figures 12 -14 display the bias for various estimation methods when L-4.0. It is observed that, for samples of smaller (e.g., less than 10) sizes, the UMLE method yields 3 higher bras than the MLE, RMLE, and the EM methods (Figures 7 and 12). However. when L (e.g. 1. 2) is much smaller than the mean, p (e.g., S), the UMLE method has the smallest MSE and the EM method has the largest MSE for samples of all sizes. When L is closer (e.g., 4 or 5) to mean, u. for samples of

19

smaller sizes (Figures 9 and 12), the MSE and bias for the UMLE become larger than those of the EM. MLE, and the RMLE methods, with the EM method having the smallest MSE and bias. This observation concurs with Gleit*s (1985) findings for the EM method. But as the sample size increases (e.g., becomes 15 or larger), as expected, the situation reverses, and the UMLE method results in the smallest MSE, while the EM and the [regression methods yield larger MSEs (Figures 10 and 11). Note that, to some extent, this behavior of the MSE and bias of the EM method is similar to the substitution by L/2 or L methods, except that as the sample size increases, the MSE and bias for the later two substitution methods become much greater than those of the EM method. as can be seen in Figures 15- 18, and 24. The EM method, after all, is just a substitution method in which all of the non-detects are replaced by an optimally obtained conditional expected value. From Figures 13 and 14, it is noticed that the bias for the EM and the regression methods becomes fairly large as the sample size increases. Also, it is observed that as the sample size increases, the bias of the MLE and RMLE methods starts becoming smaller (m magnitude) than that of the UMLE method for all censoring intensities (Figs. 13-14) From all of these graphs, it is observed that the bias and MSEs obtained using the RMLE and Cohen*s MLE methods are stable, always stay very close to each other, and lie in the middle of the respective bias and MSE of the other methods for all sample sizes and censoring levels. Actually, in most cases, the RMLE method even results m a smaller bias and MSE than the MLE method as can be seen from Figures 8, 10, 11, 13, 14. and 16 18. Also, note that the differences in the MSE of the three MLE methods (UMLE, Cohen*s, and RMLE) decrease as the sample size increases.

Computed Detection Limit Figures 19 - 22 are the graphs of the MSE and Figures 23 - 27 have the bias for various sample sizes and censoring levels. From these graphs, as expected, it is observed that the MSE for all of the methods Increase with the detection limit, L, or the censoring intensity, for all sample sizes. It IS observed that for small sample sizes, the EM method (the optimal replacement by the conditional expected value method) results In smaller MSE and bias (Figures 19 and 23), especially when the detection limit starts coming closet to the mean value. The bias for the EM and L/2 methods becomes unacceptably high with increased sample

20

size, as can be seen in Figures 24 and 25. AS observed earlier, note that, as the sample Size Increases, the UMLE method results is the smallest MSE, and the substitution by L/2 method, the EM method. and the regression method yield MSEs much larger than the three MLE methods. However, the UMLE method results in a bias which is larger than those of the MLE and RMLE methods. This increase in the bias of the UMLE becomes quite noticeable with increases in the sample size, the detection limit, and the censoring Intensity. This is especially tree when the censoring level starts exceeding 30%. Moreover. from all of these graphs. It is observed that both the bias and the MSEs obtained using the RMLE and Cohen*s MLE methods are stable and always stay close together for all of the sample sizes and censoring levels. Also, as noticed earlier, the RMLE method does result In a smaller bias and MSE than the MLE method (Figures 20-25) most of the time. These observations concur with the conclusions derived by Haas and Scheff (1990).

5. SUMMARY AND CONCLUSIONS In this article, two questions which arise when dealing with left-censored data sets have been addressed. Those two questions are: 1) Which method should be used for the estimation of the population mean and sd from left-censored data sets? and 2) What is an appropriate robust estimation procedure III the presence of potential outliers in the right tall of the distribution? The various substitution methods are sample, but do not perform well in most cases as they yield estimates with a larger bias and MSE than those obtained using the MLE methods. Also. it is observed that for larger sample sizes (e.g., >=15), the EM method results in a bias and MSE larger than those of the MLE methods. The examples presented here lead to the conclusion that the OLS regression-based approaches cannot be recommended for routine use. The estimated non-detects obtained by extrapolating the fitted model many times result in infeasible estimates, which become negative or even greater than the detection limit, I. The examples and the simulation results presented In thus article clearly establish that. In most cases, the three MLE methods (.Cohen*s MLE. UMLE, and RMLE) perform better than the various substitution and regression methods. All of the classical estimation procedures, including the maximum likelihood and substitution methods, result in distorted estimates in the presence of outliers. In the presence of outliers, the EM method

21

sometimes produces negative estimates of the non-detects, which in turn result In a biased estimate of the population mean. The OLS regression models get distorted by the outlying observations; therefore. regression estimates obtained using raw or log-transformed data are no longer reliable. Thus, the OLS regression method based on the log-transformed data is not a true robust method. It is observed that the robust estimation procedure based on the PROP influence function results in stable and reliable estimates of the population parameters. Moreover, the resulting robust estimates, with or without the outliers, and the classical estimates, without the outliers, stay in close agreement. The performance of the various estimation methods described here depend upon several things. such as the sample size, the censoring intensity, and the value of the detection limit I. The conclusions derived from the simulation results and graphs presented in this article are summarized as follows. . When the detection limit, L, is closer to the population mean, it is observed that for samples of smaller sizes (e.g., 5-10), the EM method and the other substitution methods such as the L/2 method result in a smaller bias and MSE than the three MLE (UMLE, Cohen*s, and RMLE) procedures. However, as the sample size Increases (e.g., 15 or larger), the EM method, along with the other substitution methods, results in a higher bias and a larger MSE than the three Iikelihood procedures. . For values of L much smaller than the mean, the UMLE method results in the smallest MSE for samples of all sizes. . The differences in the MSE of the three MLE methods decrease as the sample size increases. . The simulation results suggest that, although the UMLE method does result in the smallest MSE 101 samples of size 15 or larger, the bias of the UMLE becomes larger (in magnitude) than the UMLE and the RMLE methods. This increase in the bias of the UMLE method becomes quite noticeable as the detection limit increases and the censoring intensity starts exceeding 30%. Thus, for higher censoring Intensities, the MLE or the RMLE method may be used to obtain estimates of the population mean and sd from a left-censored data set. . The RMLE method is simple and results in estimates which are in close agreement with the Cohen*s ML estimates. It is observed that the bias and MSEs obtained using the RMLE and Cohen*s MLE methods are stable and always stay close together for all sample sizes and censoring intensities

22

Actually, in most cases, the RMLE method results in a smaller bias and MSE than those obtained using the Cohen*s MLE method. This is especially true as the sample size increases. Using the examples and results described here, the following recommendations can be made.

For data sets with potential outliers, the robust estimation procedures based on Influence functions. such as the PROP Influence function, should be used for the estimation of population parameters.

! !

For samples of small sizes (e.g. 10 observations or less), the EM method or the substitution by L/2 (or L) method may be used, especially when L is closer to the mean. For samples of larger sizes (e.g.. 15 observations or larger), the UMLE method may be used for censoring levels of 30% or less.

However, since the differences in the MSE of the three MLE methods (UMLE, Cohen*s, and RMLE) decrease as the sample size increases, and in order to make things easier for a typical user, it is recommended that for larger sample sizes. or for samples with censoring levels exceeding 30%. the much simplified RMLE method may be used for the estimation of the population parameters.

The results of our study clearly establish that one should stay away from the substitution methods, especially when the sample size is larger than 10 observations.

NOTICE
The U.S. Environmental Protection Agency (EPA), through its Office of Research and Development

(ORD), funded and collaborated in the research described here. It has been subjected to the Agency*s peel review and has been approved as an EPA publication. The U.S. Government has a non-exclusive, royalty free license m and to any copyright covering this article.

23

REFERENCES Box, G.E.P., and Cox, D.R.,An analysis of transformation,Journal of Royal statistical Society, Ser. B.39.
pp 211-252, 1964.

24

25

26

27

You might also like