Academia.eduAcademia.edu

Improved power in multinomial goodness-of-fit tests

2002, Journal of the Royal Statistical Society: Series D (The Statistician)

The Pearson's chi-square and the log likelihood ratio chi-square statistics are fundamental tools in goodnessof-fit testing. Cressie and Read (1984) constructed a general family of divergences which includes both statistics as special cases. This family is indexed by a single parameter, and divergences at either end of the scale are more powerful against alternatives of one type while being rather poor against the opposite type. Here we present several new goodness-of-fit testing procedures which have reasonably high power at both kinds of alternatives. Graphical studies illustrate the advantages of the new methods.

Improved Power in Multinomial Goodness-of-fit Tests Ayanendranath Basu Applied Statistics Unit, Indian Statistical Institute, 203 B. T. Road, Calcutta 700 035, India. Surajit Ray Department of Statistics, Penn State University, University Park, PA 16802, USA. Chanseok Park Dept. of Mathematical Sciences, Clemson University, Clemson, SC 29634-0975, USA. Srabashi Basu Stat-Math Unit, Indian StatisticalInstitute, 203 B. T. Road, Calcutta 700 035, India. Submitted to JRSS(D) : 26 Mar, 2001 Revised version submitted : 22 Jan, 2002 Published : Sep, 2002 vol. 51, no. 3, pp. 381-393 Summary. The Pearson’s chi-square and the log likelihood ratio chi-square statistics are fundamental tools in goodnessof-fit testing. Cressie and Read (1984) constructed a general family of divergences which includes both statistics as special cases. This family is indexed by a single parameter, and divergences at either end of the scale are more powerful against alternatives of one type while being rather poor against the opposite type. Here we present several new goodness-of-fit testing procedures which have reasonably high power at both kinds of alternatives. Graphical studies illustrate the advantages of the new methods. Keywords: Disparities, empty cell penalty, goodness-of-fit,power divergence. 1. Introduction The Pearson’s chi-square and the log likelihood ratio statistics are long standing techniques in goodness-of-fit testing under multinomial set ups. Many authors have investigated the scope and relative performance of these tests, and have compared them with other less popular statistics such as the Neyman modified chi-square statistic, the modified log likelihood ratio statistic (based on the Kullback-Leibler divergence) and the test statistic based on the Hellinger distance. See, for example, Cochran (1952), Watson (1959), Hoeffding (1965), West and Kempthorne (1972), Moore and Spruill (1975), Chapman(1976), Larntz (1978), and Koehler and Larntz (1980). Cressie and Read (1984), hereafter referred to as C&R, and Read and Cressie (1988) presented a unified approach to goodness-of-fit testing in multinomial models through the family of power divergences denoted by I λ : λ ✁ IR ✂ . The Pearson’s chi-square, the likelihood disparity (generating the log likelihood ratio statistic), the (twice, squared) Hellinger distance, the Kullback-Leibler divergence and the Neyman modified chi-square are indexed by λ = 1, 0, ✄ 1 ☎ 2, ✄ 1 and ✄ 2 respectively. Based on 2 a comparative study, Read and Cressie (1988) recommends I 3 as a compromise candidate among the different test statistics, although they noted several desirable properties of the other test statistics, including the Pearson’s chi-square I 1 (see, eg. Section 4.5, Section 6.7, and Appendix A11 of Read and Cressie). Basu and Sarkar (1994) considered the disparity test statistics, a more general class of goodness-of-fit test statistics, which include the power divergence statistics and are based on the minimum disparity estimation approach of Lindsay (1994). A disparity is characterized by a function G ✆✞✝✠✟ , which gives geometrical insight on the behavior of the disparities in controlling the “outliers” and “inliers”, representing departures from the null in opposing directions. The present paper is motivated by the observation of C&R that there is a reverse order hierarchy in the powers of the goodness-of-fit tests within the power divergence family for the “bump” alternatives compared to the “dip” alternatives for the equiprobable null hypothesis. Bump alternatives are those where k ✄ 1 cells of a multinomial with k cells have equal probability, while the remaining cell has a higher probability than the rest. Dip alternatives are similar except for the fact that the one remaining cell has lower probability than the rest. In this paper we try to explain why the above behavior of the power divergence test statistics are natural, and make a preliminary attempt to provide some new tests with reasonably high power at both kinds of alternatives. 2 Basu et al. Three new sets of goodness-of-fit test statistics are considered. The first is based upon “penalized versions” of the power divergence statistics, while the second represents a judicious combination of the members of the power divergence statistics. The third method is based on entirely new families of disparities sensitive to both kinds of departures. In Section 2 we give a description of the disparity test statistics as well as introduce the power divergence family of C&R and the blended weight Hellinger distance family. The equiprobable null along with the dip and bump alternatives are also described in this section. In Section 3 the new test statistics are proposed and their usefulness is demonstrated through graphical studies. Section 4 provides a small comparative study where the performance of some of the new tests are compared to the present standard. The last section contains concluding remarks. 2. The equiprobable null hypothesis, the alternatives and the disparity test statistics For a sequence of n observations on a multinomial distribution with probability vector π π 1 ✁✄✂☎✂✆✂☎✂✆✁ πk , ∑ki✝ 1 πi = 1, let X x1 ✁✞✂☎✂☎✂✆✁ xk denote the observed frequencies for k categories and p p 1 ✁ p2 ✁✟✂☎✂✆✂☎✁ pk ✠ x1 n ✁✡✂☎✂☎✂✆✁ xk n denote the observed proportions. One is often interested in a simple null hypothesis such as ✆ ✆ ✟ ✟ ✆ π0i ✁ H0 : πi ✟ ✆ ☎ ☎ 1 ✁ 2 ✁✟✂☛✂✟✂✟✁ k i ✟ (1) where π0i , i 1 ✁✟✂☛✂✟✂✟✁ k are known constants. This completely specifies the null hypothesis. In particular, the equiprobable null hypothesis for this set up is obtained when one uses π 0i 1 k for all i. For the equiprobable null hypothesis, consider the following set of alternatives given by ☎ ☞ H1 : πi 1 ✄ η k 1 1 ✍ η k✁ ☎ ✆ ✆ ✄ ✟ ✟ ✂ ☎ k✁ 1 ✁ 2 ✁✟✂☛✂✟✂✌✁ k i ✆ 1 ✄ ✁ ✟ i ☎ (2) k where the value of η is between 1 and k 1. Note that for η ✎ 0 a bump alternative and for η ✏ 0 a dip alternative is obtained. The ability of a statistic to discern the veracity of the null hypothesis depends on the sensitivity of the statistic to deviations from the null hypothesis. We will distinguish between the two kinds of deviations in this context. “Outliers” will represent those cells where pi ✎ π0i ; these are the cells which have more cases than predicted. The “inliers”, on the other hand, represent those cells with fewer cases than predicted. Let G be a strictly convex thrice differentiable nonnegative function on ✑ 1 ✁ ∞ with G 0 ✒ 0, G ✓ 1 ✔ 0 ✕ 0 and 2 ✔ G ✓ 0 ✖ 1, where G ✓ i ✔ denotes the i-th derivative of G. Suppose that G ✓ 3 ✔ 0 is finite and G ✓ 3 ✔ is continuous at 0. Then the disparity ρG p ✁ π0 between p and π0 defined in (1) is given by ✄ ✄ ✄ ✆ ✟ ✟ ✆ ✆ ✆ ✟ ✆ ✟ ✟ ✟ k ρG p ✁ π0 ✕ pi ∑ G ✗ π0i 1✘ i✝ 1 ✁ consider DρG p ✁ π0 ✚ DρG ✆ ✟ k (3) ∑ G δi π0i ✁ δi π0i✙ 1 pi 1 ✂ i✝ 1 2nρG p ✁ π0 as a test statistic for the simple null hypothesis π0i ✄ ✆ ✟ ✆ ✄ ✟ For a disparity ρG defined in (1). We will call δi the “Pearson residual” at cell i. For a positive δi the i-th cell is an outlier and for negative δi it is an inlier. The power divergence test statistics 2nI λ p ✁ π0 are generated by the power divergence family ✆ ✝ ✝ ✟ ✆ ✟ ✆ I λ p ✁ π0 ✕ ✆ ✟ k ✆ λ pi ✘ π0i ✗ ✟ corresponding to G δ✕ ✟ ✟ pi λ λ ✍ 1 ✚✜ ∑ i✝ 1 ✛ ✆ ✆ δ ✍ 1 ✥✓ λ✦ 1 ✔ δ✍ 1 λ λ✍ 1 ✆ ✟ ✟ ✄ ✆ π0i pi λ✍ 1 ✤ 1 ✢✣✍ ✄ ✆ ✄ ✁ λ IR ✁ ✁ (4) δ ✂ λ✍ 1 ✟ ✄ ✟ (5) In particular the Pearson chi-square statistic DPCS is generated by the function G δ ✧ ✆ 1 2 2δ ✟ (i.e. λ 1), and the Hellinger distance statistic DHD corresponds to G δ ✖ 2 ✑ δ ✍ 1 1★ (i.e. λ 1 2 . The likelihood ratio chisquare statistic is generated by I 0 , which corresponds to the limiting case of the form in (4) as λ ✩ 0. C&R showed that the test statistic 2nI λ has an asymptotic χ2 k 1 distribution under the simple null hypothesis H0 , for all λ IR. Basu and Sarkar (1994) generalised this to show that all disparity test statistics 2nρ G have this same asymptotic distribution under the null. The blended weight Hellinger distance family BW HDτ , 0 ✪ τ ✪ 1 defined by ✆ ✆ ✄ ✟ ✆ ✟ 1 2 2 ✄ ✟ ✄ ☎ ✟ ✂ ✁ Improved Power in Multinomial Goodness-of-fit Tests BW HDτ p ✁ π0 ✕ 2 ✙ ✆ 1 k ∑✜ ✝ ✟ i 1 pi τ pi ✆ ✟ ✍ 1 2 ✆ 2 π0i ✄ 1 τ π0i ✄ 3 ✟ ✆ ✟ 1 2 ✢ (6) corresponds to G δ✕ 2 ✙ ✆ 1 ✟ δ ✑τ δ✍ 1 ☎ ✆ ✍ 1✁ 2 ✟ ✆ 1 ✄ τ 2 ✟ ★✄✂ ✂ The (twice, squared) Hellinger distance is a member of BW HDτ with τ 1 2. ✂ 2.0 C&R (1984, Table 2) have determined the exact power of the disparity test statistics for various values of η based on different members of the power divergence family, for the specific case n 20, k 4, and significance level α 0 ✂ 05. The exact powers are calculated for the appropriately randomized test of size α 0 ✂ 05 by enumerating all possible samples and calculating the appropriate critical value by determining the probabilities of these samples under the null. The power is then easily evaluated by determining the probabilities of the samples under the alternative. The results of C&R as well as Read (1984) show that for η ✎ 0 the exact power of the tests increases with λ, while for η ✏ 0 the exact power decreases with λ. 1.5 λ ✝✟✞ 2 λ ✝✟✞ 1 λ✝ 0 λ✝ 1 λ✝ 2 0.0 0.5 Gδ ☎ 1.0 ✆ -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 δ Fig. 1. Plot of the G ✠☛✡ ☞ for various members of power divergence family. Thus the sensitivity of a disparity test statistic depends on how the defining G function treats the outliers and inliers. In Figure 1 we present the G functions for several members of the power divergence family. For large positive values of λ the statistics are fairly flat on the negative side of the δ-axis but curve away rapidly on the positive side. Thus the test statistics for large positive values of λ are strongly sensitive to outliers, while presenting a relatively dampened response to inliers. The opposite is true for large negative values of λ. As a result, large positive values of λ lead to high power against bump alternatives while being poor against dip alternatives. However, the findings are reversed for large negative values of λ. ✆ ✆ ✝ ✟ ✝ ✟ Basu et al. 1.0 4 0.6 0.2 0.4 Power 0.8 penalty=5 no penalty -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 η Fig. 2. Comparison of power functions for power divergence test statistic with λ 2 with and without penalty (h ✝ ✝ 5). 3. Proposed test statistics 3.1. The penalized divergence statistics In this section we propose a class of divergences obtained by introducing an empty cell penalty to the disparities within the class of power divergences. Recall that the disparities with large positive values of λ are sensitive against outliers but not against inliers. The empty cell penalty introduced here makes these disparities simultaneously sensitive to empty cells which are the extreme cases of inliers. Formally, we rewrite the power divergence family defined in (4) as ∑ I λ p ✁ π0 ✕ ✆ ✟ pi 0 ✛ pi λ λ✍ 1 ✆ ✟ pi ✘ π0i ✜ ✗ λ 1✢ ✄ 1 ✍ λ✍ 1 ✆ π0i ✄ pi ✟ ✤ 1 λ ✍ 1 p∑ i✝ 0 ✍ ✂ πi Note that ordinarily the disparity puts the weight 1 λ ✍ 1 on the empty cells (cells with p i 0). For large positive values of λ this weight is fairly small. An artificial empty cell penalty can hike up the weight of the empty cells to a suitably large value so that it increases the sensitivity of the statistics to dip alternatives. Thus we consider the penalized power divergence family given by ☎ Ihλ p ✁ π0 ✕ ✆ ✟ ∑ pi 0 ✛ pi λ λ✍ 1 ✆ ✟ ✜ ✗ ✆ ✟ λ pi ✘ π0i ✄ 1✢ ✍ 1 π0i λ✍ 1 ✆ ✄ pi ✟ ✤ ✍ h ∑ πi pi ✝ 0 where h represents the penalty weight. One can use 2nIhλ p ✁ π0 as the goodness-of-fit test statistic for testing the null hypothesis (1). Ideal choices would be divergences with large positive values of λ used in conjunction with large positive values of h. Following Park et al. (2001, Theorem 2.1), one can show that 2nIhλ p ✁ π0 has an asymptotic χ2 k 1 distribution under the null hypothesis. Penalized divergences have also been used by Harris and Basu (1994), Basu and Basu (1998), and Park et al. (2001) to make the divergences less sensitive to empty cells, thereby increasing the robustness of their parameter estimation properties. For illustration we computed the exact powers for the equiprobable null hypothesis with n 20, k 4, and significance level α 0 ✂ 05, for the power divergence statistic with λ 2, as well as for its penalized version with ✆ ✟ ✆ ✆ ✄ ✟ ✟ Improved Power in Multinomial Goodness-of-fit Tests 0.6 γ γ γ γ ✝ ✝ ✝ ✝ 0 01 05 1 2 0.0 0.0 0.2 0.4 Power ✝ ✝ 0 01 05 1 2 0.4 ✝ ✝ 0.2 Power 0.6 γ γ γ γ 5 -1 0 1 2 3 4 -1 0 1 λ (a) 2 3 4 λ (b) Fig. 3. The effect of the penalty on the power at two fixed alternatives for the power divergence test statistics. penalty weight h 5. The powers of the statistics for values of η between ✄ 1 and 2 are plotted in Figure 2. Notice that the penalty clearly leads to a significant increase in power for large negative values of η without any appreciable loss in power for positive values of η. Figure 3 gives another graphical illustration of the effect of the penalty where we plot the powers for different values of the penalty weight h for the equiprobable null hypothesis with n 20, k 5, α 0 05, and for two specific values of η determining two alternatives of the opposite type. However in this case, instead of choosing the penalties to be preassigned specific values, we have chosen the penalty weight h to be equal to γ ☎ ✆ λ 1 ✟ , a scale factor of the the natural weight to the empty cell for the disparity. We have used γ 0 01 0 5 1 2 in our graphs. Figure 3a exhibits the powers of these penalized statistics for η ✄ 0 9, which gives us some idea of the change in power due to the penalty effect for this value of η. For example, for λ 1 and λ 2, the segment of the vertical lines along these values of λ between γ 1 and γ 2 represent the increase in power for these statistics at η ✄ 0 9 when the penalty weights are double their natural weights. Notice also that the increase in power between γ 1 and γ 2 is negligible for very large values of λ, the reason being that for such values of λ the ordinary weights of the empty cells are so small that even doubling them has little effect in terms of improving the power. Figure 3b illustrates the effect of the penalty under identical set ups as in Figure 3a but now the value of η is 1 5. In this case the line corresponding to γ 1 lies above the line for γ 2, exhibiting that there is a loss in power due to doubling the penalty weight. Again the vertical line segments for λ 1 and λ 2 between these two values of γ quantify this decrease in power, but clearly the losses in these situations are substantially smaller compared to the gain for the dip alternative in Figure 3a. A comparison of Figures 3a and 3b show that in this case roughly for values of λ between 1/2 and 2 the application of penalty is probably meaningful and the gains outweigh the losses even by conservative estimates. ✂ ✍ ✂ ✁ ✂ ✁ ✁ ✂ ✂ ✂ 3.2. The combined divergence statistics Another option for improved power is to choose a disparity test statistic where the G ✆✞✝✠✟ function is the combination of the functions of two different disparities so that the test is sensitive for both dip and bump alternatives. Recall that disparities corresponding to large positive values of λ are highly sensitive to outliers (corresponding to δ 0), while those corresponding to large negative λ are highly sensitive to inliers (corresponding to δ 0). Thus, for example, if one chooses a G ✆ ✝ ✟ so that it matches the G function for λ ✄ 1 on the negative side of the axis and the corresponding function of λ 2 on the positive side, the statistic is obviously going to be strongly influenced by both inliers and ✎ ✏ Basu et al. 1.0 6 ✞ combined( 1 2) λ 1 λ 2 0.2 0.4 Power 0.6 0.8 ✝✟✞ ✝ -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 η Fig. 4. Power comparison for a combined divergence statistic with simple power divergence statistics. outliers (see Figure 1). Technically such combinations amount to choosing a combined divergence ρ G p ✁ π defined by the function G such that ☞ G1 δ ✁ if δ ✪ 0 G δ✕ if δ ✎ 0 G2 δ ✆ ✆ ✆ ✟ ✟ ✟ ✆ ✟ where G1 and G2 are convex functions satisfying all the properties mentioned in Section 2. Notice that the combined G itself is a convex function satisfying G 0 0 ✁ G ✓ 1✔ 0 ✕ 0, and G ✓ 2 ✔ 0 1. For a combined divergence ρG ✁ consider DρG 2nρG p ✁ π0 as a test statistic for the simple null hypothesis in (1). The following theorem, proved in the appendix, shows that the test statistic for combined divergences have the same asymptotic χ2 distribution when the null hypothesis is true. ✆ ✆ ✝ ✝ ✟ ✆ ✟ ✆ ✟ ✆ ✟ ✟ Theorem 1 The test statistic DρG corresponding to the combined divergence ρG has an asymptotic χ2 k distribution under the null hypothesis in (1). ✆ ✄ 1 ✟ As an illustration we present the following case. In Figure 4 we present the power for the combination where G 1 corresponds to the G function of λ 1, while G2 corresponds to the G function of λ 2. Notice that the combined divergence test statistic is quite close to the best cases (among 2nI ✙ 1 and 2nI 2) in terms of power for most values of η, while being substantially better than the worst cases for all the alternatives. In particular the power of the combined divergence tests are very close to the best case for negative values of η. ✄ 3.3. The PBHMτ statistics We now consider a third set of goodness-of-fit tests expected to perform well for both dip and bump alternatives. The divergences used here are mixtures of the Pearson’s chi-square with members of the blended weight Hellinger distance (BW HDτ ) resulting in the PBHMτ (Pearson-blended Hellinger Mixture) family indexed by the parameter τ. The relevant G function for this mixture disparity is GPBHM δ ✕ τ ✆ ✟ δ2 2 ✍ ✆ 1 ✄ τ 2✑ τ δ ✍ ✁ ✟ δ2 1✍ 1 ✆ ✄ τ ✟ ★2 ✁ τ ✁ ✆ 0✁ 1 ✟ ✂ 7 2.0 Improved Power in Multinomial Goodness-of-fit Tests τ 0(PCS) τ 01 τ 03 τ 05 τ 07 τ 09 NCS ✝ ✝ 1.5 ✝ ✝ ✝ ✝ 0.0 0.5 ☎ 1.0 Gδ ✆ -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 δ Fig. 5. Plot of the G ✠ ✡ ☞ function for various members of PBHM family. Figure 5 presents the graphs of the G functions for several members of this family, together with those for the Pearson’s and Neyman’s modified chi-square. The graphs demonstrate the sensitivity of the new mixture disparities to both outliers and inliers. For illustration, the exact powers of the 2nPBHMτ p π0 statistic, with τ 0 5 are computed with n 20 k 4 and significance level α 0 05 for different values of η and presented in Figure 6 together with the power function of the Pearson’s chi-square statistic. Once again notice that the 2nPBHMτ disparity test statistic shows a comparatively larger increase in power for large negative values of η, together with a relatively smaller loss in power for large positive values of η. ✆ ✟ ✁ ✂ ✁ ✂ Since the G functions of the PBHMτ disparities satisfy all the properties listed in Section 2, the 2nPBHMτ statistics have asymptotic chi-square distributions for each τ. 3.4. Composite null hypothesis In this paper we have mostly focused on the simple null hypothesis (more specifically on the equiprobable null hypothesis) so that the concepts are easily explained. However it is not difficult to show that the results extend to the case of the composite hypothesis, there the cell probabilities are a function of a parameter θ of dimension s k 1, under the regularity conditions of Birch (1964), and provided a BAN (best asymptotically normal) estimator of θ is used. In particular, the conclusions of Theorem 4.3 of Basu and Sarkar (1994) remain valid under the above conditions for all the three classes of new disparities developed in Sections 3.1-3.3. The proofs are straightforward and are not included here so as to retain the applied focus of the paper; however, we will present a data example with a composite null hypothesis in the following section. ✏ ✄ Basu et al. 1.0 8 05 ✁ 0.0 0.2 0.4 Power 0.6 0.8 PBHMτ PCS -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 η Fig. 6. Power comparison of PBHMτ 0 5 with Pearson’s χ2 . ✁ 4. A comparison of some of the test statistics In this paper we have suggested three classes of new goodness-of-fit statistics. In terms of individual test statistics, this amounts to millions of choices. Precise recommendations about their use will require extensive future research including large scale comparisons involving several scenarios; however, for the benefit of the applied statistician whose final interest would be in the choice of the particular statistic to be used, we present some limited comparisons which could provide some initial indication of the possible test statistics that could serve as reasonable alternatives to the currently available tests. To do this we provide one set of exact power comparisons with several test statistics, and also consider a couple of data examples to contrast the different methods. As C&R and Read and Cressie (1988) 2 recommended the I 3 divergence to be used as a compromise candidate, we use this statistic for the basis of comparison. Exact power comparisons: For the exact power comparisons we retain the set up considered in most of our numerical studies, i.e. we use the equiprobable null hypothesis, use the dip and bump alternatives as functions of η, and use sample size n 20, and number of groups k 4. The statistics we use in this comparison correspond to (a) λ 2 3, (b) penalized statistic with λ 2 and penalty weight 1, (c) penalized statistic with λ 1 and penalty weight 5, (d) combined statistic for λ 1 and λ 2, (e) combined statistic for λ 2 and λ 1, and (f) the PBHM τ statistic with τ 0 5. The calculated exact powers at level α 0 05 are graphically presented in Figure 7. It appears that 2, λ 1) statistic, all the other five statistics are very close in terms of exact attained except for the combined (λ power in this case. While this does not show either of the other statistics to be actually “better” than the λ 2 3 case, it does show that there are many other compromise candidates with practically identical performances in this case. Next we present a couple of data examples. The examples were not restricted to equiprobable null types, but are of more complex nature representing the complexities of real life. Since the basic premise of our construction is to make the test statistics more sensitive to deviations from the null irrespective of the nature of the null, we expect the new statistics to exhibit their high sensitivity with these real data as well. ☎ ✄ ✄ ✂ ✂ ✄ ☎ Data Example 1: This dataset is taken from Agresti (1990; Table 3.10, page 72). A total of 182 psychiatric patients on drugs were classified according to their diagnosis. The frequency distribution of the diagnosis is given in Table 1. 1.0 Improved Power in Multinomial Goodness-of-fit Tests λ 23 λ 2, penalty=1 λ 1 ,penalty=5 combined λ 1, λ combined λ 2, λ PBHMτ 0 5 ✝ 0.8 ✝ 0.6 ✝ ✝ ✞ ✝ ✝ ✞ ✝ 2 1 0.2 0.4 Power ✁ -1.0 -0.5 0.0 0.5 1.0 1.5 η 2 Fig. 7. Power comparison of some of the new statistics with 2nI 3 . Table 1. Frequency distribution of diagnosis of psychiatric patients Diagnosis Schizophrenia Frequency 105 Affective Disorder 12 Neurosis 18 Personality Disorder 47 Special Symptoms 0 2.0 9 10 Basu et al. Consider testing the null hypothesis where the probability vector is given by π 0 ✂ 56 ✁ 0 ✂ 06 ✁ 0 ✂ 09 ✁ 0 ✂ 25 ✁ 0 ✂ 04 . The chi-square critical value at 4 degrees of freedom, and at level of significance 0.05 is given by 9.488. The ordinary statistics for λ 2 ✁ 1, and 2/3 cannot reject the null hypothesis with this critical value, with the corresponding test statistics being 5.273, 7.689, and 9.142 respectively. However, for the penalized statistics (notice that the data set contains one empty cell) corresponding to λ 1 and λ 2 with penalty weight h 1 the test statistic are 14.969 and 14.980 respectively, which rejects the null hypothesis comfortably. 1 2, λ 1) and (λ 1 2, λ 2) combinations are, respectively, Similarly, the combined statistics for (λ 29.529 and 29.540. Once again they comfortably reject the null hypothesis unlike the ordinary statistics corresponding 1, λ 2) or (λ 2, λ 1) combinations. This is to λ 2 and λ 1. Notice that we have not computed the (λ because if there are empty cells, any combination component for the inliers with λ 1 or less will make the statistic infinite. Of course technically we reject for such statistics, but it is not very informative. The PBHM0 5 statistic for this data set is 18.602. For this example, therefore, the new statistics looked at lead to rejection, whereas the λ 2 3 statistic as well as some of the other ordinary ones fail to reject. With an expected frequency of over 7 in the last cell being matched against an observed frequency equal to zero, the null hypothesis is certainly in doubt in our opinion. ✆ ✄ ☎ ✄ ✄ ✟ ☎ ✄ ✄ ☎ Data Example 2: Next we consider the time passage example data (Read and Cressie 1988; pp 12-16) which studies the relationships between life stresses and illnesses in Oakland, California. The data is in the form of an 18 cell multinomial, with the frequencies representing the total number of respondents for each month who indicated one stressful event between 1 and 18 months prior to interview (see Read and Cressie for more details). The null hypothesis H0 : πi 1 18 ✁ i 1 ✁☛✂✟✂✟✂✌✁ 18 is rejected for each of the ordinary test statistics. However if one considers a loglinear ϑ ✍ βi ✁ i 1 ✁✟✂☛✂✟✂☛✁ 18, the model fit appears to be much better. Notice that the null does time trend model H0 : log πi not completely specify the probability structure any more. Expected frequencies on the basis of estimates of ϑ and β obtained using maximum likelihood are given in Read and Cressie (1988; Table 2.2). The test statistics are now compared with the critical value of a chi-square with 16 degrees of freedom (rather than 17), and the critical value at level of significance 0.05 is 26.296. The λ 2 3 statistic equals 23.076, and fails to reject the null (in fact so does all the statistics with λ between 0 and 3. However the combined statistics for (λ 1, λ 2), and (λ 2, λ 1) are 35.271 and 44.840 respectively, with the null being rejected in both cases. The PBHM0 5 statistic equals 24.629, and although larger than the λ 2 3 statistic, this also fails to reject the null. On the whole this analysis shows that the time trend model is on the borderline of being significant, but some of the newer statistics are more likely to reject the null than the compromise suggested by C&R. Notice that the penalized statistics are not meaningful in this case because this data has no empty cell. ☎ ✆ ✟ ☎ ✄ ✄ ☎ 5. Concluding Remarks In this paper we have presented some new candidates for goodness-of-fit testing in multinomial models. A final and definitive recommendation will require much deeper research, but we feel that the initial indications are promising enough for some of the proposed statistics to be further pursued. It does appear that some of these tests are competitive 2 in comparison to the compromise suggested by C&R, and some of them are more sensitive than I 3 in certain cases. What is the cost incurred in making the test statistics more sensitive to both kinds of deviations? While all the test statistics considered by us have asymptotic chi-square distributions, we believe that the cost of making the test statistics more sensitive is paid through a slower convergence to the asymptotic chi-square limit. We do not perceive this to be a problem for large sample sizes, but for small to moderate sample sizes the level of our tests may be a little off from the nominal levels if one uses the chi-square critical values. At present we are investigating better small sample approximations to the null distributions in the spirit of Read (1984). However, exact critical values in small samples and simulated critical values in moderate samples can be easily calculated for the simple null. The authors will be happy to provide codes for determination of the above so that one can perform accurate level α goodness-of-fit tests when applying the proposed methods (which may be too inaccurate with χ 2 critical values in small samples). The Splus codes are readily available from the site http://www.stat.psu.edu/˜surajit/goodness/. In conclusion, we emphasize again that this is a small preliminary study involving only a limited number of scenarios. To determine the general scope of the proposed statistics more extensive studies are necessary which the authors hope to undertake in future. Improved Power in Multinomial Goodness-of-fit Tests 11 Appendix Proof of Theorem 1. Consider the combined divergence ρG ✠ p π ☞ defined by ✁✁ G ✠ δ☞ ✂ ✝ ✁✁✄ G1 ✠ δ ☞ if δ ☎ 0 G2 ✠ δ ☞ if δ ✆ 0 where G1 and G2 are convex functions satisfying the properties listed in Section 2. For a combined divergence ρG ✠ p π ☞ let Dρ G ✝ 2nρG ✠ p π0 ☞ be the test statistic for the hypothesis defined in (1). By a first order Taylor series expansion of the test statistic (as a function of pi around π0i ) we get k ∑ G ✠☛✠ pi ✞ k k ∑ G ✠ 0 ☞ π0i ✟ ∑ ✠ pi ✞ π0i ☞✞✝ π0i ☞ π0i ✝ i 1 i 1 k k π0i ☞ G ✠ 1 ✡ ✠ 0 ☞ ✟ i 1 1 ∑ 6 ✠ pi ✞ 1 ∑ 2 ✠ pi ✞ 1 π0i ☞ 2 G ✠ 2 ✡ ✠ 0 ☞ π0i ☛ ✟ i 1 1 2 ☛ ξi ✞ 1 ☞ π0i ☛ π0i ☞ 3 G ✠ 3 ✡ ✠ π0i i 1 ✝ S1 ✟ S2 ✟ S3 ✟ S4 say where ξi lies on the line segment joining pi and π0i . Note that (a) G1 ✠ 0 ☞ 2✡ everywhere and G1✠ ✠ 0☞ ✝ 2✡ G2✠ ✠ 0☞ ✝ G ✠ 2✡ ✠ 0 ☞ ✝ G2 ✠ 0 ☞ ✝ 0 and hence G ✠ 0 ☞ ✝ 0, (b) G ✠ 2✡ exists 1, and (c) both pi and π0i are nonnegative terms that sum to 1 over i. The first two ✝ terms S1 and S2 are, therefore, equal to 0. Next note that k ∑ n ✠ pi ✞ ✝ 6nS4 π0i ☞ 3☞ π0i ☞ 2✏ i 1 k ☎✎✍ ∑ n ✠ pi ✞ 1 2 ☛ ξi ✞ 1 ☞ π0i ☛ ✌ G ✠ 3 ✡ ✠ π0i ✒ ✍ ✠ π0i ☛ ☛ 1p ✞ i 2 is bounded, supi pi ✞ π0i ✑ ✑ o p ✠ 1 ☞ , ∑ki ✝ 1 n✠ ✑ pi ✞ π0i ☞ 2 ✝ O p ✠ 1 ☞ for all i and therefore, 6nS4 n 2nS3 ✏ O p ✠ 1 ☞ , ✠ ξi ✞ π0i ☞ 3✡ ✝ ✑ i 1 ☞ and ✠ π0i ☛ 1 ξi ✞ 1 ☞ have the same sign and by the assumptions G1✠ 1 ☛ ξi ✞ 1 ☞ continuous at 0. Hence G ✠ 3✡ ✠ π0i ✏ sup pi ✞ π0i 2✏ 1 3✡ ☛ ✍ sup G ✠ ✠ π0i ☛ ξi ✞ 1 ☞ sup π0i i supi ☞ π0i ✌ ✍ i i 1 ✝ 1 ☛ ✠ pi ✞ π0i ☞ n ∑ π0i ✝ ✠ ✝ o p ✠ 1 ☞ for every i. Notice also that δi 3✡ 0 ☞ and G2✠ ✠ 3✡ 3✡ ✝ 0 ☞ are bounded and G1✠ and G2✠ are o p ✠ 1 ☞ . Then the result follows by noting that 2 i 1 is the Pearson chi-square statistic whose asymptotic χ2 ✠ k ✞ 1 ☞ distribution under the simple null hypothesis is well known. 12 Basu et al. References [1] Agresti, A. (1990). Categorical Data Analysis. John Wiley & Sons, New York. [2] Basu, A. and Basu, S. (1998). Penalized minimum disparity methods for multinomial models. Statistica Sinica, 8, 841–860. [3] Basu, A. and Sarkar, S. (1994). On disparity based goodness-of-fit tests for multinomial models. Statist. Probab. Lett., 19, 1994, 307–312. [4] Birch, M.W. (1964). A new proof of the Pearson–Fisher Theorem. Ann. Math. Statist., 35, 817–824. [5] Chapman, J.W. (1976). A comparison of the χ 2 , –2 log R, and the multinomial probability criteria for significance testing when expected frequencies are small. J. Amer. Statist. Assoc., 71, 854–863. [6] Cochran, W.G. (1952). The χ2 test of goodness-of-fit. Ann. Math. Statist., 23, 315–345. [7] Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. B, 46, 440–464. [8] Harris, I. R. and Basu, A. (1994). Hellinger distance as a penalized log likelihood. Commun. Statist. Comput. Simul., 23, 1097–1113. [9] Hoeffding, W. (1965). Asymptotically optimal tests for multinomial distributions. Ann. Math. Statist., 36, 369– 408. [10] Koehler, K.J. and Larntz, K. (1980), An empirical investigation of goodness-of-fit statistics for sparse multinomials, J. Amer. Statist. Assoc., 75, 336–344. [11] Larntz, K. (1978). Small sample comparisons of exact levels of chi-squared goodness-of-fit statistics. J. Amer. Statist. Assoc., 73, 253–263. [12] Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Statist., 22, 1081–1114. [13] Moore, D.S. and Spruill, M. C. (1975). Unified large–sample theory of general chi-squared statistics for tests of fit. Ann. Statist., 3, 599–616. [14] Park, C., Basu, A. and Harris, I. R. (2001). Tests of hypothesis in multiple samples based on penalized disparities. J. Korean. Statist. Soc., 30, 347–366. [15] Read, T. R. C. (1984). Small sample comparisons for power divergence goodness-of-fit statistics. J. Amer. Statist. Assoc., 79, 929–935. [16] Read, T. R. C. and Cressie, N. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. SpringerVerlag, New York. [17] Watson, G.S. (1959). Some recent results in chi-square goodness-of-fit tests. Biometrics, 15, 440–468. [18] West, E.N. and Kempthorne, O. (1972). A comparison of χ 2 and likelihood ratio tests for composite alternatives. J. Statist. Comput. Simul., 1, 1–33.