Vol4 No1
Vol4 No1
Vol4 No1
www.doaj.org
Aims
The Directory of Open Access Journals covers free, full text, quality controlled scientic and scholarly journals. It aims to cover all subjects and languages.
Increase visibility of open access journals Simplify use Promote increased usage leading to higher impact
Scope The Directory aims to be comprehensive and cover all open access scientic and scholarly journals that use a quality control system to guarantee the content. All subject areas and languages will be covered. In DOAJ browse by subject Agriculture and Food Sciences Biology and Life Sciences Chemistry General Works History and Archaeology Law and Political Science Philosophy and Religion Social Sciences Arts and Architecture Business and Economics Earth and Environmental Sciences Health Sciences Languages and Literatures Mathematics and statistics Physics and Astronomy Technology and Engineering
Contact Lotte Jrgensen, Project Coordinator Lund University Libraries, Head Ofce E-mail: [email protected] Tel: +46 46 222 34 31
Funded by
Hosted by
www.soros.org
www.lu.se
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
Bruno D. Zumbo
Associate Editor Measurement, Evaluation, & Research Methodology University of British Columbia
Vance W. Berger
Assistant Editor Biometry Research Group National Cancer Institute
Todd C. Headrick
Assistant Editor Educational Psychology and Special Education Southern Illinois University-Carbondale
Harvey Keselman
Assistant Editor Department of Psychology University of Manitoba
Alan Klockars
Assistant Editor Educational Psychology University of Washington
Patric R. Spence
Editorial Assistant Department of Communication Wayne State University
ii
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
Editorial Board
Subhash Chandra Bagui Department of Mathematics & Statistics University of West Florida J. Jackson Barnette School of Public Health University of Alabama at Birmingham Vincent A. R. Camara Department of Mathematics University of South Florida Ling Chen Department of Statistics Florida International University Christopher W. Chiu Test Development & Psychometric Rsch Law School Admission Council, PA Jai Won Choi National Center for Health Statistics Hyattsville, MD Rahul Dhanda Forest Pharmaceuticals New York, NY John N. Dyer Dept. of Information System & Logistics Georgia Southern University Matthew E. Elam Dept. of Industrial Engineering University of Alabama Mohammed A. El-Saidi Accounting, Finance, Economics & Statistics, Ferris State University Felix Famoye Department of Mathematics Central Michigan University Barbara Foster Academic Computing Services, UT Southwestern Medical Center, Dallas Shiva Gautam Department of Preventive Medicine Vanderbilt University Dominique Haughton Mathematical Sciences Department Bentley College Scott L. Hershberger Department of Psychology California State University, Long Beach Joseph Hilbe Departments of Statistics/ Sociology Arizona State University SinHo Jung Dept. of Biostatistics & Bioinformatics Duke University Jong-Min Kim Statistics, Division of Science & Math University of Minnesota Harry Khamis Statistical Consulting Center Wright State University Kallappa M. Koti Food and Drug Administration Rockville, MD Tomasz J. Kozubowski Department of Mathematics University of Nevada Kwan R. Lee GlaxoSmithKline Pharmaceuticals Collegeville, PA Hee-Jeong Lim Dept. of Math & Computer Science Northern Kentucky University Balgobin Nandram Department of Mathematical Sciences Worcester Polytechnic Institute J. Sunil Rao Dept. of Epidemiology & Biostatistics Case Western Reserve University Karan P. Singh University of North Texas Health Science Center, Fort Worth Jianguo (Tony) Sun Department of Statistics University of Missouri, Columbia Joshua M. Tebbs Department of Statistics Kansas State University Dimitrios D. Thomakos Department of Economics Florida International University Justin Tobias Department of Economics University of California-Irvine Dawn M. VanLeeuwen Agricultural & Extension Education New Mexico State University David Walker Educational Tech, Rsrch, & Assessment Northern Illinois University J. J. Wang Dept. of Advanced Educational Studies California State University, Bakersfield Dongfeng Wu Dept. of Mathematics & Statistics Mississippi State University Chengjie Xiong Division of Biostatistics Washington University in St. Louis Andrei Yakovlev Biostatistics and Computational Biology University of Rochester Heping Zhang Dept. of Epidemiology & Public Health Yale University INTERNATIONAL Mohammed Ageel Dept. of Mathematics, & Graduate School King Khalid University, Saudi Arabia Mohammad Fraiwan Al-Saleh Department of Statistics Yarmouk University, Irbid-Jordan Keumhee Chough (K.C.) Carriere Mathematical & Statistical Sciences University of Alberta, Canada Michael B. C. Khoo Mathematical Sciences Universiti Sains, Malaysia Debasis Kundu Department of Mathematics Indian Institute of Technology, India Christos Koukouvinos Department of Mathematics National Technical University, Greece Lisa M. Lix Dept. of Community Health Sciences University of Manitoba, Canada Takis Papaioannou Statistics and Insurance Science University of Piraeus, Greece Nasrollah Saebi School of Mathematics Kingston University, UK Keming Yu Department of Statistics University of Plymouth, UK
iii
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
Regular Articles
11 34 Biao Zhang Testing the Goodness of Fit of Multivariate Multiplicativeintercept Risk Models Based on Case-control Data
35 42
Panagiotis Mantalos Two Sides of the Same Coin: Bootstrapping the Restricted vs. Unrestricted Model John P. Wendell, Sharon P. Cox Coverage Properties of Optimized Confidence Intervals for Proportions
43 52
53 62
Rand R. Wilcox, Inferences about Regression Interactions via a Robust Mitchell Earleywine Smoother with an Application to Cannabis Problems Stan Lipovetsky, Michael Conklin Regression by Data Segments via Discriminant Analysis
63 74
75 80
W. A. Abu-Dayyeh, Local Power for Combining Independent Tests in the Z. R. Al-Rawi, Presence of Nuisance Parameters for the M. MA. Al-Momani Logistic Distribution B. Sango Otieno, C. Anderson-Cook Inger Persson, Harry Khamis
David A. Walker
81 89
Effect of Position of an Outlier on the Influence Curve of the Measures of Preferred Direction for Circular Data Bias of the Cox Model Hazard Ratio
90 99
Bias Affiliated with Two Variants of Cohens d When Determining U1 as A Measure of the Percent of Non-Overlap
C. Anderson-Cook, Some Guidelines for Using Nonparametric Methods for Kathryn Prewitt Modeling Data from Response Surface Designs Gibbs Y. Kanyongo Determining the Correct Number of Components to Extract from a Principal Components Analysis: A Monte Carlo study of the Accuracy of the Scree Plot Abdullah Almasri, Ghazi Shukur Leming Qu Testing the Casual Relation Between Sunspots and Temperature Using Wavelets Analysis Bayesian Wavelet Estimation of Long Memory Parameter
134 139
140 154
iv
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
Model-Selection-Based Monitoring of Structural Change On the Power Function of Bayesian Tests with Application to Design of Clinical Trials: The FixedSample Case Bayesian Reliability Modeling Using Monte Carlo Integration Right-tailed Testing of Variance for Non-Normal Distributions Using Scale Mixtures of Normals to Model Continuously Compounded Returns
172 186
Vincent Camara, Chris P. Tsokos Michael C. Long, Ping Sa Hasan Hamdan, John Nolan, Melanie Wilson, Kristen Dardia
187 213
214 226
227 239
Michael B.C. Khoo, Enhancing the Performance of a Short Run Multivariate T. F. Ng Control Chart for the Process Mean Paul A. Nakonezny, An Empirical Evaluation of the Retrospective Pretest: Joseph Lee Rodgers Are There Advantages to Looking Back? Yonghong Jade Xu An Exploration of Using Data Mining in Educational Research Bruno D. Zumbo, Kim H Koh Manifestation of Differences in Item-Level Characteristics in Scale-Level Measurement Invariance Tests of Multi-Group Confirmatory Factor Analyses
240 250
251 274
275 282
Brief Report
283 287 J. Thomas Kellow Exploratory Factor Analysis in Two Measurement Journals: Hegemony by Default
Early Scholars
288 299 Ling Chen, Mariana Drane, Robert F. Valois, J. Wanzer Drane Multiple Imputation for Missing Ordinal Data
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
312 318
JMASM17: An Algorithm and Code for Computing Exact Critical Values for Friedmans Nonparametric ANOVA (Visual Basic) JMASM18: An Algorithm for Generating Unconditional Exact Permutation Distribution for a Two-Sample Experiment (Visual Fortran)
JMASM19: A SPSS Matrix for Determining Effect Sizes From Three Categories: r and Functions of r, Differences Between Proportions, and Standardized Differences Between Means (SPSS)
319 332
J. I. Odiase, S. M. Ogbonmwan
333 342
David A. Walker
vi
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
JMASM is an independent print and electronic journal (http://tbf.coe.wayne.edu/jmasm) designed to provide an outlet for the scholarly works of applied nonparametric or parametric statisticians, data analysts, researchers, classical or modern psychometricians, quantitative or qualitative evaluators, and methodologists. Work appearing in Regular Articles, Brief Reports, and Early Scholars are externally peer reviewed, with input from the Editorial Board; in Statistical Software Applications and Review and JMASM Algorithms and Code are internally reviewed by the Editorial Board. Three areas are appropriate for JMASM: (1) development or study of new statistical tests or procedures, or the comparison of existing statistical tests or procedures, using computerintensive Monte Carlo, bootstrap, jackknife, or resampling methods, (2) development or study of nonparametric, robust, permutation, exact, and approximate randomization methods, and (3) applications of computer programming, preferably in Fortran (all other programming environments are welcome), related to statistical algorithms, pseudo-random number generators, simulation techniques, and self-contained executable code to carry out new or interesting statistical methods. Elegant derivations, as well as articles with no take-home message to practitioners, have low priority. Articles based on Monte Carlo (and other computer-intensive) methods designed to evaluate new or existing techniques or practices, particularly as they relate to novel applications of modern methods to everyday data analysis problems, have high priority. Problems may arise from applied statistics and data analysis; experimental and nonexperimental research design; psychometry, testing, and measurement; and quantitative or qualitative evaluation. They should relate to the social and behavioral sciences, especially education and psychology. Applications from other traditions, such as actuarial statistics, biometrics or biostatistics, chemometrics, econometrics, environmetrics, jurimetrics, quality control, and sociometrics are welcome. Applied methods from other disciplines (e.g., astronomy, business, engineering, genetics, logic, nursing, marketing, medicine, oceanography, pharmacy, physics, political science) are acceptable if the demonstration holds promise for the social and behavioral sciences.
Internet Sponsor Paula C. Wood, Dean College of Education, Wayne State University
e-mail: [email protected]
vii
"2@$
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-10
Rand R. Wilcox
Department of Psychology
This article considers a J by K ANOVA design where all JK groups are dependent and where groups are to be compared based on medians. Two general approaches are considered. The first is based on an omnibus test for no main effects and no interactions and the other tests each member of a collection of relevant linear contrasts. Based on an earlier paper dealing with multiple comparisons, an obvious speculation is that a particular bootstrap method should be used. One of the main points here is that, in general, this is not the case for the problem at hand. The second main result is that, in terms of Type I errors, the second approach, where multiple hypotheses are tested based on relevant linear contrasts, performs about as well or better than the omnibus method, and in some cases it offers a distinct advantage. Keywords: Repeated measures designs, robust methods, kernel density estimators, bootstrap methods, linear contrasts, multiple comparisons, familywise error rate where is a column vector containing the JK by JK matrix elements jk ,and C is an (having rank ) that reflects the null hypothesis of interest. (The first K elements of are 11, ,1K , the next K elements are
Introduction Consider a J by K ANOVA design where all JK groups are dependent. Let jk (j =1,...J; k =1,...K) represent the (population) medians corresponding to these JK groups. This article is concerned with two strategies for dealing with main effects and interactions. The first is to perform an omnibus test for no main effects and no interactions by testing
Ho : C = 0,
(1)
Rand R. Wilcox ([email protected]) is a Professor of Psychology at the University of Southern California, Los Angeles.
approach uses a collection of linear contrasts, rather than a single omnibus test, and now the goal is to control the probability of at least one Type I error. A search of the literature indicates that there are very few results on comparing the medians of dependent groups using a direct estimate of the medians of the marginal distributions, and there are no results for the situation at hand. In an earlier article (Wilcox, 2004), two methods were considered for performing all pairwise comparisons among a collection of dependent groups. The first uses an estimate of the appropriate standard error stemming from the influence function of a single
3
be the
X (1) k X ( n) k
th
observations associated with k variable written in ascending order. Two estimates of the population median are relevant here. The first is
k = X ( m ) k ,
where again m=[.5n+.5], and the other is =M , j k the usual sample median based on X ,...,X . 1k nk Although the focus is on estimating the median with q=.5, the results given here apply to any q, 0<q<1. Let f be the marginal density of k the kth variable and let
V1 = (q 1)2 P ( X 1 xq1 , X 2 xq 2 ),
where
q 1 , f ( xq ) 0, q , f ( xq )
IFq ( x ) =
(Bahadur, 1966; also see Staudte & Sheather, 1990). Now consider the situation where sampling is from a bivariate distribution. Let X ik (i=1,...,n; k=1, 2) be a random sample of n
X ( m ) = xq +
1 n
IFq ( X i ),
(2)
if x < x q if x = x q if x > x q ,
2 12 =
V1 + V2 + V3 + V4 . nf1 ( xq1 ) f 2 ( xq 2 )
(3)
Also, (2) yields a well-known expression for the squared standard error of X ( m )1 , namely,
2 11 =
1 q (1 q ) . n f 21 ( xq1 )
RAND R. WILCOX
where a is a sensitivity parameter satisfying 0a1. Based on comments by Silverman (1986), a=.5 is used. Then the adaptive kernel estimate of f is taken to be k
where
K (t ) = =
3 1 (1 t 2 ) / 5, | t |< 5 4 5 0, otherwise,
| X ik x |
MADk . .6745
is the Epanechnikov kernel, and following Silverman (1986, p. 47 48), the span is
Under normality, MADN =MAD /.6745 k k estimates the standard deviation, in which case x is close to X if x is within standard ik deviations of X . Let ik
h = 1.06
A=min(s, IQR/1.34), and where s is the standard deviation and IQR is the interquartile range based on X 1k ,..., X nk . Here, IQR is estimated via the ideal is fourths. Let =[(n/4)+(5/12)]. That is, (n/4)+(5/12) rounded down to the nearest integer. Let
N k ( x ) = {i :| X ik x | MADN k }.
That is, N k ( x) indexes the set of all X values ik that are close to x. Then an initial estimate of f k ( x) is taken to be
where I is the indicator function. Here, =.8 is used. The adaptive kernel density estimate is computed as follows. Let
and
i = ( f k ( X ik ) / g ) a ,
log g =
1 n
log f ( Xi )
1 f k ( x) = 2 MADN k
I iNk ( x ) ,
q1 = (1 h) X ( ) + hX (
q2 = (1 h) X ( ') + hX (
IQR = q2 q1 .
h=
variation of an adaptive kernel density estimator is used (e.g., Silverman, 1986), which is based in part on an initial estimate obtained via a socalled expected frequency curve (e.g., Wilcox, 2005; cf. Davies & Kovac, 2004). To elaborate, let MAD be the median absolute deviation k associated with the kth marginal distribution, which is the median of the values | X 1k M k | ...,| X nk M k | . For some
f ( x ) =
1 n
1 K {h 1i 1 ( x X i )} , hi
A , n1/ 5
n 5 + l. 4 12
+1)
(4)
1)
(5)
Now consider the more general case of a J by K design and suppose (1) is to be tested. Based on the results in the previous section, two test statistics are considered. The first estimates the population medians with a single order statistic, X ( m ) , and the second uses the usual sample median, M. Let X be a random sample of n ijk j vectors of observations from the jth group ( i = 1,..., n j ; j=1,...,J; k=1,...,K).
Estimates of V , V and V are obtained 2 3 4 in a similar manner. The resulting estimate of the covariance between X ( m )1 and X ( m )2 is 2 labeled 12. Of course, the squared standard error of X ( m )1 can be estimated in a similar 2 fashion and is labeled . 11 An alternative approach is to use a bootstrap method, a possible appeal of which is that the usual sample median can be used when n is even. Generate a bootstrap sample by resampling with replacement n pairs of values * from X yielding Xik (i=1,...,n; k=1, 2). For ik * fixed k, let Mk be the usual sample median based on the bootstrap sample and corresponding to the kth marginal distribution. * Repeat this B times yielding M , b=1,...,B. bk Then an estimate of the covariance between M 1 and M is 2
where
Mk =
12 =
1 B 1
(M i* M1 )(M i*2 M 2 ), 1
* M bk / B.
( q 1) V1 = n
i12 .
v jk
Q = 'C' (CVC' ) 1 C.
(6)
As is well known, the usual choices for C for main effects for Factor A, main effects for ' Factor B, and for interactions are C=C jK, J C = j'J C K and C=CJCK, respectively, where C is a J-1 by J matrix having the form J
RAND R. WILCOX
effects for Factor A, one could perform all pairwise comparisons among the j . This is for every j < j ,
Writing
for appropriately chosen contrast coefficents c jk , then of course an estimate of the squared
standard error of j j is
An Approached Based on Linear Contrasts Another approach to analyzing the twoway ANOVA design under consideration is to test hypotheses about a collection of linear contrasts appropriate for studying main effects and interactions. Consider, for example,
hypotheses to be tested and let P , PD be the 1 corresponding p-values. Put the p-values in descending order yielding P1] P 2] ...P D] . [ [ [
j =
jk .
Based on results in Wilcox (2004), the null distribution of T is approximated with a Students T distribution with n1 degrees of freedom. To elaborate on controlling the probability of at least one Type I error with Roms method, and still focusing on Factor A, let D = ( J 2 J ) / 2 be the number of
n2 =
j j' ' =
' j is a 1J matrix of ones and is the (right) J Kronecker product. There remains the problem of approximating the null distribution of Q. Based on results in Wilcox (2003, chapter 11) when comparing groups using a 20% trimmed mean, an obvious speculation is that Q has, approximately, an F distribution with and 1 2 degrees of freedom. For main effects for Factor A, main effects for Factor B, and for interactions, is equal to J-1, K-1 and (J-1)(K1 1), respectively. As for , it is estimated based 2 on the data, but an analog of this method for medians was not quite satisfactory in simulations; the actual probability of a Type I error was too far below the nominal level. A better approach was simply to take 2 = , which will be assumed henceforth. This will be called method A. An alternative approach is to proceed exactly as in method A, only estimate the .5 quantiles with the usual sample median and replace V with the bootstrap estimate described j in section 2. (Here, B=100 is used.) This will be called method B.
There is the problem of controlling the probability of at least one Type I error among the ( J 2 J ) / 2 hypotheses to be tested, and here this is done with a method derived by Rom (1990). Interactions can be studied by testing hypotheses about all of the relevant ( J 2 J )( K 2 K ) / 4 tetrad differences, and of course, main effects for Factor B can be handled in a similar manner. For convenience, attention is focused on Factor A (the first factor). Here, j is simply estimated with
j = 1,
H 0 : j = j .
j = jk .
c jk jk
c jk jk ,
fourth moments are not defined and so no values for the skewness and kurtosis are reported. Additional properties of the g-and-h distribution are summarized by Hoaglin (1985). Table 1: Critical values, d , , for Roms method.
Table 1, stop and reject all D hypotheses; otherwise, go to step 3 (If > 10, use d = / ). 3. Increment by 1. If
and reject all hypotheses having a signicance level less than or equal d . 4. 5. If P
Continue until a significant result is obtained or all D hypotheses have been tested.
A Simulation Study Simulations were used to study the small-sample properties of the methods just described. Vectors of observations were generated from multivariate normal distributions having a common correlation, . To study the effect of non-normality, observations were transformed to various g-and-h distributions (Hoaglin, 1985), which contains the standard normal distribution as a special case. If Z has a standard normal distribution, then
Z exp(hZ / 2)
has a g-and-h distribution where g and h are parameters that determine the first four moments. The four distributions used here were the standard normal (g = h =0.0), a symmetric heavy-tailed distribution (h = 0.5, g = 0.0), an asymmetric distribution with relatively light tails (h = 0.0, g = 0.5), and an asymmetric distribution with heavy tails (g = h = 0.5). Table 2 shows the skewness ( 1 ) and kurtosis ( 2 ) for each distribution considered. For h = .5, the third and
= .05
1 2 3 4 5 6 7 8 9 10 .05000 .02500 .01690 .01270 .01020 .00851 .00730 .00639 .00568 .00511
= .01
.01000 .00500 .00334 .00251 .00201 .00167 .00143 .00126 .00112 .00101
P d
,,
stop
Table 2: Some properties of the g-and-h distribution. g 0.0 0.0 0.5 0.5 h 0.0 0.5 0.0 0.5
(1 )
0.00 0.00 1.81
( 2 )
3.0 8.9
if g > 0
RAND R. WILCOX
Table 3: Estimated probability of a Type I error, J = K = 2, n = 20, = .05 Method A g 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 h 0.0 0.0 0.5 0.5 0.0 0.0 0.5 0.5 Factor A Inter .074 .072 .046 .049 .045 .044 .030 .019 .068 .073 .045 .036 .053 .024 .038 .027 Method B Factor A Inter .046 .032 .048 .047 .045 .047 .030 .032 .050 .036 .053 .038 .044 .029 .038 .015 Method C Factor A Inter .051 .048 .025 .026 .045 .043 .021 .023 .052 .048 .027 .027 .049 .048 .020 .024
Table 4: Estimated Type I error rates using Methods A and C, J = 2, K = 3, n = 20, = .05 Method A
Method C Inter .043 .023 .026 .015 .039 .016 .023 .010 Factor A .059 .056 .026 .031 .053 .052 .024 .025 Factor B .044 .057 .018 .023 .040 .047 .015 .024 Inter .049 .047 .047 .019 .025 .045 .050 .019 .023
interactions, respectively. For method C, the estimates were .057, .042 and .068. Conclusion In summary, the bootstrap version of method A (method B) does not seem to have any practical value based on the criterion of controlling the probability of a Type I error. This is in contrast to the situations considered in Wilcox (2004) where pairwise multiple comparisons among J dependent groups were considered. A possible appeal of method B is that it uses the usual sample median when n is even rather than a single order statistic, but at the cost of risking actual Type I error probabilities well below the nominal level. Methods A, B and C perform well in terms of avoiding Type I error probabilities well above the nominal level, but methods A and B become too conservative in certain situations where method C continues to perform reasonably well. It seems that applied researchers rarely have interest in an omnibus hypothesis only; the goal is to know which levels of the factor differ. Because the linear contrasts can be tested in a manner that controls FWE, all indications are that method C is the best method for routine use. Finally, S-PLUS and R functions are available from the author for applying method C. Please ask for the function mwwmcp. References Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577580. Davies, P. L., & Kovac, A. (2004). Densities, spectral densities and modality. Annals of Statistics, 32, 10931136. Dawson, M., Schell, A., Rissling, A., & Wilcox, R. R. (2004). Evaluative learning and awareness of stimulus contingencies. Unpublished technical report, Dept of Psychology, University of Southern California. Hoaglin, D. C. (1985) Summarizing shape numerically: The g-and-h distributions. In D. Hoaglin, F. Mosteller and J. Tukey (Eds.) Exploring data tables, trends, and shapes. (p. 461515). New York: Wiley.
10
RAND R. WILCOX
Wilcox, R. R. (2003). Applying Contemporary Statistical Techniques. San Diego, CA: Academic Press. Wilcox, R. R. (2004). Pairwise comparisons of dependent groups based on medians. Computational Statistics & Data Analysis, submitted. Wilcox, R. R. (2005). Introduction to Robust Estimation and Hypot Testing, 2nd Ed. San Diego, CA: Academic Press
Robey, R. R., & Barcikowski, R. S. (1992). Type I error and the number of iterations in Monte Carlo studies of robustness. British Journal of Mathematical and Statistical Psychology, 45, 283288. Rom, D. M. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika, 77, 663666. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. New York: Chapman and Hall. Staudte, R. G., & Sheather, S. J. (1990). Robust Estimation and Testing. New York: Wiley.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 11-34
Regular Articles Testing the Goodness of Fit of Multivariate Multiplicative-intercept Risk Models Based on Case-control Data
Biao Zhang
Department of Mathematics The University of Toledo The validity of the multivariate multiplicative-intercept risk model with I + 1 categories based on casecontrol data is tested. After reparametrization, the assumed risk model is equivalent to an ( I + 1) -sample semiparametric model in which the I ratios of two unspecified density functions have known parametric forms. By identifying this ( I + 1) -sample semiparametric model, which is of intrinsic interest in general ( I + 1) -sample problems, with an ( I + 1) -sample semiparametric selection bias model, we propose a weighted Kolmogorov-Smirnov-type statistic to test the validity of the multivariate multiplicativeintercept risk model. Established are some asymptotic results associated with the proposed test statistic, also established is an optimal property for the maximum semiparametric likelihood estimator of the parameters in the ( I + 1) -sample semiparametric selection bias model. In addition, a bootstrap procedure along with some results on analysis of two real data sets is proposed. Key words: Biased sampling problem, bootstrap, Kolmogorov-Smirnov two-sample statistic, logistic regression, mixture sampling, multivariate Gaussian process, semiparametric selection bias model, strong consistency, weak convergence
11
Biao Zhang is a Professor in the Department of Mathematics at the University of Toledo. His research interests include categorical data analysis and empirical likelihood. The author wishes to thank Xin Deng and Shuwen Wan for their help in the manuscript conversion process. Email: [email protected].
P (Y = i | X = x ) = i* ri ( x; i ), i = 1, P (Y = 0 | X = x )
, I , (1)
class of multivariate multiplicative-intercept risk models includes the multivariate logistic regression models and the multivariate oddslinear models discussed by Weinberg and Sandler (1991) and Wacholder and Weinberg (1994). By generalizing earlier works of Anderson (1972, 1979), Farewell (1979), and Prentice and Pyke (1979) in the context of the logistic regression models, Weinberg and Wacholder (1993) and Scott and Wild (1997) showed that under model (1.1), a prospectively derived analysis, including parameter estimates and standard errors for 1 , , I , is asymptotically correct in case-control studies. In this article, testing the validity of model (1) based on case-control data as specified below is considered.
Let Y be a multicategory response variable with I + 1 categories and X be the associated p 1 covariate vector. When the possible values of the response variable Y are denoted by y = 0,1, , I and the first category (0) is the baseline category, Hsieh, Manski, and McFadden (1985) introduced the following multivariate multiplicative-intercept risk model:
r1 ,
Introduction
where 1* ,
12
Let X i1 ,
independent.
Let
i = P(Y = i )
and
gi ( x ) = f ( x | Y = i ) be the conditional density or frequency function of X given Y = i for i = 0,1, I . If f ( x ) is the marginal distribution of X , then applying Bayes rule
yields
parametric form, and thus is of intrinsic interest in general ( I + 1) -sample problems. Model (2) is equivalent to model (1); it is an ( I + 1) sample semiparametric selection bias model with weight functions w0 ( x, , ) = 1 and
f ( x | Y = i) =
P(Y = i | X = x)
i
I.
f ( x ),
It is seen that
f ( x | Y = i ) 0 P (Y = i | X = x ) = f ( x | Y = 0) i P (Y = 0 | X = x )
Consequently,
gi ( x ) = f ( x | Y = i) =
0 * r ( x; i ) f ( x | Y = 0) i i i
{ X 01 ,
, X 0 n0 ; X 11 ,
I i =0
, X 1n1 ;
ni
; X I1,
, X InI }
i .i .d .
Throughout
this
article,
let
sample T1 ,
= (1 ,
, I ) ,
= ( 1 ,
, I ) ,
the i th (i = 0,1,
X i1 ,
, X ini
and
G0 (t ) = n 1
be, respectively, the empirical distribution functions based on the sample X i1 , , X ini from
X 01 ,
, X 0 n0
g 0 ( x),
with n =
ni . Furthermore, let
j =1 [ X ij t ]
Gi (t ) = ni 1
k =1 [Tk t ]
= exp[ i + si ( x; i )]g 0 ( x ),
0 * r ( x ; i ), i i i
i = 1,
, I.
i = 1,
, I,
i = 0,1,
for i = 1, , I depending on the unknown parameters and . The s -sample semiparametric selection bias model was proposed by Vardi (1985) and was further developed by Gilbert, Lele, and Vardi (1999). Vardi (1982, 1985), Gill, Vardi, and Wellner (1988), and Qin (1993) discussed estimating distribution functions in biased sampling models with known weight functions. Weinberg and Wacholder (1990) considered more flexible design and analysis of case-control studies with biased sampling. Qin and Zhang (1997) and Zhang (2002) considered goodness-offit tests for logistic regression models based on case-control data, whereas Zhang (2000) considered testing the validity of model (2) when I = 1 . The focus in this article is to test the validity of model (1.2) for I 1 . Let {T1 , , Tn } denote the pooled sample
I. Note that model (2) is equivalent to an ( I + 1) sample semiparametric model in which the i th (i = 1, , I ) ratio of a pair of unspecified density functions gi and g 0 has a known
wi ( x, , ) = exp[i + s( x; i )]
BIAO ZHANG
the equality of G0 and G1 for which I = 1 and
I ni
13
L( , , G0 ) = exp[i + si ( X ij ; i )]dG0 ( X ij )
i =0 j =1 I k
k =1
distribution function G0 . This fact, along with the fact that G0 and G0 are, respectively, the nonparametric maximum likelihood estimators of G0 without and with the assumption of
mass unity. Similar to the approach of Owen (1988, 1990) and Qin and Lawless (1994), it can be shown by using the method of Lagrange multipliers that for fixed ( , ) , the maximum
pk 0 and
n k =1
pk =
( , ) = n log n0
k =1 I i =1
ni
[ i + si ( X ij ; i )].
over ( , ) . Let ( , ) with
i =1 j =1
Next, maximize
Methodology Based on the observed data in (2), the likelihood function can be written as
= (1 , I ) and = ( 1 ,
I ) be the
log 1 +
i exp[ i + si (Tk ; i )]
the validity of model (2), where Gi is the maximum semiparametric likelihood estimator of Gi under model (2) and is derived by employing the empirical likelihood method developed by Owen (1988, 1990). For a more complete survey of developments in empirical likelihood, see Hall and La Scala (1990) and Owen (1991). This article is structured as follows: in the method section proposed is a test statistic by deriving the maximum semiparametric likelihood estimator of Gi under model (2). Some asymptotic results are then presented along with an optimal property for the maximum semiparametric likelihood estimator of ( , ) . This is followed by a bootstrap procedure which allows one to find P -values of the proposed test. Also reported are some results on analysis of two real data problems. Finally, proofs of the main theoretical results are offered.
pk {exp[i + si (Tk ; i )] 1} = 0
for i = 1,
, I , is attained at
1 ,
1 n0 1 +
exp[ i + si (Tk ; i )] i =1 i
, I . Therefore,
log-likelihood
G0 (t ) = G1 (t ) , motivates us to employ a weighted average of the I + 1 discrepancies between Gi and Gi (i = 0,1, , I ) to assess
k = 1,
G0
and
the
pooled
empirical
i =1 j =1
exp
[ i + si ( X ij ; i )]
n k =1
pk = 1 ,
ni
14
i =1
ni (t ) = n Gi (t ) Gi (t ) ,
, I,
(3)
,I .
i =0
i =1
On the basis of the pk in (4), it can be proposed to estimate Gi (t ) , under model (2), by
1 n0
respectively,
ai bi and
distribution function based on the sample X i1 , , X ini from the i th (i = 0,1, , I ) category. Moreover, let
Gi (t ) = ni 1
ni
I j =1 [ X ij t ]
be
the
empirical
1967). Let ( X k , Yk ) , k = 1, , n , be a random sample from the joint distribution of ( X , Y ), then the likelihood has the form of
, p . Note that Gi is the maximum semiparametric likelihood estimator of Gi under i = 0,1, , I . Let model (2) for
i = 0,1,
selected from the whole population with n0 , n1 , , nI being random (Day & Kerridge,
a = (a1 ,
i = 0,
Gi (t ) =
n
n k =1
k =1
1+
k = 1,
, n.
(4)
exp[ m + sm (Tk ; m )] m =1 m
I[Tk t ] , (5)
, I,
to assess the validity of model (2). Clearly, the proposed test statistic n measures the global departure from the assumption of the multivariate multiplicative-intercept risk model (1). Because the same value of n occurs no matter which category is the baseline category, there is a symmetry among the I + 1 category designations for such a global test. Thus, the choice of the baseline category in model (1) is arbitrary for testing the validity of model (1) or model (2) based on n . Note that the test statistic n reduces to that of Zhang (2000) when
ai for
Remark 1: The test statistic n can also be applied to mixture sampling data in which a sample of n =
I i =0
pk =
1 n0 1 +
1
I
i exp[ i + si (Tk ; i )]
n =
1 I i ni I + 1 i =0
where du (Tk ; u ) =
su ( , u ) for u = 1, u
i ni (t ) = n
i =0
ni members is randomly
= 0,
n k =1
u exp[u + su (Tk ; u )]
I m =1
1+
u = 1,
j =1
( , ) = u
nu
du ( X uj ; u ) du (Tk ; u )
Then, ni is the discrepancy between the two estimators Gi (t ) and Gi (t ), and thus measures the departure from the assumption of the multivariate multiplicative-intercept risk model (1) within the i th (i = 1, , I ) pair of category i and the baseline category (0). Since
m exp[ m + sm (Tk ; m )]
= 0,
u = 1,
, I,
ni = sup ni (t ) ,
t
i = 0,1,
, I.
i [Gi (t ) Gi (t )] = 0,
(6)
since
15
Let
with
i =0
j =1
system of score equations in (7). Then comparing (7) with (3) implies that
estimates of are identical under the retrospective sampling scheme and the prospective sampling scheme. In addition, the two estimated asymptotic variance-covariance matrices for and based on the observed information matrices coincide. See also Remarks 3 and 4 below. Asymptotic results In this section, the asymptotic properties in (5) and the proposed test statistic n in (6) are studied. To this end, let ( (0) , (0) ) be the true value of ( , ) under model (2) with
where * = (1* , , I* ) . Suppose that the sample data in model (2) are collected prospectively, then the (prospective) likelihood function is, by (1),
L ( * , ) = P (Y = i | X = X ij ) =
i =0 j =1 i = 0 j =1
k =1
m =1
* u
k =1
1+
* exp[ sm (Tk ; m )] m =1 m
n k =1
1+
I * m =1 m
exp[ sm (Tk ; m )]
du (Tk ; u ) = 0, (7)
u = 1,
,I.
u* exp[su (Tk ; u )]
( * , ) = u
nu j =1
= 0,
u = 1,
, I,
d i (t ; i ) =
du ( X uj ; u )
Di (t; i ) = i = 1,
nu
exp[su (Tk ; u )]
* u I
Write =
I i =0
i and
si (t ; i ) , i d i (t; i ) 2 si (t; i ) = i i i
, I,
( * , ) u*
Throughout this article, it is assumed that i = ni / n0 (i = 0,1, , I ) is positive and finite and remains fixed as n =
I i =0
log 1 +
* m exp[ sm (Tk ; m )] .
and
(0) = ( 10 ,
( , ) =
*
nj
[log + si ( X ij ; i )]
* i
i = 0 j =1 I
1+
I * m =1 m m
r ( X ij ; m )
(0) = (10 ,
, p 0 )
, p 0 ) .
ni .
ni
ni
i*ri ( X ij ; i )
where i = P(Y = i ) for i = 1, , I . The first expression is a prospective decomposition and the second one is a retrospective decomposition. Remark 2: In light of Anderson (1972, 1979), the case-control data may be treated as the prospective data to compute the maximum likelihood estimate of ( * , ) under model (1),
= [ i f ( X ij | Y = i)],
ni
= ( 1 ,
( * , )
* = (1* , , I* )
and
for
, I)
16
v=0,v u
,I ,
u0
v = 0,v u
,I ,
S=
S21 S22
Buv (t ) =
(A3) For i = 1,
v exp [ v 0 + s ( y; vo )] ,
u v = 0,1,
, I,
q1 j
such that
k = 1,
= S 1 (1 + )
D+J 0
0 , 0
uu s22 =
I uv s22 ,
u = 0,1, , I ,
S11 S21
the true parameter point (0) such that for all t the function ri (t ; i ) (i = 1,
d u ( y , u ) d ( y , )]dG0 ( y )
u v = 0,1, , I ,
matrix having elements {11 , , I 1} on the main diagonal. In order to formulate the results, the following assumptions are stated. (A1) There exists a neighborhood 0 of
uv s22 = 1+1
k = 1, 2, h = 0,1,
u = 0,1,
s =
uu 21
s , , I,
uv 21
Akh (t ) =
t
S21 = ( s )
uv 21 u ,v =1, , I ,
m exp[ m0 + sm ( y; m 0 )]
,I,
dG0 ( y ), (8)
, I ) admits all
u v = 0,1,
,I
exp
+ s ( y ; ) exp + s ( y ; ) u uo v v0 vo d ( y , ) dG ( y ) u u 0 I 1+ i =1 exp[ + s ( y ; )] i i0 i i0
v = 0,v u
s = 1+
uv 21 1
buu (t ) =
uu s11 =
I uv s11 ,
u = 0,1, , I ,
v exp [v 0 + s ( y; vo )] du (t; u ),
u v = 0,1, , I, buv (t; , ),
u = 0,1, , I ,
u v = 0,1,
, I,
v =0,v u
uv s11 = 1+1
auu (t ) =
BIAO ZHANG
2 si (t; i ) Q2 (t ) for all 0 ik il and k , l = 1, , p, where
17
such that
where
( (0) , (0) ) ( , ) = ( , ) = (
(0) , (0) )
and result,
q2 j =
j = 1, 2.
(0)
q3 =
where S is obtained from S with ( (0) , (0) ) replaced by ( , ) and G0 replaced by G0 . Remark 4: Because S 1 is the prospectively derived asymptotic variancecovariance matrix of ( * , ) on the basis of the prospective likelihood function given in Remark 2, it is seen from the expression for the asymptotic variance-covariance matrix of
the
estimate ( , ) defined in (3). Theorem 8 concerns the strong consistency and the asymptotic distribution of ( , ) Theorem 1: Suppose that model (2) and Assumptions (A1) (A4) hold. Suppose further
that S is positive definite. (a) As n , with probability 1 there exists a sequence ( , ) of roots of the system of score equations (2.1) such that ( , ) is strongly consistent
a. s .
for
estimating
( , ) ( (0) , (0) ) .
(b) As n , it may be written
(0) (0)
1 = S 1 n
+o p (n 1/ 2 ),
(9)
method derived by employing the following of moments . Motivated by the work of Gill,
( , ) with respect to
= S (1 + )
(A4) For i = 1,
D+J 0
(0)
( (0) , (0) ) ( , ) = ( , ) = (
(0) , (0) )
.As
N ( p +1) I (0, ).
(10)
18
I i =0
i exp[ i + si ( y; i )]
j =1
taken
for
for fixed ( , ) by
I i =0 n
i exp[ i + si ( y; i )]
I
be the empirical distribution function based on the sample X i1 , , X in from the i th response category. Let i (t ; , ) be a real function from
(0)
defined in (18) of the proof section. Moreover, the maximum semiparametric likelihood estimator ( , ) is optimal in the sense that
= ni i (t ; , )d G i (t ) = ni i (t ; , )d G i (t ) = EGi [ni i (T ; , )]
In the following case, considered, although the results can be naturally generalized to the case of p > 1 . The weak convergence of
, I} . p = 1 is
n (G 0 G 0 ,
, G I G I ) is
EGi [ni i (T ; , )]
measurable functions { i (t ; , ) : i = 1,
under G i for i = 1,
,I :
where
= V 1 B (V ) 1 with V and B
R p to R p +1 for i = 1, , I and let (t ; , ) = ( 1 (t ; , ), , I (t; , )) . Then, for a particular choice of (t ; , ) , ( , ) can be estimated by matching the
(0)
N ( p +1) I (0,
for i = 0,1,
1 n0
exp[i + si (Tk ; i )]
k =1
exp[i + si (Tk ; i )] i =0 i
, I . Let G i (t ) = ni 1
I[Tk t ]
ni
I j =1 [ X ij t ]
),
exp[ i + si ( y; i )]
theorem demonstrates
I[ y t ]dFn ( y )
that the choice of i (t; , ) = (1, di (t; i )) for i = 1, , I is optimal in the sense that the difference between the asymptotic variance-covariance matrices of
G i (t ) =
n n0
solution to the system of equations in (11). Note that ( , ) depends on the choice of
= ( 1 ,
, I ) and = ( 1 ,
empirical distribution function of the pooled sample {T1 , , Tn } . Then Gi can be estimated
Let
i = 0,1,
, I.
Fn (t ) =
1 n
i =1 [Ti t ]
be
the
It is easy to see that the above system of equations reduces to the system of score equations in (3) if i (t; , ) = (1, di (t; i )) is
i = 1,
, I . Let
exp[ i + si ( y; i )]
I[ y t ]dF ( y ),
k =1
n Gi (t ) = n0
Gi
for i = 1, , I . In other words, ( , ) can be estimated by seeking a root to the following system of equations:
Li ( , ) =
n I
ni
i ( X ij , , ) = 0i = 1, , I .
i (Tk , , )
(11)
( , )
with
, I ) be a
BIAO ZHANG
now established to a multivariate Gaussian process by representing G i G i (i = 0,1, , I ) as the mean of a sequence of independent and identically distributed stochastic processes with a remainder term of order o p ( n 1/ 2 ) . Theorem 3: Suppose that model (2) and Assumptions (A1) (A4) hold. Suppose further
19
(13)
and
t
the
remainder
term
Rin (t ) satisfies
(14)
i {sup t | Wi (t ) |} w ) = . According to Theorem 3 and the continuous Mapping Theorem (Billingsley, 1968, p. 30):
n
sup Rin (t ) = o p (n 1/ 2 ).
lim P(n w1 )
n
G1 G1 GI GI
WI
in
D I +1[, ]
(15)
Thus, the proposed goodness of fit test procedure has the following decision rule: reject model (2) at level if n > w1 . In order for this proposed test procedure to be useful in practice, the distribution of and the (1 ) -quantile w1
(W0 , W1 ,
, WI ) is a multivariate Gaussian
1 I +1
I i =0
G0 G0
W0 W1
=P
As a result,
= lim P
satisfies P( I 11 +
1 H (t ) = ( A (t ), A (t )) S 1 2i 2i n i 1i
1 n0
n k =1
exp[ i 0 + si (Tk ; i 0 )]
1+
exp[ m 0 + sm (Tk ; m 0 )] m =1 m
I[Tk t ] ,
Theorem 3 forms the basis for testing the validity of model (2) on the basis of the test statistic n in (6). Let w denote the quantile of the distribution of
1 I +1 I i =0
i {supt | Wi (t ) |},
I i =0
1 I i { sup n | Gi (t ) Gi (t ) |} w1 I +1 i =0 t
H1i (t )
,I ,
EWi ( s )W j (t ) = 1+
Bij ( s )
i j
( A1i ( s ), A2i ( s )) S 1
i j ,I. (16)
A (t ) 1j , A (t ) 2j
i j = 0,1,
i.e.,
EWi (t ) = 0,
, I,
20
Unfortunately, no analytic expressions appear to be available for the distribution function of function thereof. A way out is to employ a bootstrap procedure as described in the next section. A Bootstrap Procedure In this section is presented a bootstrap procedure which can be employed to approximate the quantile w1 defined at the end of the last section. If model (1) is valid, since * = (1* , , I* ) is not estimable in general on the basis of the case-control data T1 , , Tn , only generated data, respectively, from G 0 , G1 , is given by (5). Specifically, let X ,
* i1 * ini
where
To see the validity of the proposed bootstrap procedure, the proofs of Theorems 1 and 3 can be mimicked with slight modification to show the following theorem. The details are omitted here. Theorem 4: Suppose that model (2) and Assumptions (A1)
pk =
i = 0,
1 = n0
n k =1
* i
1+
exp[ + sm (Tk* ; m )] m =1 m
* m
, I,
* * n (G 0 G 0 ,
G i (t ) =
k =1
k = 1,
, n.
* * *
k
1 n0 1 +
1
* i *
exp[ + si (Tk* ; i )] i =1 i
* * GI GI
WI
where (W0 , W1 , , WI ) is the multivariate Gaussian process defined in Theorem 3. Theorem 3 and part (b) of Theorem 4 indicate that the limit process of
* j =1 [ X ij t ]
let G i (t ) =
1 ni
ni
for i = 0,1,
, I and
* * G0 G0 * * n G1 G1
W0 W1
* = ( 1* ,
(6),
in
D I +1[, ],
and
( * , * ) with
* = (1* , , I* )
* { X 01 ,
* * , X 0 n0 ; X 11 ,
, X 1*n1 ;
; X I*1 ,
* , X InI }
and
jointly independent. Let {T1* , , Tn*} denote the combined bootstrap sample
for i = 1, , I . (a) Along almost all sample sequences T1 , T2 , , given (T1 , , Tn ) , as n , we have
, X ) : i = 0,1,
* i1
,X
, I and , I } are
, G I , where G i (i = 0,1,
, I)
be
* ini
N ( p +1) I (0, ).
* (t ) = n (G (t ) G (t )) for i = 0,1, ni
* i
* i
1 I +1
I i =0
where 0 = 0 and s0 ( ; 0 ) 0 . Then the corresponding bootstrap version of the test statistic n in (6) is given by
1 I i * , ni I + 1 i=0 * = sup t | * (t ) | ni ni * = n
with
,I .
<
BIAO ZHANG
21
i =0
(2) for i = 1, , I . Example 1: Agresti (1990) analyzed, by employing the continuation-ratio logit model, the relationship between the concentration level of an industrial solvent and the outcome for pregnant mice in a developmental toxicity study. The complete dataset is listed on page 320 in his book. Let X denote concentration level (in mg/kg per day) and Y represent pregnancy outcome, in which Y = 0,1, and 2 stand for
i = 1,
,I
Gi (i = 0,1,
.
0
In
this
i =0
behavior as does n =
I 1 I +1
i ni . Thus, the
independently and identically from the joint distribution of ( X , Y ) , Remark 1 implies that the test statistic n in (6) can be used to test the validity of the multivariate logistic regression model. Under model (2),
(1 , 1 , 2 , 2 ) = (3.33834, 0.01401, 2.52553, 0.01191) and n = 0.49439 with the observed P -value
equal to 0 based on 1000 bootstrap replications of * . Because n0 = 1000 , n1 = 199 , and n
* * n2 = 236 , 1* = log 1* and 2 = log 2 can be
estimated by
case,
we
have
G 0 (left panel), the curves of G1 and G1 (middle panel), and the curves of G 2 and G 2 (right
panel) based on this data set. The middle and right panels indicate strong evidence of the lack of fit of the multivariate logistic regression model to these data within the categories for Malformation and Non-live.
* = n
I 1 I +1
i * ni
has
the
same
limiting
three possible outcomes: Normal, Malformation, and Non-live. Here this data set is analyzed on the basis of the multivariate logistic regression model. Because the sample data ( X i , Yi ) ,
i = 1,
22
Figure 1. Example 1: Developmental toxicity study with pregnant mice. Left panel: estimated cumulative
~ distribution functions G0 (solid curve) and G0 (dashed curve). Middle panel: estimated cumulative
~ distribution functions G1 (solid curve) and G1 (dashed curve). Right panel: estimated cumulative
BIAO ZHANG
Example 2: Table 9.12 in Agresti (1990, p. 339) contains data for the 63 alligators caught in Lake George. Here the relationship between the alligator length and the primary food choice of alligators is analyzed by employing the multivariate logistic regression model. Let X denote length of alligator (in meters) and Y represent primary food choice in which Y = 0,1, and 2 stand for three categories: Other, Fish, and Invertebrate. Since the sample data ( X i , Yi ) , i = 1, , 63 , can be thought as being drawn independently and identically from the joint distribution of ( X , Y ) , Remark 1 implies that the test statistic n in (6) can be used to test the validity of the multivariate logistic regression model. For the male data, we find
23
combined data, the curve of G1 (G 2 ) bears a resemblance to that of G1 (G 2 ) , whereas the dissimilarity between the curves of G 0 and G 0 indicates some evidence of lack of fit of the multivariate logistic regression model to these data within the baseline category for Other. Proofs First presented are four lemmas, which will be used in the proof of the main results. The proofs of Lemmas 1, 2, and 3 are lengthy yet straightforward and are therefore omitted here. Throughout this section, the norm of a m1 m2 matrix
2.60093) and n = 1.33460 with the observed P -value identical to 0.389 based on 1000 bootstrap replications of * . For the female n
data, we find (1 , 1 , 2 , 2 ) = ( 5.58723, 2.57174, 2.70962, 1.50304) and n = 1.63346 with the observed P -value equal to 0.249 based on 1000 bootstrap replications of * . For the n combined male and female data,
Furthermore, in addition to the notation in (8) we introduce some further notation. Write
1i Qi11 = ( s11 , Ii 1i , s11 ) , Qi 21 = (( s21 ) , Ii , ( s21 ) ) ,
Qi 21
2.38837) and n = 1.73676 is found with the observed P -value identical to 0.225 based on 1000 bootstrap replications of * , indicating n
that we can ignore the gender effect on primary food choice. Because n0 = 10, n1 = 33, and n2 =
* 20, 1 = log 1* * * and 2 = log 2
S n11
S n 21
2 1 ( (0) , ( 0) ) = , n 2 1 ( (0) , ( 0) ) , = n
can
be
estimated by
G1 (middle panel), and the curves of G 2 and G 2 (right panel) based, respectively, on the
male, female, and combined data set. For the
4.48780 + log(20/10) = 0.99850 and 5.18094, respectively. Figures 2-4 display the curves of G 0
* 1 = * 2 =
0.19542 + log(33/10) =
n k =1
{1 +
, , I,
i = 0, 1,
Qi =
Qi11
|| A || =
aij
i =1 j =1
i = 0, 1,
A = (a ij ) m1m2
m1 m2 1/2
is for
defined
by
m1 , m2 1.
, I,
24
Figure 2. Example 2: Primary food choice for 39 male Florida alligators. Left panel: estimated cumulative
~ distribution functions G1 (solid curve) and G1 (dashed curve). Right panel: estimated cumulative distribution ~ functions G 2 (solid curve) and G 2 (dashed curve).
~ distribution functions G0 (solid curve) and G0 (dashed curve). Middle panel: estimated cumulative
BIAO ZHANG
25
Figure 3. Example 2: Primary food choice for 24 female Florida alligators. Left panel: estimated
~ cumulative distribution functions G1 (solid curve) and G1 (dashed curve). Right panel: estimated ~ cumulative distribution functions G 2 (solid curve) and G 2 (dashed curve).
~ cumulative distribution functions G0 (solid curve) and G0 (dashed curve). Middle panel: estimated
26
Figure 4. Example 2: Primary food choice for 63 male and female Florida alligators. Left panel: estimated
~ cumulative distribution functions G1 (solid curve) and G1 (dashed curve). Right panel: estimated cumulative distribution functions G 2 (solid curve) and G 2 (dashed curve). ~
~ cumulative distribution functions G0 (solid curve) and G0 (dashed curve). Middle panel: estimated
BIAO ZHANG
H3i (t ) =
n
27
Lemma 3: Suppose that model (2) holds definite. For
and
is positive S s t , we have
k =1
{1 +
, , I.
= 1+
2 i I
( A1i (t ), A2 i (t )) S 1
A2i ( s )
2 i
k = 0, k i
Bik ( s ) Bik (t )
i j
I k =0
( A1i ( s ), A2 i (s )) S 1
A2 j (t )
1+
i j
i2
1+
[Gi (s ) Bii (s )]
+
I k = 0, k i
1+
2 j
Bij ( s )G j (t ) + , I.
2 i
1+
3 i
i = 0, 1,
i j
i j
1+
2 j
Bij ( s )G j (t ) +
1+
i2 j
i j = 0, 1,
, I.
1+
Bik ( s ) Bik (t )
i j = 0, 1,
Lemma 4: Suppose that model (2) and Assumption (A2) hold. If S is positive definite and G0 is continuous, then the stochastic process
Bij ( s )
1+
I k =0
Bik ( s ) B jk (t )
Gi ( s ) Bij (t ),
i = 0, 1,
algebra that
1+
Bik ( s ) B jk (t )
1+
i2 j
Gi ( s ) Bij (t ),
n=
1+
ni
for
D+J 0 = . S BS = S (1 + ) 0 0 Lemma 2: Suppose that model (2) holds and S is positive definite. For s t ,
1 1 1
I 1+ 1 Var Qi Qi , =S ( ( 0 ) , ( 0 ) ) n i=0 i
( ( 0 ) , ( 0 ) )
1+
i3
i = 0, 1,
Lemma 1: Suppose that model (2) holds and S is positive definite. Let J be an I I matrix of 1 elements and let 1 1 D = Diag( 1 , , I ) denote the I I diagonal matrix having elements 1 1 {1 , , I } on the main diagonal, then
nH 2i (t ))
A1i (s )
nH 2i (t ))
A1 j (t )
28
n [ H1i (t ) Gi (t ) H 2i (t )]
i
1
k =0, k i
i
1 ni
ni U ii (t ) nH 2i (t ),
(17)
measure
of
where
U ii (t ) =
, I , where x is the measure with mass one at x . Then, it can be shown that k = 0, 1,
nk ( Pnk PX k 1 )( I ( ,t ] fik )
U ik (t ) =
nk j =1
i exp[i 0 + si ( X kj ; i 0 )] k I[ X
I m =1
1+
k = 0, 1,
fii ( y ) =
1+
fik ( y ) =
k i = 0, 1,
ni
j =1
1+
i I[ X
ij t ]
= nk U ik (t ),
1 nk
kj t ]
As a result, there exist I + 1 zero-mean Gaussian processes Vi 0 , Vi1 , , ViI such that
nk U ik Vik i, k = 0,1,
Thus, the
m exp[ m 0 + sm ( X kj ; m 0 )]
D[, ] for i = 0, 1, , I . These results, along with (17), imply that the stochastic process
{ n [ H 1i (t ) Gi (t ) H 2i (t )], t } is tight in D[, ] for i = 0, 1, , I . The
proof is complete. Proof of Theorem 1: Fro part (a), let
m = 0,m i I m =1
m exp[ m 0 + sm ( y; m 0 )]
m exp[ m 0 + sm ( y; m 0 )]
B = {( , ) :|| ( 0) || 2 + || ( 0) || 2 2 }
be the ball with center at the true parameter point ( ( 0 ) , ( 0 ) ) and radius for some > 0 . For small , it can be shown that we can expand n 1 ( , ) on the surface of B about
, I.
( ( 0 ) , ( 0 ) ) to find
j =1
1+
nk U ik (t )
, I.
kj
1+
Let Pnk =
1 nk
nk
X be the empirical
, X knk
for
X k1 ,
i, k = 0,1,
, I.
on D[, ], . ,I
stochastic process
BIAO ZHANG
1 1 ( , ) ( ( 0 ) , ( 0 ) ) = Wn1 + Wn 2 + Wn 3 , n n
where
29
( ( 0 ) , ( 0) )
1 Wn1 = ( ( 0 ) , ( 0 ) ) n ( ( 0 ) , ( 0) ) ,
(0) 1 Wn 2 = ( (0 ) , (0 ) ) S n , (0) 2
and Wn 3 satisfies | Wn 3 | c3 3 for some constant c3 > 0 and sufficiently large n with ( (0) , (0) ) a .s. probability 1. Because 1 0 and n 1 ( (0) , (0) ) a .s. 0 by the strong law of large n numbers, it follows that for any given > 0 , with probability 1 | Wn1 | 2 3 for sufficiently large n . Furthermore, because
a. s .
inf
x 0
x Sx 1 , x x 2
2
where 1 > 0 is the smallest eigenvalue of S . As a result, Wn 2 < c 2 2 for sufficiently large
<
( , ) has
a local maximum in the interior of B . Because at a local maximum the score equations (3) must be satisfied it follows that for any sufficiently small > 0 and sufficiently large n , with probability 1, the system of score equations (3) has a solution ( , ) within B . Because
~ ~
~ ~
( ( 0) , ( 0) ) . For part (b), since ( , ) is strongly consistent by part (a), expanding and
( (0) , (0) )
~ ~
( (0) , (0) )
at ( ( 0 ) , ( 0 ) ) gives
0= = +
( , ) ( (0) , (0) )
( , )
2 ( (0) , (0) )
( (0) )
2 ( (0) , (0) )
( (0) ) + o p ( n ),
0= = +
( (0) , (0) )
1
2
2 ( (0) , (0) )
( (0) )
> 0 for
2 ( (0) , (0) )
30
nS n ~
( 0)
(0)
= S 1 B1/ 2
( (0) , (0) )
and
( (0) , (0) )
has mean 0, it follows from the multivariate central limit theorem that
1 1 S ( (0) , (0) ) n
N ( p +1) I ( 0,
).
i = 0,1,
ij v12 =
1 1+ i exp[i 0 + si (t ; i 0 )] j exp[ j 0 + s j (t ; j 0 )] 1+
I m =1
( (0) , (0) )
ii v11 =
I ij v11 ,
1 = S 1 + o p (n 1 / 2 ), ( (0) , (0) ) n
( (0) , (0) )
1 1+ i exp[i 0 + si (t ; i 0 )] j exp[ j 0 + s j (t ; j 0 )] 1+
exp[ m + sm (t ; m 0 )] m =1 m
, I,
, I,
,I
m exp[m + sm (t ; m 0 )]
+o p (n 1/ 2 )
1 + o p (n 1 n ) n ( (0) , (0) )
( (0) , (0) )
1 1/ 2 B ( (0) , (0) ) n
d
(0)
1 1 S ( (0) , (0) ) n
( (0) , (0) )
1 1 S ( (0) , (0) ) n
( (0) , (0) )
(0)
+ o p ( n ) . ( ( 0) , ( 0) )
( (0) , (0) ) ( (0) , (0) )
( ( 0) , ( 0) )
1 1/ 2 B ((0) , (0) ) n
= O p (1) by
( (0) , (0) )
((0) , (0) )
N ( p +1) I ( 0, I ( p +1) I ) ,
d
).
BIAO ZHANG
I j = 0, j i
31
1 ni
n k =1
V12 = (v )
ij 12 i , j =1, , I
, V = (V11 , V12 ),
, LI ( , )) ,
( 0)
and
asymptotically independent because it can be shown after very extensive algebra that
((0) , (0) ) ((0) , (0) ) 1L ( , ) S 1 ((0) , (0) ) , V ((0) , (0) ) (0) (0)
Cov
S 1
= 0.
(0)
Rin (t ) = o p (n 1 / 2 ) rin (t ) + o p ( n ).
It follows from part (b) of Theorem 3 that n = O p (n 1 / 2 ) . Furthermore, it can be shown
Consequently, there is
strongly for i = 0,1, , I and consistent, applying a first-order Taylor expansion and Theorem 1 gives, uniformly in t ,
11
21
W1 WI
H1I GI H 2 I
~ ~ ( , ) is
EH 0i (t ) = i
of
0 Var
H10 G0 H 20 H G H
W0
in D I +1[, ].
(20)
(0)
(0)
(0) (0)
(0)
in
= H1i (t )
1 = V 1 L( ( 0) , ( 0) ) + o p (n 1 / 2 ). n
(0) (0) (0) (0)
+o p (n 1/ 2 ) rin (t ) + o p ( n )
= H1i (t ) H 2 i (t ) + Rin (t ),
i = 0,1,
, I, (19)
are
i = 0,1,
, I,
(0)
rin (t ) = ( H 0 i (t ) EH 0i (t ), H 3i (t ) EH 3i (t ))
B = Var
1 L( (0) , (0) ) . n
( (0) )
(18)
rin (t ) + o p ( n )
( (0) , (0) )
L( , ) = ( L ( , ), 1
1+
ii v12 =
ij v12 , i = 0,1,
, I,
Gi (t ) =
I m =1
m exp[ m + sm (Tk ; m )]
( (0) )
32
i j i j k =0 k 1+ 1+ Bij ( s )G j (t ) + 2 Gi ( s ) Bij (t ) + 2 i j i j
1
i = 0,1,
, I,
( A (s ), A (s))S 1 1i 2i j I
A (t ) 2j
1+
k = 0 k ik i j
B ( s) B
jk
(t )
Cov( nH 2i ( s ),
nH 2i (t ))
1+
2 j
Bij (s )G j (t )
Bij (s ) 1
1+
i2 j
Gi ( s ) Bij (t )
1+
[Gi ( s ) Bii ( s )]
i j
i j
( A1i ( s ), A2i (s )) S 1
= EWi (s )Wi (t ),
i j = 0, 1,
, I.
2 i
( A1i ( s ), A2 i ( s)) S 1
I
A2i (t )
1+
i2
1
[Gi ( s ) Bii ( s )]
2 i
( A1i ( s ), A2i (s )) S 1
i = 0,1,
, I,
Cov( n [ H1i ( s ) Gi ( s )] nH 2i ( s ), n [ H (t ) G (t )] nH (t ))
1j j 2j
1+
B ( s ) B (t )
A1i (t ) A2 i (t )
= EWi ( s )Wi (t ),
A1i (t )
It then follows from the multivariate central limit theorem for sample means and the CramerWold device that the finite-dimensional distributions of
n ( H 10 G0 H 20 ,
, H 1I G I H 2 I )
converge weakly to those of (W0 , , WI ) . Thus, in order to prove (20), it is enough to show that the process
{ n ( H10 (t ) G0 (t ) H 20 (t ), , H (t ) G (t ) H (t )) , t }
1I I 2I
[, ]. But this has been established by Lemma 4 for continuous G0 . Thus, (20) has been proven when G0 is
is tight in D continuous. Suppose now that G0 is an arbitrary distribution function over [, ] . Define the inverse of G0 , or quantile function associated with G0 , by G0 ( x) = inf{ t : G0 (t ) x},
1
I +1
x (0,1). Let i1 ,
, ini
be
independent
random variables having the same density function h i ( x ) = exp[ i + s i ( G 0 1 ( x ); i )] on (0,1) for i = 0,1, , I and assume that
1+
Under the assumption that the underlying distribution function G0 is continuous (20) is proven. According to (16) and Lemmas 2 and 3, we have for s t ,
Bij ( s )
Bik ( s ) B jk (t )
A (t ) 1j
A1 j (t ) A2 j (t )
BIAO ZHANG
{( i1 , , ini ) : i = 0,1, , I}
are jointly
33
independent. Thus, we have the following ( I + 1) -sample semiparametric model analogous to (2):
01 , , 0 n ~ h0 ( x) = I ( 0,1) ( x ),
0
i .i .d .
i1 ,
i .i .d .
i = 0,1,
, I.
(21)
, X ini ) and
the same
(G ( i1 ),
1 0
, G ( ini ))
1 0
have
distribution, i.e.,
( X i1 ,
, X ini ) = (G01 ( i1 ),
i = 0,1,
pooled
, I . Let { 1 ,
random
, n } denote the
; I1 ,
References Agresti, A. (1990). Categorical data analysis. NY: John Wiley & Sons. Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika, 59, 19-35. Anderson, J. A. (1979). Robust inference using logistic models. International Statistical Institute Bulletin, 48, 35-53. Billingsley, P. (1968). Convergence of probability measures. NY: John Wiley & Sons. Day, N. E., & Kerridge, D. F. (1967). A general maximum likelihood discriminant. Biometrics, 23, 313-323. Farewell, V. (1979). Some results on the estimation of logistic models based on retrospective data. Biometrika, 66, 27-32. Gilbert, P., Lele, S., & Vardi, Y. (1999). Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika, 86, 27-43. Gill, R. D., Vardi, Y., & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. Annals of Statistics, 16, 1069-1112. Hall, P., & La Scala, B. (1990). Methodology and algorithms of empirical likelihood, International Statistical Review,58, 109-127.
{ 01 ,
, 0n0 ; 11 ,
d
, 1n1 ;
, H1I (t ) GI (t )]
H 2 I (t ))
D
and
, H1I GI H 2 I )
n ( H10 G0 H 20 ,
(W0 ,
WI )
in
the
Skorohod
topology
and
34
Hsieh, D. A., Manski, C. F., & McFadden, D. (1985). Estimation of response probabilities from augmented retrospective observations. Journal of the American Statistical Association, 80, 651-662. Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75, 237-249. Owen, A. B. (1990). Empirical likelihood confidence regions. Annals of Statistics, 18, 90-120. Owen, A. B. (1991). Empirical likelihood for linear models. Annals of Statistics, 19, 1725-1747. Prentice, R. L., & Pyke, R. (1979). Logistic disease incidence models and casecontrol studies. Biometrika, 66, 403-411. Qin, J. (1993). Empirical likelihood in biased sample problems. Annals of Statistics, 21, 1182-96. Qin, J. (1998). Inferences for casecontrol and semiparametric two-sample density ratio models. Biometrika 85, 619-30. Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics, 22, 300-325. Qin, J., & Zhang, B. (1997). A goodness of fit test for logistic regression models based on case-control data. Biometrika, 84, 609-618. Scott, A. J., & Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57-71. Sen, P. K., & Singer, J. M. (1993). Large sample methods in statistics: an introduction with applications. NY: Chapman & Hall.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 35-42
Two Sides Of The Same Coin: Bootstrapping The Restricted Vs. Unrestricted Model
Panagiotis Mantalos
Department of Statistics Lund University, Sweden
The properties of the bootstrap test for restrictions are studied in two versions: 1) bootstrapping under the null hypothesis, restricted, and 2) bootstrapping under the alternative hypothesis, unrestricted. This article demonstrates the equivalence of these two methods, and illustrates the small sample properties of the Wald test for testing Granger-Causality in a stable stationary VAR system by Monte Carlo methods. The analysis regarding the size of the test reveals that, as expected, both bootstrap tests have actual sizes that lie close to the nominal size. Regarding the power of the test, the Wald and bootstrap tests share the same power as the use of the Size-Power Curves on a correct size-adjusted basis. Key words: Bootstrap, Granger-Causality, VAR system, Wald test
Introduction When studying the small sample properties of a test procedure by comparing different tests, two aspects are of importance: a) to find the test that has actual size closest to the nominal size, and given that (a) holds, and b) to find the test that has the greatest power. In most cases, however, the distributions of the test statistic used are known only asymptotically and, unfortunately, unless the sample size is very large, the tests may not have the correct size. Inferential comparisons and judgements based on them might be misleading. Gregory and Veall (1985) can be consulted for an illustrative example. One of the ways to deal with this situation is to use the bootstrap. The use of this procedure is increasing with the advent of personal computers.
However, the issue of the bootstrap test, even it is applied, is not trivial. One of the problems is that one needs to decide how to resample the data, and whether to resample under the null hypothesis or under the alternative hypothesis. By bootstrapping under the null hypothesis, an approximation is made of the distribution of the test statistic, thereby generating more robust critical values for our test statistic. Alternately, by bootstrapping under the alternative hypothesis, an approximation is made of the distribution of the parameter, and is subsequently used to make inferences. In either case, it does not matter whether the nature of the theoretical distribution of the parameter estimator or the theoretical distribution of the test statistic is known. What matters is that the bootstrap technique approximates those distributions. In this article, the bootstrap test procedure shows that a) by bootstrapping under the null hypothesis (that is, bootstrapping the restricted model), and b) by bootstrapping under the alternative hypothesis (that is, bootstrapping the unrestricted model) will lead to the same results.
Panagiotis Mantalos, Department of Statistics Lund University, Box 743 SE-22007 Lund, Sweden. E-mail: [email protected]
35
36
Define
(0, ) .
where q and r are fixed (q x 1) vectors and R is a fixed q K matrix with rank q. It is possible to base a test of H 0 on the Wald criterion
Ts = ( R - r ) Var ( R )
( R - r ).
Ts*
( R( =
- ) . Var ( R * )
*
a)
y =X +
* 1
,...,
* T
Bootstrap critical values The bootstrap technique improves the critical values, so that the true size of the test approaches its nominal value. The principle of bootstrap critical values is to draw a number of Bootstrap samples from the model under the null hypothesis, calculate the Bootstrap test statistic Ts* , and compare it with the observed test statistic. The bootstrap procedure for calculation of the critical values is given by the following steps:
H o : R = r vs. H1 : R
that
r,
(2)
(3)
Bootstrap-hypothesis testing One of the important considerations for generating the yt* leading to the bootstrap critical values is whether to impose the null hypothesis on the model from which is generated the yt* . However, some authors, including Jeong and Chung (2001), argued for bootstrapping under the alternative hypothesis. Let the data speak is their principle in apply the bootstrap. The bootstrap procedure to resample the data from the unrestricted model consists of the following steps: a) b) Estimate the test statistic as in (3) Use the adjusted OLS residuals ( i
where y is an n 1 vector, X is an n
( )
y=X +
)
(1)
c) Then, calculate the test statistic Ts* as in (3), i.e., by applying the Wald test procedure to the (4) model. Repeat this step b times and take the (1-)th quintile of the bootstrap distribution of Ts* to obtain the - level Bootstrap critical values ( ct* ). Reject Ho if Ts ct* . Among articles that advocated this approach are Horowitz (1994) and Mantalos and Shukur (1998), whereas Davidson and MacKinnon (1999) and Mantalos (1998) advocated the estimate of the P-value. A bootstrap estimate of the P-value for testing is P*{ Ts* Ts }.
y * = X 0 +
The properties of the two different methods will be illustrated and investigated using Monte Carlo methods. The Residual Bootstrap, (RB), will be used to study the properties of the test procedure when the errors are identically and independently distributed. To provide an example that is easy to be extended to a more general hypothesis, it is convenient to use the Wald test for restrictions for testing Granger-causality in a stable stationary VAR system.
Use
the
adjusted
OLS
* 1
residuals
* T
,...,
data.
(4)
) i
(5)
PANAGIOTIS MANTALOS
By repeating this step b times the (1) quintile can be used of the bootstrap distribution of the (5) as the - level Bootstrap critical values ( ct* ). Reject Ho if Ts ct* . The bootstrap estimate of the P-value is P*{ Ts* Ts }. Since Efrons (1979) introduction of the bootstrap as a computer-based method for evaluating the accuracy of a statistic, there have been significant theoretical refinements of the technique. Horowitz (1994) and Hall and Horowitz (1996) discussed the method and showed that bootstrap tests are more reliable than asymptotic tests, and can be used to improve finite-sample performance. They provided a heuristic explanation of why the bootstrap provides asymptotic refinements to the critical values of test statistics. See Hall (1992) for a wider discussion on bootstrap refinements based on the Edgeworth expansion. Davidson and MacKinnon (1999) provided an explanation of why the bootstrap provides asymptotic refinements to the p- values of a test. The same authors conclude that by using the bootstrap critical values or bootstrap test, the size distortion of a bootstrap test is at
th
* b) Unrestricted: yu = X + *
37
. (7)
and
and
Because the right-hand components of the (10) and (11) are equal,
= ) (
Two sides of the same coin Consider the general linear model
y=X +
(1)
It is not difficult to see from (12) that the same results from the both methods are expected: there are two sides to the same coin. These results will be illustrated by a Monte Carlo experiment. Wald test for restrictions in a VAR model Consider a data-generation process (DGP) that consists of the k-dimensional multiple time series generated by the VAR(p) process
1 p
and suppose that the interest is in testing the q independent linear restrictions
GDPs are:
1t
kt
a) Restricted:
(6)
y * = X 0 + R
independent white noise process with nonsingular covariance matrix and, for j = 1, ... ,
where
, ...,
is a zero mean
denoted by
H o : R = r vs. H1 : R
r.
(2) be
yt = A1 yt
... +A p yt
(X X) 1 X
(X X) 1 X
* ( X X ) 1 X yu
(X X) 1 X
* (X X ) 1 X yR
(X X ) 1 X
(8)
(9)
(10)
(11)
(12)
(13)
38
+
jt
Y : = y1 ,
and
1 T
By using this notation, for t = 1, , T, the VAR (p) model including a constant term ( v ) can be written compactly as
Y = BZ + .
Then, the LS estimator of the B is
)
B = YZ ZZ
(15)
be the vector the LS estimators of the parameters, where vec[.] denotes the vectorization operator that stacks the columns of the argument matrix. Then,
Ts _ wald
1 = ( R p ) R ( ZZ ) R
where denotes weak convergence in distribution and the [ k 2 ( p ) x k 2 ( p ) ] covariance matrix p is non-singular. Now, suppose that in testing independent linear restrictions is of interest
p
Ho : R
= s vs. H1 : R
s,
(17)
T 1/ 2 ( p p )
N ( 0, p ) ,
(16)
Ts*_ wald p = ( R * ) R ( Z * Z * ) * R
1
Let
vec A1 ,
, Ap be the vector of
,A p
The null hypothesis of no Granger-causality may be expressed in terms of the coefficients of VAR process as
Ho : R
= 0 vs. H1 : R
: =
k x T matrix.
Z : = Z0 ,
(
, ZT
p 1
yt
(kp+1) x T matrix
be the estimate of the residual covariance matrix. Then, the diagonal elements of
Zt : =
yt
(kp +1) x 1
matrix,
Let
( ZZ )
(14)
1 = ( R p - s) R ( ZZ ) R
B : = v, A1 ,
<
, yT
k,
where q and s are fixed (q x 1) vectors and R is a fixed [q x k 2 p ] matrix with rank q. We can base a test of H 0 on the Wald criterion
k x T matrix,
, Ap
Ts _ wald = ( R p - s) Var ( R p - s)
( R p - s) .
(18)
( R p - s).
(20)
0.
(21)
( R p )
(22)
p ( R * )
(23)
PANAGIOTIS MANTALOS
39
Ts*_ wald
1 p = ( R * - R p ) R ( Z * Z * ) * R
p ( R * - R p )
and MacKinnon (1998) because they are easy to interpret. The P-value plot is used to study the size, and the Size-Power curves is used to study the power of the tests. The graphs, the P-value plots and Size-Power curves are based on the empirical distribution function, the EDF of the
for the unrestricted form. Methodology Monte Carlo experiment This section illustrates various generalizations of the Granger-causality tests in VAR systems with stationary variables, using Monte Carlo methods. The estimated size is calculated by observing how many times the null is rejected in repeated samples under conditions where the null is true. The following VAR(1) process is generated:
For the P-value plots, if the distribution used to compute the ps terms is correct, each of the ps terms should be distributed uniformly on (0,1). Therefore the resulting graph should be close to the 45o line. Furthermore, to judge the reasonableness of the results, a 95% confidence interval is used for the nominal size ( 0 ) as:
0
y1t is Granger-noncausal for y2t and if 0, y1t causes y2t . Therefore, = 0 is used to study
the size of the tests. The order p of the process is assumed to be known. Because this assumption might be too optimistic, a VAR(2) is fitted: yt = v A1 yt 1 A 2 yt 2 . t For each time series, 20 pre-sample values were generated with zero initial conditions, taking net sample sizes of T = 25 and 50. The Bootstrap test statistic ( Ts* ) is calculated. As for b, which is the size of the bootstrap sample used to estimate bootstrap critical values and the P-value, b = 399 is used. Note that there are no initial bootstrap observations in bootstrap procedure. Next presented are the results of the Monte Carlo experiment concerning the sizes of the various versions of the tests statistics using the VAR(2) model. Graphical methods are used that were developed and illustrated by Davidson
where
~ N 0, I 2
, yt = y1t . y2 t . If = 0,
1/ 2
yt =
0.5
0.3 yt 0.5
(25)
of Monte Carlo replications. Results that lie between these bounds will be considered satisfactory. For example, if the nominal size is 5%, define a result as reasonable if the estimated size lies between 3.6% and 6.4%. The P-value plots also make it possible and easy to distinguish between tests that systematically over-reject or under-reject, and those that reject the null hypothesis about the right proportion of the time. Figure 1 shows the truncated P-value plots for the actual size of the bootstrap and the Wald tests, using 25 and 50 observations. Looking at these curves, it is not difficult to make the inference that both the bootstrap tests perform adequately, as they lie inside the confidence bounds. However, using the asymptotic critical values, the Wald test shows a tendency to over-reject the null hypothesis. The superiority of the bootstrap test over the Wald test, concerning the size of the tests, is considerable, and more noticeable in small samples of size 25. The power of the Wald and bootstrap tests by using sample sizes of 25 and 50 observations was examined. The power function is estimated by calculating the rejection frequencies in 1000 replications using the value = 2. The Size-Power Curves are used to compare the estimated power functions of the alternative test statistics. This proved to be quite
(1 N
(24)
P-values, denoted as F x j .
40
adequate, because those tests that gave reasonable results regarding size usually differed very little regarding power. The same processes are followed for the size investigation to evaluate the EDFs denoted random numbers used to estimate the size of the tests. Size-Power Curves are used to plot the estimated power functions against the nominal size. The estimated power functions are plotted
) (
by F
xj
PANAGIOTIS MANTALOS
41
Figure 1. P-values Plots Estimated Size of the Wald and Bootstrap Tests. Figure 1a: 25 observations Figure 1b: 50 observations
Dash 3Dot lines: 95% Confidence interval Figure 2. Estimated Power of the Wald and Bootstrap Tests. Figure 2a: 25 observations Figure 2b: 50 observations
Figure3. Size-adjusted Power of the Wald and Bootstrap Tests. Figure 3a: 25 observations Figure 3b: 50 observations
42
Davidson, R., & MacKinnon, J. G. (1998). The size distortion of bootstrap tests. Econometric Theory, 15, 361-376. Davidson, R., & MacKinnon, J. G. (1996). The Power of bootstrap tests. Discussion paper, Queens University, Kingston, Ontario. Davidson, R., & MacKinnon, J. G. (1998). Graphical methods for investigating the size and power of test statistics. The Manchester School, 66, 1-26. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1-26. Gregory, A. W., & Veall, M. R. (1985). Formulating Wald tests of nonlinear restrictions. Econometrica 53, 1465-1468. Hall, P. (1992). The bootstrap and Edgeworth expansion. New York: SpringerVerlag. Hall, P., & Horowitz, J. L. (1996). Bootstrap critical values for tests based on generalized - method - of - moments estimators. Econometrica, 64, 891916.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 43-52
Wardell (1997) provided a method for constructing confidence intervals on a proportion that modifies the Clopper-Pearson (1934) interval by allowing for the upper and lower binomial tail probabilities to be set in a way that minimizes the interval width. This article investigates the coverage properties of these optimized intervals. It is found that the optimized intervals fail to provide coverage at or above the nominal rate over some portions of the binomial parameter space but may be useful as an approximate method. Key words: Attribute, Bernoulli, dichotomous, exact, sampling where Cn , CL* ( p ) is the coverage probability for a particular method with a nominal confidence level CL* for samples of size n taken from a population with binomial parameter p and I(i, p ) is 1 if the interval contains p when y = i and 0 otherwise. The actual confidence level of a method for a given CL* and n
Introduction A common task in statistics is to form a confidence interval on the binomial proportion p. The binomial probability distribution function is defined as
Pr [Y = y | p , n ] = b ( p, n, y ) n y p (1 p n y ) , = y
where the proportion of elements with a specified characteristic in the population is p, the sample size is n, and y is the outcome of the random variable Y representing the number of elements with a specified characteristic in the sample. The coverage probability for a given value of p is
(CL )
n ,CL*
Exact confidence interval methods (Blyth & Still, 1983) have the property that CLn, CL* CL* for all n, and CL* . The most commonly used exact method is due to Clopper and Pearson (1934) and is based on inverting binomial tests of H 0 : p = p 0 . The upper bound of the ClopperPearson interval (U) is the solution in p 0 to the equation
n i= y
43
John P. Wendell is Professor, College of Business Administration, University of Hawai`i at M noa. E-mail: [email protected]. Sharon Cox is Assistant Professor, College of Business Administration, University of Hawai`i at M noa.
except that when y = n , U = 1 . The lower bound, L, is the solution in p 0 to the equation
y i =0
i =0
Cn ,CL* ( p ) =
I(i, p ) b( p, n, i ),
b ( p0 , n, i ) = U ,
b ( p0 , n, i ) = L , CL* = 1
where
44
bounds are determined by inverting hypothesis tests, both U and L are set a priori and remain fixed regardless of the value of y. In practice, the values of U and L are often set to U = L = / 2 . Wardell (1997) modified the ClopperPearson bounds by replacing the condition that U and L are fixed with the condition that only is fixed. This allows to be partitioned differently between U and L for each sample outcome y. Wardell (1997) provided an algorithm for accomplishing this partitioning in such a way that the confidence interval width is minimized for each y. Intervals calculated in this way are referred to here as optimized intervals. Wardell (1997) was concerned with determining the optimized intervals and not the coverage properties of the method. The purpose of this article is to investigate the coverage properties. Coverage Properties of Optimized Intervals Figure 1 plots Cn ,.95 ( p ) against p for sample sizes of 5, 10, 20, and 50. The discontinuity evident in the Figure 1 plots is due to the abrupt change in the coverage probability when p is at U or L for any of the n + 1 confidence intervals. Berger and Coutant (2001) demonstrated that the optimized interval method is an approximate and not an exact method by showing that CL5,.95 = .9375 < .95 . Figure 1 confirms the Berger and Coutant result and extends it to sample sizes of 10, 20, and 50. Agresti and Coull (1998) argued that some approximate methods have advantages over exact methods that make them preferable in many applications. In particular, they recommended two approximate methods for use by practitioners: the score method and adjusted Wald method. The interval bounds for the score method are
2 / (1 + z / 2 / n ) ,
where p = y / n and zc is the 1 c quantile of the standard normal distribution. The adjusted Wald method interval bounds are
p z / 2 p (1 p ) / ( n + 4 ) ,
where
~ = ( y + 2 ) / (n + 4 ) . p
One measure of the usefulness of an approximate method is the average coverage probability over the parameter space when p has a uniform distribution. This measure is used by Agresti and Coull (1998). Ideally, the average coverage probability should equal the nominal coverage probability. Figure 2 is a plot of the average coverage probabilities for the optimized interval, adjusted Wald and score methods for sample sizes of 1 to 100 and nominal confidence levels of .80, .90, 95, and .99. Both the adjusted Wald and the score method perform better on this measure than the optimized interval method in the sense the average coverage probability is closer to the nominal across all of the nominal confidence levels and sample sizes. However, the optimized interval method has the desirable property that the average coverage probability never falls below the nominal for any of the points plotted. The score method is below the nominal for the entire range of sample sizes at the nominal confidence level of .99 and the same is true for the adjusted Wald method at the nominal confidence level of .80.
p+
2 p (1 p ) + z / 2 / 4n / n
45
Figure 1. Coverage Probabilities of Optimized Intervals Across Binomial Parameter p. The disjointed lines plot the actual coverage probabilities of the optimized interval method across the entire range of values of p at a nominal confidence level of .95 for sample sizes of 5, 10, 20, and 50. The discontinuities occur at the boundary points of the n + 1 confidence intervals. The horizontal dotted line is at the nominal confidence level of .95. For all four sample sizes the actual coverage probability falls below the nominal for some values of p, demonstrating that the optimized bounds method is not an exact method.
46
Figure 2. Average Coverage Probabilities of Three Approximate Methods. The scatter is of the average coverage probabilities of three approximate methods when p is uniformly distributed for sample sizes of from 1 to 100 with nominal confidence levels of .80, .90, .95, and .99. The optimized interval method is indicated by a o, the adjusted Wald method by a +, and the score method by a <. The horizontal dotted line is at the nominal confidence level. The optimized interval methods average coverage probability tends to be further away from the nominal than the other two methods for all four nominal confidence levels and is always higher than the nominal. The average coverage probabilities of the other two methods tend to be closer to, and sometimes below, the nominal level.
47
uniform-weighted root mean squared error of the average coverage probabilities about the nominal confidence level. Ideally, this mean squared error would equal zero. Figure 3 plots the root mean squared error for the three methods over the same range of sample sizes and nominal confidence levels as Figure 2. The relative performance of the three methods for this metric varies according to the nominal confidence level. Each method has at least one nominal confidence level where the root mean squared error is furthest from zero for most of the sample sizes. The score method is worst at nominal confidence level of .99, the adjusted Wald at .80, and the optimized interval method at both .90 and .95. Agresti and Coull (1998) also advocated comparing one method directly to another by measuring the proportion of the parameter space where the coverage probability is closer to the nominal for one method than the other. Figure 4 plots this metric for both the score method and the adjusted Wald method versus the optimized interval method for the same sample sizes and nominal confidence levels as Figures 2 and 3. The results are mixed. At the .99 nominal confidence level the coverage of the adjusted Wald method is closer to the nominal in less than 50% of the range of p for all sample sizes, whereas the score method is closer for more than 50% of the range of p for all sample sizes above 40. At the other three nominal confidence levels both the adjusted Wald and score methods are usually closer to the nominal than the optimized interval method in more 50% of the range of p when sample sizes are greater than 20 and less than 50% for smaller sample sizes. Neither method is closer than the optimized interval method to the nominal confidence level in more than 65% of the range of p for any of the pairs of sample sizes and nominal confidence levels. Another metric of interest is the proportion of the range of p where the coverage probability is less than the nominal. For exact methods, this proportion is zero by definition. For approximate methods, a small proportion of
Coull (1998) is
(C
n ,CL*
( p ) CL* ) dp , the
2
the range of p with coverage probabilities less than the nominal level is preferred. Figure 5 plots this metric over the same sample sizes and nominal confidence levels as Figures 2 to 4. The optimized interval method is closer to zero than the other methods for almost all of the sample sizes and nominal confidence levels. The adjusted Wald is the next best, with the score method performing the worst on this metric. The approximate methods all have the * property that CLn ,CL* < CL for most values of
U and L for all y. As a result, the CL* = 1 level optimized intervals must be
contained
*
within the Clopper-Pearson CL = 1 2 level intervals. Because the Clopper-Pearson method is an exact method, it * follows directly that CLn ,CL* CL for all n
and CL* . The score and the adjusted Wald method have no such restriction on CLn ,CL* . Figure 6 plots the actual coverage probability of the optimized interval method against sample sizes ranging from 1 to 100 for nominal confidence levels of .80, .90, .95, and .99. Figure 6 shows that the optimized method is always below the nominal except for very small sample sizes. It is often within a distance of /2 of the nominal confidence level, particularly for sample sizes over 20. The performance of the adjusted Wald method for this metric is very similar to the optimized interval method for sample sizes over 10 at the .95 and .99 confidence level. At the .80 and .90 confidence level the adjusted Wald performs very badly, with coverage probabilities of zero for all of the sample sizes when the nominal level is .80. The score method is the opposite, with actual confidence levels substantially below the nominal at the .95 and .99 nominal levels and closer at the .90 and .80 levels.
48
Figure 3. Root Mean Square Error of Three Approximate Methods. The scatter is of the uniformweighted root mean squared error of the average coverage probabilities of three approximate methods when p is uniformly distributed for sample sizes of from 1 to 100 with nominal confidence levels of .80, .90, .95, and .99. The optimized interval method is indicated by a o, the adjusted Wald method by a +, and the score method by a <. The relative performance of the three methods for this metric varies according to the nominal confidence level. Each method has at least one nominal confidence level where the root mean squared error is furthest from zero for most of the sample sizes.
49
Figure 4. Proportion of Values of p Where Coverage is Closer to Nominal. The scatter is of the proportion of the uniformly distributed values of p for which the adjusted Wald or score method has actual coverage probability closer to the nominal coverage probability than the optimized method for sample sizes of from 1 to 100 with nominal confidence levels of .80, .90, .95, and .99. The adjusted Wald method is indicated by a o and the score method by a +. The horizontal dotted line is at 50%. At the .80, .90, and .95 nominal confidence levels both the adjusted Wald and Score method tend to have coverage probabilities closer to the nominal for more than half the range of p sample sizes over 20 and this is also true for the score method at a nominal confidence level of .99. For the adjusted Wald at nominal confidence level of .99, and for both methods with sample sizes less than 20, the coverage probability is closer to the nominal than the optimized method for less than half the range of for p.
50
Figure 5. Proportion of p Where Coverage is Less Than the Nominal. The scatter is of the proportion of the uniformly distributed values of p for which a coverage method has actual coverage probability less than the nominal coverage probability for sample sizes of from 1 to 100 with nominal confidence levels of .80, .90, .95, and .99. The optimized interval method is indicated by a o, the adjusted Wald method by a +, and the score method by a <. In general, the optimized interval method has a smaller proportion of the range of p where the actual coverage probability is less than the nominal than the other two methods and this proportion tends to decrease as the sample size increases while it increases for the adjusted Wald and stays at approximately the same level for the score method.
51
Figure 6. Actual Confidence Levels. The scatter is of the actual confidence levels for three approximate methods for sample sizes of from 1 to 100 with nominal confidence levels of .80, .90, .95, and .99. The optimized interval method is indicated by a o, the adjusted Wald method by a +, and the score method by a <. No actual confidence levels for any sample size are shown for the adjusted Wald method at a nominal confidence level of .80 or for sample sizes less than four at a nominal confidence level of .90. The actual confidence level is zero at all of those points. The upper horizontal dotted line is at the nominal confidence level and the lower dotted line is at the nominal confidence level minus a. The actual confidence level for the optimized bound method is always less than nominal level except for very small sample sizes, but it is never less than the nominal level minus a. The actual confidence level of the other two methods can be substantially less than the nominal.
52
The optimized interval method is not an exact method. It should not be used in applications where it is essential that the actual coverage probability be at or above the nominal confidence level across the entire parameter space. For applications where an exact method is not required the optimized method is worth consideration. Figures 2 6 demonstrate that none of the three approximate methods considered in this paper is clearly superior for all of the metrics across all of the sample sizes and nominal confidence levels considered. The investigator needs to determine which metrics are most important and then consult Figures 2 6 to determine which method performs best for those metrics at the sample size and nominal confidence level that will be used. If the distance of the actual confidence level from the nominal confidence level and the proportion of the parameter space where coverage falls below the nominal are important considerations then the optimized bound method will often be a good choice.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 53-62
Inferences About Regression Interactions Via A Robust Smoother With An Application To Cannabis Problems
Rand R. Wilcox Mitchell Earleywine
Department of Psychology University of Southern California, Los Angeles
A flexible approach to testing the hypothesis of no regression interaction is to test the hypothesis that a generalized additive model provides a good fit to the data, where the components are some type of robust smoother. A practical concern, however, is that there are no published results on how well this approach controls the probability of a Type I error. Simulation results, reported here, indicate that an appropriate choice for the span of the smoother is required so that the actual probability of a Type I error is reasonably close to the nominal level. The technique is illustrated with data dealing with cannabis problems where the usual regression model for interactions provides a poor fit to the data. Key words: Robust smoothers, curvature, interactions
Introduction A combination of extant regression methods provides a very flexible and robust approach to detecting and modeling regression interactions. In particular, both curvature and nonnormality are allowed. The main goal in this paper is to report results on the small-sample properties of this approach when a particular robust smoother is used to approximate the regression surface. The main result is that in order to control the probability of a Type I error, an appropriate choice for the span must be used which is a function of the sample size. However, before addressing this issue, we provide a motivating example for considering smoothers when investigating interactions. A well-known approach to detecting and modeling regression interactions is to assume that for a sample of n vectors of observations,
(Y ,X ,X ), i i1 i2
Yi = 0 + 1 X i1 + 2 X i 2 + 3 X i1 X i 2 + i , (1)
i=1,...,n, where is independent of X i1 and
X i2 ,
E ( ) = 0.
The
hypothesis
of
no
interaction corresponds to
H 0 : 3 = 0.
This approach appears to have been first suggested by Saunders (1956). A practical issue is whether this approach is flexible enough to detect and to model an interaction if one exists. We consider data collected by the second author to illustrate that at least in some situations, a more flexible model is required. The data deal with cannabis problems among adult males. Responses from n=296 males were obtained where the two regressors were the participants use of cannabis ( X 1 ) and consumption of alcohol ( X 2 ). The dependent measure (Y) reflected cannabis dependence as measured by the number of DSM-IV symptoms reported. An issue of interest was determining whether the amount of alcohol consumed alters the association between Y and the amount of cannabis used, and there is the issue of
Rand R. Wilcox ([email protected]) is a Professor of Psychology at the University of Southern California, Los Angeles. M. Earleywine is an Associate Professor at the University of Southern California, Los Angeles.
53
54
REGRESSION INTERACTIONS VIA A ROBUST SMOOTHER H 0 : 3 = 0 is .18 when testing at the .05 level with a sample size of twenty. Of course, in this case, standard diagnostics can be used to detect the curvature, but experience with smoothers suggest that dealing with curvature is not always straightforward. Suppose instead Y = X 1 + X 12 + | X 2 | + , so there is no interaction even though there is a nonlinear association. Then with a sample size of fifty, and when testing at the .05 level, the probability of rejecting H 0 : 3 = 0 is .30. In contrast, using the more flexible method described here, the probability of rejecting the hypothesis of no interaction is .042. If we ignore the result that (1) is an inadequate model for the cannabis data and simply test H 0 : 3 = 0 (using least squares in conjunction with a conventional T test), or if we test H : =0 using a more robust hypothesis 0 3 testing method derived for the least squares estimator that is based on a modified percentile bootstrap method (Wilcox, 2003), or when using various robust estimators (such as an Mestimator with Schweppe weights or when using the Coakley-Hettmansperger estimator), we reject. But an issue is whether we reject because there is indeed an interaction, or because the model provides an inadequate representation of the data. And another concern is that by using an invalid model, an interaction might be masked. A more general and more flexible approach when investigating interactions is to test the hypothesis that there exists some functions f1
and f 2 such that
understanding how the association changes if an interaction exists. Using a method derived by Stute, Gonzlez-Manteiga and Presedo-Quindimil (1998), it is possible to test the hypothesis that the model given by equation (1) provides a good fit to the data. If, for example,
Yi = 0 + 1 X i1 + 2 X i 2 + 3 X i1 X i22 + i ,
then there is an interaction, but the family of regression equations given by (1) is inappropriate. The Stute et al. method can be applied using the S-PLUS or R function lintest in Wilcox (2003). Estimating the unknown parameters via least squares, this hypothesis is rejected at the .05 level. A criticism is that when testing the hypothesis that (1) is an appropriate model for the data, and when using the ordinary least squares estimator when estimating the unknown parameters, the probability of a Type I error might not be controlled (Wilcox, 2003). Replacing the least squares estimator with various robust estimators corrects this problem. Here, using the robust M-estimator derived by Coakley and Hettmansperger (1993), or using a generalization of the Theil-Sen estimator to multiple predictors (see Wilcox, 2005), again the hypothesis is rejected. Moreover, the R (or SPLUS) function pmodchk in Wilcox (2005) provides a graphical check of how well the model given by (1) fits the data when a least squares estimate of the parameters is used, versus a more flexible fit based on what is called a running interval smoother, and a poor fit based on (1) is indicated. Robust variations give similar results. So, at least in this case, an alternative and more flexible approach to testing the hypothesis of no interaction seems necessary. To provide more motivation for a more flexible approach when modeling interactions, note that equation (1) implies a nonlinear association between Y versus X 1 and X 2 . A concern, however, is that a nonlinear association does not necessarily imply an interaction. If, for example, X , X and are independent and 1 2 have standard normal distributions, and if 2 Y = X 1 + X 2 + , the probability of rejecting
Y = 0 + f1 ( X 1 ) + f 2 ( X 2 ) + .
(2)
Equation (2) is called a generalized additive model, a general discussion of which can be found in Hastie and Tibshirani (1990). A special case is where f1 ( X 1 ) = 1 X 1 , f 2 ( X 2 ) = 2 X 2 , but (2) allows situations where the regression surface is not necessarily a plane, even when there is no interaction. If the model represented by (2) is true, then there is no interaction in the following sense. Pick any two values for X 2 ,
55
1 For completeness, Barry (1993) derived a method for testing the hypothesis of no interaction assuming an ANOVA-type decomposition where
regression lines without forcing them to have a particular shape such as a straight line. As with most smoothers, the running interval smoother is based in part on something called a span, , which plays a role when determining whether the value X is close to a particular value of X 1 (or X 2 ). Details are provided in Appendix A. There are many ways of fitting the model given by (2). Here, the focus is on a method where the goal is to estimate a robust measure of location associated with Y, given ( X1 , X 2 ) , because of the many known advantages such measures have (e.g., Hampel, Ronchetti, Rousseeuw & Stahel, 1986; Huber, 1981; Staudte & Sheather, 1990; Wilcox, 2003, 2005). Primarily for convenience, the focus is on a 20% trimmed mean, but various robust Mestimators are certainly a possibility. The advantages associated with robust measures of location include an enhanced ability to control the probability of a Type I error in situations where methods based on means are known to fail, and substantial gains in power, over methods based on means, even under slight departures from normality. (Comments about using the mean, in conjunction with the proposed method, are made in the final section of this paper.) Here, the main reason for not using a robust M-estimator (with say, Hubers ), is that this estimator requires division by the median absolute deviation (MAD) statistic, and in some situations considered here, when the sample size is small, MAD is zero. The running interval smoother provides a predicted value for Y, given ( X i1 , X i 2 ) , say
Y = 0 + f1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 1, X 2 ) + ,
in which case the hypothesis of no interaction is
H 0 : f3 ( X 1 , X 2 ) 0.
Barry (1993) used a Bayesian approach assuming that the (conditional) mean of Y is to be estimated and that prior distributions for f1 ,
Yi ; see Appendix A. Next, compute the residuals ri = Yi Yi . If the model given by (2)
is true, meaning that there is no interaction, then the regression surface when predicting r, given ( X1 , X 2 ) , should be a horizontal plane. The hypothesis that this regression surface is indeed a horizontal plane can be tested using the method derived by Stute et al. (1998). The details can be found in Appendix B.
56
Simulations were conducted as a partial check on the ability of the method, just outlined, to control the probability of a Type I error. Values for X 1 , X 2 and were generated from four types of distributions: normal, symmetric and heavy-tailed, asymmetric and light-tailed, and asymmetric and heavy-tailed. For non-normal distributions, observations were generated from a g-and-h distribution which is described in Appendix C. The goal was to check on how the method performs under normality, plus what would seem like extreme departures from normality, with the idea that if good performance is obtained under extreme departures from normality, the method should perform reasonably well with data encountered in practice. The correlation between X 1 and X 2 was taken to be either =0 or =.5. Initial simulation results revealed that the actual probability of a Type I error, when testing at the .05 level, is sensitive to the span, . (Hrdle & Mammen, 1993, report a similar result for a method somewhat related to the problem at hand.) If the span is too large, the actual Type I error probability can drop well below the nominal level. When testing at the .05 level, simulations were used to approximate a reasonable choice for . Here, the span corresponding to the sample sizes 20, 30, 50, 80 and 150 are taken to be .4, .36, .18, .15 and .09, respectively. It is suggested that when 20n150, interpolation based on these values be used, and for n>150 use a span equal to .09. For n>150 and sufficiently large, perhaps the actual Type I error probability is well below the nominal level, but exactly how the span should be modified when n>150 is an issue that is in need of further investigation. Table 1 contains , the estimated probability of making a Type I error when testing at the .05 level. n=20, and when Y= or 2 Y = X 1 + X 2 + . (The g and h values are explained in Appendix C.) Simulations were also run when Y = X 1 + X 2 + , the results were very similar to the case Y= , so for brevity they are not reported. No situation was found
57
Y= g 0.0 h 0.0 g 0.0 0.0 0.5 0.5 0.0 0.0 0.5 0.5 0.0 0.0 0.5 0.5 0.0 0.0 0.5 0.5 h 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 =0 .033 .039 .045 .037 .031 .032 .033 .029 .029 .031 .040 .029 .028 .026 .035 .020 =.5 .034 .034 .043 .035 .032 .024 .031 .024 .022 .020 .039 .027 .024 .017 .029 .015
2 Y = X1 + X 2 +
0.0
0.5
0.5
0.0
0.5
0.5
=0 .047 .026 .045 .035 .019 .020 .016 .023 .036 .032 .037 .025 .024 .015 .014 .015
=.5 .035 .031 .034 .032 .015 .012 .013 .013 .022 .014 .028 .020 .003 .003 .006 .007
58
Figure 2: A smooth of the residuals stemming from the generalized additive model versus the two predictors.
When there is no interaction, all three regression lines should be approximately parallel which is not the case. The regression lines corresponding to X 2 = 0.73 and -0.352 are reasonably parallel, and they are approximately horizontal suggesting that there is little association between Y and X for these special 1 cases.
But for X 3 = 0.332 , the association changes, particularly in the right portion of Figure 1 where the association becomes more positive. If the data are split into two groups according to whether X is less than the median i2 of the values X 12 ,..., X n 2 , -0.352, and then create a smooth between Y and X 1 , the result is shown in right panel of Figure 3.
59
Conclusion In principle, the method in this article can be used with any measure of location. It is noted, however, that if the 20% trimmed mean is replaced by the sample mean, poor and unstable control over the probability of a Type I error results. Finally, all of the methods used in this paper are easily applied using the S-PLUS or R functions in Wilcox (2005). (These functions can be downloaded as described in chapter 1.) Information about S-PLUS can be obtained from www.insightful.com, and R is a freeware variant of S-PLUS that can be downloaded from www.R-project.org. For convenience, the relevant functions for the problem at hand have been combined into a single function called adtest. If, for example, the X values are stored in an S-PLUS matrix x, and the Y values are stored in y, the command adtest(x,y) tests the hypothesis that the model given by (2) is true.
References Barry, D. (1993). Testing for additivity of a regression function. Annals of Statistics, 21, 235-254. Cleveland, W. S., & Devlin, S. J. (1988). Locally-weighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association, 83, 596-610. Coakley, C. W., & Hettmansperger, T. P. (1993). A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88, 872-880. Dette, H. (1999). A consistent test for the functional form of a regression based on a difference of variances estimator. Annals of Statistics, 27, 1012-1040. Fan, J. (1993). Local linear smoothers and their minimax efficiencies. The Annals of Statistics, 21, 196-216.
60
= +1
In terms of efficiency (achieving a small standard error relative to the usual sample mean), 20% trimming performs very well under normality but continues to perform well in situations where the sample mean performs poorly (e.g., Rosenberger & Gasko, 1983). Now, we describe the running interval smoother in the one-predictor case. Consider a random sample ( X1 , Y1 ),..., ( X n , Yn ) and let be some constant that is chosen in a manner to be described. The constant is called the span. The median absolute deviation (MAD), based on X 1 ,..., X n , is the median of the n values
| X i X | MADN .
Thus, for normal distributions, X is close to X i if X is within standard deviations of X i . Then
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust Statistics. New York: Wiley. Hardle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press. Hardle, W., & Mammen, E. (1993). Comparing non-parametric versus parametric regression fits. Annals of Statistics, 21, 19261947. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized Additive Models. New York: Chapman and Hall. Hoaglin, D. C. (1985). Summarizing shape numerically: The g-and-h distribution. In D. Hoaglin, F. Mosteller & J. Tukey (Eds.) Exploring Data Tables Trends and Shapes. New York: Wiley. Huber, P. J. (1981). Robust Statistics. New York: Wiley. Rosenberger, J. L., & Gasko, M. (1983). Comparing location estimators: Trimmed means, medians, and trimean. In D. Hoaglin, F. Mosteller and J. Tukey (Eds.) Underststanding Robust and Exploratory Data Analysis. (pp. 297-336). New York: Wiley. Samarov, A. M. (1993). Exploring regression structure using nonparametric functional estimation. Journal of the American Statistical Association, 88, 836-847. Saunders, D. R. (1956). Moderator variables in prediction. Educational and Psychological Measurement, 16, 209-222. Stute, W., Gonzlez-Manteiga, W. G. & Presedo-Quindimil, M. P. (1998). Bootstrap approximations in model checks for regression. Journal of the American Statistical Association, 93, 141-149. Wilcox, R. R. (2003). Applying Contemporary Statistical Techniques. San Diego, CA: Academic Press. Wilcox, R. R. (2005). Introduction to Robust Estimation and Hypothesis Testing, 2nd Edition. San Diego, CA: Academic Press.
1 Xt = n2
W( i ) .
61
fj
(j=1,
2).
Here,
Let
j j running interval smooth based on the jth predictor, ignoring the other predictor under investigation. Next, iterate as follows. 1. 2. Increment k. Let
where
f1k ( X 1 ) = S1 (Y f 2k 1 ( X 2 ) | X 1 )
and
The test statistic is the maximum absolute value of all the R j values. That is, the test statistic is
f 2k ( X 2 ) = S2 (Y f1k 1 ( X 1 ) | X 2 ).
3. Repeat steps 1 and 2 until convergence. Finally, estimate 0 with the 20% trimmed i=1,...,n. The computations are performed by R or S-PLUS function adrun in Wilcox (2005).
D = max | R j | .
An appropriate critical value is estimated with the wild bootstrap method as follows. Generate U1 ,..., U n from a uniform distribution and set
Appendix B
This appendix describes the method for testing the hypothesis of no interaction. Fit the generalized additive model as described in
Appendix A yielding Yi , and let ri = Yi Yi , i=1,...,n. The goal is to test the hypothesis that the regression surface, when predicting the residuals, given ( X i1 , X i 2 ) , is a horizontal plane. This is done using the wild bootstrap method derived by Stute, Gonzlez-Manteiga and Presedo-Quindimil (1998). Let rt be the 20% trimmed mean based on the residuals r1 ,..., rn . Fix j and set I i = 1 if simultaneously
X i1 X j1 and X i 2 X j 2 , otherwise I i = 0 .
mean
of
the
values
Yi
f jk (Yi | X ij ) ,
Vi = 12(U i .5),
i* = vVi , i
and
ri* = rt + vi* .
Then based on the n pairs of points ( X 1 ,
D D(*u ) .
Rj
1 n
Ii (ri rt ) I i vi ,
(3)
vi = ri rt .
(4)
62
Details regarding the simulations are as follows. Observations were generated where the marginal distributions have a g-and-h distribution (Hoaglin, 1985) which includes the normal distribution as a special case. More precisely, observations Z ij , (i=1,...,n; j=1, 2) were initially generated from a multivariate normal distribution having correlation , then the marginal distributions were transformed to
exp( gZ ij ) 1 g
X ij =
where g and h are parameters that determine the third and fourth moments. The four (marginal) g-and-h distributions examined were the standard normal (g=h=0), a symmetric heavytailed distribution (g=0, h=.5), an asymmetric distribution with relatively light tails (g=.5, h=0), and an asymmetric distribution with heavy tails (g=h=.5). Here, two choices for were considered: 0 and .5.
2 Zexp( hZ ij / 2),
if g = 0
g 0.0 0.0 0.5 0.5 h 0.0 0.5 0.0 0.5
1
0.00 0.00 1.75 ---
2
3.0 --8.9 ---
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 63-74
It is known that two-group linear discriminant function can be constructed via binary regression. In this article, it is shown that the opposite relation is also relevant it is possible to present multiple regression as a linear combination of a main part, based on the pooled variance, and Fisher discriminators by data segments. Presenting regression as an aggregate of the discriminators allows one to decompose coefficients of the model into sum of several vectors related to segments. Using this technique provides an understanding of how the total regression model is composed of the regressions by the segments with possible opposite directions of the dependency on the predictors. Key words: Regression, discriminant analysis, data segments
Introduction Linear Discriminant Analysis (LDA) was introduced by Fisher (1936) for classification of observations into two groups by maximizing the ratio of between-group variance to within-group variance (Rao, 1973; Lachenbruch, 1979; Hand, 1982; Dillon & Goldstein, 1984; McLachlan, 1992; Huberty, 1994). For two-group LDA, the Fisher linear discriminant function can be represented as a linear regression of a binary variable (groups indicator) by the predictors (Fisher, 1936; Anderson, 1958; Ladd, 1966; Hastie, Tibshirani & Buja, 1994; Ripley, 1996). Many-group LDA can be described in terms of the Canonical Correlations Analysis (Bartlett, 1938; Kendall & Stuart, 1966; Dillon & Goldstein, 1984; Lipovetsky, Tishler, & Conklin, 2002). LDA is used in various applications, for example, in marketing research
Stan Lipovetsky joined GfK Custom Research as a Research Manager in 1998. His primary areas of research are multivariate statistics, multiple criteria decision making, econometrics, and marketing research. Email him at: [email protected]. Michael Conklin is Senior Vice-President, Analytic Services for GfK Custom Research. His research interests include Bayesian methods and the analysis of categorical and ordinal data.
(Morrison, 1974; Hora & Wilcox, 1982; Lipovetsky & Conklin, 2004). Considered in this article is the possibility of presenting a multiple regression by segmented data as a linear combination of the Fisher discriminant functions. This technique is based on the relationship between total and pooled variances. Using this approach, we can interpret regression as an aggregate of discriminators, that allows us to decompose the coefficients of regression into a sum of vectors related to the data segments. Such a decomposition helps explain how a regression by total data could have the opposite direction of the dependency on the predictors, in comparison with the coefficients related to each segment. These effects correspond to well-known Simpsons and Lords paradoxes (Blyth, 1972; Holland & Rubin, 1983; Good & Mittal, 1987; Pearl, 2000; Rinott & Tam, 2003; Skrondal & Rabe-Hesketh, 2004; Wainer & Brown, 2004), and to treatment and causal effects in the models (Arminger, Clogg & Sobel, 1995; Rosenbaum, 1995; Winship & Morgan, 1999). The article is organized as follows. Linear discriminant analysis and its relation to binary regression are first described. The next section considers regression by segmented data and its decomposition by Fisher discriminators, followed by a numerical example and a summary.
63
64
Consider the main features of LDA. Denote X a data matrix of n by p order consisting of n rows of observations by p variables x1, x2, , xp. Also denote y a vector of size n consisting of binary values 1 or 0 that indicate belonging of each observations to one or another class. Suppose there are n1 observations in the first class (y =1), n2 observations in the second class (y =0), and total number of observations n=n1+n2 . Construct a linear aggregate of x-variables:
(6)
z = Xa,
(1)
that is a generalized eigenproblem. The matrix at the left-hand side (6) is of the rank one because it equals the outer product of a vector of the group means differences. So the problem (6) has just one eigenvalue different from zero and can be simplified. Using a constant of the scalar product c = (m (1) m ( 2) )a , reduces (6) to the linear system:
where a is a vector of p-th order of unknown parameters, and z is an n-th order vector of the aggregate scores. Averaging scores z (1) within each group yields two aggregates:
S pool a = q (m (1) m ( 2) ) ,
(7)
z (1) = m (1) a , z ( 2) = m ( 2) a ,
(1) (2)
(2)
a = S 1 (m (1) m ( 2) ) , pool
(8)
where m and m are vectors of p-th order of mean values m (j1) and m (j 2 ) of each j-th variable xj within the first and second group of observations, respectively. The maximum squared distance between two groups ||z(1)-z(2)||2 = ||(m(1)-m(2))a||2 versus the pooled variance of scores aSpool a defines the objective for linear discriminator:
that defines Fisher famous two-group linear discriminator (up to an arbitrary constant). The same Fisher discriminator (8) can be obtained if instead of the pooled matrix (4) the total matrix of second-moments defined as a cross-product XX of the centered data is used, so the elements of this matrix are:
F=
(1)
( 2)
(1)
( 2)
(3)
with elements of the pooled matrix defined by combined cross-products of both groups:
i =1
i =1 nt
t =1 i =1
(5)
t =1 i =1
n2
( x ji m )( xki m )
(2) j (2) k
( S pool ) jk =
n1 i =1
where mj corresponds to mean value of each xj by total sample of size n. Similarly to transformation known in the analysis of variance, consider decomposition of the crossproduct (9) into several items when the total set of n observations is divided into subsets with sizes nt with t = 1, 2, , T :
( S tot ) jk =
n
( x ji m j )( x ki mk ) =
T
i =1
a (m
m )(m m )a , a S pool a
( S tot ) jk =
( x ji m j )( x ki m k ) ,
(9)
nt
t =1 i =1
nt
( (m (jt ) m j ) x kit ) T t =1
t =1 i =1
( nt (m (jt ) m j )(m kt ) mk ) .
(10)
65
where A is a non-singular square n-th order matrix, u and v are vectors of n-th order, the matrix in the left-hand side (13) is inverted and solution obtained:
1 a = S tot (m (1) m ( 2) ) q
where m (t ) is a vector of mean values m (t ) of j each j-th variable within t-th group, and m is a vector of means for all variables by the total sample. Consider the case of two groups, T=2. Then (11) can be reduced to
m ( 2)
t =1
(S
pool
that is a generalized eigenproblem for the many groups. Denoting the scalar products at the lefthand side (17) as some constants ct =( m (t ) -m)a , . (13) the solution of (22) via a linear combination of Fisher discriminators is presented:
= q (m(1) m (2) )
( A + uv ) 1 = A 1
A 1uv A 1 , 1 + u A 1v
(14)
In the case of two groups we have simplification (12) that reduces the eigenproblem (17) to the solution (8). But the discriminant functions in
a=
T t =1
1 ct nt S pool (m (t ) m) .
n1 m
+ n2 m n1 + n2
(1)
( 2)
+ n2 m
( 2)
n1m (1) + n 2 m ( 2) n1 + n 2
n m (1) + n2 m ( 2) 1 n1 + n 2
t =1
S tot = S pool +
nt ( m
(t )
m)(m
(t)
m) , (11)
1 + h (m
(1)
S 1 (m (1) m ( 2) ) . pool
Comparison of (8) and (15) shows that both discriminant functions coincide (up to unimportant in LDA constant in the denominator (15)), so we can use Stot instead of Spool . This feature of proportional solutions for the pooled or total matrices holds for more than two classification groups as well. Consider a criterion of maximizing ratio (3) of betweengroup to the within-group variances for many groups. Using the relation (11) yields:
n1m (1) + n2 m ( 2) n1 + n2
F=
(12)
. (16)
(18)
66
multi-group LDA with the pooled matrix or the total matrix in (17) are the same (up to a normalization) a feature similar to two group LDA (15). To show this, rewrite (17) using (16) in terms of these two matrices as a generalized eigenproblem:
a = ( X X ) 1 X y .
(24)
(19)
to
(S
1 pool
(16) with the total matrix in denominator, another generalized eigenproblem is obtained:
Matrix of the second moments XX in (23) for the centered data is the same matrix S tot (9). If the dependent variable y is binary, then the vector Xy is proportional to the vector of differences between mean values by two groups m (1) m ( 2) , and solution (24) is proportional to the solution (15) for the discriminant function defined via S tot . As it was shown in (15), the results of LDA are essentially the same with both S tot or S pool matrices. Although the Fisher discriminator can be obtained in regular linear regression of the binary group indicator variable by the predictors, a linear regression with binary output can also be interpreted as a Fisher discriminator. Predictions z=Xa (21) by the regression model are proportional to the classification (1) by the discriminator (15). Regression as an Aggregate of Discriminators Now, the regression is described by data segments presented via an aggregate of discriminators. Suppose the data are segmented; for instance, the segments are defined by clustering the independent variables, or by several intervals within a span of the dependent variable variation. Identify the segments by index t =1,,T to present the total second-moment X matrix S tot = X as the sum (11) of the pooled second-moment matrix S pool and the total of outer products for the vectors of deviations of each segments means from the total means. Using the relation (11), the normal system of equations (23) for linear regression is represented as follows:
(20)
with eigenvalues and eigenvectors b in this 1 case. Multiplying S pool by the relation (20), it is
1 represented as ( S pool S tot ) b = (1 /(1 )) b . Both
problems (19) and (20) are reduced to the 1 eigenproblem for the same matrix S pool S tot with the eigenvalues connected as (1+)(1-)=1 and with the coinciding eigenvectors a and b. Now, consider some properties of linear regression related to discriminant analysis. Multiple regression can be presented in a matrix form as a model:
y = Xa+ ,
(21)
where Xa is a vector of theoretical values of the dependent variable y (corresponding to the linear aggregate z (1)), and denotes a vector of errors. The Least Squares objective for minimizing is:
LS =
= ( y Xa)( y Xa)
= y y 2aX y + aX Xa
. (22)
t =1
S pool +
nt (m (t ) m)(m (t ) m) a = X y .
(25)
67
t =1 i =1
where St are the matrices of second moments within each t-th segment. Introducing the constants
t =1
t =1
t =1
1 = S pool
1 = S pool X y
S 1 ( m (1) m ( 2 ) ) , pool
(32) where h is the same constant as in (12). It can be seen that the vector of coefficients for twosegment regression, similarly to the general solution (29), equals the main part apool (30) minus a constant (in the parentheses at the righthand side (32) multiplied by the discriminator (8).
case the sum in (25) coincides with the total second-moment matrix, so the regular regression
h( m (1) m ( 2 ) ) S 1 X y pool
so the vector apool corresponds to the main part of the total vector in (29) of the regression coefficients defined via the pooled matrix (26), and additional vectors at correspond to Fisher discriminators (8) between each t-th particular segment and total data set. Decomposition (29) shows that regression coefficients a consist of the part apool and a linear aggregate (with weights ntct) of Fisher discriminators at of the segments versus total data. It is interesting to note that if to increase number of segments up to the number of observations (T=n, with only one observation in each segment) then each variables mean in any segment coincides with the original observation ( ( itself, mkt ) = x kit ) , so S pool = 0 in (26). In this
a pool
T t =1
nt ct at
1 a = S pool X y
T t =1
nt ct S 1 (m (t ) m) pool
.
(29)
(30)
Thus, for T segments there are only T-1 independent discriminators. Consider a simple case of two segments in data. In difference to the described two-group LDA problem (12)-(15) and its relation to the binary linear regression (24), we can have a nonbinary output, for instance, a continuous dependent variable. Using the derivation (12)(15) for the inversion of the matrix of the normal system of equations (25), the solution (29) is obtained for two-segment linear regression in explicit form:
1 a = S tot X y 1 hS pool (m (1) m ( 2 ) )( m (1) m ( 2 ) ) S 1 pool
S pool a = X y
nt ct ( m ( t ) m) .
(28)
= S 1 pool
defined similarly to those in derivation (17)-(18), reducing the system (25) to:
t =1
1 = S pool
T t =1
nt (m(t ) m) nt m (t ) m
T
ct = (m (t ) m)a ,
t =1
S pool =
nt
St ,
(26)
solution can be seen as an aggregate of the discriminators by each observation versus total vector of means. The obtained decomposition (29) is useful for interpretation, but it still contains the unknown parameters ct (27) that need to be estimated. First, notice that the Fisher discriminators at (30) of each segment versus entire data, are restricted by the relation:
T
(27)
nt at =
T t =1
1 nt S pool (m (t ) m)
(31)
nt = 0
X y
68
where y = y ~ pool is a vector of difference y between empirical and predicted by pooled variance theoretical values of the dependent variable. The relation (36) is also a model of regression of the dependent variable y by the
where A is a non-singular matrix and u1v1 + u 2 v 2 are two outer products of vectors. The derivation for the inverted matrix of such a structure is given in the Appendix. In this case, the system (25) can be presented in the notations:
A = S pool , u1 = v1 = n1 (m (1) m) , u2 = v2 = n2 (m(2) m)
(33) Applying the formula (A16) with definitions (33), we obtain solution of the system (25) for three segments. In accordance with the relations (29)(31), this solution is expressed via the vector apool and two Fisher discriminators. In a general case of any number T of segments, the parameters ct in the decomposition (29) can be obtained in the following procedure. Theoretical values of the dependent variable are predicted by the regression model (28) as follows:
y new predictors - the Fisher classifications ~t (35). This regression can be constructed in the Least Squares approach (22)-(24). In difference to the regression (21) by possibly many independent x variables, the model (36) contains y just a few regressors ~t , because a number of segments is usually small. Regression decomposition (25)-(35) uses the segments within the independent variables, that is expressed in presentation of the total second-moment matrix of x-s at the left-hand side (25) via the pooled matrix of x-s (26). However, there is also a vector Xy of the x-s cross-products with the dependent variable y at the right-hand side of normal system of equations (25). The decomposition of this vector can also be performed by the relations (10)-(11). Suppose, we use the same segments for all x-s and y variables, then: X y ( X y )tot = ( X y ) pool +
where y
(t )
T
y = X a = XS 1 X y pool
t =1
(35) All the vectors in (35) can be found from the data, so using ~ (34) in the regression (21), the y model is reduced to:
where xj is a column of observations for the j-th variable in the X matrix. Using the presentation (37)-(38) in place of the vector Xy in (29)-(30) yields a more detailed decomposition of the vector apool by the segments within the dependent variable data. In the other relations (32), or (34)(35), this further decomposition can be used as
ct yt
t =1
(34)
dependent variable in each t-th segment and the total mean. The elements of the vector ( X ) pool y in (37) are defined due to (10)-(11) as:
( x j y ) pool =
t =1 i =1
T 1
T 1
nt (m(t ) m)( y (t ) y )
t =1
nt
t =1
Another analytical result can be obtained for three segments in data, when a general solution (29) contains two discriminators. For this case we extended the Sherman-Morrison formula (14) to the inversion of a matrix A + u1v1 + u 2 v 2 ,
y =
ct ~t + , y
(36)
(37)
69
problem with its inversion. At the same time the pooled matrix obtained as a sum of segmented matrices (26), is usually less ill-conditioned. The numerical simulations showed that the condition numbers of the pooled matrices are regularly many times less than these values of the related total second-moment matrices. It means that working with a pooled matrix in (30) yields more robust results, not as prone to multicollinearity effects as in a regular regression approach. Numerical example Consider an example from a real research project with 550 observations, where the dependent variable is customer overall satisfaction with a bank merchants services, and the independent variables are: x1 satisfaction with the account set up; x2 satisfaction with communication; x3 satisfaction with how sales representatives answer questions; x4 satisfaction with information needed for account application; x5 satisfaction with the account features; x6 satisfaction with rates and fees; x7 satisfaction with time to deposit into account. All variables are measured with a ten-point scale from absolutely non-satisfied to absolutely satisfied (1 to 10 values). The pair correlations of all variables are positive. The data is considered in three segments of non-satisfied, neutral, and definitely satisfied customers, where the segments correspond to the values of the dependent variable from 1 to 5, from 6 to 9, and 10, respectively. Consider the segments contribution into the regression coefficients and into the total model quality. The coefficients of regression for the standardized variables are presented in the last column of Table 1. The coefficient of multiple determination for this model is R2=0.485, and Fstatistics equals 73.3, so the quality of the regression is good. The first four columns in Table 1 present inputs to the coefficients of regression from the pooled variance of the independent variables combined with the pooled variance of the dependent variable and three segments (37)-(38). The sum of these items in the next column comprises the pooled subtotal apool (30).
nt ct (m m)
(t )
t =1
where the vectors by segments and the constants are defined as:
1 at = S pool (m (t ) m) , t = nt ( y (t ) y c t ) .
Thus, the solution (29)-(30) is in this case reduced to the linear combination of discriminant functions at with the weights t , without the apool input. This solution corresponds to the classification (18) by several groups in discriminant analysis. The parameters t can be estimated as it is described in the procedure (32)(36). If we work with a centered data, a vector of total means by x-variables m = 0 and the mean value y = 0 , so these items can be omitted in all the formulae. A useful property of the solution (30) consists in the inversion of the pooled matrix S pool instead of inversion of the total matrix
t =1
t =1
nt (m (t ) m)( y (t ) y )
t at ,
(39)
(40)
70
Table 1. Regression Decomposition by the Items of Pooled Variance and Discriminators. Fisher Regression Pooled Variance of Predictors Discriminators Total Pooled Segment Segment Segment Pooled Segment Segment 1 2 3 Subtotal 1 3 Dependent .116 .026 .015 .064 .222 -.011 -.044 .166 .007 .149 .001 .049 .206 -.064 -.034 .108 .008 .232 -.006 .048 .282 -.100 -.033 .149 -.035 .005 .021 .077 .068 -.002 -.053 .013 .039 .101 -.016 -.028 .096 -.044 .019 .072 .054 .325 .012 .142 .533 -.141 -.098 .294 .048 .102 .018 .095 .262 -.044 -.065 .153 Table 2. Regression Decomposition by Segments. Core Input Net Variable Coefficient Effect x1 .131 .072 x2 .008 .005 x3 .003 .001 x4 -.014 -.006 x5 .023 .008 x6 .066 .037 x7 .065 .026 2 R .143 R2 share 29% Segment 1 Segment3 Net Net Coefficient Effect Coefficient Effect .015 .008 .020 .011 .084 .046 .015 .008 .131 .069 .015 .008 .003 .001 .024 .011 .057 .020 -.009 -.003 .184 .103 .044 .025 .058 .023 .030 .012 .271 .071 56% 15% Regression Total Net Coefficient Effect .166 .091 .108 .059 .149 .078 .013 .006 .072 .025 .294 .165 .153 .061 .485 100%
Variable x1 x2 x3 x4 x5 x6 x7
71
multiple determination defined by the scalar product of the standardized coefficients of regression aj and the vector of pair correlations ryj of the dependent variable and each j-th independent variables, so ryj=(Xy)j. Items ryjaj in total R2 are called the net effects of each predictor: R 2 = ry1 a1 + ry 2 a 2 + ...ryn a n . The net effects for core, two segment items, and their total (that is equal to the net effects obtained by the total coefficients of regression) are shown in Table 2. The net effects can be also used for finding the important predictors in each component of total regression. Summing net effects within their columns in Table 2 yields a splitting of total R2 =.485 into its core (R2 =.143), segment-1 (R2 =.271), and segment-3 (R2 =.071) components. In the last row of Table 2 we see that the core and two segments contribute to total coefficient of multiple determination by 29%, 56%, and 15%, respectively. Thus, the main share in the regression is produced by segment-1 of the dissatisfaction influence. Conclusion Relations between linear discriminant analysis and multiple regression modeling were considered using decomposition of total matrix of second moments of predictors into pooled matrix and outer products of the vectors of segment means. It was demonstrated that regression coefficients can be presented as an aggregate of several items related to the pooled segments and Fisher discriminators. The relations between regression and discriminant analyses demonstrate how a total regression model is composed of the regressions by the segments with possible opposite directions of the dependency on the predictors. Using the suggested approach can provide a better understanding of regression properties and help to find an adequate interpretation of regression results. References Anderson T. W. (1958) An introduction to multivariate statistical analysis. New York: Wiley and Sons.
72
Arminger, G., Clogg C. C., & Sobel M. E., (Eds.) (1995). Handbook of statistical modeling for the social and behavioral sciences. London: Plenum Press. Bartlett M. S. (1938). Farther aspects of the theory of multiple regression. Proceedings of the Cambridge Philosophical Society, 34, 33-40. Blyth, C. R. (1972). On Simpsons paradox and the sure-thing principle. Journal of the American Statistical Association, 67, 364366. Conklin, M., Powaga, K., & Lipovetsky, S. (2004). Customer satisfaction analysis: identification of key drivers. European Journal of Operational Research, 154/3, 819-827. Dillon, W. R., & Goldstein, M. (1984) . Multivariate analysis: methods and applications. New York: Wiley and Sons. Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188. Good, I. J., & Mittal, Y. (1987). The amalgamation and geometry of two-by-two contingency tables. The Annals of Statistics, 15, 694-711. Hand, D. J. (1982). Kernel discriminant analysis. New York: Research Studies Press. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255-1270. Harville, D. A. (1997) Matrix algebra from a statistician perspective. New York: Springer. Holland, P. W., & Rubin D. B. (1983). On Lord's paradox. In: H. Wainer & S. Messick (Eds.) Principals of Modern Psychological Measurement, 3-25. Hillsdale, NJ: Lawrence Earlbaum. Hora, S. C., & Wilcox, J. B. (1982). Estimation of error rates in several-population discriminant analysis. Journal of Marketing Research, 19, 57-61. Huberty, C. H. (1994). Applied discriminant analysis. New York: Wiley and Sons. Kendall, M. G. & Stuart, A. (1966). The advanced theory of statistics, vol. III. London: Griffin. Lachenbruch, P. A. (1979). Discriminant analysis. Biometrics, 35, 69-85.
73
q11 = v1 A1u1 , q12 = v1 A1u2 , q21 = v2 A1u1 , q22 = v2 A1u2 , c1 = v1 A1b, c2 = v2 A1b.
(A1) (A7) Considering equations (A6) by the elements of vector u1 and by the elements of vector u2, we obtain a system with two unknown parameters k1 and k2:
( A + uv ) 1 = A 1
A uv A 1 + u A 1v
1
singular matrix of n-th order, and u1v1 + u 2 v 2 is a matrix of the rank 2, arranged via two outer products u1 v1 and u 2 v of the vectors of n-th 2 order. Suppose we need to invert such a matrix to solve a linear system: ( A + u1v1 + u 2 v )a = b , 2
(A2)
where a is a vector of unknown coefficients and b is a given vector. Opening the parentheses, we get an expression:
Aa + u1 k1 + u 2 k 2 = b ,
(A3)
k1 = (v1a ) , k 2 = (v 2 a) ,
Solution a can be found from (A3) as:
(A4)
a = A1
a = A b k1 A u1 k 2 A u 2 .
(A5) with the constants defined in (A7). The expression in the figure parentheses (A11) defines the inverted matrix of the system (A2). It can be easily proved by multiplying the matrix in (A2) by the matrix in (A11), that yields the uniform matrix. In a simple case when both pairs of the vectors are equal, or u1v1 = u 2 v 2 ,
Substituting the solution (A5) into the system (A2) and opening the parentheses yields a vector equation:
(A6)
where the following notations are used for the known constants defined by the bilinear forms:
they can be denoted as u1v1 = u 2 v 2 = 0.5uv , and the expression (A12) reduces to the formula (A1). We can explicitly present the inverted matrix (A11) as follows:
where k1 and k2 are unknown parameters defined as scalar products of the vectors:
is well known in various theoretical and practical statistical evaluations. It is convenient to use 1 when the inverted matrix A is already known, so the inversion of A + uv can be expressed via A 1 due to the formula (A1). We extend this formula to the inversion of a matrix with two pairs of vectors. Consider a matrix A + u1v1 + u 2 v 2 , where A is a square non-
(A8)
(A9)
= (1 + q11 )(1 + q22 ) q12 q21 = (1 + v1 A1u1 )(1 + v2 A1u2 )-(v1 A1u2 )(v2 A1u1 )
(A10)
74
( A + u1v1 + u 2 v2 ) 1 = A 1
(A12) For the important case of a symmetric matrix A, each of the bilinear forms (A7) can be equally presented by the transposed expression, for instance,
A 1 (u1u u 2 u1 ) A 1 (v1v v 2 v1 ) A 1 2 2
(A15) with the determinant defined in (A10). In a special case of the outer products of each vector by itself, when u1 = v1 and u 2 = v 2 , the formula (A15) transforms into:
q11 = v1 A1u1 = u1 A1v1, q12 = v1 A1u2 = u2 A1v1, q21 = v2 A1u1 = u1 A1v2 , q22 = v2 A1u2 = u2 A1v2 .
Using the property (A13) we simplify the numerator of the second ratio in (A12) to following: (A13)
A 1 (u1u1 + u 2 u 2 ) A 1
A1u1u A1v1v2 A1 + A1u2u1 A1v2v1 A1 2 A1u1u2 A1v2 v1 A1 A1u2 u1 A1v1v2 A1 (A14) = A1 (u1u2 u2u1 ) A1 (v1v2 v2 v1 ) A1.
A 1u1v 2 A 1u 2 v1 A 1 A 1u 2 v1 A 1u1v A 1 2
( A + u1v1 + u 2 v ) 1 = A 1 2 A 1 (u1v1 + u 2 v 2 ) A 1
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 75-80
Local Power For Combining Independent Tests in The Presence of Nuisance Parameters For The Logistic Distribution
W. A. Abu-Dayyeh Z. R. Al-Rawi M. MA. Al-Momani
Department of Statistics, Faculty of Science Yarmouk University Irbid-Jordan
Four combination methods of independent tests for testing a simple hypothesis versus one-sided alternative are considered viz. Fisher, the logistic, the sum of P-values and the inverse normal method in case of logistic distribution. These methods are compared via local power in the presence of nuisance parameters for some values of using simple random sample. Key words: combination method; independent tests; logistic distribution; local power; simple random sample; nuisance parameter.
Introduction Combining independent tests of hypotheses is an important and popular statistical practice. Usually, data about a certain phenomena comes from different sources in different times, so we want to combine these data to study such phenomena. Many authors have considered the problem of combining (n) independent tests of hypotheses. For simple null hypotheses, Little and Folks (1971), studied four methods for combining a finite number of independent tests. They found that the Fisher method is better than the other three methods via Bahadur efficiency. Again, Little and Folks (1973) studied all methods of combining a finite number of independent tests and thy found that the Fisher's method is optimal under some mild conditions.
W. A. Abu-Dayyeh, Department of Mathematical Sciences, Dhahran, Saudi Arabia M. MA. Al-Momani, Department of Mathematical Sciences, Dhahran, Saudi Arabia Z. R. Al-Rawi Chairman, Department of Statistics Yarmouk University, Irbid, Jordan For correspondence regarding this article, send Email to [email protected]. This work was carried out with financial support from the Yarmouk University Research Council.
Brown, Cohen and Strawderman (1976) have shown that such all tests form a complete class. Abu-Dayyeh and Bataineh (1992) showed that the Fisher's method is strictly dominated by the sum of P-values method via Exact Bahadur Slop in case of combining an infinite number of independent shifted exponential tests when the sample size remains finite. Also, Abu-Dayyeh (1992) showed that under certain conditions that the local limit of the ratio of the Exact Bahadur efficiency of two tests equivalent to the Pitman efficiency between the two tests where these tests are based on sum of iid r.vs. Again AbuDayyeh and El-Masri (1994) studied the problem of combining (n) independent tests as (n ) in case of triangular distribution using six methods viz. sum of P-values, inverse normal, logistic, Fisher, minimum of P-values and maximum of P-values. They showed that the sum of P-values is better than all other methods. Abu-Dayyeh (1997) extended the definition of the local power of tests to the case of having nuisance parameters. He derived the local power for any symmetric test in the case of a bivariate normal distribution with known correlation coefficient, and then he applied it to the combination methods. Specific Problem Suppose there is (n) simple hypotheses:
i 0i i
75
H0(i) :
vs
H1(i) :
>
0i
i=1,2,,n (1)
76
Where 0i is known for i=1,2,,n and H0(i) is rejected for sufficiently large values of some continuous real valued test statistic T(i) , i=1,2,,n and we want to combine the (n) hypotheses into one hypothesis as follows:
2 n 01 02
vs
Many methods have been used for combining several tests of hypotheses into one overall test. Among these methods are the nonparametric (omnibus) methods that combine the P-values of the different tests. The P-value of the i-th hypothesis is given by:
Pi = Pi ) (T (
H0 (i)
(i )
t ) = 1 Fi ) (t ) (
H0 (i) (i)
where FH0 (t) is the cdf of T under H0 . Note that Pi ~ U(0,1) under H0(i). Considered in this article is the case of = i , where 1 , 2 , ..., r 0 fixed constants and is the unknown parameter.
i
c = (2 ), (1 ) 4
A(2)
Then T (1) , T ( 2 ) , ..., T ( r ) are independent r.vs such that for i = 1,2,.., r and we want to test
E (1 , 2 ) (L )
(4)
following 1 =
and therefore considered is the problem of combining a finite number of independent tests by looking at the Local Power of tests which is defined for a test by:
A(3)
LP () = inf
E ()
=0
(5)
E (1 , 2 ) (S )
where
0 , = ( 1 , 2 ,..., r ), i 0 , i = 1,2,..., r , in
case of logistic distribution. Compared (5) for the four methods of combining tests for the location family of distributions when r = 2 and A(4)
H0 : = 0 H1 : > 0 `
vs
KL =
(y 1 + e ) y
c
( y 2 )( y 1)
3
1 e c (c + 1)
KS =
c 2 (3 2 c ) , and c = 2 . 6
>
0i
for (2)
H0: ( 1,
, ,
)=(
, ,
0n
(3)
E (1 , 2 ) (F )
a
=0
= K F (1 + 2 ) ,where
a=e
c 2
KF =
1 e
2 y dy , y3
and
y=0
= K L (1 + 2 ) ,
where
(1 e )
c 2
=0
= K S (1 + 2 ) ,where
77
=0
= K N (1 + 2 ) ,where
y2 y3
E ( S ) (1 , 2 ,3 )
=0
1 a= and c = 2 1 (1 ) . ( c )
Proofs of the previous lemma are similar to proofs of lemma 2, so we will not write it. Lemma 2 Let X 1 , X 2 , X 3 be independent r.vs such that X i ~ Logistic ( i ,1) for i = 1,2,3 . Then B(4)
=0
= K N (1 + 2 + 3 )
where
KN =
ab
=0
i =1
= K F (1 + 2 + 3 )
a
where y2 y3
a = ( c ) ,
dy
KF =
1 e
1 c
B(3)
1 =
(u 1)(v 1) c 1 1 (u 1)(v 1) + e
KL =
1 v3
du dv ,
where f ( xi i ) is the
1 1 u2 v2
du dv
E (1 , 2 ,3 ) ( F )
= 1
(1 F ) f ( xi i ) dxi
i =1
2 y 1 + c ln ( y ) 2
c = 3 1 (1 ) .
Now, we will prove just B(1), because the proof of the others can be done in the same way. Proof of B(1):
E ( , = , ) ( F )
E (1 , 2 ,3 ) ( F ) B(1)
= KF
11
b = c 1 (v ) ,
F f ( xi i ) dxi ,
i =1
1 c
1 y
dy ,
KS =
E ( N ) (1 , 2 , 3 )
i =1
= KS
i = K S (1 + 2 + 3 ) where
c 3 (2 c ) and c = 3 6 . 12
3 i =1
= KN
and
78
so,
E ( F ) (1 , 2 ,3 )
1 (1 F ) f ( xi i ) dxi i =1
3
that x 2 ln
(e x + 1) 1
3
e 2
=0
\
Let a = ln e 2 1 ,
dx1dx2 dx3
b = ln
i e
xi
d = ln
get
By symmetric of xi we have
(e x + 1)(e x + 1) 1
2 3
e 2
where
1F =
i =1
0,
that x1 ln also
1+ e
I2 =
x2 2
(e x + 1)(e x + 1)
2 3
e 2
1 ,
Also,
b
1+ e 2 ln ( p1 ) 2 ln ( p 2 ) 2 ln ( p3 ) c implies
pi =
1
xi
, i = 1,2,3
1, 2
ln ( pi ) c
u = 1 + e x1 to get that I1 =
, so , I1 = 1 e
c 2
2
o.w
(e x + 1)(e x + 1).
3
2 x3 c 2 x2 x3 3
KF =
a b
x2 x2
(1+ e
( e e ) 1 e e +1 e +1 dxdx ( ( )( )) ) (1+ e )
x3 x3 2 1
e x2
(1F )
ex1
x1 2
ex2
x2
(1+e ) (1+e
( e e ) dxdx dx ) (1+e )
x3 2x3 3 2 x3 2
E (1 , 2 , 3 ) (F ) KF =
KF =
=0
i =1
i K F , where
x1 x1 2
x2 x2
(1 + e ) (1 + e
d
( e e ) dx dx dx . ) (1 + e )
x3 2 x3 2 x3 3 1 2 3
Let I1 =
(1 + e )
e x1
x1 2
1 + ed
( xi ) =
( e ) (1 + e )
xi 2 xi
(e x + 1) 1
3
e 2
for i = 1, 2, 3.
and let
( x3 ) + f ( x1 ) f \ ( x2 ) f ( x3 ) + f \ ( x1 ) f ( x2 ) f ( x3 )
f ( x1 ) f ( x2 ) f
E ( F ) (1 , 2 , 3 )
(1 F )
and x3 ln e 2 1 respectively.
, then we will
let
)(
79
= 1 e e
c
c 2
(e
x3
+1
KF =
Finally put
d
y = 1 + e x3 we get
2y
d =e 2
3
i =1
i =1
i =1
under
H0 ,
completes the proof. Also, here for the logistic distribution we will compare the Local Power for the previous four tests numerically. So from tables (1) and (2) when = 0.01 and r = 2 the sum of p-values method is the best method followed by the inverse normal method, the logistic method and Fisher method respectively, but for all of the other values of and r the inverse normal method is the best method followed by the sum of p-values method followed by logistic method and the worst method is Fisher method.
because 2
ln ( pi ) ~ (2 ) 6
then
c = (2 ), (1 ) , which 6
= P 2 0
ln ( pi ) c = 1 P 2 0
KF =
1 e
1+
ln ( pi ) c ,
c ln ( y ) 2
a x3
y2 y3
= 1 e
c (e x + 1) 1 + 2 ln ( e x + 1)
(e
x3
+1
e x2
References
(1 + e )
x2 2
dx2 1 dx x2 ) 2 (1 + e
b
c 2
(e
x3
+1
c ln e x3 + 1 2
dy ,
Abu-Dayyeh, W. A. (1989). Bahadur exact slope, pitman efficiency and local power combining independent tests. Ph.D. Thesis, University of Illinois at Urbana Champaign. Abu-Dayyeh, W. A. (1992). Exact bahadur efficiency. Pakistan Journal of Statistics, 8 (2), 53-61. Abu-Dayyeh, W. A., & Bataineh (1992). Comparing the exact Bahadur of the Fisher and sum of P-values methods in case of shifted exponential distribution. Mutah slopes. Journal for Research and Studies, 8, 119-130. Abu-Dayyeh, W. A., & El-Masri. (1994). Combining independent tests of triangular distribution. Statistics & Probability Letters, 21, 195-202. Abu-Dayyeh, W. A. (1997). Local power of tests in the prescience of nuisance parameters with an application. The Egyptian Statistical Journal ISSR, 41, 1-9. Little, R. C., & Folks, L. J. (1971). Asymptotic optimality of Fishers method of combining independent tests. Journal of the American Statistical Association, 66, 802-806. Little, R. C., & Folks, L. J. (1973). Asymptotic optimality of Fishers method of combining Independent Tests II. Journal of the American Statistical Association, 68, 193-194.
80
KF
0.0073833607 0.0174059352 0.0326662436
KL
0.0081457298 0.0192749938 0.0361783939
KS
0.0090571910 0.0212732200 0.0394590744
KN
0.0089064740 0.0214554551 0.0415197403
KF
0.0062419188 0.0144747833 0.0267771426
KL
0.0071070250 0.0165023359 0.0304639648
KS
0.0080425662 0.0183583839 0.0332641762
KN
0.0083424342 0.0199610766 0.0381565019
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 81-89
Effect Of Position Of An Outlier On The Influence Curve Of The Measures Of Preferred Direction For Circular Data
B. Sango Otieno
Department of Statistics Grand Valley State University
Christine M. Anderson-Cook
Statistical Sciences Group Los Alamos National Laboratory
Circular or angular data occur in many fields of applied statistics. A common problem of interest in circular data is estimating a preferred direction and its corresponding distribution. It is complicated by the wrap-around effect on the circle, which exists because there is no natural minimum or maximum. The usual statistics employed for linear data are inappropriate for directional data, as they do not account for its circular nature. The robustness of the three common choices for summarizing the preferred direction (the sample circular mean, sample circular median and a circular analog of the Hodges-Lehmann estimator) are evaluated via their influence functions. Key words: Circular distribution, directional data, influence function, outlier
Introduction The notion of preferred direction in circular data is analogous to the center of a distribution for data on a linear scale. Unlike in linear data where a center always exists, if data are uniformly distributed around the circle, then there is no natural preferred direction. Therefore, it is appropriate and desirable that all sensible measures of preferred direction are undefined if the sample data are equally spaced around the circle. This article considers estimating the preferred direction for a sample of unimodal circular data. Three choices for summarizing the preferred direction are the mean direction, the median direction (Fisher 1993) and the HodgesLehmann estimate (Otieno & Anderson-Cook, 2003a).
The sample mean direction is a common choice for moderately large samples, because when combined with a measure of sample dispersion, it acts as a summary of the data suitable for comparison and amalgamation with other such information. The sample mean is obtained by treating the data as vectors of length one unit and using the direction of their resultant vector. Given a set of circular observations 1 , . . ., n , each observations is measured as a unit vector with coordinates from the origin of (cos( i ), sin ( i )) , i = 1, . . ., n. The resultant vector of these n unit vectors is obtained by summing them componentwise to get the resultant vector
i =1 i =1
R =
(C
+ S 2 ).
Jamalamadaka and SenGupta (2001), show that the sample circular mean direction is location invariant, that is, if the data are shifted by a certain amount, the value of the sample
81
B. Sango Otieno is Assistant Professor at Grand Valley State University. He is a member of Institute of Mathematical Statistics and Michigan Mathematics Teachers Association. Email: [email protected]. Christine AndersonCook is a Statistician at Los Alamos National Laboratory. She is a member of the American Statistical Association, and the American Society of Quality. Email: [email protected].
sample circular mean is the angle corresponding to the mean resultant vector
R=
R=
cos( i ),
82
circular mean direction also changes by that amount. An alternative, the sample median, can be thought of as the location of the circumference of the circle that balances the number of observations on the two halves of the circle, Otieno and Anderson-Cook (2003b). The sample median direction of angles 1 , . . .,
median, Fisher (1993). A third measure of preferred direction for circular data is the circular Hodges-Lehmann estimate of preferred direction, subsequently referred to as HL. This is the circular median of all pairwise circular means of the data (Otieno & Anderson-Cook, 2003a). As with the linear case, there are three possible methods for calculating
i =1
1 d ( ) = n
is the circular
83
circular r.v is said to have a wrapped normal (WN) distribution if its pdf is
= exp
I ( ) 1 2 , = A( ) = 1 2 I 0 ( )
two have a
0 , < 2
2
and
0 < ,
where
j =0
turn
is
approximately
equivalent
approximately equivalent to N ,
no preferred direction. As increase from zero, f ( ) peaks higher about . The von Mises is symmetric since it has the property f ( + ) = f ( ) , for all , where addition or subtraction is modulo 2 . With the uniform or isotropic distribution, however, the total probability is spread out uniformly on the circumference of a circle; that is, all directions are equally likely. It thus represents the state of no preferred direction. The von Mises is similar in importance to the Normal distribution on the line, (Mardia,1972). When 2 , the von Mises distribution VM ( , ), can be approximated by the Wrapped Normal distribution WN ( , ) , which is a symmetric unimodal distribution obtained by wrapping a normal N , 2 distribution around the circle. A
r =0
are the modified Bessel functions of order zero and order one, respectively. Based on the difficulty in distinguishing the two distributions, Collett and Lewis (1981) concluded that decision on whether to use a von Mises model or a Wrapped Normal model, depends on which of the two is most convenient. The Wrapped Normal distribution WN ( , ) is obtained by wrapping the N , 2 distribution onto the circle, where 2 = 2 log , which implies
that, = exp
is the modified Bessel function of order zero. The concentration parameter, , quantifies the dispersion. If is zero,
and
I 1 ( ) =
4 j j2
[(r + 1)!r!]
1 2
2 r +1
, which in to
I 0 ( ) = (2 ) 1 exp[xos ( )]d =
2j
I 0 ( ) = (2 )
p =1
f W ( ) = (2 ) + 1
1
p cos[ p( )] ,
2
close
84
1 . This approximation is very 2 accurate for > 10 (Mardia & Jupp, 2000). 1 Note 2 = 2 log A( ) and 2 =
are the estimates of approximated by WN ( , ) and N ,
WN , exp
when VM ( , ) is
Consider a circular distribution F which is unimodal and symmetric about the unknown direction 0 . The influence function (IF) for the circular mean direction is given by
IF ( ) =
length
sin ( 0 )
respectively. Figure 1 shows how the WN and N approximations are related for various values of concentration parameter, , using the following approximation,
A( ) 1
1 1 1 2 3 . . . , 2 8 8
Jammalamadaka & SenGupta (2001, p. 290). The circular median is rotationally invariant as shown by Ackermann (1997). Lenth (1981), and, Wehrly and Shine (1981) studied the robustness properties of both the circular mean and median using influence curves, and revealed that the circular mean is quite robust, in contrast to the mean for linear data on the real line. Durcharme and Milasevic (1987), show that in the presence of outliers, the circular median is more efficient than the mean direction. Many authors, including He and Simpson (1992), advocate the use of circular median as an estimate of preferred direction, especially in situations where the data are not from the von Mises distribution. The Hodges-Lehmann estimator, on the other hand is a compromise between the occasionally non-robust circular mean and the more robust circular median. Unlike the circular median which downweights outliers significantly but is sensitive to rounding and grouping (Wehrly & Shine, 1981), the HL estimate downweights outliers more sparingly and is more robust to rounding and grouping. The circular HL estimator has comparable efficiency to mean and is superior to median; see Otieno and Anderson-Cook (2003a). Other properties of this estimate are explored and compared to those of circular mean and circular median in Otieno and Anderson-Cook (2003a). S-Plus or R functions for computing this estimate are available by request from the authors.
derivative are bounded by 1 , see Wehrly and Shine (1981). Another result due to Wehrly and Shine (1981) is the influence function of the circular median. Without loss of generality for notational simplicity, assume that [0, ] . The influence function for the circular median direction is given by
1 sgn ( 0 ) 2 , IF ( ) = [ f ( 0 ) f ( 0 + )]
where f ( 0 ) is the probability density function of the underlying distribution of the data at the hypothesized mean direction 0 , and sgn(x) = 1, 0, or -1 as x > 0, x = 0, or x < 0, respectively. Wehrly and Shine (1981) and Watson (1986) evaluated the robustness of the circular mean via an influence function introduced by Hampel (1968, 1974) and concluded that the estimator is somewhat robust to fixed amounts of contamination and to local shifts, since its influence function is bounded. The influence curve for the circular median, however, has a jump at the antimode. This implies that the circular median is sensitive to rounding or grouping of data (Wehrly & Shine, 1981).
( 0 < < 0 + ) ,
1 2
is
85
observation.
Estimated Variance
0.0
0.2
0.4
0.6
10
15 Concentration Parameter
i j . is equivalent to the pairwise circular mean of i and j , Otieno and AndersonCook,(2003a). The functional of the circular
1 , 2
where F ( ) = P( ) = F (2 )h( )d , Hettmansperger & McKean (1998, p.3,10-11). For a sample from a von Mises distribution with a limited range of concentrated parameter values, 2 , the influence function of the circular HL estimator c HL is given by
cumulative density function of 1 , . . ., n . Note that this influence function is a centered and scaled cdf and is therefore bounded. Note that, it is also discontinuous at the antimode, like the influence function of the circular median. Figure 2 are plots of the influence functions of the circular mean, circular median and the circular HL estimators for preferred direction for various concentration parameters. The range of the data values is
=1 to 8.
+ j )
, and 2 =
20
25
30
IF ( ) =
F ( )
1 2, 1
2
where
F(.)
is
the
radians to 2
86
Kappa = 1
2 1.0 1.5
Kappa = 2
Influence Function
Influence Function
-2
-1
-3
-2
-1
-1.5
-0.5
0.5
-3
-2
-1
Kappa = 4
1.0 1.0
Kappa = 8
Influence Function
Influence Function
0.5
-0.5
0.0
-1.0
-3
-2
-1
-1.0 -3
-0.5
0.0
0.5
-2
-1
Notice that all the estimators have curves which are bounded. Also, as the data becomes more concentrated (with increasing), the influence function of the circular median changes least followed by the circular HL estimator. This is similar to the linear case.
Also, as increases, the bound for the influence function for all the three measures decreases, however, overall the bound of the influence function for the mean is largest for angles closest to
87
occurs at
or
while for both the median and HL, the maximum occurs uniformly for a range away from the preferred direction. Overall, HL seems like a compromise between the mean and the median. A Practical Example Consider the following example of Frog migration data Collett (1980), shown in Figure 3. The data relates the homing ability of the Northern cricket frog, Acris crepitans, as studied by Ferguson, et. al.(1967). A number of frogs were collected from mud flats of an abandoned stream meander and taken to a test pen lying to the north of the collection point. After 30 hours enclosure within a dark environmental chamber, 14 of them were released and the directions taken by these frogs recorded (taking 0 0 to be due North), Table 1. In order to compute the sample mean of these data, consider them as unit vectors, the resultant vector of these 14 unit vectors is obtained by summing them componentwise to
i =1 i =1
by observations nearest the center of the data followed by HL. The influence of an outlier on the sample circular median is bounded at either a constant positive or a constant negative value, regardless of how far the outlier is from the center of the data. On the other hand, the HL estimator is influenced less by observations near the center, and reflects the presence of the outlier. The influence curve for the circular mean is similar to that of the redescending function (See Andrews et. al., 1972 for details). Conclusion Like in the linear case, it is helpful to decide what aspects of the data are of interest. For example, in the case of distributions that are not symmetric or have outliers, like in the case of the Frog migration data, the circular mean and circular median are measuring different characteristics of the data. Hence one needs to choose which aspect of the data is of most interest. For data that are close to uniformly distributed or have rounding or grouping, it is wise to avoid the median since its estimate is prone to undesirable jumps. Either of the other two measures perform similarly. For data spread on a smaller fraction of the circle, with a natural break in the data, the median is least sensitive to outliers. The mean is typically most responsive to outliers, while HL gives some, but not too much weight to outliers. Overall, the circular HL is a good compromise between circular mean and circular median, like its counterpart for linear data. The HL estimator is less robust to outliers compared to the median, however it is an efficient alternative, since it has a smaller circular variance, Otieno and Anderson-Cook, (2003a). The HL estimator also provides a robust alternative to the mean especially in situations where the model of choice of circular data (the von Mises distribution) is in doubt. Overall, the circular HL estimate is a solid alternative to the established circular mean and circular median with some of the desirable features of each.
The sample circular mean is the angle corresponding to the mean resultant vector
R= R =
(C
mean is -0.977
(124 ),
0
concentration parameter, = 2.21 for the best fitting von Mises.(Table A.3, Fisher, 1993, p. 224). The circular median is -0.816 133.25 0 and circular Hodges-Lehmann is -0.969 124.5 0 . Using = 2.21 , Figure 4 gives the influence curves of the mean, median and HL. Note that the measure least influenced by observation x, a presumed outlier, is the circular mean, since x is nearer to the antimode. However, the circular median is influenced most
get R =
cos( i ),
sin ( i ) = (C , S ) , say.
88
O O OO
O O O O O O O Preferred Direction
Figure 4: Influence curves for the three measures for data with a single outlier
h HL d Median m Mean
Influence Function
0.5
1.0
o o oo
d m o o o o oo o o o h
0.0
-1.0
-0.5
-3
-2
-1
89
Jammalamadaka, S. R., & SenGupta, A. (2001). Topics in circular statistics, world scientific. New Jersey. Mardia, K. V. (1972) Statistics of directional data. London: Academic Press. Mardia, K. V., & Jupp, P. E. (2000). Directional statistics. Chichester: Wiley. Otieno, B. S. (2002) An alternative estimate of preferred direction for circular data. Ph.D Thesis., Department of Statistics, Virginia Tech. Blacksburg: VA. Otieno, B. S., & Anderson-Cook, C. M. (2003a). Hodges-Lehmann estimator of preferred direction for circular. Virginia Tech Department of Statistics Technical Report, 03-3. Otieno, B. S., & Anderson-Cook, C. M. (2003b). A More efficient way of obtaining a unique median estimate for circular data. Journal of Modern Applied Statistical Methods, 3, 334-335. Rao, J. S. (1984) Nonparametric methods in directional data. In P. R. Krishnaiah and P. K. Sen. (Eds.), Handbook of Statistics, 4, pp. 755-770. Amsterdam: Elsevier Science Publishers. Stephens, M. A. (1963). Random walk on a circle. Biometrika, 50, 385-390. Watson, G. S. (1986). Some estimation theory on the sphere. Annals of the Institute of Statistical Mathematics, 38, 263-275. Wehrly, T., & Shine, E. P. (1981). Influence curves of estimates for directional data. Biometrika, 68, 334-335.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 90-99
Harry Khamis
Statistical Consulting Center Wright State University
The hazard ratio estimated with the Cox model is investigated under proportional and five forms of nonproportional hazards. Results indicate that the highest bias occurs for diverging hazards with early censoring, and for increasing and crossing hazards under a high censoring rate. Key words: censoring proportion, proportional hazards, random censoring, survival analysis, type I censoring
Introduction In recent decades, survival analysis techniques have been extended far beyond the medical, biomedical, and reliability research areas to fields such as engineering, criminology, sociology, marketing, insurance, economics, etc. The study of survival data has previously focused on predicting the probability of response, survival, or mean lifetime, and comparing the survival distributions. More recently, the identification of risk and/or prognostic factors related to response, survival, and the development of a certain condition has become equally important (Lee, 1992). Conventional statistical methods are not adequate to analyze survival data because some observations are censored, i.e., for some observations there is incomplete information about the time to the event of interest. A common type of censoring in practice is Type I censoring, where the event of interest is observed only if it occurs prior to some pre-
Harry Khamis, Statistical Consulting Center Wright State University, Dayton, OH. 45435. Email him: [email protected].
specified time, such as the closing of the study or the end of the follow-up. The most common approach for modeling covariate effects in survival data uses the Cox Proportional Hazards Regression Model (Cox, 1972), which takes into account the effect of censored observations. As the name indicates, the Cox model relies on the assumption of proportional hazards, i.e., the assumption that the effect of a given covariate does not change over time. If this assumption is violated, then the Cox model is invalid and results deriving from the model may be erroneous. A great number of procedures, both numerical and graphical, for assessing the validity of the proportional hazards assumption have been proposed over the years. Some of the procedures require partitioning of failure time, some require categorization of covariates, some include a spline function, and some can be applied to the untransformed data set. However, no method is known to be definitively better than the others in determining nonproportionality. Some authors recommended using numerical tests, e.g., Hosmer and Lemeshow (1999). Others recommended graphical procedures, because they believe that the proportional hazards assumption only approximates the correct model for a covariate and that any formal test, based on a large enough sample, will reject the null hypothesis of proportionality (Klein & Moeschberger, 1997, p. 354). Power studies to compare some numerical tests have been performed; see, e.g.,
90
91
procedure followed by a steady decline in risk as the patient recovers (see, e.g., Kline & Moeschberger, 1997). In the Cox model, the relation between the distribution of event time and the covariates z (a p x 1 vector) is described in terms of the hazard rate for an individual at time t:
(t)
lim
t 0
where T is the random variable under study: time until the event of interest occurs. Thus, for small t, (t) t is approximately the conditional probability that the event of interest occurs in the interval [t, t + t], given that it has not occurred before time t. There are many general shapes for the 0. hazard rate; the only restriction is (t) Models with increasing hazard rates may arise when there is natural aging or wear. Decreasing hazard functions are less common, but may occur when there is a very early likelihood of failure, such as in certain types of electronic devices or in patients experiencing certain types of transplants. A bathtub-shaped hazard is appropriate in populations followed from birth. During an early period deaths result, primarily from infant diseases, after which the death rate stabilizes, followed by an increasing hazard rate due to the natural aging process. Finally, if the hazard rate is increasing early and eventually begins declining, then the hazard is termed humpshaped. This type of hazard rate is often used in modeling survival after successful surgery, where there is an initial increase in risk due to infection or other complications just after the
where 0(t) is the baseline hazard rate, an unknown (arbitrary) function giving the value of the hazard function for the standard set of conditions z = 0, and is a p x 1 vector of unknown parameters. The partial likelihood is asymptotically consistent estimate of (Andersen & Gill, 1982; Cox, 1975, and Tsiatis, 1981). The ratio of the hazard functions for two individuals with covariate values z and z* is (t,z)/ (t,z*) = exp[ '(z z*)], an expression that does not depend on t. Thus, the hazard functions are proportional over time. The factor exp( 'z) describes the hazard ratio for an individual with covariates z relative to the hazard at a standard z = 0. The usual interpretation of the hazard ratio, exp( 'z), requires that (1) holds. There is no clear interpretation if the hazards are not proportional. Of principal interest in a Cox regression analysis is to determine whether a given covariate influences survival, i.e. to estimate the hazard ratio for that covariate. The behavior of the hazard ratio estimated with the Cox model when the underlying assumption of proportional hazards is false (i.e., when the hazards are not proportional) is investigated in this paper. To assess the Cox estimates under nonproportional hazards, the estimates are compared to an exact calculation of the geometric average of the hazard ratio described in the next section. An average hazard ratio does not reflect the truth exactly since the hazard ratio is changing with time when the proportionality assumption is not in force. However, it can provide an approximate standard against which to compare the Cox model estimates. Because the estimation of the hazard ratio from the Cox model cannot be done analytically (Klein & Moeschberger, 1997), the comparison is made by simulations.
(t,z) =
0(t)exp(
'z),
(1)
92
Average hazard ratio The average hazard ratio (AHR) is defined as (Kalbfleisch & Prentice, 1981):
0
(W) =
0 1 1 2 2
d=
Methodology Simulation strategy The hazard ratio estimates from the Cox model are evaluated under six scenarios: (1) proportional hazards, (2) increasing hazards, (3) decreasing hazards, (4) crossing hazards,
For early censoring, a percentage of the lifetimes are randomly chosen and multiplied by a random number generated from the uniform distribution. The percentage chosen is the same as the censoring proportion. The parameters of the uniform distribution are chosen so that the censoring times are short in order to achieve the effect of early censoring. For late censoring, a percentage of the longest lifetimes are chosen; this percentage is slightly larger than the censoring proportion. Of those lifetimes, a percentage corresponding to the censoring time
When the parametric forms of the survivor functions are unknown, the AHR (2) can still be used; in this case, the Kaplan-Meier product-limit estimates for the two groups are used as the survivor functions (Kaplan & Meier, 1958). However, (2) then only holds for uncensored data. The AHR function for censored data can be found in Kalbfleisch and Prentice, 1981.
- [(
)/(
t=
ts tc
if t s t c if t s > t c
where 1(t) and 2(t) are the hazard functions of two groups and W(t) is a survivor or weighting function. The weight function can be chosen to reflect the relative importance attached to hazard ratios in different time periods. Here, W(t) depends on the general shape of the failure time distribution and is defined as W(t) = S1 (t)S2 (t), where S1(t) and S2(t) are the survivor functions (i.e., one minus the cumulative distribution function) for the two groups, and > 0. The value = weights the hazard ratio at time t according to the geometric average of the two survivor functions. Values of > will assign greater weight to the early times while < assigns greater weight to later times. Here, = will be used. For Weibull distributed lifetimes with and shape parameter , the scale parameter survival function is S(t) = exp[-( t) ] and the AHR estimator (2) can be written
(W)
= - [ 1(t)/ 2(t)]dW(t),
(2)
93
censoring the estimate is generally unbiased regardless of sample size or censoring proportion. Decreasing Hazards Survival times are generated from the Weibull distribution where =0.9, =1 for group 1, and =0.75, =3 for group 2. The AHR is 0.44 for this situation. The percent of the bias for the mean Cox model estimate relative to the AHR is given in Table 3. The Cox estimates fall below the AHR. These estimates decrease slightly with increasing censoring proportion. The estimates for early censoring are slightly less biased than for random or late censoring at the higher censoring proportions. The bias is not heavily influenced by sample size. Crossing Hazards Survival times are generated from the Weibull distribution where =2.5, =0.3 for group 1, and =0.9, =2 for group 2. The AHR is 15.4 for this situation. The percent of the bias for the mean Cox model estimate relative to the AHR is given in Table 4. The bias of the Cox estimates tends to be much smaller for 10% and 25% censoring proportions compared to the 50% censoring proportion. For 50% censoring, the Cox model tends to overestimate the AHR. The bias decreases with increasing sample size, especially for high censoring proportions. Diverging Hazards Survival times are generated from the Weibull distribution where =0.9, =1.0 for group 1, and =1.5, =2 for group 2. The AHR is 0.536 for this situation. The percent of the bias for the mean Cox model estimate relative to the AHR is given in Table 5. The Cox estimates are larger for random and late censoring than for early censoring at the highest censoring proportion. Generally, the sample size has little effect on the bias. For early censoring, the percent bias is approximately 20% and is not strongly affected by sample size or censoring proportion.
94
Table 1. Proportional Hazards: percent bias of Cox model estimates relative to average hazard rate of 2.0. Sample Size per Group
Censoring Random
% Censored 10% 25% 50% 10% 25% 50% 10% 25% 50%
100 2.0 2.0 3.0 2.5 3.5 7.0 2.0 2.5 3.5
Early
Late
Table 2. Increasing Hazards: percent bias of Cox model estimates relative to average hazard rate of 1.20. Sample Size per Group Censoring Random % Censored 10% 25% 50% 10% 25% 50% 10% 25% 50% 30 - 6.7 - 9.2 -15.0 - 4.2 - 4.2 - 1.7 - 7.5 -12.5 -20.8 50 - 7.5 -10.8 -17.5 - 5.8 - 5.8 - 5.0 - 9.2 -14.2 -22.5 100 - 8.3 -10.8 -18.3 - 6.7 - 5.8 - 5.8 -10.0 -15.0 -23.3
Early
Late
95
Table 3. Decreasing Hazards: percent bias of Cox model estimates relative to average hazard rate of 0.441. Sample Size per Group Censoring Random % Censored 10% 25% 50% 10% 25% 50% 10% 25% 50% 30 - 2.0 - 4.3 - 9.5 - 1.4 - 2.7 - 5.4 - 2.0 - 4.9 -10.9 50 - 3.2 - 5.7 -11.3 - 2.5 - 3.6 - 5.9 - 3.4 - 6.8 -12.9 100 - 3.2 - 5.9 -12.2 - 2.3 - 3.6 - 6.6 - 3.6 - 7.3 -13.8
Early
Late
Table 4. Crossing Hazards: percent bias of Cox model estimates relative to average hazard rate of 15.4. Sample Size per Group Censoring Random % Censored 10% 25% 50% 10% 25% 50% 10% 25% 50% 30 5.8 19.5 73.3 1.3 9.1 32.5 - 1.9 - 0.6 100.6 50 - 7.1 4.5 52.6 -11.0 - 5.2 8.4 -12.9 - 5.8 81.8 100 -14.9 - 5.2 34.4 -18.8 -15.6 - 6.5 -19.5 - 8.4 67.5
Early
Late
96
Table 5. Diverging Hazards: percent bias of Cox model estimates relative to average hazard rate of 0.536. Sample Size per Group Censoring Random % Censored 10% 25% 50% 10% 25% 50% 10% 25% 50% 30 -16.2 -10.4 7.8 -19.0 -19.0 -18.8 -16.4 - 6.9 18.5 50 -18.3 -12.9 3.7 -20.9 -21.3 -21.8 -18.5 - 9.3 13.9 100 -19.2 -14.2 1.1 -22.0 -22.6 -23.7 -19.4 -10.4 12.3
Early
Late
Table 6. Converging Hazards: percent bias of Cox model estimates relative to average hazard rate of 7.15. Sample Size per Group Censoring Random % Censored 10% 25% 50% 10% 25% 50% 10% 25% 50% 30 - 8.9 - 5.6 4.0 - 9.4 - 6.2 2.4 -10.2 - 7.3 10.5 50 -11.2 - 8.3 1.9 -11.3 - 8.8 - 0.8 -12.4 - 8.4 9.2 100 -12.2 - 9.4 - 0.6 -12.4 -10.2 - 4.3 -13.1 - 8.1 8.1
Early
Late
97
plot of the hazard curves. From this information, one can approximate the magnitude and nature of the risk of biased estimation of the hazard ratio by the Cox model. Generally, the least biased estimates are obtained for the lower censoring proportions (10% and 25%) except for diverging hazards. In terms of bias, early censoring is problematic only for diverging hazards; late censoring is problematic for increasing and crossing hazards with the 50% censoring rate; and random censoring is problematic for crossing hazards with the 50% censoring rate. The case corresponding to the least occurrence of severe bias is the one involving random censoring with a censoring rate of 25% or less. In practice, the experimenter typically has some control over sample size and perhaps the censoring proportion. For instance, the experimenter may be able to minimize censoring proportion, depending on the situation, through effective study design and experimental protocol. Minimizing the censoring rate is generally recommended, especially for increasing and crossing hazards. Early censoring is appreciably affected by censoring proportion only for constant and crossing hazards. Sample size has the strongest effect on constant and crossing hazards, especially at higher censoring proportions, where higher sample sizes lead to less biased estimates. In practical applications, the proportional hazards assumption is never met precisely. If the deviation from the proportional hazards assumption is severe, then remedial measures should be taken. However, in many instances the model diagnostics reveal only a small to moderate deviation from the proportional hazards assumption. In these cases, the Cox model estimate of the hazard ratio is used for interpretation purposes in the presence of small to moderate assumption violations. This study characterizes the consequences of this interpretation in terms of bias, taking into account censoring rate, type of censoring, type of nonproportional hazards, and sample size. The general results indicate that the percent bias relative to AHR is under 20% in all but a few specific instances, as outlined above.
98
Table 7. Percent bias of the average Cox regression model estimates of the hazard ratio relative to the AHR averaged over sample size. Censoring Hazards constant % censoring 10 25 50 10 25 50 10 25 50 10 25 50 10 25 50 10 25 50 random * * * * * * * * * * * 53 * * * * * * early * * * * * * * * * * * * -21 -21 -21 * * * late * * * * * -22 * * * * * 83 * * * * * *
increasing
decreasing
crossing
diverging
converging
99
Lee, E. T. (1992). Statistical methods for survival data analysis,(2nd Ed). Oklahoma City: John Wiley & Sons. Ngandu, N. H. (1997). An empirical comparison of statistical tests for assessing the proportional hazards assumption of Coxs model. Statistics in Medicine, 16, 611- 626. Persson, I. (2002). Essays on the assumption of proportional hazards in Cox regression. Acta Universitatis Upsaliensis. Unpublished Ph.D. dissertation. Quantin, C., Moreau, T., Asselain, B., Maccario, J., & Lellouch, J. A. (1996). Regression survival model for testing the proportional hazards hypothesis. Biometrics, 52, 874-885. Song, H. H. & Lee, S. (2000). Comparison of goodness of fit tests for the Cox proportional hazards model. Communications in Statistics Simulation and Computation, 29, 187-206. Tsiatis, A. A. (1981). A large sample study of Coxs regression model. The Annals of Statistics, 9, 93-108.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 100-105
Bias Affiliated With Two Variants Of Cohens d When Determining U1 As A Measure Of The Percent Of Non-Overlap
David A. Walker Educational Research and Assessment Department Northern Illinois University
Variants of Cohens d, in this instance dt and dadj, has the largest influence on U1 measures used with smaller sample sizes, specifically when n1 and n2 = 10. This study indicated that bias for variants of d, which influence U1 measures, tends to subside and become more manageable, in terms of precision of estimation, around 1% to 2% when n1 and n2 = 20. Thus, depending on the direction of the influence, both dt and dadj are likely to manage bias in the U1 measure quite well for smaller to moderate sample sizes. Key words: Non-overlap, effect size, Cohens d
Introduction In his seminal work on power analysis, Jacob Cohen (1969; 1988) derived an effect size measure, Cohens d, as the difference between two sample means. Using n, M, and SD from two sample groups, d provided score distances in units of variability (p. 21), by translating the means into a common metric of standard deviation units pertaining to the degree of departure from the null hypothesis. The common formula for Cohens d (1988) is
Cohens d can be calculated if no n, M, or SD for two groups is reported via t values and degrees of freedom, termed dt here, where it is assumed that n1 and n2 are equal (Rosenthal, 1991):
d1 =
2t df
(2)
d=
X1 X 2 ^ pooled
(1)
where t = t value, and df = n1 + n2 - 2 Kraemer (1983) noted that the distribution of Cohens d was skewed and heavy tailed, and Hedges (1981) found that d was a positively biased effect size estimate. Hedges proposed an approximate, modified estimator of d, which will be termed dadj here, where:
c (m ) 1
2
3 4m 1
(3)
^ pooled =
David Walker is an Assistant Professor at Northern Illinois University. His research interests include structural equation modeling, effect sizes, factor analyses, predictive discriminant analysis, predictive validity, weighting, and bootstrapping. Email: [email protected].
where m = n1 + n2 2. Cohen (1969; 1988) revisited the idea of group overlap, which was studied by Tilton (1937), and the degree of overlap (O) between two distributions; and also in close proximity to the time of Cohens initial work (i.e., 1969) by Elster and Dunnete (1971). This resulted in the U1 measure, which was derived from d as a percent of non-overlap. As Cohen (1988) explained, If we maintain the assumption that the populations being compared are normal and with equal variability, and conceive them further as equally numerous, it is possible to define measures of non-overlap (U1) associated with d
100
DAVID A. WALKER
(p. 21). Algebraically, U1 is related to the cumulative normal distribution and is expressed as (Cohen, 1988):
101
U1 =
2 Pd / 2 1 Pd / 2
2U 2 1 = U2
(4)
where d = Cohens d value, P = percentage of the area falling below a given normal deviate, and U2 = Pd/2. In SPSS (Statistical Package for the Social Sciences) syntax, U1 is calculated using the following expressions: Compute U = CDF.NORMAL((ABS(d)/2),0,1). Compute U1 = (2*U-1)/U*100. Execute. where d = Cohens d value, ABS = absolute value, CDF. NORMAL = cumulative probability that a value from a normal distribution where M = 0 and SD = 1 is < the absolute value of d/2. Thus, the link between d and U1 was seen by Cohen (1988) in that, d is taken as a deviate in the unit normal curve and P [from expression 4] as the percentage of the area (population of cases) falling below a given normal deviate (p. 23). For Cohen (1998), non-overlap was the extent to which an experiment or intervention had had an effect of separating the two populations of interest. A high percentage of non-overlap indicated that the two populations were separated greatly. When d = 0, there was 0% overlap and U1 = 0 also, or as Cohen (1988) noted either population distribution is perfectly superimposed on the other (p. 21). Therefore, the two populations were identical. The assumptions for the percentage of population non-overlap are: 1) the comparison populations have normality and 2) equal variability. Further, Cohen (1988) added that the U1 measure would also hold for samples from two groups if the samples approach the conditions of normal distribution, equal variability, and equal sample size (p. 68). Cohen (1988, p. 22) went on to produce
Table 2.2.1, which consisted of non-overlap percentages for values of d. Assuming a normal distribution, this table showed that, for example, a value of d = .20 would have a corresponding U1 = 14.7%, or a percentage of non-overlap of just over 14%. That is, the distribution of scores for the treatment group overlapped only a small amount with the distribution of scores for the non-treatment group, which was manifested in the small effect size of .20. As the value of d increased, so would the percentage of nonoverlap between the two distributions of scores, which indicated that the two groups differed considerably. Methodology After an extensive review of the literature, it was found that very few studies included effect size indices with tests for statistical significance and none produced a U1 measure when any of the variants of d were reported. Further, beyond studies, for example, by Hedges (1981) or Kraemer (1983) related to the upward bias and skewness associated with d in small samples, it appears in the scholarly literature that d as a percent of non-overlap has not been studied to evaluate any bias affiliated with variants of d, dt and dadj, substituted for it in the calculation of U1, except for what has been provided by Cohen (1988). Thus, the intent of this research was to examine U1 under varying sizes of d and n (i.e., n1 = n2). That is, this research looked at d values of .2, .5, .8, 1.00, and 1.50, which represent in educational research typically small to extremely large effect sizes. The sizes of n were 10, 20, 40, 50, 80, and 120, which represent in educational research small to large sample sizes. It should be noted, though, as was first discussed by Glass, McGaw, and Smith (1981), and reiterated by Cohen (1988), about the previously-mentioned d effect size target values and their importance: these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them if possible. The values chosen had no more reliable a basis than my own intuition. They were offered as conventions because they were needed in
102
a research climate characterized by a neglect of attention to issues of magnitude (p. 532). Using the work of Aaron, Kromrey, and Ferron (1998), this studys tables will display the bias and proportional bias found in each U1 measure found via both dt and dadj. As noted in the Aaron et al. research, the current study defines bias as the difference between the tabled value of U1, derived from the standard d formula and presented by Cohen (1988) as Table 2.2.1, and the presented U1 value resultant from dt and dadj., respectively. Proportional bias, or the size of [the] bias as a proportion of the actual effect size estimate (Aaron et al., p. 9), will be defined as the bias found above divided by the presented estimate for U1 derived from both dt and dadj, respectively (see Tables 1 and 2). Results Using syntax written in SPSS v. 12.0 to obtain the results of the study, Tables 1 and 2 indicated, as would be expected, that regardless of the variant of d used, as the value of d increased, the bias in U1 decreased. For example, Table 1 shows that at a small value of d = .2, and also at a moderate value of d = .5, the bias for small to moderate sample sizes ranged from about 1% to over 4%. As the value of d increased into the large effect size range of d = .8 to 1.50, the bias for the same sample sizes ranged from about 3% to under 1%. The bias related to the U1 measure for both forms of d used in this study was similar with both variants of d, the bias was constant with small sample sizes having 3% to 4% bias, moderate sample sizes having about 1%, and
.2 .2 .2 .2 .2 .2
DAVID A. WALKER
103
Table 1 Continued.
n1 = n2 10 20 40 50 80 120 d U1 33.0 33.0 33.0 33.0 33.0 33.0 U1 via dt 34.5 33.7 33.4 33.3 33.2 33.1 Bias (U1 U1 dt) 1.5 .7 .4 .3 .2 .1 Proportional Bias (Bias / U1 dt) .044 .021 .012 .009 .006 .003
.5 .5 .5 .5 .5 .5
n1 = n2 10 20 40 50 80 120
.8 .8 .8 .8 .8 .8
Proportional Bias (Bias / U1 dt) .037 .019 .008 .006 .004 .002
n1 = n2 10 20 40 50 80 120
Proportional Bias (Bias / U1 dt) .035 .018 .009 .005 .004 .002
n1 = n2 10 20 40 50 80 120
Proportional Bias (Bias / U1 dt) .028 .014 .007 .006 .003 .001
104
.2 .2 .2 .2 .2 .2
n1 = n2 10 20 40 50 80 120
.5 .5 .5 .5 .5 .5
n1 = n2 10 20 40 50 80 120
.8 .8 .8 .8 .8 .8
Proportional Bias (Bias / U1 dadj) .033 .015 .006 .006 .004 .002
n1 = n2 10 20 40 50 80 120 n1 = n2 10 20 40 50 80 120
U1 55.4 55.4 55.4 55.4 55.4 55.4 U1 70.7 70.7 70.7 70.7 70.7 70.7
U1 via dadj 53.8 54.7 55.1 55.1 55.2 55.3 U1 via dadj 69.1 69.9 70.3 70.4 70.5 70.6
Proportional Bias (Bias / U1 dadj) .030 .013 .005 .005 .004 .002 Proportional Bias (Bias / U1 dadj) .023 .011 .006 .004 .003 .001
DAVID A. WALKER
Reference Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and dbased effect size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Elster, R. S., & Dunnette, M. D. (1971). The robustness of Tiltons measure of overlap. Educational and Psychological Measurement, 31, 685-697. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.
105
Hedges, L. V. (1981). Distribution theory for Glass estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107-128. Huberty, C. J., & Mourad, S. A, (1980). Estimation in multiple correlation/prediction. Educational and Psychological Measurement, 40, 101-112. Kraemer, H. C. (1983). Theory of estimation and testing of effect sizes: Use in meta-analysis. Journal of Educational Statistics, 8, 93-101. Rosenthal, R. (1991). (Series Ed.), Meta-analytic procedures for social research. Newbury Park, CA: Sage Publications. Thompson, K. N., & Schumacker, R. E. (1997). An evaluation of Rosenthal and Rubins binomial effect size display. Journal of Educational and Behavioral Statistics, 22, 109117. Tilton, J. W. (1937). The measurement of overlapping. Journal of Educational Psychology, 28, 656-662.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 106-119
Some Guidelines For Using Nonparametric Methods For Modeling Data From Response Surface Designs
Christine M. Anderson-Cook
Statistical Sciences Group Los Alamos National Laboratory
Kathryn Prewitt
Mathematics and Statistics Arizona State University
Traditional response surface methodology focuses on modeling responses using parametric models with designs chosen to balance cost with adequate estimation of parameters and prediction in the design space. Using nonparametric smoothing to approximate the response surface offers both opportunities as well as problems. This article explores some conditions under which these methods can be appropriately used to increase the flexibility of surfaces modeled. The Box and Draper (1987) printing ink study is considered to illustrate the methods. Key words: Edge-Corrections, data sparseness, bandwidth, lowess, Nadaraya-Watson
Introduction In his review of the current status and future directions in response surface methodology, Myers (1999) suggests that one of the new frontiers is to utilize nonparametric methods for response surface modeling. Explored in this article are some of the key issues influencing the success of these methods used together. Combining nonparametric smoothing approaches, which typically depend on spacefilling samples of points in the desired prediction region, with response surface designs, which primarily focus on an economy of points for adequate prediction of prespecified parametric models, presents some unique challenges. Nonparametric approaches are typically used either as an exploratory data analytic tool in conjunction with a parametric method or exclusively because a parametric model didn't
Christine Anderson-Cook is a Technical Staff Member at Los Alamos National Laboratory. Her research interests include response surface methodology and design of experiments. Email: [email protected]. Kathryn Prewitt is Associate Professor at Arizona State University. Her research interests include nonparametric function estimation methods, time series and goodness-of-fit. Email: [email protected]
provide the necessary sensitivity to curvature. The number and location of design points impose a limitation on the order of the polynomial the parametric model can accommodate. This, in turn, imposes a limitation on the type of curvature of the fitted model. Standard response surface techniques using parametric models often assume a quadratic model. Nonparametric techniques assume a certain amount of smoothness, but do not impose a form for the curvature of the target function. Local polynomial models which fit a polynomial model within a window of the data can pick up important curvature, which a parametric fit typically cannot. Issues of what designs are suitable for utilizing nonparametric methods, appropriate choices of smoother types as well as bandwidth considerations will all be discussed. Important limitations exist for incorporating these methods into surface modeling, because ill-defined or nonsensical models can easily be generated without careful consideration of how to blend the method and design. Vining and Bohn (1998) utilized the Gasser-Mueller estimator (G-M) (see Gasser & Mueller, 1984) which is a kernel based smoothing method to estimate the process variance for a dual response system for the Box and Draper (1987) printing ink study. In that study, a full 33 factorial design was used with three replicates per combination of factors. Each
106
107
modeling should likely be to moderate this range of observed variability to more closely reflect what is believed to be realistic for the actual process. If the modeling undersmoothes the data (approaching interpolating between observed points), a risk exists of basing the dual response optimization on non-reproducible idiosyncrasies of the data. If the data is oversmoothed, important curvature is flattened making it difficult to find the best location for the process. This perpetual problem of modeling is doubly important here as the results of the model are being used to determine weights for the modeling of the mean of the process as well as for the optimization of the global process through the dual modeling paradigm. Hence, as different models for the variability are considered, predicted ranges will be noted throughout the design space. Reviewed in this article are some of the basics of nonparametric methods and their implications for the designed experiment are discussed with limited sample size and structured layout of design points. Then compared are different nonparametric approaches to the existing parametric choices and those presented in Vining and Bohn (1998) for this particular example, and conclude with some general recommendations for how to sensibly and appropriately use nonparametric methods for response surface designs Smoothing Methods Smoothing methods are distinct from traditional response surface parametric modeling in that they use different subsets of the data and different weightings for the selected points at different locations in the design space. There are several popular nonparametric smoothing methods such as the Nadaraya-Watson (Nadaraya, 1964) and Watson (1964) which fits a constant to the data in a window, the GasserMueller (Gasser & Mueller, 1984) which is a convolution-type estimator, spline smoothing (Eubank, 1999), and local polynomial methods (Fan & Gijbels, 1996) which fit polynomials in the local data window.
108
Figure 2: Range of observed responses likely with different values for a variety of standard deviations
109
with
E ( i ) = 0
Var ( i ) = 1
and
The smoothing function, m(.) is also called the regression function, E (Y | X = x ) . It is assumed that the variance of the error term for the problem of modeling the printing ink standard deviations would be reasonably constant. Kernel smoothing nonparametric methods involve the choice of a kernel function and a bandwidth as a smoothing parameter which determines the window of data to be utilized in the estimation process. The idea is to weight the data according to its closeness to the target location, hence to estimate m( x0 ) , greater weight is given to the Yi values with associated
( x) = Var (Y | X = x) .
X i values close to x0 .
Spline smoothing methods are categorized as a nonparametric technique and involve a smoothing parameter but no kernel function. One of the few references to an application of nonparametric methods to response surface problems is Hardy et al. (1997) who explored the use of R-splines with a significantly larger number of design points and with the goal of selecting variables for the regression model rather than obtaining a plausible curve. There are special considerations when using nonparametric methods for the printing data problem which are next outlined. Most of the literature regarding nonparametric methods shows application to space-filling designs and a larger number of sample points. The printing example has 27 data points which is significantly smaller than the data typically seen in the smoothing literature. Most of these points are on the boundary or edge. It is known that nonparametric estimators can exhibit so-called boundary effects. If a method such as the Gasser-Mueller (Gasser & Mueller, 1984) is used, the bias is bounded but not decreasing with increased sample size as one would want unless kernel functions called boundary kernels are used. This means that a different kernel needs to be used when a point is on the boundary. Local polynomial methods of order greater than 1 incorporate naturally the boundary kernels necessary. These methods are easily
Yi = m( X i ) + ( X i ) i
(1)
110
Bandwidth issues One of the most important choices to make when using a nonparametric method of function estimation is the smoothing parameter. For kernel methods, the bandwidth is such a parameter. Large bandwidths provide very smooth estimates and smaller bandwidths produce a more noisy summary of the underlying relationship. The reason for this behavior can be seen in the leading terms of the bias and variance for a point in the interior in the univariate explanatory variable case:
where f(x) is the density of the X explanatory variable, K(.) the kernel function, and b the bandwidth. The effect of the bandwidth can be observed: large values of the bandwidth increase the bias and reduce the variance of the predicted function; small values decrease the bias and increase the variance. This difficulty is called the bias-variance tradeoff. Bandwidth selection methods can be local (potentially changing at each point at which the function is to estimated) or global (where a single bandwidth is used for the entire curve). Typically, the bandwidth is often chosen to minimize an optimality criterion, such as an estimate of the leading terms of the MSE or cross-validation (see Eubank, 1988, Fan
Var (m( x ))
2 K (u ) 2 du
nbf ( x)
(2)
cannot be used for the purpose of goodness of fit because without a parametric form for m(.), SSE is minimized with Yi = m( X i ) , i.e. the curve estimate which minimizes this quantity is obtained by connecting the points. The purpose of the bandwidth selection method is essentially to solve the bias-variance tradeoff difficulty described previously. The second derivative in the bias term suggests that these estimators typically underestimate peaks and overestimate valleys which is sometimes an argument for using a local bandwidth choice since the expectation would be to use a smaller bandwidth in regions where there are more curvature. Because the number of points in the problem is small, it would be more sensible to use a global bandwidth, one bandwidth for the entire curve. There are not enough points to justify accurate estimation of different local bandwidths. This is not to say that in the future it may be discovered that in fact different bandwidths should be used to estimate different portions of the surface, but existing methods (Fan & Gijbels, 1995; Prewitt, 2003) will not work. Methods for local bandwidth selection have relied on the fact that each candidate bandwidth for a particular point x0 will incorporate additional data points as the bandwidth candidates become larger which may not be the case for the problem.
i =1
explained by comparing them to a weighted least squares problem where the kernel function provides the weights and the estimate is provided by solving a familiar looking matrix operation. Most of the nonparametric methods literature provides results and examples for sample sizes much larger than the printing data and also provides leading terms of the bias and variance to describe the behavior of the estimator which implies that there are negligible terms as n grows large. The problem then is much different than has been addressed before: the sample size is small, the design is not spacefilling and most of the points are on the edge.
SSE =
(Yi m( X i ))2
(3)
111
The second method considered is defined below and resembles a weighted least squares estimator where a plane is fit with the data centered at x so that the desired estimator is 0 and the "LL" stands for local linear with no higher order terms.
mLL ( x )
K (u ) = 0.75(1 u 2 ) I (| u | 1)
(4)
2 ( x2 X i2 ) 3 ( x3 X i3 )) 2 K b (x, Xi )
(6) One can also think of the above estimators as motivated by a desire to estimate m( x ) by using the first few terms of its Taylor
which is simple and has optimal properties (Mueller, 1988). At the point x = ( x1 , x2 , x3 ) the weighted least squares estimate with a kernel function as the weight. The two methods can be described as follows where mC ( x ) is the local polynomial with fitted constant: Let the weight function be defined as:
mNW ( x ) is constructed by expansion, considering an interval around x and estimating the first term of the Taylor expansion around x where mLL ( x ) uses estimates of first order terms
of the Taylor expansion as an estimate of m( x ) . Printing Example Smoothing It has already been noted that some particular issues concerning the application of nonparametric smoothing to a sparse small set of data with the vast majority of design locations on the edges. A related issue to consider is what type of surface is possible or likely. If the variability of the process can change very quickly and dramatically within the range of the design space, then the 33 factorial design is an inadequate choice and should be replaced by a much larger space filling design. However, if the surface should change moderately slowly throughout the region, then the 33 design may be adequate. As well, if the surface is likely to be relatively smooth and undergoes changes slowly, then a nonparametric method should be selected and bandwidth that
K b ( x, X 1 ) =
1 b3
j =1
This is called a product kernel because it is the product of three univariate kernel functions. The kernel function equals zero when data points are outside the window defined by the bandwidth and has a nonzero weight when X i is inside the window. It is appropriate to use the same bandwidth in each of the three directions because the scaling of the coded variables in the set-up of the response surface design makes units comparable in all directions. The definition below resembles a weighted least squares estimator when a constant is fit.
x j X ij
= arg min 0
i =1
Kb (x, Xi )Y Kb (x, Xi )
n i =1
(Yi 0 1 ( x1 X i1 )
i =1
(Yi 0 )2 Kb (x, Xi )
(5)
112
uses information from several nearby points to estimate the surface locally. Examined now are some of the implications of choosing different bandwidths for this 33 factorial design. The 33 factorial design is comprised of 27 locations on the cube: 8 corner points, 12 edge points, 6 face center points and one center point. Notably, all but one of the points are on the edge of the design space. This is standard practice for parametric estimation, because Dand G-efficiency both benefit from maximal spread of points to the edges of the design space. However, this set-up coupled with the extreme small sample size is highly unusual for nonparametric approaches. One of the advantages of the structured locations selected for a response surface design is that allows the investigation of the characteristics of estimation for different nonparametric bandwidth choices. For example, using the Epanechnikov kernel weighting function, the number of design points can be specified that will be used for estimation at each of the four categories of design points. Table 1 shows the effect of bandwidth on different locations as well as the range of the non-zero weights for particular bandwidths used for the local estimation. Bandwidths less than 0.5 of the total range of each variable use only the observation at that location, while a bandwidth of 1 uses all observations. The weights associated with each design location change for different weights. As the bandwidth increases, not only do more locations get used, but also their relative contributions to the estimate become more comparable. For example, for a bandwidth of 0.6 at one of the design points, the observation at the location to be estimated is weighted approximately 35 times more (1.25 / 0.036) than the most distant nonzero weighted observations. As well, for a design point and a bandwidth of 1, this ratio drops to 2.4 (0.75 / 0.316) and the points used are also further away. Various authors considered different models for the standard deviation for this data set. Parametric models considered include a linear model in all three factors on log(standard deviation +1), shown in Figure 3(a) with an R2 of 29.4 %. The transformation of the standard deviation was done to improve fit, and to avoid
113
Table 1: Number of points contributing to local estimation for 33 factorial, with Epanechnikov kernel.
Table 2: Summary of Prediction Values for Lowess and Local Average on Untransformed Standard Deviations.
Table 3: Summary of Prediction Values for Lowess and Local Average on Transformed Log (Standard Deviations+1).
114
Figure 3: Contour plots for best Linear and Quadratic parametric models based on the Box and Draper (1987) data for log (standard deviations + 1).
Figure 4: Contour plots of local average models for untransformed response with bandwidths 0.8 and 1. (a)
(b)
115
Figure 5: Contour plots of lowess models for untransformed response with bandwidths 0.6, 0.8 and 1. (a)
(b)
(c)
116
Figure 6: Contour plots of local average models for logarithm transformed response with bandwidths 0.8 and 1. (a)
(b)
Figure 7: Contour plots of lowess models for logarithm transformed response with bandwidths 0.8 and 1. (a)
(b)
117
method, which either yields good responsiveness if using the values near the edge, or wide extrapolation when this point is removed. This seems to imply some superiority for the C method, which outperforms both the parametric models, and appears to retain some useful predictive ability even when used for extrapolation. Based on an overall assessment of all characteristics of the methods considered, the Nadaraya-Watson local averaging (C) method with bandwidth of either 0.8 or 1.0 emerge as leading choices. The bandwidth of 1.0 uses all of the data, with diminishing weights for more distant points. The 0.8 bandwidth excludes points on the opposite side of the design space for corner, edge and face-center points. Both of these models allow for greater flexibility than either of the parametric models, by allowing greater adaptability of the shape of the surface, while also utilizing a significant proportion of the data for estimation. They provide enough smoothing to produce a surface that likely is consistent with underlying assumptions of how the standard deviation of the process might vary across the range of the design space Conclusion Based on sparseness of the data sets typical for many response surface designs, it should be evident that the use of nonparametric methods must be used with care to avoid nonsensical results. However, the printing ink example has demonstrated that nonparametric models have real potential for helping with modeling responses, when the restrictions of a parametric model are too limiting. The ability to adapt the shape of the surface locally is desirable, and can be done even when there are only a small number of values observed across the range of each variable. It is particularly important to consider a priori what the surface, range and ratio of maximum to minimum predicted values reasonably might be. The chosen method should balance optimizing fit, while still maintaining characteristics of the appropriate shape.
118
Table 5: Number of points contributing to local estimation for different Central Composite Designs and widths.
Due to a large number of points on the edge of the design space, which is highly desirable for D- and G-efficiency when using a parametric model, a smoother which is insensitive to edge effects is recommended. The local averaging smoother (C) performed quite well although the (LL) supposedly has superior boundary capability in the bias term both in order and boundary kernel adjustment. The reason for this apparent contradiction may be again that the sample size is small and the boundary order results depend on larger sample sizes or as pointed out in Ruppert and Wand (1994) the boundary variance of the (LL) may
be larger than the boundary variance of the (C) estimator. Consequently the local averaging (C) estimator is recommended for this problem. The local first-order polynomial works well in many standard applications, where the proportion of edge points is small, but does not seem like a suggested choice for most response surface designs. To avoid near-interpolation, a moderate to large bandwidth needs to be used. Table 5 considers perhaps the most popular class of response surface designs, the Central Composite Design. It gives the number of points used for estimation for the different types of points for a
119
References Box, G. E. P., & Draper, N. R. (1987). Empirical model-building and response Surfaces. Wiley. Del Castillo, E., & Montgomery, D. C. (1993). A nonlinear programming solution to the dual response problem. Journal of Quality Technology, 25, 199-204. Eubank, R. L. (1988). Spline smoothing and nonparametrie regression. Marcel Dekker. Eubank, R. L. (1999). Nonparametric regression and spline smoothing. Marcel Dekker. Fan, J., & Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society, Series B, Methodological, 57, 371-394.
Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications (ISBN 031298321). Chapman & Hall. Gasser, T., & Mueller, H.-G. (1984). Estimating regression functions and their derivatives by the kernel method. Scandinavian Journal of Statistics, 11, 171-185. Hardy, S. W., Nychka, D. W., Haaland, P. D., & O'Connell, M. (1997). Process modeling with nonparametric response surface methods. In ASA Proceedings of the Section on Physical and Engineering Sciences, pages 163172. American Statistical Association (Alexandria, VA). Lin, D. K. J., & Tu, W. (1995). Dual response surface optimization. Journal of Quality Technology, 27, 34-39. Myers, R. H. (1999). Response surface methodology - Current status and future directions (pkg: P1-74). Journal of Quality Technology, 31, 30-44. Nadaraya, E. (1964). Some new estimates for distribution functions. Theory of Probability and its Applications (Transl of Teorija Verojatnostei iee Primenenija), 9, 497500. Prewitt, K. (2003). Efficient bandwidth selection in non-parametric regression. Scandinavian Journal of Statistics. 30, 75-92. Prewitt, K., & Lohr, S. (2002). Condition indices and bandwidth choice in local polynomial regression. Under Review. Ruppert, D., & Wand, M. P. (1994). Multivariate locally weighted least squares regression. The Annals of Statistics, 22, 13461370. Seifert, B., & Gasser, T. (1996). Finitesample variance of local polynomials: Analysis and solutions. Journal of the American Statistical Association, 91, 267-275. Vining, G. G., & Bohn, L. L. (1998). Response surfaces for the mean and variance using a nonparametric approach. Journal of Quality Technology, 30, 282-291. Watson, G. (1964). Smooth regression analysis. Sankhya A26, 359-372.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 120-133
Determining The Correct Number Of Components To Extract From A Principal Components Analysis: A Monte Carlo Study Of The Accuracy Of The Scree Plot.
Gibbs Y. Kanyongo
Department of Foundations and Leadership Duquesne University
This article pertains to the accuracy of the of the scree plot in determining the correct number of components to retain under different conditions of sample size, component loading and variable-tocomponent ratio. The study employs use of Monte Carlo simulations in which the population parameters were manipulated, and data were generated, and then the scree plot applied to the generated scores. Key words: Monte Carlo, factor analysis, principal component analysis, scree plot
Introduction In social science research, one of the decisions that quantitative researchers make is determining the number of components to extract from a given set of data. This is achieved through several factor analytic procedures. The scree plot is one of the most common methods used for determining the number of components to extract. It is available in most statistical software such as the Statistical Software for the Social Sciences (SPSS) and Statistical Analysis Software (SAS). Factor analysis is a term used to refer to statistical procedures used in summarizing relationships among variables in a parsimonious but accurate manner. It is a generic term that includes several types of analyses, including (a) common factor analysis, (b) principal component analysis (PCA), and (c) confirmatory factor analysis (CFA). According to Merenda, (1997) common factor analysis may be used when a primary goal of the research is to investigate how well a new set of data fits a particular well-established model. On the other
hand, Stevens (2002) noted that principal components analysis is usually used to identify the factor structure or model for a set of variables. In contrast; CFA is based on a strong theoretical foundation that allows the researcher to specify an exact model in advance. In this article, principal components analysis is of primary interest. Principal component analysis Principal component analysis develops a small set of uncorrelated components based on the scores on the variables. Tabachnick and Fidell (2001) pointed that components empirically summarize the correlations among the variables. PCA is the more appropriate method than CFA if there are no hypotheses about components prior to data collection, that is, it is used for exploratory work. When one measures several variables, the correlation between each pair of variables can be arranged in a table of correlation coefficients between the variables. The diagonals in the matrix are all 1.0 because each variable theoretically has a perfect correlation with itself. The off-diagonal elements are the correlation coefficients between pairs of variables. The existence of clusters of large correlation coefficients between subsets of variables suggests that those variables are related and could be measuring the same underlying dimension or concept. These underlying dimensions are called components.
Gibbs Y. Kanyongo is assistant professor in the Department of Foundations and Leadership at Duquesne University. 410A Canevin Hall, Pittsburgh, PA 15282. Email: [email protected].
120
KANYONGO
A component is a linear combination of variables; it is an underlying dimension of a set of items. Suppose, for instance a researcher is interested in studying the characteristics of freshmen students. Next, a large sample of freshmen are measured on a number of characteristics like personality, motivation, intellectual ability, family socio-economic status, parents characteristics, and physical characteristics. Each of these characteristics is measured by a set of variables, some of which are correlated with one another. An analysis might reveal correlation patterns among the variables that are thought to show the underlying processes affecting the behavior of freshmen students. Several individual variables from the personality trait may combine with some variables from motivation and intellectual ability to yield an independence component. Variables from family socio-economic status might combine with other variables from parents characteristics to give a family component. In essence what this means is that the many variables will eventually be collapsed into a smaller number of components. Velicer et. al., (2000) noted that a central purpose of PCA is to determine if a set of p observed variables can be represented more parsimoniously by a set of m derived variables (components) such that m < p. In PCA the original variables are transformed into a new set of linear combinations (principal components). Gorsuch (1983) described the main aim of component analysis as to summarize the interrelationships among the variables in a concise but accurate manner. This is often achieved by including the maximum amount of information from the original variables in as few derived components as possible to keep the solution understandable. Stevens (2002) noted that if we have a single group of participants measured on a set of variables, then PCA partitions the total variance by first finding the linear combination of variables that accounts for the maximum amount of variance. Then the procedure finds a second linear combination, uncorrelated with the first component, such that it accounts for the next largest amount of variance, after removing the variance attributable to the first component from the system. The third principal component is
121
constructed to be uncorrelated with the first two, and accounts for the third largest amount of variance in the system. This process continues until all possible components are constructed. The final result is a set of components that are not correlated with each other in which each derived component accounts for unique variance in the dependent variable. Uses of principal components analysis Principal component analysis is important in a number of situations. When several tests are administered to the same examinees, one aspect of validation may involve determining whether there are one or more clusters of tests on which examinees display similar relative performances. In such a case, PCA functions as a validation procedure. It helps evaluate how many dimensions or components are being measured by a test. Another situation is in exploratory regression analysis when a researcher gathers a moderate to a large number of predictors to predict some dependent variable. If the number of predictors is large relative to the number of participants, PCA may be used to reduce the number of predictors. If so, then the sample size to variable ratio increases considerably and the possibility of the regression equation holding up under cross-validation is much better (Stevens 2002). Here, PCA is used as a variable reduction scheme because the number of simple correlations among the variables can be very large. It also helps in determining if there is a small number of underlying components, which might account for the main sources of variation in such a complex set of correlations. If there are 30 variables or items, 30 different components are probably not being measured. It therefore makes sense to use some variable reduction scheme that will indicate how the variables or items cluster or hang together. The use of PCA on the predictors is also a way of attacking the multicollinearity problem (Stevens, 2002). Multicollinearity occurs when predictors are highly correlated with each other. This is a problem in multiple regression because the predictors account for the same variance in the dependent variable. This redundancy makes the regression model less accurate in as far as the number of predictors required to explain the
122
variance in the dependent variable in a parsimonious way is concerned. This is so because several predictors will have common variance in the dependent variable. The use of PCA creates new components, which are uncorrelated; the order in which they enter the regression equation makes no difference in terms of how much variance in the dependent variable they will account for. Principal component analysis is also useful in the development of a new instrument. A researcher gathers a set of items, say 50 items designed to measure some construct like attitude toward education, sociability or anxiety. In this situation PCA is used to cluster highly correlated items into components. This helps determine empirically how many components account for most of the variance on an instrument. The original variables in this case are the items on the instrument. Stevens (2002) pointed out several limitations (e.g., reliability consideration and robustness) of the k group MANOVA (Multivariate Analysis of Variance) when a large number of criterion variables are used. He suggests that when there are a large number of potential criterion variables, it is advisable to perform a PCA on them in an attempt to work with a smaller set of new criterion variables. The scree plot The scree plot is one of the procedures used in determining the number of factors to retain in factor analysis, and was proposed by Cattell (1966). With this procedure eigenvalues are plotted against their ordinal numbers and one examines to find where a break or a leveling of the slope of the plotted line occurs. Tabachnick and Fidell (2001) referred to the break point as the point where a line drawn through the points changes direction. The number of factors is indicated by the number of eigenvalues above the point of the break. The eigenvalues below the break indicate error variance. An eigenvalue is the amount of variance that a particular variable or component contributes to the total variance. This corresponds to the equivalent number of variables that the component represents. Kachigan, (1991) provided the following explanation: a component associated with an eigenvalue of 3.69 indicates that the
KANYONGO
that these distributions are chosen because their properties are understood and because in many cases they provide good models for variables of interest to applied researchers. Using Monte Carlo simulations in this study has the advantage that the population parameters are known and can be manipulated; that is, the internal validity of the design is strong although this will compromise the external validity of the results. According to Brooks et al. (1999), Monte Carlo simulations perform functions empirically through the analysis of random samples from populations whose characteristics are known to the researcher. That is, Monte Carlo methods use computer assisted simulations to provide evidence for problems that cannot be solved mathematically, such as when the sampling distribution is unknown or hypothesis is not true. Mooney, (1997) pointed that the principle behind Monte Carlo simulation is that the behavior of a statistic in a random sample can be assessed by the empirical process of actually drawing many random samples and observing this behavior. The idea is to create a pseudo-population through mathematical procedures for generating sets of numbers that resemble samples of data drawn from the population. Mooney (1997) further noted that other difficult aspects of the Monte Carlo design are writing the computer code to simulate the desired data conditions and interpreting the estimated sampling plan, data collection, and data analysis. An important point to note is that a Monte Carlo design takes the same format as a standard research design. This was noted by Brooks et al., (1999) when they wrote It should be noted that Monte Carlo design is not very different from more standard research design, which typically includes identification of the population, description of the sampling plan, data collection and data analysis (p. 3). Methodology Sample size (n) Sample size is the number of participants in a study. In this study, sample size is the number of cases generated in the Monte Carlo simulation. Previous Monte Carlo studies
123
by (Velicer et al. 2000, Velicer and Fava, 1998, Guadanoli & Velicer, 1988) found sample size as one of the factors that influences the accuracy of procedures in PCA. This variable had three levels (75, 150 and 225). These values were chosen to cover both the lower and the higher ends of the range of values found in many applied research situations. Component loading (aij) Field (2000) defined a component loading as the Pearson correlation between a component and a variable. Gorsuch, 1983 defined it as a measure of the degree of generalizability found between each variable and each component. A component loading reflects a quantitative relationship and the further the component loading is from zero, the more one can generalize from that component to the variable. Velicer and Fava, (1998), Velicer et al., (2000) found the magnitude of the component loading to be one of the factors having the greatest effect on accuracy within PCA. This condition had two levels (.50 and .80). These values were chosen to represent a moderate coefficient (.50) and a very strong coefficient (.80). Variable-to-component ratio (p:m) This is the number of variables per component. The number of variables per component will be measured counting the number of variables correlated with each component in the population conditions. The number of variables per component has repeatedly been found to influence the accuracy of the results, with more variables per component producing more stable results. Two levels for this condition were used (8:1 and 4:1). Because the number of variables in this study was fixed at 24, these two ratios yielded three and six variables per factor respectively. Number of variables This study set the number of variables a constant at 24, meaning that for the variable-tocomponent ratio of 4:1, there were six variables loading onto one component, and for variableto-component ratio of 8:1, eight variables loaded onto a component (see Appendixes A to D).
124
Generation of population correlation matrices A pseudo-population is an artificial population from which samples used in Monte Carlo studies are derived. In this study, the underlying population correlation matrices were generated for each possible aij and p:m combination, yielding a total of four matrices (see Appendixes E to H). The population correlation matrices were generated in the following manner using RANCORR programme by Hong (1999): 1. The factor pattern matrix was specified based on the combination of values for p:m and aij (see Appendixes A to D). 2. After specifying the factor pattern matrix and the program is executed, a population correlation matrix was produced for each combination of conditions. 3. The program was executed four times to yield four different population correlation matrices, one correlation matrix for each combination of conditions (see Appendixes E to H). After the population correlation matrices were generated, the Multivariate Normal Data Generator (MNDG) program (Brooks, 2002) was used to generate samples from the population correlation matrices. This program generated multivariate normally distributed data. A total of 12 cells were created based on the combination of n, p: m and aij. For each cell, 30 replications were done to give a total of 360 samples, essentially meaning that 360 scree plots were generated. Each of the samples had a predetermined factor structure since the parameters were set by the researcher. The scree plots were then examined to see if they extracted the exact number of components as set by the researcher. Interpretation of the scree plots The scree plots were given to two raters with some experience in interpreting scree plots. These raters were graduate students in Educational Research and Evaluation and had taken a number of courses in Educational Statistics and Measurement.
An examination of Figures 1 and 2 show that when component loading was .80, it was relatively clear where the cut-off point was for determining the number of components to extract. Figure 1 clearly shows that six the
KANYONGO
125
Figure 1. The scree plot for variable-to-component ratio of 4:1, component loading of .80
5
Eigenvalue
0 1 3 5 7 9 11 13 15 17 19 21 23
Component Number
Figure 2. The scree plot for variable-to-component ratio of 8:1, component loading of .80
7
Eigenvalue
0 1 3 5 7 9 11 13 15 17 19 21 23
Component Number
components were extracted and in Figure 2, three components were extracted. These two plots show why it was easy for the raters to have more agreement for component loading of .80. This was not the case when component loading was .50 as the raters had few cases of agreement and more cases of disagreement.
In Table 2, when component loading was .50, the two raters agreed correctly only 28 times and agreed wrongly 97 times. Compared to component loading of .80, the scree plot was not as accurate when component loading was .50. This finding is consistent with that of Zwick and Velicer (1986) who noted in their study that, The raters in this study showed greater agreement at higher than at lower component
126
loading levels. (p. 440). Figures 1 and 2 show typical scree plots that were obtained for component loading of .50. In Figure 3, the number of components extracted was supposed to be six, but it is not clear from the plot were the cut-off point is for six components. One can see why there were a lot of disagreements between the two raters when component loading was low. In Figure 4, the plot was supposed to extract three components, but it is not quite clear even to an experienced rater, how many components to be extracted with this plot. These cases show how it is difficult to use the scree plots especially in exploratory studies when the researcher does not know the number of components that exist. Table 2. A cross tabulation of the measure of agreement when component loading was .50 between rater 1 and rater 2. Rater 2 0 Rater 1 0 1 Total 97 5 102 1 50 28 78 Total 147 33 180
Reports of rater reliability on the scree plot have ranged from very good (Cattell & Jaspers, 1967) to quite poor (Crawford & Koopman, 1979). This wide range and the fact that data encountered in real life situations rarely have perfect structure with high component loading makes it difficult to recommend this procedure as a stand-alone procedure for practical uses in determining the number of components. Generally, most real data have low to moderate component loading, which makes the scree plot an unreliable procedure of choice (Zwick & Velicer, 1986). The second part of question one was to consider the percentages of time that the scree plots were accurate in determining the exact number of components, and those percentages were computed for each cell (see table 5). In
KANYONGO
127
Figure 3. The scree plot for variable-to-component ratio of 4:1, component loading of .50
2.0
1.5
Eigenvalue
1.0
.5
0.0 1 3 5 7 9 11 13 15 17 19 21 23
Component Number
Figure 4. The scree plot for variable-to-component ratio of 8:1, component loading of .50
3.0
2.5
2.0
Eigenvalue
1.5
1.0
.5
0.0 1 3 5 7 9 11 13 15 17 19 21 23
Component Number
Table 5. Performance of scree plot (as a percentage) under different conditions of variable-tocomponent ratio, component loading and sample size.
V-C-R Comp.loading Sample size Rater 1 Rater 2 Consensus 75 73% 67% 23% 4: .50 150 20% 10% 20% 225 10% 16% 10% 75 10% 26% 75% 1 .80 150 16% 23% 100% 225 3% 13% 100% 75 33% 63% 47% .50 150 13% 57% 23% 225 27% 77% 47% 75 87% 100% 100% 8: 1 .80 150 100% 100% 97% 225 100% 100% 100%
128
Generally, the findings of this study are in agreement with previous studies that found mixed results on the scree plot. The subjectivity in the interpretation of the procedure makes it such an unreliable procedure to use as a standalone procedure. The scree plot would probably be useful in confirmatory factor analysis to provide a quick check of the factor structure of the data. In that case the researcher already knows the structure of the data as opposed to using it in exploratory studies where the structure of the data is unknown. If used in exploratory factor analysis, the scree plot can be misleading even for experienced researcher because of its subjectivity. Based on the findings of this study, it is recommended that the scree plot not be used as a stand-alone procedure in determining the number of components to retain. Researchers should use it with other procedures like parallel analysis or Velicers Minimum Average Partial (MAP) and parallel analysis. In situations where the scree plot is the only procedure available, users should be very cautious in using it and they can do so in confirmatory studies but not exploratory studies. References Brooks, G. P. (2002). MNDG. http://oak.cats.ohiou.edu.edu/~brooksg/mndg.ht m. Brooks, G. P., Barcikowski, R. S., & Robey, R. R. (1999). Monte Carlo simulation for perusal and practice. A paper presented at the meeting of the American Educational Research Association, Montreal, Quebec, Canada. (ERIC Document Reproduction Service No. ED449178). Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276. Cattell, R. B., & Jaspers, J. (1967). A general plasmode for factor analytic exercises and research. Multivariate Behavioral Research Monographs, 3, 1-212.
KANYONGO
Steiner, D. L. (1998). Factors affecting reliability of interpretations of scree plots. Psychological Reports, 83, 689-694. Tabachnick , B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.) Needham Heights., MA: Pearson. Velicer, F. W., Eaton, C. A., & Fava, J. L. (2000). Construct Explication through factor or component analysis: A review and evaluation of alternative procedures for determining the number of factors or components. In R. D. Goffin, & E. Helmes (Eds.), Problems and solutions in human assessment (pp. 42 -71). Boston: Kluwer Academic Publishers. Appendix: Appendix A : Population Pattern Matrix p:m = 8:1 (p = 24, m = 3 aij = .80).
p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 .80 .80 .80 .80 .80 .80 .80 .80 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 Components (m) 2 .00 .00 .00 .00 .00 .00 .00 .00 .80 .80 .80 .80 .80 .80 .80 .80 .00 .00 .00 .00 .00 .00 .00 .00 3 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .80 .80 .80 .80 .80 .80 .80 .80
129
Zoski, K. W., & Jurs, S. (1990). Priority determination in surveys: an application of the scree test. Evaluation Review, 14, 214-219. Zwick, R. W., & Velicer, F. V. (1986). Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99, 432-442.
130
KANYONGO
131
132
KANYONGO
133
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 134-139
Testing The Casual Relation Between Sunspots And Temperature Using Wavelets Analysis
Abdullah Almasri
Department of Statistics Lund University, Sweden
Ghazi Shukur
Departments of Economics and Statistics Jnkping University and Vxj University
Investigated and tested in this article are the causal nexus between sunspots and temperature by using statistical methodology and causality tests. Because this kind of relationship cannot be properly captured in the short run (daily, monthly or yearly data), the relationship is investigated in the long run using a very low frequency Wavelets-based decomposed data such as D8 (128 - 256 months). Results indicate that during the period 1854-1989, the causality nexus between these two series is as expected of onedirectional form, i.e., from sunspots to temperature. Key words: Wavelets, time scale, causality tests, sunspots, temperature
Introduction The Sun is the energy source that powers Earths weather and climate, and therefore it is natural to ask whether changes in the Sun could have caused past climate variations and might cause future changes. At some level the answer must be yes. Recently, concerns about human-induced global warming have focused attention on just how much climatic change the Sun could produce. Accordingly, many authors tried to investigate the relation between the sunspots and the climate change, e.g., Friis-Christensen (1997) compared observations of cloud cover and cosmic particles and concluded that variation in global cloud cover was correlated with the cosmic ray flux from 1980 to 1995. They proposed the observed variation in cloud
Abdullah Almasri also holds an appointment in the Department of Primary Care and General Practice, University of Birmingham, UK. Email: [email protected]. Ghazi Shukur holds appointments in the Departments of Economics and Statistics, Jnkping University and Vxj University, Sweden.
cover seemed to be caused by the varying solar activity related cosmic ray flux and postulated that an accompanying change in the earths albedo could explain the observed correlations between solar activity and climate. However, Jorgensen and Hansen (2000) showed that any evidence supporting that the mechanism of cosmic rays affecting the cloud cover and hence climate does not exist. Nevertheless, most of these studies suffered from the lack of statistical methodology. In this study, well selected statistical tools are used to investigate the causal relation between the sunspots and the temperature. A vector autoregressive (VAR) model is constructed and applied, which allows for causality test, on low frequency Wavelets based decomposed data. Processing in this manner we can see the nature of the causal relation between these two variables. Wavelet is a fairly new approach in analysing data (e.g. Daubechies, 1992) that is becoming increasingly popular for a wide range of applications (e.g. time series analyses). This subject is not really familiar in other areas such as in statistics with environmental application. The idea behind using this technique is based on the fact that the time period (time scale) of the analysis is very crucial for determining those aspects that are relatively more important, and those that are relatively less important. In time series one can envisage a cascade of time scales
134
135
j = 1,2, ... , J ,
T / 2 J scaling coefficients. Thus, the first T - T / 2 J elements of d are wavelet coefficients and the last T / 2 J elements are scaling coefficients, where J M . Notice that the length of X does coincide with the length of d (length of dj = 2M-j, and s J = 2M-J).
The multiresolution analysis of the data leads to a better understanding of wavelets. The idea behind multiresolution analysis is to express W d as the sum of several new series, each of which is related to variations in X at a certain scale. Because the matrix W is orthonormal, the time series may be constructed from the wavelet coefficients d by using X = W d . Partition the columns of W commensurate with the partitioning of d to obtain W = [W1 W2 . . . WJ VJ ] , where W j is a T T / 2 j matrix and VJ is a
j =1
X = W d =
W j d j + VJ s J =
D j + SJ .
136
50
100
150
200
250
1860
1880
1900
1920
1940
1960
1980
-1.5
-1.0
-0.5
0.0
0.5
1860
1880
1900
1920
1940
1960
1980
Monthly temperature for the northern hemisphere for the years 1854-1989, from the data base held at the Climate Research Unit of the University of East Anglia, Norwich, England (Briffa. & Jones, 1992). The numbers consist of the temperature (degrees C) difference from the monthly average over the period 1950-1979.
137
i =1
(1)
k
(2)
The Causality Between the Sunspots and Temperature Next the wavelets analysis is used in investigating the hypothesis that the sunspots may affect the temperature. This will mainly be done by using causality test. Because this kind of relationship can not properly be captured in the short run (daily, monthly or yearly data), only the relationship in the long run is investigated, either by using 10-20 years data (which is not available in this case) or a very low frequency Wavelets decomposed data like D8 (128 - 256 months). This is used in this article (see Figure 3). This will be done empirically by constructing a (VAR) model that allows for causality test in the Granger sense. Causality is intended as in the sense of Granger (1969). That is, to know if one variable precedes the other variable or if they are contemporaneous. The Granger approach to the question whether sunspots (Sun) causes temperature (Tem) is to see how much of the current value of the second variables can be explained by past values of the first variable. (Tem) is said to be Granger-caused by (Sun) if (Sun) helps in the prediction of (Tem), or equivalently, if the coefficients of the lagged (Sun) are statistically significant in a regression of (Tem) on (Sun). Empirically, one can test for causality in Granger sense by means of the following vector autoregressive (VAR) model:
where e1t and e2t are error terms, which are assumed to be independent white noise with zero mean. The number of lags, k, will be decided by using the Schwarz (1978) information criteria, in what follows referred to as SC. According to Granger and Newbold (1986) causality can be tested for in the following way: A joint F-tests is constructed for the inclusion of lagged values of (Sun) in (1) and for the lagged values of (Tem) in (2). The null hypothesis for each F-test is that the added coefficients are zero and therefore the lagged (Sun) does not reduce the variance of (Tem) forecasts (i.e. bi in (1) are jointly zero for all i), or that lagged (Tem) does not reduce the variance of (Sun) forecasts (i.e. fi in (2) are jointly zero for all i). If neither null hypothesis is rejected, the results are considered as inconclusive. However, if both of the F-tests rejected the null hypothesis, the result is labeled as a feedback mechanism. A unique direction of causality can only be indicated when one of the pair of F-tests rejects and the other accepts the null hypothesis which should be the case in the study. Moreover, before testing for causality, the augmented Dickey-Fuller (1979, 1981) is applied, in what follows referred to as ADF, test for deciding the integration order of each aggregate variable. When looking at the Wavelets decomposed data for sun and temperature used here, i.e. the D8 in Figure 3 below, the ADF test results indicate that each variable is integrated of the same order zero, i.e., I(0), indicating the both of the series are stationary implying that the VAR model can be estimated by standard statistical tools.
i =1
coincides with the length of X ( T 1 vector). Because the terms at different scales represent components of X at different resolutions, the approximation is called a multiresolution decomposition, see Percival and Mofjeld (1997). As mentioned earlier the wavelet decompositions in this paper will be made with respect to the Symmlets basis. This has been done by using the S-plus Wavelets package produced by StatSci of MathSoft that was written by Bruce and Gao (1996). Figure 3 shows the multiresolution analysis of order J = 6 based on the Symmlets of length 8.
Sunt = c 0 +
c i Temt i +
i =1
Temt = a 0 +
ai Temt i +
bi Sunt i + e1t ,
f i Sunt i + e2t ,
138
-5
10
1860
1880
1900
1920 time
1940
1960
1980
Results According to the model selection criteria proposed by Schwarz (1978), it is found that the model that minimizes this criteria is the VAR(3). When this model is used to test for causality, the inference is drawn that only the (Sun) Granger causes the (Tem). The test results can be found in Table 1, below. This means that the causality nexus between these two series is a onedirectional form, i.e., from (Sun) to (Tem). This should be fairly reasonable, since it is not logical to assume that the temperature in the earth should have any significant effect on the sunspots. Table 1: Testing results for the Granger causality. Null Hypothesis Sun does not Granger Cause Tem Tem does not Granger Cause Sun P-value 0.0037 0.5077
Conclusion The main purpose of this article is to model the causality relationship between sunspots and temperature. Although other studies exist for the similar purpose, they are not based on a careful statistical modeling. Moreover, these studies have sometimes shown to end up with conflicting results and inferences. Here, in this article, well selected statistical methodology for estimation and testing the causality relation between these two variables is used. A very low frequency Wavelets based decomposed data indicates that, during the period 1854-1989, the causality nexus between these two series is the expected one-directional form, i.e., from sunspots to temperature
References Bruce, A. G., & Gao, H.-Y. (1996). Applied wavelet analysis through S-Plus, New York: Springer-Verlag.
139
Granger, C. W. J. (1969). Investigating causal relations by econometric models an crossspectral methods. Econometrica, 37, 24-36. Percival, D. B., & Mofjeld, H. O. (1997). Analysis of subtidal coastal sea level fluctuations using Wavelets. Journal of the American Statistical Association, 92(439), 868880. Percival, D. B., & Walden, A. T. (2000). Wavelet methods for time series analysis. Cambridge, UK: Cambridge University Press. Schwarz, G. (1978). Estimation the dimention of a model. Annals of Statistics, 6, 461-464.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 140-154
A Bayesian wavelet estimation method for estimating parameters of a stationary I(d) process is represented as an useful alternative to the existing frequentist wavelet estimation methods. The effectiveness of the proposed method is demonstrated through Monte Carlo simulations. The sampling from the posterior distribution is through the Markov Chain Monte Carlo (MCMC) easily implemented in the WinBUGS software package. Key words: Bayesian method, wavelet, discrete wavelet transform (DWT), I(d) process, long memory
Introduction Stationary processes exhibiting long range dependence have been widely studied now since the works of Granger and Joyeux (1980) and Hosking (1981). The long range dependence has found applications in many areas, including economics, finance, geosciences, hydrology, and statistics. The estimation of the long-memory parameter of the stationary long-memory process is one of the important tasks in studying this process. There exist parametric, non-parametric and semi-parametric methods of estimation for the long-memory parameter in literature. In the parametric method, the long-memory parameter is one of the several parameters that determine the parametric model; hence the usual classical methods such as the maximum likelihood estimation can be applied. The non-parametric method, not assuming restricted parametric form of the model, usually uses regression methods by regressing the logarithm of some sampling statistics for estimation.
The widely and often used Geweke and Poter-Hudak (1983) estimation method belongs to non-parametric methods. The semi-parametric method makes intermediate assumptions by not specifying the covariance structure at short ranges. The article by Bardet et al. (2003) surveyed some semi-parametric estimation methods and compared their finite sample performance by Monte-Carlo simulation. Wavelet has now been widely used in statistics, especially in time series, as a powerful mutiresolution analysis tool since 1990s. See Vidakovic (1999) for reference from the statistical perspective. The wavelets strength rests in its ability to localize a process in both time and frequency scale simultaneously. This article presents a Bayesian Wavelet estimation method of the long-memory parameter d and variance 2 of a stationary long-memory I(d) process implemented in the MATLAB computing environment and the WinBUGS software package. Methodology A time series {X t} is a fractionally integrated process, I(d), if it follows:
140
Leming Qu is an Assistant Professor of Statistics, 1910 University Dr., Boise, ID 837251555. Email: [email protected]. His research interests include Wavelets in statistics, time series, Bayesian analysis, statistical and computational inverse problems, nonparametric and semiparametric regression.
(1-L)d Xt =
where t ~ i.i.d. N(0, 2) and L is the lag operator defined by LX t=X t-1 . The parameter d is not necessarily an integer so that fractional differencing is allowed. The process {Xt} is
LEMING QU
stationary if |d|< 0.5. The fractionally differencing operator (1-L)d is defined by the general binomial expansion: (1-L)d= where
141
d j , k ~ N (0, 2 ), j
where j = 0,1,
k =0
d ( L) k , k
c0 , 0
approximately uncorrelated due to the whitening property of the DWT. The 2 , j=-1, 0, 1,, Jj 1 depend on
d (d + 1) = k (k + 1)(d k + 1)
and () is the usual Gamma function. Denote the autocovariance function of {Xt} as (k ) , that is (k ) = E ( X t X s ) where k=|t-s|. The formula for (k ) of a stationary I(d) process is well-known (Beran 1994, pp. 63):
the lowest resolution level for which we use j0=0 in this article. The smoothed wavelet coefficient vector c j0 = (c j0 , 0 , c j0 ,1 , , c j , 2 j0 1 ) T . At the
0
resolution level j, the detailed wavelet coefficient vector d j = (d j , 0 , d j ,1 , , d j , 2 j 1 )T for j=0, 1,, J-1. McCoy and Walden (1996) argued heuristically that the DWT coefficients of X has the following distribution:
d
2 ( J j ) 2 ( J j +1)
and
2 as
2 = 2 J j 2 4 d 2 j
When J-j
2,
2 j
sin(f ) f , so that
can be simplified as
where j = 1,0, , J 2. McCoy and Walden (1996) used these facts to estimate d and 2 by the Maximum Likelihood Method. They demonstrated through simulation that d could be estimated as well, or better by wavelet methods than the best Fourierbased method. Jensen (1999) derived the similar result about the distribution of the wavelet coefficients, and by the fact that Var ( d j ,k ) 2 2 jd , he used the Ordinary Least Squares method to estimate d. That is, by regressing log of the sample variance of the wavelet coefficients at resolution level j, against log(2 2 j ) for j=2,3, , J-2, he obtained the OLS estimate of d. The sample variance of the wavelet coefficients at resolution level j is estimated by the sample second moment of the observed wavelet coefficients at resolution level j. Vannucci and Corradi (1999) section 5 proposed a Bayesian approach. They used independent priors and assumed Inverse Gamma distribution for 2 and a Beta distribution for 2d. They did not use formula (1), instead, they used a recursive algorithm to compute the variances of wavelet coefficients. The posterior inference is done through Markov chain Monte Carlo (MCMC) sampling procedure. They did
142
Simple calculation shows that J ( ) 1 2 . The second prior is the other independent priors on d and 2 , i.e.,
( ) = (d ) ( 2 ) .
The prior for d+0.5 is Beta( , ) where > 0 , > 0 are the hyperparameters. This prior restricts |d|<0.5, thus imposing stationarity for the time series. When = = 1 , the prior is the noninformative uniform prior. When historical information or expert opinion is available, and can be selected to reflect this extra information, thus obtaining an informative prior. Hyper priors can also be used on and to reflect uncertainties on them, thus forming a hierarchical Bayesian model. A Gamma( 1 , 2 ) prior is chosen for the precision
( | ) f ( | ) ( )
where f ( | ) is the likelihood of the data given the parameters , which is the density of the multivariate normal distribution N (0, ) with
2 = diag ( 1 , 0 , 1 ,
, J 1 )
and
= diag ( 2 , j
, 2 ) j
for j = 0,1, , J 1 is a 2 j 2 j diagonal matrix. The inference of is based on the posterior distribution ( | ) . The MCMC methods are popular to draw repeated samples from the intractable ( | ) . We focus on the implementation of the Gibbs sampling for estimating d and 2 in the WinBUGS software. The easy programming in the WinBUGS software provides practitioners an useful and
2 = 1 2 ,
1 > 0, 2 > 0 are the hyperparameters. When 1 and 2 are close to zero, the prior for 2 is practically equivalent to ( 2 ) 1 2 , an
improper prior. The non-informative prior ( 2 ) I (0, +) ( 2 ) can also be chosen.
not give details of the implementation in the paper. McCoy and Walden (1996) did not give the variance of their estimates. Jensen (1999) only estimated d using the OLS method, it is not clear how 2 is estimated. In both cases, the estimated d can not be guaranteed in the range (0.5, 0.5). Here, we propose a Bayesian approach to estimate d and 2 in the same spirit of Vannucci and Corradi (1999) section 5. The distinction of this article from Vannucci and Corradi (1999) is that firstly, we use the explicit formula (1) for the variances of wavelet coefficients at resolution level j instead the recursive algorithm to compute these variances; secondly, the MCMC is implemented in the WinBUGS software package. Denoting = ( d , 2 ) , the parameters of the models for the data . If a prior distribution of () of is chosen, i.e., ~ ( ) , then by Bayesian formula, the posterior distribution of is
J ( ) = E
2 ln f ( | ) . T
where
LEMING QU
Simulation The MCMC sampling is carried out in the WinBUGS software package. WinBUGS is the current windows-based version of the BUGS (Bayesian inference Using Gibbs Sampling), a newly developed, user-friendly and free software package for general-purpose Bayesian computation, Lunn et al. (2000). It is developed by the MRC, Biostatistics Group, Institute of Public Health (www.mrc-bsu.cam.ac.uk/bugs), Cambridge. In WinBUGS programming, user only needs to specify the full proper data distribution and prior distributions, WinBUGs will then use certain sophisticated sampling methods to sample the posterior distribution. In this Monte Carlo experiment, we compare the proposed Bayesian approach with the approach in McCoy and Walden (1996) and Jensen (1999). Different values of d, N and different prior distributions ( ) are used to determine the effectiveness of the estimation procedure. Also used were two different wavelet bases to compare the effect of this choice. The Davis and Harte (1987) algorithm was used to generate an I(d) process because of its efficiency compared to other computationally intensive methods (McLeod & Hipel (1978). This algorithm generates a Gaussian time series with the specified autocovariances by discrete Fourier transform and discrete inverse Fourier transform. It is well known that Fast Fourier Transform (FFT) can be carried out in O(N log N) operations, so the computation is fast. The generation of the I(d) process using the Davis and Harte algorithm and the DWT of the generated I(d) process are carried out in the MATLAB 6.5 on a Pentium III running Windows 2000. The DWT tool used is the WAVELAB802 developed by the team from the Statistics Department of Stanford University (http://www-stat.stanford.edu/~wavelab). The following two different wavelet basis for comparison were chosen: (a) Harr wavelet; (b) LA(8): Daubechies least asymmetric compactly supported wavelet basis with four vanish moments, see p.198 of Daubechies (1992). The periodic boundary handling is used. The data of the discrete wavelet transformed I(d)
143
process is first saved in a file in R data file format. Then WinBUGS1.4 is activated under MATLAB to run a script file that implements the proposed Bayesian estimation procedure. The estimation results from WinBUGS1.4 are then converted to the MATLAB variables for further uses. The model parameters are estimated under the following independent priors on d and
2
(a)
(b) d ~ Unif ( 0.5,0.5), 2 ~ Unif (0,1000). The prior (a) is practically equivalent to Jefferys noninformative prior:
(d , 2 )
I ( 0, + ) ( 2 ) I ( 0.5,0.5) (d ).
BUGS only allow the use of proper prior specification, so the non-informative or improper prior distribution can be regarded as the limit of a corresponding proper prior. The estimation results using the proposed Bayesian approach for the simulated I(d) process and the method by Jensen (1999) and McCoy and Walden (1996) are found in Table 1 for Haar wavelets and Table 2 for LA(8) wavelets. For the chosen prior, it reports the estimated posterior mean, posterior standard deviation (SD). In addition, it also tabulated in the parenthesis below the value of Mean and SD the 95% credible intervals of the parameters using the 2.5% and 97.5% quantiles of the random samples. In all cases, two independent chains of 10500 iterations each were run, keeping every tenth one, after burn-in 500, with random initial values. The posterior inference is based on the actual random samples of 2000. For the case of N=256, d=0.1, 2 =1.0 and prior (b), Figure 1 shows the trace of the random samples and the kernel estimates of the posterior densities of the parameters.
144
The autocorrelation function of the random samples shows very little autocorrelations for the drawn series of the random samples. The two parallel chains mix well after small steps of the initial stage. All other diagnostics for convergence indicate a good convergence behavior. In most cases, the Bayesian wavelet estimates of d and 2 are quite good. They are very close to the truth. The 95% credible interval
LEMING QU
145
Table 1: Estimation of the simulated I(d) process when N=256 Using Haar Basis. Prior (a) Parameter d=0.1 Jensen 0.1620 MW 0.1629 Mean 0.1686 (0.0739, SD 0.0499 0.2711) 0.0931 1.2460) 0.0465 0.2827) 0.0972 1.3150) 0.0351 0.4902) 0.0934 1.2395) 0.0470 0.1663) 0.1918 2.5745) 0.0477 0.2858) 0.1729 2.3275) 0.0467 0.4069) 0.1540 2.1385) Mean 0.1692 (0.0768, 1.0485 (0.8791, 0.1880 0.1008, 1.1068 (0.9289, 0.4284 (0.3489, 1.0571 (0.8822, 0.0719 (-0.0172 2.1943 (1.8570, 0.1847 (0.0938, 1.9791 (1.6715, 0.3105 (0.2165, 1.8305 (1.5300, Prior (b) SD 0.0499 0.2674) 0.0977 1.2540) 0.0462 0.2854) 0.1021 1.3220) 0.0364 0.4901) 0.0975 1.2640) 0.0472 0.1679) 0.1948 2.5975) 0.0462 0.2785) 0.1770 2.3675) 0.0476 0.4079) 0.1619 2.1665)
2 = 1.0
d=0.25 0.1431
1.0226
1.0452 (0.8801,
0.1858
0.1887 (0.1049,
2 = 1.0
d=0.4 0.4121
1.0789
1.1000 (0.9331,
0.4384
0.4301 (0.3567,
= 1.0
2
1.0189
1.0445 (0.8775,
d=0.1
0.1227
0.0681
0.0709 (-0.0176,
2 = 2.0
d=0.25 0.2468
2.1482
2.1787 (1.8455,
0.1855
0.1858 (0.0995,
2 = 2.0
d=0.4 0.2154
1.9369
1.9674 (1.6570,
0.3096
0.3127 (0.2238,
= 2.0
2
1.7783
1.8130 (1.5435,
146
Table 2: Estimation of the simulated I(d) process when N=256 Using LA(8) Basis. Prior (a) Parameter d=0.1 Jensen 0.0759 MW 0.1701 Mean 0.1757 (0.0894, SD 0.0466 0.2734) 0.0935 1.2295) 0.0508 0.3680) 0.0916 1.2295) 0.0359 0.4905) 0.0953 1.2370) 0.0529 0.2278) 0.1926 2.5765) 0.0535 0.3697) 0.1609 2.2165) 0.0454 0.4045) 0.1635 2.1510) Mean 0.1755 (0.0936, 1.0270 (0.8626, 0.2661 (0.1705, 1.0412 (0.8824, 0.4295 (0.3536, 1.0502 (0.8826, 0.1151 (0.0183, 2.1694 (1.8235, 0.2637 (0.1630, 1.8849 (1.5795, 0.3117 (0.2236, 1.7995 (1.5145, Prior (b) SD 0.0446 0.2707) 0.0899 1.2190) 0.0502 0.3741) 0.0888 1.2255) 0.0362 0.4895) 0.0932 1.2450) 0.0535 0.2298) 0.1894 2.5650) 0.0540 0.3761) 0.1709 2.2420) 0.0463 0.4040) 0.1595 2.1225)
2 = 1.0
d=0.25 0.0904
1.0037
1.0222 (0.8529,
0.2611
0.2651 (0.1681,
2 = 1.0
d=0.4 0.4906
1.0154
1.0398 (0.8791,
0.4369
0.4304 (0.3548,
= 1.0
2
1.0148
1.0413 (0.8669,
d=0.1
0.0542
0.1110
0.1175 (0.0166,
2 = 2.0
d=0.25 0.1977
2.1233
2.1594 (1.8185,
0.2609
0.2608 (0.1556,
2 = 2.0
d=0.4 0.2632
1.8372
1.8745 (1.5870,
0.3111
0.3130 (0.2257,
= 2.0
2
1.7469
1.7942 (1.5080,
LEMING QU
147
148
Frequentist Comparison Also compared were the estimates of the three methods in repeatedly simulated I(d) process. Figure 2 is the box plots of the estimates for d and 2 respectively of 200 replicates with N=128, d=0.25 and 2 =1.0. Figure 3 is the box plots of the estimates for 200 replicates with N=128, d=0.40 and 2 =1.0. The x-axis labels in the box plot read as follows: `JH denotes the case by the Jensen method using Haar; JL denotes the case by the Jensen method using LA(8); and so forth. Because of the long computation time associated with the Gibbs sampling for the large number of simulated I(d) processes, we limit the burn-in to 100 iterations and the number of random samples to 500. Because only the posterior mean was calculated using the generated random samples, not much information was lost even when the slightly short chain was used. For the estimates of d, the mean square errors of the McCoy and Walden and The Bayesian method using these two priors are very similar, and they are all smaller than the one by Jensens OLS. LA(8) gives less biased estimates than Haar. The mean estimates for d given by the Bayesian method using LA(8) is similar to those by McCoy and Walden. In all methods, it seems the estimates for d and 2 are a little
LEMING QU
Vidakovic, B. (1999). Statistical modeling by wavelets. New York: WileyInterscience.
149
WinBUGS 1.4: MRC, Biostatistics Group, Institute of Public Health, Cambridge University, Cambridge, 2003, www.mrcbsu.cam.ac.uk/bugs
Appendix: This appendix includes the MATLAB code for first simulating the I(d) process, then transforming it by DWT, and the WinBUGS program for the MCMC computation. In the WinBUGS programming, the symbol ~ is for the stochastic node which has the specified distribution denoted on the right side, the symbol is for the deterministic node which has the specified expression denoted on the right side. All the likelihood function, the prior distributions and initial values of the nodes without parents must be specified in the programs. The MATLAB code: function x=Generatex(J, d, sig2eps) %Generate the I(d) process %input: %J: where N=2^J sample size %d: long memory parameter of the I(d) process, abs(d)<0.5 %sig2eps: $\sigma_\epsilon^2$ %output: %x: the time series N=2^J; c=[]; % generate the autocovariance function by the formular of covariance % function for LRD c(1)=sig2eps*gamma(1-2*d)/((gamma(1-d))^2); %for i=1:N-1 c(i+1)=c(1)*gamma(i+d)*gamma(1-d)/(gamma(d)*gamma(i+1-d)); end; for i=1:N-1 c(i+1)=c(i)*(i+d-1)/(i-d); end; x=GlrdDH(c);
function x=GlrdDH(c); %GlrdDH.m Generating the stationary gaussion time seriess with specified % autocovariance series c % using Davis and Hartes method, Appendix of `Tests for Hurst Effect, % Biometrika, V74, No. 1 (Mar., 1987), 95-101 %c: autocovariance series [temp, N]=size(c); %c is a row vector cCirculant=[]; for i=1:N-2 cCirculant(i)=c(N-i); end; cFull=[]; cFull=[c cCirculant];
150
g=[]; g=fft(cFull); %Fast Fourier Transform of cFull Z=[]; Z=complex(normrnd(0,1,1,N), normrnd(0,1, 1,N)); Z(1)=normrnd(0,sqrt(2)); %Be careful to specify sqrt(2), if you want variance of Z(1) to be 2 Z(N)=normrnd(0,sqrt(2)); ZCirculant=[]; for i=1:N-2 ZCirculant(i)=conj(Z(N-i)); end; ZFull=[]; ZFull=[Z ZCirculant]; X=[]; X=ifft(ZFull.*sqrt(g))*sqrt(N-1); x=[]; x=real(X(1:N)); function [dJensen, dMW, sigMW, dBS, sigBS]=GetdHatSig2Hat(x, j0, filter) %Wavelet estimation of Long Range Dependence parameters % %input: %x: the observed I(d) process %j0: lowest resolution level of the DWT %filter: wavelet filter %output: %dJensen: estimate of d by Jensen 1999 %dMW: estimate of d by McCoy & Walden 1996 %sigMW: estimate of $\sigma_\epsilon^2$ by McCoy & Walden 1996 %dBS: estimate of d by Bayesian Wavelet Method for prior (a), (b) %dBS.a, dBS.b %sigBS: estimate of $\sigma_\epsilon^2$ by Bayesian Wavelet Method for prior (a), (b) %sigBS.a, sigBS.b N=length(x); J=log2(N); w=[]; w = FWT_PO(x,j0,filter); %w is a coulmn vector resolution=[]; % data used in WinBUGS14 resolution(1:2^j0,1)=j0-1; for j = j0:(J-1) resolution(2^j+1 : 2^(j+1),1)=j; end; vwj=[]; for j=j0+1:(J-1) vwj(j, :)=[j, mean(w(dyad(j)).^2)];
LEMING QU
151
end; tempd=[]; tempd=-[ones(J-2,1), log(2.^(2*vwj(2:J-1,1)))]\log(vwj(2:J-1,2)); dJensen=tempd(2); OPTIONS=optimset(@fminbnd); dMW=fminbnd(@NcllhMW, -0.5, 0.5, OPTIONS, j0, w, J); sigMW=findSig2epsHat(dMW, j0, w, J); n=N-2^(J-1); %the first n data of w, approximation of variance
%function mat2bugs() converts matlab variable to BUGS data file mat2bugs(c:\WorkDir\LRD_data.txt, w,w, twopowl, 2^j0, n, n, N, N, resolution, resolution, J, J, pi, pi,K,500); %set the current directory at MATLAB to C:\Program Files\WinBUGS14\ cd C:\Program Files\WinBUGS14\; %prior (a) dos(WinBUGS14 /par BWIdSt_a.odc); Sa=bugs2mat(C:\WorkDir\bugsIndex.txt, C:\WorkDir\bugs1.txt); dBS.a=mean(Sa.d); %the posterior mean as the estimate of d sigBS.a=mean(Sa.sig2eps); %the posterior mean as the estimate of sig2eps %prior (b) dos(WinBUGS14 /par BWIdSt_b.odc); Sb=bugs2mat(C:\WorkDir\bugsIndex.txt, C:\WorkDir\bugs1.txt); dBS.b=mean(Sb.d); %the posterior mean as the estimate of d sigBS.b=mean(Sb.sig2eps); %the posterior mean as the estimate of sig2eps cd C:\WorkDir; function y=NcllhMW(d, j0, w, J); %NcllhMW.m --- Negative Concentrated log likelihood of McCoy & Walden % %input: %d: the long memory parameter, a value in (0,0.5) %j0: Lowest Resolution Level %w: w=Wx, x is the observed time series %J: N=2^J sample size % %output: %y: Negative Concentrated log likelihood for the given data w
152
m=J-j; bmP(j+1)=2*4^(-d)*quad(@sinf,2^(-m-1),2^(-m),[],[],d); %by McCoy & Waldens formula, P37, (2.9) smP(j+1)=2^m*bmP(j+1); end; bpp1P=gamma(1-2*d)/((gamma(1-d))^2)-sum(bmP); %B_{p+1} in McCoy & Waldens notation, p=J here spp1P=2^J*bpp1P*(bpp1P>0); %S_{p+1} in McCoy & Waldens notation, it should be nonnegative if spp1P>0 sig2epsHat=w(1)^2/spp1P; else sig2epsHat=0; end; sumlogsmP=0; for j = j0:(J-1) sig2epsHat=sig2epsHat+sum(w(2^j+1 : 2^(j+1)).^2)/smP(j+1); sumlogsmP=sumlogsmP+2^j*log(smP(j+1)); end; sig2epsHat=sig2epsHat/N; %McCoy & Walden, Page 49, formular (5.1) y=N*log(sig2epsHat)+log(spp1P)+sumlogsmP; %McCoy & Walden, Page 49 function sig2epsHat=findSig2epsHat(d, j0, w, J); %find Sig2epsHat by McCoy & Walden Page 49, formular (5.1) % %input: %d: the long memory parameter, a value calculated by function NcllhMW(); %j0: Lowest Resolution Level %J: N=2^J sample size N=2^J; bmP=[]; smP=[]; for j=j0:(J-1) %j is the resolution level m=J-j; bmP(j+1)=2*4^(-d)*quad(@sinf,2^(-m-1),2^(-m),[],[],d); %by McCoy & Waldens formula, P37, (2.9) smP(j+1)=2^m*bmP(j+1); end; bpp1P=gamma(1-2*d)/((gamma(1-d))^2)-sum(bmP); %B_{p+1} in McCoy & Waldens notation, p=J here spp1P=2^J*bpp1P*(bpp1P>0); %S_{p+1} in McCoy & Waldens notation, it should be nonnegative
LEMING QU
153
N=2^J; bmP=[]; smP=[]; for j=j0:(J-1) %j is the resolution level if spp1P>0 sig2epsHat=w(1)^2/spp1P; else sig2epsHat=0; end; for j = j0:(J-1) sig2epsHat=sig2epsHat+sum(w(2^j+1 : 2^(j+1)).^2)/smP(j+1); end; sig2epsHat=sig2epsHat/N; \end{verbatim} The WinBUGS script file: BWIdSt\_a.odc check(C:/MyDir/LRD_model_a.odc) data(C:/MyDir/LRD_data.txt) compile(1) gen.inits() update(100) set(d) set(sig2eps) update(500) coda(*, C:/Documents and Settings/MyDir/bugs) #save(C:/Documents and Settings/MyDirlog.txt) quit() The WinBUGS model file: LRD_model_a.odc model { # This takes care of the father wavelet coefficients from level L+1 to J-1 # which are detailed wavelet coefficients, $D$ for (i in twopowl+1:n) { tau[i]<-1/(pow(2*pi, -2*d)*sig2eps*pow(2, 2*d*(J-resolution[i])) *(2-pow(2,2*d))/(1-2*d)) w[i] ~ dnorm (0, tau[i]) }
154
#The following takes care the wavelet coefficients at the resolution level J-1. #It uses the exact formula instead of the approximation. for (i in 1:K) { sinf[i]<-pow(sin(pi*(0.25+i/(4*K))),-2*d)} integration<-sum(sinf[])/(4*K) B1<-2*pow(4,-d)*sig2eps*integration tau1<-1/(2*B1) #S_1=2*B_1 in McCoy & Walden 1996s notation
for (i in (n+1): N) { w[i] ~ dnorm (0, tau1) } # This takes care of the scaling coefficients on the lowest level $j_0=L$ # which are mother wavlelet coefficients, $C$ # twopowl <- pow(2, L) for (jp1 in 1:J-1) { #jp1=j+1, m=J-j b[jp1]<-(2*pow(2*pi, -2*d)*sig2eps*pow(2, -(J-jp1+1)*(1-2*d)) *(1-pow(2,2*d-1))/(1-2*d)) } bpp1<-sig2eps*exp(loggam(1-2*d))/pow(exp(loggam(1-d)),2)-sum(b[])-B1; #B_{p+1} in McCoy & Waldens notation spp1<-pow(2,J)*bpp1*step(bpp1)+1.0E-6; #S_{p+1} in McCoy & Waldens notation, this should be positive tau0 <- 1/spp1 for (i in 1: twopowl) { w[i] ~ dnorm(0, tau0) } #note: m=J-resolution[i] in McCoy & Waldens 1996 paper # prior (a) d~dunif(-0.5, 0.5) sig2eps<-1/ tau2 tau2~dgamma(1.0E-2,1.0E-2) #prior (b) # d~dunif(-0.5, 0.5) # sig2eps~dunif(0,1000) }
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 155-162
Monitoring structural change is performed not by hypothesis testing but by model selection using a modified Bayesian information criterion. It is found that concerning detection accuracy and detection speed, the proposed method shows better performance than the hypothesis-testing method. Two advantages of the proposed method are also discussed. Key words: Modified Bayesian information criterion, model selection, monitoring, structural change
Introduction Deciding whether a time series has a structural change is tremendously important for forecasters and policymakers. If the data generating process (DGP) changes in ways not anticipated, then forecasts lose accuracy. In the real world, not only historical analysis but also real-time analysis should be performed, because new data arrive steadily and the data structure changes gradually. Given a previously estimated model, the arrival of new data presents the challenge of whether yesterdays model can explain todays data. This is why real-time detection of structural change is an essential task. Such forward-looking methods are closely related to the sequential test in the statistics literature but receive little attention in econometrics except for Chu, Stinchcombe, and White (1996) and Leisch, Hornik, and Kuan (2000). Chu et al. (1996) has proposed two tests for monitoring potential structural changes: the fluctuation and CUSUM monitoring tests. In their fluctuation test, when new observations are obtained, estimates are computed sequentially from all available data (historical and newly
Kosei Fukuda is Associate Professor of Economics at Nihon University, Japan. He has served as an economist in the Economic Planning Agency of the Japanese government (1986-2000). Email: [email protected]
obtained sample) and compared to the estimate based only on the historical sample. The null hypothesis of no change is rejected if the difference between these two estimates becomes too large. One drawback of their test is however that it is less sensitive to a change occurring late in the monitoring period. Leisch et al. (2000) proposed the generalized fluctuation test which includes the fluctuation test of Chu et al. (1996) as a special case and shown that their tests have roughly equal sensitivity to a change occurring early or late in the monitoring period. Two drawbacks of their test are however that there is no objective criterion in selecting the window sizes, and that it has low power in the case of small samples. In this article, a model-selection-based monitoring of structural change is presented. The existence of structural change is examined, not by hypothesis testing but by model selection using a modified Bayesian information criterion proposed by Liu, Wu, and Zidek (1997). Liu et al. (1997) presented segmented linear regression model and proposed model-selection method in determining the number and location of changepoints. Their criterion has been applied to examine what happened in historical data sets while it has not been applied to examine what happens in real time. Therefore this criterion is applied to monitor structural change. In this method, whether the observed time series contains a structural change is determined as a result of model selection from a battery of alternative models with and without structural change.
155
156
i =1
lim P T T
for all T + 1 k [T ]
Methodology Leisch et al. (2000) hypothesis-testing method considered the following regression model
where W 0 is the generalized Brownian bridge on [0, ] , as shown by Chu et al. (1996), and
yi = xi i + i , i = 1,...,T , T + 1,...,
where xi is the
(1)
t )], t 1
n 1 vector of explanatory
variables, and i is a i.i.d. disturbance term. Suppose an economist is currently at time T and has observed historical data ( yi , xi ), i = 1,...,T . He takes as given that the parameter vector i was constant and unknown historically. Consider testing the null hypothesis that i remains constant against the alternative that i changes at some unknown point in the future. Leisch et al. (2000) first considered tests based on recursive estimates and show that Chu et al. (1996) fluctuation test is a special case of this class of tests. They write the Chu et al. fluctuation test as
max RET ( ) is thus determined by the boundary crossing probability of W 0 on [1, ] . Choosing a 2 = 7.78 and a 2 = 6.25 gives 95%
and 90% monitoring boundaries, respectively. Leisch et al. (2000) next considered tests based on moving estimates. Define the moving OLS estimates computed from windows of a constant size [Th], where 0 < h 1 and [Th] > n, as
i = k [Th ]+1
i = k [Th ]+1
k = [Th],[Th] + 1,...
max RET ( ) =
k =T +1,...,[T ]
max
k T T
Q ( k - T )
12 T
(2)
They propose tests on the maximum and range of the fluctuation of moving estimates:
T (k ,[Th]) =
xi xi
xi yi ,
i =1
QT =
1 T
2 xi xi, T =
1 T
1 QT 2 ( k T ) < q (k T ),
i =1
Another contribution of this article is the introduction of minimum length of each segment ( L ). Liu et al. (1997) pay little attention to this topic and make the minimum length equivalent to the number of explanatory variables. This possibly leads to over-fit problem in samples. In order to overcome this problem, L = 10 is set arbitrarily and practically in simulations and obtain better performance than the Liu et al. method. The rest of the article is organized as follows. The hypothesis- testing method and the model-selection method are reviewed briefly. Next, simulation results are shown to illustrate the efficacy of the proposed method. Finally, conclusions and discussions are presented.
where k =
xi xi
xi yi ,
( yt xtT ) 2 ,
and
t =1
(3)
(4)
(5)
157
Liu et al. model-selection method Liu et al. (1997) considered the following segmented linear regression model
where T0 = 0 and Tm +1 = T .
1 max [QT 2 ( T (k ,[Th]) T )]i
k =T+1,...,[T ]
The indices (T1 ,...,Tm ) , or the changekpoints, are explicitly treated as unknown. In addition, the following conditions are newly imposed:
k =T +1,...,[T ]
lim P
[Th] 1 QT 2 ( T (k ,[Th]) T ) T T
< z ( h) 2 log + (k T ),
for all T + 1 k < [T ]
= [ F1 ( z ( h ), )]n ;
(8)
lim P max
for all
T + 1 J < [T ]
where log + t = 1 if t e , log + t = log t if t > e . In contrast with the boundary-crossing probability of (4), the Fi (i = 1,2) do not have analytic forms. Nevertheless, the critical values z (h) can be obtained via simulations, and some typical values are shown in Leisch et al. (2000).
T )]i
= [ F2 ( z ( h ), )]n ;
(9)
Liu et al. (1997) estimated m , the number of changepoints, and T1 ,...,Tm , by minimizing the modified Schwarzs criterion (Schwarz 1978) LWZ=
[Th] i=1,...,n T T
k =T +1,..., J
.
(7)
ST (T1,...,Tm ) =
m +1
Ti
( yt xt i ) 2 .
i =1 t =Ti 1 +1
yt = xt i + t , t = Ti 1 + 1,
, Ti , i = 1,
, m + 1,
(10)
(12)
158
using 0 = 0.1 and c0 = 0.299 ; here small simulations are implemented to examine how the structural change detection is affected by changing these two parameter values in the next Section. This criterion is an extended version of Yao (1988) such as
t >T 2,
where et is generated from i.i.d. N (0,1). Considered are historical samples of sizes T = 50, 100, 200, 400, L = 1, 10, c 0 = 0.01, 0.05, 0.1, 0.3, 0.5, and 0 = 0.01, 0.05, 0.1, 0.2. The number of replications is 1,000. Table 1 shows frequency counts of selecting structural change models using the Liu et al. information criterion. First consider comparing the performances between two pairs of (c0 = 0.1, 0 = 0.05) and
(c 0 = 0.299, 0 = 0.1) .
The former significantly outperforms the latter, particularly in the structural-change cases of T = 50 and 100. The pair of (c0 = 0.299, 0 = 0.1) imposes too heavy penalty to select structural change models correctly. Next consider comparing between L = 1 and L = 10. The latter outperforms the former, particularly in small samples of T = 50 and 100. In the case of L = 1, it happens to occur that a structural change is incorrectly detected in the beginning or end of the sample. Monitoring structural change via the Leisch et al. simulations In Leisch et al. (2000), the DGP for examining empirical size is the same as DGP 1. They show the performances of max RE , max ME , and range ME tests and consider moving window sizes h = 0.25, 0.5, 1, and = 10 for the expected monitoring period [T ]. However, the DGP for examining empirical power is not the same as DGP2. The mean changes from 2.0 to 2.8 at 1.1T or 3T . Similarly to Leisch et al. (2000), only the results for the 10% significance level are reported. All experiments were repeated 1,000 times.
FUKUDA
159
Table 1 Frequency counts of selecting structural change models L 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 T 50 50 50 50 50 100 100 100 100 100 200 200 200 200 200 400 400 400 400 400 50 50 50 50 50 100 100 100 100 100 200 200 200 200 200 400 400 400 400 400
c0
0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5 0.01 0.05 0.1 0.3 0.5
0
0.01 0.590 0.379 0.192 0.024 0.001 0.653 0.360 0.147 0.003 0.001 0.697 0.297 0.077 0.000 0.000 0.710 0.271 0.047 0.000 0.000 0.368 0.212 0.111 0.009 0.001 0.461 0.230 0.091 0.001 0.000 0.550 0.226 0.071 0.001 0.000 0.596 0.176 0.033 0.000 0.000 0.05 0.585 0.360 0.172 0.018 0.000 0.647 0.348 0.137 0.002 0.000 0.692 0.275 0.064 0.000 0.000 0.703 0.240 0.034 0.000 0.000 0.366 0.206 0.101 0.008 0.001 0.457 0.217 0.078 0.000 0.000 0.539 0.210 0.058 0.001 0.000 0.587 0.155 0.026 0.000 0.000
for DGP1 0.1 0.583 0.347 0.155 0.012 0.000 0.639 0.327 0.112 0.001 0.000 0.677 0.244 0.047 0.000 0.000 0.684 0.202 0.026 0.000 0.000 0.362 0.200 0.091 0.008 0.001 0.449 0.200 0.068 0.000 0.000 0.531 0.191 0.046 0.000 0.000 0.580 0.124 0.019 0.000 0.000 0.2 0.576 0.305 0.128 0.008 0.000 0.623 0.269 0.070 0.001 0.000 0.656 0.186 0.026 0.000 0.000 0.660 0.136 0.014 0.000 0.000 0.352 0.176 0.075 0.003 0.001 0.434 0.164 0.051 0.000 0.000 0.512 0.147 0.028 0.000 0.000 0.552 0.094 0.006 0.000 0.000 0.01 0.968 0.919 0.826 0.330 0.080 0.999 0.994 0.959 0.591 0.183 1.000 1.000 0.999 0.906 0.514 1.000 1.000 1.000 0.999 0.952 0.933 0.869 0.763 0.344 0.101 0.993 0.980 0.948 0.552 0.157 1.000 1.000 0.999 0.900 0.514 1.000 1.000 1.000 0.999 0.942
for DGP2 0.1 0.968 0.908 0.798 0.266 0.055 0.999 0.992 0.946 0.456 0.112 1.000 1.000 0.999 0.843 0.331 1.000 1.000 1.000 0.998 0.868 0.931 0.851 0.726 0.279 0.059 0.992 0.974 0.930 0.446 0.094 1.000 0.999 0.999 0.840 0.320 1.000 1.000 1.000 0.997 0.869 0.2 0.968 0.893 0.746 0.200 0.028 0.999 0.987 0.925 0.349 0.056 1.000 1.000 0.997 0.707 0.202 1.000 1.000 1.000 0.992 0.706 0.930 0.840 0.694 0.212 0.028 0.992 0.969 0.907 0.342 0.039 1.000 0.999 0.997 0.722 0.176 1.000 1.000 1.000 0.981 0.696
0.05 0.968 0.913 0.817 0.301 0.068 0.999 0.994 0.951 0.530 0.142 1.000 1.000 0.999 0.881 0.430 1.000 1.000 1.000 0.999 0.924 0.932 0.863 0.745 0.315 0.080 0.993 0.977 0.943 0.504 0.132 1.000 1.000 0.999 0.875 0.423 1.000 1.000 1.000 0.999 0.918
160
One fundamental difference between the Leisch et al. method and the proposed method is whether the changepoint is estimated. In the Leisch et al. method, the changepoint estimation cannot be performed. In order to do so, another step is needed. As in Chu et al. (1996), for example, it is possible to define the changepoint by the point at which the maximum of the LR statistics is obtained for the period from the starting point to the first hitting point. In contrast, the proposed method presents not only the first hitting point but also the changepoint simultaneously. This is because in the proposed method, from a battery of alternative models obtained by changing the changepoint on the condition of (11), including no structural change model, the best model is selected in each monitoring point. Therefore, the proposed method is very computer intensive. Table 2 shows frequency counts of selecting structural change models. In the LWZ criterion used are a pair of (c0 = 0.1, 0 = 0.05) , considering the results of the preceding simulation results. In the cases of no structural change, the YAO criterion ( L = 1 and L = 10 ) and the LWZ criterion ( L = 1 ) show poor performance. In contrast, the performance of the LWZ criterion ( L = 10 ) is comparable to other hypothesis-testing methods. In addition, it is shown that the more samples are obtained, the better performances are realized, because larger penalty ( ln(T )2.05 ) is imposed in the LWZ criterion than in the YAO criterion ( ln(T ) ). In the cases of structural change, the proposed method using the LWZ criterion ( L = 10 ) outperforms other hypothesis-testing methods, particularly in the late change case. The max ME, and range ME tests with small window sizes of h = 1 4 and h = 1 2 shows poor performances in small samples. More interesting features are shown in Table 3. Concerning the mean of detection delay, the proposed method using the LWZ criterion ( L = 10 ) significantly outperforms other hypothesis-testing methods. One fundamental drawback of the Leisch et al. method is that it remains unknown how small h should be. The smaller h is used, the quicker
FUKUDA
Table 2. Frequency counts of selecting structural change models T 25 50 100 200 300 25 50 100 200 300 28 55 110 220 330 CP YAO L=1 0.838 0.822 0.852 0.840 0.868 0.986 1.000 1.000 1.000 1.000 L=10 0.401 0.472 0.538 0.582 0.590 0.961 0.999 1.000 1.000 1.000 LWZ L=1 0.424 0.245 0.138 0.054 0.020 0.941 0.994 1.000 1.000 1.000 L=10 0.146 0.104 0.067 0.031 0.012 0.890 0.989 1.000 1.000 1.000 0.994 1.000 1.000 1.000 1.000 max-RE 0.088 0.078 0.073 0.084 0.087 0.931 0.996 1.000 1.000 1.000 0.691 0.953 1.000 1.000 1.000 max-ME h=1/4 0.091 0.090 0.109 0.090 0.090 0.685 0.948 1.000 1.000 1.000 0.445 0.828 0.997 1.000 1.000 h=1/2 0.104 0.108 0.109 0.105 0.098 0.832 0.992 1.000 1.000 1.000 0.660 0.966 1.000 1.000 1.000 h=1 0.121 0.105 0.109 0.108 0.103 0.925 1.000 1.000 1.000 1.000 0.823 0.992 1.000 1.000 1.000 range-ME h=1/4 0.058 0.051 0.065 0.054 0.060 0.108 0.206 0.607 0.955 0.998 0.034 0.072 0.320 0.824 0.979 h=1/2 0.065 0.064 0.055 0.053 0.055 0.277 0.640 0.966 0.999 1.000 0.100 0.386 0.847 0.998 1.000 h=1 0.081 0.049 0.061 0.045 0.042 0.660 0.950 0.999 1.000 1.000 0.417 0.858 0.999 1.000 1.000
161
25 75 0.999 0.996 0.994 50 150 1.000 1.000 1.000 100 300 1.000 1.000 1.000 200 600 1.000 1.000 1.000 300 900 1.000 1.000 1.000 Note: CP denotes change point.
162
Table 3. The mean and standard deviation of detection delay T 25 50 100 200 300 25 50 100 200 300 T 25 50 100 200 300 Change point 28 55 110 220 330 75 150 300 600 900 Change point 28 55 110 220 330 YAO L=1 16( 23) 15( 18) 15( 11) 16( 11) 16( 11) 15( 13) 15( 11) 16( 10) 17( 10) 18( 10) h=1/4 22( 22) 30( 32) 25( 16) 30( 10) 37( 11) L=10 25( 26) 21( 21) 19( 11) 19( 10) 19( 10) 19( 13) 19( 11) 19( 9) 19( 10) 20( 9) max-ME h=1/2 23( 25) 26( 26) 30( 11) 42( 13) 53( 15) LWZ L=1 17( 24) 20( 26) 22( 18) 24( 16) 27( 15) 21( 20) 23( 17) 26( 14) 30( 15) 34( 15) h=1 24( 19) 32( 19) 44( 13) 62( 16) 76( 19) L=10 25( 25) 27( 30) 25( 19) 26( 15) 27( 14) 25( 20) 25( 17) 27( 13) 31( 14) 34( 15) h=1/4 32( 10) 49( 15) 73( 43) 75( 63) 71( 50) max-RE 28( 36) 25( 27) 24( 16) 27( 14) 30( 15) 69( 45) 104( 69) 127( 75) 147( 73) 165( 80) range-ME h=1/2 30( 14) 44( 26) 60( 47) 66( 20) 80( 15) h=1 33( 21) 46( 29) 58( 14) 82( 16) 102( 18) 40( 22) 61( 45) 69( 33) 91( 17) 111( 19)
25 75 26( 28) 27( 29) 29( 28) 37( 8) 35( 13) 50 150 39( 47) 34( 38) 37( 29) 48( 16) 55( 32) 100 300 30( 28) 33( 19) 47( 18) 74( 43) 77( 67) 200 600 31( 12) 45( 15) 64( 25) 98(114) 74( 24) 300 900 38( 12) 55( 18) 78( 30) 84( 86) 87( 17) Note: The number in each parenthesis indicates standard deviation.
Liu, J., Wu, S., Zidek, J. V. (1997). On segmented multivariate regressions. Statistica Sinica, 7, 497-525.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz criterion. Statistics and Probability Letters, 6, 181-189.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 163-171
On The Power Function Of Bayesian Tests With Application To Design Of Clinical Trials: The Fixed-Sample Case
Lyle Broemeling
Department of Biostatistics and Applied Mathematics University of Texas MD Anderson Cancer Center
Dongfeng Wu
Department of Mathematics and Statistics Mississippi State University
Using a Bayesian approach to clinical trial design is becoming more common. For example, at the MD Anderson Cancer Center, Bayesian techniques are routinely employed in the design and analysis of Phase I and II trials. It is important that the operating characteristics of these procedures be determined as part of the process when establishing a stopping rule for a clinical trial. This study determines the power function for some common fixed-sample procedures in hypothesis testing, namely the one and two-sample tests involving the binomial and normal distributions. Also considered is a Bayesian test for multi-response (response and toxicity) in a Phase II trial, where the power function is determined. Key words: Bayesian; power analysis; sample size; clinical trial
Introduction The Bayesian approach to testing hypotheses is becoming more common. For example, in a recent review volume, see Crowley (2001), many contributions where Bayesian considerations play a prominent role in the design and analysis of clinical trials. Also, in an earlier Bayesian review (Berry & Stangl, 1996), methods are explained and demonstrated for a wide variety of studies in the health sciences, including the design and analysis of Phase I and II studies. At our institution, the Bayesian approach is often used to design such studies. See Berry (1985,1987,1988), Berry and Fristed (1985), Berry and Stangl (1996), Thall and Russell (1998), Thall, Estey, and Sung (1999), Thall, Lee, and Tseng (1999), Thall and Chang (1999),and Thall et al. (1998), for some recent references where Bayesian ideas have been the primary consideration in designing Phase I and II studies. Of related interest in the design of a trial is the estimation of sample size based on Bayesian principles, where Smeeton and Adcock (1997) provided a review of formal decisiontheoretic ideas in choosing the sample size. Typically, the statistician along with the investigator will use information from previous related studies to formulate the null and alternative hypotheses and to determine what prior information is to be used for the Bayesian analysis. With this information, the Bayesian design parameters that determine the critical region of the test are given, the power function calculated, and lastly the sample size determined as part of the design process. In this study, only fixed-sample size procedures are used. First, one-sample binomial and normal tests will be considered, then two-sample tests for binomial and normal populations, and lastly a test for multinomial parameters of a multiresponse Phase II will be considered. For each test, the null and alternative hypotheses will be formulated and the power function determined. Each case will be illustrated with an example, where the power function is calculated for several values of the Bayesian design parameters.
Lyle Broemeling is Research Professor in the Department of Biostatistics and Applied Mathematics at the University of Texas MD Anderson Cancer Center. Email: [email protected]. Dongfeng Wu got her PhD from the University of California, Santa Barbara. She is an assistant professor at Mississippi State University.
163
164
For the design of a typical Phase II trial, the investigator and statistician use prior information on previous related studies to develop a test of hypotheses. If the main endpoint is response to therapy, the test can be formulated as a sample from a binomial population, thus if Bayesian methods are to be employed, prior information for a Beta prior must be determined. However, if the response is continuous, the design can be based on a onesample normal population. Information from previous related studies and from the investigators experience will be used to determine the null and alternative hypotheses, as well as the other design parameters that determine the critical region of the test. The critical region of a Bayesian test is given by the event that the posterior probability of the alternative hypothesis will exceed some threshold value. Once a threshold value is used, the power function of the test can be calculated. The power function of the test is determined by the sample size, the null and alternative hypotheses, and the above-mentioned threshold value. Results Binomial population Consider a random sample from a Bernoulli population with parameters n and , where n is the number of patients and is the probability of a response. Let X be the number of responses among n patients, and suppose the null hypotheses is H: 0 versus the alternative A: > 0 . From previous related studies and the experience of the investigators, the prior information for is determined to be Beta(a,b), thus the posterioir distribution of is Beta (x+a, n-x+b), where x is the observed number of responses among n patients. The null hypothesis is rejected in favor of the alternative when Pr[ > 0 / x, n] > , (1)
where the outer probability is with respect to the conditional distribution of X given . The power (2) at a given value of is interpreted as a simulation as follows: (a) select (n, ), and set S=0, (b) generate a X~Binomial(n, ), (c) generate a ~Beta(x+a, n-x+b), (d) if Pr [ > 0 / x, n] > , let the counter S =S+1, otherwise let S=S, (e) repeat (b)-(d) M times, where M is large, and (f) select another and repeat (b)-(d). The power of the test is thus S/M and can be used to determine a sample size by adjusting the threshold , the probability of a
Type I error g( 0 ), and the desired power at a particular value of the alternative. The approach taken is fixing the Type I error at and finding n so that the power is some predetermined value at some value of deemed to be important by the design team. This will involve adjusting the critical region by varying the value of the threshold . An example of this method is provided in the next section. The above hypotheses are one-sided, however it is easy to adjust the above testing procedure for a sharp null hypothesis. Normal Population Let N( , ) denote a normal population with mean and precision , where both are unknown and suppose we want to test the null hypothesis H: = 0 versus A:
1
where is usually some large value as .90, .95, or .99. The above equation determines the
BROEMELING & WU
_
165
and variance
non-informative prior distribution for and , the Bayesian test is to reject the null in favor of the alternative if the posterior probability P of the alternative hypothesis satisfies P > , where P = D 2 /D and, D = D 1 + D 2 . It can be shown that D 1 = { (n/2)}2 [ n( 0 and
_ n/2
. Using a
it can be shown that the power (size) of the test at 0 is 1- . Thus in this sense, the Bayesian and classical t-test are equivalent. Two binomial populations Comparing two binomial populations is a common problem in statistics and involves the null hypothesis H: 1 = 2 versus the
(3) (4)
} /{(2 )
n/2
alternative A: 1 2 , where 1 and 2 are parameters from two Bernoulli populations. Assuming uniform priors for these two populations, it can be shown that the Bayesian test is to reject H in favor of A if the posterior probability P of the alternative hypothesis satisfies P > , where P = D 2 /D, (8) (9)
x ) 2 + (n-1) s 2 ] n / 2 }
( n 1) / 2
(5)
} } (6)
(2 )
( n 1) / 2
[(n-1)
( n 1) / 2
where is the prior probability of the null hypothesis. The power function of the test is g( , ) = Pr X / , [ P >
R and >0
/ n,
x, s 2 ],
(7)
: x2 ) ( x1 + x2 + 1)(n1 + n2 x1 x2 ) } (n1 + n2 + 2) ,
1 1
1 1
BC(n :x
)BC( n2
where BC(n,x) is the binomial coefficient x from n. Also, D 2 = (1- )(n 1 +1) (n 2 +1) , where is the prior probability of the null hypothesis. X 1 and X 2 are the number of responses from the two binomial populations with parameters (n 1 , 1 ) and ( n2 , 2 ) respectively. The alternative hypothesis is twosided, however the testing procedure is easily revised for one-sided hypotheses. In order to choose sample sizes n 1 and n 2 , one must calculate the power function g( 1 , 2 ) = Pr x , x
1
where P is given by (3) and the outer probability is with respect to the conditional distribution of X given and . The above test is for a two-sided alternative, but the testing procedure is easily revised for one-sided hypotheses. This will be used to find the sample size in an example to be considered in a following section. In the case when the null and alternative hypotheses are H: 0 and A: > 0 and the prior distribution for the parameters is f( , ) 1 / , where H is rejected in favor of A whenever Pr[
/ 1 , 2
[P
>
x1 , x2 , n1 , n2 ], (1 , 2 ) (0,1) x (0,1)
(10)
166
where P is given by (9) and the outer probability is with respect to the conditional distribution of X 1 and X 2 , given 1 and 2 . As given above, (10) can be evaluated by a simulation procedure similar to that described in 3.1. Two normal populations Consider two normal populations with means 1 and 2 and precisions 1 and 2 respectively, and suppose the null and alternative hypotheses are H: 1 2 and A: respectively. Assuming a noninformative prior for the parameters, namely f( 1 , 2 , 1, 2 ) = 1/ 1 2 , one can show that the posterior distribution of the two means is such that 1 and 2 are independent and i
_
1 > 2
+ 1, n2 + 1,..., nk + 1) .
A typical hypothesis testing problem, see [14], is given by the null hypothesis ( k=4), where H: 1 + 2 k12 or 1 + 3 k13 versus the alternative A: 1
xi
_
n i / si ), where n i is the
2
xi and si
The null hypothesis states that the response rate 1 + 2 is less than some historical value or that the toxicity rate some historical value
1 + 3
mean
x i , and precision n i / si
_
. It is known that
( i -
x i )(
n i / si )
2 1/ 2
is rejected if the response rate is larger than the historical or the toxicity rate is too low compared to the historical. Pr[ A /data]> (13)
has a Students t-
distribution with n i -1 degrees of freedom. Therefore the null hypothesis is rejected if Pr[ 1 > 2 /data]> . (11)
where is some threshold value. This determines the critical region of the test, thus the power function is g( )= Pr n / { Pr[ A / data] >
Multinomial Populations Consider a multinomial population with k categories and corresponding probabilities i ,
i =k
for i=1,2,,k. Suppose there are n patients and that n i belong to the i-th category.
i= 1,2,,k, where
i =1
= 1 and
0 < i < 1
n = (n1 , n2 ,..., nk )
The power function will be illustrated for the multinomial test of hypothesis with a Phase I trial, where response to therapy and toxicity are considered in designing the trial.
i =1
f( / data)
i n
i =k
i =k
i
i =
is
is greater than
},
(14)
given
BROEMELING & WU
Examples The above problems in hypothesis testing are illustrated by computing the power function of some Bayesian tests that might be used in the design of a Phase II trial. One-Sample Binomial No prior information Consider a typical Phase II trial, where the historical rate for toxicity was determined as .20. The trial is to be stopped if this rate exceeds the historical value. See Berry (1993) for a good account of Bayesian stopping rules in clinical trials. Toxicity rates are carefully defined in the study protocol and are based on the NCI list of toxicities. The null and alternative hypotheses are given as H:
167
The Bayesian test behaves in a reasonable way. For the conventional type I error of .05, a sample size of N=125 would be sufficient to detect the difference .3 versus .2 with a power of .841. It is interesting to note that the usual binomial test, with alpha = .05 and power .841, requires a sample of size 129 for the same alternative value of . For the same alpha and power, one would expect the Bayesian (with a uniform prior for ) and the binomial tests to behave in the same way in regard to sample size. With prior information Suppose the same problem is considered as above, but prior information is available with 50 patients, 10 of whom have experienced toxicity. The null and alternative hypotheses are as above, however the null is rejected whenever (16) Pr[ > / x, n] > , where is independent of ~ Beta(10,40). This can be considered as a one-sample problem where a future study is to be compared to a historical control. As above, compute the power function (see Table 2) of this Bayesian test with the same sample sizes and threshold values in Table 1. The power of the test is .758, .865, and .982 for = .4 for N= 125, 205, and 500, respectively. This illustrates how important is prior information in testing hypotheses. If the hypothesis is rejected with the critical region Pr[ >.2 / x, n] > , (17) the power (Table 1) will be larger than the corresponding power (Table 2) determined by the critical region (16), because of the additional posterior variability introduced by the historical information contained in . Thus, larger sample sizes are required with (16) to achieve the same power as with the test given by (17).
.20
and A:
> .20 ,
(15)
where is the probability of toxicity. The null hypothesis is rejected if the posterior probability of the alternative hypothesis is greater than the threshold value . The power curve for the following scenarios will be computed (see Equation 2), with sample sizes n = 125, 205, and 500, threshold values = .90, .95, .99, M=1000, and
power increases with and for given N and , the power decreases with , and for given and
null value 0 = .20. It is seen that the power of the test at = .30 and = .95, is .841, .958, and .999 for n = 125, 205, and 500, respectively. Note that for a given N and , the
168
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
.90 0,0,0 0,0,0 .107,.099,.08 .897,.97,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
.95 0,0,0 0,0,0 .047,.051,.05 .841,.958,.999 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
.99 0,0,0 0,0,0 .013,.013,.008 .615,.82,.996 .996,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
.90 0,0,0 0,0,0 .016,.001,.000 .629,.712,.850 .996,.999,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
.95 0,0,0 0,0,0 .002,.000,.000 .362,.374,.437 .973,.998,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
.99 0,0,0 0,0,0 .000,.000,.000 .004,.026,.011 .758,.865,.982 .999,1,1 1,1,1 1,1,1 1,1,1 1,1,1 1,1,1
Two Binomial Populations The case of two binomial populations was introduced in section 4.2, where equation (10) gives the power function for testing H: 1 = 2 versus the alternative A: 1 2 . In this example, let n 1 = 20 = n 2 be the sample sizes of the two groups and suppose the prior probability of the null hypotheses is = .5. The power at each point ( 1 , 2 ) is calculated via simulation, using equation (10) with = .90. Table 3 lists the power function for this test.
When the power is calculated with the usual two-sample, two-tailed, binomial test with ( 1 , 2 ) = (.3, .9), the power is .922, which is almost equivalent to the above Bayesian test. This is to be expected, because we are using a uniform prior density for both Bernoulli parameters. It is not too uncommon to have two binomial populations in a Phase II setting, where 1 and 2 are response rates to therapy. alpha = .013, sample sizes n 1 = 20 = n 2 , and
BROEMELING & WU
169
2 1
.1 .2 .3 .4 .5 .6 .7 .8 .9 1 .1 .004 .031 .171 .368 .619 .827 .950 .996 1 1 .2 .032 .011 .028 .098 .289 .527 .775 .928 .996 1 .3 .135 .028 .006 .025 .100 .237 .464 .768 .946 1 .4 .360 .106 .029 .013 .022 .086 .254 .491 .840 1 .5 .621 .281 .107 .028 .007 .035 .113 .316 .647 .984 .6 .842 .536 .252 .075 .017 .005 .037 .132 .359 .873 .7 .958 .744 .487 .244 .108 .027 .013 .028 .156 .567 .8 .992 .913 .767 .542 .291 .116 .049 .010 .037 .200 .9 1 .997 .961 .847 .640 .357 .171 .040 .006 .017 1 1 1 1 .999 .981 .882 .587 .205 .014 .000
A Phase II trial with toxicity and response rates With Phase II trials, response to therapy is usually taken to be the main endpoint, however in reality one is also interested in the toxicity rate, thus it is reasonable to consider both when designing the study. Most Phase II trials are conducted not only to estimate the response rate, but to learn more about the toxicity. In such a situation, the patients can be classified by both endpoints as follows: Table 4. Number of and Probability of Patients by Response and Toxicity. Toxicity Response Yes No Yes (n , ) (n , )
1 1 2 2
r = 1 + 2 and the rate of toxicity be t = 1 + 3 , where 1 is the probability a patient will experience
Let the response rate be toxicity and respond to therapy, and n 1 is the number of patients who fall into that category. Following Petroni and Conoway (2001), let the null hypothesis be H: r
r0
or t
t0
No
(n 3 , 3 )
(n 4 , 4 )
where r 0 and t 0 are given and estimated by the historical rates in previous trials.
> r0
and t
< t0 ,
170
t r
.2 .3 .4 .5 .6 .7 .8
.2
.3
.4
.5
.30. That is, the alternative hypothesis is that the response rate exceeds .40 and the toxicity rate is less than .30, and the null is rejected in favor of the alternative if the latter has a posterior probability in excess of . Table 5 gives the power for n=100 patients and threshold = .90. From above, the power of the test is .818 when ( r , t ) = (.7, .2), and the test behaves in a reasonable way. When the parameter values are such that the response rate is in excess of .40 and the toxicity rate is less than or equal to .30, the power is higher, relative to those parameter values when the null hypothesis is true. Conclusion We have provided a way to assess the sampling properties of some Bayesian tests of hypotheses used in the design and analysis of Phase II clinical trials. The one-sample binomial scenario is the most common in a Phase II trial, where the response to therapy is typically binary. We think it is important to know the power function of a critical region that is determined by Bayesian considerations, just as it is with any other test.
t0 =
r0
= .40 and
The Bayesian approach has one major advantage and that is prior information, and when this is used in the design of the trial, the power of the test will be larger then if prior information had not been used. We have confined this investigation to the fixed-sample case, but will seek to expand the results to the more realistic situation where Bayesian sequential stopping rules will be used to design Phase II studies. References Berry D. A. (1985). Interim analysis in clinical trials: Classical versus Bayesian approaches. Statistics in Medicine, 4, 521-526. Berry D. A. (1987). Interim analysis in clinical trials: the role of the likelihood principal. The American Statistician, 41, 117122. Berry D. A. (1988). Interim analysis in clinical research. Cancer Investigations, 5, 469477. Berry D. A. (1993). A case for Bayesianism in clinical trials (with discussion). Statistics in Medicine, 12, 1377-1404. Berry D. A., & Fristed, B. (1985). Bandit problems. Sequential allocation of experiments. New York: Chapman-Hall.
BROEMELING & WU
Berry D. A., & Stangl D. K., (1996). Bayesian methods in health-related research,. Bayesian Biostatistics. (In D. A. Berry, & D. Stangl, eds.) New York: Marcel Dekker Inc., p. 3 66. Crowley J. (2001). Handbook of statistics in clinical oncology. New York: Marcel Dekker Inc. Petroni G. R., & Conoway M. R. (2001). Designs based on toxicity and response. (In J. Crowley, ed.) Handbook of Statistics in Clinical Oncology. New York: Marcel-Dekker Inc., p. 105 118. Smeeton N. C., & Adcock C. J. (1997, special issue). (eds.) Sample size determination. The Statistician, 4.
171
Thall P. F, Simon R., Ellenberg S. S., & Shrager R. (1988). Optimal two-stage designs for clinical trials with binary responses. Statistics in Medicine, 71, 571-579. Thall P. F., & Russell K. E. (1998). A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics, 54, 251-264. Thall P. F., Estey E. H., & Sung H. G. (1999). A new statistical method for dosefinding based on efficacy and toxicity in early phase clinical trials. Investigational New Drugs, 17, 155-167. Thall P. F., Lee J. J., & Tseng C. H. (1999). Accrual strategies for Phase I trials with delayed patient outcome. Statistics in Medicine, 18, 1155-1169. Thall P. F., & Cheng S. C. (1999) Treatment comparisons based on twodimensional safety and efficacy alternatives in oncology trials. Biometrics, 55, 746-753.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 172-186
The aim of this article is to introduce the concept of Monte Carlo Integration in Bayesian estimation and Bayesian reliability analysis. Using the subject concept, approximate estimates of parameters and reliability functions are obtained for the three-parameter Weibull and the gamma failure models. Four different loss functions are used: square error, Higgins-Tsokos, Harris, and a logarithmic loss function proposed in this article. Relative efficiency is used to compare results obtained under the above mentioned loss functions. Key words: Estimation, loss functions, Monte Carlo Integration, Monte Carlo Simulation, reliability functions, relative efficiency.
Introduction In this article, the concept of Monte Carlo Integration (Berger, 1985) is used to obtain approximate estimates of the Bayes rule that is ultimately used to derive estimates of the reliability function. Monte Carlo Integration is used to first obtain approximate Bayesian estimates of the parameter inherent in the failure model, and using this estimate directly, obtain approximate Bayesian estimates of the reliability function. Secondly, the subject concept is used to directly obtain Bayesian estimates of the reliability function. In the present modeling effort, the threeparameter Weibull and the gamma failure models are considered, that are respectively defined as follows:
c f ( x; a, b, c)= ( x a) c1 e b x a; b, c > 0
( x a ) c b
(1) where a, b and c are respectively the location, scale and shape parameters; and
1 g ( x; , ) = x 1e , ( ) x
(2)
( ) =
For each of the above underlying failure models, approximate Bayesian estimates will be obtained for the subject parameter and the reliability function with the squared error, the Higgins-Tsokos, the Harris, and a proposed logarithmic loss functions. The loss functions
172
Vincent A. R. Camara earned a Ph.D. in Mathematics/Statistics. His research interests include the theory and applications of Bayesian and empirical Bayes analyses with emphasis on the computational aspect of modeling. Chris P. Tsokos is a Distinguished Professor of Mathematics and Statistics. His research interests are in statistical analysis and modeling, operations research, reliability analysis-ordinary and Bayesian, time series analysis.
where and are respectively the shape and scale parameters. For these two failure models, consider the scale parameters b and to behave as random variables that follow the lognormal distribution which is given by
1 Ln ( ) 2
2
(3).
173
LLn ( R, R)= Ln
Square error loss function The popular square error loss function places a small weight on estimates near the true value and proportionately more weight on extreme deviation from the true value of the parameter. Its popularity is due to its analytical tractability in Bayesian reliability modeling. The squared error loss is defined as follows:
R R
It places a small weight on estimates whose ratios to the true value are close to one, and proportionately more weight on estimates whose ratios to the true value are significantly different
from one. R(t ) and R (t ) represent respectively the true reliability function and its estimate. Methodology Considering the fact that the reliability of a system at a given time t is the probability that the system fails at a time greater or equal to t, the reliability function corresponding to the three-parameter Weibull failure model is given by
LSE ( R, R)= R R
(4)
Higgins-Tsokos loss function The Higgins-Tsokos loss function places a heavy penalty on extreme over-or underestimation. That is, it places an exponential weight on extreme errors. The Higgins-Tsokos loss function is defined as follows:
LHT ( R, R)=
f1e f 2 ( R R ) + f 2 e f1 ( R R ) 1, f1 + f 2
R (t ) = e
(t a )c b
f1 , f 2
0.
( )
where (l1 , l2 ) denotes the incomplete gamma function. When is an integer, equation (9) becomes
To our knowledge, the properties of the Harris loss function have not been fully investigated. However it is based on the premises that if the system is 0.99 reliable then on the average it should fail one time in 100, whereas if the reliability is 0.999 it should fail one time in 1000. Thus, it is ten times as good. Logarithmic loss function The logarithmic loss function characterizes the strength of the loss logarithmically, and offers useful analytical tractability. This loss function is defined as:
. Consider the situation where there are m independent random variables X 1 , X 2 ,..., X m with the same probability density function dF ( x| ) , and each of them having n realizations, that is, X 1 : x11 , x 21 , ..., x n1 ; X 2 : x12 , x 22 , ..., x n2 ; . ;
R(t )
= e
X m : x1m , x 2 m , ..., x nm
R(t ) =
1 R 1 R
L H ( R, R ) =
,k
0.
(6)
1 t i =0 i !
,l
0.
(7)
(8)
(9)
, t >0
is obtained
from the n realizations x1 j , x2 j ,... xnj , where j = 1,...m. Repeating this independent procedure k times, a sequence of MVUE is obtained for the
m g ( i ) L( x; i ) ( i ) i =1 h ( i ) = lim m m L( x ; i ) ( i ) i =1 h ( i )
g ( ) B( HT )
m L ( x ; i ) ( i ) L ( x ; ) ( ) d = lim m i =1 h ( i )
(11)
Ln
Equations (10) and (11) imply that the posterior expected value of g( ) is given by
f1 , f 2
0.
f g ( ) e 2 L( x; ) ( )d
g ( ) L( x; ) ( )d E ( g ( ) | x )= L( x; ) ( )d
g ( ) L( x; ) ( )d 1 g ( ) g ( ) B( H ) = 1 L( x; ) ( )d 1 g ( )
f g ( ) e 1 L( x; ) ( )d
Note that E h represents the expectation with respect to the probability density function h, and g( ) is any function of which assures convergence of the integral; also, h( ) mimics the posterior density function. For the special case where g ( ) = 1 , equation (10) yields
h ( ) m g ( i ) L ( x ; i ) ( i ) lim ^ m i =1 h ( i )
g ( ) L ( x ; ) ( )
This approach is used to obtain approximate Bayesian estimates of g ( ) , for the different loss functions under study. Approximate Bayesian estimates of the parameter and the reliability are then obtained by replacing g ( ) by and R(t) respectively in the derived expressions corresponding to the approximate Bayesian estimates of g ( ) . The Bayesian estimates used to obtain approximate Bayesian estimates of the function g ( ) are the following when the squared error, the Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are used:
(10).
g ( ) L( x; ) ( )d g ( ) B( SE ) = L( x; ) ( )d
1 f +f 1 2
(12)
175
Using equation (12) and the above Bayesian decision rules, approximate Bayesian estimates of g( ) corresponding respectively to the squared error, the Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are respectively given by the following expressions when m replicates are considered.
m g ( i ) L ( x; i ) ( i ) i=1 h( i ) (14) g ( ) E ( SE ) = m L ( x; i ) ( i ) i=1 h( i )
f g ( i ) me 1 L ( x; i ) ( i ) i =1 1 h( i ) g ( ) E ( HT ) = Ln f1 + f 2 f g ( i ) me 2 L ( x; i ) ( i ) i=1 h ( i )
g ( ) E ( Ln) =e
First, use the above general functional forms of the Bayesian estimates of g( ) to obtain approximate Bayesian estimates of the random parameter inherent in the underlying failure model. Furthermore, these estimates are used to obtain approximate Bayesian reliability estimates. Second, use the above functional forms to directly obtain approximate Bayesian estimates of the reliability function. Three-parameter Weibull underlying failure model In this case the parameter , discussed above, will correspond to the scale parameter b. The location and shape parameters a and c are considered fixed. The likelihood function corresponding to n independent random variables following the three-parameter Weibull failure model is given by
(15)
f1 , f 2
0.
(16) and
Furthermore, it can be shown that S n is a sufficient statistic for the parameter b, and a minimum variance unbiased estimator of b is given by
n
where S n =
n i =1
( xi a) c .
(17)
(18)
( xi a ) c n
.
i =1
The
p ( y | b )=
1 b y e , y > 0 ,b > 0 . b
(19)
j =1
m j =1 e
i =1
e 0
j =1
f1 , f 2
b E(H )
m
bj 1 b j 1 1 b j
j =1
Approximate Bayesian estimates for the scale parameter b and the reliability function R(t ) are obtained, with the use of equations (18) and (22), by replacing respectively g (b) by b and
j =1
b j 1
(22)
Sn
nL n ( b j ) + ( c 1)
i =1
1 L n ( xi a ) 2
(25)
Sn
nL n ( b j ) + ( c 1)
L n ( xi a )
i =1
1 2
Ln ( b j )
f2 b j
Sn
n L n ( b j ) + ( c 1)
L n ( xi a )
i =1
1 2
(24)
Ln ( b j )
b 1 n
(21)
j =1
n
f1 b j
Sn
n L n ( b j ) + ( c 1)
L n ( xi a )
i =1
1 2
Ln (b j )
E (e )= E e n
b
( xi a ) c
b E ( HT ) =
1 Ln f1 + f 2
n
Using equation (20) and the fact that the X i s are independent, the moment generating function of the minimum variance unbiased estimator of the parameter b is
(23)
Ln (b j )
= (1 b) 1
(20)
b j e
n Ln ( b j ) S L n ( xi a ) 1 n n L n ( b j ) + ( c 1) 2 i =1 b j
E (e y )=
1 e b0
1 y ( ) b
dy
177
1 Ln f1 + f 2
Sn b
j
R E ( H T ) (t ) =
j =1
(26) The approximate Bayesian estimates of the reliability corresponding to the first method are therefore given by
1 bE
f1 , f 2
R
m
R Eb (t , a , c|bE ) =
t>a ,
(27)
and
(t a )
j =1
bj
R E ( Ln ) ( t ) = e
h1 (b i ) :
.
(31) Gamma underlying failure model The likelihood function corresponding to n independent random variables following the two-parameter gamma underlying failure model can be written under the following form:
R E ( SE ) (t )
m bj
e
m
bj
j =1
bj
j =1
= e
e
n
where S n
'
xi .
i =1
Sn
nLn ( b j ) + ( c 1)
Ln ( xi a )
i =1
Sn
nLn ( b j ) + ( c 1)
i =1
1 Ln ( b j ) 2
( t a )c
1 Ln ( b j ) Ln ( xi a ) 2
(28)
,
L2 ( x, ; )
1
' Sn n Ln ( )
j =1
( 1)
n i =1
Ln ( xi ) nLn ( ( ))
(32)
n Sn 1 nLn ( b j ) + ( c 1 ) Ln ( x i a ) 2 b j i =1
2 Ln ( b j )
where b E stands respectively for the above approximate Bayesian estimates of the scale parameter b. Approximate Bayesian reliability estimates corresponding to the second method are also derived by replacing g ( ) by R(t) in equations (14), (15), (16) and (17). The obtained estimates corresponding respectively to the squared error, the Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are respectively given by the following expressions,
1
( t a )c
j =1
1 e
n Sn 1 nLn ( b j ) + ( c 1 ) Ln ( x i a ) 2 b j i =1
Sn
nLn (b
) + ( c 1 )
L n ( xi a )
i =1
1 2
Ln (b
(30)
2 Ln ( b j )
( t a )c
j =1
( t a )c
1 e
j 2
Sn
nLn (b
)+ ( c 1)
L n ( xi a )
i =1
1 2
b E ( Ln ) = e
j =1
e 0,
j =1
E (H )
(t ) =
( t a )c
n
f2e
Sn
n Ln ( b j ) + ( c 1)
L n ( xi a )
i =1
1 2
(29)
Ln (b j )
2 n S 1 Ln ( b j ) Ln ( x i a ) n nLn ( b j ) + ( c 1 ) 2 b j i =1
Ln ( b j ) e
f1 e
n L n ( b j ) + ( c 1 )
L n ( xi a )
e
( t a )c b j
i =1
1 2
j =1
n
Ln (b j )
Ln ( b j )
( t a )c b j
E ( SE )
m
i =1
E (e )= E (e
i =1
xi n
)
(33)
E ( HT )
e
f2
j
j =1
' Sn
f1 , f 2
is given by
1 1
j
j =1
The i s that are the minimum variance unbiased estimates of the scale parameter will
j 1
and
2 n ' Sn 1 Ln ( j ) nLn ( j ) + ( 1 ) Ln ( xi ) 2 j i =1
j =1
Ln ( j ) e
E ( Ln ) = e
h2 ( i ) :
j =1
2 n ' Sn 1 Ln ( j ) Ln ( xi ) nLn ( j ) + ( 1) 2 j i =1
play the role of the i ' s . Considering the lognormal prior, equation (14), (15), (16) and (17) yield the following approximate Bayesian estimates of the scale parameter corresponding respectively to the squared error, the Higgins-Tsokos, the Harris and the proposed lognormal loss functions, after
' Sn
n L n ( j ) + ( 1)
n i =1
(37)
(38)
Approximate Bayesian estimates for the scale parameter and the reliability function R(t) are obtained, with the use of equations (32) and (34) by replacing respectively g( ) by and R(t ) in equations (14), (15), (16) and (17).
(34)
m j =1
e
j
' Sn
n L n ( j ) + ( 1)
n i =1
1 Ln ( j ) Ln ( x i ) 2
h2 ( , | )=
( n ) ( n ) n
n 1
, > 0
E(H )
1 Ln ( j ) Ln ( x i ) 2
2
e 0
j =1
n L n (
) + ( 1 )
n i =1
1 L n ( xi ) 2
(36)
f1
' Sn
n L n (
) + ( 1 )
L n ( xi )
i =1
1 2
Ln (
= 1 n
is a minimum variance unbiased estimator of , and its moment generating function is given by
j =1
1 Ln f1 + f 2
Ln (
2
' Sn
n L n ( j ) + ( 1)
n i =1
(35)
xi
je
j =1 1 Ln ( j ) Ln ( x i ) 2
2
i =1
' Sn
n L n ( j ) + ( 1)
Ln ( x i )
1 Ln ( j ) 2
179
1 Ln f1 + f 2
( )
j
,
R E ( t , | E )
( )
(39) where
j =1
j =1
R E ( SE ) (t ) =
m
j =1
1 ( )
Sn
'
(40)
j =1
(43)
j =1
Ln 1
()
2 n ' Ln ( j ) Sn nLn ( j )+ ( 1) Ln ( xi ) 1 2 i =1 m j e j =1
i =1
(,
n Ln ( j ) + ( 1)
Ln ( xi )
Ln ( j )
Sn
'
n Ln ( j ) + ( 1)
Ln ( xi )
i =1
( ,
1 Ln ( j )
and
R E ( Ln ) (t ) =
t ) j
2 ' n Sn 1 Ln ( j ) nLn ( j )+ ( 1) Ln ( xi ) 2 i =1 j
h2 ( i ) :
( ) e t j =1 ( , )
' Sn
n Ln ( j ) + ( 1)
n i =1
(42)
( ,
)
j 1 Ln ( j ) Ln ( xi ) 2
2
' Sn
n Ln ( j ) + ( 1)
Ln ( xi )
i =1
( ) ( ,
estimate of the scale parameter . The approximate Bayesian reliability estimates corresponding to the second method are obtained by replacing g ( ) by R(t ) in equations(14), (15), (16) and (17). The obtained estimates corresponding respectively to the squared error, the Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are given by the following expressions, after
f1 , f 2
0,
R E ( H ) (t ) =
1 Ln ( j ) 2
2
f2 1
' Sn
n L n (
) + ( 1 )
L n ( xi )
i =1
1 2
(41)
= 1
, t
E ( )
j =1
( ,
f1 1
' Sn
n L n (
) + ( 1 )
L n ( xi )
i =1
1 2
Ln (
Ln (
( ,
IMSE ( R E (t )) =
R E (t ) R(t )
dt
(44)
If the relative efficiency is smaller than one, the Bayesian estimate corresponding to the squared error loss is less efficient. The squared error will be more efficient if the relative efficiency is greater than one. If the relative efficiency is approximately equal to one, the Bayesian reliability estimates are equally efficient. Numerical Simulations In the numerical simulations, Bayesian and approximate Bayesian estimates of the scale parameter for the gamma failure model and the lognormal prior will be compared, when the squared error loss is used and the shape parameter is considered fixed. Second, the new approach will be implemented, and approximate Bayesian reliability estimates will be obtained for the three-parameter Weibull and the gamma failure model under the squared error, the Higgins-Tsokos (with f1 = 1, f 2 = 1 ), the Harris, and the logarithmic loss functions, respectively. Comparison between Bayesian estimates and approximate Bayesian estimates of the scale parameter Using the square error loss function, the gamma underlying failure model and the lognormal prior, Table 1 gives estimates of the scale parameter when the shape parameter is fixed and equal to one.
Define the relative efficiency as the ratio of the IMSE of the approximate Bayesian reliability estimates using a challenging loss function to that of the popular squared error loss. The relative efficiencies of the Higgins-Tsokos, the Harris and the proposed logarithmic loss are respectively defined as follows:
~
Eff ( HT )
IMSE( R E ( HT ) (t ))
~
IMSE( R E ( SE ) (t ))
Eff ( H )
=
~
IMSE( R E ( H ) (t ))
~
IMSE( R E ( SE ) (t ))
2
R E ( H ) (t ) R(t ) dt R E ( SE ) (t ) R(t ) dt
~
and
Eff ( Ln)=
IMSE( R E ( Ln ) (t ))
~
IMSE( R E ( SE ) (t ))
R E ( SE ) (t ) R(t )
R E ( HT ) (t ) R(t )
dt
2
dt
Relative Efficiency with Respect to the Squared Error Loss To compare our results, the criterion of integrated mean square error, IMSE, of the approximate Bayesian reliability estimate
R E ( SE ) (t ) R(t )
R E ( Ln ) (t ) R(t )
dt
.
dt
181
Table 1.
Lognormal prior
True value of
Bayesian estimate of
= 1, = 0.5
Number of replicates m 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
1.1688
0.9795 0.9883 1.0796 1.0625 1.0385 1.0899 1.0779 0.9795 0.9880 1.0351 0.9943 0.9665 0.9945 1.0017 1.9591 1.9766 2.1555 2.1162 2.0658 2.1679 2.1467 1.9591 1.9761 2.0704 1.9886 1.9331 1.9892 2.0034
= 4, = 9
1.0561
= 3, = 0.8
2.2808
= 8, = 12
2.0376
Let
R Eb ( SE ) (t ) , R Eb ( SE ) (t ) , R Eb ( HT ) (t ) ,
R Eb ( HT ) (t ) , R Eb ( H ) (t ) , R Eb ( H ) (t )
, R Eb( Ln ) (t ) and R Eb ( Ln ) (t )
represent, respectively, the approximate Bayesian reliability estimates obtained with the approximate Bayesian reliability estimates of the scale parameter b, and the ones obtained by direct computation, when the squared error, the Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are used. These estimates are given below in Table 2. Table 3
gives the approximate Bayesian reliability estimates obtained directly using equations (28), (29), (30) and (31).
Gamma failure model G( = 1, = 1) A typical sample of thirty failure times that are randomly generated from G( = 1, = 1) is given below. 0.95497 2.69516 1.26364 0.54999 1.44922 0.31084 3.13788 0.51249 0.57911 0.77497 0.09670 1.47495 1.60653 0.64000 0.78403 1.47283 0.11715 0.22012 0.50421 1.07792 0.09107 0.56762 0.94337 0.62536 1.08172 0.47580 0.92341 3.81572 0.14532 1.08156
The obtained minimum variance unbiased estimates of the scale parameter b are given below
b1
The obtained minimum variance unbiased estimates of the scale parameter are given below.
b2
b3
1 = 1009127916 . 2 3
= 1140808468 . = 0.9991268436
These minimum variance unbiased estimates will be used along with likelihood function and the lognormal prior f (b; = 0.34, = 0115) .
183
Table 2.
R(t )
Approximation IMSE Relative efficiency with respect to
R Eb ( SE ) (t )
2
1 ( t 1) 2 1.1251
1 ( t 1) 2 0.9758
R Eb ( Ln ) (t )
e 1.1242 3.6301 3 10
1 ( t 1) 2
e ( t 1)
0 0
e 2.381010 4 1.0
1 ( t 1) 2 1.1251
15.25
R Eb( SE ) (t )
The above approximate Bayesian estimates yield good estimates of the true reliability function. Table 3.
Time t 1.00001 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
R(t ) 1.0000 0.9394 0.7788 0.5698 0.3679 0.2096 0.1054 0.0468 0.0183 0.0063 0.0019 0.0005 0.0001
R Eb( SE ) (t ) 1.0000 0.9459 0.8005 0.6062 0.4108 0.2492 0.1354 0.0659 0.0287 0.0112 0.0039 0.0012 0.0003
R Eb ( HT ) (t ) 1.0000 0.9459 0.8005 0.6062 0.4108 0.2492 0.1354 0.0659 0.0287 0.0112 0.0039 0.0012 0.0003
R Eb ( H ) (t ) 1.0000 0.9459 0.8008 0.6066 0.4112 0.2495 0.1355 0.0659 0.0287 0.0112 0.0039 0.0012 0.0003
R Eb( Ln ) (t ) 1.0000 0.9459 0.8005 0.6061 0.4105 0.2488 0.1349 0.0655 0.0284 0.0110 0.0038 0.0012 0.0003
Let
R E ( SE ) (t ) , R E ( SE ) (t ) , R E ( HT ) (t ) ,
R E ( HT ) (t ) , R E ( H ) (t )
R E ( H ) (t ), R E ( Ln ) (t )
and R E ( Ln ) (t ) represent respectively the approximate Bayesian reliability estimates obtained with the approximate Bayesian estimate of , and the ones obtained by direct computation, when the squared error, the
Higgins-Tsokos, the Harris and the proposed logarithmic loss functions are used. These estimates are given in Table 5 and Table 6. For computational convenience, the results presented in Table 3 are used to obtain approximate estimates of the analytical forms of the various approximate Bayesian reliability expressions under study. The results are given in Table 4. Table 6 gives the approximate Bayesian reliability estimates obtained directly by using equations (40), (41), (42) and (43). For computational convenience, the results presented in Table 6 are used to obtain approximate estimates of the analytical forms of the various approximate Bayesian reliability expressions under study. The results are given in Table 7.
Table 4.
R (t )
Approximation IMSE
R Eb ( SE ) (t )
2
R Eb( HT ) (t )
e
1 ( t 1) 2 1.1251
R Eb( H ) (t )
e
1 ( t 1) 2 1.1251
R Eb( Ln ) (t )
e
1 ( t 1) 2 1.1251
e (t 1)
0
1 ( t 1) 2 1.1251
2.381310 3 1
2.381310 3 1
2.381310 3 1
2.381310 3 1
R Eb( SE ) (t )
Table 5.
R (t )
R E ( SE ) (t ) e
t 1.0311
R E ( HT ) (t )
t 1.1250
R E ( H ) ( t )
t 0.9758
R E ( Ln ) (t )
t 1.1242
Approximation IMSE
e t
0.0
2.38100810 4
3.676471 3 10
1.48203410 4
3.630931 3 10
1.0
15.44
0.62
15.25
R E ( SE ) (t )
185
Table 6.
Time t 10 100 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
R(t ) 1.0000 0.3679 0.1353 0.9498 0.0183 0.0067 0.0025 0.0009 0.0003 0.0001 0.0000
R E ( SE ) (t ) 1.0000 0.3786 0.1437 0.0547 0.0209 0.0080 0.0031 0.0012 0.0005 0.0002 0.0001
R E ( HT ) (t ) 1.0000 0.4108 0.1690 0.0696 0.0287 0.0118 0.0049 0.0020 0.0008 0.0003 0.0001
R E ( H ) ( t ) 1.0000 0.4112 0.1692 0.0697 0.0287 0.0118 0.0049 0.0020 0.0008 0.0003 0.0001
R E ( Ln ) 1.0000 0.4105 0.1685 0.0692 0.0284 0.0117 0.0048 0.0020 0.0008 0.0003 0.0001
Table 7.
R E ( SE ) (t )
R E ( HT ) (t ) e 3.676471 3 10 15.44
t 1.1250
R E ( H ) ( t ) e 3.67647110 3 15.44
t 1.1250
R E ( Ln ) (t ) e 3.630931 3 10 15.25
t 1.1242
e 2.38100810 4 1.0
t 1.0311
R E ( SE ) (t )
The above approximate Bayesian estimates yield good estimates of the true reliability function.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 187-213
Ping Sa
Mathematics and Statistics University of North Florida
A new test of variance for non-normal distribution with fewer restrictions than the current tests is proposed. Simulation study shows that the new test controls the Type I error rate well, and has power performance comparable to the competitors. In addition, it can be used without restrictions. Key words: Edgeworth expansion, Type I error rate, power performance
Introduction Testing the variance is crucial for many real world applications. Frequently, companies are interested in controlling the variation of their products and services because a large variation in a product or service indicates poor quality. Therefore, a desired maximum variance is frequently established for some measurable characteristic of the products of a company. In the past, most of the research in statistics concentrated on the mean, and the variance has drawn less attention. This article is about testing the hypothesis that the variance is 2 equal to a hypothesized value o versus the alternative that the variance is larger than the hypothesized value. This statistical test will be referred to as a right-tailed test in further discussion. The chi-square test is the most commonly used procedure to test a single variance of a population. Once a random sample of size n is taken, the individual values X i , the sample mean X , the sample variance S 2 , and 2 specified ( o ) are used to compute the chisquared test statistic 2 = ( n 1) S 2 / 2 , which 0 is distributed (2n 1) under H 0 . The 2 statistic is used for hypothesis tests concerning 2 when a normal population is assumed. It is well known that the chi-square test statistic is not robust against departures from normality such as when skewness and kurtosis are present. This can lead to rejecting H 0 much more frequently than indicated by the nominal alpha level, where alpha is the probability of rejecting H 0 when
H 0 is true.
Practical alternatives to the 2 test are needed for testing the variance of non-normal distributions. There are nonparametric methods such as bootstrap and jackknife (see Efron & Tibshirani, 1993). The bootstrap requires extensive computer calculations and some programming ability by the practitioner making the method infeasible for some people. Although the jackknife method is easier to implement, it is a linear approximation to the bootstrap method and can give poor results when the statistic estimate is nonlinear. Another alternative is presented in Kendall (1994) and Lee and Sa (1998). The robust chi-square statistic r2 which has the form chi-square of freedom,
Michael C. Long (MA, University of North Florida) is a Research Associate and Statistician in the Department of Health, State of Florida. Email: [email protected]. Ping Sa (Ph. D. University of South Carolina) is a Professor of Mathematics and Statistics at the University of North Florida. Her recent scholarly activities have involved research in multiple comparisons and quality control. Email: [email protected].
187
kurtosis coefficient. The critical value for test rejection is 2, where is the smallest integer, which is greater than or equal to (n-
1) d . Because d is a function of the sample kurtosis coefficient alone, this could create
performance problems for r2 test with skewed distributions. Lee and Sa (1996) derived a new method for a right-tailed variance test of symmetric heavy-tailed distributions using an Edgeworth expansion (see Bickel & Doksum, 1977), and an inversion type of Edgeworth expansion provided by Hall (1983),
P ( ) / ( ) x + 1 ( x 2 1) / 6
= ( x) + o(1 / n ) ,
)
(1)
1 =
where is any statistic, and , ( ) and 1 are the mean, standard deviation and coefficient
of skewness of , respectively. ( x ) is the standard normal distribution function. They considered the variable S 2 / 2 , and the variable admitted the inversion of the Edgeworth expansion above as follows:
1
where k i is the i th sample cumulant. They approximated their decision rule even further using a Taylor series expansion of
S2
K4 2 + 4 n n 1
x + 1 ( x 2 1) / 6
Z 1 = Z a( Z 2 1) + 2a 2 ( Z 3 Z ) > z .
= ( x) + o(1 / n ) ,
(2)
where K 4 = E ( X ) 4 3( E ( X ) 2 ) 2 and
1 =
E (S 2 2 ) 3 (E (S 2 2 ) 2 )
3 2
, the coefficient of
skewness of S 2 , provided all the referred moments exist. The population coefficient of skewness equals K 3 / ( ) = 0 under symmetric and heavy-tailed assumptions, and the population coefficient of kurtosis equals K 4 / 4 > 0, where K i is the i th cumulant (see
2 3
After a simulation study, their study found their test provided a controlled Type I error rate as well as good power performance when sample size is moderate or large (p. 51). Lee and Sa (1998) performed another study on a right-tailed test of variance for skewed distributions. A method similar to the previously proposed study was employed with the primary difference being in the estimated coefficient of skewness, 1 . The population coefficient of skewness, K 3 / ( 2 ) 3 , was assumed zero in the heavy-tailed distribution study and estimated for the skewed distribution study. Their study performed a preliminary
k 4 2( S ) + n n 1
2 2
2 Z > z + 1 ( z 1) / 6 ,
(3)
S2
Z=
2 0
1
, and
k4 n 0
4
2 + n 1
1 3n 1 8n 2 (S 2 ) 3 k 4 S 2 k6 + 2 2 2 (n 1) n 2
(4)
LONG & SA
simulation study for the best form of Z and found
189 n o
S2
Z=
2 0
1 + 2 n 1
k4 nS 0
2 2
( x ) + n 2 p1 ( x ) ( x ) + + n 2 p j ( x ) ( x ) + ,
(5) where ( x ) = (2 ) e Normal density
1 2 x2 2
cumulants of o . From Hall (1992), the Edgeworth expansion for the sample variance is
= ( x ) + n 2 p1 ( x ) ( x ) + + n 2 p j ( x ) ( x ) + ,
(6) where
1 x2 1 p1 = - B1 + B2 , B1 = ( 4 1) 2 , 6
Methodology
( 4 1) 2 ( 6 3 4 6 32 + 2) j = E{( X ) } j ,
B2
=
3
o . If
n o is asymptotically normally n o
and =
E( X )4 4 .
n (see Hall,
x+n
= ( x ) + o ( n 1 2 )
To
test
2 Ho : 2 = o
versus
) may
be
n (S 2 2 )
n (S 2 2 )
to be the Z with controllable Type I error rates as well as good power performance. Hence, the motivation for this study is to develop an improved method for right-tailed tests of variance for non-normal distributions. A test is desired which works for both skewed and heavy-tailed distributions and also has fewer restrictions from assumptions. This test should work well for multiple sample sizes and significance levels. The test proposed uses a general Edgeworth expansion to adjust for the non-normality of the distribution and considers the variable S 2 that admits an inversion of the general Edgeworth expansion. A detailed explanation of the new method is provided in the next section. In the Simulation Study Section, the simulation study is introduced for determining whether the previously proposed tests or the new test has the best true level of significance or power. The results of the simulation are discussed in the section of Simulation Results. Conclusions of the study are rendered at the end.
(x ) =
)x
=
j
1 2
x2 1 B1 + B2 6
(7)
2 + B z 1 , B1 2 6
(8) where z is the upper percentage point of the standard normal distribution,
2 S2 0
Z=
S4 , B1 = k4 + 2 S 4
k + 12k4 S 2 + 4k32 + 8( S 2 )3 B2 = 6 . 3 (k 4 + 2 S 4 ) 2
Simulation Study Details for the simulation study are provided in this section. The study is used to compare Type I error rates and the associated power performance of the different right-tail tests for variance. Distributions Examined Distributions were chosen to achieve a range of skewness (0.58 to 9.49) or kurtosis (-1.00 to 75.1) for comparing the test procedures. The skewed distributions considered in the study included Weibull with scale parameter = 1.0 and shape parameters = 0.5, 0.8, 2.0 (see Kendall, 1994), Lognormal( = 0, = 1) , (see Evans, Hastings, & Peacock, 2000), Gamma with scale parameter 1.0 and shape parameters = 0.15,1.2,4.0 (see Evans, Hastings, & Peacock, 2000), 10 Inverse Guassian distributions with = 1.0 , scale parameters = 0.1 to 25.0 with skewness ranging from 0.6 to 9.49 (see Chhikara & Folks, 1989 and Evans, Hastings, & Peacock, 2000), Exponential with = 1.0 and = 1.0 (see Evans, Hastings, & Peacock, 2000), Chi-square with degrees of freedom ( = 1, 2, 3, 4, 8, 12, 16, 24), and a polynomial function of the standard normal distribution Barnes2 (see Fleishman 1978).
The heavy-tailed distributions considered included Students T ( = 5,6,8,16,32,40), 10 JTB ( , ) distributions with ( = 0, = 1) and various , values including Laplace( =2.0, =1.0) , (see Johnson, Tietjen, & Beckman, 1980), and special designed distributions which are polynomial functions of the standard normal distribution: Barnes1 and Barnes3 having kurtosis 6.0 and 75.1 respectively (see Fleishman 1978). All the heavy-tailed distributions are symmetric with the exception of Barnes3. Barnes3 has skewness of .374 which is negligible in comparison to the kurtosis of 75.1. Therefore, Barnes3 was considered very close to symmetric. Simulation Description Simulations were run using Fortran 90 for Windows on an emachines etower 400i PC computer. All the Type I error and power comparisons for the test procedures used a simulation size of 100,000 in order to reduce experimental noise. Fortran 90 IMSL library was used to generate random numbers from these distributions: Weibull, Lognormal, Gamma, Exponential, Chi-square, Normal and Students T. In addition, the Inverse Gaussian, JTB, Barnes1, Barnes2, and Barnes3 random variates were created with Fortran 90 program subroutines using the IMSL librarys random number generator for normal, gamma, and uniform in various parts of the program. The following tests were compared in the simulation study: 1) 2 = (n 1) S 2 / 2 ; the decision rule is 0
2 Reject H 0 if 2 > n 1, . 2 2) r2 = (n 1)dS 2 0 ; the decision rule is
Reject H 0 if r2 > 2, , where is the smallest integer that is greater than or equal to (n-1) d .
LONG & SA
S2
3) Zs =
2 0
2 S2 0
191
1
from Lee and Sa
Z4=
k4 2 + 2 2 nS 0 n 1
(n 1)k 4 2S 4 + n(n + 1) n + 1
2 S2 0 4 4 k 4 0 2 0 + n 1 nS 4 2 2 S 0
Z5=
Zs a( Zs 2 1) + 2a 2 ( Zs 3 Zs ) > z . S2
4) Zh =
02
k4 n 0
4
1
from Lee and Sa and Z6=
2 n 1
4 2 k 4 0 2 0 + n 1 nS 2
, where
z + n
Z2 =
Z3 =
k 4 2S 4 + n n 1
1 2
2 + B z 1 . B1 2 6
2. Calculate: X , S 2 , k 3 , k 4 , k 6 , 1 , B1 , B 2 .
, 3. Calculate all the test statistics: 2 , r2 , Zs, Zh, Zn, Z2, Z3, Z4, Z5, and Z6. 4. Find the critical value for each test considered.
k is multiplied to each variate. The traditional power studies were performed by multiplying the distribution observations by k to create a new set of observations yielding a variance k times larger than the H 0 value. Steps 1 through 6 above would then be implemented for the desired values of k , sample sizes, and significance levels. The power would then be the proportion of 100,000 rejected for the referenced value of k , sample size, and significance level. This method has been criticized by many researchers since tests with high Type I error rates frequently have high power also. Tests with high Type I error rates usually have fixed lower critical points relative to other tests and therefore reject more easily when the true variance is increased. Hence, these tests tend to have higher power. Some researchers are using a method to correct this problem. With k =1, the critical point for each test under investigation is adjusted till the proportion rejected out of 100,000 is the same as the desired nominal level. The concept is that the tests can be compared better for power afterward since all the tests have critical points adjusted to approximately the same Type I error rate. Once this is accomplished, steps 1 through 7 above are performed for each k under consideration to get a better power comparison between the different tests at that level of k .
that the
LONG & SA
distributions. Although there are still some inflated cases, they are not severe. These results are understandable since the r2 test only adjusts for the kurtosis of the sampled distribution and not the skewness. The Z2 tests Type I error rates reported in Tables 1 and 2 were extremely conservative for most of the skewed distributions. It becomes even more conservative when the coefficient of skewness gets larger. In fact, the Z2 test is so conservative it is rarely inflated for any of the skewed or heavy-tailed distribution cases. Similar to the Z2 test, test Zh performs quite conservatively in all the skewed distributions as well. However, it performs differently under heavy-tailed distributions. The Type I error rates become closer to the nominal level except for one distribution, and there are even a few inflated cases. The exception in the heavy-tailed distributions is the Barnes3. In this case, test Zh is extremely conservative for all the nominal levels. Under the skewed distribution, the Zs test performs well for the sample size 40 and the nominal level 0.05. However, the Type I error rates become more or less uncontrollable when either the alpha level gets small or the sample size is reduced. These results confirmed the recommendations of Lee and Sa (1998) that Zs is more suitable for moderate to large sample sizes and alpha levels not too small. Although Zs was specifically designed for the skewed distributions, it actually works reasonably well for the heavy-tailed distributions as long as the sample size and/or the alpha level are not too small. Generally speaking, the proposed test Z6 controls Type I error rates the best in both the skewed distribution cases and the heavytailed distribution cases. Only under some skewed distributions with both small alpha and small sample size were there a few inflated Type I error rates. However, the rates of inflation are at much more acceptable level than some others. Power Comparison Results One of the objectives of the study is to find one test for non-normal distributions with an improved Type I error rate and power over earlier tests. It was suspected that tests with very
193
conservative Type I error rates might have lower power than other tests since it is harder to reject with these tests. Because tests Zh and Z2 were extremely conservative for the skewed distributions, exploratory power simulations were run on a couple of mildly skewed distributions with Zs, Zh, Z2, and Z6 to further decrease the potential tests. The preliminary power comparisons confirmed our suspicion. Both Zh and Z2 have extremely low power even when k is as large as 6.0. Therefore, Z2 will not be looked at further since Z6 is the better performer of the new tests. Also, the Zh tests power is unacceptable, but it will still be compared for the heavy-tailed distributions since that is what it was originally designed for. The results of the preliminary power study are reported in Long and Sa (2003). Tables 5 and 6 provide the partial results from the new type of power comparisons, and Tables 7 and 8 consist of some results from the traditional type of power study. Based on the complete power study in Long and Sa (2003), the following expected similarities can be found for the power performance of the tests between the skewed and heavy-tailed distributions regardless of the type of power study. When the sample size decreases from 40 to 20, the power 2 decreases. As the k in k 0 increases, the power increases. When the significance level decreases from 0.10 to 0.01, the power decreases more than the decrease experienced with the sample size decrease. As the skewness of the skewed distribution decreases, the power increases. As the kurtosis of the heavy-tailed distribution decreases, the power increases overall with a slight decrease from the T(5) distribution to the Laplace distribution. The primary difference overall between the skewed and heavy-tailed distributions is that the power is better for the heavy-tailed distributions when comparing the same sample size, significance level, and k . In fact, the power increases more quickly over the levels of k for the heavy-tailed distributions versus the skewed distributions, with a more noticeable difference at the higher levels of kurtosis and skewness respectively. Some specific observations are summarized as follows:
LONG & SA
195
Table 1. Comparison of Type I Error Rates when n=40, Skewed Distributions Distribution
=0.01 ______________________
2
____
, r2 Zs Zh Z2 Z6 (skewness) _____________________________________________________________________________ IG (1.0,0.1) (9.49) .1616 .0259 .0004 .0001 .0121 .0429 .0250 .0003 .0000 .0110 .0237 .0003 .0000 .0100
Weibull(1.0,0.5) (6.62) LN(0,1) (6.18) IG(1.0,0.25) (6.00) Gamma(1.0,0.15) (5.16) IG(1.0,0.5) (4.24) Chi(1) (2.83) Exp(1.0) (2.00) Chi(2) (2.00) Barnes2 (1.75) IG(1.0,25.0) (0.60)
.1522 .0198 .0012 .0001 .0090 .0349 .0188 .0011 .0001 .0082 .0177 .0010 .0001 .0074 .1325 .0274 .1671 .0349 .0156 .0012 .0001 .0073 .0148 .0011 .0001 .0065 .0141 .0009 .0000 .0057 .0192 .0014 .0002 .0093 .0179 .0013 .0001 .0082 .0168 .0011 .0001 .0074
.1704 .0166 .0025 .0003 .0092 .0322 .0154 .0024 .0003 .0081 .0144 .0022 .0003 .0073 .1538 .0271 .1282 .0194 .0135 .0032 .0005 .0077 .0126 .0029 .0004 .0069 .0117 .0028 .0004 .0061 .0113 .0073 .0019 .0094 .0102 .0069 .0017 .0085 .0094 .0065 .0015 .0077
.0949 .0119 .0115 .0045 .0116 .0159 .0110 .0109 .0041 .0104 .0100 .0103 .0037 .0097 .0922 .0150 .0114 .0114 .0045 .0109 .0103 .0107 .0041 .0099 .0095 .0100 .0038 .0091
.0716 .0141 .0154 .0079 .0150 .0127 .0127 .0146 .0072 .0137 .0116 .0138 .0065 .0124 .0217 .0092 .0102 .0113 .0089 .0107 .0090 .0104 .0078 .0095 .0081 .0093 .0067 .0084
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
Table 1 (continued). Comparison of Type I Error Rates when n=40, Skewed Distributions Distribution
=0.05 _______________
____
2 , r2 Zs Zh Z2 Z6 (skewness) ______________________________________________________________________________ IG (1.0,0.1) .1859 .0532 .0015 .0007 .0448 (9.49) .0761 .0520 .0015 .0007 .0433 .0509 .0014 .0006 .0419
Weibull(1.0,0.5) (6.62) LN(0,1) (6.18) IG(1.0,0.25) (6.00) Gamma(1.0,0.15) (5.16) IG(1.0,0.5) (4.24) Chi(1) (2.83) Exp(1.0) (2.00) Chi(2) (2.00) Barnes2 (1.75) IG(1.0,25.0) (0.60) .1899 .0467 .0037 .0017 .0402 .0683 .0454 .0035 .0016 .0387 .0442 .0033 .0015 .0372 .1701 .0610 .1992 .0719 .0415 .0043 .0022 .0362 .0404 .0040 .0021 .0347 .0392 .0039 .0019 .0331 .0479 .0446 .0022 .0417 .0467 .0437 .0019 .0401 .0454 .0418 .0017 .0385
.2148 .0486 .0078 .0043 .0430 .0743 .0469 .0075 .0039 .0412 .0454 .0072 .0035 .0397 .1994 .0672 .1906 .0622 .0442 .0094 .0050 .0395 .0423 .0090 .0046 .0378 .0408 .0087 .0043 .0360 .0439 .0203 .0136 .0431 .0421 .0197 .0130 .0416 .0406 .0191 .0124 .0397
.1583 .0441 .0299 .0229 .0460 .0559 .0424 .0289 .0218 .0442 .0408 .0279 .0209 .0425 .1557 .0545 .0430 .0293 .0226 .0453 .0414 .0285 .0214 .0434 .0399 .0278 .0204 .0415
.1414 .0485 .0388 .0340 .0531 .0549 .0466 .0376 .0324 .0511 .0451 .0364 .0309 .0493 .0732 .0442 .0429 .0407 .0429 .0498 .0413 .0390 .0410 .0477 .0397 .0376 .0389 .0454
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
197
Table 2. Comparison of Type I Error Rates when n=20, Skewed Distributions Distribution
=0.01 ______________________
____
2 , r2 Zs Zh Z2 Z6 (skewness) ______________________________________________________________________________ IG (1.0,0.1) .1215 .0342 .0003 .0003 .0149 (9.49) .0443 .0321 .0003 .0003 .0122 .0302 .0003 .0002 .0104
Weibull(1.0,0.5) (6.62) LN(0,1) (6.18) IG(1.0,0.25) (6.00) Gamma(1.0,0.15) (5.16) IG(1.0,0.5) (4.24) Chi(1) (2.83) Exp(1.0) (2.00) Chi(2) (2.00) Barnes2 (1.75) IG(1.0,25.0) (0.60) .1227 .0294 .0012 .0012 .0139 .0386 .0270 .0011 .0011 .0115 .0249 .0009 .0009 .0098 .1082 .0316 .1295 .0406 .0246 .0013 .0014 .0119 .0226 .0012 .0012 .0100 .0209 .0010 .0011 .0083 .0307 .0015 .0015 .0142 .0281 .0014 .0014 .0120 .0258 .0013 .0012 .0098
.1408 .0296 .0024 .0025 .0152 .0396 .0269 .0021 .0021 .0128 .0243 .0019 .0018 .0108 .1272 .0336 .1096 .0265 .0258 .0029 .0030 .0141 .0231 .0024 .0026 .0119 .0208 .0022 .0023 .0102 .0228 .0067 .0079 .0185 .0201 .0059 .0070 .0161 .0176 .0051 .0061 .0139
.0810 .0203 .0092 .0107 .0191 .0202 .0175 .0079 .0093 .0165 .0153 .0067 .0080 .0144 .0825 .0205 .0206 .0095 .0111 .0196 .0180 .0083 .0097 .0168 .0156 .0071 .0082 .0145
.0680 .0228 .0127 .0159 .0238 .0192 .0198 .0112 .0137 .0206 .0171 .0097 .0119 .0180 .0213 .0095 .0134 .0105 .0098 .0120 .0113 .0087 .0079 .0095 .0095 .0072 .0064 .0076
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
Table 2 (continued). Comparison of Type I Error Rates when n=20, Skewed Distributions Distribution
=0.05 _______________
2
____
, r2 Zs Zh Z2 Z6 (skewness) ______________________________________________________________________________ IG (1.0,0.1) (9.49) .1451 .0566 .0014 .0015 .0459 .0736 .0547 .0013 .0014 .0430 .0530 .0011 .0012 .0399 .1538 .0534 .0033 .0039 .0444 .0706 .0514 .0031 .0035 .0412 .0493 .0028 .0031 .0385 .1377 .0603 .0471 .0482 .0057 .0397 .0451 .0435 .0051 .0369 .0431 .0406 .0046 .0343
Weibull(1.0,0.5) (6.62) LN(0,1) (6.18) IG(1.0,0.25) (6.00) Gamma(1.0,0.15) (5.16) IG(1.0,0.5) (4.24) Chi(1) (2.83) Exp(1.0) (2.00) Chi(2) (2.00) Barnes2 (1.75) IG(1.0,25.0) (0.60)
.1652 .0579 .0046 .0053 .0473 .0760 .0552 .0041 .0047 .0437 .0528 .0038 .0043 .0407 .1805 .0604 .0073 .0079 .0505 .0575 .0568 .0069 .0072 .0471 .0549 .0064 .0064 .0438 .1686 .0560 .0089 .0104 .0484 .0725 .0535 .0083 .0095 .0446 .0509 .0077 .0087 .0416 .1635 .0545 .0176 .0215 .0523 .0669 .0515 .0165 .0200 .0486 .0484 .0155 .0186 .0455 .1394 .0529 .0260 .0313 .0544 .0604 .0496 .0241 .0291 .0506 .0468 .0226 .0272 .0473 .1406 .0543 .0264 .0317 .0565 .0605 .0511 .0245 .0293 .0524 .0482 .0229 .0273 .0489 .1307 .0560 .0342 .0416 .0617 .0587 .0530 .0321 .0389 .0577 .0499 .0302 .0364 .0542 .0687 .0449 .0377 .0433 .0507 .0437 .0419 .0349 .0398 .0464 .0388 .0322 .0365 .0424
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
199
Table 3. Comparison of Type I Error Rates when n=40, Heavy-tailed Distributions Distribution
=0.01 ______________________
____
2 , r2 Zs Zh Z2 Z6 (kurtosis) ______________________________________________________________________________ Barnes3 .1269 .0167 .0001 .0000 .0060 (75.1) .0280 .0158 .0001 .0000 .0052 .0151 .0001 .0000 .0047
T(5) (6.00) Barnes1 (6.00) T(6) (3.00) Laplace(2.0,1.0) (3.00) JTB(4.0,1.0) (0.78) T(16) (0.50) JTB(1.25,0.5) (0.24) T(32) (0.21) JTB(2.0,0.5) (-0.30) .0629 .0075 .0084 .0027 .0058 .0111 .0066 .0079 .0024 .0050 .0059 .0074 .0021 .0045 .1081 .0118 .0126 .0021 .0089 .0188 .0105 .0119 .0019 .0078 .0093 .0111 .0017 .0068 .0526 .0103 .0085 .0108 .0044 .0075 .0076 .0100 .0040 .0067 .0067 .0092 .0034 .0059
.0608 .0099 .0138 .0043 .0092 .0124 .0089 .0130 .0038 .0081 .0080 .0120 .0034 .0072 .0246 .0103 .0127 .0082 .0106 .0098 .0092 .0118 .0074 .0095 .0084 .0109 .0067 .0084 .0198 .0103 .0118 .0088 .0104 .0095 .0092 .0107 .0079 .0092 .0083 .0098 .0070 .0083 .0134 .0102 .0112 .0097 .0108 .0089 .0091 .0101 .0086 .0095 .0081 .0090 .0075 .0083 .0139 .0091 .0100 .0084 .0093 .0083 .0084 .0093 .0075 .0083 .0076 .0085 .0067 .0074 .0061 .0064 .0068 .0060 .0061 .0055 .0056 .0059 .0051 .0052 .0049 .0052 .0043 .0044
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
Table 3 (continued). Comparison of Type I Error Rates when n=40, Heavy-tailed Distributions Distribution
=0.05 _______________
2
____
, r2 Zs Zh Z2 Z6 (kurtosis) ______________________________________________________________________________ Barnes3 (75.1) T(5) (6.00) Barnes1 (6.00) T(6) (3.00) Laplace(2.0,1.0) (3.00) JTB(4.0,1.0) (0.78) T(16) (0.50) JTB(1.25,0.5) (0.24) T(32) (0.21) JTB(2.0,0.5) (-0.30) .1554 .0390 .0011 .0003 .0315 .0590 .0380 .0011 .0002 .0302 .0371 .0010 .0002 .0290 .1184 .0362 .0262 .0198 .0369 .0456 .0348 .0254 .0188 .0352 .0332 .0247 .0178 .0335 .1786 .0492 .0327 .0201 .0484 .0655 .0472 .0317 .0190 .0462 .0453 .0308 .0179 .0444 .1054 .0449 .0376 .0310 .0257 .0400 .0360 .0300 .0243 .0381 .0345 .0290 .0231 .0363
.1263 .0417 .0359 .0268 .0449 .0500 .0400 .0349 .0254 .0431 .0385 .0338 .0241 .0413 .0770 .0447 .0428 .0429 .0506 .0464 .0429 .0410 .0409 .0487 .0414 .0396 .0391 .0466 .0683 .0436 .0419 .0438 .0498 .0448 .0419 .0402 .0420 .0479 .0402 .0388 .0401 .0457 .0577 .0445 .0431 .0481 .0515 .0441 .0428 .0414 .0459 .0493 .0411 .0400 .0442 .0474 .0591 .0444 .0434 .0471 .0510 .0444 .0425 .0419 .0448 .0489 .0407 .0402 .0430 .0467 .0381 .0344 .0355 .0396 .0405 .0348 .0327 .0338 .0377 .0385 .0312 .0323 .0359 .0366
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
201
Table 4. Comparison of Type I Error Rates when n=20, Heavy-tailed Distributions Distribution
=0.01 ______________________
2
____
, r2 Zs Zh Z2 Z6 (kurtosis) ______________________________________________________________________________ Barnes3 (75.1) T(5) (6.00) Barnes1 (6.00) T(6) (3.00) Laplace(2.0,1.0) (3.00) JTB(4.0,1.0) (0.78) T(16) (0.50) JTB(1.25,0.5) (0.24) T(32) (0.21) JTB(2.0,0.5) (-0.30) .0964 .0241 .0001 .0001 .0076 .0290 .0221 .0001 .0001 .0062 .0207 .0001 .0001 .0049 .0543 .0151 .0072 .0056 .0100 .0147 .0125 .0060 .0046 .0082 .0107 .0052 .0037 .0063 .0590 .0205 .0084 .0059 .0136 .0225 .0178 .0072 .0048 .0111 .0153 .0062 .0039 .0092 .0461 .0131 .0146 .0088 .0070 .0110 .0122 .0075 .0055 .0088 .0104 .0062 .0044 .0070
.0053 .0165 .0105 .0083 .0139 .0153 .0138 .0089 .0068 .0113 .0117 .0077 .0055 .0092 .0238 .0143 .0115 .0100 .0126 .0107 .0118 .0096 .0079 .0098 .0098 .0081 .0061 .0076 .0184 .0128 .0104 .0092 .0108 .0093 .0106 .0086 .0073 .0084 .0089 .0072 .0058 .0066 .0138 .0138 .0120 .0104 .0115 .0094 .0114 .0099 .0079 .0087 .0096 .0081 .0062 .0069 .0134 .0121 .0103 .0087 .0101 .0079 .0099 .0084 .0066 .0076 .0079 .0066 .0050 .0056 .0059 .0091 .0075 .0054 .0057 .0051 .0076 .0059 .0038 .0040 .0061 .0046 .0026 .0028
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
Table 4 (continued). Comparison of Type I Error Rates when n=20, Heavy-tailed Distributions Distribution
=0.05 _______________
____
2 , r2 Zs Zh Z2 Z6 (kurtosis) ______________________________________________________________________________ Barnes3 .1184 .0430 .0009 .0007 .0319 (75.1) .0544 .0414 .0008 .0005 .0294 .0397 .0007 .0005 .0268
T(5) (6.00) Barnes1 (6.00) T(6) (3.00) Laplace(2.0,1.0) (3.00) JTB(4.0,1.0) (0.78) T(16) (0.50) JTB(1.25,0.5) (0.24) T(32) (0.21) JTB(2.0,0.5) (-0.30) .1034 .0439 .0233 .0249 .0440 .0489 .0409 .0215 .0225 .0398 .0383 .0199 .0206 .0362 .1509 .0570 .0244 .0243 .0544 .0674 .0537 .0225 .0220 .0496 .0502 .0206 .0201 .0456 .0968 .0482 .0449 .0283 .0228 .0469 .0417 .0260 .0279 .0428 .0388 .0240 .0254 .0395
.1166 .0493 .0303 .0324 .0516 .0537 .0458 .0281 .0298 .0475 .0427 .0261 .0271 .0439 .0742 .0468 .0386 .0436 .0520 .0463 .0434 .0361 .0400 .0479 .0404 .0335 .0367 .0443 .0658 .0440 .0381 .0430 .0494 .0429 .0408 .0350 .0391 .0454 .0377 .0324 .0355 .0415 .0587 .0457 .0417 .0483 .0529 .0434 .0420 .0387 .0439 .0483 .0391 .0357 .0401 .0441 .0583 .0447 .0406 .0462 .0512 .0430 .0415 .0375 .0421 .0468 .0382 .0344 .0382 .0423 .0387 .0359 .0350 .0394 .0410 .0338 .0325 .0320 .0350 .0364 .0298 .0291 .0313 .0325
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, Z2, and Z6 test using z , ( z + t , n1 ) / 2 , and t , n1 critical points (first, second, and third numbers in column Zs, Zh, Z2, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
203
Table 5. New Power Comparisons for Skewed Distribution Upper-Tailed 2 Rejection Region when x = k 0 , significance level 0.100, n = 40 ________ ___ _______ ___ 2 , r2 Zs Z6 2 , r2 Zs Z6 (skewness) _________________________________________________________ Weibull(1.0,0.5) .101 .102 .101 .280 .315 .315 (6.62) .098 .099 .098 .303 .309 .308 Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .099 .100 .100 .100 .098 .098 .098 .098 .099 .099 .098 .098 .100 .101 .099 .102 .098 .100 .318 .340 .382 .432 .634 .697 .339 .344 .340 .345 .439 .441 .437 .447 .703 .704 .703 .708 Distribution
k = 1.0
k = 2. 0
________________ 2 , r2__________ Zs Z6 .439 .499 .501 .485 .494 .493 .490 .523 .528 .523 .524 .528 .612 .695 .698 .685 .694 .703 .903 .940 .940 .937 .940 .941
k = 3.0
n = 40 (continued) ________________ 2 , r2 Zs Z6 (skewness) ________________________________________ Weibull(1.0,0.5) .563 .634 .636 (6.62) .619 .629 .629 Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .611 .648 .762 .828 .975 .987 .648 .653 .649 .654 .837 .839 .836 .842 .987 .988 .987 .988 Distribution
k = 4. 0
________________ ________________ 2 , r2 Zs Z6 2 , r2 Zs __________ Z6 .623 .729 .731 .725 .797 .799 .715 .725 .725 .784 .793 .794 .697 .731 .852 .906 .993 .997 .731 .736 .732 .737 .912 .914 .912 .916 .997 .997 .997 .997 .763 .793 .906 .946 .998 .999 .794 .798 .794 .799 .950 .951 .950 .952 .999 .999 .999 .999
k = 5.0
k = 6. 0
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs and Z6) and chisquare and robust chi-square test (first and second) on the column 2 , r2 .
Table 5 (continued). New Power Comparisons2for Skewed Distribution Upper-Tailed Rejection Region when x = k 0 , significance level 0.100 n = 20 Distribution
k = 1.0
_______ ___
k = 2. 0
________ ___
k = 3.0
Z6 ______ ____ .343 .382 .385 .374 .380 .384 .375 .389 .459 .511 .729 .777 .394 .395 .519 .520 .393 .394 .531 .528
(skewness) 2 , r2 Zs Z6 2 , r2 Zs Z6 ________________________________________________________ Weibull(1.0,0.5) .100 .101 .102 .231 .253 .255 (6.62) .101 .100 .101 .248 .251 .254 Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .100 .100 .099 .098 .099 .099 .101 .101 .098 .098 .102 .100 .100 .100 .101 .100 .102 .098 .254 .263 .295 .325 .469 .514 .266 .265 .267 .266 .331 .340 .332 .337 .525 .527 .521 .519
2 , r2 Zs
________________
n = 20 (continued) Distribution
k = 4.0
______ ___
k = 5.0
________ ___
k = 6.0
________________
(skewness) 2 , r2 Zs Z6 2 , r2 Zs Z6 2 , r2 Zs Z6 _________________________________________________________ __________ Weibull(1.0,0.5) .432 .481 .484 .502 .557 .560 .570 .627 .631 (6.62) .471 .478 .483 .546 .554 .559 .616 .625 .629 Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .465 .483 .586 .648 .862 .898 .488 .490 .657 .658 .903 .901 .487 .488 .667 .665 .904 .900 .532 .557 .551 .558 .676 .748 .739 .748 .925 .952 .949 .951 .556 .557 .757 .755 .952 .950 .585 .606 .742 .802 .959 .974 .611 .612 .811 .811 .975 .975 .610 .610 .818 .816 .976 .975
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
205
Table 6. New Power Comparisons for Heavy-tail Upper-Tailed Rejection Region 2 when x = k 0 and significance level 0.100 n = 40 Distribution
k = 1.0
_______ ___
k = 2. 0
________ ___
k = 3.0
___________________
(kurtosis) 2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 ________________________________________________________________ _______ Barnes3 .101 .102 .100 .099 .266 .413 .460 .418 .457 .904 .934 .913 (75.1) .099 .098 .098 .097 .381 .405 .457 .416 .874 .898 .933 .912 T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .099 .099 .101 .100 .102 .101 .101 .102 .097 .099 .099 .101 .099 .099 .101 .101 .101 .101 .102 .101 .099 .102 .098 .102 .775 .841 .853 .844 .840 .842 .856 .846 .766 .801 .797 .801 .798 .801 .821 .801 .845 .902 .903 .905 .901 .904 .903 .905 n = 40 (continued) Distribution .978 .989 .991 .990 .989 .990 .991 .990 .968 .978 .976 .979 .978 .979 .980 .979 .995 .997 .997 .997 .996 .997 .997 .997
k = 4. 0
__________________
k = 5.0
___________________
k = 6. 0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) ________________________________________ Barnes3 .737 .998 .999 .999 .963 1.00 1.00 1.00 (75.1) .997 .998 .999 .998 1.00 1.00 1.00 1.00
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) 1.00 .999 .999 .999 .999 .999 .999 .999 .995 .997 .996 .997 .997 .997 .998 .997 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 .999 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2 , r2 Zs Zh
Z6
.997 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
_______
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs, Zh, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
Table 6 (continued). New Power Comparisons for Heavy-tail Upper-Tailed Rejection Region when 2 x = k 0 and significance level 0.100 n = 20 Distribution
k = 1.0
_______ ___
k = 2. 0
________ ___
k = 3.0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) _______________________________________ Barnes3 .100 .099 .099 .100 .217 .302 .323 .314 (75.1) .101 .101 .099 .098 .290 .306 .331 .309
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .102 .102 .101 .101 .100 .102 .101 .102 .099 .099 .101 .098 .099 .102 .101 .099 .102 .100 .100 .100 .101 .102 .098 .098 .584 .646 .662 .648 .637 .648 .662 .652 .565 .601 .613 .598 .560 .608 .604 .598 .691 .714 .715 .714 .714 .716 .714 .711 n = 20 (continued) Distribution
2 , r2 Zs Zh Z6
_______ .355 .733 .778 .763 .714 .739 .774 .755 .868 .907 .914 .908 .900 .908 .914 .907
.834 .861 .863 .859 .860 .864 .862 .858 .931 .940 .938 .940 .940 .941 .936 .939
k = 4. 0
k = 5.0
Z6
k = 6. 0
2 , r2 Zs Zh Z6 2 , r2 Zs Zh (kurtosis) ________________________________________
Barnes3 (75.1) T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50)
2 , r2 Zs Zh
Z6
_______
.656 .958 .967 .966 .899 .993 .996 .995 .854 .960 .973 .964 .992 .993 .996 .994 .960 .973 .976 .974 .986 .992 .992 .992 .972 .974 .976 .975 .992 .992 .992 .993 .936 .950 .951 .949 .973 .980 .978 .980 .950 .951 .950 .949 .980 .981 .986 .980 .984 .986 .984 .986 .996 .997 .996 .997 .986 .986 .984 .986 .997 .997 .996 .997
.975 .999 .999 .999 .998 .999 .999 .999 .995 .997 .997 .997 .997 .997 .997 .997 .988 .992 .990 .991 .992 .992 .992 .991 .999 .999 .999 .999 .999 .999 .999 .999
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs, Zh, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
207
Table 7. Traditional Power Comparisons for Skewed Distribution Upper-Tailed Rejection Region 2 when x = k 0 , significance level 0.100, n = 40 Distribution
k = 1.0
________ ___
k = 2. 0
___ ____ __
k = 3.0
________________
2 , r2 Zs Z6 2 , r2 Zs Z6 (skewness) __________________________________________ Weibull(1.0,0.5) .207 .078 .078 .464 .270 .272 (6.62) .100 .077 .077 .307 .267 .269
Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .245 .114 .229 .104 .201 .096 .088 .089 .087 .087 .081 .083 .079 .081 .085 .092 .083 .090 .529 .318 .322 .361 .315 .318 .600 .403 .409 .440 .399 .406 .789 .680 .695 .698 .676 .692 n = 40 (continued) Distribution
2 , r2 Zs
.638 .488 .694 .542 .805 .696 .959 .936 .448 .446 .500 .497 .666 .663 .930 .929
k = 4.0
________ ___
k = 5.0
___ ____ __
k = 6.0
________________
2 , r2 Zs Z6 2 , r2 Zs Z6 (skewness) __________________________________________ Weibull(1.0,0.5) .749 .585 .589 .822 .687 .691 (6.62) .622 .582 .586 .717 .684 .688
Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .786 .664 .902 .837 .992 .987 .628 .626 .818 .816 .986 .985 .631 .628 .823 .821 .987 .987 .846 .746 .948 .910 .998 .997 .715 .718 .713 .715 .899 .903 .898 .901 .997 .997 .997 .997
2 , r2 Zs
.870 .788 .883 .802 .971 .949 1.00 .999 .762 .762 .776 .774 .942 .941 .999 .999
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
k = 1.0
________ ___
k = 2. 0
___ ____ __
k = 3.0
________________
(skewness) 2 , r2 Zs Z6 2 , r2 Zs Z6 2 , r2 Zs Z6 ______________________________________________________________________________ Weibull(1.0,0.5) .173 .080 .080 .354 .218 .220 .482 .336 .340 (6.62) .097 .078 .078 .245 .214 .215 .364 .332 .334 Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .206 .112 .197 .106 .183 .103 .092 .090 .089 .086 .093 .090 .093 .090 .091 .088 .103 .099 .402 .282 .457 .335 .613 .518 .252 .248 .310 .304 .503 .496 .254 .249 .317 .310 .523 .515 .533 .408 .628 .519 .833 .780 .377 .372 .495 .489 .770 .765 .380 .374 .504 .497 .785 .779
n = 20 (continued) Distribution
k = 4. 0
________________
k = 5.0
________________
k = 6. 0
________________
2 , r2 Zs Z6 2 , r2 Zs Z6 2 , r2 Zs Z6 (skewness) ______________________________________________________________________________ Weibull(1.0,0.5) .578 .439 .443 .646 .516 .521 .699 .577 .582 (6.62) .466 .433 .437 .541 .511 .514 .601 .572 .576
Gamma(1.0,0.15) (5.16) IG(1.0,0.6) (3.87) Chi(2) (2.00) .615 .502 .741 .653 .924 .898 .471 .466 .634 .629 .893 .890 .473 .467 .643 .637 .901 .897 .677 .574 .816 .747 .964 .951 .546 .541 .731 .727 .949 .947 .548 .542 .739 .734 .953 .951 .722 .627 .863 .810 .982 .974 .601 .597 .797 .794 .973 .972 .604 .598 .805 .800 .976 .975
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
209
Table 8. Traditional Power Comparisons for Heavy-tail Upper-Tailed Rejection Region 2 when x = k 0 and significance level 0.100, n = 40 Distribution
k = 1.0
___________________
k = 2. 0
___________________
k = 3.0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) ______________________________________________________________________________ Barnes3 .171 .066 .005 .065 .432 .312 .116 .317 .840 .827 .666 .846 (75.1) .088 .065 .004 .064 .344 .308 .113 .312 .836 .824 .659 .842
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .159 .077 .053 .085 .086 .076 .052 .083 .178 .087 .067 .097 .094 .085 .066 .095 .141 .086 .073 .097 .090 .084 .071 .095 .863 .814 .768 .830 .820 .811 .765 .827 .857 .784 .736 .799 .793 .781 .733 .795 .916 .889 .873 .901 .891 .887 .871 .899 n = 40 (continued) Distribution .990 .985 .972 .987 .986 .985 .971 .987 .954 .973 .958 .976 .975 .973 .958 .975 .997 .995 .993 .996 .995 .995 .993 .996
k = 4. 0
_______ ___
k = 5.0
_______ ___
k = 6. 0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) ________________________________________________ Barnes3 .994 .994 .871 .995 1.00 1.00 .894 1.00 (75.1) .994 .993 .867 .995 1.00 1.00 .891 1.00
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .999 .999 .992 .999 .999 .999 .992 .999 .998 .997 .992 .997 .997 .997 .992 .997 1.00 1.00 .999 1.00 1.00 1.00 .999 1.00 1.00 1.00 .995 1.00 1.00 1.00 .995 1.00 1.00 1.00 .997 1.00 1.00 1.00 .997 1.00 1.00 1.00 1.00 1.00 1.00 1.00 .999 1.00
2 , r2 Zs
Zh
Z6
1.00 1.00 .907 1.00 1.00 1.00 .903 1.00 1.00 1.00 .996 1.00 1.00 1.00 .996 1.00 1.00 1.00 .999 1.00 1.00 1.00 .999 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs, Zh, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
k = 1.0
___
k = 2. 0
________ ___
k = 3.0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) ______________________________________________________________________________ Barnes3 .132 .063 .004 .062 .287 .225 .062 .230 .596 .581 .425 .609 (75.1) .078 .062 .004 .059 .238 .220 .058 .223 .588 .572 .410 .597
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .143 .080 .050 .091 .086 .077 .047 .087 .164 .090 .061 .102 .096 .086 .058 .098 .134 .087 .068 .102 .090 .084 .065 .098 .678 .607 .519 .634 .614 .600 .508 .626 .679 .584 .482 .609 .594 .577 .471 .600 .741 .690 .636 .717 .692 .682 .627 .710 n = 20 (continued) Distribution .913 .885 .823 .898 .888 .882 .815 .894 .895 .852 .769 .864 .857 .848 .759 .860 .945 .929 .899 .938 .931 .927 .894 .936
k = 4. 0
_______ ___
k = 5.0
________ ___
k = 6. 0
___________________
2 , r2 Zs Zh Z6 2 , r2 Zs Zh Z6 (kurtosis) _______________________________________________ Barnes3 .912 .908 .779 .925 .985 .983 .869 .987 (75.1) .910 .904 .768 .921 .984 .983 .862 .987
T(5) (6.00) Laplace(2,1) (3.00) T(8) (1.50) .976 .967 .927 .972 .968 .966 .922 .970 .964 .947 .888 .952 .949 .945 .881 .950 .988 .984 .967 .986 .983 .983 .965 .985 .993 .990 .963 .991 .990 .989 .959 .991 .986 .979 .940 .982 .980 .978 .935 .981 .997 .996 .986 .996 .996 .995 .984 .996
2 , r2 Zs
Zh
Z6
.997 .996 .812 .997 .996 .996 .886 .997 .998 .997 .976 .997 .997 .996 .973 .997 .994 .991 .963 .992 .992 .991 .959 .992 .999 .999 .992 .999 .999 .999 .991 .999
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs, Zh, and Z6 test using z , and ( z + t , n 1 ) / 2 critical points (first, and second numbers in column Zs, Zh, and Z6) and chi-square and robust chi-square test (first and second) on the column 2 , r2 .
LONG & SA
Table 9. Comparisons of Type I Error Rates among Zs & Z6 when n=30 Skewed Distributions
211
=0.10 =0.05 =0.02 =0.01 _________ __________ __ __ (skewness) Zs Z6 Zs Z6 Zs Z6 Zs Z6 ______________________________________________________________________________ IG(1.0,0.1) .0805 .0792 .0549 .0454 .0378 .0224 .0301 .0138 (9.49) .0792 .0775 .0534 .0435 .0361 .0206 .0286 .0121 .0779 .0759 .0518 .0416 .0348 .0189 .0273 .0108
Distribution Weibull(1,0.5) .0802 (6.62) .0788 .0775 LN(0,1) (6.18) IG(1.0,0.25) (6.00) .0804 .0786 .0769 .0517 .0437 .0500 .0416 .0484 .0396 .0447 .0381 .0431 .0361 .0415 .0342 .0512 .0432 .0494 .0409 .0478 .0388 .0538 .0472 .0517 .0448 .0499 .0427 .0503 .0447 .0481 .0421 .0464 .0397 .0490 .0477 .0468 .0453 .0448 .0430 .0463 .0486 .0441 .0459 .0420 .0437 .0472 .0494 .0450 .0468 .0428 .0446 .0518 .0570 .0497 .0546 .0474 .0520 .0441 .0505 .0418 .0478 .0398 .0452 .0420 .0483 .0399 .0456 .0377 .0433 .0305 .0184 .0288 .0168 .0273 .0150 .0256 .0158 .0243 .0145 .0231 .0132 .0324 .0198 .0305 .0181 .0290 .0164 .0298 .0200 .0280 .0179 .0265 .0161 .0264 .0175 .0245 .0158 .0227 .0141 .0241 .0214 .0221 .0193 .0203 .0176 .0229 .0226 .0210 .0206 .0195 .0189 .0233 .0230 .0214 .0210 .0196 .0191 .0265 .0284 .0245 .0262 .0225 .0240 .0204 .0216 .0185 .0198 .0168 .0178 .0187 .0202 .0169 .0180 .0153 .0163 .0234 .0110 .0219 .0095 .0204 .0082 .0197 .0091 .0181 .0078 .0166 .0069 .0231 .0104 .0214 .0091 .0198 .0079 .0212 .0110 .0196 .0098 .0178 .0085 .0182 .0101 .0165 .0090 .0149 .0080 .0155 .0128 .0138 .0115 .0126 .0102 .0145 .0141 .0129 .0125 .0115 .0110 .0146 .0140 .0128 .0123 .0116 .0111 .0169 .0181 .0151 .0161 .0136 .0145 .0109 .0112 .0098 .0094 .0087 .0081 .0110 .0109 .0097 .0094 .0086 .0079
.0722 .0729 .0706 .0710 .0693 .0693 .0833 .0818 .0802 .0835 .0816 .0797
Gamma(1,.15) .0877 .0890 (5.16) .0856 .0863 .0837 .0840 IG(1.0,0.5) (4.24) Chi(1) (2.83) Exp(1.0) (2.00) Chi(2) (2.00) Barnes2 (1.75) IG(1.0,25.0) (0.60) Chi(24) (0.58) .0828 .0811 .0803 .0886 .0864 .0843 .0864 .0833 .0814 .0942 .0915 .0890
.0880 .0970 .0857 .0944 .0835 .0918 .0894 .0872 .0848 .0978 .0951 .0930
.0933 .1048 .0912 .1022 .0891 .0995 .0865 .0841 .0816 .0868 .0845 .0821 .1021 .0990 .0963 .1017 .0990 .0963
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , ( z + t , n 1 ) / 2 , and t , n 1 critical points (first, second, and third numbers in column Zs and Z6).
=0.10 =0.05 =0.02 =0.01 _________ __ __ _________ (skewness) Zs Z6 Zs Z6 Zs Z6 Zs Z6 ______________________________________________________________________________ Barnes3 .0644 .0631 .0390 .0303 .0261 .0135 .0196 .0067 (75.1) .0630 .0613 .0379 .0286 .0249 .0121 .0186 .0056 .0615 .0596 .0367 .0270 .0238 .0108 .0175 .0047
Distribution T(5) (6.00) Barnes1 (6.00) T(6) (3.00) Laplace(2,1) (3.00) JTB(4.0,1.0) (0.78) T(16) (0.50) .0795 .0887 .0775 .0861 .0754 .0835 .1014 .1096 .0988 .1066 .0965 .1035 .0823 .0799 .0777 .0932 .0903 .0875 .0385 .0388 .0365 .0365 .0347 .0342 .0517 .0507 .0490 .0477 .0468 .0448 .0407 .0431 .0385 .0404 .0365 .0381 .0444 .0474 .0423 .0448 .0401 .0422 .0455 .0516 .0431 .0490 .0409 .0461 .0441 .0504 .0417 .0476 .0397 .0450 .0441 .0518 .0419 .0486 .0398 .0459 .0436 .0501 .0415 .0476 .0391 .0452 .0350 .0408 .0327 .0382 .0306 .0355 .0170 .0144 .0157 .0128 .0143 .0113 .0234 .0191 .0215 .0169 .0197 .0151 .0180 .0170 .0163 .0151 .0148 .0134 .0203 .0199 .0186 .0179 .0170 .0161 .0203 .0212 .0185 .0193 .0168 .0172 .0195 .0205 .0179 .0184 .0160 .0165 .0190 .0203 .0172 .0183 .0156 .0163 .0186 .0196 .0169 .0175 .0151 .0157 .0131 .0131 .0117 .0113 .0105 .0098 .0103 .0075 .0088 .0065 .0077 .0054 .0146 .0107 .0128 .0091 .0113 .0076 .0102 .0088 .0089 .0075 .0078 .0062 .0124 .0113 .0108 .0098 .0096 .0084 .0117 .0114 .0103 .0099 .0092 .0083 .0112 .0107 .0099 .0092 .0087 .0078 .0116 .0114 .0100 .0098 .0087 .0081 .0107 .0103 .0093 .0086 .0083 .0074 .0067 .0055 .0059 .0044 .0049 .0034
.0879 .0911 .0857 .0893 .0836 .0879 .0894 .1045 .0872 .1008 .0851 .0979 .0882 .0859 .0836 .1035 .1007 .0977
JTB(1.25,0.5) .0895 .1059 (0.24) .0856 .1017 .0827 .0988 T(32) (0.21) JTB(2.0,0.5) (-0.30) .0884 .1049 .0859 .1019 .0834 .0992 .0769 .0943 .0743 .0903 .0705 .0868
NOTE: Entries are the estimated proportion of samples rejected in 100,000 simulated samples for Zs and Z6 test using z , ( z + t , n 1 ) / 2 , and t , n 1 critical points (first, second, and third numbers in column Zs and Z6).
LONG & SA
Conclusion This study proposed a new right-tailed test of the variance of non-normal distributions. The test is adapted from Halls inverse Edgeworth expansion for variance (1992) with the purpose to find a new test with fewer restrictions from assumptions and no need for the knowledge of the distribution type. To this end, the study compared Type I error rates and power of previously known tests to its own. Of the previous tests and six new tests examined by the study, Z6 had the best performance for right-tailed tests. The Z6 test outperforms the 2 test by far while performing much better than the r2 test on skewed distributions and better with heavy-tailed distributions. The Z6 test does not need the original assumptions for the Zs test that the coefficient of skewness of the parent distribution is greater than 2 or that the distribution is skewed. Additionally, the Z6 test performs better overall than the Zs test since Zs performs poorly with smaller alpha levels. Test Z6, unlike Zh, does not need the original assumptions that the population coefficient of skewness is zero in the heavy-tailed distribution or that the distribution is heavy-tailed. Also, the Z6 test performs better for skewed distributions than the Zh test, which has low power at lower alphas. Finally, when considering the Type I error rates, both distribution types, and power, the Z6 test is the best in performance overall. The Z6 test can be used for both types of distributions with good power performance and superior Type I error rates. Therefore, the Z6 test is a good choice for right-tailed tests of variance with non-normal distributions References Efron, B., & Tibshirani, R. J. (1993). An Introduction to the bootstrap. New York: Chapman & Hall. Kendall, S. M. (1994). Distribution theory. New York: Oxford University Press Inc.
213
Lee, S. J., & Sa, P. (1998). Testing the variance of skewed distributions. Communications in Statistics: Simulation and Computation, 27(3), 807-822. Lee, S. J., & Sa, P. (1996). Testing the variance of symmetric heavy-tailed distributions. Journal of Statistical Computation and Simulation, 56, 39-52. Bickel, P. J., & Doksum, K. A. (1977). Mathematical statistics -Basic idea and selected topics. San Francisco: Holden-Day. Hall, P. (1983). Inverting an edgeworth expansion. The Annals of Statistics, 11, 2, 569576. Kendall, S. M., & Stuart, A. (1969). The advanced theory of statistics-distribution theory, th Vol.I, 4 Edition, Griffith Inc. Hall, P. (1992). The bootstrap and edgeworth. New York: Springer-Verlag. Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions, 3rd Edition, New York: John Wiley & Sons, Inc.. Chhikara, R. S., & Folks, J. L. (1989). The inverse gaussian distribution-theory, methodology and applications. Marcel Dekker, Inc. Fleishman, A. I. (1978). A method of simulating non-normal distributions, Psychometrica, 43, 521-531 Johnson, M. E., Tietjen, G. L., & Beckman, R. J. (1980). A new family of probability distributions with applications to monte carlo studies. Journal of the American Statistical Association, 75, 370, 276-279 Long, M., & Sa, P., The Simulation Results for Right-tailed Testing of Variance for Non-Normal Distributions, Technical Report, #030503, Department of Mathematics and Statistics, University of North Florida, http://www.unf.edu/coas/mathstat/CRCS/CRTechRep.htm, 2003.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 214-226
John Nolan
Department of Mathematics and Statistics American University
Melanie Wilson
Department of Mathematics Allegheny College
Kristen Dardia
Department of Mathematics and Statistics James Madison University
A new method for estimating the parameters of scale mixtures of normals (SMN) is introduced and evaluated. The new method is called UNMIX and is based on minimizing the weighted square distance between exact values of the density of the scale mixture and estimated values using kernel smoothing techniques over a specified grid of x-values and a grid of potential scale values. Applications of the method are made in modeling the continuously compounded return, CCR, of stock prices. Modeling this ratio with UNMIX proves promising in comparison with other existing techniques that use only one normal component, or those that use more than one component based on the EM algorithm as the method of estimation. Key words: Expectation-Maximization algorithm, UNMIX, kernel density smoothing, expected return
Introduction The study of univariate scale mixtures of normals, SMN, has long been of interest to statisticians continuously hunting for better methods to model probability density functions. Modeling using these mixtures has many applications from genetics and medicine to economic and populations studies. More specifically, one can use SMN to model any data that is seemingly normally distributed and has a high kurtosis. Using SMN allows for the tails of
Hasan Hamdan is Assistant Professor at James Madison University. His research interests are in mixture models, sampling and mathematical statistics. Email: [email protected]. John Nolan is a Professor at American University. His research interests are in probability, stochastic processes and mathematical statistics. Melanie Wilson is a graduate student in statistics at Duke and Kristen Dardia is graduating from James Madison University. This research was partially supported by NSF grant number NSF-DMS 0243845.
the density to be heavier than those in the normal density, giving a better coverage for data that varies greatly from the mean. The most common estimation of the parameters of the mixtures is the EM algorithm of by Dempster, Larid, and Rubin (1977). This method is based on finding the maximum likelihood estimate of the parameters of a given data set. The EM algorithm performs well in cases where the distance between means of the components is relatively large. However, when estimating the parameters of a mixture of normals where all of the components have the same mean but different variances, the EM algorithm gives a poor estimation when these variances are small and close. In this article, we elaborate on a new approach of estimation, UNMIX, proposed by Hamdan and Nolan (2004). The UNMIX program uses kernel smoothing techniques to get an empirical estimate of the density of the data. It then estimates the parameters of the mixture based on minimizing the weighted least squares of the distance between the values from the empirical density and the new scale mixture density over a pre-specified grid of x-values, and
214
215
function of the form of a SMN is introduced in Section 2. Next, in Section 3, techniques of estimation of SMN are listed and brief background on the common EM algorithm is also presented. In Section 4, the density of CCR is estimated for different stocks with SMN using the UNMIX program and using a single normal. Also, the density is also estimated using the EM algorithm and the results are compared. Finally, some suggestions for improving this method are made in the conclusion section. Methodology A random variable X is a scale mixture of normals or SMN if X AZ , where Z N(0,1), A > 0, A and Z independent. Here N(0,1) is the standard normal variable with mean 0 and standard deviation 1. Therefore, X has a probability density function
where is the standard normal density and the mixing measure is the distribution of A . An SMN can either be an infinite or a finite mixture, depending upon the mixing measure . If our mixing measure is discrete and A takes on a finite number of values, say 1 ,..., m with respective probabilities
A common finite mixture, called the contaminated normal, occurs when A takes on two values, with 1 < 2 and 1 > 2 . In this case our density function can be simplified to
f ( x ) = 1 ( x / 1 ) + (1 1 ) ( x / 2 ) .
f ( x) =
f ( x) =
( x / ) (d ) ,
(1)
j =1
(x / j ) j
(2)
216
Some common examples of infinite SMN are the Generalized t distribution, Exponential power family, and Sub-Gaussian distributions. The following theorem gives the characteristics necessary for a distribution to be SMN with mean zero. Theorem: (Schoenberg, 1938) Given any random variable X with density f (x ) , X is a scale mixture of normals if and only if is a completely monotone function. See Feller (1971) for definition of completely monotone function. As we have seen above when A takes on a finite number of values, the density of X can be written more simply in the same manner as equation (2). When is not concentrated at a finite number of points, Hamdan and Nolan (2004) give a constructive method on how to discretize so that equation (2) is uniformly close to equation (1). Estimating Scale Mixtures of Normals In estimating SMN one needs to find the following: number of components, estimated parameters of each component, and estimated weights of each component. We highlight some of the important developments in this area. This problem of estimating SMN has been the subject of a large diverse body of literature. Dempster, Larid, and Rubin (1977) introduced the EM algorithm for approximating the maximum likelihood estimates. Because other methods have been developed based on the EM algorithm. A robust powerful approach based on minimizing distance estimation is analyzed by Beran (1977) and Donoho and Liu (1988). Zhang (1990) used Fourier methods to derive kernel estimators and provided lower and upper bounds for the optimal rate of convergence. Priebe (1994) developed a nonparametric maximum likelihood technique from related methods of kernel estimation and finite mixtures. EM algorithm The EM algorithm developed by Dempster, Larid, and Rubin (1977), is based on finding the maximum likelihood estimate of the components, parameters, and weights of a
h( x ) = f
( x)
j =1
the data are completed by letting each xi correspond to a y i . The new y i is a vector giving the initial value xi and also a sequence of values z1 ,..., z k which tells the location of the x value as follows:
where zij =
Therefore the only missing values are the labels, z i1 ,..., z ik . Next the maximum likelihood estimate of each y i is found in the in the Expectation Step of the EM algorithm. An initial guesses for the parameters
x j
j ,
217
ji =
discretizing the mixture over a pre-specified grid of x-values and potential grid of sigma values. Given a sample of size n from the mixture, we fix a grid of possible sigma values (called the -grid), and possible x values (called the xgrid), x1 ...xk , where k m. In order to obtain an estimate ( x) of f ( x ) for each x in the x -grid, we use f kernel smoothing techniques discussed briefly at the end of the section. Our model is
i =1
with ij =
will use wi =1 throughout. However, if the data are heavy-tailed then one can try different weights until he finds a good fit (in the heavytailed case, a good strategy might be weighting the points that are close to the mean of the x-grid less than those that are far from the mean of the x-grid. Next consider the problem as a quadratic programming problem with two constraints: j = 1 and j 0 for all j. Expanding
S ( ) :
wi2 yi2 2 wi yi
k i =1 m
i =1 k
j =1
i =1 k
j =1
w12 yi2 2
m m
i =1 k
j =1
j =1
i =1 j =1 j =1
( w ) ( w ).
j ij j i il l
wi yiij j
w y 2
2 2 1 i
wi yi
i =1
j =1
ij j +
S ( ) =
ij j + wi2
ij j
k
j =1 k
ij j
S ( ) =
wi yi
ij j
j =1
yi =
m j =1
(xi / j ) j + ,i ,
;
2
218
constant. Reformulating the problem in a matrix environment, we let g be the (m 1) vector defined as
j =1
H=
i =1
i =1
i =1
i =1
i =1
1 S ( ) = 2 c + g T + T H . 2
Therefore
1 g T + T H 2
A=
1 0 0 0 1 0 0 0 0
(m 1) . A quadratic programming routine, QPSOLVE which is a Fortran subroutine, is used to solve this problem. UNMIX is a Splus program that takes the sample, x-grid, r-grid, and a vector of weights as the input and calls
of order
(m m)
and bT = (0
programming
constraints
j =1
0 0 1
0) of order
To simplify, let c =
1 2
k i =1
wi2 yi2 be a
wi2imi1
wi2imi 2
k i =1
wi2imim
and
wi2i1i1
wi2i1i 2
g=
wi yii1 ,......,
wi yiim
j =1
i =1
Because
wi2i1im
QPSOLVE. The programs output is a vector of estimated weights over the given r-grid. In obtaining an estimate for f (x), kernel smoothing techniques were used. One important variable in density estimates using kernel smoothing techniques is the bandwidth. In general, using a large bandwidth oversmoothes the density curve, and small bandwidths can under-smooth the density curve. In essence, the bandwidth controls how wide the kernel function is spread about the point of interest. If there are a large number of values, xi near x , then the weight of x is relatively large and the estimation of the density at x will also be large. There are four sources of variability involved when using UNMIX to estimate a SMN. The first is the sampling variability, the second is due to the method of density estimation and bandwidth used. The third variability is the choice of the x-grid and finally, the fourth is the choice of the r-grid. Controlling sampling variability can be done by increasing the sample size. However, controlling the variability introduced by the method of density estimation requires care and investigation of the sample and bandwidth used. For example, we can weight the observations by using their distance from the center. There is considerable literature on how to pick the most effective bandwidth including articles by Hardle and Marron (1985) and Muller (1985). For the purposes of this article, when using the UNMIX program, the default bandwidth based on the literature given in R-Software is used. UNMIX performs well for estimating distributions with a high kurtosis but losses accuracy for data that is extremely concentrated about the mean. However, these difficulties can be overcome due to the flexibility of the program in terms of fitting the data. In particular, the r-grid can be changed and the weights interactively in a systematic way until a good fit is found. We have found that the most useful x-grid is evenly distributed and symmetric about the mode, where the distance from the mode on both sides is the absolute maximum of the sample data because the mode is 0 in this case. This allows the x-grid to cover all data points. Also in creating the -grid, a
219
The continuously compounded return, CCR, can now be estimated as follows with S T as the stock price time units earlier:
ln
ST ~ 2 / 2 , . ST
Comparing Normal Estimate to UNMIX Estimate We now estimate and compare the density of the CCR using a single normal curve and a scale mixture of normals. Taking advantage of Yahoos (an internet search engine) intensive finance sources, three stocks were found whose price quotes showed
1 n 1
n i =1
(xi x )2 .
[(
x=
i =1
xi and xi = ln
1 n
Si , where ST is S i 1
relatively high volatility: Ciber Inc, ExxonMobil, and Continental Airlines. For each of the stocks, we sampled the weekly closing prices over the past four years, from July 14, 2000 to July 14, 2004. The natural log of the return was taken to find the CCR for each stock. Modeling with the single normal method described above and the UNMIX program, their performances were compared against the empirical density found using kernel smoothing techniques. The empirical density is then used to estimate the density over an x-grid of 51 equally-spaced points between -4S and 4S, where S is the sample standard deviation. Because the empirical density can be made very close to the true density at any given point, it is considered as the true density in each of the following error calculations which are presented in Table 1, Table 2 and Table 3. Example 1: In this example, the density of the CCR, of Ciber Inc. stock, is estimated. The normal estimate based on the random walk assumption has a mean of -.00686 and standard deviation of .09041. The estimated SMN was found using the UNMIX program and has 4 components with weight vector of (.52951,.07374,.39415,.00260) and an estimated -vector of (.12266,.06048,.03885,.03750). The estimated densities were evaluated on the same x-grid and the results are shown in Figure 1. The maximum and average error between each estimate and the empirical density can be seen in Table 1. In Figure 2, the three density estimates were found for an x-grid located in the tail of distribution of CCR and it consists 25 equally-spaced points between .2 and .45. Using the normal assumption, the probability of any sample point falling in such range is approximately .012 and approximately .035 when the scale mixture assumption is used. Though this probability is not high, most density estimation techniques do not recover the tails well where the most extreme occurrences can be found. This could be very problematic in finance and risk analysis.
220
S a m p le D e n U n m ix E s t E m E s t 6 D ensity 0 2 4
- 0 .3
- 0 .2 N =
-0 .1 2 1 1
0 .0 B a n d w id th =
0 .1
0 .2
0 .3
0 .0 2 0 4 5
D ensity
0.0
0.1
0.2
0.3
0.4
S a m p le D e n N o rm a l E s t U n m ix E s t
0 .2 5 N =
0 .3 0 2 1 1
0 .3 5 B a n d w id th =
0 .4 0 0 .0 2 0 4 5
0 .4 5
Notice in Figure 2 that estimating with SMN produces a better fit in the tails. In contrast to overestimating the rate of return in the body, where around 95% of the data are located, the normal curve tends to underestimate the density in the tails. As in our examples, the distributions for the CCR tend to have fatter tails than the proposed normal has. Because the tails of the data are heavy, the scale mixture estimation will produce a better fit than the normal.
Under the single normal assumption, the 95% confidence interval for the mean of the CCR is (-.1767, .1767). Equivalently and by exponentiation, the interval for the mean rate of return is (.8381, 1.1933). The corresponding UNMIX estimate is found to be (.8469, 1.1808). In comparison to UNMIX, the normal curve tends to overestimate the rate of return in the body of the density. Though the gap does not seem large when investing a small amount, for
221
Next, the performance of the UNMIX method is compared to the EM algorithm in estimating the density of the CCR of the same three stocks. The number of components to be used with the EM is also unknown, and there are many ways that can be used to estimate it. Here, we tried two, three, four and five component mixture. There was no noticeable difference between the four-component mixture and the five-component mixture. Therefore, the fourcomponent mixture was used for our examples. The parameters were then estimated using the EM algorithm and it was compared to that found using the UNMIX estimation. The initialization of the parameters was somewhat arbitrary because our goal is to find the best density fit and not to investigate the speed or the convergence of these estimation methods. The s were initialized such that each component has an equal weight of .25, and the s were initialized such that = the mean of the sample, 1 and 2 , 3 and 4 = .2,.4, and .8 times the mean of the sample respectively. Then, the s were initialized for each component in the same manner as the s. For each of the three examples, the process was repeated 50 times and the mean of the parameter estimate was taken as the final EM estimate. The estimated densities of the stocks are shown in Figure 7. Notice that the EM estimate tends to overestimate the mean of the empirical density which is a consequence of the fitted component having a very small variance. The EM captures the skewness of the density better but in general, UNMIX outperforms it. This is seen by the fact that in the three examples, the EM algorithm produces both a greater maximum and average error as summarized in Table 5.
222
15
S a m p le D e n N o rm a l E s t U n m ix E s t
D e nsity
0 -0 .1 5
10
- 0 .1 0 N =
- 0 .0 5 2 0 7
0 .0 0 B a n d w id th =
0 .0 5
0 .1 0
0 .1 5
0 .0 0 7 8 3 9
Figure 4: Estimated Density of CCR in the right tail of Exxon Mobile stock. Probability of being in the tail is approximately .0372.
S N U
a m p le D e n o r m a l E s t n m i x E s t
D ensity
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 .0 6 N =
0 .0 8 2 0 7 B a n d w
0 .1 0 i d th =
0 .1 2 0 .0 0 7 8 3 9
0 .1 4
S a m p le D e n N o rm a l E s t U n m ix E s t
D ensity
0 -0 .4
-0 .2 N = 2 0 6
0 .0
0 .2
0 .4
B a n d w i d th = 0 . 0 2 2 7 2
223
Figure 6: Estimated Density in the right tail of CCR for Continental Airlines stock. Probability of being in this tail is .0186.
0.30
D ensity
0.00
0.05
0.10
0.15
0.20
0.25
S a m p le D e n N o rm a l E s t U n m ix E s t
0 .2 5 N =
0 .3 0 2 0 6
0 .3 5 B a n d w id th =
0 .4 0 0 .0 2 2 7 2
0 .4 5
Table 1: Maximum and average errors of the Normal and UNMIX estimates of CCR for Ciber Inc. Error Max. Norm. Max. UNMIX Avg. Norm. Avg. UNMIX Body. Den. 1.6440 .7874 .3645 .1559 Tail Den. .2630 .1188 .0499 .0203
Table 2: Maximum and average errors of the Normal and UNMIX estimates of CCR for ExxonMobile stock. Error Max. Norm. Max. UNMIX Avg. Norm. Avg. UNMIX Body. Den. 1.9545 1.7341 .5838 .4015 Tail Den. 1 .0582 .4712 .2268 .1283
Table 3: Maximum and average errors of the Normal and UNMIX estimates of CCR for Continental Airline stock. Error Max. Norm. Max. UNMIX Avg. Norm. Avg. UNMIX Body. Den. 1.3104 .8375 .26911 .02090 Tail Den. .1897 .1410 .0699 .0579
224
Table 4: Bounds for the middle 95% probability of the distribution for the CCR of Ciber Inc., ExxonMobil, and Continental Airlines in both the normal and UNMIX estimates. Stock Ciber Inc. ExxonMobile Continental Normal (.8381, 1.1933) (.9428,1.0607) (.8334, 1.200) UNIMIX (.8469, 1.1808) (.9462, 1.0569) (.8416, 1.1882)
Figure 7: Estimated Density of CCR using the UNMIX program and the EM algorithm for (a) Ciber Inc.; (b) ExxonMobil; (c) Continental Airlines.
S a m p le D e n U nm ix E s t Em Est 6
15
10
D ensity
De nsit y
5
2
0
-0 .3 -0 .2 -0 .1 0 .0 0 .1 0 .2 0 .3
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
N = 211
B a nd w id th = 0 .0 2 0 4 5
S a m p le D e n U n m ix E s t E m E s t
D ensity
0 - 0 .4
- 0 .2 N = 2 0 6
0 .0
0 .2
0 .4
B a n d w id th = 0 .0 2 2 7 2
225
Table 2: Maximum and average errors of the UNMIX and EM estimates of CCR for all examples. Error Ciber Inc. Max. EM 1.2832 Max. UNMIX .7971 Av. EM .1592 Avg. UNMIX .1542 ExxonMobile 2.143 1.7269 .4059 .3985 Continental 1.2987 .8320 .2193 .2048
Conclusion Estimation of the CCR of stocks has been an interest of both statisticians and financiers due to the importance of producing accurate models for the data. As evidenced by the previous examples, UNMIX allows for this analysis to occur with smaller error in comparison to the single normal assumption and the common methods based on the EM algorithm. Although the EM algorithm is well developed and allows for different location and different scales, sometimes it has some practical difficulties. For example, when trying to find the MLE of the parameters, it might find a large local maxima that occurs as a consequences of a fitted component having a very small (but nonzero) variance. Also, there are still some problems associated with initializing the parameters including the number of components. However, UNMIX fitted the data better than the EM. We believe that it will always fit the data well, because it is based on minimizing the weighted distance between empirical density and the mixture over a given grid. However, in terms of estimating the actual parameters, more work needs to be done because the EM still does a better job in estimating the actual values as we have seen in many simulated examples where the actual mixtures are known. Here are some areas where we can improve UNMIX. First, make it most applicable is the possibility of handling not only scale, but location conditions. Also improvements to the program can be made by developing guidelines
to choose the most optimal x-grid and r-grid. Finally, we can improve the empirical density estimate by using optimal kernel functions and bandwidths. Implications of the UNMIX program can apply beyond the scope of the stock market. This program can be used to model distributions with relatively high possibilities of outlying events. Staying in the realm of finance the program can be used to estimate exchange rates. However, there are also many examples outside of the finance field including fitting extreme data. For example, the UNMIX program was used to fit the density of some heavy-tailed data. These data were generated from the class of stable densities that have infinite variance and known to be infinite variance mixture of normals such as Cauchy density. Although more work needs to be done, but the UNMIX method looks promising in fitting such data. References Akaike, H. (1954). An approximation to the density function. Annals of the Institute of Statistical Mathematics, 6, 127-132. Beran, R. (1977). Minimum hellinger distance estimates for parametric models. Annals of Statistics, 5, 445-463. Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis, 41, 561-575.
226
Bozdogan, H. (1993). Choosing the number of component clusters in the mixturemodel using a new informational complexity criterion of the Inverse-Fisher Information Matrix, Information and Classification, 40-54. Clark, P. K. (1973). A subordinated stochastic process model with finite variance for speculative prices. Econometrica, 41, 135-155. Dempster, A. P., Larid, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1-38. Dick, N. P., & Bowden, D. C. (1973). Maximum likelihood estimation for mixtures of two Normal distributions. Biometrics, 29, 781-790. Donoho, D. L., & Liu, R. C., (1988). The Automatic robustness of minimum distance functional. Annals of Statistics, 16, 552586. Epps, T. W., & Epps, M. L. (1976). The stochastic dependence of security price changes and transaction volumes: implications for the mixture-of-distributions hypothesis. Econometrica, 44, 305-321. Fama, E. F. (1965). The behavior of stock market prices. Journal of Business, 38, 34105. Feller, W. J. (1971). An introduction to probability and its applications. (2nd ed.). Vol. II. NY: Wiley. Fix, E., & Hodges, J. L. (1989). Discriminatory data analysis - nonparametric discrimination: consistency properties., International Statistics Review, 57, 238-247. Glasserman, P., Heidelberger, P., & Shahabuddin, P. (2000). Portfolio value-at-risk with heavy-tailed risk factors. IBM Research Report.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 227-239
Enhancing The Performance Of A Short Run Multivariate Control Chart For The Process Mean
Michael B.C. Khoo T. F. Ng
School of Mathematical Sciences Universiti Sains, Malaysia
Short run production is becoming more important in manufacturing industries as a result of increased emphasis on just-in-time (JIT) techniques, job shop settings and synchronous manufacturing. Short run production or more commonly short run is characterized by an environment where the run of a process is short. To meet these new challenges and requirements, numerous univariate and multivariate control charts for short run have been proposed. In this article, an approach of improving the performance of a short run multivariate chart for individual measurements will be proposed. The new chart is based on a robust estimator of process dispersion. Key words: Short run, process mean, process dispersion, quality characteristic, in-control, out-of-control
Introduction
Let X n = (X n1 , X n 2 ,..., X np ) denotes the p 1
vector of quality characteristics made on a part. Assume that X n , n = 1, 2, , are independent and identically distributed (i.i.d.) multivariate normal, N p (, ) , observations where X nj is the observation on variable (quality characteristic) j at time n. Define the estimated mean vector obtained from a sequence of X 1 , X 2 ,..., X n random multivariate observations (X 1 , X 2 ,..., X p ) where Xn = as
n
The following four cases (see Khoo & Quah, 2002) of and known and unknown give the standard normal V statistics for the short run multivariate chart based on individual measurements: Because V statistics follow a standard normal distribution, this feature makes it suitable for the limits of the chart to be based on the 1-of-1, 3-of-3, 4-of-5 and EWMA tests which will be discussed in the later section. Case KK: = 0 , = 0 , both known
Tn2 = ( X n 0 ) 0 1 ( X n 0 )
and
V n = 1 {H p (Tn2 )}, n = 1, 2,
variable j made from the first n observations. Table 1 gives the additional notations that are required in the article.
227
Michael B. C. Khoo (Ph.D., University Science of Malaysia, 2001) is a lecturer at the University of Science of Malaysia. His research interests are statistical process control and reliability analysis. Email: [email protected]. T. F. Ng is a graduate student in the school of Mathematical Sciences, University Science of Malaysia.
where
S 0 ,n =
1 n
i =1
V n = 1 H p
( X i 0 )( X i 0 )
i =1
Xj =
X ij n
(1)
and
n 1 2 Tn n , n = 2, 3,
(2)
228
and
V n = 1 F p ,n p
n = p + 1, p + 2,
where
and
V n = 1 F p , n p 1
n = p + 2, p + 3,
In Eq. (1) (4), p represents the number of quality characteristics that are monitored simultaneously, i.e., p 2.
(.)
1
- The standard normal cumulative distribution function (.) - The inverse of the standard normal cumulative distribution function H v (.) - The chi-squared cumulative distribution function with v degrees of freedom Fv1 ,v2 (.) - The Snedecor-F cumulative distribution function with (v1 , v 2 ) degrees of freedom
Sn =
1 n 1
( X i X n )( X i X n )
i =1
(n 1)( n p 1) 2 Tn np( n 2)
n p Tn2 p( n 1)
(3)
(4)
Enhanced Short Run Multivariate Control Chart for Individual Measurements The short run multivariate chart statistics in Eq. (1) and (2) are based on the known covariance matrix while that of Eq. (3) and (4) are based on the estimated covariance matrix, a.k.a., the sample covariance matrix. It is shown in Ref. 1 that the performance of the chart based on the V statistics in Eq. (3) and (4) are inferior to that of cases KK and UK in Eq. (1) and (2) respectively. Thus, in this article an approach to enhance the performance of the short run multivariate chart for cases KU and UU is proposed by replacing the estimators of the process dispersion, i.e., S 0 ,n and S n in Eq. (3) and (4) respectively with a robust estimator of scale based on a modified mean square successive difference (MSSD) approach. Holmes and Mergen (1993) and Seber (1984) provided discussion about the MSSD approach. The new estimator of the process dispersion is denoted by S MSSD while the new V statistic is represented by VMSSD .
KHOO & NG
The following standard normal VMSSD and UU: Note that all used here are similar previous section. formulas give the new statistics for cases KU the notations which are to that defined in the
where
229
and
1 F p , 1 (n 2 p +1)
2
n = 2p + 1, 2p + 3,
and
V MSSD, n =
1 F p , 1 (n 2 p +1)
2
1 F p , 1 (n 2 p )
2
and
1 F p , 1 (n 2 p )
2
V MSSD, n = n 2p 2 TMSSD , n , n = 2p + 2, 2p 2p
+ 4,
Case UU: and both unknown For odd numbered observations, i.e., n, is an odd number, 1 2 TMSSD,n = (X n X n1 ) S MSSD , n 1 (X n X n 1 )
S MSSD, n 2
1 n 2 = ( X i X i 1 )( X i X i 1 ) 2 i =2, 4, 6
n = 2p + 2, 2p + 4,
For the VMSSD statistics in eqs. (5a), (5b), (6a) and (6b) above, p is the number of quality characteristics monitored simultaneously, hence p 2. Tests for Shifts in the Mean Vector Because all the VMSSD statistics are standard normal random variables, the following tests will be used in the detection of shifts in the mean vector. Given a sequence of VMSSD statistics, i.e., V MSSD, a +1 , VMSSD , a + 2 , ..., VMSSD, m , ..., where VMSSD, a represents the control chart statistic, VMSSD , at observation a, the tests are defined as follow:
(5b)
(5a) For even numbered observations, i.e., n, is an even number, 1 2 TMSSD, n = ( X n 0 ) S MSSD , n 2 ( X n 0 ) where
and
V MSSD, n =
1, 2p + 3,
S MSSD, n1 =
1 ( X i X i 1 )( X i X i 1 ) 2 i =2,4,6
n 1
n 2p +1 2 TMSSD , n , 2p
n = 2p +
(6b)
Case KU: = 0 known, unknown For odd numbered observations, i.e., n, is an odd number,
S MSSD, n 1 =
1 n 1 ( X i X i 1 )( X i X i 1 ) 2 i = 2,4,6
V MSSD, n =
(n 2 p + 1)(n 1) 2 TMSSD , n , 2np
230
plotted, the test signals a shift in if V MSSD, m > 3, i.e., V MSSD, m > 3. The 3-of-3 Test: When V MSSD, m is plotted, the test signals a shift in if V MSSD, m ,
V MSSD, m 1 and V MSSD, m 2 all exceed 1 (i.e., 1). This test requires the availability of three consecutive VMSSD statistics. The 4-of-5 Test: When V MSSD, m is
plotted, the test signals a shift in if at least four of the five values V MSSD, m , V MSSD, m 1 , ,
V MSSD, m 4 exceed 1 (i.e., 1). This test can only
be used if five consecutive VMSSD statistics are available. In addition to these tests, the EWMA chart computed from a sequence of the VMSSD statistics is also considered. The EWMA chart is defined as follows:
Z MSSD, m = VMSSD, m + (1 ) Z MSSD, m 1 , m = a, a + 1,
(7)
where Z MSSD, a 1 = 0 and a is an integer representing the starting point of the monitoring of a process. The UCL of an EWMA chart is K ( 2 ) , where is the smoothing constant and K is the control limit constant. For the simulation study in this paper, the values of (, K) used are (0.25, 2.90) which gives UCL = 1.096, i.e., similar to that in Ref. 1. Evaluating the Performance of the Enhanced Short Run Multivariate Chart A simulation study is performed using SAS version 8 to study the performance of the enhanced short run multivariate chart for individual measurements. To enable a comparison to be made between the performance of the new short run chart with the chart proposed in Ref. 1, the simulation study of the new bivariate chart is conducted under the same condition as that of Ref. 1. The on-target mean vector vector is 0 = (0, 0) while the incontrol covariance matrix is 0 =
1
where
KHOO & NG
231
Table 2. Simulation Results of the Enhanced Short Run Multivariate Chart for Cases KU and UU based on 0 = (0,0) , S = (,0) and = 0. =0 c = 10
1-of-1 3-of-3 4-of-5
EWMA
c = 20
1-of-1 3-of-3 4-of-5
EWMA
c = 50
1-of-1 3-of-3 4-of-5
EWMA
s = (,0)
0.0 KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU 0.039 0.036 0.055 0.040 0.111 0.049 0.225 0.064 0.409 0.091 0.611 0.126 0.787 0.173 0.965 0.292 0.998 0.423 0.152 0.156 0.220 0.171 0.423 0.221 0.721 0.308 0.919 0.434 0.986 0.574 0.998 0.718 1.000 0.910 1.000 0.980 0.111 0.111 0.169 0.126 0.360 0.168 0.681 0.247 0.910 0.371 0.986 0.516 0.999 0.678 1.000 0.897 1.000 0.981 0.116 0.118 0.173 0.133 0.394 0.174 0.739 0.261 0.947 0.387 1.000 0.534 1.000 0.681 1.000 0.897 1.000 0.978 0.032 0.035 0.049 0.037 0.123 0.063 0.277 0.112 0.510 0.189 0.740 0.293 0.894 0.430 0.992 0.695 1.000 0.883 0.130 0.123 0.194 0.149 0.422 0.239 0.746 0.396 0.943 0.611 0.994 0.799 1.000 0.927 1.000 0.996 1.000 1.000 0.086 0.079 0.130 0.100 0.343 0.171 0.703 0.329 0.931 0.550 0.994 0.769 0.999 0.914 1.000 0.995 1.000 1.000 0.087 0.087 0.140 0.108 0.396 0.189 0.779 0.362 0.970 0.609 0.998 0.815 1.000 0.939 1.000 0.998 1.000 1.000 0.040 0.037 0.070 0.054 0.167 0.114 0.390 0.240 0.665 0.431 0.882 0.660 0.974 0.849 1.000 0.988 1.000 1.000 0.113 0.113 0.187 0.158 0.440 0.305 0.790 0.578 0.972 0.841 0.999 0.969 1.000 0.997 1.000 1.000 1.000 1.000 0.066 0.063 0.121 0.096 0.352 0.228 0.746 0.505 0.968 0.813 0.998 0.968 1.000 0.998 1.000 1.000 1.000 1.000 0.064 0.069 0.126 0.102 0.420 0.266 0.846 0.594 0.991 0.893 1.000 0.989 1.000 1.000 1.000 1.000 1.000 1.000
4.0 5.0
232
Table 3. Simulation Results of the Enhanced Short Run Multivariate Chart for Cases KU and UU based on 0 = (0,0) , S = (,0) and = 0.5. =0 c = 10
1-of-1 3-of-3 4-of-5
EWMA
c = 20
1-of-1 3-of-3 4-of-5
EWMA
c = 50
1-of-1 3-of-3 4-of-5
EWMA
s = (,0)
0.0 KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU 0.039 0.036 0.063 0.040 0.141 0.055 0.304 0.078 0.525 0.112 0.750 0.157 0.894 0.217 0.994 0.371 1.000 0.537 0.152 0.156 0.238 0.179 0.513 0.245 0.826 0.364 0.971 0.522 0.998 0.679 1.000 0.822 1.000 0.962 1.000 0.996 0.111 0.111 0.189 0.132 0.459 0.191 0.805 0.307 0.968 0.466 0.997 0.639 1.000 0.796 1.000 0.958 1.000 0.995 0.116 0.118 0.191 0.137 0.499 0.198 0.859 0.317 0.988 0.478 1.000 0.645 1.000 0.793 1.000 0.959 1.000 0.996 0.032 0.035 0.058 0.040 0.164 0.076 0.382 0.144 0.648 0.255 0.864 0.404 0.965 0.556 0.999 0.826 1.000 0.959 0.130 0.123 0.214 0.163 0.520 0.281 0.863 0.498 0.988 0.734 0.999 0.900 1.000 0.976 1.000 1.000 1.000 1.000 0.086 0.079 0.149 0.113 0.447 0.216 0.842 0.429 0.985 0.692 0.999 0.888 1.000 0.974 1.000 1.000 1.000 1.000 0.087 0.087 0.166 0.118 0.519 0.237 0.901 0.484 0.995 0.744 1.000 0.924 1.000 0.985 1.000 1.000 1.000 1.000 0.040 0.037 0.078 0.061 0.227 0.148 0.518 0.322 0.821 0.572 0.961 0.810 0.997 0.941 1.000 0.999 1.000 1.000 0.113 0.113 0.210 0.171 0.546 0.387 0.900 0.709 0.995 0.932 1.000 0.994 1.000 1.000 1.000 1.000 1.000 1.000 0.066 0.063 0.138 0.108 0.462 0.295 0.885 0.662 0.996 0.923 1.000 0.995 1.000 1.000 1.000 1.000 1.000 1.000 0.064 0.069 0.157 0.120 0.571 0.359 0.949 0.760 0.999 0.973 1.000 0.999 1.000 1.000 1.000 1.000 1.000 1.000
2.0 2.5
KHOO & NG
233
Table 4. Simulation Results of the Short Run Multivariate Chart in Ref. 1 for Cases KU and UU based on 0 = (0,0) , S = (,0) and = 0. =0 c = 10
1-of-1 3-of-3 4-of-5
EWMA
c = 20
1-of-1 3-of-3 4-of-5
EWMA
c = 50
1-of-1 3-of-3 4-of-5
EWMA
s = (,0)
0.0 KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU 0.041 0.040 0.048 0.041 0.052 0.043 0.056 0.041 0.069 0.049 0.096 0.064 0.131 0.096 0.268 0.194 0.473 0.355 0.102 0.103 0.120 0.100 0.178 0.112 0.253 0.128 0.340 0.164 0.434 0.215 0.522 0.269 0.663 0.372 0.747 0.448 0.052 0.052 0.069 0.054 0.110 0.062 0.172 0.074 0.248 0.104 0.337 0.145 0.425 0.184 0.561 0.258 0.652 0.304 0.038 0.039 0.056 0.040 0.093 0.051 0.157 0.065 0.247 0.091 0.342 0.133 0.442 0.181 0.605 0.292 0.730 0.397 0.037 0.039 0.049 0.040 0.072 0.052 0.093 0.067 0.132 0.096 0.193 0.151 0.290 0.232 0.569 0.484 0.832 0.769 0.103 0.100 0.133 0.106 0.233 0.143 0.387 0.216 0.558 0.329 0.713 0.468 0.833 0.611 0.949 0.804 0.984 0.900 0.046 0.049 0.073 0.053 0.149 0.084 0.286 0.141 0.469 0.241 0.650 0.381 0.789 0.528 0.933 0.733 0.980 0.851 0.044 0.041 0.066 0.049 0.151 0.080 0.321 0.148 0.536 0.270 0.741 0.428 0.882 0.603 0.984 0.854 0.999 0.957 0.042 0.038 0.057 0.051 0.113 0.087 0.184 0.144 0.292 0.233 0.445 0.368 0.617 0.539 0.914 0.873 0.996 0.987 0.103 0.101 0.153 0.131 0.312 0.225 0.581 0.417 0.821 0.652 0.949 0.841 0.991 0.947 1.000 0.996 1.000 1.000 0.056 0.050 0.088 0.070 0.221 0.154 0.493 0.320 0.785 0.585 0.943 0.809 0.991 0.942 1.000 0.997 1.000 1.000 0.042 0.043 0.088 0.069 0.263 0.167 0.617 0.393 0.903 0.713 0.991 0.921 1.000 0.991 1.000 1.000 1.000 1.000
0.5 1.0
234
Table 5. Simulation Results of the Short Run Multivariate Chart in Ref. 1 for Cases KU and UU based on 0 = (0,0) , S = (,0) and = 0.5.
=0 s = (,0) 0.0 KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU KU UU c = 10
1-of-1 3-of-3 4-of-5
EWMA
c = 20
1-of-1 3-of-3 4-of-5
EWMA
c = 50
1-of-1 3-of-3 4-of-5
EWMA
0.041 0.040 0.047 0.042 0.054 0.042 0.061 0.047 0.085 0.062 0.127 0.091 0.187 0.139 0.394 0.293 0.653 0.518
0.102 0.103 0.124 0.102 0.196 0.115 0.286 0.139 0.399 0.182 0.501 0.250 0.590 0.317 0.724 0.424 0.801 0.489
0.052 0.052 0.072 0.055 0.124 0.068 0.202 0.087 0.308 0.119 0.402 0.167 0.490 0.217 0.626 0.288 0.700 0.325
0.038 0.039 0.059 0.041 0.120 0.050 0.199 0.077 0.305 0.121 0.421 0.173 0.527 0.229 0.686 0.354 0.802 0.473
0.037 0.039 0.052 0.041 0.077 0.056 0.109 0.079 0.171 0.126 0.269 0.218 0.418 0.341 0.751 0.678 0.944 0.909
0.103 0.100 0.141 0.115 0.274 0.165 0.465 0.266 0.650 0.416 0.804 0.578 0.900 0.717 0.977 0.883 0.995 0.949
0.046 0.049 0.082 0.063 0.190 0.098 0.374 0.181 0.588 0.325 0.769 0.495 0.884 0.645 0.970 0.831 0.993 0.911
0.044 0.041 0.082 0.049 0.201 0.097 0.428 0.199 0.679 0.364 0.857 0.564 0.951 0.733 0.996 0.935 1.000 0.989
0.042 0.038 0.065 0.054 0.129 0.098 0.234 0.181 0.387 0.314 0.589 0.508 0.789 0.719 0.981 0.965 1.000 0.999
0.103 0.101 0.166 0.144 0.391 0.281 0.700 0.527 0.916 0.785 0.984 0.927 0.998 0.979 1.000 1.000 1.000 1.000
0.056 0.050 0.101 0.079 0.295 0.197 0.638 0.440 0.903 0.744 0.985 0.922 0.998 0.983 1.000 1.000 1.000 1.000
0.042 0.043 0.102 0.079 0.355 0.217 0.789 0.553 0.976 0.870 0.999 0.983 1.000 0.999 1.000 1.000 1.000 1.000
0.5 1.0
KHOO & NG
Table 6. VMSSD and V Statistics for Case UU. Observation No., n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X1 1.404 0.624 0.454 -1.768 -0.224 -0.082 1.146 1.816 -1.245 -0.976 -0.621 -0.080 0.742 -0.543 -2.335 -0.848 -0.431 1.369 0.283 0.850 X2 0.268 1.392 0.755 -1.902 0.140 0.734 0.484 0.906 -1.555 -0.340 -1.058 -0.710 -0.146 -0.818 -2.801 -1.176 0.590 1.863 0.197 0.149 Vn 1.162 -1.452 -0.585 -0.190 0.222 0.482 -0.199 -0.266 -0.507 -0.202 -0.824 1.437 -0.808 0.836 0.769 -1.659 -0.155
VMSSD,n
235
-1.650 -1.214 -0.327 0.058 0.296 0.023 -0.393 -0.800 0.042 -0.654 1.507 -0.415 0.742 0.955 -1.405 0.028
Observation No., n 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
X1 0.819 1.706 1.198 2.863 2.141 1.823 1.609 2.811 0.170 -0.776 -0.111 1.400 1.584 2.047 0.481 3.773 1.891 2.169 1.761 1.184
X2 -0.277 0.564 -1.313 0.211 0.438 0.474 0.414 2.192 -0.650 -1.186 -0.613 0.302 0.337 0.585 0.690 2.495 1.871 1.073 1.191 -0.113
Vn 0.395 0.780 2.181 2.049 0.545 -0.023 -0.366 1.191 -0.987 -0.193 -1.216 -0.656 -0.403 0.080 -0.153 1.693 0.673 -0.160 -0.400 -0.531
VMSSD,n
0.580 1.085 2.434 2.737 1.657 0.987 0.630 1.650 -0.676 0.347 -0.838 0.313 0.609 1.203 0.667 2.545 1.256 0.420 0.049 0.132
Observation Number
236
Observation Number
test in Tables 2, 3, 4 and 5 are almost the same. The results also show that the performance of the enhanced chart based on the basic 1-of-1 rule is superior to the chart proposed in Ref. 1. An Example of Application An example will be given to show how the proposed enhanced short run multivariate chart is put to work. To simulate an in-control process, 20 bivariate observations are generated using SAS version 8 from a N 2 ( 0 , 0 ) distribution. For an o.o.c. process, with a shift in the mean vector, the next 20 bivariate observations are generated from a N 2 ( S , 0 ) distribution.
0 =
compute the corresponding V statistics for case UU. The computed V and VMSSD statistics are summarized in Table 6. Figures 1 and 2 show the plotted VMSSD and V statistics respectively. For the enhanced chart based on the V MSSD statistics, the 3-of-3 test signals an o.o.c. at observation 24 while the 4-of-5 test signals at observation 25. The chart proposed in Ref. 1 based on the V statistics fails to detect a shift in the mean vector. Conclusion It is shown in this paper that the enhanced chart based on a robust estimator of scale, i.e., S MSSD gives excellent improvement over the existing short run multivariate chart proposed in Khoo & Quah (2002). The proofs of how the VMSSD statistics for cases KU and UU are derived are shown in the Appendix.
Here,
0 =
S =
where = 0.8. The 40 1 observations generated are substituted in eqs. (6a) and (6b) to compute the corresponding VMSSD statistics for case UU. Similarly, these 40 observations are substituted in Eq. (4) to
1.3
KHOO & NG
References Khoo, M. B. C. & Quah, S. H. (2002). Proposed short runs multivariate control charts for the process mean. Quality Engineering, 14 (4), 603 621. Holmes, D. S., & Mergen, A. E. (1993). Improving the performance of the T 2 control chart. Quality Engineering, 5 (4), 619 625. Seber, G. A. F. (1984). Multivariate observations. New York : John Wiley and Sons.
2p F 1 n 2 p + 1 p , 2 (n 2 p +1)
237
and 1
Appendix In this section, it will be shown that the VMSSD statistics in eqs. (5a), (5b), (6a) and (6b) are N(0,1) random variables. All the notations used here are already defined in the earlier sections. The following theorems taken from Seber (1984) are used: Theorem A. Suppose that y N p (0, ), W W p (n, ), and y and W are statistically independent. Assumed that the distribution are nonsingular, i.e., > O, and n p, so that W 1 exists with probability 1. Let
T 2 = ny W 1 y,
S MSSD, n1 W p
n 1 , . 2
(A4)
Because = 0 is known, then X n 0 N p (0, ) (A5) Substituting Eq. (A4) and (A5) into Eq. (A1) and (A2) of Theorem A,
(A1) (A2)
F p , n 1 p +1
2
Theorem B. Suppose that X 1 , X 2 ,..., X n are independently and identically distributed (i.i.d.) as N p (0, ), then
n
i.e.,
( n 2 p + 1) 1 ( X n 0 ) S MSSD,n 1 ( X n 0 ) 2p F p , 1 ( n 2 p +1 ) .
2
X i X i W p ( n, )
(A3)
i =1
where W p ( n, ) is the Wishart distribution with n degrees of freedom. Equation (5a): Case KU We need to show that for odd numbered observations, i.e., when n is an odd number, 1 2 TMSSD, n = ( X n 0 ) S MSSD ,n 1 ( X n 0 )
Define
1 2 TMSSD, n = ( X n 0 ) S MSSD , n1 ( X n 0 ) ;
then
2p F 1 for n > 2p 1, n 2 p + 1 p , 2 (n 2 p +1) i.e., n = 2p+1, 2p+3, .
2 TMSSD, n
( n p + 1) T F p , n p +1 p n
2
then
n 1 n 1 p +1 2 2 n 1 p 2
1 ( X n 0 ) S MSSD,n 1 ( X n 0 )
238
Equation (5b): Case KU We need to show that for even numbered observations, i.e., when n is an even number, 1 2 TMSSD, n = ( X n 0 ) S MSSD , n2 ( X n 0 )
i.e., n = 2p+2, 2p+4, . Equation (6a): Case UU We need to show that for odd numbered observations, i.e., when n is an odd number, 1 2 TMSSD, n = (X n X n1 ) S MSSD , n 1 (X n X n 1 )
2np F 1 ( n 2 p + 1)(n 1) p , 2 ( n 2 p +1)
2p F 1 n 2 p p , 2 (n 2 p )
Proof:
If X j , j = 1, 2, 3, , are i.i.d. N p (, )
( X i X i 1 )
N p (0, ) , i = 2, 4, 6, .
X n 0 N p (0, )
(A7)
i.e.,
Substituting Eq. (A6) and (A7) into Eq. (A1) and (A2) of Theorem A,
S MSSD, n1 W p
n 1 , . 2
n2 n2 p +1 2 2 n2 p 2
(X n 0 ) S
1 MSSD,n 2
(X n 0 )
(A8)
Because is unknown,
N p 0,
Define
1 2 TMSSD, n = ( X n 0 ) S MSSD , n2 ( X n 0 ) ;
and
n n 1
X n X n1
N p 0, 1 +
i.e., ( n 2 p) 1 ( X n 0 ) S MSSD,n 2 ( X n 0 ) 2p F p , 1 (n 2 p ) .
Then,
F p , n 2 p +1
X n1 N p ,
n 1
1 n 1
and
1
S MSSD, n 2 W p
n2 , . 2
( X i X i 1 )
N p (0, ) , i = 2, 4, 6, .
(A6)
KHOO & NG
n 1 (X n X n1 ) N p (0, ) n
239
i.e.,
i.e.,
Then,
2np
(X
1 X n1 ) S MSSD, n1 (X n X n1 ) F p , 1 ( n 2 p +1)
2
and
n 1 (X n X n1 ) N p (0, ) n
Define
T then
2 MSSD, n
= (X n X n1 ) S
1 MSSD , n 1
(X
X n1 ) ;
Substituting Eq. (A10) and (A11) into Eq. (A1) and (A2) of Theorem A,
( n 2 2 p + 1)( n 2 2 )( nn1 ) p( n 2 ) 2
Equation (6b): Case UU We need to show that for even numbered observations, i.e., when n is an even number,
2 TMSSD, n =
(X
1 X n1 ) S MSSD, n2 ( X n X n1 ) F p , n 2 p +1
2
i.e.,
(X
1 X n1 ) S MSSD , n2 (X n X n 1 )
(n 2 p )(n 1)
2np
1 X n1 ) S MSSD, n2 ( X n X n1 ) F
p,
n 2 p 2
2np F 1 ( n 2 p)( n 1) p , 2 ( n 2 p )
(X
( X i X i 1 )
N p (0, ) , i = 2, 4, 6, .
X n X n1 N p 0,
(n 2 p + 1)(n 1)
(A9) Substituting Eq. (A8) and (A9) into Eq. (A1) and (A2) of Theorem A, ( n21 p + 1)( n21 )( nn1 ) p( n2 1 ) 1 (X n X n1 ) S MSSD,n1 (X n X n1 ) Fp , n 1 p +1
S MSSD, n2 W p
n2 , . 2
(A10)
Because is unknown,
X n1 N p ,
n 1
n n 1
(A11)
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 240-250
An Empirical Evaluation Of The Retrospective Pretest: Are There Advantages To Looking Back?
Paul A. Nakonezny
Center for Biostatistics and Clinical Science
University of Oklahoma
This article builds on research regarding response shift effects and retrospective self-report ratings. Results suggest moderate evidence of a response shift bias in the conventional pretest-posttest treatment design in the treatment group. The use of explicitly worded anchors on response scales, as well as the measurement of knowledge ratings (a cognitive construct) in an evaluation methodology setting, helped to mitigate the magnitude of a response shift bias. The retrospective pretest-posttest design provides a measure of change that is more in accord with the objective measure of change than is the conventional pretest-posttest treatment design with the objective measure of change, for the setting and experimental conditions used in the present study. Key words: Response shift bias, quasi-experimentation, retrospective pretest-posttest design, retrospective pretest, measuring change
Introduction More than 30 years after Cronbach and Furby (1970) posited their compelling question, How we should measure changeor should we?, the properties of the change score continue to attract much attention in educational and psychological measurement. Self-report evaluations are frequently used to measure change in treatment and educational training interventions. In using self-report instruments, it is assumed that a subjects understanding of the standard of measurement for the dimension being measured will not change from pretest to posttest (Cronbach & Furby, 1970).
Paul A. Nakonezny is an Assistant Professor of Biostatistics in the Center for Biostatistics and Clinical Science at the University of Texas Southwestern Medical Center, 6363 Forest Park Rd., Suite 651, Dallas, TX 75235. Email: [email protected]. Joseph Lee Rodgers is a Professor of Psychology in the Department of Psychology at the University of Oklahoma, Norman, OK, 73019.
If the standard of measurement is not comparable between the pretest and posttest scores, however, then self-report evaluations in pretest-posttest treatment designs may be contaminated by a response shift bias (Howard & Dailey, 1979; Howard, Ralph, Gulanick, Maxwell, Nance, & Gerber, 1979; Maxwell & Howard, 1981). A response shift becomes a bias if the experimental intervention changes the subject's internal evaluation standard for the dimension measured and, hence, changes the subject's interpretation of the anchors of a response scale. When a response shift is presumably a result of the treatment, a treatment-induced response shift bias should occur in the treatment group and not in the control group. However, another possible source of contamination in response shifts, for both the treatment and control groups, is exposure to the conventional pretest, which could have a priming effect and confounding influence on subsequent self-report ratings (Hoogstraten, 1982; Spranger & Hoogstraten, 1989). A response shift, nevertheless, results in different scale units (metrics) at the posttest than at the pretest, which could produce systemic errors of measurement that threaten evaluation of the basic treatment effect.
240
241
objective measurement that is rooted in the philosophy of logical positivism (an epistemology in the social sciences that views subjective measures as obstacles toward an objective science of measurement). Second, retrospective self-reports are susceptible to a response-style bias (e.g., memory distortion, subjects current attitudes and moods, subject acquiescence, social desirability), which could presumably affect ratings in both the treatment and control groups. Nonetheless, in self-report pretestposttest treatment designs, previous psychometric research has demonstrated empirical support for the retrospective pretestposttest difference scores over the traditional pretest-posttest change scores in providing an index of change more in agreement with objective measures of change on both cognitive and behavioral dimensions (e.g., Hoogstraten, 1982; Howard & Dailey, 1979; Howard, Millham, Slaten, & ODonnell, 1981; Howard, Ralph, Gulanick, Maxwell, Nance, & Gerber, 1979; Howard, Schmeck, & Bray, 1979; Spranger & Hoogstraten, 1989). The purpose of this article is to build on a previous line of research, by Howard and colleagues and Hoogstraten and Spranger, on response shift effects and retrospective selfreport ratings. Specifically, the current study examined (a) response shift bias in the selfreport pretest-posttest treatment design in an evaluation setting, (b) the validity of the retrospective pretest-posttest design in estimating treatment effects, (c) the effect of memory distortion on retrospective self-report pretests, and (d) the effect of pretesting on subsequent and retrospective self-report ratings. Methodology A cross-sectional quasi-experimental pre-post treatment design (Cook & Campbell, 1979) with data from 240 participants was used to address the research objectives of this study. The design included a treatment group and a no-treatment comparison group. Participants in the treatment group were 124 students enrolled in an undergraduate epidemiology course (Class A) and participants in the no-treatment comparison group were 116 students enrolled in an
242
undergraduate health course (Class B). The 240 participants were undergraduate students who attended a large public University in the state of Texas during the Spring semester of 2002 and who met the following criteria for inclusion in the study: (a) at least 18 years of age, (b) must not have taken an epidemiology course or a course that addressed infectious disease epidemiology, and (c) must not have been concurrently enrolled in Class A and Class B. Participants signed a consent form approved by the Institutional Review Board of the University and received bonus class points for participating. The gender composition was 29 males and 211 females, and the age range was 18 to 28 years (with an average age of 20.61 years, SD = 2.46). The racial distribution of the study sample included 181 (75.4 %) Caucasians, 37 (15.4 %) African Americans, 13 (5.4 %) Hispanics, and 9 (3.8 %) Asians. Participant characteristics by group are reported in Table 1. The treatment in this design was a series of lectures on infectious disease epidemiology that was part of the course content in Class A, but not in Class B. Participants knowledge of infectious disease epidemiologythe basic construct in this studywas measured with a one-item self-report instrument and with a tenitem objective instrument, and the same itemscale instruments were used for both the treatment and no-treatment comparison groups. Each instrument was operationalized as the mean of the items measuring each scale, and was scored so that a higher score equaled more knowledge of infectious disease epidemiology. The conventional self-report instrument, which was used in both the pretest and posttest measurement settings, consisted of one-item that asked participants to respond to the following question: How much do you know about the principles of Infectious Disease Epidemiology? The current study measured this one-item using a six-point Likert-type scale that ranged from 0 (not much at all) to 5 (very very much), with verbal labels for the intermediate scale points. The retrospective self-report pretest, which was similar to the conventional self-report
243
aF statistic was used to test for mean age differences between the treatment group and the no-treatment comparison group. bChi-Square statistic was used to test for differences between the treatment group and the no-treatment comparison group on gender, race and classification, respectively.
group and participants in the no-treatment comparison group (who were not exposed to the treatment) completed the objective posttest. The objective posttest was identical to the objective pretest. One week after completion of the objective posttest (time 3), participants in both the treatment and no-treatment comparison groups completed the self-report posttest and the retrospective self-report pretest. Participants first completed the self-report posttest and, while keeping the posttest in front of them, they then filled out the retrospective self-report pretest.
The self-report posttest was identical to the conventional self-report pretest. The retrospective self-report pretest was similar to the conventional self-report pretest, but the wording of the question accounted for the retrospective time frame. Lastly, about one month after completion of the self-report posttest and retrospective self-report pretest, at the end of the academic semester (time 4), participants in both the treatment and no-treatment comparison groups completed the recalled self-report pretest, which permitted a memory test of the initial/conventional self-report pretest completed at the outset of the academic semester (time 1)
244
and, thus, yielded a test for a response-style bias of the retrospective self-report pretest rating. The recalled self-report pretest consisted of one-item that asked participants to respond to the following question: Four months ago, at the beginning of the semester, you were asked how much you knew about Infectious Disease Epidemiology. Please recall, remember, and be as accurate as possible, how you responded at that time regarding your knowledge level of Infectious Disease Epidemiology (i.e., how did you respond at that time?). The current study measured this one-item using a six-point Likerttype scale similar to that described above. The research objectives of this study were addressed by analyzing the series of pretest and posttest ratings using the dependent t test, the Pearson product-moment correlation (r), and analysis of variance (ANOVA). Estimates of the magnitude of the effect size were also computed (Rosenthal, Rosnow, & Rubin, 2000). The effect size estimators that accompanied the dependent t test and the ANOVA were Cohens (1988) d and eta-square ( 2 ), respectively. The Pearson product-moment correlation (r) was also used as the effect size estimator in the specific regression analyses. To test the response shift hypothesis, the dependent t test was carried out comparing the retrospective self-report pretest to the conventional self-report pretest within the treatment and no-treatment comparison groups. The dependent t test also was used to compare the recalled self-report pretest to the conventional self-report pretest, which tested for the effect of memory distortion in the retrospective pretest-posttest design. The Pearson correlation between the recalled self-report pretest and the conventional self-report pretest and between the recalled selfreport pretest and the retrospective self-report pretest also was used to test for memory distortion. To examine the relative validity of the retrospective pretest-posttest design in estimating treatment effects, a simple correlation analysis was further used to assess the relationship between the self-reported measures of change and the objective measure of change in both the conventional and retrospective
245
Objective
Pretes t Posttes t
1.10 1.01
2.34 0.81
0.86 0.87
1.13 0.91
1.82 0.76
4.06 0.98
0.99 1.14
2.21 0.92
0.91 0.80
1.30 0.95
3.48 1.01
2.43 0.77
1.03 0.93
1.26 0.86
3.66 0.93
No-Treatment Comparison Group (n = 116) Self-Report Pretest Condition Condition 1 M SD Condition 2 M SD Condition 3 M SD Condition 4 M SD
1.09 0.98 0.71 0.86 0.99 0.89 1.68 0.56 1.67 0.67 Pretes t Posttest Retro Recalled
Objective
Pretest Posttest
0.79 0.82
0.86 0.87
0.52 0.78
0.83 0.85
1.67 0.66
1.50 0.79
1.07 0.84
1.19 0.75
0.73 0.87
1.03 0.91
1.82 0.61
1.43 0.89
0.67 0.71
0.83 0.79
1.55 0.72
Note. Retro = retrospective self-report pretest; Recalled = recalled self-report pretest (used to test for the threat of memory distortion). Participants in condition 1 completed both the self-report and objective pretests; Participants in condition 2 completed the objective pretest; Participants in condition 3 completed the selfreport pretest; Participants in condition 4 completed neither the self-report pretest nor the objective pretest. All participants, regardless of the assigned condition, completed the posttests as well as the retrospective and recalled self-report pretests. The sample size per condition by group was approximately equal.
246
change score with the objective pre/post measure of change (r = .26, p < .18). Conversely, as anticipated, for the notreatment comparison group averaged across conditions 1 and 2, the magnitude of the correlation between the conventional pre/post self-report change score and the objective pre/post measure of change, r = .27, p < .16, was greater than the correlation between the retrospective pre/post self-report change score and the objective pre/post change score, r = .04, p < .75, albeit neither was significant. Memory Distortion The effect of memory distortion within the retrospective pretest-posttest design was also examined. For the treatment group, averaged across conditions 1 and 3, the results of the dependent t test revealed no significant mean difference between the recalled self-report pretest (M = 1.22, SD = .93) and the conventional self-report pretest (M = 1.05, SD = 1.07), t(61) = 1.56, p < .12, M = .17, SD = .89, d = 0.19 (Table 2). Further, the no-treatment comparison group had nearly identical average scores on the recalled self-report pretest (M = .933, SD = .882) and the conventional self-report pretest (M = .935, SD = .832), averaged across conditions 1 and 3, suggesting no significant mean difference, t(54) = -0.01, p < .99, M = -0.002, SD = .85, d = -0.002 (Table 2). The dependent t test results suggest no significant presence of memory distortion in the retrospective pretest-posttest treatment design. A simple correlation analysis also was used to test for memory distortion. The Pearson correlations between the recalled pre/post selfreport change score and the conventional pre/post self-report change score, averaged across conditions 1 and 3, and between the recalled pre/post self-report change score and the retrospective pre/post self-report change score, averaged across all four conditions, were significant and reasonably high in the treatment group (r = .64 and r = .63, respectively, ps <.0001) and in the no-treatment comparison group (r = .54 and r = .56, respectively, ps <.0001). Further, the Pearson correlations between the recalled self-report pretest and the conventional self-report pretest, averaged across
247
downward
Treatment Effects in the Retrospective Pre/Post Design The principal focus of the current study was to evaluate the validity of the retrospective pretest-posttest design in estimating treatment effects. The findings of the present study favor the retrospective pre/post self-report measure of change in providing a measure of self-reported change that better reflects the objective index of change on a construct of knowledge rating. This finding is in line with previous psychometric research (e.g., Hoogstraten, 1982; Howard & Dailey, 1979; Howard et al., 1979; Howard, Schmeck, & Bray, 1979; Spranger & Hoogstraten, 1989), and is most likely a result of the self-report posttest and the retrospective selfreport pretest being filled out with respect to the same internal standard, the same metric. This, therefore, mitigates the treatment-induced response shift bias, minimizes errors of measurement, and provides an unconfounded and unbiased estimate of the treatment effect (Howard et al., 1979). Although there is empirical support for the retrospective pretest-posttest difference scores over the conventional pretest-posttest change scores in providing an index of change more in agreement with objective measures of change, this is not to suggest that the conventional self-report pretest should be substituted by the retrospective self-report rating. Rather, in light of the findings of this study as well as those from previous studies, the suggestion put forward is that retrospective selfreport pretests could be used in at least three evaluation research settings: (a) to test for and attenuate a response shift bias in the conventional pretest-posttest treatment design, (b) when conventional pretest data or concurrent data are not available, or (c) when researchers want to measure change on dimensions not included in earlier-wave longitudinal data. Testing for Threats to Validity Also evaluated were the potential threats of memory distortion and pretesting effect to the internal validity of the retrospective pretestposttest treatment design in the current study.
248
Retrospective self-report ratings could be limited by memory lapses and pretests could exert a confounding influence on subsequent self-report ratings, including retrospective ratings, which could threaten evaluation of the treatment effect (Collins et al., 1985; Howard & Dailey, 1979; Sprangers & Hoogstraten, 1989). In general, the present study found no significant presence of memory distortion or a pretesting effect in the retrospective pretest-posttest treatment design used in the current study. This is not to suggest that memory distortion or a pretesting effect should not be accounted for as potential threats to the basic retrospective pretest-posttest design. Rather, what this finding suggests is that memory distortion and pretesting are not influencing the interpretation of the treatment effect in the type of retrospective pretest-posttest design used in the present study. The conventional self-report pretest and the recalled self-report pretest were only separated by four months, which may have in part mitigated the effect of memory distortion. Previous research (e.g., Finney, 1981; Howard, Dailey, & Gulanick, 1979; Howard, Schmeck, & Bray, 1979; Maisto et al., 1982), nonetheless, suggests that a pretesting effect can be mitigated and moderate-to-high recall accuracy is possible when cognitive constructs are measured (such as knowledge ratings) and when retrospective questions are specific and anchors on response scales are explicit (these conditions are consistent with those used in this study). An Application of the Retrospective Pre/Post Design In this section, a study by Nakonezny, Rodgers, and Nussbaum (2003) which applied the retrospective pretest-posttest treatment design to a unique research setting is briefly described. Nakonezny et al. (2003) examined the effect of later life parental divorce on solidarity in the relationship between the adult child and older parent. This examination was achieved by testing the buffering hypothesis that greater levels of predivorce solidarity in the adult child/older parent relationship buffers damage to postdivorce solidarity. The unique and uncommon nature of the phenomenon of later life parental divorce, however, precluded access
249
using a conventional pretest-posttest treatment design in evaluation research settings. Retrospective self-report pretests could be used, however, when conventional self-report pretest data are not available. In support of this scenario, we present an example of an innovative application of the retrospective pretest-posttest treatment design in a social science research setting. Finally, the ultimate value of this work may lie in its ability to renew interest in the retrospective pretest-posttest treatment design, to motivate future research, and to sharpen the empirical focus of that research. References Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum. Collins, L. M., Graham, J. W., Hansen, W. B., & Johnson, C. A. (1985). Agreement between retrospective accounts of substance use and earlier reported substance use. Applied Psychological Measurement, 9, 301-309. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin Company. Cronbach, L. J., & Furby, L. (1970). How we should measure change--or should we? Psychological Bulletin, 74, 68-80. Finney, H. C. (1981). Improving the reliability of retrospective survey measures: Results of a longitudinal field survey. Evaluation Review, 5, 207-229. Hoogstraten, J. (1982). The retrospective pretest in an educational training context. Journal of Experimental Education, 50, 200-204. Howard, G. S., & Dailey, P. R. (1979). Response shift bias: A source of contamination of self-report measures. Journal of Applied Psychology, 64, 144-150. Howard, G. S., Millham, J., Slaten, S., & ODonnell, L. (1981). Influence of subjective response style effects on retrospective measures. Applied Psychological Measurement, 5, 89-100.
250
Howard, G. S., Ralph, K. M., Gulanick, N. A., Maxwell, S. E., Nance, D. W., & Gerber, S. K. (1979). Internal invalidity in pretestposttest self-report evaluations and a reevaluation of retrospective pretests. Applied Psychological Measurement, 3, 1-23. Howard, G. S., Schmeck, R. R., & Bray, J. H. (1979). Internal invalidity in studies employing self-report instruments: A suggested remedy. Journal of Educational Measurement, 16, 129-135. Maisto, S. A., Sobell, L. C., Cooper, A. M., & Sobell, M. B. (1982). Comparison of two techniques to obtain retrospective reports of drinking behavior from alcohol abusers. Addictive Behaviors, 7, 33-38.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 251-274
Technology advances popularized large databases in education. Traditional statistics have limitations for analyzing large quantities of data. This article discusses data mining by analyzing a data set with three models: multiple regression, data mining, and a combination of the two. It is concluded that data mining is applicable in educational research. Key words: Data mining, large scale data analysis, quantitative educational research, Bayesian network, prediction
Introduction In the last decade, with the availability of highspeed computers and low-cost computer memory (RAM), electronic data acquisition and database technology have allowed data collection methods that are substantially different from the traditional approach (Wegman, 1995). As a result, large data sets and databases are becoming increasingly popular in every aspect of human endeavor including educational research. Different from the small, low-dimensional homogeneous data sets collected in traditional research activities, computer-based data collection results in data sets of large volume and high dimensionality (Hand, Mannila, & Smyth, 2001; Wegman, 1995).
Yonghong Jade Xu is an Assistant Professor at the Department of Counseling, Educational Psychology, and Research, College of Education, the University of Memphis. The author wishes to thank Professor Darrell L. Sabers, Head of the Department of Educational Psychology at the University of Arizona, and Dr. Patricia B. Jones, Principal Research Specialist at the Center for Computing Information Technology at the University of Arizona, for their invaluable input pertaining to this study. Correspondence concerning this article should be addressed to Yonghong Jade Xu, Email: [email protected]
Many statisticians (e.g., Fayyad, 1997; Hand et al., 2001; Wegman, 1995) noticed some drawbacks of traditional statistical techniques when trying to extract valid and useful information from a large volume of data, especially those of a large number of variables. As Wegman (1995) argued, applying traditional statistical methods to massive data sets is most likely to fail because homogeneity is almost surely gone; any parametric model will almost surely be rejected by any hypothesis testing procedure; fashionable techniques such as bootstrapping are computationally too complex to be seriously considered for many of these data sets; random subsampling and dimensional reduction techniques are very likely to hide the very substructure that may be pertinent to the correct analysis of the data (p. 292). Moreover, because most of the large data sets are collected from convenient or opportunistic samples, selection bias puts in question any inferences from sample data to target population (Hand, 1999; Hand et al., 2001). The statistical challenge has stimulated research aiming at methods that can effectively examine large data sets to extract valid information (e.g., Daszykowski, Walczak, & Massart, 2002). New analytical techniques have been proposed and explored. Among them, some statisticians (e.g., Elder & Pregibon, 1996; Friedman, 1997; Hand, 1998, 1999, 2001; Wegman, 1995) paid attention to a new data analysis tool called data mining and knowledge discovery in database. Data mining is a process
251
YONGHONG JADE XU
253
Figure 1. An example of a BBN model. This graph illustrates the three major classes of elements of a Bayesian network; all variables, edges, and CP tables are for demonstration only and do not reflect the data and results of the current study in any way.
set of directed edges (arcs) between variables showing the causal/relevance relationships between variables, and also, a CP table P(A| B1, B2,, Bn) attached to each variable A with parents B1, B2, , Bn. The CPs describe the strength of the beliefs given that the prior probabilities are true. Because in learning a previously unknown BBN, the calculation of the probability of any branch requires all branches of the network to be calculated (Niedermayer, 1998), the practical difficulty of performing the propagation, even with the availability of highspeed computers, delayed the availability of software tools that could interpret the BBN and perform the complex computation until recently. Although the resulting ability to describe the network can be performed in linear time, given a relatively large number of variables and their product state space, the process of network discovery remains computationally impossible if an exhaustive search in the entire model space is required for finding the network of best prediction accuracy. As a compromise, some algorithms and utility functions are adopted to direct random selection of variable subsets in the BBN modeling process and to guide the search for the optimal subset with an evaluation function tracking the prediction accuracy (measured by
the classification error rate) of every attempted model (Friedman et al., 1997). That is, a stochastic variable subset selection is embedded into the BBN algorithms. The variable selection function conducts a search for the optimal subset using the BBN itself as a part of the evaluation function, the same algorithm that will be used to induce the final BBN prediction model. Some special features of the BBN are considered beneficial to analyzing large data sets. For instance, to define a finite product state space for calculating the CPs and learning the network, all continuous variables have to be discretized into a number of intervals (bins). With such discretization, variable relationships are measured as associations that do not assume linearity and normality, which minimizes the negative impacts of outliers and other types of irregularities inherent in secondary data sources. Variable discretization also makes a BBN flexible in handling different types of variables and eliminates the sample size as a factor influencing the amount of computation. With large databases available for research and policy making in education, this study is designed to assess whether the data mining approach can provide educational researchers with extra means and benefits in analyzing large-scale data sets.
YONGHONG JADE XU
255
Table 1. Name, Definition, and Measurement Scale of the 91 Variables from NSOPF:99.
Variable name Q25 Q26 Q29A1 Q29A2 Q29A3 Q29A4 Q29A5 Q29B1 Q29B2 Q29B3 Q29B4 Q29B5 Q29C1 Q29C2 Q29C3 Q29C4 Q29C5 Q2REC Q30B Q30C Q30D
Variable definition
Years teaching in higher education institution Positions outside higher education during career Career creative works, juried media Career creative works, non-juried media Career reviews of books, creative works Career books, textbooks, reports Career exhibitions, performances Recent sole creative works, juried media Recent sole creative works, non-juried media Recent sole reviews of books, works Recent sole books, textbooks, reports Recent sole presentations, performances Recent joint creative works, juried media Recent joint creative works, non-juried media Recent joint reviews of books, creative works Recent joint books, reports Recent joint presentations, performances Teaching credit or noncredit courses Hours/week unpaid activities at the institution Hours/week paid activities not at the institution Hours/week unpaid activities not at the institution
Scale
Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Ordinal Interval Interval Interval
Table 1 Continued.
Variable name Q31A1 Q31A2 Q31A3 Q31A4 Q31A5 Q31A6 Q31A7 Q32A1 Q32A2 Q32B1 Q32B2 Q33 Q40 Q50 Q51 Q52 Q54_55RE Q58 Q59A Q61SREC Q64 Variable definition Time actually spent teaching undergrads (percentage) Time actually spent teaching graduates (percentage) Time actually spent at research (percentage) Time actually spent on professional growth (percentage) Time actually spent at administration (percentage) Time actually spent on service activity (percentage) Time actually spent on consulting (percentage) Number of undergraduate committees served on Number of graduate committees served on Number of undergraduate committees chaired Number of graduate committees chaired Total classes taught Total credit classes taught Total contact hours/week with students Total office hours/week Any creative work/writing/research PI / Co-PI on grants or contracts Total number of grants or contracts Total funds from all sources Work support availability Union status Scale Ratio Ratio Ratio Ratio Ratio Ratio Ratio Interval Interval Interval Interval Interval Interval Interval Interval Categorical Ordinal Interval Ratio Ordinal Categorical
YONGHONG JADE XU
Table 1 Continued.
Variable name Q76G Q7REC Q80 Q81 Q85 Q87 Q90 Q9REC X01_3 X01_60 X01_66 X01_82 X01_8REC X01_91RE DISCIPLINE X02_49 X03_49 X04_0 X04_41 X04_84 X08_0D Variable definition Consulting/freelance income Years on current job Number of dependents Gender Disability Marital status Citizenship status Years on achieved rank Principal activity Overall quality of research index Job satisfaction: other aspects of job Age Academic rank Highest educational level of parents Principal field of teaching/researching Individual instruction w/grad &1st professional students Number of students receiving individual instructions Carnegie classification of institution Total classroom credit hours Ethnicity in single category Doctoral, 4-year, or 2-year institution Scale Ratio Interval Interval Categorical Categorical Categorical Categorical Interval Categorical Ordinal Ordinal Interval Ordinal Ordinal Categorical Interval Interval Categorical Interval Categorical Ordinal
257
Table 1 Continued.
Variable name X08_0P X09_0RE X09_76 X10_0 X15_16 X21_0 X25_0 X37_0 X46_41 X47_41 SALARY Variable definition Private or public institution Degree of urbanization of location city Total income not from the institution Ratio: FTE enrollment / FTE faculty Years since highest degree Institution size: FTE graduate enrollment Institution size: Total FTE enrollment Bureau of Economic Analysis (BEA) regional codes Undergraduate classroom credit hours Graduate and First professional classroom credit hours Basic academic year salary Scale Categorical Ordinal Ratio Ratio Interval Interval Interval Categorical Interval Interval Ratio
Note. All data were based on respondent reported status during the 1998-99 academic year.
Analysis Three different prediction models were constructed and compared through the analysis of NSOPF:99; each of them had a variable reduction procedure and a prediction model based on the selected measures. The first model, Model I, was a multiple regression model with variables selected through statistical data reduction techniques; Model II was a data mining BBN model with an embedded variable selection procedure. A combination model, Model III, was also a multiple regression model, but built on variables selected by the data mining BBN approach. Model I. The first model started with variable reduction procedures that reduced the 90 NSOPF:99 variables (salary measure excluded) to a smaller group that can be efficiently manipulated by a multiple regression
procedure, and resulted in an optimal regression model based on the selected variables. According to the compensation theory and characteristics of the current data set, basic salary of the academic year as the dependent variable was log-transformed to improve its linear relationship with candidate independent variables. The variable reduction for Model I was completed in two phases. In the first phase, the dimensional structure of the variable space was examined with Exploratory Factor Analysis (EFA) and K-Means Cluster (KMC) analysis; based on the outcomes of the two techniques, variables were classified into a number of major dimensions. Because EFA measures variable relationships by linear correlation and KMC by Euclidian distance, only 82 variables on
YONGHONG JADE XU
dichotomous, ordinal, interval, or ratio scales were included. Two different techniques were used to scrutinize the underlying variable structure such that any potential bias associated with each of the individual approaches could be reduced. In EFA, different factor extraction methods were tried and followed by both orthogonal and oblique rotations of the set of extracted factors. The variable grouping was determined based on the matrices of factor loadings: variables that had a minimum loading of .35 on the same factor were considered as belonging to the same group. In the KMC analysis, the number of output clusters usually needs to be specified. When the exact number of variable clusters is unknown, the results of other procedures (e.g., EFA) can provide helpful information for estimating a range of possible number of clusters. Then the KMC can be run several times, each time with a different number of clusters specified within the range. The multiple runs of the KMC can also help to reduce the chance of getting a local optimal solution. Because variables were separated into mutually exclusive clusters, the interpretation of cluster identity was based on variables that had short distance from the cluster seed (the centroid). The results of the KMC analysis were compared with that of the EFA for similarities and differences. A final dimensional structure of the variable space was determined based on the consensus of the EFA and KMC outputs; each of the variable dimensions was labeled with a meaningful interpretation. During the second phase, one variable was selected from each dimension. Because of the different clustering methods used, variables in the same dimension might not share linear relationships. Taking into consideration that the final model of the analysis was of linear prediction, a method of extracting variables that account for more salary variance was desirable. Thus, for each cluster, the log-transformed salary was regressed on the variables within that cluster, and only one variable was chosen that associated with the greatest partial R2 change. Variables that did not show any strong relationships with any of the major groups, along with multilevel nominal variables that
259
could not be classified, were carried directly into the second stage of multiple regression modeling as candidate predictors and tested for their significance. Nominal variables were recoded into binary variables and possible interactions among the predictor variables were checked and included in the model if significant. Both forced entry and stepwise selection were used to search for the optimal model structure; if any of the variables was significant in one variable selection method, but nonsignificant in the other, a separate test on the variable was conducted in order to decide whether to include the variable in the final regression model. Finally, the proposed model was cross-checked with All Possible Subsets regression techniques including Max R and Cp evaluations to make sure the model was a good fit in terms of the model R2, adjusted R2, and the Cp value. Model II. The second prediction model was a BBN-based data mining model. To build the BBN model, all 91 original variables were input into a piece of software called the Belief Network Powersoft ; variables on interval and ratio scales were binned into category-like intervals because the network-learning algorithms require discrete values for a clear definition of a finite product state space of the input variables. Rather than logarithmical transformation, salary was binned into 24 intervals for the following reasons: first, logtransformation was not necessary because BBN is a robust nonmetric algorithm independent of any monotonic variable transformation. And second, a finite number of output classes is required in a Bayesian network construction. During the modeling process, variable selection was performed internally to find the subset with the best prediction accuracy. The BBN model learning was an automated process after reading in the input data. According to Chen and Greiner (1999), the authors of the software, two major tasks in the process are learning the graphical structure (variable relationships) and learning the parameters (CP tables). Learning the structure is the most computationally intensive task. The BBN software used in this study takes the network structure as a group of CP relationships (measured by statistical functions such as 2 statistic or mutual information test) connecting
YONGHONG JADE XU
261
Table 2 Continued.
Variable name Variable definition Variables from the original set DISCIPLINE Q12A Q12E Q12F Q19 Q26 Q30B Q31A4 Q31A6 Q64 Q80 Q81 Q85 Q87 Q90 X01_3 X01_91RE X04_0 X04_84 X37_0 Principal field of teaching/research Appointments: Acting Appointments: Clinical Appointments: Research Current position as primary employment Positions outside higher education during career Hours/week unpaid activities at the institution Time actually spent on professional growth (percentage) Time actually spent on service activity (percentage) Union status Number of dependents Gender Disability Marital status Citizenship status Principal activity Highest educational level of parents Carnegie classification of institution Ethnicity in single category Bureau of Economic Analysis (BEA) region code 10 1 1 1 1 1 1 1 1 3 1 1 1 3 3 1 1 14 3 8 df
YONGHONG JADE XU
263
Variable Intercept Q29A1 X15_16 Q31A1 Q31A5 Q16A1REC Q29A3 Q76G X01_66 X01_8REC Q31A4 Q31A6 Q81 Intercept
Label
Career creative works, juried media Years since highest degree Time actually spent teaching undergrads (%) Time actually spent at administration (%) Highest degree type Career reviews of books, creative works Consulting/freelance income Other aspects of job Academic rank Time actually spent on professional growth (%) Time actually spent on service activity (%) Gender
BEA region codes (Baseline: Far West) BEA1 BEA2 BEA3 BEA4 New England Mid East Great Lakes Plains -0.0608 0.0082 -0.0545 -0.0868 0.0058 0.0031 0.0006 0.0003 8.89 0.0021 16.27 0.5788 -3.86 0.0001 3.80 <.0001
Table 3 Continued.
Parameter estimate -0.0921 -0.0972 -0.1056 0.1480 Standard t value p > |t| error 0.0084 0.0198 0.0148 0.0142 -7.97 <.0001 -3.07 <.0001 0.56 <.0001 -3.82 0.2879
Variable BEA5 BEA6 BEA7 BEA8 Southeast Southwest Rocky Mountain U.S. Service schools
Label
Principal field of teaching/research (Baseline: legitimate skip) DSCPL1 DSCPL2 DSCPL3 DSCPL4 DSCPL5 DSCPL6 DSCPL7 DSCPL8 DSCPL9 DSCPL10 Agriculture & home economics Business Education Engineering Fine arts Health sciences Humanities Natural sciences Social sciences All other programs -0.0279 0.1103 -0.0643 0.0695 -0.0449 0.0933 -0.0641 -0.0276 -0.0249 0.0130 0.0306 0.0228 0.0216 0.0246 0.0241 0.0182 0.0195 0.0190 0.0202 0.0194 -0.91 0.3624 4.84 <.0001 -2.98 0.0029 2.82 0.0048 -1.86 0.0627 5.12 <.0001 -3.29 -1.45 0.001 0.148
Carnegie classification (Baseline: Private other Ph.D.) STRATA1 STRATA2 STRATA3 STRATA4 Public comprehensive Private comprehensive Public liberal arts Private liberal arts 0.0053 -0.0377 -0.0041 -0.0917 0.0236 0.0263 0.0341 0.0260 0.22 0.8221 -1.43 0.1525 -0.12 0.9039 -3.52 0.0004
YONGHONG JADE XU
265
Table 3 Continued.
Parameter estimate 0.2630 0.2588 -0.1557 0.0386 -0.0061 -0.0207 -0.0879 0.0792 0.1428 0.0005 Standard t value p > |t| error 0.0326 0.0444 0.0523 0.0247 0.0574 0.0563 0.0428 0.0228 0.0259 0.0254 8.07 <.0001 5.82 <.0001 -2.98 0.0029 1.56 0.1185 -0.11 0.9155 -0.37 0.7127 -2.06 0.0399 3.47 0.0005 5.51 <.0001 0.02 0.984
Variable STRATA5 STRATA6 STRATA7 STRATA8 STRATA9 STRATA10 STRATA11 STRATA12 STRATA13 STRATA14 Public medical Private Medical Private religious Public 2-year Private 2-year Public other Private other Public research Private research Public other Ph.D.
Label
Primary activity (Baseline: others) PRIMACT1 PRIMACT2 PRIMACT3 Primary activity: teaching Primary activity: research Primary activity: administration -0.0541 -0.0133 0.0469 0.0169 0.0199 0.0203 -3.21 0.0013 -0.67 0.5039 2.31 0.0211
Model II To make the findings of the data mining BBN model comparable to the result of regression Model I, the second model started without any pre-specified knowledge such as the order of variables in some dependence relationships, forbidden relations, or known causal relations. To evaluate variable relationships and simplify model structure, the data mining software makes it possible for users to provide a threshold value that determines how
strong a mutual relationship between two variables is considered meaningful; relationships below this threshold are omitted from subsequent network structure learning (Chen & Greiner, 1999). In the current analysis, a number of BBN learning processes were completed, each with a different threshold value specified, in order to search for an optimal model structure. Because generalizability to new data sets is an
YONGHONG JADE XU
267
Figure 2. The BBN model of salary prediction. Some of the directional relationships may be counterintuitive (e.g., Q31A1 X04_0) as a result of data-driven learning. The CP tables are not included to avoid complexity. The definitions of the seven variables are a. SALARY: Basic salary of the academic year. b. Q29A1: Career creative works, juried media c. Q31A1: Percentage of time actually spent teaching undergrads d. X15_16: Years since highest degree e. X01_8REC: Academic rank f. X04_0: Carnegie classification of institutions g. Q10AREC: Years since achieved tenure
Label
Career creative works, juried media Time actually spent teaching undergrads (%) Academic rank Years since highest degree
-0.41 0.6802 -1.91 0.0563 1.98 0.0472 5.60 <.0001 -1.85 0.0648
Note. The dependent variable was log-transformed SALARY (LOGSAL). Table 5. Summary Information of Multiple Regression Models I and III
Source df Sum of squares Mean square F Pr > F
Model I: Multiple regression with statistical variable selection Model Error Corrected total 47 6599 6646 621.4482 612.4897 1233.9379 13.2223 0.0928 142.46 <.0001
YONGHONG JADE XU
269
Table 5 Continued.
Source df Sum of Squares Mean square F Pr > F
Model III: Multiple regression with variables selected through BBN Model Error Corrected total 18 6632 6651 520.2949 714.3279 1234.6228 28.90527 0.10769 268.4 <.0001
Note: 1. For Model I, R2 = .5036, adjusted R2 = .5001, and the standard error of estimate is 0.305. 2. For Model II, R2 = .4214, adjusted R2 = .4199 and the standard error of estimate is 0.328
Given the measures of variable associations that do not assume any probabilistic forms of variable distributions, neither linearity nor normality was required in the analysis. Consequently, the non-metric algorithms used to build the BBN model binned the original SALARY measure as the predicted values. Model Selection In the multiple regression analysis, every unique combination of the independent variables theoretically makes a candidate prediction model, albeit the modeling techniques produce candidate models that are mostly in a nested structural schema. Model comparison is part of the analysis process; human intervention is necessary to select the final model that usually has a higher R2 along with simple and stable structure. In contrast, the learning of an optimal BBN model is a result of search in a model space that consists of candidate models of substantially different structures. In the automated model discovery process, numerous candidate models were constructed, evaluated with criteria called score functions, and the one with best prediction accuracy is output as the optimal choice. Model Presentation As a result of different approaches to summarizing data and different algorithms of analyzing data, the outputs of the multiple
regression and the BBN models are different. The final result of a multiple regression analysis is usually presented as a mathematical equation. For example, Model III can be written as: Log (Salary) = 10.5410 + 0.0024 Q29A1 0.0030 Q31A1 + 0.0664 X01_8REC + 0.0088 X15_16 - 0.0385 STRATA1 0.0645 STRATA2 - 0.0315 STRATA3 0.1221 STRATA4 + 0.2933 STRATA5 + 0.2915 STRATA6 - 0.2095 STRATA7 0.0403 STRATA8 - 0.0371 STRATA9 0.0245 STRATA10 - 0.0871 STRATA11 + 0.0479 STRATA12 + 0.1543 STRATA13 - 0.0496 STRATA14 + error. (1) If a respondent received the highest degree three years ago (X15_16 = 3), had three publications in juried media (Q29A1 = 3), spent 20% of work time teaching undergraduate classes (Q31A1 = 20) as an assistant professor (X01_8REC = 4) in a public research institution (STRATA12 = 1 and all other STRATA variables were 0), the predicted value of this individuals log-transformed salary should be 10.83 according to Equation 1 (about $50,418), with an estimated standard error indicating the level of uncertainty. The result of the BBN model is presented in a quite different way. For the above case, the BBN model would make a prediction
YONGHONG JADE XU
Table 6. An Example of the BBN Conditional Probability Tables. Bin # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Salary range Salary < 29600 29600 < Salary < 32615 32615 < Salary < 35015 35015 < Salary < 37455 37455 < Salary < 39025 39025 < Salary < 40015 40015 < Salary < 42010 42010 < Salary < 44150 44150 < Salary < 46025 46025 < Salary < 48325 48325 < Salary < 50035 50035 < Salary < 53040 53040 < Salary < 55080 55080 < Salary < 58525 58525 < Salary < 60010 60010 < Salary < 64040 64040 < Salary < 68010 68010 < Salary < 72050 72050 < Salary < 78250 78250 < Salary < 85030 85030 < Salary < 97320 97320 < Salary < 116600 116600 < Salary < 175090 175090 < Salary Probability 0.0114 0.0012 0.0487 0.0655 0.0254 0.0263 0.0460 0.0950 0.0894 0.0552 0.1590 0.0728 0.0081 0.0672 0.0985 0.0140 0.0321 0.0142 0.0228 0.0098 0.0005 0.0170 0.0190 0.0005
1 2 3 4 5 6 7 8 9 10 Salary Bin 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.0000 0.0500 0.1000 Probability 0.1500
271
0.2000
Note. Salary was binned into 24 intervals. For this particular case, the product state is that the highest degree was obtained three years ago (X15_16 = 3), had three publications in juried media (Q29A1 = 3), spent 20% of the time teaching undergraduate classes (Q31A1 = .2) as an untenured (Q10AREC = 0) assistant professor (X01_8REC = 5) in a public research institution (STRATA = 12 and all other binary variables were 0).
YONGHONG JADE XU
linearity was dismissed, and significance tests were rarely necessary. However, the BBN model had some drawbacks as well. First, the BBN model, as most data mining models, is adaptive to categorical variables. Continuous measures had to be binned to be appropriately handled. The downgrade of measurement scale definitely cost information accuracy. It also became clear in the process of this study that the ability to identify the most important variable from a group of highly correlated measures is an important criterion for evaluating applied data analysis methods when handling a large number of variables because redundant measures on the same constructs are common in large data sets and databases. The findings of this study indicate that BBN is capable to perform such a task because Model II identified five variables from groups of measures on teaching, publication, experience, academic seniority, and institution parameter, the same five as those selected by the data reduction techniques in building Model I for the reason that the five variables accounted for more variance of the predicted variables than their alternatives. In general, data mining has some unique features that can help to explore and analyze enormous amount of data. Combining statistical and machine learning techniques in automated computer algorithms, data mining can be used to explore very large volumes of data with robustness against poor data quality such as nonnormality, outliers, and missing data. The inductive nature of data mining techniques is very practical to overcome limitations of traditional statistics when dealing with large sample sizes. The random selection of subset variables in making accurate predictions simplifies the problem associated with large number of variables. Nevertheless, the applicability of this new technique in educational and behavioral science has to be tailored for the specific needs of individual researchers and the goal of their studies. By introducing data mining, a tool that has been widely used in business management and scientific research, this study demonstrated an alternative approach to analyzing educational databases. A clear-cut answer is difficult regarding the differences and advantages of the
273
individual approaches. However, looking at a problem from different viewpoints itself is the essence of the study, and hopefully it can provide critical information for researchers to make their own assessment about how well these different models work to provide insight into the structure of and to extract valuable information from large volumes of data. Using confirmatory analysis to follow up the findings generated by data mining, educational researchers can virtually turn their large collection of data into a reservoir of knowledge to serve public interests.
References Chen, J., & Greiner, R. (1999). Comparing Bayesian network classifiers. Proceedings of the Fifteenth Conference on Uncertainty In Artificial Intelligence (UAI), Sweden, 101-108. Chen, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2001). Learning Bayesian networks from data: An information-theory based approach. Artificially Intelligence, 137(1-2), 43100. Daszykowski, M., Walczak, B., & Massart, D. L. (2002). Representative subset selection. Analytical Chimica Acta, 468(1), 91103. Elder, J. F. & Pregibon, D. (1996). A statistical perspective on knowledge discovery in databases. In U. M. Fayyad, G. PiatetskyShapiro, R. Smyth, & R. Uthurusamy (Eds.) Advances in knowledge discovery and data mining (pp.83-113). Menlo Park, California: AAAI Press. Fayyad, U. M. (1997, August). Data mining and knowledge discovery in databases: Implications for scientific databases. Papered presented at 9th International Conference on Scientific and Statistical Database Management (SSDBM97), Olympia, WA. Frawley, W. J., Piatetsky-Shapiro, G., & Matheu, C. J. (1991). Knowledge discovery in database: An overview. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.) Knowledge Discovery in Databases (pp. 1-27). MIT: AAAI Press.
Friedman, J. H. (1997). Data mining and statistics: Whats the connection? In D. W. Scott (Ed.), Computing Science and Statistics: Vol. 29(1). Mining and Modeling Massive Data Sets in Sciences, and Business with a Subtheme in Environmental Statistics (pp. 3-9). (Available from the Interface Foundation of North America, Inc., Fairfax Station, VA 22039-7460). Glymour, C., Madigan, D., Pregibon, D. & Smyth, P. (1997). Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery, 1, 11-28. Hand, D. J. (1998). Data mining: Statistics and more? The American Statistician, 52, 112-118. Hand, D. J. (1999). Statistics and data mining: Intersecting disciplines. SIGKDD Exploration, 1, 16-19. Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. Heckerman, D. (1997). Bayesian networks for data mining. Data Mining and Knowledge Discover, 1, 79-119. Michalski, R. S., Bratko, I., & Kubat, M. (1998). Machine learning and data mining: Methods and applications. Chichester: John Wiley & Sons.
National Center of Education Statistics. (2002). National Survey of Postsecondary Faculty 1999 (NCES Publication No. 2002151), [Restricted-use data file, CD-ROM]. Washington, DC: Author. Niedermayer, D. (1998). An introduction to Bayesian networks and their contemporary applications. Retrieved on September 24, 2003 from http://www.niedermayer.ca/papers/bayesian/ Thearling, K. (2003). An Introduction to Data Mining: Discovering hidden value in your data warehouse. Retrieved on July 6, 2003 from http://www.thearling.com/text/dmwhite/ dmwhite.htm. Wegman, E., J. (1995). Huge data sets and the frontiers of computational feasibility. Journal of Computational and Graphical Statistics, 4(4), 281-295. Yu, Y., & Johnson, B. W. (2002). Bayesian belief network and its applications (Tech. Rep. UVA-CSCS-BBN-001). Charlottesville, VA: University of Virginia, Center for Safety-Critical Systems. Zhou, Z. (2003). Three perspectives of data mining. Artificial Intelligence, 143(1), 139146.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 275-282
Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses
Bruno D. Zumbo
University of British Columbia, Canada
Kim H Koh
Nanyang Technological University, Singapore
If a researcher applies the conventional tests of scale-level measurement invariance through multi-group confirmatory factor analysis of a PC matrix and MLE to test hypotheses of strong and full measurement invariance when the researcher has a rating scale response format wherein the item characteristics are different for the two groups of respondents, do these scale-level analyses reflect (or ignore) differences in item threshold characteristics? Results of the current study demonstrate the inadequacy of judging the suitability of a measurement instrument across groups by only investigating the factor structure of the measure for the different groups with a PC matrix and MLE. Evidence is provided that item level bias can still be present when a CFA of the two different groups reveals an equivalent factorial structure of rating scale items using a PC matrix and MLE. Key words: multi-group confirmatory factor analysis, item response formats
Introduction Broadly speaking, there are two general classes of statistical and psychometric techniques to examine measurement invariance across groups: (1) scale-level analyses, and (2) item-level analyses. The groups investigated for measurement invariance are typically formed by gender, ethnicity, or translated/adapted versions of a test. In scale-level analyses, the set of items comprising a test are often examined together Bruno D. Zumbo is Professor of Measurement, Evaluation and Research Methodology, as well as member of the Department of Statistics and the Institute of Applied Mathematics at the University of British Columbia, Canada Email: [email protected]. Kim H. Koh is Assistant Professor, Centre for Research in Pedagogy and Practice, National Institute of Education, Nanyang Technological University, Singapore. Email: [email protected]. An earlier version of this article was presented at the 2003 National Council on Measurement in Education (NCME) conference, Chicago Illinois. We would like to thank Professor Greg Hancock for his comments on an earlier draft of this article.
using multi-group confirmatory factor analyses (Byrne, 1998; Jreskog, 1971) that involve testing strong and full measurement invariance hypotheses. In the item-level analyses the focus is on the invariant characteristics of each item, one item at a time. In setting the stage for this study, which involves a blending of ideas from scale- and item-level analyses (i.e., multi-group confirmatory factor analysis and item response theory), it is useful to compare and contrast overall frameworks for scale-level and itemlevel approaches to measurement invariance. Recent examples of this sort of comparison can be found in Raju, Laffitte, & Byrne (2002), Reise, Widaman, & Pugh (1993), and Zumbo (2003). In these studies, the impact of scaling on measurement invariance has not been examined. Hence, it is important for the current study to investigate to what extent the number of scale points effects the tests of measurement invariance hypotheses in multi-group confirmatory factor analysis. Scale-level Analyses There are several expositions and reviews of single-group and multi-group confirmatory factor analysis (e.g., Byrne, 1998; Steenkamp & Baumgartner, 1998; Vandenberg
275
276
& Lance, 2000); therefore this review will be very brief. In describing multi-group confirmatory factor analysis, consider a onefactor model: one latent variable and ten items all loading on that one latent variable. There are two sets of parameters of interest in this model: (1) the factor loadings corresponding to the paths from the latent variable to each of the items, and (2) the error variances, one for each of the items. The purpose of the multi-group confirmatory factor analysis is to investigate to what extent each, or both; of the two sets of model parameters (factor loadings and error variances) are invariant in the two groups. As Byrne (1998) noted, there are various hypotheses of measurement invariance that can be tested, from weak to strict invariance. That is, one can test whether the model in its entirety is completely invariant, i.e., the measurement model as specified in one group is completely reproduced in the other, including the magnitude of the loadings and error variances. At the other end of the extreme is an invariance in which the only thing shared between the groups is overall pattern of the model but neither the magnitudes of the loadings nor of the error variances are the same for the two groups, i.e., the test has the same dimensionality, or configuration, but not the same magnitudes for the parameters. Item-level Analyses In item-level analyses, the framework is different than at the scale-level. At the item level, measurement specialists typically consider (a) one item at a time, and (b) a unidimensional statistical model that incorporates one or more thresholds for an item response. That is, the response to an item is governed by referring the latent variable score to the threshold(s) and from this comparison the item response is determined. Consider the following example of a four-point Likert item, How much do you like learning about mathematics? The item responses are scored on a 4-point scale such as (1) Dislike a lot, (2) Dislike, (3) Like, and (4) Like a lot. This item, along with other items, serve as a set of observed ordinal variables, xs, to measure the latent continuous variable x*, namely attitudes toward learning mathematics. For each observed ordinal variable x, there is an underlying continuous variable x*. If x has m
i 1 < x* i , i = 1,2,3,..., m,
where
m = +
are parameters called threshold values. For a variable x with m categories, there are m-1 unknown thresholds. Given that the above item has four response categories, there are three thresholds with the latent continuous variable. If one approaches the item level analyses from a scale-level perspective, the item responding process is akin to the thresholds one invokes in computing a polychoric correlation matrix (Jreskog & Srbom, 1996). In an item-level analysis measurement specialists often focus on differences in thresholds across the groups. That is, the focus is on determining if the thresholds are the same for the two groups. If studying an achievement or knowledge test, it should be asked if the items are equally difficult for the two groups, with the thresholds being used as measures of item difficulty (i.e., an item with a higher threshold is more difficult). These differences in thresholds are investigated by methods collectively called methods for detecting differential item functioning (DIF). In common measurement practice this sort of measurement invariance is examined, for each item, one item at a time, using a DIF detection method such as the Mantel-Haenszel (MH) test or logistic regression (conditioning on the observed scores), or methods based on item response theory (IRT). The IRT methods investigate the thresholds directly whereas the non-IRT methods test the difference in thresholds indirectly by studying the observed response option proportions by using categorical data analysis methods such as the MH or logistic regression methods (see Zumbo & Hubley, 2003 for a review).
277
characteristics of the statistical decisions over the long run; i.e., over many replications. We study the rejection rates for a test of the statistical hypotheses in multi-group confirmatory factor analysis. Methodology A computer simulation was conducted to investigate whether item-level differences in thresholds manifest themselves in the tests of strong and full measurement invariance hypotheses in multi-group CFA of a Pearson covariance matrix with maximum likelihood estimation. Simulated was a one-factor model with 38 items. Obtained was a population covariance matrix based on the data reported in Zumbo (2000, 2003) that were based on the item characteristics of a sub-section of the TOEFL. Based on this covariance matrix, 100,000 simulees were generated on these 38 items with a multivariate normal distribution with marginal (univariate) means of zero and standard deviations of one. The simulation was restricted to a one-factor model because item-level methods (wherein differences in item thresholds, called DIF in that literature, is widely discussed) predominantly assume unidimensionality of their items, for example, IRT, MH, or logistic regression DIF methods. The same item thresholds were used as those used by Bollen and Barb (1981) in their study of ordinal variables and Pearson correlation. In short, this method partitions the continuum ranging from 3 to +3. The thresholds are those values that divide the continuum into equal parts. The example in Figure 1 is a three-point scale using the notation described above for the x* and x. Item thresholds were applied to these 38 normally distributed item vectors to obtain the ordinal item responses. The simulation design involved two completed crossed factors: (i) number of scale points ranging from three to seven, and (ii) the percentage of items with different thresholds (i.e., percentage of DIF items) ranging from zero to 42.1 (1, 4, 8 and 16 items out of the total of 38).
278
0.4
0.3
Density
0.2
0.1 1 0.0 -3 -2 -1 a1 0 1 a2 2 3 2 3
Note: Number of categories for x: 3 (values 1, 2, 3). Item thresholds for x*: a1, a2 (values of 1 and 1).
Three to seven item scale points were chosen because in order to only deal with those scale points for which Byrne (1998) and others suggest the use of Pearson covariance matrices with maximum likelihood estimation for ordinal item data. The resulting simulation design is a five by five completely crossed design. The differences in thresholds were modeled based on suggestions from the item response theory (IRT) DIF literature for binary items. That is, the IRT DIF literature (e.g., Zumbo, 2003; Zwick & Ercikan, 1989) suggests that an item threshold difference of 0.50 standard deviations is a moderate DIF. This idea was extended and applied to each of the thresholds for the DIF item(s). For example, for a three-point item response scale group one would have thresholds of -1.0 and 1.0 whereas group two would have thresholds of 0.5 and 1.5. Note that for both groups the latent variables are simulated with a mean of zero and standard deviation of one. The same principle applies for the four to seven point scales.
Given that both groups have the same latent variable mean and standard deviation, the difference thresholds for the two groups (i.e., the DIF) would imply that the item(s) that is (are) performing differently across the two groups would have different item response distributions. It should be noted that the Bollen and Barb methodology results in symmetric Likert item responses that are normally distributed. The results in Table 1 allow one to compare the effect of having different thresholds in terms of the skewness and kurtosis. The descriptive statistics reported in Table 1 were computed from a simulated sample of 100,000 continuous normal scores that were transformed with our methodology. For a continuous normal distribution the skewness and kurtosis statistics reported would both be zero. Focusing first on the skewness, it can be see in Table 1 that they range from -0.008 to 0.011 (with a common standard error of 0.008) indicating that, as expected, the Likert responses were originally near symmetrical. Applying the
279
Note: These statistics were computed from a sample of 100,000 responses using SPSS 11.5. In all cases, standard errors of the skewness and kurtosis were 0.008 and 0.015, respectively.
threshold difference, as described above, resulted in item responses that were nearly symmetrical for three, six, and seven scale points, and only small positive skew (0.125 and 0.105) for the four and five scale points. In terms of kurtosis, there is very little change with the different thresholds, except for the three-point scale that resulted in the response distribution being more platykurtic with the different thresholds. The items on which the differences in thresholds were modeled were selected randomly. Thus in the four item condition, the item from the one-item condition was included and an additional three items were randomly selected. In the eight-item condition, the four items were included an additional four items were randomly selected, and so on. The sample size for the multi-group CFA was three hundred per group, a sample size that is commonly see in practice. The number of replications for each cell in the simulation design was 100. The nominal alpha was set at .05 for each invariance hypothesis test. It is important to note that the rejection rates reported in this paper are, technically, Type I error rates only for the no DIF conditions. In the other cases, when DIF is present, the rejection rates represent the likelihood of rejecting the null hypothesis (for each of the full and strong
measurement invariance hypotheses) when the null is true at the unobserved latent variable level, but not necessarily true in the manifest variables because the thresholds are different across the groups. For each replication the strong and full measurement invariance hypotheses were tested. These hypotheses were tested by comparing the baseline model (with no between group constraints) to each of the strong and full measurement invariance models. That is, strong measurement invariance is the equality of item loadings Lambda X, and the full measurement invariance is the equality of both item loadings and uniquenesses, Lambda X and Theta-Delta, across groups. For each cell, we searched the LISREL output for the 100 replications for warning or error messages. A one-tailed 95% confidence interval was computed for each empirical error rate. The confidence interval is particularly useful in this context because we have only 100 replications so we want to take into account sampling variability of the empirical error rate. The upper confidence bound was compared to Bradleys (1978) criterion of liberal robustness of error. If the upper confidence interval was .075 or less it met the liberal criterion.
280
Table 2. Rejection Rates for the Full and Strong Measurement Invariance Hypotheses, with and without DIF Present. Percentage of items having different thresholds across the two groups (% of DIF items) 0 (no DIF items) 2.9 (1 item) 10.5 (4 items) 21.1 (8 items) 42.1 (16 items)
3 pt. FI SI FI SI FI SI FI SI FI SI .07 (.074) .03 (.033) .09 (.095) .07 (.074) .04 (.043) .06 (.064) .08 (.084) .04 (.043) .07 (.074) .04 (.043) FI SI FI SI FI SI FI SI FI SI
4pt. .01 (.012) .03 (.033) .02 (.022) .02 (.022) .03 (.033) .02 (.022) .00 (.000) .00 (.000) .02 (.022) .02 (.022) FI SI FI SI FI SI FI SI FI SI
5pt. .01 (.012) .04 (.043) .01 (.012) .01 (.012) .03 (.033) .04 (.043) .04 (.043) .04 (.043) .02 (.022) .06 (.064) FI SI FI SI FI SI FI SI FI SI
6pt. .05 (.054) .03 (.033) .00 (.000) .03 (.033) .03 (.033) .06 (.064) .02 (.022) .01 (.012) .02 (.022) .05 (.054) FI SI FI SI FI SI FI SI FI SI
7pt. .02 (.022) .06 (.064) .02 (.022) .03 (.033) .03 (.033) .07 (.074) .02 (.022) .07 (.074) .02 (.022) .02 (.022)
Note. The upper confidence bound is provided in parentheses next to the empirical error rate. The empirical error rates in the range of Bradleys liberal criterion are indicated in plain text type whereas empirical error rates that do not even satisfy the liberal criterion are identified with symbol and in bold font.
Results To determine whether the tests of strong and full measurement invariance (using the Chi-squared difference tests arising from using a Pearson Covariance matrix and maximum likelihood estimation in, for example, LISREL) are affected by differences in item thresholds we examined the level of error rates in each of the conditions of the simulation design. Table 2 lists the results of the simulation study. Each tabled valued is the empirical error rate over the 100 replications with 300 respondents per group (upon searching the output for errors and warnings produced by LISREL, one case was found of a non-positive definite theta-delta (TD) matrix for the study cells involving three scale points for the 2.9 and 21.1 percent of DIF items. The one replication with this warning was excluded from the calculation of the error rate and upper 95% bound for those two cells, therefore the cell statistics were calculated for 99
replications for those two cases). The values in the range of Bradleys liberal criterion are indicated in plain text type. Values that do not even satisfy the liberal criterion are identified with symbol . The results show that almost all of the empirical error rates are within the range of Bradleys liberal criterion. Only two cells have empirical error rates that exceed the upper confidence interval of .075. These two cells are for the three-scale-point condition. This suggests that the differences of item thresholds may have an impact on the full measurement invariance hypotheses in some conditions for measures with a three-point item response format, although this finding is seen in only two of the four conditions involving differences in thresholds. For scale points ranging from four to seven, the empirical error rates are either at or near the nominal error. Interestingly, the empirical error rates of the three scale points are
281
that are ultimately used to achieve the intended purpose, the scores may be contaminated by item level bias and, ultimately, valid inferences from the test scores become problematic. References Bollen, K. A., & Barb, K. H. (1981). Pearsons r and coarsely categorized measures. American Sociological Review, 46, 232-239. Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144-152. Byrne, B. M. (1994). Testing for the factorial validity, replication, and invariance of a measuring instrument: A paradigmatic application based on the Maslach Burnout Inventory. Multivariate Behavioral Research, 29, 289-311. Byrne, B. M. (1998). Structural Equation Modeling with LISREL, PRELIS, and SIMPLIS. Mahwah, N.J.: Lawrence Erlbaum Associates, Publishers. Byrne, B. M., Shavelson, R. J., & Muthn, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466. Hambleton, R. K., & Patsula, L. (1999). Increasing the validity of adaptive tests: Myths to be avoided and guidelines for improving test adaptation practices. Journal of Applied Testing Technology, 1, 1-11. Jreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409-426. Jreskog, K. G., & Sorbom, D. (1996). LISREL 8: Users Reference Guide. Chicago, IL.: Scientific Software International. Luecht, R. (1996). MIRTGEN 1.0 [Computer Software]. Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517-529. Reise, S. P., Widaman, K. F., & Pugh, R.H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566.
282
Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78-90. Van de Vijver, F. J. R., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines. European Psychologist, 1, 89-99. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4-69. Zumbo, B. D. (2000, April). The effect of DIF and impact on classical test statistics: undetected DIF and impact, and the reliability and interpretability of scores from a language proficiency test. Paper presented at the National Council on Measurement in Education (NCME), New Orleans, LA.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 283-287
Brief Report Exploratory Factor Analysis in Two Measurement Journals: Hegemony by Default
J. Thomas Kellow
College of Education University of South Florida-Saint Petersburg
Exploratory factor analysis studies in two prominent measurement journals were explored. Issues addressed were: (a) factor extraction methods, (b) factor retention rules, (c) factor rotation strategies, and (d) saliency criteria for including variables. Many authors continue to use principal components extraction, orthogonal (varimax) rotation, and retain factors with eigenvalues greater than 1.0. Key words: Factor analysis, principal components, current practice
Introduction Factor analysis has often been described as both an art and a science. This is particularly true of exploratory factor analysis (EFA), where researchers follow a series of analytic steps involving judgments more reminiscent of qualitative inquiry, an irony given the mathematical sophistication underlying EFA models. A number of issues must be considered before invoking EFA, such as sample size and the relationships between measured variables (see Tabachnick & Fidell, 2001, for an overview). Once EFA is determined to be appropriate, researchers must consider carefully decisions related to: (a) factor extraction methods, (b) rules for retaining factors, (c) factor rotation strategies, and (d) saliency criteria for including variables. There is considerable latitude regarding which methods may be appropriate or desirable in a particular analytic scenario (Fabrigar, Wegener, MacCallum, & Strahan, 1999).
Factor Extraction Methods There are numerous methods for initially deriving factors, or components in the case of principal component (PC) extraction. Although some authors (Snook & Gorsuch, 1989) have demonstrated that certain conditions involving the number of variables factored and initial communalities lead to essentially the same conclusions, the unthinking use of PC as an extraction mode may lead to a distortion of results. Stevens (1992) summarizes the views of prominent researchers, stating that: When the number of variables is moderately large (say > 30), and the analysis contains virtually no variables expected to have low communalities (e.g., .4), then practically any of the factor procedures will lead to the same interpretations. Differences can occur when the number of variables is fairly small (< 20), and some communalities are low. (p. 400) Factor Retention Rules Several methods have been proposed to evaluate the number of factors to retain in EFA. Although the dominant method seems to be to retain factors with eigenvalues greater than 1.0, this approach has been questioned by numerous authors (Zwick & Velicer, 1986; Thompson & Daniel, 1996). Empirical evidence suggests that, while under-factoring is probably the greater
J. Thomas Kellow is an Assistant Professor of Measurement and Research in the College of Education at the University of South FloridaSaint Petersburg. Email him at: [email protected]
283
284
danger, sole reliance on the eigenvalues greater than 1.0 criterion may result in retaining factors of trivial importance (Stevens, 1992). Other methods for retaining factors may be more defensible and perhaps meaningful in interpreting the data. Indeed, after reviewing empirical findings on its utility, Preacher and McCallum (2003) reported that the general conclusion is that there is little justification for using the Kaiser criterion to decide how many factors to retain (p. 23). Factor Rotation Strategies Once a decision has been made to retain a certain number of factors, these are often rotated in a geometric space to increase interpretability. Two broad options are available, one (orthogonal) assuming the factors are uncorrelated, and the second (oblique) allowing for correlations between the factors. Although the principal of parsimony may tempt the researcher to assume, for the sake of ease of interpretability, uncorrelated factors, Pedhazur and Shmelkin (1991) argued that both solutions should be considered. Indeed, it might be argued that it rarely is tenable to assume that multidimensional constructs, such as selfconcept, are comprised of dimensions that are completely independent of one another. Although interpretation of factor structure is somewhat more complicated when using oblique rotations, these methods may better honor the reality of the phenomenon being investigated. Saliency Criteria for Including Variables Many researchers regard a factor loading (more aptly described as a pattern or structure coefficient) of .3 or above as worthy of inclusion in interpreting factors (Nunnally, 1978). This rationale is predicated on a rather arbitrary decision rule that 9% of variance accounted for makes a variable noteworthy. In a similar vein, Stevens (1992) offered .4 as a minimum for variable inclusion as this means the variable shares at least 15% of its variance with a factor. Others (Cliff & Hamburger, 1967) argue for the statistical significance of a variable as an appropriate criterion for inclusion. As Hogarty, Kromrey, Ferron, and Hines (in press) noted, although a variety of rules of thumb of
J. THOMAS KELLOW
were identified. In some instances the authors conducted two or more EFA analyses on split samples. For the present purposes these were coded as separate studies. This resulted in 212 studies that invoked EFA models. Variables extracted from the EFA articles were: a) b) c) d) factor extraction methods; factor retention rules; factor rotation strategies; and saliency criteria for including variables. Results Factor Extraction Methods The most common extraction method employed (64%) was principal components (PC). The next most popular choice was principal axis (PA) factoring (27%). Techniques such as maximum likelihood were infrequently invoked (6%). A modest percentage of authors (8%) conducted both PC and PA methods on their data and compared the results for similar structure. Factor Extraction Rules The most popular method used for deciding the number of factors to retain was the Kaiser criterion of eigenvalues greater than 1.0. Over 45% of authors used this method. Close behind in frequency of usage was the scree test (42%). Use of other methods, such as percent of variance explained logics and parallel analysis, was comparatively infrequent (about 8% each). Many authors (41%) explored multiple criteria for factor retention. Among these authors, the most popular choice was a combination of the eigenvalues greater than 1.0 and scree methods (67%). Factor Rotation Strategies Virtually all of the EFA studies identified (96%) invoked some form of factor rotation solution. Varimax rotation was most often employed (47%), with Oblimin being the next most common (38%). Promax rotation also was used with a modest degree of frequency (11%). A number of authors (18%) employed both Varimax and Oblimin solutions to examine the influence of correlated factors on the resulting factor pattern/structure matrices.
285
Saliency Criteria for Including Variables Thirty-one percent of EFA authors did not articulate a specific criterion for interpreting salient pattern/structure coefficients, preferring instead to examine the matrix in a logical fashion, considering not only the size of the pattern/structure coefficient, but also the discrepancy between coefficients for the same variable across different factors (components) and the logical fit of the variable with a particular factor. Of the 69% of authors who identified an a priori criterion as an absolute cutoff, 27% opted to interpret coefficients with a value of .3 or higher, while 24% chose the .4 value. Other criteria chosen with modest frequency (both about 6%) included .35 and .5 as absolute cutoff values. For the remaining authors who invoked an absolute criterion, values ranged from .25 to .8 . A few (3%) of these values were determined based on the statistical significance of the pattern/structure coefficient. Conclusion Not surprisingly, the hegemony of default settings in major statistical packages continues to dominate the pages of EPM and PID. The Little Jiffy model espoused by Kaiser (1970), wherein principal components are rotated to the varimax criterion and all components with eigenvalues greater than 1.0 is alive and well. It should be noted that this situation is almost certainly not unique to EPM or PID authors. An informal perusal of a wide variety of educational and psychological journals that occasionally publish EFA results easily confirms the status of current practice. The rampant use of PC as an extraction method is not surprising given its status as the default in major statistical packages. Gorsuch (1983) has pointed out that, with respect to extraction methods, PC and factor models such as PA often yield comparable results when the number of variables is large and communalities (h2) also are large. Although comforting, authors are well advised to consider alternative extraction methods with their data even when these assumptions are met. When these assumptions are not met, such as when the rank of the factored matrix is small, there is
286
considerable measurement error, measurement error is not homogeneous across variables, and sampling error is small due to larger sample size, other extraction methods have more appeal (Thompson & Daniel, 1996, p. 202, italics added). The eigenvalues greater than 1.0 criterion was the most popular option for EFA analysts. A number of researchers, however, combined both the eigenvalues greater than 1.0 criterion and the scree test in combination, which is interesting inasmuch as both methods consult eigenvalues, only in different ways. A likely explanation is that both can be readily obtained in common statistical packages. Other approaches to ascertaining the appropriate number of factors (components) such as parallel analysis (Horn, 1965) and the bootstrap (Thompson, 1988) are available, as are methods based on standard error scree (Zoski & Jurs, 1996). Each of these methods, however, requires additional effort on the part of the researcher. However, EFA authors should consider alternatives for factor retention in much the same way that CFA authors consult the myriad fit indices available in model assessment. As Thompson and Daniel noted, The simultaneous use of multiple decision rules is appropriate and often desirable (p. 200). For authors invoking an absolute criterion for retaining variables, the .3 level and the .4 were by far the most popular. Researchers who feel compelled to set such arbitrary criteria often look to textbook authors to guide their choice. The latter criterion can be traced to Stevens (1992), who stated that It would seem that one would want in general a variable to share at least 15% of its variance with the construct (factor) it is going to be used to help name. This means only using loadings (sic) which are about .4 or greater for interpretation purposes (p. 384). The former rule appears to be attributable to Nunnally (1982), who claimed that It is doubtful that loadings (sic) of any smaller size should be taken seriously, because they represent less than 10 percent of the variance (p. 423). One-third of EFA authors chose not to adhere to a strict, and ultimately arbitrary, criterion for variable inclusion. Rather, these researchers considered the pattern/structure
J. THOMAS KELLOW
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179-185. Kaiser, H. F. (1970). A second generation Little Jiffy. Psychometrika, 35, 401 415. Nunnally, J. C. (1978). Psychomteric theory (2nd ed.). New York: McGraw Hill. Pedhazur, E., & Schmelkin, L. (1991). Measurement, design, and analysis. Hillsdale, NJ: Erlabaum. Preacher, K. J., & MacCallum, R. C. (2003). Repairing Tom Swifts electronic factoring machine. Understanding Statistics, 2, 13-43. Russell, D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in Personality and Social Psychology Bulletin. Personality and Social Psychology Bulletin, 28, 1629-1646. Snook, S. C., & Gorsuch, R. L. (1989). Component analysis versus common factor analysis: A Monte Carlo study. Psychological Bulletin, 106, 148-154. Stevens, J. (1992). Applied multivariate statistics for the social sciences (2nd ed.). Hillsdale, NJ: Earlbaum.
287
Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th Ed.). Needham Heights, MA: Allyn & Bacon. Thompson, B. (1988). Program FACSTRAP: A program that computes bootstrap estimates of factor structure. Educational and Psychological Measurement, 48, 681-686. Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31(3), 24-31. Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington, DC: American Psychological Association. Thompson, B., & Daniel, L. G. (1996). Factor analytic evidence for the construct validity of scores: A historical overview and some guidelines. Educational and Psychological Measurement, 56, 197-208. Zoski, K. W., & Jurs, S. (1996). An objective counterpart to the visual scree test for factor analysis: The standard error scree. Educational and Psychological Measurement, 56, 443-451. Zwick, W. R., & Velicer, W. F. (1986). Factors influencing five rules for determining the number of components to retain. Psychological Bulletin, 99, 432-442.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 288-299
Mariana Toma-Drane
University of South Carolina
Robert F. Valois
University of South Carolina
J. Wanzer Drane
University of South Carolina
Simulations were used to compare complete case analysis of ordinal data with including multivariate normal imputations. MVN methods of imputation were not as good as using only complete cases. Bias and standard errors were measured against coefficients estimated from logistic regression and a standard data set. Key words: complete case analysis, missing data mechanism, multiple logistic regression Introduction Surveys are important sources of information in epidemiologic studies and other research as well, but often encounter missing data (Patricia, 2002). Ordinal variables are very common in survey research; however, they challenge primary data collectors who might need to impute missing values of these variables due to their hierarchical nature but with unequal intervals. The traditional approach, complete case analysis (CC), excludes from the analysis observations with any missing value among variables of interest (Yuan, 2000). CC remains the most common method in the absence of readily available alternatives in software packages. However, using only complete cases could result in losing information about incomplete cases, thus biasing parameter estimates, and compromising statistical power (Patricia, 2002). Multiple imputation (MI) procedure replaces each missing value with m plausible values generated under an appropriate model. These m multiply imputed datasets are then analyzed separately by using procedures for complete data to obtain desired parameter estimates and standard errors. Results from the m analyses are then combined for inferences by computing the mean of the m parameter estimates and a variance estimate that include both a within-imputation and a betweenimputation component (Rubin, 1987). MI has some desirable features, such as introducing appropriate random error into the imputation process and making it possible to obtain unbiased estimates of all parameters; allowing use of complete-data methods for data analysis; producing more reasonable estimates of standard errors and thereby increasing efficiencies of estimates (Rubin, 1987). In addition, MI can be used with any kind of data and any kind of analysis without specialized software (Allison, 2000). MI appears to be a more attractive method handling missing data in multivariate analysis compared to CC (King et al., 2001; Little & Rubin, 1989). However, certain requirements should be met to have its attractive properties. First, the data must be missing at random (MAR). Second, the model used to generate the imputed values must be correct in some sense. Third, the model used for the analysis must catch up, in some sense, with the model used in the imputation
Ling Chen is a doctoral student in the Department of Statistics at the University of Missouri, Columbia. Mariana Toma-Drane, is a doctoral student at Norman J. Arnold, School of Public Health, Department of Health Promotion Education and Behavior. John Wanzer Drane is Professor of Biostatistics at USC and Fellow of the American Academy of Health Behavior. Robert F. Valois is Professor Health Promotion, Education and Behavior at USC and a Fellow of the American Academy of Health Behavior.
288
289
observed information. MCAR is a special case of MAR. The missing data mechanism is ignorable for likelihood-based inferences for both MCAR and MAR (Little & Rubin, 1987). Missing NI occurs when the probability of response of Y depends on the value of Ymis and possibly the value of Yobs as well. The data used in this investigation are from the 1997 South Carolina Youth Risk Behavior Survey (SCYRBS). The total number of complete and partial questionnaires collected is 5545. The survey employed a two-stage cluster sampling with derived weightings designed to obtain a representative sample of all South Carolina public high school students in grades 9-12, with the exception of those in special education schools. The survey ran from March until June 1997. The questionnaire covers six categories of priority health-risk behaviors required by the Center for Disease Control and Prevention, and locally, two additional psychological categories of questions were added that include quality of life and life satisfaction (Valois, Zulling, Huebner & Drane, 2001). The six categories of priority health-risk behaviors among youth and young adults are those that contribute to unintentional and intentional injuries; tobacco use; alcohol and other drug use; sexual behaviors; dietary behaviors and physical inactivity (Kolbe, 1990). The items on self-report youth risk behaviors are Q10 through Q20. The six lifesatisfaction variables, Q99 through Q104, are based on six domains: family, friends, school, self, living environment and overall life satisfaction. Each of the questions has seven response options based on the Multidimensional Students Life Satisfaction Scale (Seligson, Huebner & Valois, 2003). The response options are from the Terrible-to-Delighted Scale: 1 terrible; 2 - unhappy; 3 - mostly dissatisfied; 4 equally satisfied and dissatisfied; 5 - mostly satisfied; 6 - pleased; and 7 - delighted (10). The four race-gender groups: White Females (WF, 26.7%), White Males (WM, 26.0%), Black Females (BF, 26.0%) and Black Males (BM, 21.3%) accounted for almost equal percentage in the sample. The sample was due to the belief that the relationship between life satisfaction and youth risk behaviors varies
290
across different race-gender groups, as demonstrated in previous research (Valois, Zulling, Huebner & Drane, 2001). Multiple Logistic Regression Analysis Exploring the relationship between life satisfaction and youth risk behaviors powered this study. Three covariates in ordinal scale were selected from the 1997 SCYRBS Questionnaire (see the Appendix for details). They were dichotomized as Q10: DRKPASS (Riding with a drunk driver); Q14: GUNSCHL (Carrying a gun or other weapon on school property) and Q18: FIGHTIN (Physical fighting), respectively. Each of them was coded 1 for never (0 time) and 2 for ever (equal to or greater than 1 time), with 1 as the referent level. All the six ordinal variables of life satisfaction (Q99 ~ Q104) were pooled for each participant to form a pseudocontinuous dependent variable ranging in score from 6 to 42, i.e., Lifesat = Q99 + Q100 + Q101 + Q102 + Q103 + Q104. The score was expressed as Satisfaction Score (SS) with lower scores indicative of reduced satisfaction with life (Valois, Zulling, Huebner & Drane, 2001). SS ranging from 6 to 27 was categorized as dissatisfied. For the dichotomized outcome variable D2, the students in dissatisfied group (D2 = 1) served as the risk group and the others as the referent group (D2 = 0). As defined, all the four variables used in logistic regression were dichotomized. DRKPASS, GUNSCHL and FIGHTIN were used as predictor variables while D2 was chosen as the response or criterion variable. The three predictor variables are each independently associated with life dissatisfaction with odds ratios (OR) ranging from 1.42 to 2.27; they are also associated with each other with odds ratios ranging from 2.22 to 4.52. To use the sampling design in multiple logistic regression analysis, dichotomous logistic regression (PROC MULTILOG) was conducted using SAS-callable Survey Data Analysis (SUDAAN) for weighted data at an alpha level of 0.05 (Shah, Barnwell & Bieler, 1997) (See Appendix.). The analyses were done separately for the four race-gender groups, and the regression coefficient ( ) and the standard error of the regression coefficient (Se ( )) for each covariate were obtained.
291
percent %
62.32
11.32
12.64 3.92 3 4
9.8
100 90 80 70 percent % 60 50 40 30 20 10 0
90.11
2.98 1 2
2.18
0.72
4.01 5
100 90 80 70 60 50 40 30 20 10 0
percent %
62.32
16.23
Q10
Q14
Q18
292
Nine scenarios were created where the covariates Q10 (DRKPASS), Q14 (GUNSCHL) and Q18 (FIGHTIN) were missing at the same rate (5%, 15% and 30%), the life-satisfaction variables (Q99 ~ Q104) were complete as in the Complete Standard Dataset, however. In each scenario 500 datasets with missing covariates were generated. Table 1 lists the missing data mechanisms for the covariates, and the average percentage of complete cases (all the three covariates complete) in the 500 datasets for each scenario. All the simulations were performed using SAS version 8.2 (2002). Multiple Imputation The missing covariates in each simulated dataset were then imputed five times using the SAS MI procedure (see Appendix). First, initial parameter estimates were obtained by running the Expectation-Maximization (EM) algorithm until convergence up to a maximum of 1000 iterations. Using the EM estimates as starting values, 500 cycles were ran of Markov Chain Monte Carlo (MCMC) full-data augmentation under a ridge prior with the hyperparameter set to 0.75 to generate five imputations. A multivariate normal model was applied to the data augmentation for the nonnormal ordinal data without trying to meet the distributional assumptions of the imputation model. Three auxiliary variables (Q11, Q13 and Q19) as well as the outcome variable D2 were entered into the imputation model as if they were jointly normal, to increase the accuracy of the imputed values of Q10, Q14 and Q18 (Allison, 2000; Schafer, 1997 & 1998; Rubin, 1996). The maximum and minimum values for the imputed values were specified, which were based on the scale of the response options for the 1997 SCYRBS questions. These specifications were necessary so that the imputations were not made outside of the range of the original variables. The continuously distributed imputes for Q10, Q14 and Q18 were rounded to the nearest category using a cutoff value of 0.5.
293
Scenario 1 2 3 4 5 6 7 8 9
Table 1. Simulated scenarios for datasets with missing covariates. Missing Average Missing data mechanism for each covariate percentage of percentage of Q10 Q14 Q18 each complete (DRKPASS) (GUNSCHL) (FIGHTIN) covariate cases 5% 85.73% MCAR MCAR MCAR 5% 85.42% MAR MCAR MCAR 5% 85.55% NI MCAR MCAR 15% 61.34% MCAR MCAR MCAR 15% 61.19% MAR MCAR MCAR 15% 62.54% NI MCAR MCAR 30% 34.22% MCAR MCAR MCAR 30% 34.30% MAR MCAR MCAR 30% 34.10% NI MCAR MCAR
Table 2. Logistic regression coefficients and standard error estimates in the 1997 SCYRBS Dataset and the Complete Standard Dataset. DRKPASS GUNSCHL FIGHTIN Group * Se( ) Se( ) Se( ) White female 0.14 0.10 0.99 0.21 0.88 0.16 N=1359 (1361) (0.16) (0.11) (0.94) (0.23) (0.84) (0.16) Black female 0.03 0.14 0.69 0.28 0.36 0.15 N=1335 (1336) (0.02) (0.14) (0.63) (0.24) (0.45) (0.16) White male 0.32 0.17 0.10 0.17 0.43 0.13 N=1338 (1340) (0.25) (0.16) (0.32) (0.15) (0.53) (0.11) Black male 0.43 0.16 0.95 0.20 0.32 0.11 N=1119 (1119) (0.35) (0.14) (0.94) (0.23) (0.52) (0.11) * , logistic regression coefficient. Se ( ), standard error of logistic regression coefficient. Numbers in parentheses, sample size, logistic regression coefficient and standard error of logistic regression coefficient from the Complete Standard Dataset. An example is presented from comparing CC and MI across the nine scenarios among White Females in table 3. The histogram of the average AVB of for each covariate in this example is shown in figure 2. To evaluate the the imputation procedure, the absolute value of bias in point estimates and coverage probability were mainly considered. The coverage probability is defined as the possibility of the true regression coefficient being covered by the actual 95 percent confidence interval. Further, the percent AVB of for each covariate, calculated by dividing AVB by the corresponding true , better compares the two methods with regard to bias. Greater or equal to10% of bias is beyond acceptance.
Both CC and MI produced biased estimates of in all the scenarios. CC showed little or no bias for all the scenarios under MCAR. The AVB of for each covariate is consistently less than 0.05 for all the three covariates even with about 34% complete cases (30% missing for each covariate). However, CC showed larger AVBs of in the scenarios under MAR and NI than in those under MCAR with the same missing covariate rates. Further, MI was generally less successful than CC because MI showed larger AVBs of than CC in most of the scenarios regardless of missing data mechanism and missing covariate rate. (Results for the other three race-gender groups not shown here.)
294
Figure 2. Average AVBs (absolute value of bias) of logistic regression coefficients across the nine scenarios among White Females. S1 ~ S9 represent Scenario1 ~ Scenario 9, respectively.
0.35
0.3
0.25
DRKPASS
GUNSCHL
FIGHTIN
295
Table 3. Comparison of complete case and multiple imputation model results across the nine scenarios among White Females. GUNSCHL DRKPASS *true value = 0.16 true value = 0.94 Se ( ) true value = 0.11 Se ( ) true value = 0.23 AVB Se( ) AVB Se( ) CC 0.0082 0.1178 0.0075 0.2574 Scenario 1 MI 0.0043 0.1088 0.0306 0.2414 CC 0.0132 0.1190 0.0148 0.2725 Scenario 2 MI 0.0329 0.1094 0.0044 0.2448 CC 0.0189 0.1162 0.0462 0.2712 Scenario 3 MI 0.0004 0.1061 0.0324 0.2470 CC 0.0116 0.1467 0.0182 0.3302 Scenario 4 MI 0.0133 0.1151 0.0893 0.2591 CC 0.0286 0.1521 0.0504 0.3666 Scenario 5 MI 0.1437 0.1166 0.0339 0.2610 CC 0.0667 0.1451 0.0633 0.3517 Scenario 6 MI 0.0315 0.1137 0.0996 0.2628 CC 0.0194 0.2097 0.0390 0.4840 Scenario 7 MI 0.0279 0.1237 0.1083 0.2704 CC 0.0138 0.1991 0.0335 0.4312 Scenario 8 MI 0.2718 0.1227 0.0660 0.2738 CC 0.0323 0.2227 0.0261 0.5611 Scenario 9 MI 0.0771 0.1344 0.1278 0.2828 * , logistic regression coefficient. Se ( ), standard error of logistic regression coefficient. AVB, absolute value of bias (| estimated true value |).
FIGHTIIN true value = 0.84 Se ( ) true value = 0.16 AVB Se( ) 0.0033 0.1740 0.0613 0.1627 0.0198 0.1768 0.0280 0.1602 0.0295 0.1759 0.0725 0.1593 0.0046 0.1964 0.1732 0.1574 0.0802 0.2039 0.0815 0.1555 0.0724 0.2105 0.1800 0.1556 0.0111 0.2478 0.3118 0.1523 0.1002 0.2347 0.0575 0.1500 0.0951 0.2661 0.3126 0.1506
Table 4. Coverage probability in Scenarios 2 and 8 for White Females. DRKPASS (%) 96.8 94.2 95.0 77.0 GUNSCHL (%) 96.4 99.0 94.0 90.4 FIGHTIN (%) 95.0 93.8 87.0 88.2
Scenario 2 Scenario 8
CC MI CC MI
Table 5. Average Correct Imputation Rate for the three covariates. Transformation * Without With Scenario 2 8 2 8 Original scale (%) Q10 Q14 Q18 15.94 83.20 31.40 21.25 83.22 29.21 40.04 89.47 50.80 52.40 89.54 50.75 DRKPASS 47.81 41.05 65.14 65.52 Recoded (%) GUNSCHL 86.77 86.75 92.11 91.81 FIGHTIN 49.20 47.39 66.00 63.22
296
Also, in most scenarios the percent AVB of from MI is far greater than that from CC and is greater than 10% of acceptance level. This discrepancy was especially obvious for all the scenarios under MCAR (Scenarios 1, 4 and 7). Moreover, the AVBs and percent AVBs from MI increase substantially as larger proportions of the covariates were missing. Interestingly, MI showed consistently decreased Se ( ) for each covariate in all the scenarios, which is not surprising, because the standard error of MI is based on full datasets (Allison, 2001). Table 4 lists the coverage probabilities in Scenarios 2 and 8 among White Females as an example. In both scenarios, the coverage probabilities from MI are not all better than those from CC. Clearly, the current MVN based multiple imputation did not perform as well as CC in generating unbiased regression estimates. To investigate how well the present MI actually imputed the missing non-normal ordinal covariates, Scenarios 2 and 8 were used to check the imputation efficiency, as the two scenarios have the same setting for missing data mechanism but different missing covariate rates. The Average Correct Imputation Rate is calculated as the average proportion of correctly imputed observations among the missing covariates. Correct imputation occurs when the imputed value is identical to its true value in the Complete Standard Dataset. Table 5 displays the Average Correct Imputation Rates for the three covariates in both original scales (Q10, Q14 and Q18) and recoded scales (DRKPASS, GUNSCHL and FIGHTIN). The Average Correct Imputation Rates for Q10 and Q18 are lower than 32% in both scenarios. Recoding helped to improve imputation efficiency for all the three covariates, this can be explained by the loss of precision after recoding. Surprisingly, the Average Correct Imputation Rates for Q14 (GUNSCHL) are very close in the two scenarios. In addition, they are consistently and considerably higher than those for the other two covariates. This may be explained by the fact that a vast majority of its observations fall into one category (figure 1). Natural logarithmic transformation on the three covariates was also attempted before multiple imputation to approximate normal
297
Table 6. Five Imputations for missing Q10 without rounding on imputed values from one random dataset in Scenario 8. Obs. 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Q10 2 . 2 . . . 1 . . 1 1 . . . . 1 True value 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 Imputation number 3 4 1.7633 * 1.8274 * 2.3249 1.6476 * 1.5210 * 1.6802 * 1.8180 * 1.6079 * 2.0099 1.5978 * 1.2140 1.7611 *
Accumulating evidence suggests that MI is usually better than, and almost always not worse than CC (Wu & Wu, 2001; Schafer, 1998; Allison, 2001; Little, 1992). Evidence provided by Schafer (1997, 2000) demonstrated that incomplete categorical (ordinal) data can often be imputed reasonably from algorithms based on a MVN model. However, our study did not show consistent results with the findings from Schafer, this is mainly due to ignorance of assumption of normality. It is known that sensitivity to model assumptions is an important issue regarding the consistency and efficiency of normal maximum likelihood method applied to incomplete data. The improved, though unsatisfactory, imputation after natural logarithmic transformation presented a good demonstration of the importance of sensitivity to normal model assumption. Moreover, normal ML methods do not guarantee consistent estimates, and they are certainly not necessarily efficient when the data
are non-normal (Little, 1992). The MVN based MI procedure not specifically tailored to highly skewed ordinal data may have seriously distorted the ordinal variables distributions or their relationship with other variables in our study, and therefore is not reliable when imputing highly skewed ordinal data. It was suggested that and highly skewed variables may well be transformed to approximate normality (Tabachnick & Fidell, 2000). Nevertheless, highly skewed ordinal variables with only four or five values can hardly be transformed to nearly normal variables as shown by the unsatisfactory imputation efficiencies after natural logarithmic transformation. This study gives a warning that doing imputation without checking distributional assumptions of imputation model can lead to worse trouble than not imputing at all. In addition, rounding after MI should be further explored in terms of appropriate cutoff values. One is cautioned that rounding could also bring its own bias into regression analysis in multiple imputations of categorical variables.
298
Applied researchers can be reasonably confident in utilizing CC to generate unbiased regression estimates even when large proportions of data missing completely at random. For ordinal variables with highly skewed distributions, MVN based MI cannot be expected to be superior to CC in generating unbiased regression estimates. It is cautionary that researchers doing imputation without checking distributional assumptions of imputation model can get into worse trouble than not imputing at all. References Allison P. D. (2001). Missing data. Thousand Oaks, CA: Sage. Allison, P. D. (2000). Multiple imputation for missing data: a cautionary tale. Sociological Methods & Research, 28, 301-9. King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing incomplete political science data: an alternative algorithm for multiple imputation. American Political Science Review, 95 (1), 49-69. Kolbe, L. J. (1990). An epidemiologic surveillance system to monitor the prevalence of youth behaviors that most affect health. Journal of Health Education, 21(6), 44-48. Little R. J. A. (1992). Regression with missing Xs: a review. Journal of American Statistical Association, 87, 1227-37. Little, R. J. A., & Rubin D. B. (1989). The analysis of social science data with missing values. Sociological Methods & Research, 18, 292-326. Little, R. J. A., & Rubin D. B. (1987). Statistical analysis with missing data. New York: John Wiley & Sons, Inc. Patrician, P. A. (2002). Focus on research methods multiple imputation for missing data. Research in Nursing & Health, 25, 76-84. Rubin D. B. (1996). Multiple imputation after 18+ years. Journal of American Statistical Association, 91 (434), 473-89.
299
Appendix A: 1997 SCYRBS Questionnaire items associated with the three covariates in regression analysis Question 10 (Q10). During the past 30 days, how many times did you ride in a car or other vehicle driven by someone who had been drinking alcohol? 1. 0 times 2. 1 time 3. 2 or 3 times 4. 4 or 5 times 5. 6 or more times Question 14 (Q14). During the past 30 days, on how many days did you carry a weapon such as a gun, knife, or club on school property? 1. 0 days 2. 1 day 3. 2 or 3 days 4. 4 or 5 days 5. 6 or more days Question 18 (Q18). During the past 12 months, how many times were you in a physical fight? 1. 0 times 2. 1 time 3. 2 or 3 times 4. 4 or 5 times 5. 6 or 7 times 6. 8 or 9 times 7. 10 or 11 times 8. 12 or more times Appendix B: SAS Code SAS PROC MI code for multiple imputation proc mi data=first.c&I out=outmi&I seed=6666 nimpute=5 minimum=1 1 1 1 1 1 0 maximum=5 5 5 5 8 5 1 round=1 noprint; em maxiter=1000 converge=1E-10; mcmc impute=full initial=em prior=ridge=0.75 niter=500 nbiter=500; freq weight; var Q10 Q11 Q13 Q14 Q18 Q19 D2; run; Appendix C: SUDAAN Code SUDAAN PROC MULTILOG code for multiple logistic regression analysis Proc multilog data=stand filetype=sas design=wr noprint; nest stratum psu; weight weight; subpopn sexrace=1 / name=white female; subgroup D2 drkpass gunschl fightin; levels 2 2 2 2; reflevel drkpass=1 gunschl=1 fightin=1; model D2 = drkpass gunschl fightin; output beta sebeta/filename=junk_2 filetype=sas; run;
Journal of Modern Applied Statistical Methods May, 2005, Vol. 3, No. 1, 300-311
JMASM Algorithms and Code JMASM16: Pseudo-Random Number Generation In R For Some Univariate Distributions
Hakan Demirtas School of Public Health
University of Illinois at Chicago
An increasing number of practitioners and applied researchers started using the R programming system in recent years for their computing and data analysis needs. As far as pseudo-random number generation is concerned, the built-in generator in R does not contain some important univariate distributions. In this article, complementary R routines that could potentially be useful for simulation and computation purposes are provided. Key words: Simulation; computation; pseudo-random numbers
Introduction Following upon the work of Demirtas (2004), pseudo-random generation functions written in R for some univariate distributions are presented. The built-in pseudo-random number generator in R does not have routines for some important univariate distributions. Built-in codes are available only for the following univariate distributions: uniform, normal, chi-square, t, F, lognormal, exponential, gamma, Weibull, Cauchy, beta, logistic, stable, binomial, negative binomial, Poisson, geometric, hypergeometric and Wilcoxon. The purpose of this article is to provide complementary R routines for generating pseudo-random numbers from some univariate distributions. In the next section, eighteen R functions of which the first thirteen correspond to the distributions that are not contained in the generator (Codes 1-13) are presented. The quality of the resulting variates have not been tested in the computer science sense. However, the first three moments for each distribution were rigorously tested. For the purposes of most applications, fulfillment of this criterion should be a reasonable approximation to reality. The last 5 functions (Codes 14-18) address already available univariate distributions; the reason for their inclusion is that variates generated with these routines are of a slightly better quality than those generated by the built-in code in terms of above-mentioned criterion. Functions for random number generation The following abbreviations are used: PDF stands for the probability density function; PMF stands for the probability mass function; CDF stands for the cumulative distribution function; GA stands for the generation algorithm and EAA stands for an example of application areas; nrep stands for the number of identically and independently distributed random variates. The formal arguments other than nrep reflect the parameters in PDF or PMF. E(X) and V(X) denote the expectation and the variance of the random variable X, respectively.
Hakan Demirtas is an Assistant Professor of Biostatistics at the University of Illinois at Chicago. His research interests are the analysis of incomplete longitudinal data, multiple imputation and Bayesian computing. E-mail address: [email protected].
300
HAKAN DEMIRTAS
Left truncated normal distribution PDF: Left truncated gamma distribution
2
301
f ( x | , , ) =
e( x )
/(2 2 )
PDF:
2 (1 (
for x< where () is the standard normal CDF, , and are the mean, standard deviation and left truncation point, respectively. EAA: Modeling the tail behavior in simulation studies. GA: Roberts (1995) acceptance/ rejection algorithm with a shifted exponential as the majorizing density. For =0 and =1,
))
f (x | , ) =
1 x 1e x / (( ) / ( ))
for x<, >1 and min(,)>0 where and are the shape and scale parameters, respectively, is the cutoff point at which truncation occurs and / is the incomplete gamma function. EAA: Modeling left-censored data. GA: An acceptance/rejection algorithm (Dagpunar, 1978) where the majorizing density is chosen to be a truncated exponential.
E( X ) =
/2
2 (1 ( ))
, V(X) is a complicated
( + 1) ( + 1) ], ( ) ( ) ( + 2) ( + 2) V ( X ) = 2[ ] E( X )2 . ( ) ( ) E( X ) = [
The procedure works best when is small (see Code 2).
Code 1. Left truncated normal distribution: draw.left.truncated.normal<-function(nrep,mu,sigma,tau){ if (sigma<=0){ stop("Standard deviation must be positive!\n")} lambda.star<-(tau+sqrt(tau^2+4))/2 accept<-numeric(nrep) ; for (i in 1:nrep){ sumw<-0 ; while (sumw<1){ y<-rexp(1,lambda.star)+tau gy<-lambda.star*exp(lambda.star*tau)*exp(-lambda.star*y) fx<-exp(-(y-mu)^2/(2*sigma^2))/(sqrt(2*pi)*sigma*(1-pnorm((taumu)/sigma))) ratio1<-fx/gy ; ratio<-ratio1/max(ratio1) u<-runif(1); w<-(u<=ratio) ; accept[i]<-y[w]; sumw<-sum(w)}} accept}
302
Code 2: Left truncated gamma distribution draw.left.truncated.gamma<-function(nrep,alpha,beta,tau){ if (tau<0){stop("Cutoff point must be positive!\n")} if ((alpha<=1)){stop("Shape parameter must be greater than 1!\n")}
if ((beta<=0)){stop("Scale parameter must be positive!\n")}
y<-numeric(nrep); for (i in 1:nrep){ index<-0 ; scaled.tau<-tau/beta lambda<-(scaled.tau-alpha+sqrt((scaled.taualpha)^2+4*scaled.tau))/(2*scaled.tau) while (index<1){ u<-runif(1); u1<-runif(1) ; y[i]<-(-log(u1)/lambda)+tau w<-((1-lambda)*y[i]-(alpha-1)*(1+log(y[i])+log((1-lambda)/(alpha1)))<=-log(u)) index<-sum(w)}} ; y<-y*beta y}
Laplace (double exponential) distribution PDF: f(x)= e-|x-| for >0, where 2 and are the location and scale parameters, respectively. EAA: Monte Carlo studies of robust procedures, because it has a heavier tail than the normal distribution. GA: A sample from an exponential distribution with mean is generated, then the sign is changed with 1/2 probability and the resulting variates get shifted by . E(X)=, V(X)=2/2 (see Code 3). Inverse Gaussian distribution PDF:
1 e Kcos ( x ) for 2 I 0 ( K )
x and K>0, where I0(K) is a modified Bessel function of the first kind of order 0. EAA: Modeling directional data. GA: Acceptance/rejection method of Best and Fisher (1979) that uses a transformed folded Cauchy distribution as the majorizing density. E(X)=0 (see Code 5). Zeta (Zipf) distribution PDF:
>0, >0, where and are the location and scale parameters, respectively. EAA: Reliability studies. GA: An acceptance/rejection algorithm developed by Michael et al. (1976). E(X)=, V(X)=3/ (see Code 4).
(Riemann zeta function). EAA: Modeling the frequency of random processes. GA: Acceptance/rejection algorithm of Devroye (1986).
( 1) , ( ) (a ) (a 2) ( (a 1))2 , V (X ) = ( (a )) 2
E( X ) =
x =1
f (x | , ) = (
1/ 2 ) x 2
( x ) 3/ 2 2 2x
f (x | ) =
1 ( ) x
for
for
x0,
HAKAN DEMIRTAS
303
Code 3. Laplace (double exponential) distribution: draw.laplace<-function(nrep, alpha, lambda){ if (lambda<=0){stop("Scale parameter must be positive!\n")} y<-rexp(nrep,lambda) change.sign<-sample(c(0,1), nrep, replace = TRUE) y[change.sign==0]<--y[change.sign==0] ; laplace<-y+alpha laplace}
Code 4. Inverse Gaussian distribution: draw.inverse.gaussian<-function(nrep,mu,lambda){ if (mu<=0){stop("Location parameter must be positive!\n")} if (lambda<=0){stop("Scale parameter must be positive!\n")} inv.gaus<-numeric(nrep); for (i in 1:nrep){ v<-rnorm(1) ; y<-v^2 x1<-mu+(mu^2*y/(2*lambda))-(mu/(2*lambda))*(sqrt(4*mu*lambda*y+mu^2*y^2)) u<-runif(1) ; inv.gaus[i]<-x1 w<-(u>(mu/(mu+x1))) ; inv.gaus[i][w]<-mu^2/x1} inv.gaus}
Code 5. Von Mises distribution: draw.von.mises<-function(nrep,K){ if (K<=0){stop("K must be positive!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ u1<-runif(1) ; u2<-runif(1); u3<-runif(1) tau<-1+(1+4*K^2)^0.5 ; rho<-(tau-(2*tau)^0.5)/(2*K) r<-(1+rho^2)/(2*rho) ; z<-cos(pi*u1) f<-(1+r*z)/(r+z) ; c<-K*(r-f) w1<-(c*(2-c)-u2>0) ; w2<-(log(c/u2)+1-c>=0) y<-sign(u3-0.5)*acos(f) ; x[i][w1|w2]<-y index<-1*(w1|w2)}} x}
Code 6. Zeta (Zipf) distribution draw.zeta<-function(nrep,alpha){ if (alpha<=1){stop("alpha must be greater than 1!\n")} zeta<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ u1<-runif(1) ; u2<-runif(1) x<-floor(u1^(-1/(alpha-1))) ; t<-(1+1/x)^(alpha-1) w<-x<(t/(t-1))*(2^(alpha-1)-1)/(2^(alpha-1)*u2) zeta[i]<-x ; index<-sum(w)}} zeta}
304
xlog (1 )
x=1,2,3,... and 0<<1. EAA: Modeling the number of items processed in a given period of time. GA: The chop-down search method of Kemp (1981).
E( X ) =
1 , (1 ) log (1 )
for x=0,1,2,..., >0 and >0, where n is the sample size, and are the shape parameters and B(,) is the complete beta function. EAA: Modeling overdispersion or extravariation in applications where clusters of separate binomial distributions. GA: First is generated as the appropriate beta and then it is used as the n success probability in binomial. E(X)= + , V (X ) =
Code 7. Logarithmic distribution: draw.logarithmic<-function(nrep,theta){ if ((theta<=0)|(theta>1)){stop("theta must be between 0 and 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; x0<-1 ; u<-runif(1) while (index<1){t<--(theta^x0)/(x0*log(1-theta)) px<-t ; w<-(u<=px) ; x[i]<-x0 ; u<-u-px index<-sum(w) ; x0<-x0+1}} x}
if (floor(n)!=n){stop("Size must be an integer!\n")} if (floor(n)<2){stop("Size must be greater than 2!\n")} beta.variates<-numeric(nrep) ; beta.binom<-numeric(nrep) for (i in 1:nrep){ beta.variates[i]<-rbeta(1,alpha,beta) beta.binom[i]<-rbinom(1,n,beta.variates[i])} beta.binom}
f (x | ) =
for
PMF:
f ( x|n , , ) =
1 n! 1+ x (1 )n + 1 x d x !( n x )!B ( , ) 0
HAKAN DEMIRTAS
305
Non-central t distribution
f (x | ) =
and >0, where is the scale parameter. EAA: Modeling spatial patterns. GA: The inverse CDF method. E(X)= /2 , V(X)=2(4-)/2 (see Code 9). Pareto distribution PDF: f(x|a,b)= aba xa+1 for
e x
/ 2 2
for x0
Y where U is a U/ central chi-square random variable with degrees of freedom and Y is an independent normally distributed random variable with variance 1 and mean . EAA: Thermodynamic stability scores. GA: Based on arithmetic functions of normal and 2 variates. Describes the ratio
0<bx< and a>0, where a and b are the shape and location parameters, respectively. EAA: Gene filtering in microarray experiments. GA: ab , The inverse CDF method. E(X)= a-1
Code 10: Pareto distribution: draw.pareto<-function(nrep,shape,location){ if (shape<=0){stop("Shape parameter must be positive!\n")} if (location<=0){stop("Location parameter must be positive!\n")} u<-runif(nrep) ; pareto<-location/(u^(1/shape)) pareto}
Code 11. Non-central t distribution: draw.noncentral.t<-function(nrep,nu,lambda){ if (nu<=1){stop("Degrees of freedom must be greater than 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ x[i]<-rt(1,nu)+(lambda/sqrt(rchisq(1,nu)/nu))} x}
306
for 0x, >0 and >1, where is the noncentrality parameter and is degrees of freedom. Both and can be non-integers. EAA: Wavelets in biomedical imaging. GA: Based on the sum of squared standard normal deviates. E(X)=+, V(X)=4+2 (see Code 12).
draw.noncentral.chisquared<-function(nrep,df,ncp){ if (ncp<0){stop("Non-Centrality parameter must be non-negative!\n")} if (df<=1){stop("Degrees of freedom must be greater than 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ df.int<-floor(df) ; df.frac<-df-df.int mui<-sqrt(ncp/df.int) ; jitter<-0 if (df.frac!=0){jitter<-rchisq(1,df.frac)} x[i]<-sum((rnorm(df.int)+mui)^2)+jitter} x}
draw.noncentral.F<-function(nrep,df1,df2,ncp1,ncp2){ if (ncp1<0){stop("Numerator non-centrality parameter must be nonnegative!\n")} if (ncp2<0){stop("Denominator non-centrality parameter must be nonnegative!\n")} if (df1<=1){stop("Numerator degrees of freedom must be greater than 1!\n")} if (df2<=1){ stop("Denominator degrees of freedom must be greater than 1!\n")} x<-draw.noncentral.chisquared(nrep,df1,ncp1)/ draw.noncentral.chisquared(nrep,df2,ncp2) x}
k =0
f ( x | , ) =
e ( x + ) / 2 x / 2 1 2 / 2
( x ) k 4k k !(k + / 2)
HAKAN DEMIRTAS
Standard t distribution PDF: Weibull distribution PDF: f ( x | , ) =
307
f (x | ) =
+1
2
( ) 2
(1 +
x2
1 ( x / ) x e for
) ( +1) / 2 for
-<x<, where is the degrees of freedom and () is the complete gamma function. GA: A rejection polar method developed by Bailey (1994). E(X)=0, V ( X ) =
0x< and min(,)>0, where and are the shape and scale parameters, respectively. EAA: Modeling lifetime data. GA: The inverse CDF method.
E(X)= (1+1/ ) ,
V ( X ) = (1 + 2 / ) 2 (1 + 1/ ) 2
(see Code 15).
Code 14. Standard t distribution: draw.t<-function(nrep,df){ if (df<=1){stop("Degrees of freedom must be greater than 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ v1<-runif(1,-1,1) ; v2<-runif(1,-1,1); r2<-v1^2+v2^2 r<-sqrt(r2) ; w<-(r2<1) x[i]<-v1*sqrt(abs((df*(r^(-4/df)-1)/r2))) index<-sum(w)}} x}
Code 15. Weibull distribution: draw.weibull<-function(nrep, alpha, beta){ if ((alpha<=0)|(beta<=0)){ stop("alpha and beta must be positive!\n")} u<-runif(nrep) ; weibull<-beta*((-log(u))^(1/alpha)) weibull}
308
1 x 1e x / ( )
for 0x<, min(,)>0, where and are the shape and scale parameters, respectively. EAA: Bioinformatics. GA: An acceptance/rejection algorithm developed by Ahrens and Dieter (1974) and Best (1983). It works when <1. E(X)=, V(X)=2 (see Code 16).
Code 16. Gamma distribution when <1 draw.gamma.alpha.less.than.one<-function(nrep,alpha,beta){ if (beta<=0){stop("Scale parameter must be positive!\n")} if ((alpha<=0)|(alpha>=1)){ stop("Shape parameter must be between 0 and 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ u1<-runif(1) ; u2<-runif(1) t<-0.07+0.75*sqrt(1-alpha) ; b<-1+exp(-t)*alpha/t v<-b*u1 ; w1<-(v<=1) ; w2<-(v>1) x1<-t*(v^(1/alpha)) ; w11<-(u2<=(2-x1)/(2+x1)) w12<-(u2<=exp(-x1)) ; x[i][w1&w11]<-x1[w1&w11] x[i][w1&!w11&w12]<-x1[w1&!w11&w12] x2=-log(t*(b-v)/alpha) ; y<-x2/t w21<-(u2*(alpha+y*(1-alpha))<=1) w22<-(u2<=y^(alpha-1)) ; x[i][w2&w21]<-x2[w2&w21] x[i][w2&!w21&w22]<-x2[w2&!w21&w22] index<-1*(w1&w11)+1*(w1&!w11&w12)+1*(w2&w21)+1*(w2&!w21&w22)}} x<-beta*x x}
Code 17. Gamma distribution when >1: draw.gamma.alpha.greater.than.one<-function(nrep,alpha,beta){ if (beta<=0){stop("Scale parameter must be positive!\n")} if (alpha<=1){stop("Shape parameter must be greater than 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ u1<-runif(1); u2<-runif(1) v<-(alpha-1/(6*alpha))*u1/((alpha-1)*u2) w1<-((2*(u2-1)/(alpha-1))+v+(1/v)<=2) w2<-((2*log(u2)/(alpha-1))-log(v)+v<=1) x[i][w1]<-(alpha-1)*v ; x[i][!w1&w2]<-(alpha-1)*v index<-1*w1+1*(!w1&w2)}} x<-x*beta x}
HAKAN DEMIRTAS
Beta distribution when max(,)<1 PDF:
309
f (x | , ) =
1 x 1 (1 x ) 1 B( , )
for
0x1, 0<1 and 0<1, where and are the shape parameters and B(,) is the complete beta function. EAA: Analysis of biomedical signals. GA: An acceptance/rejection algorithm developed by Johnk (1964). It works when both parameters are less than 1. E(X)= , +
V (X ) =
found to be negligible, suggesting that random number generation routines presented are accurate. These routines could be a handy addition to a practitioners set of tools given the growing interest in R. However, the reader is invited to be cautious about the following issues: 1) It is not postulated that algorithms presented are the most efficient. Furthermore, implementation of a given algorithm may not be optimal. Given sufficient time and resources, one can write more efficient routines. 2) Quality of every random number generation process depends on the uniform number generator.
Code 18. Beta distribution when max(,)<1: draw.beta.alphabeta.less.than.one<-function(nrep,alpha,beta){ if ((alpha>=1)|(alpha<=0)|(beta>=1)|(beta<=0)) { stop ("Both shape parameters must be between 0 and 1!\n")} x<-numeric(nrep) ; for (i in 1:nrep){ index<-0 ; while (index<1){ u1<-runif(1) ; u2<-runif(1) v1<-u1^(1/alpha) ; v2<-u2^(1/beta) summ<-v1+v2 ; w<-(summ<=1) x[i]<-v1/summ ; index<-sum(w)}} x}
Results for arbitrarily chosen parameter values For each distribution, the parameters can take infinitely many values and first two moments virtually fluctuate on the entire real line. The quality of random variates was tested by a broad range of simulations to see any potential aberrances and abnormalities in some subset of the parameter domains and to avoid any selection biases. The empirical and theoretical moments for arbitrarily chosen parameter values are reported in Table 1 and 2. Table 1 tabulates the theoretical and empirical means for each distribution for arbitrary values. Throughout the table, the number of replications (nrep) is chosen to be 10,000. A similar comparison is made for the variances, as shown in Table 2. In both tables, the deviations from the expected moments are
McCullough (1999) raised some questions about the quality of Splus generator. At the time of this writing, a source that tested the R generator is unknown to the author. In addition, the differences between empirical and distributional moments have merely been examined for each distribution. More comprehensive and computer science-minded tests are needed possibly using DIEHARD suite (Marsaglia, 1995) or other well-regarded test suites.
310
Distribution Parameter(s) Left truncated normal =0, =1, =0.5 Left truncated gamma =4, =2, =0.5 Laplace =4, =2 Inverse Gaussian =1, =1 Von Mises K=10 Zeta (Zipf) =4 Logarithmic =0.6 Beta-binomial =2, =3, n=10 Rayleigh =4 Pareto a=5, b=5 Non-central t =5, =1 Non-central Chi=5, =2 squared Doubly non-central F n=5, m=10, 1=2, 2=3 Standard t Weibull Gamma with <1 Gamma with >1 Beta with <1 and <1 =5 =5, =5 =0.3, =0.4 =3, =0.4 =0.7, =0.4
Theoretical mean 1.141078 8.002279 4 1 0 1.110626 1.637035 4 5.013257 6.25 1.189416 7 0.667381 0 4.590844 0.12 1.2 0.636363
Empirical mean 1.143811 8.005993 3.999658 1.001874 0.002232 1.109341 1.637142 4.016863 5.018006 6.248316 1.191058 7.004277 0.666293 0.001263 4.587294 0.118875 1.200645 0.636384
Table 2: Comparison of theoretical and empirical variances for arbitrarily chosen parameter values.
Distribution Parameter(s) Left truncated normal =0, =1, =0.5 Left truncated gamma =4, =2, =0.5 Laplace =4, =2 Inverse Gaussian =1, =1 Zeta (Zipf) =4 Logarithmic =0.6 Beta-binomial =2, =3, n=10 Rayleigh =4 Pareto a=5, b=5 Non-central t =5, =1 Non-central Chi=5, =2 squared Doubly non-central F n=5, m=10, 1=2, 2=3 Standard t Weibull Gamma with <1 Gamma with >1 Beta with <1 and <1 =5 =5, =5 =0.3, =0.4 =3, =0.4 =0.7, =0.4
Theoretical variance 0.603826 15.98689 0.5 1 0.545778 1.412704 6 6.867259 2.604167 1.918623 18 0.348817 1.666667 1.105749 0.048 0.48 0.110193
Empirical variance 0.602914 15.86869 0.502019 0.997419 0.556655 1.4131545 6.001696 6.854438 2.604605 1.903359 18.09787 0.346233 1.661135 1.098443 0.047921 0.481972 0.110126
HAKAN DEMIRTAS
References Ahrens, J. H., & Dieter, U. (1974). Computer methods for sampling from gamma, beta, poisson and binomial distributions. Computing, 1, 223-246. Bailey, R. W. (1994). Polar generation of random variates with the t-distribution. Mathematics of Computation, 62, 779-781. Best, A. W. (1983). A note on gamma variate generators with shape parameter less than unity. Computing, 30, 185-188. Best, D. J., & Fisher, N. I. (1979). Efficient simulation of the von mises distribution. Applied Statistics, 28, 152-157. Cheng, R. C. H., & Feast, G. M. (1979). Some simple gamma variate generation. Applied Statistics, 28, 290-295. Dagpunar, J. S. (1978). Sampling of variates from a truncated gamma distribution. Journal of Statistical Computation and Simulation, 8, 59-64. Demirtas, H. (2004). Pseudo-random number generation in r for commonly used multivariate distributions. Journal of Modern Applied Statistical Methods, 3, 385-497.
311
Devroye, L. (1986). Non-Uniform random variate generation. New York: Springer-Verlag. Jhonk, M. D. (1964). Erzeugung von betaverteilter und gammaverteilter zufallszahlen. Metrika, 8, 5-15. Kemp, A. W. Efficient generation of logarithmically distributed pseudo-random variables. Applied Statistics, 30, 249-253. Marsaglia, G. (1995). The marsaglia random number cdrom, including the diehard battery of tests of randomness. Department of Statistics, Florida State University, Tallahassee, Florida. McCullough, B. D. (1999). Assessing the reliability of statistical software: Part 2. The American Statistician, 53, 149-159. Michael, J. R., William, R. S., & Haas, R. W. (1976). Generating random variates using transformations with multiple roots. The American Statistician, 30, 88-90. Robert, C. P. (1995). Simulation of truncated random variables. Statistics and Computing, 13, 169-184.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 312-318
JMASM17: An Algorithm And Code For Computing Exact Critical Values For Friedmans Nonparametric ANOVA
Sikha Bagui Subhash Bagui
The University of West Florida, Pensacola
Provided in this article is an algorithm and code for computing exact critical values (or percentiles) for Friedmans nonparametric rank test for k related treatment populations using Visual Basic (VB.NET). This program has the ability to calculate critical values for any number of treatment populations ( k ) and block sizes (b) at any significance level ( ) . We developed an exact critical value table for k = 2(1)5 and b = 2(1)15 . This table will be useful to practitioners since it is not available in standard nonparametric statistics texts. The program can also be used to compute any other critical values. Key words: Friedmans test, randomized block designs (RBD), ANOVA, Visual Basic
Introduction While experimenting (or dealing) with randomized block designs (RBDs) (or one-way repeated measures designs), if the normality of treatment populations or the assumptions of equal variances are not met, or the data are in ranks, it is recommended that Friedmans rankbased nonparametric test be used as an alternative to the conventional F test for the RBD (or one-way repeated measures analysis of variance) for k related treatment populations. This test was developed by Friedman (1937), and was designed to test the null hypothesis that all the k treatment populations are identical versus the alternative that at least two of the treatment populations differ in location. This test is based on a statistic that is a rank analogue of SST (total sum of squares) for the RBD and
is computed in the following manner. After the data from a RBD are obtained, the observed values in each block b are ranked from 1 (the smallest in the block) to k (the largest in the block). Let Ri denote the sum of the ranks of the values corresponding to treatment population i , i = 1, 2, , k . Then the Friedmans test statistic is given by
Sikha Bagui is an Assistant Professor in the Department of Computer Science. Her areas of research are database and database design, data mining, pattern recognition, and statistical computing. Email: [email protected]. Subhash Bagui is a Professor in the Department of Mathematics and Statistics. His areas of research are statistical classification and pattern recognition, bio-statistics, construction of designs, tolerance regions, statistical computing and reliability. Email: [email protected].
If the null hypothesis is true, it is expected that the rankings be randomly distributed within each block. If that is the case, the sum of the rankings in each treatment population will be approximately equal, and the resulting value of Fr will be small. If the alternative hypothesis is true, the expectation is that this will to lead to differences among the Ri values and obtain correspondingly large values of Fr . Thus, the null hypothesis is rejected in favor of the alternative hypothesis for large values of Fr . Exact null sampling distribution of Fr is not known. But, as with the Kruskal-Wallis (1952) statistic, the null Fr can be distribution of the Friedmans approximated by a chi-square ( 2 ) distribution
312
i =1
Fr =
12 bk (k + 1)
Ri2 3b(k + 1) .
313
that all observations for treatment populations are from the same population. Therefore, to find the null distribution of the Fr statistic, first, generate b uniform pseudo-random numbers from the interval (0,1) for each of the k treatment populations. Assume that the probability of a tie is zero. Then random variates within each block are ranked from 1 to k . The program then calculates rank sums of each treatment population, Ri , and computes the value of the Fr statistic
This process is replicated a sufficient number of times until the null distribution of the Fr statistic is modeled adequately. Then the program returns a critical value that is associated with a percentile fraction of 0.90, 0.95, 0.975, or 0.99 (or equivalently a significance level alpha of 0.10, 0.05, 0.025, or 0.01). In some cases returned values may be true for a range of Pvalues. With adequate number of runs, this VB.NET program yields the same values reported by Lehmann (1998) in Table M. In Table 1 below, we provide critical values for the Fr test for b = 2(1)15 , k = 2(1)5 and = 0.1, 0.05, 0.025, 0.01. The notation F1 in the Table 1 means (1 )100% percentile of the Fr statistic which is equivalent to level critical value of the Fr statistic. This table will be useful to the practitioners since it is not available in standard statistics texts with a chapter on nonparametric statistics. The critical values in Table 1 are generated using 1 million replications in each case.
i =1
Fr =
12 bk (k + 1)
Ri2 3b(k + 1) .
314
(b)
Columns
(k )
2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5
F0.90
2.0000 3.0000 4.0000 1.800 2.6667 3.5714 2.0000 2.7778 3.6000 2.2727 3.000 1.9231 2.5714 3.2667 4.0000 4.6667 4.5000 4.8000 4.3333 4.5714 4.7500 4.6667 4.2000 4.9091 4.6667 4.5841 4.4286 4.8000 5.4000 5.8000 6.0000 6.1200 6.2000 6.2571 6.1500 6.0667 6.2400 6.1636 6.1000 6.0462 6.2571 6.2000 6.8000 7.2000 7.4000 7.5200 7.6000 7.6571 7.7000 7.6444 7.6800 7.7091 7.6667 7.6923 7.7143 7.6800
F0.95
2.0000 3.0000 4.0000 5.0000 2.6667 3.5714 4.5000 2.7778 3.6000 4.4545 3.0000 3.7692 4.5714 3.2667 4.0000 4.6667 6.0000 5.2000 6.3333 6.0000 5.2500 6.0000 5.6000 5.6364 6.1667 5.9469 5.5714 5.7333 5.4000 7.0000 7.5000 7.3200 7.4000 7.6286 7.5000 7.5333 7.5600 7.5818 7.6000 7.6154 7.6286 7.5600 7.2000 8.2667 8.6000 8.8000 8.9333 9.0286 9.2000 9.1556 9.2000 9.2363 9.2667 9.2923 9.3143 9.3333
F0.975
2.0000 3.0000 4.0000 5.0000 2.6667 3.5714 4.5000 5.4444 3.6000 4.4545 5.3333 3.7692 4.5714 5.4000 4.0000 6.0000 6.5000 6.4000 7.0000 7.1429 7.0000 6.8889 7.4000 7.0909 7.1667 7.2920 7.0000 6.9333 6.0000 7.4000 8.1000 8.2800 8.6000 8.6571 8.8500 8.7333 8.8800 8.8909 9.0000 9.0000 9.0000 9.0800 7.6000 9.3333 9.6000 10.0800 10.2667 10.5143 10.6400 10.5778 10.6400 10.7636 10.7333 10.7692 10.8000 10.8267
F0.99
2.0000 3.0000 4.0000 5.0000 2.6667 7.0000 4.5000 5.4444 6.4000 7.3636 5.3333 6.2308 7.1429 5.4000 4.0000 6.0000 6.5000 7.6000 8.3333 8.0000 7.7500 8.6667 8.6000 8.9091 8.6667 9.0796 9.0000 8.5333 6.0000 8.2000 9.3000 9.7200 10.0000 10.3714 10.3500 10.4667 10.6800 10.6364 10.7000 10.7539 10.8857 10.7600 7.6000 9.8667 11.0000 11.5200 11.8667 12.0043 12.2000 12.3556 12.4000 12.5818 12.5333 12.7385 12.7429 12.7467
315
Conover, W. J. (1999). Practical Nonparametric Statistics. (3rd ed.). New York: Wiley. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701. Headrick, T. C. (2003). An algorithm for generating exact critical values for the Kruskal-Wallis One-way ANOVA. Journal of Modern Applied Statistical Methods, 2, 268-271. Hollander, M., & Wolfe, D. A. (1973). Nonparametric Statistical Methods. New York: John Wiley. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion analysis of variance. Journal of American Statistical Association, 47, 583-621. Lehmann, E. L. (1998). Nonparametrics: Statistical Methods Based on Ranks. New Jersey: Prentice Hall. Minitab. (2000). Minitab for Windows, release 13.3, Minitab, Inc., State College, PA. SPSS. (2002). SPSS for Windows, version 11.0, SPSS, Inc., Chicago, IL.
Appendix:
Imports System.Windows.Forms Public Class Form1 Inherits System.Windows.Forms.Form Dim sum = 0, squared = 0, square_sum = 0, m = 0, n = 0, i = 0, j = 0, k = 0, l = 0, p = 0, q = 0, r As Integer Dim count = 0, v = 0, z As Integer Dim num As Single Dim percentile As Single Dim f As Single Dim file1 As System.IO.StreamWriter
316
Dim array1(,) As Single = New Single(,) {} Dim array6(,) As Single = New Single(,) {} Dim array3() As Single = New Single() {} Dim array4() As Single = New Single() {} Dim array5() As Integer = New Integer() {} Dim array7() As Integer = New Integer() {} Dim array8() As Single = New Single() {} Dim array9() As Single = New Single() {} Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load 'Calling the random number generator Randomize() End Sub Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click m = Val(TextBox1.Text) 'm is the number of rows(blocks) n = Val(TextBox2.Text) 'n is the number of columns z = Val(TextBox3.Text) 'z is the number of runs percentile = Val(TextBox4.Text) 'percentile value 'Defining the Dim array1(m, Dim array6(m, Dim array3(n) Dim array4(n) Dim array5(n) Dim array7(n) Dim array8(z) Dim array9(z) arrays n) As Single n) As Single As Single As Single As Integer As Integer As Single As Single
Dim row, col As Integer Dim output As String For p = 1 To z output = " " 'creating initial m x n random array For row = 1 To m For col = 1 To n array1(row, col) = Rnd() output &= array1(row, col) & " " Next output &= vbCrLf Next j = 1 k = 1 For r = 1 To array1.GetUpperBound(0) 'pulling out one row For col = 1 To n 'array1.GetUpperBound(0) num = array1(j, col) array3(col) = num output &= array3(col) & " " Next
317
j = j + 1 'copying one row into new array For row = 1 To array3.GetUpperBound(0) array4(row) = array3(row) Next 'sorting one row Array.Sort(array3)
'ranking row For row = 1 To array4.GetUpperBound(0) For i = 1 To array3.GetUpperBound(0) If array4(row) = array3(i) Then array5(row) = i End If Next Next 'putting row back into two dimensional array For row = 1 To array5.GetUpperBound(0) output &= array5(row) & " " array6(k, row) = array5(row) Next output &= vbCrLf k = k + 1 Next output = " " 'displaying two dimensional array For row = 1 To array6.GetUpperBound(0) For col = 1 To n 'array6.GetUpperBound(0) output &= array6(row, col) & " " Next output &= vbCrLf Next 'summing columns in two dimensional array l = 1 sum = 0 square_sum = 0 For col = 1 To n 'array6.GetUpperBound(0) For row = 1 To array6.GetUpperBound(0) sum += array6(row, l) Next output = sum square_sum = sum * sum array7(l) = square_sum output &= vbCrLf output &= array7(l) & " " l = l + 1 sum = 0 square_sum = 0
318
Next f = 0 squared = 0 = 1 To array7.GetUpperBound(0) squared += array7(row) Next output = squared output = " " f = Convert.ToSingle(12 / (m * n * (n + 1)) * squared - 3 * m * (n + 1)) array8(p) = f output &= array8(p) & " " f = 0 squared = 0 Next output = " " For row = 1 To array8.GetUpperBound(0) output &= array8(row) & " " Next For row = 1 To array8.GetUpperBound(0) array9(row) = array8(row) output &= array9(row) & " " Next Array.Sort(array9) 'Array9 - sorted F values For row = 1 To array9.GetUpperBound(0) output &= array9(row) & " " Next output = " " count = 0 For row = 1 To array9.GetUpperBound(0) count += 1 Next output = count v = percentile * count output = " " output = array9(v) MessageBox.Show(output, "95% percentile value") End Sub End Class
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 319-332
An Algorithm For Generating Unconditional Exact Permutation Distribution For A Two-Sample Experiment
J. I. Odiase S. M. Ogbonmwan
Department of Mathematics University of Benin, Nigeria
An Algorithm that generates the unconditional exact permutation distribution of a 2 x n experiment is presented. The algorithm is able to handle ranks as well as actual observations. It makes it possible to obtain exact p-values for several statistics, especially when sample sizes are small and the application of large sample approximation is unreliable. An illustrative implementation is achieved and leads to the computation of exact p-values for the Mood test when the sample size is small. Key words: permutation test, Monte Carlo test, p-value, rank order statistic, Mood test
Introduction An important part of Statistical Inference is the representation of observed data in terms of a pvalue. In fact, the p-value plays a major role in determining whether to accept or reject the null hypothesis. The p-value assists in establishing whether the observed data are statistically significant and so, any statistical approach that will guarantee its proper computation should be developed and employed in inferential statistics so that the probability of making a type I error is exactly . In practice, data are usually collected under varied conditions with some distributional assumptions such as that the data came from a normal distribution. It is advisable to avoid as much as possible making so many distributional
J. I. Odiase ([email protected]) is a Lecturer in the Department of Mathematics, University of Benin, Nigeria. His research interests include statistical computing and nonparametric statistics. S. M. Ogbonmwan ([email protected]) is an Associate Professor of Statistics, Department of Mathematics, University of Benin, Nigeria. His research interests include statistical computing and nonparametric statistics.
assumptions because data are usually never collected under ideal or perfect conditions, that is, do not conform perfectly to an assumed distribution or model being employed in its analysis. The p-value obtained through the permutation approach turns out to be the most reliable because it is exact, see Agresti (1992) and Good (2000). If the experiment to be analyzed is made up of small or sparse data, large sample procedures for statistical inference are not appropriate (Senchaudhuri et al., 1995; Siegel & Castellan, 1988). In this article, consideration is given to the special case of 2 x n tables with row and column totals allowed to vary with each permutation this seems more natural than fixing the row and column totals. This is the unconditional exact permutation approach which is all-inclusive rather than the constrained or conditional exact permutation approach of fixing row and column totals. This later approach mainly addresses contingency tables (Agresti, 1992). Several approaches have been suggested as alternatives to the computationally intensive unconditional exact permutation see Fisher (1935) and Agresti (1992) for a discussion on exact conditional permutation distribution. Also see Efron (1979), Hall and Tajvidi (2002), Efron and Tibshirani (1993), Opdyke (2003) for Monte Carlo approaches. Other approaches like the Bayesian and the likelihood have also been found useful in obtaining exact permutation
319
320
distribution (Bayarri & Berger, 2004; Spiegelhalter, 2004). Large sample approximations are commonly adopted in several nonparametric tests as alternatives to tabulated exact critical values. The basic assumption required for such approximations to be reliable alternatives is that the sample size should be sufficiently large. However, there is no generally agreed upon definition of what constitutes a large sample size (Fahoome, 2002). Available software for exact inference is expensive, with varied restrictions in the implementation of exact permutation procedures in the software. Computational time is highly prohibitive even with very fast processor speed of available personal computers. R. A. Fisher compiled by hand 32,768 permutations of Charles Darwins data on the height of crossfertilized and self-fertilized zea mays plants. The enormity of this task possibly discouraged Fisher from probing further into exact permutation tests (Ludbrook & Dudley, 1998). Permutation tests provide exact results, especially when complete enumeration is feasible. A comprehensive documentation of the properties of permutation tests can be found in Pesarin (2001). The problem with permutation tests has been high computational demands, viz space and time complexities. Sampling from the permutation sample space rather than carrying out complete enumeration of all possible distinct rearrangements is what most of the available permutation procedures do, see Opdyke (2003) for a detailed listing of widely available permutation sampling procedures. Opdyke (2003) however observed that most of the existing procedures can perform Monte Carlo sampling without replacement within a sample, but none can avoid the possibility of drawing the same sample more than once, thereby reducing the power of the permutation test. The purpose of this article is to fashion out a sure and efficient way of obtaining unconditional exact permutation distribution by ensuring that a complete enumeration of all the distinct permutations of any 2-sample experiment is achieved. This will produce exact p-values and therefore ensure that the probability of making a type I error is exactly .
, x ini
, i = 1, 2
and ni is the i sample size. Also, let XN = (X1, X2), where N = n1 + n2. XN is composed of N independent and identically distributed random variables. We have
(n 1 + n 2 ) !
n 1! n 2 !
N!
n 1! n 2 !
321
N! n !n ! 1 2
and
x12
x 22
(n!)
N!
1 x 11 x 12
x 21 x 22
2 x 21 x12
x 11 x 22
3 x 22 x12
x 21 x 11
4 x11 x 21
x 12 x 22
Numbers 1 6 on top of the permutations represent the permutation numbers The actual process of permuting the variates of the experiment reveals the following.
x12
x 22
x2i, x2i,
x11 x12
x 22
x 12
In an attempt to offer a mathematical explanation for the method of exchanges of variates leading to the algorithm, observe that
2 0 2 1 2 2
2 = 1 x 1 = 1 Permutation (original arrangement of the experiment) 0 2 = 2 x 2 = 4 Permutations (using one variate from first sample) 1 2 = 1 x 1 = 1 Permutation (exchange the samples, i.e., 2 variates) 2
Total = 1+4+1 = 6
x 21
x 11
x 11
x 21
original arrangement of the experiment 1 permutation i = 1, 2 i = 1, 2 exchange the samples (columns) 2 permutations 2 permutations 1 permutation
For all possible permutations of the N variates, systematically develop a pattern necessary for the algorithm required for the generation of all the distinct permutations. The presentation of the systematic generation of all the possible permutations of the N variates now follows. Examine an experiment of two samples (treatments), each with two variates, i.e.,
x 11
x 21
4! = 6 (permutations) as 2! 2!
5 x11 x 22
x 21 x 12
6 x 21 x 22
x 11 x 12
322
(exchange samples, i.e., three variates) Total = 1+9+9+1 = 20 Similarly, observe that permutation (1) is the original matrix, permutations (2) to (10) are obtained by using the elements of the first column to interchange the elements of the second column, one at a time. Permutations (11) to (19) are obtained by using 2 elements of the first column to interchange the elements of the second column, and permutation (20) is obtained by interchanging the columns of the original arrangement of the experiment. Continuing in the above fashion, clearly, the number of permutations for any 2-sample experiment can be written as
n
x12 x 13
x 22 . x 23
The
expectation
is
to
have
6! = 20 3! 3!
permutations, which are given in Table 2. The process of permuting the variates reveal the following:
x 11 x12 x 13
x 21 x 22 x 23
original
i =0
s t
x 22 x 23
x 12 x 13
probability
3 3 = 3 x 3 = 9 Permutations 1 1
3 3 = 1 x 1 = 1 Permutation 0 0
x 21
x 11
permutations, because
After obtaining all the distinct permutations from a complete enumeration, the statistic of interest is computed for each permutation. Each value of the statistic obtained from a complete enumeration occurs with
N! n !n ! 1 2
t, i
j (3 x 3)
9 permutations
a = 0 for b > a. b
i =0
(n!)2
N!
for n
2 2
i = 1, 2, 3 i = 1, 2, 3 i = 1, 2, 3
n i
n i
n i =0 i 2n n
n1 i
x 11
x 21
Observe that permutation (1) is the original arrangement, permutations (2) to (5) are obtained by using the elements of the first column to interchange the elements of the second column, one at a time. Permutation (6) is obtained by interchanging the columns of the original arrangement of the experiment, making use of the two elements in the first column. Examine a 2-sample experiment, where
n2 i
323
1 x 11 x 12 x13 6 x11 x 22 x 13 11 x 21 x 22 x 13 16 x 22 x 12 x 23
x 11 x 22 x 23 x 21 x 22 x12 x 11 x 22 x 12 x12 x 13 x 23
x 21 x 22 x11 x 21 x 13 x 23 x11 x 13 x 23 x 21 x 12 x 13
RN
Rn
( )
Rn
( p)
XN
x pn
1 11
xp
, N = np
N = np and Ri( j ) is the ith rank for sample j, see Sen and Puri (1967) for an expository discussion of rank order statistics. At this stage, the method can now be applied to this matrix of ranks. Note that any rearrangement or permutation of this matrix of ranks can be used in generating all the other distinct permutations.
1 1 1
thereafter obtained by simply tabulating the distinct values of the statistic against their probabilities of occurrence in the complete enumeration. This method of obtaining unconditional exact permutation distribution also suffices when ranks of observations of an experiment are used instead of the actual observations. In handling ranks with this approach, tied observations do not pose any problems because the permutation process will be implemented as if the tied observations or ranks are distinct. Given an n x p experiment,
with xij as actual observations, i = 1, 2, , p, j = 1, 2, , n for some rank order statistic, replace these observations with ranks. In order to achieve this, do a combined ranking from the smallest to the largest observation. For equal sample sizes, this yields an n x p matrix of ranks represented as follows:
R(
R( p)
324
The first step in developing the algorithm is to formulate the matrix of ranks, by adopting the trivial permutation, because it does not matter what rearrangement of the actual matrix of ranks is used in initiating the process of permutation, that is,
1 2 3 n1
n1 + 1 n1 + 2 n1 + 3 n1 + n 2
i =0
where n is the number of variates in each sample (column) i.e., the balanced case. The computer algorithm now follows.
For the above matrix of ranks, ensure that ties are taken care of, by replacing ranks of tied observations with the mean of their ranks. In designing the computer algorithm for the method of complete enumeration via permutation described so far, it is intended that all statements should be read like sentences or as a sequence of commands. We write Set T 1, where Set is part of the statement language and T is a variable. Words that form the statement language required for this work include: do, od, else, for, if, fi, set, then, through, to, as used in Goodman and Hedetniemi (1977). To distinguish variable names from words in the statement language, variable names appear in full capital letters.
Results Algorithm (RANK) Generation of the trivial matrix of ranks Step 1. Set P number of treatments; K Number of variates Step 2. For I 1 to P do through Step 4 Step 3. For J 1 to K do through Step 4 Step 4. [X is the matrix of ranks] Set X(J, I) (I 1)K + J od For all possible permutations of the N samples of p subsets of size n, the model of the number of permutations required for the computer algorithm for an experiment of two samples is:
n
n i
n i
permutations
325
Algorithm (PERMUTATION) Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16 Step 17 Step 18 Step 19 Step 20 Step 21 Step 22 Step 23 Step 24 Step 25 Step 26 Step 27 Step 28 Step 29 Step 30 Step 31 For J1 1 to K do through Step 5 Set TEMP X(J1, P - 1), I1 P For J2 1 to K do Step 5 Set X(J1, P - 1) X(J2, I1), X(J2, I1) TEMP [Compute statistic and restore original values of X] od For I 1 to K 1 do through Step 16 Set TEMP1 X(I, P - 1) For J I + 1 to K do through Step 16 Set TEMP2 X(J, P - 1) For L P to P do through Step 16 For I1 1 to K do through Step 16 For L1 L to P do through Step 16 If L L1 then Set T I1 + 1 else Set T 1 fi For J1 T to K do Step 16 Set X(I, P - 1) X(I1, L), X(I1, L) TEMP1, X(J, P - 1) X(J1, L1), X(J1, L1) TEMP2 [Compute statistic and restore original values of X] od For I 1 to K 2 do through Step 32 Set TEMP1 X(I, P - 1) For J I + 1 to K 1 do through Step 32 Set TEMP2 X(J, P - 1) For M J + 1 to K do through Step 32 Set TEMP3 X(M, P - 1) For L P to P do through Step 32 For I1 1 to K do through Step 32 For L1 L to P do through Step 32 If L L1 then Set T I1 + 1 else Set T 1 fi For J1 T to K do through Step 32 For L2 L1 to P do through Step 32 If L1 L2 then Set T1 J1 + 1 else Set T1 1 fi For J2 T1 to K do Step 32 Set X(I, P - 1) X(I1, L), X(I1, L) TEMP1, X(J, P - 1) X(J1, L1), X(J1, L1) TEMP2, X(M, P - 1) X(J2, L2), X(J2, L2) TEMP3 [Compute statistic and restore original values of X] od For I 1 to K 3 do through Step 53 Set TEMP1 X(I, P - 1) For J I + 1 to K 2 do through Step 53 Set TEMP2 X(J, P - 1) For M J + 1 to K 1 do through Step 53 Set TEMP3 X(M, P - 1) For N M + 1 to K do through Step 53 Set TEMP4 X(N, P - 1) For L P to P do through Step 53 For I1 1 to K do through Step 53
Step 32 Step 33 Step 34 Step 35 Step 36 Step 37 Step 38 Step 39 Step 40 Step 41 Step 42
326
Step 43 Step 44 Step 45 Step 46 Step 47 Step 48 Step 49 Step 50 Step 51 Step 52
Step 53 Step 54
For L1 L to P do through Step 53 If L L1 then Set T I1 + 1 else Set T 1 fi For J1 T to K do through Step 53 For L2 L1 to P do through Step 53 If L1 L2 then Set T1 J1 + 1 else Set T1 1 fi For J2 T1 to K do through Step 53 For L3 L2 to P do through Step 53 If L2 L3 then Set T2 J2 + 1 else Set T2 1 fi For J3 T2 to K do Step 53 Set X(I, P - 1) X(I1, L), X(I1, L) TEMP1, X(J, P - 1) X(J1, L1), X(J1, L1) TEMP2, X(M, P - 1) X(J2, L2), X(J2, L2) TEMP3, X(N, P - 1) X(J3, L3), X(J3, L3) TEMP4 od [Compute statistic and restore original values of X] od [Interchange samples and compute statistic]
for equal sample sizes. R1i is the rank of y1i, i = 1, 2, , n obtained after carrying out a combined ranking for the two samples. The large sample approximation for equal samples is
z =
where N = 2n and M the Mood test statistic. The p-values obtained are presented in Table 3 and the distribution of the test statistic is represented graphically in Figure 1.
n (N +
2
081 1 21
)(N
i =1
nN
The PERMUTATION algorithm was translated to FORTRAN codes and implemented in Intel Visual FORTRAN for a 2 x 5 experiment. The 252 distinct permutations generated are presented in the Appendix. The algorithm can be extended to any sample size, depending on the processor speed and memory space of the computer being used to implement the algorithm. For an optimal management of computer memory (space complexity), the permutations are not stored, they are discarded immediately the statistic of interest is computed. By way of illustration, generate the pvalues for a 2 x 5 experiment for the Mood test. = 0.05, Fahoome (2002) noted that when sample size should exceed 5 for the large sample approximation to be adopted for the Mood test. The unconditional permutation approach makes it possible to obtain exact p-values even for fairly large sample sizes. Given two samples,
y11, y12, , y1n and y21, y22, , y2n, the test statistic for the Mood test is
n
R 1i
2n + 1 2
327
Table 3. p-values for Mood Statistic. M 11.25 15.25 17.25 21.25 23.25 25.25 27.25 29.25 31.25 33.25 35.25 37.25 39.25 41.25 p(M) 0.0079 0.0079 0.0159 0.0317 0.0317 0.0159 0.0397 0.0476 0.0397 0.0397 0.0714 0.0476 0.0397 0.1270 p-value 0.0079 0.0159 0.0317 0.0635 0.0952 0.1111 0.1508 0.1984 0.2381 0.2778 0.3492 0.3968 0.4365 0.5635 M 43.25 45.25 47.25 49.25 51.25 53.25 55.25 57.25 59.25 61.25 65.25 67.25 71.25 p(M) 0.0397 0.0476 0.0714 0.0397 0.0397 0.0476 0.0397 0.0159 0.0317 0.0317 0.0159 0.0079 0.0079 p-value 0.6032 0.6508 0.7222 0.7619 0.8016 0.8492 0.8889 0.9048 0.9365 0.9682 0.9841 0.9921 1.0000
0.14 0.12 Probability 0.1 0.08 0.06 0.04 0.02 0 11.25 23.25 31.25 39.25 47.25 55.25 65.25 Test statistic (M)
328
Clearly, results obtained from using Normal distribution, which is the large sample asymptotic distribution for the Mood test, will certainly not be exactly the same as using the exact permutation distribution, especially for small sample sizes. The permutation approach produces the exact p-values. Example Consider the following example on page 278 of Freund (1979) on difference of means. Table 2: Heat-producing capacity of coal in millions of calories per tonne Mine1 8400 8230 8380 7860 7930 Mine2 7510 7690 7720 8070 7660
Subjecting the data in Table 2 to Mood test, the test statistic (M) is 39.25 and from Table 3 containing unconditional exact permutation distribution of Mood test statistic, the corresponding p-value is 0.4365 which exceeds = 0.05, suggesting that we cannot reject the null hypothesis of no difference between the heat-producing capacity of coal from the two mines. Adopting the large sample Normal approximation for Mood test, z calculated is 0.17 which gives a p-value of 0.4325 and this exceeds /2 = 0.025, meaning that the observed data are compatible with the null hypothesis of no difference as earlier obtained from the exact permutation test. Conclusion Several authors have attempted to obtain exact p-values for different statistics using the permutation approach. Two things have made their attempts an uphill task. First is the speed of computer required to perform a permutation test. Until recently, the speed of available computers has been grossly inadequate to handle complete enumeration for even small sample sizes. Recent advances in computer design has drawn researchers in this area closer to the realization of complete enumeration even for fairly large
329
Senchaudhuri, P., Mehta, C. R., & Patel, N. T. (1995). Estimating exact p-values by the method of control variates or Monte Carlo rescue. Journal of the American Statistical Association, 90, 640-648. Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd edition). NY: McGraw-Hill. Spiegelhalter, D. J. (2004). Incorporating Bayesian ideas into health-care evaluation. Statistical Science, 19, 156-174
1 1 6 2 7 3 8 4 9 5 10 13 1 6 2 3 7 8 4 9 5 10 25 1 6 2 7 3 8 4 5 9 10 37 6 1 2 3 7 8 4 9 5 10
2 6 1 2 7 3 8 4 9 5 10 14 1 6 2 7 8 3 4 9 5 10 26 1 6 2 7 3 8 4 9 10 5 38 6 1 2 7 8 3 4 9 5 10
3 7 6 2 1 3 8 4 9 5 10 15 1 6 2 7 9 8 4 3 5 10 27 6 1 7 2 3 8 4 9 5 10 39 6 1 2 7 9 8 4 3 5 10
4 8 6 2 7 3 1 4 9 5 10 16 1 6 2 7 10 8 4 9 5 3 28 6 1 8 7 3 2 4 9 5 10 40 6 1 2 7 10 8 4 9 5 3
5 9 6 2 7 3 8 4 1 5 10 17 1 4 2 7 3 8 6 9 5 10 29 6 1 9 7 3 8 4 2 5 10 41 7 6 2 1 8 3 4 9 5 10
6 10 6 2 7 3 8 4 9 5 1 18 1 6 2 4 3 8 7 9 5 10 30 6 1 10 7 3 8 4 9 5 2 42 7 6 2 1 9 8 4 3 5 10
7 1 2 6 7 3 8 4 9 5 10 19 1 6 2 7 3 4 8 9 5 10 31 7 6 8 1 3 2 4 9 5 10 43 7 6 2 1 10 8 4 9 5 3
8 1 6 7 2 3 8 4 9 5 10 20 1 6 2 7 3 8 9 4 5 10 32 7 6 9 1 3 8 4 2 5 10 44 8 6 2 7 9 1 4 3 5 10
9 1 6 8 7 3 2 4 9 5 10 21 1 6 2 7 3 8 10 9 5 4 33 7 6 10 1 3 8 4 9 5 2 45 8 6 2 7 10 1 4 9 5 3
10 1 6 9 7 3 8 4 2 5 10 22 1 5 2 7 3 8 4 9 6 10 34 8 6 9 7 3 1 4 2 5 10 46 9 6 2 7 10 8 4 1 5 3
11 1 6 10 7 3 8 4 9 5 2 23 1 6 2 5 3 8 4 9 7 10 35 8 6 10 7 3 1 4 9 5 2 47 6 1 2 4 3 8 7 9 5 10
12 1 3 2 7 6 8 4 9 5 10 24 1 6 2 7 3 5 4 9 8 10 36 9 6 10 7 3 8 4 1 5 2 48 6 1 2 7 3 4 8 9 5 10
330
Appendix Continued: 49 6 1 2 7 3 8 9 4 5 10 61 7 6 2 1 3 5 4 9 8 10 73 1 6 7 2 10 8 4 9 5 3 85 1 6 8 7 3 2 10 9 5 4 97 1 3 2 4 6 8 7 9 5 10 109 1 3 2 7 6 8 4 5 9 10 121 1 6 2 4 3 5 7 9 8 10 50 6 1 2 7 3 8 10 9 5 4 62 7 6 2 1 3 8 4 5 9 10 74 1 6 8 7 9 2 4 3 5 10 86 1 6 9 7 3 8 10 2 5 4 98 1 3 2 7 6 4 8 9 5 10 110 1 3 2 7 6 8 4 9 10 5 122 1 6 2 4 3 8 7 5 9 10 51 7 6 2 1 3 4 8 9 5 10 63 7 6 2 1 3 8 4 9 10 5 75 1 6 8 7 10 2 4 9 5 3 87 1 2 6 5 3 8 4 9 7 10 99 1 3 2 7 6 8 9 4 5 10 111 1 6 2 3 7 5 4 9 8 10 123 1 6 2 4 3 8 7 9 10 5 52 7 6 2 1 3 8 9 4 5 10 64 8 6 2 7 3 1 4 5 9 10 76 1 6 9 7 10 8 4 2 5 3 88 1 2 6 7 3 5 4 9 8 10 100 1 3 2 7 6 8 10 9 5 4 112 1 6 2 3 7 8 4 5 9 10 124 1 6 2 7 3 4 8 5 9 10 53 7 6 2 1 3 8 10 9 5 4 65 8 6 2 7 3 1 4 9 10 5 77 1 2 6 4 3 8 7 9 5 10 89 1 2 6 7 3 8 4 5 9 10 101 1 6 2 3 7 4 8 9 5 10 113 1 6 2 3 7 8 4 9 10 5 125 1 6 2 7 3 4 8 9 10 5 54 8 6 2 7 3 1 9 4 5 10 66 9 6 2 7 3 8 4 1 10 5 78 1 2 6 7 3 4 8 9 5 10 90 1 2 6 7 3 8 4 9 10 5 102 1 6 2 3 7 8 9 4 5 10 114 1 6 2 7 8 3 4 5 9 10 126 1 6 2 7 3 8 9 4 10 5 55 8 6 2 7 3 1 10 9 5 4 67 1 2 6 3 7 8 4 9 5 10 79 1 2 6 7 3 8 9 4 5 10 91 1 6 7 2 3 5 4 9 8 10 103 1 6 2 3 7 8 10 9 5 4 115 1 6 2 7 8 3 4 9 10 5 127 6 1 7 2 8 3 4 9 5 10 56 9 6 2 7 3 8 10 1 5 4 68 1 2 6 7 8 3 4 9 5 10 80 1 2 6 7 3 8 10 9 5 4 92 1 6 7 2 3 8 4 5 9 10 104 1 6 2 7 8 3 9 4 5 10 116 1 6 2 7 9 8 4 3 10 5 128 6 1 7 2 9 8 4 3 5 10 57 6 1 2 5 3 8 4 9 7 10 69 1 2 6 7 9 8 4 3 5 10 81 1 6 7 2 3 4 8 9 5 10 93 1 6 7 2 3 8 4 9 10 5 105 1 6 2 7 8 3 10 9 5 4 117 1 4 2 5 3 8 6 9 7 10 129 6 1 7 2 10 8 4 9 5 3 58 6 1 2 7 3 5 4 9 8 10 70 1 2 6 7 10 8 4 9 5 3 82 1 6 7 2 3 8 9 4 5 10 94 1 6 8 7 3 2 4 5 9 10 106 1 6 2 7 9 8 10 3 5 4 118 1 4 2 7 3 5 6 9 8 10 130 6 1 8 7 9 2 4 3 5 10 59 6 1 2 7 3 8 4 5 9 10 71 1 6 7 2 8 3 4 9 5 10 83 1 6 7 2 3 8 10 9 5 4 95 1 6 8 7 3 2 4 9 10 5 107 1 3 2 5 6 8 4 9 7 10 119 1 4 2 7 3 8 6 5 9 10 131 6 1 8 7 10 2 4 9 5 3 60 6 1 2 7 3 8 4 9 10 5 72 1 6 7 2 9 8 4 3 5 10 84 1 6 8 7 3 2 9 4 5 10 96 1 6 9 7 3 8 4 2 10 5 108 1 3 2 7 6 5 4 9 8 10 120 1 4 2 7 3 8 6 9 10 5 132 6 1 9 7 10 8 4 2 5 3
331
Appendix Continued: 133 7 6 8 1 9 2 4 3 5 10 145 7 6 9 1 3 8 10 2 5 4 157 6 1 2 3 7 4 8 9 5 10 169 6 1 2 3 7 8 4 9 10 5 181 6 1 2 7 3 4 8 9 10 5 193 1 6 7 2 8 3 9 4 5 10 205 1 6 7 2 9 8 4 3 10 5 134 7 6 8 1 10 2 4 9 5 3 146 8 6 9 7 3 1 10 2 5 4 158 6 1 2 3 7 8 9 4 5 10 170 6 1 2 7 8 3 4 5 9 10 182 6 1 2 7 3 8 9 4 10 5 194 1 6 7 2 8 3 10 9 5 4 206 1 6 8 7 9 2 4 3 10 5 135 7 6 9 1 10 8 4 2 5 3 147 6 1 7 2 3 5 4 9 8 10 159 6 1 2 3 7 8 10 9 5 4 171 6 1 2 7 8 3 4 9 10 5 183 7 6 2 1 3 4 8 5 9 10 195 1 6 7 2 9 8 10 3 5 4 207 1 2 6 4 3 5 7 9 8 10 136 8 6 9 7 10 1 4 2 5 3 148 6 1 7 2 3 8 4 5 9 10 160 6 1 2 7 8 3 9 4 5 10 172 6 1 2 7 9 8 4 3 10 5 184 7 6 2 1 3 4 8 9 10 5 196 1 6 8 7 9 2 10 3 5 4 208 1 2 6 4 3 8 7 5 9 10 137 6 1 7 2 3 4 8 9 5 10 149 6 1 7 2 3 8 4 9 10 5 161 6 1 2 7 8 3 10 9 5 4 173 7 6 2 1 8 3 4 5 9 10 185 7 6 2 1 3 8 9 4 10 5 197 1 2 6 3 7 5 4 9 8 10 209 1 2 6 4 3 8 7 9 10 5 138 6 1 7 2 3 8 9 4 5 10 150 6 1 8 7 3 2 4 5 9 10 162 6 1 2 7 9 8 10 3 5 4 174 7 6 2 1 8 3 4 9 10 5 186 8 6 2 7 3 1 9 4 10 5 198 1 2 6 3 7 8 4 5 9 10 210 1 2 6 7 3 4 8 5 9 10 139 6 1 7 2 3 8 10 9 5 4 151 6 1 8 7 3 2 4 9 10 5 163 7 6 2 1 8 3 9 4 5 10 175 7 6 2 1 9 8 4 3 10 5 187 1 2 6 3 7 4 8 9 5 10 199 1 2 6 3 7 8 4 9 10 5 211 1 2 6 7 3 4 8 9 10 5 140 6 1 8 7 3 2 9 4 5 10 152 6 1 9 7 3 8 4 2 10 5 164 7 6 2 1 8 3 10 9 5 4 176 8 6 2 7 9 1 4 3 10 5 188 1 2 6 3 7 8 9 4 5 10 200 1 2 6 7 8 3 4 5 9 10 212 1 2 6 7 3 8 9 4 10 5 141 6 1 8 7 3 2 10 9 5 4 153 7 6 8 1 3 2 4 5 9 10 165 7 6 2 1 9 8 10 3 5 4 177 6 1 2 4 3 5 7 9 8 10 189 1 2 6 3 7 8 10 9 5 4 201 1 2 6 7 8 3 4 9 10 5 213 1 6 7 2 3 4 8 5 9 10 142 6 1 9 7 3 8 10 2 5 4 154 7 6 8 1 3 2 4 9 10 5 166 8 6 2 7 9 1 10 3 5 4 178 6 1 2 4 3 8 7 5 9 10 190 1 2 6 7 8 3 9 4 5 10 202 1 2 6 7 9 8 4 3 10 5 214 1 6 7 2 3 4 8 9 10 5 143 7 6 8 1 3 2 9 4 5 10 155 7 6 9 1 3 8 4 2 10 5 167 6 1 2 3 7 5 4 9 8 10 179 6 1 2 4 3 8 7 9 10 5 191 1 2 6 7 8 3 10 9 5 4 203 1 6 7 2 8 3 4 5 9 10 215 1 6 7 2 3 8 9 4 10 5 144 7 6 8 1 3 2 10 9 5 4 156 8 6 9 7 3 1 4 2 10 5 168 6 1 2 3 7 8 4 5 9 10 180 6 1 2 7 3 4 8 5 9 10 192 1 2 6 7 9 8 10 3 5 4 204 1 6 7 2 8 3 4 9 10 5 216 1 6 8 7 3 2 9 4 10 5
332
Appendix Continued: 217 218 219 220 221 222 223 224 225 226 1 3 1 3 1 3 1 3 1 3 1 3 1 6 1 6 1 6 1 6 2 4 2 4 2 4 2 7 2 7 2 7 2 3 2 3 2 3 2 7 6 5 6 8 6 8 6 4 6 4 6 8 7 4 7 4 7 8 8 3 7 9 7 5 7 9 8 5 8 9 9 4 8 5 8 9 9 4 9 4 8 10 9 10 10 5 9 10 10 5 10 5 9 10 10 5 10 5 10 5 229 230 231 232 233 234 235 236 237 238 6 1 6 1 7 6 6 1 6 1 6 1 6 1 7 6 6 1 6 1 7 2 8 7 8 1 7 2 7 2 7 2 8 7 8 1 7 2 7 2 9 8 9 2 9 2 8 3 8 3 9 8 9 2 9 2 3 4 3 4 10 3 10 3 10 3 4 5 4 9 4 3 4 3 4 3 8 5 8 9 5 4 5 4 5 4 9 10 10 5 10 5 10 5 10 5 9 10 10 5 241 242 243 244 245 246 247 248 249 250 7 6 6 1 6 1 6 1 6 1 7 6 1 2 1 2 1 2 1 2 8 1 2 3 2 3 2 3 2 7 2 1 6 3 6 3 6 3 6 7 3 2 7 4 7 4 7 8 8 3 8 3 7 4 7 4 7 8 8 3 9 4 8 5 8 9 9 4 9 4 9 4 8 5 8 9 9 4 9 4 10 5 9 10 10 5 10 5 10 5 10 5 9 10 10 5 10 5 10 5 Numbers 1 252 on top of the permutations represent the permutation numbers 227 6 1 7 2 8 3 9 4 5 10 239 6 1 7 2 3 8 9 4 10 5 251 1 6 7 2 8 3 9 4 10 5 228 6 1 7 2 8 3 10 9 5 4 240 6 1 8 7 3 2 9 4 10 5 252 6 1 7 2 8 3 9 4 10 5
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 333-342
JMASM19: A SPSS Matrix For Determining Effect Sizes From Three Categories: r And Functions Of r, Differences Between Proportions, And Standardized Differences Between Means
David A. Walker Educational Research and Assessment Department Northern Illinois University
The program is intended to provide editors, manuscript reviewers, students, and researchers with an SPSS matrix to determine an array of effect sizes not reported or the correctness of those reported, such as rrelated indices, r-related squared indices, and measures of association, when the only data provided in the manuscript or article are the n, M, and SD (and sometimes proportions and t and F (1) values) for twogroup designs. This program can create an internal matrix table to assist researchers in determining the size of an effect for commonly utilized r-related, mean difference, and difference in proportions indices when engaging in correlational and/or meta-analytic analyses. Key words: SPSS, syntax, effect size
Introduction Cohen (1988) defined effect size as the degree to which the phenomenon is present in the population (p. 9) or the degree to which the null hypothesis is false (p. 10). For many years, researchers, editorial boards, and professional organizations have called for the reporting of effect sizes with statistical significance testing (Cohen, 1965; Knapp, 1998; Levin, 1993; McLean & Ernest, 1998; Thompson, 1994; Wilkinson & The APA Task Force on Statistical Inference, 1999). However, research applied to this issue has indicated that most published studies do not supply measures of effect size with results garnered from statistical significance testing (Craig, Eison, & Metze, 1976; Henson & Smith, 2000; Vacha-Hasse, Nilsson, Reetz, Lance, & Thompson, 2000). When reported with statistically significant
results, effect size can provide information pertaining to the extent of the difference between the null hypothesis and the alternative hypothesis. Furthermore, effect sizes can show the magnitude of a relationship and the proportion of the total variance of an outcome that is accounted for (Cohen, 1988; Kirk, 1996; Shaver, 1985). Conversely, there have long been cautions affiliated with the use of effect sizes. For instance, over 20 years ago, Kraemer and Andrews (1982) pointed out that effect sizes have limitations in the sense that they can be a measure that clearly indicates clinical significance only in the case of normally distributed control measures and under conditions in which the treatment effect is additive and uncorrelated with pretreatment or control treatment responses. (p. 407) Hedges (1981) examined the influence of measurement error and invalidity on effect sizes and found that both of these problems tended to underestimate the standardized mean difference effect size. In addition, Prentice and Miller (1992) ascertained that, The statistical size of an effect is heavily dependent on the operationalization of the independent variables
David Walker is an Assistant Professor at Northern Illinois University. His research interests include structural equation modeling, effect sizes, factor analyses, predictive discriminant analysis, predictive validity, weighting, and bootstrapping. Email: [email protected].
333
334
and the choice of a dependent variable (p. 160). Robinson, Whittaker, Williams, and Beretvas (2003) warned that depending on the choice of which effect size is reported, in some cases important conclusions may be obscured rather than revealed (p. 52). Finally, Kraemer (1983), Sawilowsky (2003), and Onwuegbuzie and Levin (2003) cautioned that effect sizes are vulnerable to various primary assumptions. Onwuegbuzie and Levin cited nine limitations affiliated with effect sizes and noted generally that these measures: are sensitive to a number of factors, such as: the research objective; sampling design (including the levels of the independent variable, choice of treatment alternatives, and statistical analysis employed); sample size and variability; type and range of the measure used; and score reliability. (p. 135) Effect sizes fall into three categories: 1) product moment correlation (r) and functions of r; 2) differences between proportions; and 3) standardized differences between means (Rosenthal, 1991). The first category of effect size, the r-related indices, can be considered as based on the correlation between treatment and result (Levin, 1994). For this group, Effect size is generally reported as some proportion of the total variance accounted for by a given effect (Stewart, 2000, p. 687), or, as Cohen (1988) delineated this effect size, Another possible useful way to understand r is as a proportion of common elements between variables (p. 78). Cohen (1988) suggested that for r-related indices, values of .10, .30, and .50 should serve as indicators of small, medium, and large effect sizes, while for r-related squared indices, values of .01, .09, and .25 should serve as indicators of small, medium, and large, respectively. The differences between proportions group is constituted in measures, for example, such as the differences between independent population proportions (i.e., Cohens h) or the difference between a population proportion and .50 (i.e., Cohens g) (Cohen, 1988). Finally, the standardized differences between means encompasses measures of effect size in terms of mean difference and standardized mean
DAVID A. WALKER
yet another purpose of this program is to offer researchers software that contains many of the formulae used in meta-analyses. Methodology The presented SPSS program will create an internal matrix table to assist researchers and students in determining the size of an effect for commonly utilized r-related, mean difference, and difference in proportions indices when engaging in correlational and/or meta-analytic analyses. Currently, the program produces nearly 50 effect sizes (see appendix A for truncated results of the programs ability). This software program employs mostly data from published articles, and some simulated data, to demonstrate its uses in terms of effect size calculations. Most of the formulae incorporated into this program come from Aaron, Kromrey, and Ferron (1998), Agresti and Finlay (1997), Cohen (1988), Cohen and Cohen (1983), Cooper and Hedges (1994), Hays (1963; 1981), Hedges (1981), Hedges and Olkin (1985), Kelley (1935), Kraemer (1983), Kraemer and Andrews (1982), McGraw and Wong (1992), Olejnik and Algina (2000), Peters and Van Voorhis (1940), Richardson (1996), Rosenthal (1991), and Rosenthal, Rosnow, and Rubin (2000). It should be noted that with the r-related and the standardized differences between means effect sizes, there are numerous, algebraicallyrelated methods concerning how to calculate these indices, of which some of been provided, but not all since the same value(s) would be repeated numerous times (see Cooper & Hedges, 1994 or Richardson, 1996 for the various formulae). Because this matrix is meant for between-group designs, k = 2, there are some specific assumptions that should be addressed. To run the program, it is assumed that the user has access to either n, M, and, SD or t or F(1) values from two-group comparisons. Also, this program was intended for post-test group comparison designs and not, for example, a onegroup repeated measures design, which can be found in meta-analytic data sets as well. Certain effect sizes produced by the program that the user does not wish to view, or
335
that may be nonsensical pertaining to the research of study, should be disregarded. As well, a few of the measures developed for very specific research conditions, such as the Common Language effect size, may not be pertinent to many research situations and should be ignored if this is the case. The Mahalanobis Generalized Distance (D2) is an estimated effect size with p = .5 implemented as the proportion value in the formula. Some of the r-related squared indices may contain small values that are negative. This can occur when the MS (treatment) is < the MS (residual) (Peters & Van Voorhis, 1940), or when the t or F values used in the formulae to derive these effect size indices are < 1.00 (Hays, 1963). Finally, even with exact formulas, some of the computed values may be slightly inexact, as could the direction of a value depending on the users definition of the experimental and control groups. Program Description and Output As presented in the program output found in appendix A, the reader should note that they enter the M, SD, and n for both groups in the first lines of the syntax termed test. If they want to run just one set of data, they put it next to test 1. If more than one set of data are desired, they put the subsequent information in test 2 to however many tests they want to conduct. The matrix produced will group the effect sizes by the three categories noted previously and also related to an appropriate level of measurement. In parenthesis, after an effect size is displayed in the matrix, is a general explanation of that particular measure and any notes that should be mentioned such as used when there are ESS (equal sample sizes) or PEES (populations are of essentially equal size), yields a PRE (proportional reduction in error) interpretation, or examines the number of CP (concordant pairs) and DP (discordant pairs). Further, the matrix generates power values, based on calculations of alpha set at the .05 level, related to indices such as Cohens d, Glass delta, and Hedges g. Finally, because some of the standardized differences between means indices produce biased values under various conditions; numerous measures of effect for this group are provided for the user to obtain the proper measure(s) pertaining to specific
336
circumstances within the research context. The accuracy of the program was checked by an independent source whose hand calculations verified the formulas utilized throughout the program via various situations employing twogroup n, M, SD. Appendix B provides the full syntax for this program. To obtain an SPSS copy of the syntax, send an e-mail to the author. Reference Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and dbased effect size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. Agresti, A., & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed). Upper Saddle River, NJ: Prentice Hall. Cohen, J. (1965). Some statistical issues in psychological research. In B.B. Wolman (Ed.), Handbook of clinical psychology (pp. 95121). New York: Academic Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Cohen, J., & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Cooper, H., & Hedges, L. V. (Eds.). (1994). The handbook of research synthesis. New York: Russell Sage Foundation. Craig, J. R., Eison, C. L., & Metze, L. P. (1976). Significance tests and their interpretation: An example utilizing published research and 2. Bulletin of the Psychonomic Society, 7, 280-282. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage. Hays, W. L. (1963). Statistics. New York: Holt, Rinehart & Winston. Hays, W. L. (1981). Statistics (3rd ed.). New York: Holt, Rinehart & Winston. Hedges, L. V. (1981). Distribution theory for Glass estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107-128.
DAVID A. WALKER
Onwuegbuzie, A. J., & Levin, J. R. (2003). Without supporting statistical evidence, where would reported measures of substantive importance lead? To no good effect. Journal of Modern Applied Statistical Methods, 2(1), 133151. Peters, C. C., & Van Voorhis, W. R. (1940). Statistical procedures and their mathematical bases. New York: McGraw-Hill. Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112(1), 160-164. Richardson, J. T. E. (1996). Measures of effect size. Behavior Research Methods, Instruments, & Computers, 28(1), 12-22. Robinson, D. H., Whittaker, T. A., Williams, N. J., & Beretvas, S. N. (2003). Its not effect sizes so much as comments about their magnitude that mislead readers. The Journal of Experimental Education, 72(1), 51-64. Rosenthal, R. (1991). (Series Ed.), Meta-analytic procedures for social research. Newbury Park, CA: Sage Publications. Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes inbehavioral research: A correlational approach. Cambridge, England: Cambridge University Press.
337
Sawilowsky, S. S. (2003). Deconstructing arguments from the case against hypothesis testing. Journal of Modern Applied Statistical Methods, 2, 467-474 Shaver, J. (1985). Chance and nonsense. Phi Delta Kappan, 67, 57-60. Stewart, D. W. (2000). Testing statistical significance testing: Some observations of an agnostic. Educational and Psychological Measurement, 60, 685-690. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837-847. Vacha-Hasse, T., Nilsson, J, E., Reetz, D. R., Lance, T. S., & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory & Psychology, 10, 413425. Wilkinson, L., & The APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594604.
338
Standardized Differences Between Means, % of Nonoverlap (with d), and Power Glass Delta (Used When There are Unequal Variances and Calculated with the Control Group SD) __________ 1.2330 .8869 .0829 .6667
Hedges g (Used When There are Hedges g Small (Using t Sample Value Sizes) n1=n2) _________ _________ 1.1488 .8431 .0758 .6557 1.1634 .8602 .0769 .6667
Hedges g (Using U % of Cohens d) Nonoverlap Power _________ __________ _________ 1.1445 .8384 .0754 .6526 61.0362 49.9468 5.9506 41.4105 .9945 .7552 .0589 .6183
Proportion of Variance-Accounted-For Effect Sizes: 2x2 Dichotomous/Nominal Phi (The Mean Percent Difference Between Two Variables with Either Considered Causing the Other) __________ .4492 .3674 .0384 .3015
Tetrachoric Correlation (Estimation of Pearsons r for Continuous Variables Reduced to Dichotomies) ____________ .4492 .3674 .0384 .3015
Pearsons Coefficient of Contingency (C) (A Nominal Approximation of the Pearsonian correlation r) _____________ .4098 .3449 .0384 .2887
Sakodas Adjusted Pearsons C (Association Between Two Variables as a Percentage of Their Maximum Possible Variation) ____________ .5795 .4878 .0542 .4082
Proportion of Variance-Accounted-For Effect Sizes: Measures of Relationship(PEES) Pearsons r (If no t Value and for Equal n; Corrected for Bias in Formula) _________ .5090 .4037 .0391 .3223
Point Biserial r (Pearsons r for Dichotomous and Continuous Variables) ___________ .5028 .3951 .0384 .3162
Biserial r (r for Interval and Dichotomous Variables) ___________ .6300 .4950 .0481 .3962
Pearsons r (Using Cohens d with Equal n) _________ .5028 .3951 .0384 .3162
Pearsons r (Using Cohens d with Unequal n) _________ .5028 .3951 .0384 .3162
Pearsons r (Using t Value and for Equal n; Corrected for Bias) _________ .5090 .4037 .0391 .3223
Pearsons r (Using Hedges g with Unequal n) _________ .5042 .3970 .0386 .3176
DAVID A. WALKER
Appendix A: Continued Proportion of Variance-Accounted-For Effect Sizes: Univariate Analyses (k=2, ESS) R Square (If no t Value and for Unequal n Corrected for Bias in Formula) _________ .2591 .1630 .0015 .1039
339
R Square (Using t Value and for Unequal n Corrected for Bias) _________ .2591 .1630 .0015 .1039
Adjusted R Square (Using t Value and for Unequal n) _________ .2339 .1177 -.0376 .0641
Proportion of Variance-Accounted-For Effect Sizes: Univariate Analyses (k=2, ESS) Eta Square (Squared Correlation Ratio or the Percentage of Variation Effects Uncorrected for a Sample) ___________ .2528 .1561 .0015 .1000 Appendix B: Program Syntax * Data enter *. data list list /testno(f8.0) exprmean exprsd(2f9.3) exprn(f8.0)contmean contsd(2 f9.3) contn(f8.0). * Put the M, SD, n for the Experimental Group followed by the Control Group. Begin data 1 9.16 3.45 31 5.35 3.09 31 2 15.95 3.47 20 13.05 3.27 20 3 31.15 10.83 27 30.37 9.41 27 4 105 15 24 95 15 24 end data. ***************************************************************************** Example References 1 Example of t and Cohen's d JEE (2002), 70(4),356-357 2 Example of F, Cohen's d, and Eta2 JEE (2002), 70(3),235 3 Example of t and Eta2 JEE (2002), 70(4),305-306 4 Example of d, r, r2, and CL Psych Bulletin (1992), 111(2),363 *****************************************************************************.
Eta Square (Calculated with F Value) ___________ .2591 .1630 .0015 .1039
Omega Square (Corrected Estimates for the Population Effect) __________ .2528 .1561 .0015 .1000
Epsilon Square (Percentage of Variation Effects Uncorrected for a Sample ___________ .2404 .1339 -.0177 .0804
Epsilon Square (Calculated with F Value) ___________ .2467 .1409 -.0177 .0844
340
Appendix B: Continued
compute poold = ((exprn-1)*(exprsd**2)+(contn-1)*(contsd**2))/((exprn+contn)-2) . compute glassdel = (exprmean-contmean)/contsd. compute cohend = (exprmean-contmean)/sqrt(poold). compute clz = (exprmean-contmean)/sqrt(exprsd**2 + contsd**2). compute cl = CDFNORM(clz)*100. compute akf1 = (exprn+contn)**2. compute akf2 = 2*(exprn+contn). compute akf3 = akf1-akf2. compute akf4 = (akf3)/(exprn*contn). compute r2akf = (cohend**2)/(cohend**2+akf4). compute rakf = SQRT (r2akf). compute hedgesg = cohend*(1-(3/(4*(exprn+contn)-9))). compute ub = CDF.NORMAL((ABS(cohend)/2),0,1). compute U = (2*ub-1)/ub*100. compute critical = 0.05. compute h = (2*exprn*contn)/(exprn+contn). compute ncp = ABS((cohend*SQRT(h))/SQRT(2)). compute alpha = IDF.T(1-critical/2,exprn+contn-2). compute power1 = 1-NCDF.T(alpha,exprn+contn-2,NCP). compute power2 = 1-NCDF.T(alpha,exprn+contn-2,-NCP). compute B = power1 + power2. compute f2 = cohend ** 2 / 4 . compute f = ABS(cohend/2). compute eta2 = (f2) / (1 + f2) . compute eta = SQRT(eta2). compute epsilon2 = 1-(1-eta2) * (exprn + contn-1) / (exprn + contn-2). compute ttest = cohend * SQRT((exprn * contn) /( exprn + contn)). compute cohenda = 2*ttest/SQRT(exprn + contn-2). compute hedgesa = 2*ttest/SQRT(exprn + contn). compute hedgesb = cohend*SQRT((exprn + contn-2)/(exprn + contn)). compute hedgesn = (exprn + contn)/(2). compute hedgesnh = 1/(.5*((1/exprn) + (1/contn))). compute hedgesnn = sqrt(hedgesn/hedgesnh). compute r1= ttest/SQRT((ttest**2)+ exprn + contn-2). compute r = cohend/SQRT(cohend ** 2 + 4) . compute rd = cohend/SQRT((cohend ** 2 + 4*(hedgesnn))). compute rg = hedgesg/SQRT((hedgesg ** 2 + 4*(hedgesnn)*((exprn + contn-2)/(exprn + contn)))). compute phi = (r **2/(1+r **2)) **.5. compute phi2 = phi **2. compute taub = SQRT(phi **2). compute gktau = phi **2. compute zr = .5 * LN((1 + r) / (1 - r)) . compute zrbias = r/(2*(exprn + contn-1)). compute zrcor = zr - zrbias. compute rsquare = r **2 . compute rsquare1 = r1**2. compute adjr2 = rsquare - ((1-rsquare)*(2/(exprn + contn -3))) . compute adjr2a = rsquare1 - ((1-rsquare1)*(2/(exprn + contn -3))) . compute adjr2akf = r2akf - ((1-r2akf)*(2/(exprn + contn -3))) . compute k = SQRT(1-r **2). compute k2 = k **2. compute lambda = 1-rsquare. compute rpbs = SQRT(eta2). compute rbs = rpbs*1.253. compute rpbs2 = rpbs **2. compute ftest = ttest **2. compute omega2 = ftest / ((exprn + contn) + ftest). compute estomega = (ttest**2-1)/(ttest**2 + exprn + contn -1).
DAVID A. WALKER
Appendix B: Continued
compute eta2f = (ftest)/(ftest + exprn + contn -2). compute esticc = (ftest-1)/(ftest + exprn + contn -2). compute c = SQRT(chi/ (exprn + contn+chi)). compute adjc = c/SQRT(.5). compute cramer = SQRT(chi/ (exprn + contn*1)). compute cramer2 = cramer **2. compute t = SQRT(chi/ (exprn + contn*1)). compute t2 = cramer **2. compute d2 = r **2/(r **2+1). compute w = SQRT (c **2/(1-c **2)). compute w2 = w **2. compute percenta = exprmean/(exprmean+contmean). compute percentb = exprsd/(exprsd+contsd). compute percentd = percenta-percentb. compute p = (exprmean*contsd)-(exprsd*contmean). compute q = (exprmean*contsd)+(exprsd*contmean). compute yulesq = p/q. compute taua = ((p-q)/((exprn+contn)*(exprn + contn-1)/2)). compute rr = (exprmean/(exprmean+contmean))/(exprsd/(exprsd+contsd)). compute rrr = 1-rr. compute odds = (exprmean/contmean)/(exprsd/contsd). compute tauc = 4*((p-q)/((exprn+contn)*(exprn+contn))). compute zb = SQRT(chi). compute coheng = exprsd - .50. compute cohenh = 2 * ARSIN(SQRT(.651)) - 2 * ARSIN(SQRT(.414)). compute cohenq = .55-zr. execute.
* FINAL REPORTS *.
341
FORMAT poold to cohenq (f9.4). VARIABLE LABELS testno 'Test'/ exprmean 'M1'/ exprsd 'SD1'/ exprn 'n1'/contmean 'M2'/ contsd 'SD2'/contn 'n2' /glassdel 'Glass Delta'/ cohend 'Cohens d (Using M & SD)'/ U 'U % of Nonoverlap'/ B 'Power'/ hedgesg 'Hedges g' /cohenda 'Cohens d (Using t Value n1=n2)'/hedgesa 'Hedges g (Using t Value n1=n2)'/hedgesb 'Hedges g (Using Cohens d)'/rd 'Pearsons r (Using Cohens d with Unequal n)'/ rg 'Pearsons r (Using Hedges g with Unequal n)'/ f2 'f Square (Proportion of Variance Accounted for by Difference in Population Membership)' /r2akf 'R Square (If no t Value and for Unequal n Corrected for Bias in Formula)'/eta2 'Eta Square (Squared Correlation Ratio or the Percentage of Variation Effects Uncorrected for a Sample)' /epsilon2 'Epsilon Square (Percentage of Variation Effects Uncorrected for a Sample' / omega2 'Omega Square (Corrected Estimates for the Population Effect)' /r 'Pearsons r (Using Cohens d with Equal n)' /r1 'Pearsons r (Using t Value and for Equal n; Corrected for Bias)' /rakf 'Pearsons r (If no t Value and for Equal n; Corrected for Bias in Formula)' /phi 'Phi (The Mean Percent Difference Between Two Variables with Either Considered Causing the Other)' /phi2 'Phi Coefficient Square (Proportion of Variance Shared by Two Dichotomies)' /zr 'Fishers Z (r is Transformed to be Distributed More Normally)'/w2 'w Square (Proportion of Variance Shared by Two Dichotomies)' /coheng 'Cohens g (Difference Between a Proportion and .50)' /cohenh 'Cohens h (Differences Between Proportions)' /cohenq 'Cohens q (One Case & Theoretical Value of r)' /rsquare 'R Square (d Value)' /rsquare1 'R Square (Using t Value and for Unequal n Corrected for Bias)'/adjr2 'Adjusted R Square (d Value)'/adjr2a 'Adjusted R Square (Using t Value and for Unequal n)'/adjr2akf 'Adjusted R Square (Unequal n and Corrected for Bias)'/ lambda 'Wilks Lambda (Small Values Imply Strong Association)' / t2 'T Square (Measure of Average Effect within an Association)' /d2 'D2 Mahalanobis Generalized Distance (Estimated with p = .5 as the Proportion of Combined Populations)' /rpbs 'Point Biserial r (Pearsons r for Dichotomous and Continuous Variables)' /rbs 'Biserial r (r for Interval and Dichotomous Variables)'/rpbs2 'r2 Point-Biserial (Proportion of Variance Accounted for by Classifying on a Dichotomous Variable Special Case Related to R2 and Eta2)' / f 'f (Non-negative and Non-directional and Related to d as an SD of Standardized Means when k=2 and n=n)' /k2 'k2 (r2/k2: Ratio of Signal to Noise Squared Indices)' / k 'Coefficient of Alienation (Degree of Non-Correlation: Together r/k are the Ratio of Signal to Noise)' /c 'Pearsons Coefficient of Contingency (C) (A Nominal Approximation of the Pearsonian correlation r)' /adjc 'Sakodas Adjusted Pearsons C (Association Between Two Variables as a Percentage of Their Maximum Possible Variation)' /cramer 'Cramers V (Association Between Two Variables as a Percentage of Their Maximum Possible Variation)'/odds 'Odds Ratio (The Chance of Faultering after Treatment or the Ratio of the Odds of Suffering Some Fate)'/ rrr 'Relative Risk Reduction (Amount that the Treatment Reduces Risk)'/ rr 'Relative Risk Coefficient (The Treatment Groups Amount of the Risk of the Control Group)'/ percentd 'Percent Difference'/ yulesq 'Yules Q (The Proportion of Concordances to the Total Number of Relations)'/ t 'Tshuprows T (Similar to Cramers V)' /w 'w (Amount of Departure from No Association)' /chi 'Chi Square(1)(Found from Known Proportions)' /eta 'Correlation Ratio (Eta or the Degree of Association Between 2 Variables)'/eta2f 'Eta Square (Calculated with F Value)'/epsilonf 'Epsilon Square (Calculated with F Value)'/esticc 'Estimated Population Intraclass Correlation Coefficient'/estomega 'Estimated Omega Square'/zrcor 'Fishers Z
342
Appendix B: Continued
Corrected for Bias (When n is Small)'/cl 'Common Language (Out of 100 Randomly Sampled Subjects (RSS) from Group 1 will have Score > RSS from Group 2)'/ taua 'Kendalls Tau a (The Proportion of the Number of CP and DP Compared to the Total Number of Pairs)'/ tetra 'Tetrachoric Correlation (Estimation of Pearsons r for Continuous Variables Reduced to Dichotomies)'/taub 'Kendalls Tau b (PRE Interpretations)'/ gktau 'Goodman Kruskal Tau (Amount of Error in Predicting an Outcome Utilizing Data from a Second Variable)'/cramer2 'Cramers V Square'/ tauc 'Kendalls Tau c (AKA Stuarts Tau c or a Variant of Tau b for Larger Tables)'/. REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=testno exprmean exprsd exprn contmean contsd contn /TITLE "Descriptive Statistics". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=glassdel cohend cohenda hedgesg hedgesa hedgesb U B /TITLE "Standardized Differences Between Means, % of Nonoverlap (with d), and Power". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= percentd yulesq /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 Dichotomous Associations". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= rr rrr odds /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 Dichotomous Associations". REPORT FORMAT=LIST AUTOMATIC ALIGN (LEFT) MARGINS (*,90) /VARIABLES= chi phi tetra c adjc /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 Dichotomous/Nominal". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= cramer w t /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 Dichotomous/Nominal". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= taub tauc taua /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 Ordinal Associations". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=gktau /TITLE "Proportion of Variance-Accounted-For Effect Sizes: 2x2 PRE Measures". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= phi2 cramer2 w2 t2 /TITLE"Proportion of Variance-Accounted-For Effect Sizes: Squared Associations". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=coheng cohenh cohenq /TITLE "Differences Between Proportions". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= f zr zrcor eta esticc /TITLE "Proportion of Variance-Accounted-For Effect Sizes:Measures of Relationship(PEES)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= rpbs rbs r rd rakf r1 rg /TITLE "Proportion of Variance-Accounted-For Effect Sizes:Measures of Relationship(PEES)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= k cl /TITLE "Proportion of Variance-Accounted-For Effect Sizes:Measures of Relationship(PEES)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=rsquare r2akf rsquare1 adjr2 adjr2a /TITLE"Proportion of Variance-Accounted-For Effect Sizes:Univariate Analyses (k=2, ESS)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=eta2 eta2f omega2 estomega epsilon2 epsilonf /TITLE"Proportion of Variance-Accounted-For Effect Sizes:Univariate Analyses (k=2, ESS)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES= rpbs2 k2 /TITLE"Proportion of Variance-Accounted-For Effect Sizes:Univariate Analyses (k=2, ESS)". REPORT FORMAT=LIST AUTOMATIC ALIGN(CENTER) /VARIABLES=f2 lambda d2 /TITLE"Proportion of Variance-Accounted-For Effect Sizes:Multivariate Analyses(k=2,ESS)".
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 343-351
Brian Borchers
Department of Mathematics New Mexico Tech
Five readily available software packages were tested on nonlinear regression test problems from the NIST Statistical Reference Datasets. None of the packages was consistently able to obtain solutions accurate to at least three digits. However, two of the packages were somewhat more reliable than the others. Key words: nonlinear regression, Levenberg Marquardt, NIST StRD
Introduction The goal of this study is to compare the nonlinear regression capabilities of several software packages using the nonlinear regression datasets available from the National Institute of Standards and Technology (NIST) Statistical Reference Datasets (National Institute of Standards and Technology [NIST], 2000). The nonlinear regression problems were solved by the NIST using quadruple precision (128 bits) and two public domain programs with different algorithms and different implementations; the convergence criterion was residual sum of squares (RSS) and the tolerance was 1E-36. Certified values were obtained by rounding the final solutions to 11 significant digits. Each of the two public domain programs, using only double precision, could achieve 10 digits of accuracy for every problem. (McCullough, 1998). The software packages considered in this study are: 1. MATLAB codes by Hans Bruun Nielsen (2002). 2. GaussFit (Jeffreys, Fitzpatrick, McArthur, & McCartney, 1998). 3. Gnuplot (Crawford, 1998). 4. Microsoft Excel (Mathews & Seymour, (1994). 5. Minpack (More, Garbow, & Hillstrom, 1980). Hiebert (1981) compared 12 Fortran codes on 36 separate nonlinear least squares problems. Twenty-eight of the problems used by Hiebert are given by Dennis, Gay, and Welch (1977) with the other eight problems given by More, Garbow, and Hillstrom, (1978). In their paper, More et al. (1978) used Fortran subroutines to test 35 problems. These 35 problems were a mixture of systems of nonlinear equations, nonlinear least squares, and unconstrained minimization. We are not aware of any other published studies in which codes were tested on the NIST nonlinear regression problems. Methodology Following McCullough (1998), accuracy is determined using the log relative error (LRE) formula,
Paul Mondragon is an Operations Research Analyst. Contact him at Paul.Mondragon@navy. mil. Brian Borchers is Professor of Mathematics. His research interests are in interior point methods for linear and semidefinite programming, with applications to combinatorial optimization problems. Contact him at [email protected].
343
344
q = log10
where q is the value of the parameter estimated by the code being tested and c is the certified value. In the event that q = c exactly then q is not formally defined, but we set it equal to the number of digits in c. It is also possible for an LRE to exceed the number of digits in c; for example, it is possible to calculate an LRE of 11.4 even though c contains only 11 digits. This is because double precision floating point arithmetic uses binary, not decimal arithmetic. In such a case, q is set equal to the number of digits in c. Finally, any q less than one is set to zero. Robustness is an important characteristic for a software package. In terms of accuracy, there is concern with each specific problem as individuals. Robustness, however, is a measure of how the software packages performed on the problems as a set. In other words, there must be a sense of how reliable the software package is so there may be some level of confidence that it will solve a particular nonlinear regression problem other than those listed in the NIST StRD. In this sense, robustness may very well be more important to the user than accuracy. Certainly the user would want parameter estimates to be accurate to some level, but accuracy to 11 digits is often not particularly useful in practical application. However, the user would want to be confident that the software package they are using will generate parameter estimates accurate to perhaps 3 or 4 digits on most any problem they attempt to solve. If, on the other hand, a software package is extremely accurate on some problems, but returns a solution which is not close to actual values on other problems, the user would want to use this software package with extreme caution. The codes were not compared on the basis of CPU time, for the reason that all of these codes solve (or fail to solve) all of the
qc
(1)
345
MINPACK Minpack (More et al., 1980) is a library of Fortran codes for solving systems of nonlinear equations and nonlinear least squares problems. Minpack is freely distributed via the Netlib web site and other sources. The algorithms proceed either from an analytic specification of the Jacobian matrix or directly from the problem functions. The paths include facilities for systems of equations with a banded Jacobian matrix, for least squares problems with a large amount of data, and for checking the consistency of the Jacobian matrix with the functions. For the problems involved in this study a program and a subroutine had to be written. The main program calls the lmder1 routine. The lmder1 routine calls two user written subroutines which compute function values and partial derivatives. Results The problems given in the NIST StRD dataset are provided with two separate initial starting positions for the estimated parameters. The first position, Start 1, is considered to be the more difficult because the initial values for the parameters are farther from the certified values than are the initial values given by Start 2. For this reason, one might expect that the solutions generated from Start 2 to be more accurate, or perhaps for the algorithm to take fewer iterations. It is interesting to note that in several cases the results from Start 2 are not more accurate based upon the minimum LRE recorded. The critical parameter used in the comparison of these software packages is the LRE as calculated in (1). The number of estimated parameters for these problems range from two to nine. It was decided that it would be beneficial for the results table to be as concise as possible, yet remain useful. As a result, after running a particular package from both starting values, the LRE for each estimated parameter was calculated. The minimum LRE for the estimated parameters from each starting position was then entered into the results table.
346
Problem
Misra1a
Start
1 2 1
Excel Gnuplot
4.8 6.1 4.2 4.6 4.0 4.9 0.0 0.0 4.7 4.6 5.8 5.8 4.9 4.9 4.2 4.3 3.9 3.9 5.1 5.1
GaussFit
10.0 10.0 7.4 8.6 8.0 8.5 0.0 7.9 8.7 8.6
HBN Minpack
11.0 10.3 10.6 9.1 10.3 10.1 4.9 5.1 6.9 6.9 7.7 7.7 2.4 2.4 7.5 7.5 3.3 3.3 8.0 3.3
347
Problem
Hahn1
Start
1 2 1
Excel Gnuplot
0.0 0.0 0.0 0.0 0.0 1.4 0.0 0.0 0.0 0.0 4.3 4.1 0.0 0.0 5.2 4.4 3.5 0.0 0.0 0.0 4.0 4.0 0.0 0.0 NS 3.7 10.0 10.0 5.4 5.4 4.8 5.0 5.9 5.9 5.8 5.9 4.1 5.1 1.6 2.2
GaussFit
0.0 0.0 0.0 1.4 NS NS 0.0 10.0 0.0 9.1 9.2 9.1 0.0 10.0 0.0 8.9 8.7 8.6 3.7 3.7
HBN Minpack
9.5 9.7 0.0 0.0 0.0 0.0 4.9 5.8 5.7 5.3 6.5 6.5 10.8 10.2 11.0 11.0 4.0 4.0 6.5 6.6 NS NS 0.0 0.0 7.6 7.5 4.3 4.3 3.5 3.5 2.4 2.4 7.6 7.6 7.6 7.6 0.0 0.0 0.0 0.0
348
Problem
MGH09
Start
1 2 1
Excel Gnuplot
0.0 5.0 1.7 1.5 0.0 5.6 5.3 5.2 0.0 0.0 0.0 5.1 0.0 3.2 0.0 0.0 3.6 3.6 3.2 4.4 4.5 3.8 4.2 4.1 NS 4.4 0.0 4.8 NS 2.6 6.4 6.7
Gaussfit
0.0 0.0 0.0 6.4 NS NS 8.0 8.3 0.0 0.0 0.0 8.3 NS NS NS NS
HBN Minpack
5.0 5.2 7.8 7.5 9.7 8.6 10.3 11.2 0.0 0.0 8.1 7.2 0.0 1.3 3.7 3.7 6.3 6.4 0.0 0.0 0.0 9.1 7.1 7.1 10.8 11.0 0.0 1.2 6.9 7.0 0.0 1.5
Notes: NS Software package was unable to generate any numerical solution. A score of 0.0 implies that the package returned a solution in which at least one parameter was accurate to less than one digit.
349
GaussFit was unable to estimate all of the parameters to even one digit from the first starting position. From the second starting position GaussFit was able to estimate all of the parameters to over six digits correctly. This seemingly high dependence upon the starting values is a potential problem when using GaussFit for solving these nonlinear regression problems. There is no guarantee that one can find a starting value which is sufficiently close to the solution for GaussFit to effectively solve the problem. Gnuplot has an average LRE score of 4.6. While this is actually lower than the average LRE score for GaussFit, gnuplot is not so heavily dependent upon the starting position in order to solve the problem. Rather, much like Nielsens code, gnuplot seems quite capable of accurately estimating the parameter values to four digits whether the starting position is close or far from the certified values. Microsoft Excel did not solve these problems well at all. The average LRE score for Excel is 2.32. Excel did perform reasonably well on the problems with a lower level of difficulty. For the eight problems with a lower level of difficulty the average LRE was 4.18. While these are probably reasonable results for these problems, we can see that for the problems with a moderate or high level of difficulty Excel did very poorly. Such results as this would cause one to have serious questions as to Excel being able to solve any particular least squares regression problem. The Minpack library of Fortran codes also performed poorly on these particular problems. The average LRE for the twenty-six problems that Minpack did solve is 4.51. Minpack was significantly less accurate than the other packages on four of the problems, Misra1b, ENSO, Thurber, and Eckerle4. On the other hand, Minpack was considerably more accurate on the MGH10 problem. Minpack did not seem to be overly dependent upon starting position as in only two of the problems was there a significant difference in the minimum LRE for the different starting positions.
350
Table 2. Comparison of Robustness Package Gnuplot Nielsens MATLAB Code GaussFit Minpack Excel N 24 23 17 17 15 P(%) 88.89% 85.19% 62.96% 62.96% 55.56%
Robustness Although the accuracy to which a particular software package is able to estimate the parameters is an important characteristic of the package, the ability of the package to solve a variety of nonlinear regression problems to an acceptable level of accuracy is perhaps more important to the user. Most users would like to have confidence that the particular software package in use is likely to estimate those parameters to an acceptable level of accuracy. What is an acceptable level of accuracy? Such a question as this might elicit a variety of responses simply depending upon the nature of the study, the data, the relative size of the parameters, and many other variables which may need to be considered. For the purposes of this study we will consider an acceptable level of accuracy to be three digits. In Table 2, the various software packages are compared by the number (and percentage) of the problems which they were able to estimate the parameters accurately to at least three digits from either starting position. Here, N is the number of problems which the package accurately estimated the parameters to at least three digits. P is the percentage of the problems which the package accurately estimated the parameters to at least three digits. It can easily be seen here that as far as the robustness of the packages is concerned there are two distinct divisions. Nielsens
MATLAB code, and Gnuplot were both able to attain the 3 digit level of accuracy for over 80% of the problems. GaussFit, Excel, and Minpack, on the other hand were able to attain that level of accuracy on less than 65% of the problems. Conclusion The robustness of the codes tested in this study is surprisingly poor. In many cases, the results were quite accurate from one starting point, and completely incorrect from another starting point. In some cases the codes failed with an error message indicating that no correct solution had been obtained, while in other cases an incorrect solution was returned without warning. Although some problems seemed to be easy for all of the codes from all of the starting points, there were other problems for which some codes easily solved the problem while other codes failed. In general, when reasonably accurate solutions were obtained, the solutions were typically accurate to five digits or better. It is suggested that users of these and other packages for nonlinear regression would be well advised to carefully check the results that they obtain. Some obvious strategies for checking the solution include running a code from several different starting points and solving the problem with more than one package.
351
References Crawford, D. (1998). Gnuplot Manual. Retrieved March 4, 2004, from www.ucc.ie/gnuplot/gnuplot.html. Dennis, J. E., Gay, D. M., & Welch, R. E. (1997). An adaptive nonlinear least squares algorithm, NBER working paper 196, M.I.T./C.C.R.E.M.S., Cambridge, Mass. Hiebert, K. L. (1981). An Evaluation of Mathematical Software That Solves Nonlinear Least Squares Problems, ACM Transactions on Mathematical Software, 7(1), 1-16. Jefferys, W. H., Fitzpatrick, M. J., McArthur, B. E., & McCartney, J. E. (1998). GaussFit: A system for least squares and robust estimation users manual. Austin: University of Texas. Kitchen, A. M., Drachenberg, R., & Symanzik, J. (2003). Assessing the reliability of web-based statistical software, Computational Statistics, 18(1), 107-122. Mathews, M., & Seymour S. (1994). Excel for Windows: The Complete Reference. (2nd ed.). NY:McGraw-Hill Inc. McCullough, B. D. (1998). Assessing reliability of statistical software: Part I: The American Statistician, 52(4), 358-366. More, J. J., Garbow, B. S., & Hillstrom, K. E. (1978). Testing unconstrained optimization software. Rep. TM-324, Applied Math Division, Argonne National Lab., Argonne, IL., More, J. J., Garbow, B. S., & Hillstrom, K. E. (1980). User guide for MINPACK-1. Report ANL-80-74. Argonne, IL: Argonne National Lab. National Institute of Standards and Technology. (2000). Statistical Reference Datasets (StRD). Retrieved March 4, 2004 http://www.itl.nist.gov/div898/strd/. Nielsen, H. B. (2002). Nonlinear Least Squares Problems. Retrieved March 4, 2004 from www.imm.dtu.dk/~hbn/Software/#LSQ.
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No.1, 352
352
Now Available!
Performance
Outstanding performance on Intel architecture including Intel Pentium 4, Intel Xeon and Intel Itanium 2 processors, as well as support for Hyper-Threading Technology.
Compatibility
Plugs into Microsoft Visual Studio* .NET Microsoft PowerStation4 language and library support Strong compatibility with Compaq* Visual Fortran
Support
1 year of free product upgrades and Intel Premier Support
The Intel Fortran Compiler 7.0 was first-rate, and Intel Visual Fortran 8.0 is even better. Intel has made a giant leap forward in combining the best features of Compaq Visual Fortran and Intel Fortran. This compiler continues to be a must-have tool for any Twenty-First Century Fortran migration or software development project.
Dr. Robert R. Trippi Professor Computational Finance University of California, San Diego
programmersparadise.com/intel
NCSS
329 North 1000 East Kaysville, Utah 84037
Histogram of SepalLength by Iris
30.0
20.0
Count
10.0 0.0 40.0
53.3
66.7
80.0
SepalLength
NCSS 2004 is a new edition of our popular statistical NCSS package that adds seventeen new procedures.
New Procedures
Two Independent Proportions Two Correlated Proportions One-Sample Binary Diagnostic Tests Two-Sample Binary Diagnostic Tests Paired-Sample Binary Diagnostic Tests Cluster Sample Binary Diagnostic Tests Meta-Analysis of Proportions Meta-Analysis of Correlated Proportions Meta-Analysis of Means Meta-Analysis of Hazard Ratios Curve Fitting Tolerance Intervals Comparative Histograms ROC Curves Elapsed Time Calculator T-Test from Means and SDs Hybrid Appraisal (Feedback) Model
Meta-Analysis
Procedures for combining studies measuring paired proportions, means, independent proportions, and hazard ratios are available. Plots include the forest plot, radial plot, and LAbbe plot. Both fixed and random effects models are available for combining the results.
Curve Fitting
This procedure combines several of our curve fitting programs into one module. It adds many new models such as Michaelis-Menten. It analyzes curves from several groups. It compares fitted models across groups using computerintensive randomization tests. It computes bootstrap confidence intervals.
ROC Curves
This procedure generates both binormal and empirical (nonparametric) ROC curves. It computes comparative measures such as the whole, and partial, area under the ROC curve. It provides statistical tests comparing the AUCs and partial AUCs for paired and independent sample designs.
Documentation
The printed, 330-page manual, called NCSS Users Guide V, is available for $29.95. An electronic (pdf) version of the manual is included on the distribution CD and in the Help system.
Tolerance Intervals
This procedure calculates one and two sided tolerance intervals using both distribution-free (nonparametric) methods and normal distribution (parametric) methods. Tolerance intervals are bounds between which a given percentage of a population falls.
Two Proportions
Several new exact and asymptotic techniques were added for hypothesis testing (null, noninferiority, equivalence) and calculating confidence intervals for the difference, ratio, and odds ratio. Designs may be independent or paired. Methods include: Farrington & Manning, Gart & Nam, Conditional & Unconditional Exact, Wilsons Score, Miettinen & Nurminen, and Chen.
Comparative Histogram
This procedure displays a comparative histogram created by interspersing or overlaying the individual histograms of two or more groups or variables. This allows the direct comparison of the distributions of several groups.
My Payment Option:
___ Check enclosed ___ Please charge my: __VISA __ MasterCard ___Amex ___ Purchase order attached___________________________
Card Number ______________________________________ Exp ________ Signature______________________________________________________
Telephone:
) ____________________________________________________
Email:
____________________________________________________________
Ship to:
NAME
________________________________________________________ ______________________________________________________
ADDRESS
ADDRESS _________________________________________________________________________
TO PLACE YOUR ORDER CALL: (800) 898-6109 FAX: (801) 546-3907 ONLINE: www.ncss.com MAIL: NCSS, 329 North 1000 East, Kaysville, UT 84037
Y = Michaelis-Menten
10.0
Histogram of SepalLength by Iris
ADDRESS _________________________________________________________________________ CITY _____________________________________________ STATE _________________________ ZIP/POSTAL CODE _________________________________COUNTRY ______________________
Type 1 3
30.0
7.5
Response
20.0
5.0
10.0
Diet S1 S7 S8 S17 S16 S9 S24 S21 S22 Ave Drug S5 S4 S26 S32 S11 S3 S15 S6 S23 S34 S20 S19 S18 S2 S28 S10 S29 S31 S13 S12 S27 S33 S30 Ave Surgery S14 S25 Ave
2.5
0.0 38.5 52.0 65.5 79.0
Count
0.0 0.0
SepalLength
5.0
10.0
15.0
20.0
Temp
0.75
20.0
Sensitivity
0.50
Count
10.0
0.25
0.0 40.0
StudyId
0.00 0.00
53.3
66.7
80.0
Total
0.25
0.50
0.75
1.00
SepalLength
.001
.01
.1
10
100
1-Specificity
Odds Ratio
PASS 2002
Power Analysis and Sample Size Software from NCSS
PASS performs power analysis and calculates sample sizes. Use it before you begin a study to calculate an appropriate sample size (it meets the requirements of government agencies that want technical justification of the sample size you have used). Use it after a study to determine if your sample size was large enough. PASS calculates the sample sizes necessary to perform all of the statistical tests listed below. A power analysis usually involves several what if questions. PASS lets you solve for power, sample size, effect size, and alpha level. It automatically creates appropriate tables and charts of the results. PASS is accurate. It has been extensively verified using books and reference articles. Proof of the accuracy of each procedure is included in the extensive documentation. PASS is a standalone system. Although it is integrated with NCSS, you do not have to own NCSS to run it. You can use it with any statistical software you want.
Power vs N1 by Alpha with M1=20.90 M2=17.80 S1=3.67 S2=3.01 N2=N1 2-Sided T Test
1.0 0.8 0.01 0.05 0.10 0.2
Power
0.4
Alpha
0.6
PASS Beats the Competition! No other program calculates sample sizes and power for as many different statistical procedures as does PASS. Specifying your input is easy, especially with the online help and manual. PASS automatically displays charts and graphs along with numeric tables and text summaries in a portable format that is cut and paste compatible with all word processors so you can easily include the results in your proposal. Choose PASS. It's more comprehensive, easier-to-use, accurate, and less expensive than any other sample size program on the market.
0.0 0 10 20 30 40 50
N1
Trial Copy Available You can try out PASS by downloading it from our website. This trial copy is good for 30 days. We are sure you will System Requirements agree that it is the easiest and most PASS runs on Windows 95/98/ME/NT/ comprehensive power analysis and 2000/XP with at least 32 megs of RAM and sample size program available. 30 megs of hard disk space. PASS sells for as little as $449.95.
PASS comes with two manuals that contain tutorials, examples, annotated output, references, formulas, verification, and complete instructions on each procedure. And, if you cannot find an answer in the manual, our free technical support staff (which includes a PhD statistician) is available.
Analysis of Variance Factorial AOV Fixed Effects AOV Geisser-Greenhouse MANOVA* Multiple Comparisons* One-Way AOV Planned Comparisons Randomized Block AOV New Repeated Measures AOV* Regression / Correlation Correlations (one or two) Cox Regression* Logistic Regression Multiple Regression Poisson Regression* Intraclass Correlation Linear Regression
Proportions Chi-Square Test Confidence Interval Equivalence of McNemar* Equivalence of Proportions Fisher's Exact Test Group Sequential Proportions Matched Case-Control McNemar Test Odds Ratio Estimator One-Stage Designs* Proportions 1 or 2 Two Stage Designs (Simons) Three-Stage Designs* Miscellaneous Tests Exponential Means 1 or 2* ROC Curves 1 or 2* Variances 1 or 2
T Tests Cluster Randomization Confidence Intervals Equivalence T Tests Hotellings T-Squared* Group Sequential T Tests Mann-Whitney Test One-Sample T-Tests Paired T-Tests Standard Deviation Estimator Two-Sample T-Tests Wilcoxon Test Survival Analysis Cox Regression* Logrank Survival -Simple Logrank Survival - Advanced* Group Sequential - Survival Post-Marketing Surveillance ROC Curves 1 or 2*
Group Sequential Tests Alpha Spending Functions Lan-DeMets Approach Means Proportions Survival Curves Equivalence Means Proportions Correlated Proportions* Miscellaneous Features Automatic Graphics Finite Population Corrections Solves for any parameter Text Summary Unequal N's *New in PASS 2002
NCSS Statistical Software 329 North 1000 East Kaysville, Utah 84037 Internet (download free demo version): http://www.ncss.com Email: [email protected] Toll Free: (800) 898-6109 Tel: (801) 546-0445 Fax: (801) 546-3907
PASS 2002 adds power analysis and sample size to your statistical toolbox
WHATS NEW IN PASS 2002? Thirteen new procedures have been added to PASS as well as a new home-base window and a new Guide Me facility. MANY NEW PROCEDURES The new procedures include a new multifactor repeated measures program that includes multivariate tests, Cox proportional hazards regression, Poisson regression, MANOVA, equivalence testing when proportions are correlated, multiple comparisons, ROC curves, and Hotellings T-squared. TEXT STATEMENTS The text output translates the numeric output into easy-to-understand sentences. These statements may be transferred directly into your grant proposals and reports. GRAPHICS The creation of charts and graphs is easy in PASS. These charts are easily transferred into other programs such as MS PowerPoint and MS Word. NEW USERS GUIDE II A new, 250-page manual describes each new procedure in detail. Each chapter contains explanations, formulas, examples, and accuracy verification. The complete manual is stored in PDF format on the CD so that you can read and printout your own copy. GUIDE ME The new Guide Me facility makes it easy for first time users to enter parameter values. The program literally steps you through those options that are necessary for the sample size calculation. NEW HOME BASE A new home base window has been added just for PASS users. This window helps you select the appropriate program module. COX REGRESSION A new Cox regression procedure has been added to perform power analysis and sample size calculation for this important statistical technique. REPEATED MEASURES A new repeated-measures analysis module has been added that lets you analyze designs with up to three grouping factors and up to three repeated factors. The analysis includes both the univariate F test and three common multivariate tests including Wilks Lambda. RECENT REVIEW In a recent review, 17 of 19 reviewers selected PASS as the program they would recommend to their colleagues.
My Payment Options:
___ Check enclosed
___ Please charge my: __VISA __MasterCard __Amex ___ Purchase order enclosed
Card Number _______________________________________________Expires_______ Signature____________________________________________________ Please provide daytime phone: ( )_______________________________________________________
___ PASS 2002 User's Guide II (printed manual): $30.00.........$ _____ ___ PASS 2002 Upgrade CD for PASS 2000 users: $149.95 .......$ _____ Typical Shipping & Handling: USA: $9 regular, $22 2-day, $33 overnight. Canada: $19 Mail. Europe: $50 Fedex.......................$ _____ Total: ...................................................................................$ _____ FOR FASTEST DELIVERY, ORDER ONLINE AT
WWW.NCSS.COM
Email your order to [email protected] Fax your order to (801) 546-3907 NCSS, 329 North 1000 East, Kaysville, UT 84037 (800) 898-6109 or (801) 546-0445
Introducing GGUM2004
Item Response Theory Models for Unfolding
The new GGUM2004 software system estimates parameters in a family of item response theory (IRT) models that unfold polytomous responses to questionnaire items. These models assume that persons and items can be jointly represented as locations on a latent unidimensional continuum. A single-peaked, nonmonotonic response function is the key feature that distinguishes unfolding IRT models from traditional, "cumulative" IRT models. This response function suggests that a higher item score is more likely to the extent that an individual is located close to a given item on the underlying continuum. Such single-peaked functions are appropriate in many situations including attitude measurement with Likert or Thurstone scales, and preference measurement with stimulus rating scales. This family of models can also be used to determine the locations of respondents in particular developmental processes that occur in stages. The GGUM2004 system estimates item parameters using marginal maximum likelihood, and person parameters are estimated using an expected a posteriori (EAP) technique. The program allows for up to 100 items with 2-10 response categories per item, and up to 2000 respondents. GGUM2004 is compatible with computers running updated versions of Windows 98 SE, Windows 2000, and Windows XP. The software is accompanied by a detailed technical reference manual and a new Windows user's guide. GGUM2004 is free and can be downloaded from:
http://www.education.umd.edu/EDMS/tutorials
GGUM2004 improves upon its predecessor (GGUM2000) in several important ways: - It has a user-friendly graphical interface for running commands and displaying output. - It offers real-time graphics that characterize the performance of a given model. - It provides new item fit indices with desirable statistical characteristics. - It allows for missing item responses assuming the data are missing at random. - It allows the number of response categories to vary across items. - It estimates model parameters more quickly. Start putting the power of unfolding IRT models to work in your attitude and preference measurement endeavors. Download your free copy of GGUM2004 today!
SM
announces
TM
v2.0
The fastest, most comprehensive and robust permutation test software on the market today.
Permutation tests increasingly are the statistical method of choice for addressing business questions and research hypotheses across a broad range of industries. Their distribution-free nature maintains test validity where many parametric tests (and even other nonparametric tests), encumbered by restrictive and often inappropriate data assumptions, fail miserably. The computational demands of permutation tests, however, have severely limited other vendors attempts at providing useable permutation test software for anything but highly stylized situations or small datasets and few tests. TM PermuteIt addresses this unmet need by utilizing a combination of algorithms to perform non-parametric permutation tests very quickly often more than an order of magnitude faster than widely available commercial alternatives when one sample is TM large and many tests and/or multiple comparisons are being performed (which is when runtimes matter most). PermuteIt can make the difference between making deadlines, or missing them, since data inputs often need to be revised, resent, or recleaned, and one hour of runtime quickly can become 10, 20, or 30 hours. In addition to its speed even when one sample is large, some of the unique and powerful features of PermuteIt
TM
include:
the availability to the user of a wide range of test statistics for performing permutation tests on continuous, count, & binary data, including: pooled-variance t-test; separate-variance Behrens-Fisher t-test, scale test, and joint tests for scale and location coefficients using nonparametric combination methodology; Brownie et al. modified t-test; skew-adjusted modified t-test; Cochran-Armitage test; exact inference; Poisson normal-approximate test; Fishers exact test; FreemanTukey Double Arcsine test extremely fast exact inference (no confidence intervals just exact p-values) for most count data and high-frequency continuous data, often several orders of magnitude faster than the most widely available commercial alternative the availability to the user of a wide range of multiple testing procedures, including: Bonferroni, Sidak, Stepdown Bonferroni, Stepdown Sidak, Stepdown Bonferroni and Stepdown Sidak for discrete distributions, Hochberg Stepup, FDR, Dunnetts one-step (for MCC under ANOVA assumptions), Single-step Permutation, Stepdown Permutation, Single-step and Stepdown Permutation for discrete distributions, Permutation-style adjustment of permutation p-values fast, efficient, and automatic generation of all pairwise comparisons
efficient variance-reduction under conventional Monte Carlo via self-adjusting permutation sampling when confidence intervals contain the user-specified critical value of the test maximum power, and the shortest confidence intervals, under conventional Monte Carlo via a new sampling optimization technique (see Opdyke, JMASM, Vol. 2, No. 1, May, 2003) fast permutation-style p-value adjustments for multiple comparisons (the code is designed to provide an additional speed premium for many of these resampling-based multiple testing procedures) simultaneous permutation testing and permutation-style p-value adjustment, although for relatively few tests at a time (this capability is not even provided as a preprogrammed option with any other software currently on the market) For Telecommunications, Pharmaceuticals, fMRI data, Financial Services, Clinical Trials, Insurance, Bioinformatics, and just about any data rich industry where large numbers of distributional null hypotheses need to be tested on samples that are TM not extremely small and parametric assumptions are either uncertain or inappropriate, PermuteIt is the optimal, and only, solution. To learn more about how PermuteIt can be used for your enterprise, and to obtain a demo version, please contact its SM author, J.D. Opdyke, President, DataMineIt , at [email protected] or www.DataMineIt.com. DataMineIt is a technical consultancy providing statistical data mining, econometric analysis, and data warehousing TM services and expertise to the industry, consulting, and research sectors. PermuteIt is its flagship product.
SM TM
Journal of Modern Applied Statistical Methods May, 2005, Vol. 4, No. 1, 2-352
Print Subscriptions
Print subscriptions including postage for professionals are US $95 per year; for graduate students are US $47.50 per year; and for libraries, universities, and corporations are US $195 per year. Subscribers outside of the US and Canada pay a US $10 surcharge for additional postage. Online access is currently free at http://tbf.coe.wayne.edu/jmasm. Mail subscription requests with remittances to JMASM, P. O. Box 48023, Oak Park, MI, 48237. Email journal correspondence, other than manuscript submissions, to [email protected].
Notice To Advertisers
Send requests for advertising information to [email protected].
"2@$
NEW IN 2004
25% discount on a new personal subscription Plus Great Discounts for Students!
Further information including submission guidelines, subscription information and details of how to obtain a free sample copy are available at
w w w. b l a c k w e l l p u b l i s h i n g . c o m / S I G N
STATISTICIANS
HAVE YOU VISITED THE
[email protected]
The genealogy project is a not-for-profit endeavor supported by donations from individuals and sales of posters and t-shirts. If you would like to help this cause please send your tax-deductible contribution to: Mathematics Genealogy Project, 300 Minard Hall, P. O. Box 5075, Fargo, North Dakota 58105-5075E