Academia.eduAcademia.edu

The Chi-Square Test

2012, American Journal of Evaluation

The examination of cross-classified category data is common in evaluation and research, with Karl Pearson's family of chi-square tests representing one of the most utilized statistical analyses for answering questions about the association or difference between categorical variables. Unfortunately, these tests are also among the more commonly misinterpreted statistical tests in the field. The problem is not that researchers and evaluators misapply the results of chi-square tests, but rather they tend to over interpret or incorrectly interpret the results, leading to statements that may have limited or no statistical support based on the analyses preformed. This paper attempts to clarify any confusion about the uses and interpretations of the family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of independence and homogeneity of variance (identity of distributions). A brief survey of the recent evaluation literature is presented to illustrate the prevalence of the chi-square test and to offer examples of how these tests are misinterpreted. While the omnibus form of all three tests in the Karl Pearson family of chi-square tests-independence, homogeneity, and goodness-of-fit,-use essentially the same formula, each of these three tests is, in fact, distinct with specific hypotheses, sampling approaches, interpretations, and options following rejection of the null hypothesis. Finally, a little known option, the use and interpretation of post hoc comparisons based on Goodman's procedure (Goodman, 1963) following the rejection of the chi-square test of homogeneity, is described in detail.

American Journal of Evaluation http://aje.sagepub.com/ The Chi-Square Test: Often Used and More Often Misinterpreted Todd Michael Franke, Timothy Ho and Christina A. Christie American Journal of Evaluation 2012 33: 448 originally published online 8 November 2011 DOI: 10.1177/1098214011426594 The online version of this article can be found at: http://aje.sagepub.com/content/33/3/448 Published by: http://www.sagepublications.com On behalf of: American Evaluation Association Additional services and information for American Journal of Evaluation can be found at: Email Alerts: http://aje.sagepub.com/cgi/alerts Subscriptions: http://aje.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav >> Version of Record - Jul 25, 2012 OnlineFirst Version of Record - Nov 8, 2011 What is This? Downloaded from aje.sagepub.com by guest on October 11, 2013 Method Note The Chi-Square Test: Often Used and More Often Misinterpreted American Journal of Evaluation 33(3) 448-458 ª The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1098214011426594 http://aje.sagepub.com Todd Michael Franke1, Timothy Ho2, and Christina A. Christie3 Abstract The examination of cross-classified category data is common in evaluation and research, with Karl Pearson’s family of chi-square tests representing one of the most utilized statistical analyses for answering questions about the association or difference between categorical variables. Unfortunately, these tests are also among the more commonly misinterpreted statistical tests in the field. The problem is not that researchers and evaluators misapply the results of chi-square tests, but rather they tend to over interpret or incorrectly interpret the results, leading to statements that may have limited or no statistical support based on the analyses preformed. This paper attempts to clarify any confusion about the uses and interpretations of the family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of independence and homogeneity of variance (identity of distributions). A brief survey of the recent evaluation literature is presented to illustrate the prevalence of the chi-square test and to offer examples of how these tests are misinterpreted. While the omnibus form of all three tests in the Karl Pearson family of chi-square tests—independence, homogeneity, and goodness-of-fit,—use essentially the same formula, each of these three tests is, in fact, distinct with specific hypotheses, sampling approaches, interpretations, and options following rejection of the null hypothesis. Finally, a little known option, the use and interpretation of post hoc comparisons based on Goodman’s procedure (Goodman, 1963) following the rejection of the chi-square test of homogeneity, is described in detail. Keywords chi-square test, quantitative methods, methods use, using chi-square test 1 Department of Social Welfare, Meyer and Rene Luskin School of Public Affairs, University of California, Los Angeles, CA, USA 2 Department of Education, Graduate School of Education and Information Sciences, University of California, Los Angeles, CA, USA 3 Department of Education, Social Research Methods Division, Graduate School of Education and Information Sciences, University of California, Los Angeles, CA, USA Corresponding Author: Todd Michael Franke, Department of Social Welfare, Meyer and Rene Luskin School of Public Affairs, University of California, Box 951656, Los Angeles, CA, 90095, USA Email: [email protected] Franke et al. 449 Karl Pearson initially developed the chi-square test in 1900 and applied it to test the goodness of fit for frequency curves. Later, in 1904, he extended it to contingency tables to test for independence between rows and columns (Stigler, 1999). Since then, the Pearson family of chi-square tests has become one of the most common sets of statistical analyses in evaluation and social science research. Unfortunately, these tests are also among the more commonly misinterpreted statistical tests in the field. The problem is not that researchers and evaluators misapply the results of chisquare tests, but rather they tend to over interpret or incorrectly interpret the results, leading them to make statements that may have limited or no statistical support based on the analyses preformed. In this article, we will attempt to clarify any confusion about the uses and interpretations of the family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of independence and homogeneity of variance (identity of distributions). First, the family of chi-square statistics will be presented, including distinguishing features of and appropriate uses for each specific test. Next, a brief survey of the recent evaluation literature will be presented to illustrate the prevalence of the chi-square test and to offer examples of how these tests are misinterpreted. Finally, a little known option, the use of post hoc comparisons based on Goodman’s procedure (Goodman, 1963) following the rejection of the chi-square test of homogeneity, will be described. The Karl Pearson Family of Chi-Square Tests The chi-square test is computationally simple. It is used to examine independence across two categorical variables or to assess how well a sample fits the distribution of a known population (goodness of fit). The chi-square tests in the Karl Pearson family are not to be confused with others such as the Yates chi-square test (correction for continuity), the Mantel–Haenszel chi-square or the Maxwell–Stuart tests of correlated proportions. Each of these has its own applications, though they all utilize the chi-square distribution as the reference distribution. In fact, many tests that assess model fit use the chi-square distribution as the reference distribution. For example, many covariance structure analyses, including factor analysis and structural equation modeling, assess model fit by comparing the sample covariances to those derived from the model. Again, while they are based on the same chi-square distribution, these tests are similar to the Karl Pearson family of tests only in that they compare an observed set of data to what is expected. The omnibus form of all three tests in the Karl Pearson family of chi-square tests—goodness of fit, independence, homogeneity—use essentially the same formula. Each of these three tests is, in fact, distinct with specific hypotheses, interpretations, and options following rejection of the null hypothesis. The formula for computing the test statistic is as follows: w2 ¼ n X ðOi  Ei Þ2 i¼1 Ei ; where n is the number of cells in the table. The obtained test statistic is compared against a critical value from the chi-square distribution with (r  1)(c  1) degrees of freedom. The main difference across each of the three chi-square tests relates to the appropriate situations for which each should be used. The chi-square goodness of fit test is used when a sample is compared on a variable of interest against a population with known parameters. For example, a goodness of fit test might be applied on a survey sample to compare whether the ethnicity or income of the survey respondents is consistent with the known demographic makeup of the geographic locale from which the sample was drawn. The null and alternative hypotheses are: Hypothesis0: The data follow a specified distribution. HypothesisA: The data do not follow the specified distribution. 450 American Journal of Evaluation 33(3) The interpretation upon rejection is that the sample differs significantly from the population on the variable of interest. The chi-square test of independence determines whether two categorical variables in a single sample are independent from or associated with each other. For example, a survey might be administered to 1,000 participants who each respond with their hair color and favorite ice cream flavor. The test would then be used to determine whether hair color and ice cream preference are independent of each other. The null and alternative hypotheses are as follows: Hypothesis0: The variables of interest are independent. HypothesisA: The variables of interest are associated. A significant test rejecting the null hypothesis would suggest that within the sample, one variable of interest is associated with a second variable of interest. Finally, the chi-square test of homogeneity is used to determine whether two or more independent samples differ in their distributions on a single variable of interest. One common use of this test is to compare two or more groups or conditions on a categorical outcome. A significant test statistic would indicate that the groups differ on the distribution of the variable of interest but does not indicate which of the groups are different or where the groups differ. The null and alternative hypotheses are as follows: Hypothesis0: The proportions between groups are the same. HypothesisA: The proportions between groups are different. We focus on the practical and important differences between the tests of independence and homogeneity because they are so frequently used in evaluation and applied research studies. Despite the fact that the formulation of the omnibus test statistic is the same for the test of independence and the test of homogeneity, these two tests differ in their sampling assumptions, null hypotheses, and options following a rejection. The main difference between them is how data are collected and sampled. Specifically, the test of independence collects data on a single sample, and then compares two variables within that sample to determine the relationship between them. The test of homogeneity collects data on two1 or more distinct groups intentionally, as might be the case in a treatment or intervention study with a comparison group. The two samples are then compared on a single variable of interest to test whether the proportions differ between them. Wickens (1989) presents a thoughtful and succinct description of these tests, as well as their sampling assumptions and hypotheses. In addition to the tests of homogeneity and independence, Wickens presents an additional alternative where both margins are fixed, which he refers to as ‘‘test of unrelated classification.’’ When data are collected using only a single sample, only the test of independence is valid and only interpretations of association between variables can be made. When data on two or more samples are collected, the test of homogeneity is appropriate and comparisons of proportions can be made across the multiple groups. When sampling occurs from multiple populations, and thus the homogeneity hypothesis appropriate, it is also reasonable (although less interesting) to ask the independence question. In the above example regarding hair color and ice cream preference, if the researcher defined the population by hair color and eye color and collected information on 500 brunettes and 500 blondes, these would constitute two independent samples. Comparisons of proportions of blondes and brunettes by their ice cream preferences would be valid. When random assignment is used to assign participants to two or more conditions, these groups are by definition independent and the test of homogeneity may be used to test for differences between the groups. Franke et al. 451 Table 1. Chi-Square Tests and Attributes Chi-Square Test Attribute Test of Independence Test of Homogeneity Test of Goodness of Fit Sampling type Single dependent sample Sample from population Interpretation Null hypothesis Association between variables No association between variables Two (or more) independent samples Difference in proportions No difference in proportion between groups Difference from population No difference in distribution between sample and population Perhaps, these distinctions can be best illustrated by the null hypothesis tested in each of these two tests. The chi-square test of independence null hypothesis states no association between two categorical variables. It can be written as H0 : f ¼ 0 or H0 : n ¼ 0. This states that the association between two categorical variables, as measured by a Phi (f) correlation for 2  2 contingency tables or with Kramer’s V for larger tables, is zero or the variables are independent. H0 : f ¼ 0 HA : f 6¼ 0 or H 0 : V ¼ 0; H A : V 6¼ 0: The chi-square test of homogeneity compares the proportions between groups on a variable of interest. The null hypothesis is presented in matrix form: 2 3 p11 ¼ p12 ¼ ::: ¼ p1k 6 p ¼ p ¼ ::: ¼ p 7 22 2k 7 6 21 H0 :¼ 6 7 4 p31 ¼ p32 ¼ ::: ¼ p3k 5 pk1 ¼ pk2 ¼ ::: ¼ pkk HA : The null is false Rejection of the null hypothesis in the case of three or more groups only allows the researcher to conclude that the proportions between the groups differ, not which groups are different. Table 1 summarizes the distinction between the three types of chi-square tests—specifically, the sampling required for each test, the correct interpretation of each test, and the null hypothesis assumed of each test. One common misinterpretation of chi-square tests comes from not distinguishing between these three specific tests. Indeed, when most researchers declare that they ‘‘utilized a chi-square test,’’ they are typically referring to the chi-square test of independence. This lack of specificity often leads researchers to use interpretations of one test where another was actually conducted. For example, researchers will more often feel compelled to compare the proportions between groups, regardless of how the data were drawn. As is most often the case, the data on two categorical variables are collected from a single sample (e.g., survey data), where the assumptions for chi-square test of homogeneity are not met, and an interpretation comparing proportions between groups is not valid. Even in those situations where data are drawn from multiple samples and the test of homogeneity is appropriate, researchers seem unaware that procedures exist to specifically follow-up after the rejection of the omnibus test. Consider the following null hypothesis:   p11 ¼ p12 ¼ p13 H0 : : p21 ¼ p22 ¼ p23 452 American Journal of Evaluation 33(3) Table 2. Use of Statistical Tests in Journal Articles American Journal of Evaluation Evaluation Review Educational Evaluation and Policy Analysis Evaluation and Program Planning Total Total Number of Articles Articles Using Inferential Statistics Articles Using Chi- Square Test Proportion of Articles Using Chi-Square Test (%) 65 61 52 114 292 16 30 35 26 107 3 11 6 12 32 18.75 36.67 17.14 46.15 29.91 A rejection in this case indicates that at least one proportion is different from at least one other proportion.2 Often, a researcher will conduct a chi-square test, find a significant value, and then look for the cells with the largest disparity in proportions or frequencies to make a substantive interpretation. The proper procedure would involve conducting post hoc comparisons after the omnibus chi-square test to determine where the significant differences actually are. Post hoc procedures for chi-square tests are discussed in a later section. Chi-square Tests in Recent Evaluation Literature A brief survey of recent evaluation literature was conducted in order to obtain a general sense of how often chi-square tests are used and how often researchers misinterpret the results. Surveying the evaluation literature is an approach that has been used by several researchers as a method for better understanding the methods and strategies used in evaluation practice. For example, Greene, Caracelli, and Graham (1989) included published evaluation studies in their sample when reviewing 57 empirical mixed-methods evaluations. Findings from the empirical study were used to refine a mixed-methods conceptual framework that had originally been developed from the theoretical literature and was intended to inform and guide practice. More recently, Miller and Campbell (2006) studied empowerment evaluation in practice by examining 47 case examples published from 1994 through June 2005 to determine the extent to which empowerment evaluation could be distinguished from evaluation approaches emphasizing similar elements, and the extent to which empowerment evaluation led to empowered outcomes for program beneficiaries. For the current study, four prominent evaluation journals were selected for review: American Journal of Evaluation, Evaluation Review, Educational Evaluation and Policy Analysis, and Evaluation and Program Planning. Every article published in these four journals between January 2008 and August 2010 was reviewed. These journals and periods were not intended to be a comprehensive search of the evaluation literature, but mainly to obtain a picture of the prevalence of chi-square tests and the extent to which these tests are incorrectly interpreted. The vast majority of chi-square tests and misinterpretations probably exist in evaluation reports that are never read beyond a small circle of intended users, but we believe that the proliferation of chi-square test misinterpretations is exacerbated by evaluation literature that is read by a larger audience. After book reviews, section introductions, memoranda, and other editorial content were excluded, there were a total of 292 articles available for review. Two graduate student researchers coded each article on a variety of measures, including whether inferential statistics were used and whether a chisquare test was used. For articles that used a chi-square test, additional codes identified whether the article contained the correct interpretation given the sampling procedure, whether post hoc interpretations were used, and whether post hoc tests were conducted. Table 2 details the number of articles in each journal as well as how many used inferential quantitative statistics. Overall, just over a third (36.6%; n ¼ 107) of the articles used some sort Franke et al. 453 Table 3. Description of Articles Using Chi-Square Analyses Number of Chi-Square Articles American Journal of Evaluation Evaluation Review Educational Evaluation and Policy Analysis Evaluation and Program Planning Total Number of Articles that Used a Valid Chi-Square Test Interpretation Number of Articles that Used a Post Hoc Interpretation N N % N % 3 11 6 12 32 3 4 2 5 14 100.00 36.36 33.33 41.67 43.75 1 4 2 2 9 33.33 36.36 33.33 16.67 28.13 of inferential statistic, ranging from a simple t test to more advanced structural equation models. Of the 107 articles that used inferential statistics, 32 articles (29.9%) also used a chi-square test in the Karl Pearson family. Evaluation and Program Planning had the most articles employing a chisquare test (n ¼ 12) while the American Journal of Evaluation had the fewest (n ¼ 3). The 32 articles that used chi-square tests were further reviewed to determine whether the interpretations were justified. Often, researchers were not specific about which chi-square tests were being used (only one of the 32 articles correctly specified the type of chi-square test conducted). To make the determination, then, coders reviewed the Method section in each article to identify which chi-square test would have been appropriate given the sampling design used. The interpretations from the chi-square tests presented in each article were then coded for the types of interpretation used, that is, whether an association claim was made between variables or whether a comparison of proportions was made between groups. This allowed the researchers to determine the type of chi-square test used by the researchers in each article. Any discrepancy between a study’s sampling design and the type of chi-square test used was coded as a nonvalid interpretation of the chisquare test. In addition, each of the 32 chi-square articles was coded on whether a post hoc interpretation was used, meaning that the author made comparisons across select rows and columns of the table. The results from these additional analyses are presented in Table 3. Overall, less than half of the chi-square articles (43.75%; n ¼ 14) had interpretations that were justified by the type of chi-square test used. All three articles in the American Journal of Evaluation included the correct usage of the chi-square test, whereas only a third (two out of six) of the articles in Educational Evaluation and Policy Analysis did so. As shown in Table 3, 9 of the 32 articles that used chisquare (28.1%) included a post hoc interpretation. None of the articles used any post hoc analyses to justify their claims. Hypothetical Example: Support Components for At-Risk Families We offer a hypothetical example to illustrate the concepts described above and to guide readers through a proper chi-square post hoc analysis. In this scenario, suppose that researchers are investigating the impact of various family support components for families at risk for child abuse and neglect. Study participants were randomly assigned to receive either parent education/life skills, connections to community resources, or wraparound services made up of the previous components plus case management. Using the county data system, a sample was drawn from each of these three conditions. The dependent variable of interest consisted of 4 outcomes measures 12 months after the families’ initial involvement with Child Protective Services (CPS): (a) a CPS rereferral; (b) a substantiated allegation; (c) the child’s removal from home; or (d) no further involvement with CPS. 454 American Journal of Evaluation 33(3) Table 4. Involvement with CPS and Service Conditions Rereferral to CPS Substantiated allegation Child removed No new involvement with CPS Total Parent Education N, Col % Community Resources N, Col % Wraparound N, Col % Total N, Col % 38, 20.43 24, 12.9 27, 14.52 97, 52.15 186 42, 22.34 18, 9.57 8, 4.26 120, 63.83 188 49, 13.73 35, 9.8 15, 4.2 258, 72.27 357 129, 17.65 77, 10.53 50, 6.84 475, 64.98 731 Note. CPS ¼ child protective services. While randomization is often used to form independent groups, it is not a prerequisite for the appropriate use of the test for homogeneity. What is required is that the groups are identified and sampled intentionally. Table 4 shows the distribution with involvement with CPS across the three conditions. The null hypothesis is as follows: 2 3 p11 ¼ p12 ¼ p13 6p ¼ p ¼ p 7 22 23 7 6 21 H0 : 6 7; 4 p31 ¼ p32 ¼ p33 5 p41 ¼ p42 ¼ p43 HA : The null is false: The obtained X62 ¼ 36:77 is significant at the conventional a level of .05. The justified interpretation following the rejection of the null hypothesis would be to conclude that the proportions are not equal across the three groups. Often at this point, researchers will conclude that the proportions are not equal and will want to compare specific conditions. For example, they might examine the ‘‘no new involvement’’ row and conclude that the wraparound condition (72.3%) is preferable to the parent education (52.2%) or community resources (63.8%) condition. Alternatively, a researcher may be interested in comparing the proportion of children removed across the conditions. It might be tempting to conclude that parent education (14.5%) is significantly different from community resources (4.26%) and wraparound (4.2%). However, this interpretation would be incorrect because there is no statistical justification for these claims based solely on the results of the omnibus test; the omnibus test indicates only that the conditions are significantly different but not which conditions are different. Because the chi-square test is an omnibus test, post hoc procedures would need to be conducted in order to compare individual conditions. As previously mentioned, the procedure for comparing conditions or groups was developed by Goodman (1963).3 Similar to the comparison procedures following an analysis of variance (ANOVA), several different approaches—including Scheffé, Holm,4 and Dunn-Bonferroni—are available for selecting the appropriate critical value. Also similar to the ANOVA, the comparison often takes on the name associated with formulation of the critical value. For purposes of this article, the Scheffé post hoc values are presented because this represents the most conservative approach. For an alternative approach based on Dunn-Bonferonni, see Marasculio and Serlin (1988). The Goodman procedure is described below. The test statistic for each contrast is as follows: ^ c qffiffiffiffiffiffiffiffi ¼ Z: SEc2 Franke et al. 455 The same equation in an expanded form is as follows: ^ c w1 ðp1 Þ  w2 ðp2 Þ qffiffiffiffiffiffiffiffi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ffi ¼ Z; p1 q1 p2 q2 SEc2 2 2 w1 þ w2 n1 n2 ^ represents the linear combination of weights (Wk) and proportions (yk ) of the specific where c contrast: c ¼ W1 y1 þ W2 y2 þ    þ Wk yk ; where W1 þ W2 þ    þ Wk ¼ 0: And the numerator of the test is the square root of the weighted standard error of the contrast: SEc2 ¼ W12 SEy21 þ W22 SEy22 þ    þ Wk2 SEy2k : The standard error of each column is the standard error of an estimated proportion: SEy2 ¼ pk qk : Nk Once the obtained test statistic is found for a comparison of interest, it is compared to a critical value. The Scheffé critical value is found by taking the square root of the critical value in the original omnibus chi-square analysis. In the above example, the chi-square omnibus critical value at the conventional a level of .05 with (r  1)(c  p 1)ffiffiffiffiffiffiffiffiffiffiffi ¼ (4ffi  1)(3  1) ¼ 6 degrees of freedom is 12.59. The pffiffiffiffiffiffiffiffiffiffiffi square root of this critical value is S  ¼ w2v:1a ¼ 12:59 ¼ 3:55 which represents the Scheffé critical value for all contrasts. Referring back to our previous example, comparing wraparound (72.3%) to parent education (52.2%) on ‘‘no new involvement’’ leads to the following hypothesis: Hypothesis0 : pNo new involvement=wraparound ¼ pNo new involvement=parent education ; HypothesisA : pNo new involvement=wraparound 6¼ pNo new involvement=parent education : The appropriate test statistic is as follows:     357 186 ð:7227Þ  ð:5215Þ :2012 357 186 s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2    2   ¼ :0436 ¼ 4:61: 357 ð:7227Þð:2773Þ 186 ð:5215Þð:4785Þ þ 357 357 186 186 357 186 and equal 1, and essentially dropout of 357 186 the equation both in the numerator and in the denominator. Given 4.61 > +3.55, we reject and conclude that there is a statistically significant difference between these conditions. Comparisons can be performed within any row. If the researcher wanted to compare wraparound (4.2%) to parent education (14.5%) on whether a child was removed, ‘‘child removed,’’ the test statistic is given by Since this is a pairwise comparison, the weights 456 American Journal of Evaluation 33(3) Table 5. Pairwise Contrasts from Hypothetical Example Rereferral Wraparound versus parent education Wraparound versus community resources Parent education versus community resources Substantiated abuse Wraparound versus parent education Wraparound versus community resources Parent education versus community resources Child removed Wraparound versus parent education Wraparound versus community resources Parent education versus community resources No new case opened Wraparound versus parent education Wraparound versus community resources Parent Education versus community resources c SE TS .0670 .0861 .0191 .0347 .0354 .0424 1.931 2.432 0.451 .0310 .0023 .0333 .0292 .0306 .0326 1.062 0.075 1.020 .1031 .0005 .1026 .0279 .0182 .0297 3.693 0.030 3.451 .2012 .0844 .1168 .0436 .0423 .0507 4.612 1.995 2.304     357 186 ð:042Þ  ð:1452Þ :1031 357 186 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s       ffi ¼ :0278 ¼ 3:69: 357 ð:042Þð:958Þ 186 ð:1452Þð:8548Þ þ 357 357 186 186 Given 3.69 > +3.55, we reject and conclude that there is a statistically significant difference between these conditions. A comparison between community resources (4.26%) and parent education (14.5%) produces a test statistic of 3.45 and is not significant due to the differing sample sizes and their impact on the standard error. This is an instance where simply examining the difference between the proportions, without conducting the appropriate post hoc test, might lead to a statistically unsupported conclusion. In both of these, the comparisons the difference between the parent education and the other two conditions were .10. However, in one case, there was a significant difference and in the other there was no difference based on the critical value. A complete listing of all pairwise comparisons is available in the Table 5 at the end of article. As noted previously, comparisons under this model are not limited to being pairwise. The post hoc procedure can also be used to test complex contrasts. Suppose you want to compare wraparound to the combination of parent education and community resources.        357 186 188 ð:1373Þ  ð:2043Þ þ ð:2234Þ 357 374 374 ffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi "   u   2  # 2 u 3572 ð:1373Þð:8657Þ 186 ð:2043Þð:7957Þ 188 ð:2234Þð:7766Þ t þ þ 357 357 374 186 374 188 ¼ :0766 ¼ 2:81: :0273 Unlike with the previous pairwise contrast weights, the combination of parent education and community resources needs to be weighted for their respective contributions. Once this is done, the Franke et al. 457 test statistic is calculated as it was before. Given 2.81 < +3.55, we do not reject and conclude that there is not a statistically significant difference between the wraparound condition and the combination of parent education and community resources. Discussion Common misconceptions of the chi-square test were clarified in this article. Specifically, we have distinguished between the members of the Karl Pearson family of chi-square tests and presented post hoc procedures. Evaluators often need to examine the association between categorical variables or to compare groups or conditions on a categorical outcome, which explains their prevalence in evaluation literature and reports. However, effective use of the chi-square test, or any other statistical test for that matter, is dependent on a clear understanding of the assumptions of the test and what is actually being tested (null hypothesis) in the statistical procedure. A correct interpretation of the chi-square test or of other statistical procedures is often dependent on factors outside of distributional assumptions and characteristics of the data itself—for example, individual observations must be independent from other observations in the contingency table. When this is this case, an interpretation of the chi-square test is based on sampling procedures and how data were collected. Furthermore, since the asymptotic approximation of the chi-square test is less precise at the extreme end of the distribution, expected values of cells need to be greater than five. The review of the evaluation literature reveals that in about half of the instances where a chi-square test was used, the wrong interpretation was presented. The appropriate interpretation of the results is directly tied to the null hypothesis under test and the interpretation—whether independence or homogeneity—is limited to that hypothesis. More commonly, researchers prefer to interpret the chi-square test of homogeneity by comparing groups across a variable of interest. However, the sampling procedure precludes the researcher from making this claim and has thus misinterpreted the results of the chi-square test. Researchers also tend to over interpret the results of statistical tests. An omnibus chi-square test informs us that the distribution of observed values deviates from expected values, but does not tell us where the discrepancy is located in the contingency table. Often, researchers will make naı̈ve comparisons between two or more groups without conducting any post hoc tests to determine whether the contrasts were significant. Many more complex statistical models exist and we have faith that these procedures are still being faithfully and thoughtfully applied. Although the chi-square tests were found to be commonly misinterpreted in recent evaluation literature, the results of these studies are not wrong. Rather, the problem is simply that there is often no statistical justification for some of the claims being made. However, Goodman’s procedure is computationally simple and there is little reason it cannot be conducted to justify significant contrasts. Our hope in this article is that researchers and evaluators will be more thoughtful in using common statistical procedures and more carefully consider what their results actually say. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The author(s) received no financial support for the research, authorship, and/or publication of this article. Notes 1. The two-sample test of proportions, which uses the Z distribution, is a special case of the test of homogeneity, employed when you have only two groups. 458 American Journal of Evaluation 33(3) 2. Comparisons in this context are limited to pairwise contrasts. It is perfectly feasible that Groups 2 and 3 combined are from Group 1 and responsible for the significant result. 3. The approach presented here builds logically on the post hoc procedures following multiple group comparisons in analysis of variance (ANOVA) models. Goodman’s approach is not the only one available for addressing pairwise comparisons, however. See Seaman and Hill (1996), Gardner (2000), and Delucchi (1993). 4. Information on the use of the Holm procedure, see Holm, 1979. References Delucchi, K. L. (1993). On the use and misuse of chi-square. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences (pp. 295–319). Hillsdale, NJ: Lawrence Erlbaum. Gardner, R. C. (2000). Psychological statistics using SPSS for Windows. Upper Saddle River, NJ: Prentice Hall. Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework for mixed-method evaluation designs. Educational Evaluation and Policy Analysis, 11, 255–274. Goodman, L. (1963). Simultaneous confidence intervals for contrasts among multinomial populations. The Annals of Mathematical Statistics, 35, 716–725. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. Marasculio, L., & Serlin, R. (1988). Statistical methods for the social and behavioral sciences. New York, NY: W.H. Freeman. Miller, R. L., & Campbell, R. (2006). Taking stock of empowerment evaluation: An empirical review. American Journal of Evaluation, 27, 296–319. doi:10.1177/109821400602700303 Seaman, M. H., & Hill, C. C. (1996). Pairwise comparisons for proportions: A note on Cox and Key. Educational and Psychological Measurement, 56, 452–459. Stigler, S. (1999). Statistics on the table: The history of statistical concepts and methods. Cambridge, MA: Harvard University Press. Wickens, T. D. (1989). Multiple contingency tables analysis for the social sciences. Hillsdale, NJ: Lawrence Erlbaum. View publication stats