Academia.eduAcademia.edu

Choosing Between Parametric and Nonparametric Tests

1988, Journal of Counseling & Development

A common question in comparing two sets of measurements is whether to use a parametric testing procedure or a non-parametric procedure. The question is even more important in dealing with smaller samples. Here, using simulation, several parametric and nonparametric tests, such as, t-test, Normal test, Wilcoxon Rank Sum test, van-der Waerden Score test, and Exponential Score test are compared.

Journal of Undergraduate Research at Minnesota State University, Mankato Volume 9 Article 6 2009 Choosing between Parametric and Non-parametric Tests Russ Johnson Minnesota State University, Mankato Follow this and additional works at: https://cornerstone.lib.mnsu.edu/jur Part of the Mathematics Commons, and the Probability Commons Recommended Citation Johnson, Russ (2009) "Choosing between Parametric and Non-parametric Tests," Journal of Undergraduate Research at Minnesota State University, Mankato: Vol. 9 , Article 6. Available at: https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 This Article is brought to you for free and open access by the Undergraduate Research Center at Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato. It has been accepted for inclusion in Journal of Undergraduate Research at Minnesota State University, Mankato by an authorized editor of Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato. Johnson: Choosing between Parametric and Non-parametric Tests Choosing Between Parametric or Non-parametric Tests Abstract: A common question in comparing two sets of measurements is whether to use a parametric testing procedure or a non-parametric procedure. The question is even more important in dealing with smaller samples. Here, using simulation, several parametric and nonparametric tests, such as, t-test, Normal test, Wilcoxon Rank Sum test, van-der Waerden Score test, and Exponential Score test are compared. Introduction Let us consider two independent random samples x1 , x2 ,, xm and y1 , y2 ,, yn are taken from two populations. To compare the two samples, a common practice is to compare their means, in other words testing the statistical hypothesis: H 0 : 1  2 vs H1 : 1   2 Where H 0 indicates the null hypothesis, H 1 indicates the alternative hypothesis, 1 indicates the first population mean, and  2 indicates the second population mean. The statistical tests of hypotheses are based on the fundamental that if the samples have significant evidence against the null hypothesis ( H 0 ), then H 0 is rejected in favor of the alternative hypothesis ( H 1 ). Then the question is how significant is significant, when do we say there is enough evidence, the answer is based on the idea of Type I error, the probability of rejecting H 0 when in fact it is true. The power of the test is determined by the rate of rejection of H 0 when it should be rejected. In other words, how well our test sees that H 0  H1 . p-value The observed level of significance (or the Type I error) of a test is known as the p-value of the test. This is the probability of rejecting H 0 when it is in fact true. In our study we use a 5% level of significance. This however, is just one of the many common levels of significance commonly used. Parametric Tests Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 1 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 1. According to Reinard (2006), when the two population distributions are normal, the population variances  1 2 and  2 2 are unknown and unequal, the test statistic is m T x  y  ( 1   2 ) 2 2 , where x   xi i 1 s1 s  2 n1 n2 m n , y  yi i 1 n  x m , s12  i 1 m 1  A  B 2 has a t-distribution with degrees of freedom df  i x A2 B2  m 1 n 1   y n 2 , s 22  i 1 i y  2 n 1 , and T s12 s 22 , where A  and B  . m n 2. According to Tanis and Hogg (2008), when the two population distributions are normal, the population variances  1 2 and  2 2 are unknown but equal, the test statistic is: T X  Y  ( 1   2 ) , where s p  1 1  m n m  n  2 degrees of freedom. sp (m  1) s12  (n  1) s 22 and T has a t-distribution with mn2 3. According to Tanis and Hogg (2008), when the two population distributions are not assumed as normal, the population variances  1 2 and  2 2 are unknown, and the sample sizes n1 and n 2 are large, the test statistic is: Z  X  Y  ( 1   2 ) 2 2 , where Z is the standard normal variate. s1 s  2 m n Note that 1  2  0 for all three cases above, as per the null hypothesis. But in general it is not necessarily zero as if we want to test that one mean is at least an amount higher than the other then 1  2 is that least amount, and so on. The cases for known variances are not considered as they are not common in practice. In this paper we will consider the first test and the third test and denote as TD and ZD , respectively. We also will consider TP when the tests are computed as in TD and ZD but the pvalue is computed by considering all permutations of the data. For larger samples, TP uses random permutations instead of all possible permutations. The corresponding p-values are denoted as PTD , PZD , and PTP for the t-test, normal test, and the respective permutation test, respectively. https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 2 Johnson: Choosing between Parametric and Non-parametric Tests Non-parametric Tests Wilcoxon Rank-Sum Test In Higgins (2004) the method to perform the Wilcoxon rank-sum test is computed as follows. Let m be the sample size of the one group or treatment, and n be the sample size of another. Combine m  n observations into one group, and rank the observations from smallest to largest. Let 1 be the rank of the smallest observation, 2 the rank of the next smallest observation, and so on. It is common to have ties among observations in a data set; that is, one or more observations may have the same value. In this case, the assignment of ranks to the observations is ambiguous. To resolve this ambiguity, the average rank is assigned to the tied observations. Find the observed rank sum W of treatment 1 (Note we may analyze either treatment 1 or treatment 2 due to the equivalency of the statements 1  2 and 2  1 ). Then the p-value of the test is computed either by using the distribution of all possible permutations of the ranks or by using normal approximation for larger samples. For the two sided test considered here WR  maximum (R  W ,W ) , where R is the sum of the ranks for the combined sample. Permutation Distribution In Higgins (2004) the method to perform the permutation distribution test follows. Find all possible permutations of the ranks in which m ranks are assigned to treatment 1 and n ranks are assigned to treatment 2. For each permutation of the ranks, find the sum of the ranks for treatment 1 (or treatment 2). Determine the two sided p-value as PWR  number of maximum(R - U, U)  WR , m  n    m  where U is the sum of the ranks for treatment 1 (or treatment 2) for a permutation. When the sample sizes are so large that all permutations cannot be performed within a reasonable time period, random permutations for a reasonable number (10,000 or 100,000) of times can be performed depending on time and computational facility. Large Sample Approximation Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 3 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 According to Higgins (2004), for larger samples with sample size 10 or greater, such permutations can be considered large, Z W  E (W ) V (W ) follows approximate standard normal distribution and hence can be used to obtain an approximate p-value. Where E(W )  m , V (W )  mn 2 ,  is the mean for all ranks for the m  n 1 combined sample irrespective of whether there is any ties, and  2 is the population variance for all ranks for the combined sample irrespective of whether there is any ties. Without ties, m  n 1 (m  n  1)(m  n  1)  and  2  . Let the large sample approximate p-value for the 2 12 Wilcoxon Rank Sum test be denoted as PWZ . van der Waerden Score Test The process of this test is exactly similar to the Wilcoxon Rank Sum test where the ranks are replaced by the van der Waerden scores. In Higgins (2004) the van der Waerden scores are defined by i   V(i )   1   m n 1     where  1 denotes the inverse of the cdf of the standard normal distribution. The test statistic is the sum of the van der Waerden scores for treatment 1 (or treatment 2). Then the p-value is computed using the methods as described for the Wilcoxon Rank Sum test by using the van der Waerden scores instead of the ranks. Let the permutation p-value for the van der Waerden score test be denoted as PVS and the large sample approximate p-value for the van der Waerden score test be denoted as PVZ . Exponential Score Test The process of this test is exactly similar to the Wilcoxon Rank Sum test where the ranks are replaced by the Exponential scores. The Exponential scores are defined by 1 1 1 1 1 1 ,  ,   , m  n m  n m  n 1 m  n m  n 1 m  n  2 in Higgins (2004). The test statistic is the sum of the Exponential scores for treatment 1 (or treatment 2). Then the p-value is computed using the methods as described for the Wilcoxon Rank Sum test by using the Exponential scores instead of the ranks. Let the permutation p-value https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 4 Johnson: Choosing between Parametric and Non-parametric Tests for the van Exponential score test be denoted as PES and the large sample approximate p-value for the van der Waerden score test be denoted as PEZ . There are certain parameters under which parametric methods have been suggested to be superior to nonparametric methods. Similarly, there are instances where nonparametric methods are suggested over their parametric counterparts. According to Warner (2007), nonparametric methods should be used when the sample size is small, whereas parametric methods should be used when the sample size is large. Also when there is an outlier in the data, nonparametric methods are said to be preferable. According to Tanis and Hogg (2008), when the population distribution is normal and the sample size n is as small as 4 or 5 the normal test should a very adequate approximation. I also tested some parameters not considered or addressed by statisticians to see if they suggest one method or the other. One of the parameters that will be tested is if different distributions have any effect on the performance of the two methods. The following three graphs illustrate the different distributions used. Different variances are also adjusted to see if any effects make themselves apparent. The distance between means is also changed, to see if the methods equivalently pick up on the more severe difference. Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 5 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 6 Johnson: Choosing between Parametric and Non-parametric Tests Figure A) Distribution Examples Simulation Study To investigate how the tests are related to the estimates of the Type I error, 1000 samples of sizes 5, 8, 11, and 15 are selected from independent normal populations with different means and variances. All nine p-values mentioned above (PTD, PZD, PTP, PWR, PWZ, PVS, PVZ, PES, and PEZ) are computed and the numbers of p-values less than or equal to 0.05 are recorded. The choices are: (i) Population 1: Normal with mean 1 and variance 1; Population 2: Normal with mean 1 and variance 1, (ii) Population 1: Normal with mean 1 and variance 1; Population 2: Normal with mean 1 and variance 1 with an outlier. The proportions of rejections are displayed in Table 1. The values displayed in Table 1 represent the rate at which the tests said the means were different when in fact they were the same. Each of the tests was performed on these two different distribution comparisons for the sample sizes 5, 8, 11, and 15. Table 1: Estimates of the Level of Significance n PTD PZD PTP 5 8 11 15 0.053 0.054 0.042 0.041 0.089 0.072 0.065 0.051 0.056 0.057 0.043 0.041 5 8 11 15 0.013 0.004 0.014 0.019 0.036 0.018 0.019 0.029 0.051 0.032 0.036 0.043 PWR N(1,1) 0.037 0.054 0.046 0.035 N(1,1) 0.030 0.030 0.039 0.041 PWZ N(1,1) 0.066 0.054 0.046 0.036 N(1,1) 0.063 0.030 0.039 0.041 PVS PVZ PES PEZ 0.040 0.054 0.046 0.035 0.066 0.054 0.046 0.036 0.060 0.060 0.054 0.050 0.037 0.049 0.047 0.041 0.063 0.030 0.039 0.041 0.057 0.029 0.039 0.048 0.030 0.020 0.034 0.041 w/outlier 0.031 0.030 0.039 0.041 To investigate the powers of the tests, samples are generated from the populations having different means. The choices are: (i) Population 1: Normal with mean 1 and variance 1; Population 2: Normal with mean 3 and variance 1, (ii) Population 1: Normal with mean 1 and variance 1; Population 2: Normal with mean 5 and variance 2, (iii) Population 3: Normal with mean 1 and variance 1; Population 2: Normal with mean 2 and variance 1, (iv) Population 1: Exponential with mean 1/3; Population 2: Normal with mean 1 and variance 1, (v) Population 1: Exponential with mean 1/3; Population 2: Exponential with mean 1, (vi) Population 1: Skewed bimodal with mean 3/8 and variance 7/9; Population 2: Normal with mean 0 and variance 1, (vii) Population 1: Skewed bimodal with mean 3/8 and variance 7/9; Population 2: with mean 3 and variance 1. Then for each of the choices proportion of rejections are computed and displayed in Table 2. The values displayed in Table 2 represent the rate at which the tests said the means were Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 7 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 different when in fact they were the different. Each of the tests was performed on these seven different distribution comparisons for the sample sizes 5, 8, 11, and 15. Table 2: Powers of the Tests n PTD PZD PTP 5 8 11 15 0.762 0.967 0.994 1.000 0.867 0.990 0.995 1.000 0.762 0.970 0.994 0.999 5 8 11 15 0.904 0.997 0.999 1.000 0.972 0.999 0.999 1.000 0.928 0.999 0.999 1.000 5 8 11 15 0.270 0.454 0.589 0.743 0.393 0.519 0.638 0.765 0.270 0.464 0.597 0.743 5 8 11 15 0.208 0.307 0.518 0.719 0.345 0.459 0.612 0.777 0.279 0.450 0.610 0.775 5 8 11 15 0.133 0.316 0.517 0.722 0.287 0.442 0.624 0.777 0.264 0.454 0.624 0.782 5 8 11 15 5 8 11 15 0.096 0.100 0.127 0.163 0.930 0.998 1.000 1.000 PWR N (1,1) 0.681 0.961 0.989 0.999 N (1,1) 0.849 0.995 0.999 1.000 N (1,1) 0.209 0.441 0.566 0.729 0.201 0.383 0.530 0.670 PWZ N (3,1) 0.767 0.961 0.989 0.999 N (5,2) 0.901 0.995 0.999 1.000 N (2,1) 0.291 0.441 0.566 0.731 N (1,1) 0.245 0.383 0.530 0.681 0.186 0.399 0.555 0.679 0.263 0.399 0.555 0.682 Exp1/ 3 Exp1/ 3 Exp1 (3 / 4) N (0,1)  (1/ 4) N ((3 / 2), (1/ 3) 2 ) N (0,1) 0.153 0.145 0.166 0.179 0.103 0.108 0.128 0.155 0.097 0.104 0.125 0.162 0.069 0.108 0.128 0.153 (3 / 4) N (0,1)  (1/ 4) N ((3 / 2), (1/ 3)2 ) N (3,1) 0.978 0.999 1.000 1.000 0.933 0.998 1.000 1.000 0.932 0.999 1.000 1.000 https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 0.878 0.998 1.000 1.000 PVS PVZ PES PEZ 0.681 0.961 0.989 0.999 0.767 0.961 0.989 0.999 0.735 0.940 0.981 0.994 0.681 0.921 0.980 0.994 0.849 0.995 0.999 1.000 0.901 0.995 0.999 1.000 0.897 0.998 1.000 1.000 0.849 0.996 1.000 1.000 0.209 0.441 0.573 0.730 0.291 0.441 0.566 0.731 0.264 0.411 0.569 0.680 0.209 0.373 0.544 0.664 0.207 0.383 0.533 0.677 0.245 0.383 0.530 0.681 0.228 0.427 0.587 0.763 0.201 0.392 0.564 0.748 0.188 0.399 0.560 0.683 0.263 0.399 0.555 0.682 0.249 0.450 0.615 0.768 0.186 0.397 0.595 0.747 0.069 0.108 0.130 0.154 0.103 0.108 0.128 0.155 0.095 0.127 0.156 0.196 0.069 0.109 0.137 0.182 0.884 0.998 1.000 1.000 0.933 0.998 1.000 1.000 0.929 0.996 1.000 1.000 0.878 0.996 0.999 1.000 8 Johnson: Choosing between Parametric and Non-parametric Tests We now analyze the various scenarios and compare the effectiveness of the parametric and non-parametric tests. We will compare populations which share different distributions, populations that have different respective distributions, populations with different variances, different populations, populations with different means, and treatments with extreme outliers. We will observe how quickly the tests are picking up on the fact that H 0 : 1   2 when it is the case. We begin with two populations each having a normal distribution. One of the samples has a mean of 1 and a variance of 1. The other has a mean of 1 and a variance of 1. Since the means are equal we are computing the level of significance of the tests. We can see from Table 1 that PZD or the normal test had slightly higher levels of significance for all four of the populations sizes. However this difference was not significant. The decision made of rejecting or accepting H 0 depends entirely on your desired level of significance. No test drastically stood out such that a majority of commonly used levels of significance would result in different test yielding different results. All of the tests picked roughly 5% for a level of significance except PZD when n=5, however, even that was off be less than 4%. Additionally the large sample approximation of the exponential scores test or PEZ picked a low level of significance when the sample size n=5. The data discussed is plotted in the following graph (Figure 1). Figure 1: Type 1 Error; N(1,1) vs N(1,1) Now we observe the results of similarly constructed populations with the addition of outliers. Again, since the means are equal we compute the levels of significance. It is apparent from the data displayed in Figure 2 that the scores were on average lower than in Figure 1, this means that the tests were, on average, more effective in determining that H0 is true. When the sample size n=5, PWZ, PVS, and PES, all picked values greater than 5%, while the rest picked lower values. When the sample size was greater, however, all the tests performed similarly Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 9 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 picking value lower than 5%. While the observed levels of significance are somewhat greater for the nonparametric methods, they still generally resulted in the same conclusion of rejecting H 0 : 1   2 . The data discussed is plotted in the Figure 2. Figure 2: Type 1 Error; N(1,1) vs N(1,1) with outlier When finding the levels of significance in both cases the methods did not differ too greatly. While in certain circumstances some tests had a p-value greater than 5%, the tests that had a p-value less than 5% were not far below this level of significance. When considering the differences between the tests we observed that on average the difference between the parametric and nonparametric methods was rather small. Since there was not a great deal of difference in the performance of the tests when considering the different styles of distributions and the sample sizes, there was no single method of test, parametric or nonparametric, that clearly performed better than the rest. We shall soon see that, when we dive into observing the power of the tests, the similarities become even more apparent We now consider how effective the tests were in determining when H 0 : 1   2 is not true. This first simulation compares two normal populations each having a variance of 1, and means of 1 and 3, respectively. When the sample size n=5, PZD had a slightly greater power than the rest, while the other tests performed very similarly when testing the power. When the sample size increased there was very little difference between any of the test’s performance. Since there was no significant difference between any of the tests for all four of the sample sizes, the test performed equally. The data discussed is plotted in the following graph. https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 10 Johnson: Choosing between Parametric and Non-parametric Tests Figure 3: Type 2 Error; simulation 1, N(1,1) vs N(3,1) In the second simulation we analyze two normal populations, population 1 with a mean of 5 and variance 1, and population 2 with mean 5 and variance 2. Each of the tests picked up on this increased difference in means rather effectively. As the sample size increases this becomes even more apparent. This is especially true when the sample size n=15. In this case all of the tests had identical values. Figure 4: Type 2 Error; simulation 2, N(1,1) vs N(5,2) For the third simulation we analyze two normal populations each having a variance of 1, and means of 1 and 2 respectively. The normal test of PZD picked a slightly higher value for the two lesser of the four sample sizes. The other test performed similar to each other for each of the Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 11 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 sample sizes. When the sample size was greater, PZD was performing closely to the other eight tests. Figure 5: Type 2 Error; simulation 3, N(1,1) vs N(2,1) In the fourth simulation we change the distribution of one of our samples to exponential and give it a mean of 1/3, the second population has normal distribution with a mean of 1 and variance 1. Similarly to the previous scenarios, the tests gave approximately the same result for all the sample sizes, with the differences between the tests decreasing as the sample size increased. Figure 6: Type 2 Error; simulation 4, Exp(1/3) vs N(1,1) https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 12 Johnson: Choosing between Parametric and Non-parametric Tests The fifth simulation compares two exponential distributions with means 1/3 and 1, respectively. In a slight change of pace, none of the tests stood out either above or below for any of the sample sizes in determining when H 0 : 1   2 is false. When the sample size n=5 the tests all have values close to 20%-25%. Each of the tests had almost identical values for higher three sample sizes. Figure 7: Type 2 Error; simulation 5, Exp(1/3) vs Exp(1) In the sixth and seventh simulations we compared skewed bimodal distributions with normal distributions. In both of the trials the skewed bimodal distribution had a mean of 3/8 and variance of 7/9, while the normal distributions had means 0 and 3 respectively, and in both cases variance of 1. In the sixth simulation for all four of the sample sizes the tests all performed similarly, picking values approximately 8% apart or less. They also stayed below 20% in all of the cases. In the seventh trial however, the tests all had values 85% or high, while still maintaining a maximum difference of 10%. Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 13 Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6 Figure 8: Type 2 Error; simulation 6, (3/4)N(0,1)+(1/4)N((3/2),(1/9) )vs N(0,1) Figure 9: Type 2 Error; simulation 7, (3/4)N(0,1)+(1/4)N((3/2),(1/9) ) vs N(3,1) While there were instances where one of the tests had a slightly higher or lower value for a certain set of parameters, when there was a difference it was not large enough to be considered significant. In finding both the power and the level of significance, none of the tests truly “outperformed” the others for any particular set of parameters. When finding the observed level of significance, the nonparametric tests did prove to be consistently more effective than the parametric tests. However, this difference in effectiveness or performance was not enough to influence the decision of whether or not to reject H 0 : 1   2 . Consequently, when we consider https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6 14 Johnson: Choosing between Parametric and Non-parametric Tests the set of parametric test against the set of nonparametric tests we did not observe that one set or the other had a significantly higher power or more accurately picked the level of significance. Contrary to accepted set of criteria for determining which to use, our research did not find a specific set of parameters for which parametric tests are the proper choice over nonparametric. A small sample size had a small effect on the performance of the tests, however when the size increased, the tests performed almost equivalently. This is the opposite of what the accepted notion of the performance of the parametric methods versus nonparametric methods. Changing the variance also seemed to have no effect. When the difference between the means was greater, both sets of tests, parametric and nonparametric, picked up on this difference similarly. Even when comparing different distributions types, the tests performed relatively similar to each other. Since there was no clear scenario when parametric methods outperformed nonparametric methods or visa versa, the research was inconclusive. None of the tested parameters had an effect significant enough to cause noticeable change in the outcome. Thus, the choice of parametric or nonparametric seems to be left to the preference of the person analyzing the population data. Bibliography Higgins, Jams J. Introduction to Modern Nonparametric Statistics. Pacific Grove, CA: Brooks/Cole-Thompson, 2004. Hogg, Robert V., and Tanis, Elliot A. A Brief Course in Mathematical Statistics. Upper Saddle River, NJ: Pearson Prentice Hall, 2008. Reinard, John C. Communication Research Statistics. London, UK: Sage Publications, 2006. Warner, Rebecca M. Applied Statistics: From Bivariate Through Multivariate Techniques. London, UK: Sage Publications, 2007. Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009 15