Journal of Undergraduate Research at
Minnesota State University, Mankato
Volume 9
Article 6
2009
Choosing between Parametric and Non-parametric Tests
Russ Johnson
Minnesota State University, Mankato
Follow this and additional works at: https://cornerstone.lib.mnsu.edu/jur
Part of the Mathematics Commons, and the Probability Commons
Recommended Citation
Johnson, Russ (2009) "Choosing between Parametric and Non-parametric Tests," Journal of
Undergraduate Research at Minnesota State University, Mankato: Vol. 9 , Article 6.
Available at: https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
This Article is brought to you for free and open access by the Undergraduate Research Center at Cornerstone: A
Collection of Scholarly and Creative Works for Minnesota State University, Mankato. It has been accepted for
inclusion in Journal of Undergraduate Research at Minnesota State University, Mankato by an authorized editor of
Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato.
Johnson: Choosing between Parametric and Non-parametric Tests
Choosing Between Parametric or Non-parametric Tests
Abstract: A common question in comparing two sets of measurements is whether to use a
parametric testing procedure or a non-parametric procedure. The question is even more
important in dealing with smaller samples. Here, using simulation, several parametric and nonparametric tests, such as, t-test, Normal test, Wilcoxon Rank Sum test, van-der Waerden Score
test, and Exponential Score test are compared.
Introduction
Let us consider two independent random samples x1 , x2 ,, xm and y1 , y2 ,, yn are
taken from two populations. To compare the two samples, a common practice is to compare their
means, in other words testing the statistical hypothesis:
H 0 : 1 2 vs H1 : 1 2
Where H 0 indicates the null hypothesis, H 1 indicates the alternative hypothesis, 1 indicates the
first population mean, and 2 indicates the second population mean.
The statistical tests of hypotheses are based on the fundamental that if the samples have
significant evidence against the null hypothesis ( H 0 ), then H 0 is rejected in favor of the
alternative hypothesis ( H 1 ). Then the question is how significant is significant, when do we say
there is enough evidence, the answer is based on the idea of Type I error, the probability of
rejecting H 0 when in fact it is true. The power of the test is determined by the rate of rejection
of H 0 when it should be rejected. In other words, how well our test sees that H 0 H1 .
p-value
The observed level of significance (or the Type I error) of a test is known as the p-value
of the test. This is the probability of rejecting H 0 when it is in fact true. In our study we use a 5%
level of significance. This however, is just one of the many common levels of significance
commonly used.
Parametric Tests
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
1
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
1. According to Reinard (2006), when the two population distributions are normal, the
population variances 1 2 and 2 2 are unknown and unequal, the test statistic is
m
T
x y ( 1 2 )
2
2
, where x
xi
i 1
s1
s
2
n1
n2
m
n
, y
yi
i 1
n
x
m
, s12
i 1
m 1
A B 2
has a t-distribution with degrees of freedom df
i
x
A2
B2
m 1 n 1
y
n
2
, s 22
i 1
i
y
2
n 1
, and T
s12
s 22
, where A
and B
.
m
n
2. According to Tanis and Hogg (2008), when the two population distributions are normal, the
population variances 1 2 and 2 2 are unknown but equal, the test statistic is:
T
X Y ( 1 2 )
, where s p
1 1
m n
m n 2 degrees of freedom.
sp
(m 1) s12 (n 1) s 22
and T has a t-distribution with
mn2
3. According to Tanis and Hogg (2008), when the two population distributions are not assumed
as normal, the population variances 1 2 and 2 2 are unknown, and the sample sizes n1 and
n 2 are large, the test statistic is: Z
X Y ( 1 2 )
2
2
, where Z is the standard normal variate.
s1
s
2
m
n
Note that 1 2 0 for all three cases above, as per the null hypothesis. But in general
it is not necessarily zero as if we want to test that one mean is at least an amount higher than the
other then 1 2 is that least amount, and so on. The cases for known variances are not
considered as they are not common in practice.
In this paper we will consider the first test and the third test and denote as TD and ZD ,
respectively. We also will consider TP when the tests are computed as in TD and ZD but the pvalue is computed by considering all permutations of the data. For larger samples, TP uses
random permutations instead of all possible permutations. The corresponding p-values are
denoted as PTD , PZD , and PTP for the t-test, normal test, and the respective permutation test,
respectively.
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
2
Johnson: Choosing between Parametric and Non-parametric Tests
Non-parametric Tests
Wilcoxon Rank-Sum Test
In Higgins (2004) the method to perform the Wilcoxon rank-sum test is computed as
follows. Let m be the sample size of the one group or treatment, and n be the sample size of
another. Combine m n observations into one group, and rank the observations from smallest to
largest. Let 1 be the rank of the smallest observation, 2 the rank of the next smallest observation,
and so on. It is common to have ties among observations in a data set; that is, one or more
observations may have the same value. In this case, the assignment of ranks to the observations
is ambiguous. To resolve this ambiguity, the average rank is assigned to the tied observations.
Find the observed rank sum W of treatment 1 (Note we may analyze either treatment 1 or
treatment 2 due to the equivalency of the statements 1 2 and 2 1 ). Then the p-value of
the test is computed either by using the distribution of all possible permutations of the ranks or
by using normal approximation for larger samples. For the two sided test considered here
WR maximum (R W ,W ) ,
where R is the sum of the ranks for the combined sample.
Permutation Distribution
In Higgins (2004) the method to perform the permutation distribution test follows. Find
all possible permutations of the ranks in which m ranks are assigned to treatment 1 and n ranks
are assigned to treatment 2.
For each permutation of the ranks, find the sum of the ranks for treatment 1 (or treatment 2).
Determine the two sided p-value as
PWR
number of maximum(R - U, U) WR
,
m n
m
where U is the sum of the ranks for treatment 1 (or treatment 2) for a permutation.
When the sample sizes are so large that all permutations cannot be performed within a
reasonable time period, random permutations for a reasonable number (10,000 or 100,000) of
times can be performed depending on time and computational facility.
Large Sample Approximation
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
3
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
According to Higgins (2004), for larger samples with sample size 10 or greater, such
permutations can be considered large,
Z
W E (W )
V (W )
follows approximate standard normal distribution and hence can be used to obtain an
approximate p-value. Where E(W ) m , V (W )
mn 2
, is the mean for all ranks for the
m n 1
combined sample irrespective of whether there is any ties, and 2 is the population variance for
all ranks for the combined sample irrespective of whether there is any ties. Without ties,
m n 1
(m n 1)(m n 1)
and 2
. Let the large sample approximate p-value for the
2
12
Wilcoxon Rank Sum test be denoted as PWZ .
van der Waerden Score Test
The process of this test is exactly similar to the Wilcoxon Rank Sum test where the ranks
are replaced by the van der Waerden scores. In Higgins (2004) the van der Waerden scores are
defined by
i
V(i ) 1
m
n
1
where 1 denotes the inverse of the cdf of the standard normal distribution. The test statistic is
the sum of the van der Waerden scores for treatment 1 (or treatment 2). Then the p-value is
computed using the methods as described for the Wilcoxon Rank Sum test by using the van der
Waerden scores instead of the ranks. Let the permutation p-value for the van der Waerden score
test be denoted as PVS and the large sample approximate p-value for the van der Waerden score
test be denoted as PVZ .
Exponential Score Test
The process of this test is exactly similar to the Wilcoxon Rank Sum test where the ranks
are replaced by the Exponential scores. The Exponential scores are defined by
1
1
1
1
1
1
,
,
,
m n m n m n 1 m n m n 1 m n 2
in Higgins (2004). The test statistic is the sum of the Exponential scores for treatment 1 (or
treatment 2). Then the p-value is computed using the methods as described for the Wilcoxon
Rank Sum test by using the Exponential scores instead of the ranks. Let the permutation p-value
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
4
Johnson: Choosing between Parametric and Non-parametric Tests
for the van Exponential score test be denoted as PES and the large sample approximate p-value
for the van der Waerden score test be denoted as PEZ .
There are certain parameters under which parametric methods have been suggested to be
superior to nonparametric methods. Similarly, there are instances where nonparametric methods
are suggested over their parametric counterparts. According to Warner (2007), nonparametric
methods should be used when the sample size is small, whereas parametric methods should be
used when the sample size is large. Also when there is an outlier in the data, nonparametric
methods are said to be preferable. According to Tanis and Hogg (2008), when the population
distribution is normal and the sample size n is as small as 4 or 5 the normal test should a very
adequate approximation.
I also tested some parameters not considered or addressed by statisticians to see if they
suggest one method or the other. One of the parameters that will be tested is if different
distributions have any effect on the performance of the two methods. The following three graphs
illustrate the different distributions used. Different variances are also adjusted to see if any
effects make themselves apparent. The distance between means is also changed, to see if the
methods equivalently pick up on the more severe difference.
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
5
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
6
Johnson: Choosing between Parametric and Non-parametric Tests
Figure A) Distribution Examples
Simulation Study
To investigate how the tests are related to the estimates of the Type I error, 1000 samples
of sizes 5, 8, 11, and 15 are selected from independent normal populations with different means
and variances. All nine p-values mentioned above (PTD, PZD, PTP, PWR, PWZ, PVS, PVZ,
PES, and PEZ) are computed and the numbers of p-values less than or equal to 0.05 are recorded.
The choices are: (i) Population 1: Normal with mean 1 and variance 1; Population 2: Normal
with mean 1 and variance 1, (ii) Population 1: Normal with mean 1 and variance 1; Population 2:
Normal with mean 1 and variance 1 with an outlier. The proportions of rejections are displayed
in Table 1. The values displayed in Table 1 represent the rate at which the tests said the means
were different when in fact they were the same. Each of the tests was performed on these two
different distribution comparisons for the sample sizes 5, 8, 11, and 15.
Table 1: Estimates of the Level of Significance
n
PTD
PZD
PTP
5
8
11
15
0.053
0.054
0.042
0.041
0.089
0.072
0.065
0.051
0.056
0.057
0.043
0.041
5
8
11
15
0.013
0.004
0.014
0.019
0.036
0.018
0.019
0.029
0.051
0.032
0.036
0.043
PWR
N(1,1)
0.037
0.054
0.046
0.035
N(1,1)
0.030
0.030
0.039
0.041
PWZ
N(1,1)
0.066
0.054
0.046
0.036
N(1,1)
0.063
0.030
0.039
0.041
PVS
PVZ
PES
PEZ
0.040
0.054
0.046
0.035
0.066
0.054
0.046
0.036
0.060
0.060
0.054
0.050
0.037
0.049
0.047
0.041
0.063
0.030
0.039
0.041
0.057
0.029
0.039
0.048
0.030
0.020
0.034
0.041
w/outlier
0.031
0.030
0.039
0.041
To investigate the powers of the tests, samples are generated from the populations having
different means. The choices are: (i) Population 1: Normal with mean 1 and variance 1;
Population 2: Normal with mean 3 and variance 1, (ii) Population 1: Normal with mean 1 and
variance 1; Population 2: Normal with mean 5 and variance 2, (iii) Population 3: Normal with
mean 1 and variance 1; Population 2: Normal with mean 2 and variance 1, (iv) Population 1:
Exponential with mean 1/3; Population 2: Normal with mean 1 and variance 1, (v) Population 1:
Exponential with mean 1/3; Population 2: Exponential with mean 1, (vi) Population 1: Skewed
bimodal with mean 3/8 and variance 7/9; Population 2: Normal with mean 0 and variance 1, (vii)
Population 1: Skewed bimodal with mean 3/8 and variance 7/9; Population 2: with mean 3 and
variance 1. Then for each of the choices proportion of rejections are computed and displayed in
Table 2. The values displayed in Table 2 represent the rate at which the tests said the means were
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
7
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
different when in fact they were the different. Each of the tests was performed on these seven
different distribution comparisons for the sample sizes 5, 8, 11, and 15.
Table 2: Powers of the Tests
n
PTD
PZD
PTP
5
8
11
15
0.762
0.967
0.994
1.000
0.867
0.990
0.995
1.000
0.762
0.970
0.994
0.999
5
8
11
15
0.904
0.997
0.999
1.000
0.972
0.999
0.999
1.000
0.928
0.999
0.999
1.000
5
8
11
15
0.270
0.454
0.589
0.743
0.393
0.519
0.638
0.765
0.270
0.464
0.597
0.743
5
8
11
15
0.208
0.307
0.518
0.719
0.345
0.459
0.612
0.777
0.279
0.450
0.610
0.775
5
8
11
15
0.133
0.316
0.517
0.722
0.287
0.442
0.624
0.777
0.264
0.454
0.624
0.782
5
8
11
15
5
8
11
15
0.096
0.100
0.127
0.163
0.930
0.998
1.000
1.000
PWR
N (1,1)
0.681
0.961
0.989
0.999
N (1,1)
0.849
0.995
0.999
1.000
N (1,1)
0.209
0.441
0.566
0.729
0.201
0.383
0.530
0.670
PWZ
N (3,1)
0.767
0.961
0.989
0.999
N (5,2)
0.901
0.995
0.999
1.000
N (2,1)
0.291
0.441
0.566
0.731
N (1,1)
0.245
0.383
0.530
0.681
0.186
0.399
0.555
0.679
0.263
0.399
0.555
0.682
Exp1/ 3
Exp1/ 3
Exp1
(3 / 4) N (0,1) (1/ 4) N ((3 / 2), (1/ 3) 2 )
N (0,1)
0.153
0.145
0.166
0.179
0.103
0.108
0.128
0.155
0.097
0.104
0.125
0.162
0.069
0.108
0.128
0.153
(3 / 4) N (0,1) (1/ 4) N ((3 / 2), (1/ 3)2 )
N (3,1)
0.978
0.999
1.000
1.000
0.933
0.998
1.000
1.000
0.932
0.999
1.000
1.000
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
0.878
0.998
1.000
1.000
PVS
PVZ
PES
PEZ
0.681
0.961
0.989
0.999
0.767
0.961
0.989
0.999
0.735
0.940
0.981
0.994
0.681
0.921
0.980
0.994
0.849
0.995
0.999
1.000
0.901
0.995
0.999
1.000
0.897
0.998
1.000
1.000
0.849
0.996
1.000
1.000
0.209
0.441
0.573
0.730
0.291
0.441
0.566
0.731
0.264
0.411
0.569
0.680
0.209
0.373
0.544
0.664
0.207
0.383
0.533
0.677
0.245
0.383
0.530
0.681
0.228
0.427
0.587
0.763
0.201
0.392
0.564
0.748
0.188
0.399
0.560
0.683
0.263
0.399
0.555
0.682
0.249
0.450
0.615
0.768
0.186
0.397
0.595
0.747
0.069
0.108
0.130
0.154
0.103
0.108
0.128
0.155
0.095
0.127
0.156
0.196
0.069
0.109
0.137
0.182
0.884
0.998
1.000
1.000
0.933
0.998
1.000
1.000
0.929
0.996
1.000
1.000
0.878
0.996
0.999
1.000
8
Johnson: Choosing between Parametric and Non-parametric Tests
We now analyze the various scenarios and compare the effectiveness of the parametric
and non-parametric tests. We will compare populations which share different distributions,
populations that have different respective distributions, populations with different variances,
different populations, populations with different means, and treatments with extreme outliers.
We will observe how quickly the tests are picking up on the fact that H 0 : 1 2 when it is the
case.
We begin with two populations each having a normal distribution. One of the samples
has a mean of 1 and a variance of 1. The other has a mean of 1 and a variance of 1. Since the
means are equal we are computing the level of significance of the tests. We can see from Table 1
that PZD or the normal test had slightly higher levels of significance for all four of the
populations sizes. However this difference was not significant. The decision made of rejecting or
accepting H 0 depends entirely on your desired level of significance. No test drastically stood out
such that a majority of commonly used levels of significance would result in different test
yielding different results. All of the tests picked roughly 5% for a level of significance except
PZD when n=5, however, even that was off be less than 4%. Additionally the large sample
approximation of the exponential scores test or PEZ picked a low level of significance when the
sample size n=5. The data discussed is plotted in the following graph (Figure 1).
Figure 1: Type 1 Error; N(1,1) vs N(1,1)
Now we observe the results of similarly constructed populations with the addition of
outliers. Again, since the means are equal we compute the levels of significance. It is apparent
from the data displayed in Figure 2 that the scores were on average lower than in Figure 1, this
means that the tests were, on average, more effective in determining that H0 is true. When the
sample size n=5, PWZ, PVS, and PES, all picked values greater than 5%, while the rest picked
lower values. When the sample size was greater, however, all the tests performed similarly
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
9
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
picking value lower than 5%. While the observed levels of significance are somewhat greater for
the nonparametric methods, they still generally resulted in the same conclusion of
rejecting H 0 : 1 2 . The data discussed is plotted in the Figure 2.
Figure 2: Type 1 Error; N(1,1) vs N(1,1) with outlier
When finding the levels of significance in both cases the methods did not differ too
greatly. While in certain circumstances some tests had a p-value greater than 5%, the tests that
had a p-value less than 5% were not far below this level of significance. When considering the
differences between the tests we observed that on average the difference between the parametric
and nonparametric methods was rather small. Since there was not a great deal of difference in the
performance of the tests when considering the different styles of distributions and the sample
sizes, there was no single method of test, parametric or nonparametric, that clearly performed
better than the rest. We shall soon see that, when we dive into observing the power of the tests,
the similarities become even more apparent
We now consider how effective the tests were in determining when H 0 : 1 2 is not
true. This first simulation compares two normal populations each having a variance of 1, and
means of 1 and 3, respectively. When the sample size n=5, PZD had a slightly greater power than
the rest, while the other tests performed very similarly when testing the power. When the sample
size increased there was very little difference between any of the test’s performance. Since there
was no significant difference between any of the tests for all four of the sample sizes, the test
performed equally. The data discussed is plotted in the following graph.
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
10
Johnson: Choosing between Parametric and Non-parametric Tests
Figure 3: Type 2 Error; simulation 1, N(1,1) vs N(3,1)
In the second simulation we analyze two normal populations, population 1 with a mean
of 5 and variance 1, and population 2 with mean 5 and variance 2. Each of the tests picked up on
this increased difference in means rather effectively. As the sample size increases this becomes
even more apparent. This is especially true when the sample size n=15. In this case all of the
tests had identical values.
Figure 4: Type 2 Error; simulation 2, N(1,1) vs N(5,2)
For the third simulation we analyze two normal populations each having a variance of 1,
and means of 1 and 2 respectively. The normal test of PZD picked a slightly higher value for the
two lesser of the four sample sizes. The other test performed similar to each other for each of the
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
11
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
sample sizes. When the sample size was greater, PZD was performing closely to the other eight
tests.
Figure 5: Type 2 Error; simulation 3, N(1,1) vs N(2,1)
In the fourth simulation we change the distribution of one of our samples to exponential
and give it a mean of 1/3, the second population has normal distribution with a mean of 1 and
variance 1. Similarly to the previous scenarios, the tests gave approximately the same result for
all the sample sizes, with the differences between the tests decreasing as the sample size
increased.
Figure 6: Type 2 Error; simulation 4, Exp(1/3) vs N(1,1)
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
12
Johnson: Choosing between Parametric and Non-parametric Tests
The fifth simulation compares two exponential distributions with means 1/3 and 1,
respectively. In a slight change of pace, none of the tests stood out either above or below for any
of the sample sizes in determining when H 0 : 1 2 is false. When the sample size n=5 the tests
all have values close to 20%-25%. Each of the tests had almost identical values for higher three
sample sizes.
Figure 7: Type 2 Error; simulation 5, Exp(1/3) vs Exp(1)
In the sixth and seventh simulations we compared skewed bimodal distributions with
normal distributions. In both of the trials the skewed bimodal distribution had a mean of 3/8 and
variance of 7/9, while the normal distributions had means 0 and 3 respectively, and in both cases
variance of 1. In the sixth simulation for all four of the sample sizes the tests all performed
similarly, picking values approximately 8% apart or less. They also stayed below 20% in all of
the cases. In the seventh trial however, the tests all had values 85% or high, while still
maintaining a maximum difference of 10%.
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
13
Journal of Undergraduate Research at Minnesota State University, Mankato, Vol. 9 [2009], Art. 6
Figure 8: Type 2 Error; simulation 6, (3/4)N(0,1)+(1/4)N((3/2),(1/9) )vs N(0,1)
Figure 9: Type 2 Error; simulation 7, (3/4)N(0,1)+(1/4)N((3/2),(1/9) ) vs N(3,1)
While there were instances where one of the tests had a slightly higher or lower value for
a certain set of parameters, when there was a difference it was not large enough to be considered
significant. In finding both the power and the level of significance, none of the tests truly
“outperformed” the others for any particular set of parameters. When finding the observed level
of significance, the nonparametric tests did prove to be consistently more effective than the
parametric tests. However, this difference in effectiveness or performance was not enough to
influence the decision of whether or not to reject H 0 : 1 2 . Consequently, when we consider
https://cornerstone.lib.mnsu.edu/jur/vol9/iss1/6
14
Johnson: Choosing between Parametric and Non-parametric Tests
the set of parametric test against the set of nonparametric tests we did not observe that one set or
the other had a significantly higher power or more accurately picked the level of significance.
Contrary to accepted set of criteria for determining which to use, our research did not find
a specific set of parameters for which parametric tests are the proper choice over nonparametric.
A small sample size had a small effect on the performance of the tests, however when the size
increased, the tests performed almost equivalently. This is the opposite of what the accepted
notion of the performance of the parametric methods versus nonparametric methods. Changing
the variance also seemed to have no effect. When the difference between the means was greater,
both sets of tests, parametric and nonparametric, picked up on this difference similarly. Even
when comparing different distributions types, the tests performed relatively similar to each other.
Since there was no clear scenario when parametric methods outperformed
nonparametric methods or visa versa, the research was inconclusive. None of the tested
parameters had an effect significant enough to cause noticeable change in the outcome. Thus, the
choice of parametric or nonparametric seems to be left to the preference of the person analyzing
the population data.
Bibliography
Higgins, Jams J. Introduction to Modern Nonparametric Statistics. Pacific Grove, CA:
Brooks/Cole-Thompson, 2004.
Hogg, Robert V., and Tanis, Elliot A. A Brief Course in Mathematical Statistics. Upper Saddle
River, NJ: Pearson Prentice Hall, 2008.
Reinard, John C. Communication Research Statistics. London, UK: Sage Publications, 2006.
Warner, Rebecca M. Applied Statistics: From Bivariate Through Multivariate Techniques.
London, UK: Sage Publications, 2007.
Published by Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato, 2009
15