Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

entropy

Article
Measures of Difference and Significance in the Era of
Computer Simulations, Meta-Analysis, and Big Data
Reinout Heijungs 1,2, *, Patrik J.G. Henriksson 3,4 and Jeroen B. Guinée 1
1 Institute of Environmental Sciences, Leiden University, 2300 RA Leiden, The Netherlands;
[email protected]
2 Department of Econometrics and Operations Research, Vrije University Amsterdam,
1081 HV Amsterdam, The Netherlands
3 Stockholm Resilience Centre, 10691 Stockholm, Sweden; [email protected]
4 WorldFish, Jalan Batu Maung, 11960 Penang, Malaysia
* Correspondence: [email protected] or [email protected]; Tel.: +31-20-598-2384

Academic Editors: Julio Stern and Adriano Polpo


Received: 28 May 2016; Accepted: 30 September 2016; Published: 9 October 2016

Abstract: In traditional research, repeated measurements lead to a sample of results, and inferential
statistics can be used to not only estimate parameters, but also to test statistical hypotheses concerning
these parameters. In many cases, the standard error of the estimates decreases (asymptotically) with
the square root of the sample size, which provides a stimulus to probe large samples. In simulation
models, the situation is entirely different. When probability distribution functions for model features
are specified, the probability distribution function of the model output can be approached using
numerical techniques, such as bootstrapping or Monte Carlo sampling. Given the computational
power of most PCs today, the sample size can be increased almost without bounds. The result is that
standard errors of parameters are vanishingly small, and that almost all significance tests will lead to
a rejected null hypothesis. Clearly, another approach to statistical significance is needed. This paper
analyzes the situation and connects the discussion to other domains in which the null hypothesis
significance test (NHST) paradigm is challenged. In particular, the notions of effect size and Cohen’s
d provide promising alternatives for the establishment of a new indicator of statistical significance.
This indicator attempts to cover significance (precision) and effect size (relevance) in one measure.
Although in the end more fundamental changes are called for, our approach has the attractiveness
of requiring only a minimal change to the practice of statistics. The analysis is not only relevant for
artificial samples, but also for present-day huge samples, associated with the availability of big data.

Keywords: significance test; null hypothesis significance testing (NHST); effect size; Cohen’s d;
Monte Carlo simulation; bootstrapping; meta-analysis; big data

1. Introduction
The problem of determining if the difference between two groups is large enough to be labeled
“significant” is an old and well-studied problem. Virtually every university program treats it, often as
a second example of the t-test, the first example being the one-sample case [1–4]. Generalizations
then motivate the study of the analysis of variance (ANOVA) and more robust non-parametric tests,
such as those by Mann–Whitney and Kruskal–Wallis. All these established tests are based on the
comparison of the means (or medians) of two (or more) groups, and as such, the standard error of
these means (or medians) plays a crucial role. Such standard errors typically decrease with the square
root of the sample size. As a result, the question of whether or not a difference between two (or more)
means (or medians) is significant not only depends on the intrinsic properties of the phenomenon
(mean of the difference and variance of the distributions), but also on the sample size, which is not an

Entropy 2016, 18, 361; doi:10.3390/e18100361 www.mdpi.com/journal/entropy


Entropy 2016, 18, 361 2 of 11

intrinsic property of the phenomenon. In a traditional experimental set-up or field study, this may be
appropriate, because significance means that the limited evidence obtained by small samples suffices
to mark the populations as being different. In such cases, the standard error is a perfect companion.
However, in the context of unlimited or virtually unlimited data—for instance, for computer-generated
samples—this concept of significance breaks down. In such cases, the standard error will not do a good
job, at least not in the way it is used in the standard textbooks.
The prevalence of computer-generated datasets and large datasets has become increasingly
common in the 21st century. Specifically, the following developments should be mentioned:

• Simulation models [5], where artificial samples are generated according to the principles of Monte
Carlo, Latin hypercube, bootstrapping, or any other sampling or resampling method. Depending
on the size of the model and computing power, such techniques easily yield a sample size of 1000
or more.
• Meta-analysis [6], where the results of dozens or hundreds of studies are combined into
one meta-study with an effectively large sample size. Online repositories in particular (such as
those of the Cochrane Library [7]) enable the performance of such meta-analyses.
• Big data [8], where automatically-collected data on millions of customers, patients, vehicles,
or other objects of interested are collected for statistical processing.

In this article, we focus on the case of comparing a numerical variable for two groups, indicated
by subscripts A and B. The reader may think of this in terms of either a control group or a treatment
group (as is often the case in medical research), or of two different situations (as is often the case in
empirical research; for instance, male customers versus female customers). The variable might be
anything like IQ, voltage, or price. Further, to keep the discussion focused, we will assume that the
true mean of group A is lower than that of group B.
Section 2 revisits the basic situation of the null hypothesis significance test on the equality of means
for two groups, also in a historic perspective, contrasting the approaches by Fisher, Neyman–Pearson,
and their synthesis; Section 3 critically analyzes the influence of sample size in the hypothesis test;
Section 4 analyses alternatives to the usual expression and proposes a new test criterion; Section 5
provides a discussion and conclusion.
As for notation, we will use Greek symbols (µ, σ) for population parameters, capital Latin symbols
(Y, Y, S) for random variables sampled from such populations, and lower case Latin symbols (y, y, s)
for the value obtained in a particular sample. Y ∼ N µ, σ2 indicates that random variable Y is


normally distributed with mean µ and variance σ2 . Their sample values are indicated by y and s2 .
t(ν) is the t-distribution with ν degrees of freedom.

2. Comparing Two Groups


Our motivation comes from the study of comparing the sustainability of different scales of
aquaculture, using a computer simulation fed by a distribution of input data [9]. We will use an example
of the carbon footprint of Pangasius catfish cultivation in Vietnam.
Let us suppose that there are two groups: small-scale (subscript A) and large-scale (subscript B)
fisheries. The carbon footprint varies within one group between sites and per day, so there is
a distribution of carbon footprints for group A, which we indicate by the stochastic variable YA ,
and a distribution of carbon footprints for group B, which we indicate by YB . For simplicity, we will
assume that both populations are normally distributed with the same (but unknown) variance σ2 :
   
YA ∼ N µ A , σ2 and YB ∼ N µB , σ2

Now, we collect from both populations a sample of equal size n A = n B = n. The purpose is to
compare the centrality parameter, in particular the means µ A and µB .
Entropy 2016, 18, 361 3 of 11

Now, there are a number of options for carrying out the statistical analysis. One choice is between
“classical statistics” (as discussed in most mainstream textbooks and handbooks, including [1–4]) and
Bayesian statistics (e.g., [10,11]). In this article, we will build entirely on the classical paradigm, mainly
because it is mainstream, and moreover because the Bayesians emphasize the changing of beliefs as
a result of new evidence, which is not the core issue in big data and computer-generated samples
(although it is a core issue in meta-analysis). Within this classical paradigm, we have a choice of taking
the Fisherian approach, the Neyman–Pearson approach, or their hybrid or synthesized forms, the null
hypothesis significance test [12].
Fisher’s approach calculates the probability of obtaining the observed value (or an even more
extreme value) of a test statistic when an a priori specified null hypothesis would be true. In the
present case, the null hypothesis would be

H0 : µ A = µB

and the test statistic would be derived from the observed difference in means (YB − YA ).
The standardized form of this is

YB − YA Y −Y
T= q = B q A
1 1
SP n + nB
A
SP n2

where s
(n A + 1) S2A + (n B + 1) S2B
r
1 2
S A + S2B

SP = =
n A + nB − 2 2
is the pooled estimate of the standard deviation of the two populations. Under H0 , the random variable
T is distributed according to a t-distribution, with n A + n B − 2 = 2 (n − 1) degrees of freedom:

T ∼ t (ν = 2 (n − 1))

YB − y B −y A
Denoting the obtained value of the random variable T = qYA by t = q , the p-value is then
SP 2 s P n2
n
calculated as the probability that T has the obtained value t or even farther away from the expected
value 0:
p-value = Pt(ν=2(n−1)) (| T | > |t|)

In this approach, no black–white decision as to significance is made, but the p-value suffices to
communicate a level of evidence. In addition, no alternative hypothesis is stated, and we study only
the plausibility of the data with a stated null hypothesis.
In contrast, the approach by Neyman–Pearson starts by formulating two competing simple
hypotheses (often called the null hypothesis and the alternative hypothesis), and calculates the
ratio of the likelihoods of the data for these hypotheses. The result then yields a probability of
the data corresponding to one hypothesis or the other (see [13] for a clear example on coin throwing).
This approach also sets an a priori threshold value for rejecting the null hypothesis against the
alternative one, symbolized as α, conventionally set to 0.05 or 0.01. The notion of significance then
arises in comparing the p-value to α. In addition, the method calculates a second parameter, β, for the
probability of incorrectly accepting the alternative hypothesis. Its complement, 1 − β, then represents
the (a posteriori) power of the test.
Whereas Fisher, Neyman, and Pearson were having an acrimonious debate on the weak
and strong points of the two methods, textbooks from the 1950s on were effectively creating
a synthesis (an “anonymous amalgamation of the two incompatible procedures” [14]), using elements
from Fisher and from Neyman–Pearson, which “follow Neyman–Pearson procedurally but Fisher
philosophically” [12]. The result is known as the null hypothesis significance test (NHST), and it
is characterized by the use of p-values in combination by an a priori α, a composite alternative
Entropy 2016, 18, 361 4 of 11

hypothesis, and occasional power calculations. In the example elaborated according to NHST, the null
hypothesis is:
H0 : µ A = µB

with alternative hypothesis


H1 : µ A 6= µB

because we want to find out if the mean carbon footprint differs between the two groups. The math
then follows Fisher’s approach in calculating a p-value. This p-value is compared to the type I error
rate α that has been set in advance (e.g., to 0.05). A smaller p-value will reject the null hypothesis and
accept the alternative hypothesis, while a larger p-value will not reject the null hypothesis, but instead
maintain it (which does not mean acceptance). So, the null hypothesis H0 : µA = µB is rejected at the
pre-determined significance level α when the calculated value of the test statistic (t) is smaller than the
lower critical value (tcrit,lower,α ) or larger than the upper critical value (tcrit,upper,α ). Critical values are
thus defined by the conditions
α  α
Pt(ν=2(n−1)) ( T ≤ tcrit,lower,α ) = and Pt(ν=2(n−1)) T ≥ tcrit,upper,α =
2 2
Because the t-distribution is symmetric around the value 0, this can also be formulated as
a rejection when the absolute value of t exceeds tcrit,one−tailed,α . In that case, we use

Pt(ν=2(n−1)) (| T | ≥ tcrit,one−tailed,α ) = α

The elaboration above on one hand summarizes the NHST-procedure for the two sample case,
which is helpful in defining concepts and notation for the later sections of this paper. On the other
hand, it briefly recaps the history in terms of the contributions by Fisher and by the tandem Neyman
and Pearson, which will turn out to be useful in the later discussion. We do not pretend to give a full
history of statistical testing; please refer to [4,12,15,16].

3. Critique of NHST in Comparing Two Means


The NHST procedure has been criticized fiercely for quite a few decades; see for instance [16–21].

In the present study, we wish to single out one aspect: the test statistic T scales with n. A sample size
n = 1000 gives a 10 times larger value of T than a sample size n = 10, and a sample size n = 100, 000
a 100 times larger value, while keeping the effects and σ fixed. At a significance level α = 0.05,
the critical values of T are 2.101 (n = 10; ν = 18), 1.961 (n = 1000; ν = 1998), and 1.960 (n = 100, 000;
ν = 199, 998), so if we simplify to tcrit,upper,0.05 ≈ 2, the only term that really matters is the observed
value of the T statistic. We reject H0 when | T | ≥ tcrit,upper,α , so when
r
YB − YA
& tcrit,upper,α 2
S n
P

As an example, consider the case YA = 5, YB = 6, SP = 1, and let n = 2, . . . , 300. At n ≥ 9,


we have sufficient certainty to reject equality of means. When the difference is smaller, say YB = 5.2
instead, n = 9 will not suffice; however, with a greater effort (n ≥ 194), we will finally be able to reject
equality of means (see Figure 1).
This convincingly reminds us that the decision to reject H0 and to conclude that the two means are
“significantly different” depends not only on the inherent properties of the populations (µ A , µB , σ) or
the properties of the samples that have been generated from them (YA , YB , SP ), but also on the sample
size n, which is not an inherent property of the population. The concept of statistical significance mixes
a number of aspects:

• the difference µB − µ A or its estimate, YB − YA ;


Entropy 2016, 18, 361 5 of 11

• the standard deviation of the two populations, σ = σ A = σB , or its pooled estimate SP ;


• the sample size, n.
Entropy 2016, 18, 361 5 of 11

FigureFigure
1. The1. The absolute
absolute valuevalue
of of
thetheT-statistic when YA ==5.0,
-statisticwhen 5.0, YB= =6.0,
6.0, S=P 1=(upper solidsolid
1 (upper line)line)
for for
different
different valuesvalues of the
of the sample
sample sizen, when
size , whenY = = 5.0,
5.0, Y ==5.2,
5.2, S == 1 1top (lower
top solid
(lower line),
solid and the
line), and the
A B P
upperupper critical
critical valuevalue of the
of the t-distribution(dashed
t-distribution (dashed line) at αα==0.05.
line) at 0.05.The
The null hypothesis
null of equality
hypothesis of of
of equality
population means is rejected at 0.05 for ≥ 9 when Δ = 1.0, but when Δ = 0.2, we need to push
population means is rejected at 0.05 for n ≥ 9 when ∆Y = 1.0, but when ∆Y = 0.2, we need to push
further and use ≥ 194 to do the job.
further and use n ≥ 194 to do the job.
The first of these is natural: the actual difference between and is of course important in
The first ifofthe
deciding these is natural:
difference the actual
is “large”, difference
“substantial”, between
or—why YA and YB is of course important in
not—“significant”.
deciding if the difference is “large”, “substantial”,
The second one plays a more intricate role. The ratio or—why not—“significant”.
provides a dimensionless indicator
YB −YA
The
of thesecond one plays
“relative” a more
difference intricate
between the role.
two meansThe ratioμ and SP μ .provides a dimensionless
It is sometimes described indicator
as a
signal-to-noise ratio [16].
of the “relative” difference between the two means µ A and µB . It is sometimes described as
The third
a signal-to-noise element
ratio [16].plays a curious role. Sample size is important for establishing the confidence
The third elementHowever,
level of a result. sample role.
plays a curious size Sample
is not part sizeof the nature for
is important of the phenomenon
establishing under
the confidence
investigation. The two means (μ and μ ) and the standard deviation (σ) are aspects of the research
level of a result. However, sample size is not part of the nature of the phenomenon under investigation.
object. Sample size ( ) is an aspect of the instrument that we use to probe the object. Of course, the
The two means (µ A and µB ) and the standard deviation (σ) are aspects of the research object.
quality of the instrument has an influence on the outcome. If we wish to know how many stars there
Sampleare,size
and(n)useisa an aspect
cheap of the the
telescope, instrument
number will thatbe we use to
lower thanprobe
whenthe weobject. Of course, the
use a multi-billion quality
dollar
of thetelescope.
instrument However, no serious astronomer will proclaim that the number of counted stars is equal are,
has an influence on the outcome. If we wish to know how many stars there
and use a cheap
to the number telescope,
of starts the number
in the willInstead,
universe. be lower than when
a formula we usethe
to estimate a multi-billion
number of starts dollar
fromtelescope.
the
However,
number no ofserious
countedastronomer willquality
stars and the proclaimof thethat the number
telescope will beofdeveloped.
counted stars The is equal to the
application number
of this
formula
of starts in the to the two measurement
universe. Instead, aset-ups
formulawillto give different
estimate theresults,
number and of
probably
starts the
from estimate with of
the number
the stars
counted expensive
and thetelescope
quality will
of be
themore accurate.
telescope willInbetraditional
developed. NHST,
The this is different.
application What
of this you seeto the
formula
depends on theset-ups
two measurement measurement set-up,
will give and thisresults,
different is not corrected for in the
and probably theoutcome.
estimate with the expensive
A consequence is that money can buy significance. Of course, the mean blood pressure of two
telescope will be more accurate. In traditional NHST, this is different. What you see depends on the
groups of patients will never be equal when you consider the last digit. However, it may be that the
measurement set-up, and this is not corrected for in the outcome.
difference is only in the fourth decimal, that = 120.0269 and = 120.0268 . With an
A consequence
exceptionally is that
large study, money can buy difference
this negligible significance. can Of be course,
declared the to mean blood pressure
be “significant”. The of
two groups
distinction between a significant difference and a large difference is mentioned in most textbooks that
of patients will never be equal when you consider the last digit. However, it may be on the
difference is only
statistics, in theinfourth
but often decimal,
a slightly cursorythat
way,y Aand = it120.0269
is well-known = 120.0268.
and y Bthat With an exceptionally
many less-informed students
large and
study, this negligible
scientists differencedifference
mistake a significant can be declared
for a largeto orbe “significant”.
important differenceThe distinction between
[22].
Combining
a significant difference theand estimated
a large difference −
differenceis mentioned and in themost
standard
textbooks erroronofstatistics,
this difference
but often in
/√ cursory
a slightly into one way,formula has one big
and it is/√well-known that advantage: it yields onestudents
many less-informed single number, which can
and scientists mistake
a significant
moreover difference
objectively forbeatested
large or important
against difference
a conventional [22].
benchmark, such as α = 0.05. Therefore, we

Combining
only need to the estimatedthis
communicate difference YB − Y
single number, A and
either asthe standard
a -value, as aerror of this
-value, or asdifference
a significanceSP / n
statement, suchYBasY“A < 0.01”, “**”, or “highly significant difference”. The fact that two things are
−√
into one formula SP / n
has one big advantage: it yields one single number, which can moreover
combined in one is the root of the problem, however: information has been lost due to the
objectively be tested against a conventional benchmark, such as α = 0.05. Therefore, we only need to
compression of two complimentary aspects into one.
communicate this single number, either as a t-value, as a p-value, or as a significance statement, such as
“p < 0.01”, “**”, or “highly significant difference”. The fact that two things are combined in one is the
Entropy 2016, 18, 361 6 of 11

root of the problem, however: information has been lost due to the compression of two complimentary
aspects into one.

4. Alternatives to NHST for Comparing Two Means


Moving away from significance tests in the direction of effect sizes has been propagated by
various authors [19,23], the latter of whom used the term “new statistics” to refer to this change of
Entropy 2016, 18, 361 6 of 11
paradigm. Cumming [19] makes a strong plea for the use of confidence intervals. Confidence intervals
for a difference
4. Alternativesof means,
to NHST such as “95% CITwo
for Comparing [1.4, 8.6]” (p. 161) indeed display elements of size and
Means
significance,Moving
and use twofrom
away pieces of information,
significance notdirection
tests in the one. of effect sizes has been propagated by
Ziliak
variousandauthorsMcCloskey
[19,23], the[16]latterpopularize
of whom used thethetwo elements
term “new as “Oomph”
statistics” and
to refer to this “Precision”.
change of
These two authors
paradigm. introduce
Cumming [19]more
makes interesting expressions,
a strong plea for the use such as “the sizeless
of confidence scientist”,
intervals. Confidencewho only
focussesintervals
on thefor a difference
question of means,
if there such asand
is an effect, CI 1.4,if
“95%ignores 8.6 ” (p.
the 161)isindeed
effect large display elements
or otherwise of
important.
size and significance, and use two pieces of information, not one.
Such critiques on NHST are understandable, but it is questionable if the alternatives provide
Ziliak and McCloskey [16] popularize the two elements as “Oomph” and “Precision”. These
a real improvement.
two authors introduce more interesting expressions, such as “the sizeless scientist”, who only
A focusses
confidence on the interval
questionshares
if thereaisproblem with
an effect, and the old
ignores statistics
if the of NHST:
effect is large givenimportant.
or otherwise a large enough
sample,Such
the width
critiques on NHST are understandable, but it is questionable if the alternatives provide ato
of the confidence interval will shrink to zero, and as students are trained see if the
real
no-effect value of 0 is inside or outside the confidence interval, at some point the confidence interval
improvement.
approach will A confidence
still be usedinterval
more shares a problem
to assess with the old
the precision statistics
rather thanof the
NHST: given Of
oomph. a large enough
course, we could
sample, the width of the confidence interval will shrink to zero, and
train students to ignore the question if 0 is inside the confidence interval, and to more focusas students are trained to see if on the
the no-effect value of 0 is inside or outside the confidence interval, at some point the confidence
confidence interval as such, but there are alternatives which in our view do a better job, and which are
interval approach will still be used more to assess the precision rather than the oomph. Of course, we
moreover easier
could train to communicate.
students to ignore the question if 0 is inside the confidence interval, and to more focus on
Cumming
the confidence also
[19] advocates
interval as such, but for there
effectare
sizes, where which
alternatives an effect sizeview
in our is “the
do a amount
better job,ofand
anything
of research
whichinterest”
are moreover (p. 162).
easier In line with [23], we single out the standardized difference of means,
to communicate.
often referredCumming [19] also d,
to as Cohen’s advocates
definedforbyeffect sizes, where an effect size is “the amount of anything of
research interest” (p. 162). In line with [23], we single out the standardized difference of means, often
referred to as Cohen’s , defined by yB − y A
d=
s−P
=
as a measure of effect size, because it basically expresses a signal-to-noise ratio. Cohen [23] arbitrarily
as a measure of effect size, because it basically expresses a signal-to-noise ratio. Cohen [23]
proposed a categorization of values: 0.2 means a small standardized effect size, 0.5 a medium one,
arbitrarily proposed a categorization of values: 0.2 means a small standardized effect size, 0.5 a
and 0.8medium
is large. Figure
one, 2 illustrates
and 0.8 that2 even
is large. Figure a more-than-large
illustrates that even a more-than-large = 1.0
value of dvalue of has a substantial
= 1.0 has
overlapa (around 45%)
substantial of probability
overlap (around 45%) mass. For d = mass.
of probability 0.2, the
Foroverlap
= 0.2, is
thearound
overlap 85%.
is around 85%.

Figure Figure density function for YA ∼ ~N (5.0,


2. Probability density function for
2. Probability 5.0,1 1)(solid
(solidline) and Y~B ∼
line)and 5.2, 1 and
N (5.2, 1) and
~ 6.0, 1 (two dashed lines), corresponding to standardized effect sizes δ = 0.2 (small) and
YB ∼ N (6.0, 1) (two dashed lines), corresponding to standardized effect sizes δ = 0.2 (small) and
1.0 (large).
1.0 (large).
Finally, we mention the developments in non-inferiority trials tests, where a
“margin”—denoted
Finally, we mention the as Δ—is defined such
developments that a proposed trials
in non-inferiority new drug
tests, can
wherebe tested to be not
a “margin”—denoted
unacceptably worse than an existing one, while it may also be used for testing superiority [24,25]. A
as ∆—is defined such that a proposed new drug can be tested to be not unacceptably worse
null hypothesis significance test then takes this margin into account. While developed in an
than anenvironment
existing one, while it may also be used for testing superiority [24,25]. A null hypothesis
of clinical research, where the existing drug is well studied, this approach has definite
significance test then takes thisalternatives
benefits. In a context of two margin into account.
(i.e., While developed
the sustainability of large scaleinfisheries
an environment of clinical
and small scale
fisheries), the situation is different, and there is no a priori magnitude for such a margin of
non-inferiority or superiority. As such, we are looking for a margin of non-inferiority or superiority
Entropy 2016, 18, 361 7 of 11

research, where the existing drug is well studied, this approach has definite benefits. In a context of
two alternatives (i.e., the sustainability of large scale fisheries and small scale fisheries), the situation
is different, and there is no a priori magnitude for such a margin of non-inferiority or superiority.
As such, we are looking for a margin of non-inferiority or superiority that is magnitude-independent.
Such a measure is provided by the standardized effect size, here implemented as the standardized
difference of means, discussed above. Although we agree with many points of critique on the
standardized effect size in comparison to the “simple” effect size [26], we think they serve one
important role in setting a generic standard for “oomph”, as introduced by Cohen [23]. Combining the
idea of superiority with a margin [24,25] formulated in terms of the standardized effect size [23] is the
core of our idea; see the next section.

5. A Proposal to Base Significance on Non-Trivial Effect Sizes


Under the null hypothesis, the standardized effect size is known to follow a non-central
t-distribution [27]. However, it is seldom used, except for finding a confidence interval of the effect
size. In fact, it has become fashionable to oppose effect sizes and significance tests [19]. We feel that
embracing confidence intervals while abolishing significance tests is tantamount to throwing the baby
out with the bathwater. Our aim is to reconcile significance tests (precision) with effect sizes (oomph).
Below is a proposal to do so.
Our proposal is based on the following premise: an effect is “significant”

• when the effect size is large enough;


• and when it has been established with enough precision.

We now propose to operationalize this as follows:

• in advance, we set (as usual) a significance level α, say α = 0.05;


• in advance, we set an importance level δ0 , say δ0 = 0.2 (a small effect size);
• we define a test statistic D = YBS−PYA that estimates δ = µB −
σ
µA
;
• we test the null hypothesis H0 : δ ≤ δ0 at a significance level α.

In this way, we ensure that a rejected null hypothesis means both a “substantial” effect size and
sufficient precision. A p-value larger than α means that the observed signal-to-noise ratio d is too small
to disprove the null hypothesis, which occurs with a small effect no matter the size of the sample,
or with a small sample no matter the size of the effect. A sufficiently large effect size measured with
sufficient precision will reject the null hypothesis.
Under the least extreme version of the null hypothesis, (δ = δ0 ), the distribution of the test statistic
D is as follows: 
YB − YA − (µB − µ A ) D−δ
T = q = q 0 ∼ t (ν = 2 (n − 1))
2 2
SP n n

It is important to observe that the p-value obtained from this t-test (let us call it p2 , for a two-sided
test δ = δ0 ) is not the p-value of the question (p1 , for a one-sided test δ ≤ δ0 ), but must be further
processed according to the following scheme:
(
1
2 p2 if d < δ0
p1 = 1
1− 2 p2 if d > δ0

where d is the obtained value of the D-statistic.


As an illustration, we reconsider the earlier example, YA ∼ N (5, 1) and YB ∼ N (µB , 1) with
two choices for µB : 5.2 and 6.0. Samples are generated with n = 1000. The results are presented in
Table 1.
Entropy 2016, 18, 361 8 of 11

We conclude with a real case illustration for the fisheries [9]. Monte Carlo simulations of the
carbon footprint for small-scale and large-scale fisheries with n = 1000 yielded the results of Table 2.
Entropy
Figure2016, 18, 361the values in a histogram.
3 shows 8 of 11

Simulationresults
Table 1. Simulation
Table results with
with sizesize=n1000
sample
sample = 1000 and population
and population size δ size
effect effect = 0.2δ (second
= 0.2
(second and δ =and
column)column) 1.0 δ(third
= 1.0column).
(third column).

Parameter/Statistic
Parameter/Statistic
Small Effect Size
Small Effect Size
Very Large Size
Very Large Size
μ 5.2 6.0
δ µ2 5.2
0.2 6.0 1.0
δ 0.2 1.0
0.243 0.969
d 0.243 0.969
0.927 17.310
t 0.927 17.310
p2
0.354
0.354 0.000
0.000
p1 0.823
0.823 0.000 0.000
: δ ≤H0.2
reject reject at α = 0.05?
0 : δ ≤ 0.2 at α = 0.05?
no
no yes yes

Table 2.
Table 2. Monte
Monte Carlo
Carlo simulation
simulation results
results of
of the
the carbon
carbon footprint
footprint of
of small
small fisheries
fisheries and
and large
large fisheries,
fisheries,
= 1000.
using sample size n = 1000.

Statistic Value
Statistic Value
0.489
d 9.153
0.489
t 0.000
9.153
p2 0.000
0.000
p 0.000
reject : δ ≤ 0.2 at1 α = 0.05 yes
reject H0 : δ ≤ 0.2 at α = 0.05 yes

Figure 3.
Figure Probabilitydensity
3. Probability densityfunctions
functionsofofthe
thecarbon
carbonfootprint
footprint
ofof a Vietnamese
a Vietnamese aquaculture
aquaculture system
system of
of Pangasius
Pangasius catfish,
catfish, obtained
obtained fromtwo
from twoartificial
artificialsamples:
samples:large-scale
large-scale(solid
(solid line)
line) and
and small-scale
small-scale
(dashed
(dashed line).
line).

As an aside, the assumptions


assumptions of of the
the performed
performed test
test are
are not
not fully
fully justified
justified in
in this
this illustration.
illustration.
Variances are unequal, so the Welch form of the test would have have been
been more
more appropriate.
appropriate. However,
However,
our aim
aimisisto to illustrate
illustrate the idea,
the idea, and inand
our in our experience,
experience, the Welchthe Welch
form form
in most casesinyields
mostsimilar
cases results.
yields
similar results.
6. Discussion
6. Discussion
We believe that our proposal resolves a dilemma in the application of statistical techniques to large
datasets. Traditional
We believe thatsignificance
our proposal tests focus on
resolves precision and
a dilemma ignore
in the size, while
application the “new statistics”
of statistical techniques[19]
to
emphasize
large size Traditional
datasets. and includesignificance
precision onlytestsindirectly.
focus on The proposed
precision andsolution
ignore size,of defining the “new
while the tuple
α, δ0 ) = (0.05,
(statistics” 0.2) at the outset
[19] emphasize and include
size and then testing
precision ≤ δ0 indirectly.
H0 : δonly combines precision and size.
The proposed Like the
solution of
defining the tuple α, δ = 0.05, 0.2 at the outset and then testing
old statistics, it still provides an unambiguous statement in the form : δ ≤ δ combines precision
“there is a significant difference
between
and size.the two
Like thepopulations”,
old statistics,which
it stillnow combines
provides statistical evidence
an unambiguous with in
statement empirical
the formrelevance.
“there is a
significant difference between the two populations”, which now combines statistical evidence with
empirical relevance.
One may object that our new procedure for the assessment of a significant difference of means
involves an arbitrary element—namely, choosing δ . That is true, but the choice of α is subjective as
Entropy 2016, 18, 361 9 of 11

One may object that our new procedure for the assessment of a significant difference of means
involves an arbitrary element—namely, choosing δ0 . That is true, but the choice of α is subjective as
well, and yet it is part of the mainstream “objective” NHST procedure. Of course, depending on the
context, different choices of the tuple (α, δ0 ) may be made.
Another possible objection is the lack of novelty. In fact, we believe that it is precisely the
lack of revolutionary features that is a strong point of our proposal. Mainstream NHST is highly
institutionalized, through at least two generations of textbooks, through statistical software (Excel,
SPSS, SAS, etc.) and through guidelines for reporting in the social and behavioral sciences (primarily
APA). While many writers have published pleas to abolish NHST, progress has been limited so far
(APA now recommends the reporting of effect sizes). Our proposal falls within NHST, with a central
role for an a priori null hypothesis and α. The only change is that the usual and often implicit null
hypothesis of “no difference” (µ A = µB ) be replaced by a more interesting null hypothesis of “at
least small difference” (e.g., µB −σ
µA
≤ 0.2). This is emphatically not a Neyman–Pearsonian direction,
because the null hypothesis is still composite (e.g., µB − σ
µA
> 0.2), and because the procedure allows
for p-values as well as significance statements. Our proposal to some extent resembles earlier ones
made in the context of Bayesian statistics [28]. Again, despite the methodological attractiveness of
the Bayesian framework, just the fact that the mainstream is not Bayesian is, from a strategic point of
view, a sufficient argument for proposing modifications to the classical framework. On the longer term,
however, Bayesian approaches may solve some of the issues in a more fundamental way, employing
the Bayesian information criterion [14,29], using the Bayes factor [30], or probability–possibility
transformations [31–33]. Schumi and Wittes [24] also briefly discuss the classical approach in a way
that is quite similar to ours, although formulated in terms of a one-sample hypothesis. It is primarily
from the comparative set-up that our proposal derives its appeal: a difference between two treatments
must be sufficiently significant and sufficiently large. Our proposal also shares elements with [34],
which connects it to power calculations. Again, as power is formally part of NHST, it is rarely practiced
by researchers in the applied sciences. Our testing scheme involving the tuple (α, δ0 ) has a strategic
value in staying close to existing practice, while attempting to remediate the most pressing problem.
Although the issue mentioned (namely: “what do we mean by a significant difference?”) is not
a problem that exclusively occurs in the world of computer simulations, meta-analysis, and big data,
we think that the developments since the start of the 21st century require a renewed confrontation
with the criticism on NHST. We even think that a solution must be provided: an easy solution, close to
the established practice. Our proposal is one step in a longer series of steps.
The described procedure was restricted to the case that µ A is smaller than µB . This can be easily
generalized to the opposite case. More importantly, it can also be generalized to the two-sided case,
µ B −µ A
in which the null hypothesis σ ≤ δ0 is tested. A rejection of this hypothesis implies that we
conclude that the absolute value of the standardized effect size |δ| is larger than δ0 .
Another generalization is that of comparing more than two populations. A typical approach
is the ANOVA form, in which the null hypothesis is µ A = µB = µC , etc. This is less trivial to
generalize for the (α, δ0 ) procedure. The alternative of making several pairwise comparisons, each with
a Bonferroni-corrected ( α0 , δ0 ) where α0 < α seems a natural way to go.
A third generalization is the direction of heteroskedastic populations, where σ A 6= σB .
There is potential to further generalize the proposed procedure for statistics other than the
standardized effect size (such as correlation coefficients, regression coefficients, and odds ratios),
for cases with dependent distributions (using the paired t-test), and for cases in which the populations
are not normal (requiring the Mann–Whitney test or another non-parametric method).
The era of almost unlimited computer capacity has created studies with tremendous
pseudo-samples using Monte Carlo simulation, bootstrapping, and other methods. In addition,
the internet has created almost unlimited data repositories, which also result in huge samples. This has
eradicated many of the fundamental assumptions of traditional inferential statistics, which have been
developed for small samples. Willam Gosset (“Student”, [35]) developed his t-distribution to assess
Entropy 2016, 18, 361 10 of 11

small samples, even as small as n = 2 [10]. Bootstrapping has been at the center of this development,
with formulas even suggested for setting a sample size to satisfy significant differences [36,37].
However, even traditional statistical textbooks typically devote a few pages to choosing sample
size such that a significant result will be obtained (e.g., [1–3]). The fact that this significance refers to
a basically meaningless (“sizeless”) phenomenon is hardly mentioned. This is clearly a questionable
practice that easily leads to the justified rejection of meaningless null hypotheses, which is exactly the
problem raised by those who criticize NHST, such as Ziliak and McCloskey [16]. However, precision is
important, and that is what the alternative schemes [19] have been underemphasizing. Data analysis
in the era of large samples requires a new paradigm. Our proposed reconciliation of effect size and
precision (by setting the tuple (α, δ0 ) in advance) should be seen as one seminal step in this program.
Whereas we have not applied its working to meta-analysis and big data, and have only demonstrated
its application to computer-generated samples of size 1000, we believe that the problem is serious
enough to deserve more attention in the era of increasing sample sizes.
As indicated, Bayesian concepts might further alleviate some of the problems mentioned, as might
a return to the Neyman–Pearson framework. However, our proposal is an attempt to improve the
situation with a minimum of changes, only replacing one conventional choice (α) by a tuple of
conventional choices (α, δ0 ). Piecemeal change may be a better solution than revolution in some cases.

Acknowledgments: This work is part of the Sustaining Ethical Aquaculture Trade (SEAT) project, which is
cofunded by the European Commission within the Seventh Framework Programme Sustainable Development
Global Change and Ecosystem (Project 222889). The reviewers made a number of excellent suggestions
for improvement.
Author Contributions: Reinout Heijungs conceived the proposed alternative to traditional NHST and wrote the
paper; Patrik Henriksson and Jeroen Guinée conducted the research on Pangasius catfish that inspired the theme
of the paper; Patrik Henriksson prepared the data used in the example. All authors have read and approved the
final manuscript.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Wonnacott, T.H.; Wonnacott, R.J. Introductory Statistics, 5th ed.; Wiley: New York, NY, USA, 1990.
2. Moore, D.S.; McCabe, G.P. Introduction to the Practice of Statistics, 5th ed.; Freeman: New York, NY, USA, 2006.
3. Doane, D.P.; Seward, L.E. Applied Statistics in Business & Economics, 5th ed.; McGraw-Hill: New York, NY,
USA, 2015.
4. Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures, 5th ed.; CRC Press: Boca Raton,
FL, USA, 2011.
5. Efron, B.; Tibshirani, R. Statistical data analysis in the computer age. Science 1991, 253, 390–395. [CrossRef]
[PubMed]
6. Cooper, H.; Hedges, L.V.; Valentine, J.C. The Handbook of Research Synthesis and Meta-Analysis, 2nd ed.;
Russell Sage Foundation: New York, NY, USA, 1994.
7. Cochrane Library. Available online: http://www.cochranelibrary.com/ (accessed on 27 May 2016).
8. Varian, H. Big data: New tricks for econometrics. J. Econ. Perspect. 2014, 28, 3–28. [CrossRef]
9. Henriksson, P.J.G.; Rico, A.; Zhang, W.; Ahmad-Al-Nahid, S.; Newton, R.; Phan, L.T.; Zhang, Z.; Jaithiang, J.;
Dao, H.M.; Phu, T.M.; et al. A comparison of Asian aquaculture products using statistically supported LCA.
Environ. Sci. Technol. 2015, 49, 14176–14183. [CrossRef] [PubMed]
10. Lee, P.M. Bayesian Statistics: An Introduction, 2nd ed.; Arnold: London, UK, 1997.
11. Lynch, S.M. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists; Springer: New York,
NY, USA, 2007.
12. Perezgonzalez, J.D. Fisher, Neyman–Pearson or NHST? A tutorial for teaching data testing. Front. Psychol.
2015, 6. [CrossRef] [PubMed]
13. Rice, J.A. Mathematical Statistics and Data Analysis, 3rd ed.; Cengage Learning: Boston, MA, USA, 2007.
14. Wagenmakers, E.J. A practical solution to the pervasive problem of p-values. Psychon. Bull. Rev. 2007,
14, 779–804. [CrossRef] [PubMed]
Entropy 2016, 18, 361 11 of 11

15. Lehmann, E.L. The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? J. Am.
Stat. Assoc. 1993, 88, 1242–1249. [CrossRef]
16. Ziliak, S.T.; McCloskey, D.N. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice,
and Lives; University of Michigan Press: Ann Arbor, MI, USA, 2007.
17. Cohen, J. The earth is round (p < 0.05). Am. Psychol. 1994, 49, 997–1003.
18. Fan, X.; Konold, T.R. Statistical significance versus effect size. In International Encyclopedia of Education,
3rd ed.; Elsevier: New York, NY, USA, 2010; pp. 444–450.
19. Cumming, G. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis; Routledge:
London, UK, 2012.
20. Morris, P.E.; Fritz, C.O. Why are effect sizes still neglected? Psychologist 2013, 26, 580–583.
21. Perezgonzalez, J.D. The meaning of significance in data testing. Front. Psychol. 2015, 6. [CrossRef] [PubMed]
22. Goodman, S. A dirty dozen: Twelve p-value misconceptions. Semin. Hematol. 2008, 45, 135–140. [CrossRef]
[PubMed]
23. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Academic Press: New York, NY,
USA, 1988.
24. Schumi, J.; Wittes, J.T. Through the looking glass: Understanding non-inferiority. Trials 2011, 12. [CrossRef]
[PubMed]
25. Leon, A.C. Comparative effectiveness clinical trials in psychiatry: Superiority, non-inferiority and the role of
active comparators. J. Clin. Psychiatry 2011, 72, 331–340. [CrossRef] [PubMed]
26. Baguley, T. Standardized or simple effect size: What should be reported? Br. J. Psychol. 2009, 100, 603–671.
[CrossRef] [PubMed]
27. Cumming, G.; Finch, S. A primer on the understanding, use, and calculation of confidence intervals that are
based on central and noncentral distributions. Educ. Psychol. Meas. 2001, 61, 161–170. [CrossRef]
28. Berger, J.O.; Delampady, M. Testing precise hypotheses. Stat. Sci. 1987, 2, 317–352. [CrossRef]
29. Raftery, A.E. Bayesian model selection in social research. Sociol. Methodol. 1995, 25, 111–163. [CrossRef]
30. Mulder, J.; Hoijtink, H.; de Leeuw, C. BIEMS: A Fortran 90 program for calculating Bayes factors for inequality
and equality constrained models. J. Stat. Softw. 2012, 46. [CrossRef]
31. Lauretto, M.; Pereira, C.A.B.; Stern, J.M.; Zacks, S. Comparing parameters of two bivariate normal
distributions using the invariant FBST. Braz. J. Probab. Stat. 2003, 17, 147–168.
32. Lauretto, M.S.; Stern, J.M. FBST for mixture model selection. AIP Conf. Proc. 2005, 803, 121–128.
33. Stern, J.M.; Pereira, C.A.B. Bayesian epistemic values: Focus on surprise, measure probability! Log. J. IGPL
2014, 22, 236–254. [CrossRef]
34. Perezgonzalez, J.D. Statistical sensitiveness for science. 2016, arXiv:1604.01844.
35. Student. The probable error of a mean. Biometrika 1908, 6, 1–25.
36. Andrews, D.W.K.; Buchinsky, M. A three-step method for choosing the number of bootstrap repetitions.
Econometrica 2000, 68, 23–51. [CrossRef]
37. Pattengale, N.D.; Alipour, M.; Bininda-Emonds, O.R.P.; Moret, B.M.E.; Stamatakis, A. How many bootstrap
replicates are necessary? J. Comput. Biol. 2010, 17, 337–354. [CrossRef] [PubMed]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like