Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data
Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data
Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data
Article
Measures of Difference and Significance in the Era of
Computer Simulations, Meta-Analysis, and Big Data
Reinout Heijungs 1,2, *, Patrik J.G. Henriksson 3,4 and Jeroen B. Guinée 1
1 Institute of Environmental Sciences, Leiden University, 2300 RA Leiden, The Netherlands;
[email protected]
2 Department of Econometrics and Operations Research, Vrije University Amsterdam,
1081 HV Amsterdam, The Netherlands
3 Stockholm Resilience Centre, 10691 Stockholm, Sweden; [email protected]
4 WorldFish, Jalan Batu Maung, 11960 Penang, Malaysia
* Correspondence: [email protected] or [email protected]; Tel.: +31-20-598-2384
Abstract: In traditional research, repeated measurements lead to a sample of results, and inferential
statistics can be used to not only estimate parameters, but also to test statistical hypotheses concerning
these parameters. In many cases, the standard error of the estimates decreases (asymptotically) with
the square root of the sample size, which provides a stimulus to probe large samples. In simulation
models, the situation is entirely different. When probability distribution functions for model features
are specified, the probability distribution function of the model output can be approached using
numerical techniques, such as bootstrapping or Monte Carlo sampling. Given the computational
power of most PCs today, the sample size can be increased almost without bounds. The result is that
standard errors of parameters are vanishingly small, and that almost all significance tests will lead to
a rejected null hypothesis. Clearly, another approach to statistical significance is needed. This paper
analyzes the situation and connects the discussion to other domains in which the null hypothesis
significance test (NHST) paradigm is challenged. In particular, the notions of effect size and Cohen’s
d provide promising alternatives for the establishment of a new indicator of statistical significance.
This indicator attempts to cover significance (precision) and effect size (relevance) in one measure.
Although in the end more fundamental changes are called for, our approach has the attractiveness
of requiring only a minimal change to the practice of statistics. The analysis is not only relevant for
artificial samples, but also for present-day huge samples, associated with the availability of big data.
Keywords: significance test; null hypothesis significance testing (NHST); effect size; Cohen’s d;
Monte Carlo simulation; bootstrapping; meta-analysis; big data
1. Introduction
The problem of determining if the difference between two groups is large enough to be labeled
“significant” is an old and well-studied problem. Virtually every university program treats it, often as
a second example of the t-test, the first example being the one-sample case [1–4]. Generalizations
then motivate the study of the analysis of variance (ANOVA) and more robust non-parametric tests,
such as those by Mann–Whitney and Kruskal–Wallis. All these established tests are based on the
comparison of the means (or medians) of two (or more) groups, and as such, the standard error of
these means (or medians) plays a crucial role. Such standard errors typically decrease with the square
root of the sample size. As a result, the question of whether or not a difference between two (or more)
means (or medians) is significant not only depends on the intrinsic properties of the phenomenon
(mean of the difference and variance of the distributions), but also on the sample size, which is not an
intrinsic property of the phenomenon. In a traditional experimental set-up or field study, this may be
appropriate, because significance means that the limited evidence obtained by small samples suffices
to mark the populations as being different. In such cases, the standard error is a perfect companion.
However, in the context of unlimited or virtually unlimited data—for instance, for computer-generated
samples—this concept of significance breaks down. In such cases, the standard error will not do a good
job, at least not in the way it is used in the standard textbooks.
The prevalence of computer-generated datasets and large datasets has become increasingly
common in the 21st century. Specifically, the following developments should be mentioned:
• Simulation models [5], where artificial samples are generated according to the principles of Monte
Carlo, Latin hypercube, bootstrapping, or any other sampling or resampling method. Depending
on the size of the model and computing power, such techniques easily yield a sample size of 1000
or more.
• Meta-analysis [6], where the results of dozens or hundreds of studies are combined into
one meta-study with an effectively large sample size. Online repositories in particular (such as
those of the Cochrane Library [7]) enable the performance of such meta-analyses.
• Big data [8], where automatically-collected data on millions of customers, patients, vehicles,
or other objects of interested are collected for statistical processing.
In this article, we focus on the case of comparing a numerical variable for two groups, indicated
by subscripts A and B. The reader may think of this in terms of either a control group or a treatment
group (as is often the case in medical research), or of two different situations (as is often the case in
empirical research; for instance, male customers versus female customers). The variable might be
anything like IQ, voltage, or price. Further, to keep the discussion focused, we will assume that the
true mean of group A is lower than that of group B.
Section 2 revisits the basic situation of the null hypothesis significance test on the equality of means
for two groups, also in a historic perspective, contrasting the approaches by Fisher, Neyman–Pearson,
and their synthesis; Section 3 critically analyzes the influence of sample size in the hypothesis test;
Section 4 analyses alternatives to the usual expression and proposes a new test criterion; Section 5
provides a discussion and conclusion.
As for notation, we will use Greek symbols (µ, σ) for population parameters, capital Latin symbols
(Y, Y, S) for random variables sampled from such populations, and lower case Latin symbols (y, y, s)
for the value obtained in a particular sample. Y ∼ N µ, σ2 indicates that random variable Y is
normally distributed with mean µ and variance σ2 . Their sample values are indicated by y and s2 .
t(ν) is the t-distribution with ν degrees of freedom.
Now, we collect from both populations a sample of equal size n A = n B = n. The purpose is to
compare the centrality parameter, in particular the means µ A and µB .
Entropy 2016, 18, 361 3 of 11
Now, there are a number of options for carrying out the statistical analysis. One choice is between
“classical statistics” (as discussed in most mainstream textbooks and handbooks, including [1–4]) and
Bayesian statistics (e.g., [10,11]). In this article, we will build entirely on the classical paradigm, mainly
because it is mainstream, and moreover because the Bayesians emphasize the changing of beliefs as
a result of new evidence, which is not the core issue in big data and computer-generated samples
(although it is a core issue in meta-analysis). Within this classical paradigm, we have a choice of taking
the Fisherian approach, the Neyman–Pearson approach, or their hybrid or synthesized forms, the null
hypothesis significance test [12].
Fisher’s approach calculates the probability of obtaining the observed value (or an even more
extreme value) of a test statistic when an a priori specified null hypothesis would be true. In the
present case, the null hypothesis would be
H0 : µ A = µB
and the test statistic would be derived from the observed difference in means (YB − YA ).
The standardized form of this is
YB − YA Y −Y
T= q = B q A
1 1
SP n + nB
A
SP n2
where s
(n A + 1) S2A + (n B + 1) S2B
r
1 2
S A + S2B
SP = =
n A + nB − 2 2
is the pooled estimate of the standard deviation of the two populations. Under H0 , the random variable
T is distributed according to a t-distribution, with n A + n B − 2 = 2 (n − 1) degrees of freedom:
T ∼ t (ν = 2 (n − 1))
YB − y B −y A
Denoting the obtained value of the random variable T = qYA by t = q , the p-value is then
SP 2 s P n2
n
calculated as the probability that T has the obtained value t or even farther away from the expected
value 0:
p-value = Pt(ν=2(n−1)) (| T | > |t|)
In this approach, no black–white decision as to significance is made, but the p-value suffices to
communicate a level of evidence. In addition, no alternative hypothesis is stated, and we study only
the plausibility of the data with a stated null hypothesis.
In contrast, the approach by Neyman–Pearson starts by formulating two competing simple
hypotheses (often called the null hypothesis and the alternative hypothesis), and calculates the
ratio of the likelihoods of the data for these hypotheses. The result then yields a probability of
the data corresponding to one hypothesis or the other (see [13] for a clear example on coin throwing).
This approach also sets an a priori threshold value for rejecting the null hypothesis against the
alternative one, symbolized as α, conventionally set to 0.05 or 0.01. The notion of significance then
arises in comparing the p-value to α. In addition, the method calculates a second parameter, β, for the
probability of incorrectly accepting the alternative hypothesis. Its complement, 1 − β, then represents
the (a posteriori) power of the test.
Whereas Fisher, Neyman, and Pearson were having an acrimonious debate on the weak
and strong points of the two methods, textbooks from the 1950s on were effectively creating
a synthesis (an “anonymous amalgamation of the two incompatible procedures” [14]), using elements
from Fisher and from Neyman–Pearson, which “follow Neyman–Pearson procedurally but Fisher
philosophically” [12]. The result is known as the null hypothesis significance test (NHST), and it
is characterized by the use of p-values in combination by an a priori α, a composite alternative
Entropy 2016, 18, 361 4 of 11
hypothesis, and occasional power calculations. In the example elaborated according to NHST, the null
hypothesis is:
H0 : µ A = µB
because we want to find out if the mean carbon footprint differs between the two groups. The math
then follows Fisher’s approach in calculating a p-value. This p-value is compared to the type I error
rate α that has been set in advance (e.g., to 0.05). A smaller p-value will reject the null hypothesis and
accept the alternative hypothesis, while a larger p-value will not reject the null hypothesis, but instead
maintain it (which does not mean acceptance). So, the null hypothesis H0 : µA = µB is rejected at the
pre-determined significance level α when the calculated value of the test statistic (t) is smaller than the
lower critical value (tcrit,lower,α ) or larger than the upper critical value (tcrit,upper,α ). Critical values are
thus defined by the conditions
α α
Pt(ν=2(n−1)) ( T ≤ tcrit,lower,α ) = and Pt(ν=2(n−1)) T ≥ tcrit,upper,α =
2 2
Because the t-distribution is symmetric around the value 0, this can also be formulated as
a rejection when the absolute value of t exceeds tcrit,one−tailed,α . In that case, we use
Pt(ν=2(n−1)) (| T | ≥ tcrit,one−tailed,α ) = α
The elaboration above on one hand summarizes the NHST-procedure for the two sample case,
which is helpful in defining concepts and notation for the later sections of this paper. On the other
hand, it briefly recaps the history in terms of the contributions by Fisher and by the tandem Neyman
and Pearson, which will turn out to be useful in the later discussion. We do not pretend to give a full
history of statistical testing; please refer to [4,12,15,16].
FigureFigure
1. The1. The absolute
absolute valuevalue
of of
thetheT-statistic when YA ==5.0,
-statisticwhen 5.0, YB= =6.0,
6.0, S=P 1=(upper solidsolid
1 (upper line)line)
for for
different
different valuesvalues of the
of the sample
sample sizen, when
size , whenY = = 5.0,
5.0, Y ==5.2,
5.2, S == 1 1top (lower
top solid
(lower line),
solid and the
line), and the
A B P
upperupper critical
critical valuevalue of the
of the t-distribution(dashed
t-distribution (dashed line) at αα==0.05.
line) at 0.05.The
The null hypothesis
null of equality
hypothesis of of
of equality
population means is rejected at 0.05 for ≥ 9 when Δ = 1.0, but when Δ = 0.2, we need to push
population means is rejected at 0.05 for n ≥ 9 when ∆Y = 1.0, but when ∆Y = 0.2, we need to push
further and use ≥ 194 to do the job.
further and use n ≥ 194 to do the job.
The first of these is natural: the actual difference between and is of course important in
The first ifofthe
deciding these is natural:
difference the actual
is “large”, difference
“substantial”, between
or—why YA and YB is of course important in
not—“significant”.
deciding if the difference is “large”, “substantial”,
The second one plays a more intricate role. The ratio or—why not—“significant”.
provides a dimensionless indicator
YB −YA
The
of thesecond one plays
“relative” a more
difference intricate
between the role.
two meansThe ratioμ and SP μ .provides a dimensionless
It is sometimes described indicator
as a
signal-to-noise ratio [16].
of the “relative” difference between the two means µ A and µB . It is sometimes described as
The third
a signal-to-noise element
ratio [16].plays a curious role. Sample size is important for establishing the confidence
The third elementHowever,
level of a result. sample role.
plays a curious size Sample
is not part sizeof the nature for
is important of the phenomenon
establishing under
the confidence
investigation. The two means (μ and μ ) and the standard deviation (σ) are aspects of the research
level of a result. However, sample size is not part of the nature of the phenomenon under investigation.
object. Sample size ( ) is an aspect of the instrument that we use to probe the object. Of course, the
The two means (µ A and µB ) and the standard deviation (σ) are aspects of the research object.
quality of the instrument has an influence on the outcome. If we wish to know how many stars there
Sampleare,size
and(n)useisa an aspect
cheap of the the
telescope, instrument
number will thatbe we use to
lower thanprobe
whenthe weobject. Of course, the
use a multi-billion quality
dollar
of thetelescope.
instrument However, no serious astronomer will proclaim that the number of counted stars is equal are,
has an influence on the outcome. If we wish to know how many stars there
and use a cheap
to the number telescope,
of starts the number
in the willInstead,
universe. be lower than when
a formula we usethe
to estimate a multi-billion
number of starts dollar
fromtelescope.
the
However,
number no ofserious
countedastronomer willquality
stars and the proclaimof thethat the number
telescope will beofdeveloped.
counted stars The is equal to the
application number
of this
formula
of starts in the to the two measurement
universe. Instead, aset-ups
formulawillto give different
estimate theresults,
number and of
probably
starts the
from estimate with of
the number
the stars
counted expensive
and thetelescope
quality will
of be
themore accurate.
telescope willInbetraditional
developed. NHST,
The this is different.
application What
of this you seeto the
formula
depends on theset-ups
two measurement measurement set-up,
will give and thisresults,
different is not corrected for in the
and probably theoutcome.
estimate with the expensive
A consequence is that money can buy significance. Of course, the mean blood pressure of two
telescope will be more accurate. In traditional NHST, this is different. What you see depends on the
groups of patients will never be equal when you consider the last digit. However, it may be that the
measurement set-up, and this is not corrected for in the outcome.
difference is only in the fourth decimal, that = 120.0269 and = 120.0268 . With an
A consequence
exceptionally is that
large study, money can buy difference
this negligible significance. can Of be course,
declared the to mean blood pressure
be “significant”. The of
two groups
distinction between a significant difference and a large difference is mentioned in most textbooks that
of patients will never be equal when you consider the last digit. However, it may be on the
difference is only
statistics, in theinfourth
but often decimal,
a slightly cursorythat
way,y Aand = it120.0269
is well-known = 120.0268.
and y Bthat With an exceptionally
many less-informed students
large and
study, this negligible
scientists differencedifference
mistake a significant can be declared
for a largeto orbe “significant”.
important differenceThe distinction between
[22].
Combining
a significant difference theand estimated
a large difference −
differenceis mentioned and in themost
standard
textbooks erroronofstatistics,
this difference
but often in
/√ cursory
a slightly into one way,formula has one big
and it is/√well-known that advantage: it yields onestudents
many less-informed single number, which can
and scientists mistake
a significant
moreover difference
objectively forbeatested
large or important
against difference
a conventional [22].
benchmark, such as α = 0.05. Therefore, we
√
Combining
only need to the estimatedthis
communicate difference YB − Y
single number, A and
either asthe standard
a -value, as aerror of this
-value, or asdifference
a significanceSP / n
statement, suchYBasY“A < 0.01”, “**”, or “highly significant difference”. The fact that two things are
−√
into one formula SP / n
has one big advantage: it yields one single number, which can moreover
combined in one is the root of the problem, however: information has been lost due to the
objectively be tested against a conventional benchmark, such as α = 0.05. Therefore, we only need to
compression of two complimentary aspects into one.
communicate this single number, either as a t-value, as a p-value, or as a significance statement, such as
“p < 0.01”, “**”, or “highly significant difference”. The fact that two things are combined in one is the
Entropy 2016, 18, 361 6 of 11
root of the problem, however: information has been lost due to the compression of two complimentary
aspects into one.
research, where the existing drug is well studied, this approach has definite benefits. In a context of
two alternatives (i.e., the sustainability of large scale fisheries and small scale fisheries), the situation
is different, and there is no a priori magnitude for such a margin of non-inferiority or superiority.
As such, we are looking for a margin of non-inferiority or superiority that is magnitude-independent.
Such a measure is provided by the standardized effect size, here implemented as the standardized
difference of means, discussed above. Although we agree with many points of critique on the
standardized effect size in comparison to the “simple” effect size [26], we think they serve one
important role in setting a generic standard for “oomph”, as introduced by Cohen [23]. Combining the
idea of superiority with a margin [24,25] formulated in terms of the standardized effect size [23] is the
core of our idea; see the next section.
In this way, we ensure that a rejected null hypothesis means both a “substantial” effect size and
sufficient precision. A p-value larger than α means that the observed signal-to-noise ratio d is too small
to disprove the null hypothesis, which occurs with a small effect no matter the size of the sample,
or with a small sample no matter the size of the effect. A sufficiently large effect size measured with
sufficient precision will reject the null hypothesis.
Under the least extreme version of the null hypothesis, (δ = δ0 ), the distribution of the test statistic
D is as follows:
YB − YA − (µB − µ A ) D−δ
T = q = q 0 ∼ t (ν = 2 (n − 1))
2 2
SP n n
It is important to observe that the p-value obtained from this t-test (let us call it p2 , for a two-sided
test δ = δ0 ) is not the p-value of the question (p1 , for a one-sided test δ ≤ δ0 ), but must be further
processed according to the following scheme:
(
1
2 p2 if d < δ0
p1 = 1
1− 2 p2 if d > δ0
We conclude with a real case illustration for the fisheries [9]. Monte Carlo simulations of the
carbon footprint for small-scale and large-scale fisheries with n = 1000 yielded the results of Table 2.
Entropy
Figure2016, 18, 361the values in a histogram.
3 shows 8 of 11
Simulationresults
Table 1. Simulation
Table results with
with sizesize=n1000
sample
sample = 1000 and population
and population size δ size
effect effect = 0.2δ (second
= 0.2
(second and δ =and
column)column) 1.0 δ(third
= 1.0column).
(third column).
Parameter/Statistic
Parameter/Statistic
Small Effect Size
Small Effect Size
Very Large Size
Very Large Size
μ 5.2 6.0
δ µ2 5.2
0.2 6.0 1.0
δ 0.2 1.0
0.243 0.969
d 0.243 0.969
0.927 17.310
t 0.927 17.310
p2
0.354
0.354 0.000
0.000
p1 0.823
0.823 0.000 0.000
: δ ≤H0.2
reject reject at α = 0.05?
0 : δ ≤ 0.2 at α = 0.05?
no
no yes yes
Table 2.
Table 2. Monte
Monte Carlo
Carlo simulation
simulation results
results of
of the
the carbon
carbon footprint
footprint of
of small
small fisheries
fisheries and
and large
large fisheries,
fisheries,
= 1000.
using sample size n = 1000.
Statistic Value
Statistic Value
0.489
d 9.153
0.489
t 0.000
9.153
p2 0.000
0.000
p 0.000
reject : δ ≤ 0.2 at1 α = 0.05 yes
reject H0 : δ ≤ 0.2 at α = 0.05 yes
Figure 3.
Figure Probabilitydensity
3. Probability densityfunctions
functionsofofthe
thecarbon
carbonfootprint
footprint
ofof a Vietnamese
a Vietnamese aquaculture
aquaculture system
system of
of Pangasius
Pangasius catfish,
catfish, obtained
obtained fromtwo
from twoartificial
artificialsamples:
samples:large-scale
large-scale(solid
(solid line)
line) and
and small-scale
small-scale
(dashed
(dashed line).
line).
One may object that our new procedure for the assessment of a significant difference of means
involves an arbitrary element—namely, choosing δ0 . That is true, but the choice of α is subjective as
well, and yet it is part of the mainstream “objective” NHST procedure. Of course, depending on the
context, different choices of the tuple (α, δ0 ) may be made.
Another possible objection is the lack of novelty. In fact, we believe that it is precisely the
lack of revolutionary features that is a strong point of our proposal. Mainstream NHST is highly
institutionalized, through at least two generations of textbooks, through statistical software (Excel,
SPSS, SAS, etc.) and through guidelines for reporting in the social and behavioral sciences (primarily
APA). While many writers have published pleas to abolish NHST, progress has been limited so far
(APA now recommends the reporting of effect sizes). Our proposal falls within NHST, with a central
role for an a priori null hypothesis and α. The only change is that the usual and often implicit null
hypothesis of “no difference” (µ A = µB ) be replaced by a more interesting null hypothesis of “at
least small difference” (e.g., µB −σ
µA
≤ 0.2). This is emphatically not a Neyman–Pearsonian direction,
because the null hypothesis is still composite (e.g., µB − σ
µA
> 0.2), and because the procedure allows
for p-values as well as significance statements. Our proposal to some extent resembles earlier ones
made in the context of Bayesian statistics [28]. Again, despite the methodological attractiveness of
the Bayesian framework, just the fact that the mainstream is not Bayesian is, from a strategic point of
view, a sufficient argument for proposing modifications to the classical framework. On the longer term,
however, Bayesian approaches may solve some of the issues in a more fundamental way, employing
the Bayesian information criterion [14,29], using the Bayes factor [30], or probability–possibility
transformations [31–33]. Schumi and Wittes [24] also briefly discuss the classical approach in a way
that is quite similar to ours, although formulated in terms of a one-sample hypothesis. It is primarily
from the comparative set-up that our proposal derives its appeal: a difference between two treatments
must be sufficiently significant and sufficiently large. Our proposal also shares elements with [34],
which connects it to power calculations. Again, as power is formally part of NHST, it is rarely practiced
by researchers in the applied sciences. Our testing scheme involving the tuple (α, δ0 ) has a strategic
value in staying close to existing practice, while attempting to remediate the most pressing problem.
Although the issue mentioned (namely: “what do we mean by a significant difference?”) is not
a problem that exclusively occurs in the world of computer simulations, meta-analysis, and big data,
we think that the developments since the start of the 21st century require a renewed confrontation
with the criticism on NHST. We even think that a solution must be provided: an easy solution, close to
the established practice. Our proposal is one step in a longer series of steps.
The described procedure was restricted to the case that µ A is smaller than µB . This can be easily
generalized to the opposite case. More importantly, it can also be generalized to the two-sided case,
µ B −µ A
in which the null hypothesis σ ≤ δ0 is tested. A rejection of this hypothesis implies that we
conclude that the absolute value of the standardized effect size |δ| is larger than δ0 .
Another generalization is that of comparing more than two populations. A typical approach
is the ANOVA form, in which the null hypothesis is µ A = µB = µC , etc. This is less trivial to
generalize for the (α, δ0 ) procedure. The alternative of making several pairwise comparisons, each with
a Bonferroni-corrected ( α0 , δ0 ) where α0 < α seems a natural way to go.
A third generalization is the direction of heteroskedastic populations, where σ A 6= σB .
There is potential to further generalize the proposed procedure for statistics other than the
standardized effect size (such as correlation coefficients, regression coefficients, and odds ratios),
for cases with dependent distributions (using the paired t-test), and for cases in which the populations
are not normal (requiring the Mann–Whitney test or another non-parametric method).
The era of almost unlimited computer capacity has created studies with tremendous
pseudo-samples using Monte Carlo simulation, bootstrapping, and other methods. In addition,
the internet has created almost unlimited data repositories, which also result in huge samples. This has
eradicated many of the fundamental assumptions of traditional inferential statistics, which have been
developed for small samples. Willam Gosset (“Student”, [35]) developed his t-distribution to assess
Entropy 2016, 18, 361 10 of 11
small samples, even as small as n = 2 [10]. Bootstrapping has been at the center of this development,
with formulas even suggested for setting a sample size to satisfy significant differences [36,37].
However, even traditional statistical textbooks typically devote a few pages to choosing sample
size such that a significant result will be obtained (e.g., [1–3]). The fact that this significance refers to
a basically meaningless (“sizeless”) phenomenon is hardly mentioned. This is clearly a questionable
practice that easily leads to the justified rejection of meaningless null hypotheses, which is exactly the
problem raised by those who criticize NHST, such as Ziliak and McCloskey [16]. However, precision is
important, and that is what the alternative schemes [19] have been underemphasizing. Data analysis
in the era of large samples requires a new paradigm. Our proposed reconciliation of effect size and
precision (by setting the tuple (α, δ0 ) in advance) should be seen as one seminal step in this program.
Whereas we have not applied its working to meta-analysis and big data, and have only demonstrated
its application to computer-generated samples of size 1000, we believe that the problem is serious
enough to deserve more attention in the era of increasing sample sizes.
As indicated, Bayesian concepts might further alleviate some of the problems mentioned, as might
a return to the Neyman–Pearson framework. However, our proposal is an attempt to improve the
situation with a minimum of changes, only replacing one conventional choice (α) by a tuple of
conventional choices (α, δ0 ). Piecemeal change may be a better solution than revolution in some cases.
Acknowledgments: This work is part of the Sustaining Ethical Aquaculture Trade (SEAT) project, which is
cofunded by the European Commission within the Seventh Framework Programme Sustainable Development
Global Change and Ecosystem (Project 222889). The reviewers made a number of excellent suggestions
for improvement.
Author Contributions: Reinout Heijungs conceived the proposed alternative to traditional NHST and wrote the
paper; Patrik Henriksson and Jeroen Guinée conducted the research on Pangasius catfish that inspired the theme
of the paper; Patrik Henriksson prepared the data used in the example. All authors have read and approved the
final manuscript.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wonnacott, T.H.; Wonnacott, R.J. Introductory Statistics, 5th ed.; Wiley: New York, NY, USA, 1990.
2. Moore, D.S.; McCabe, G.P. Introduction to the Practice of Statistics, 5th ed.; Freeman: New York, NY, USA, 2006.
3. Doane, D.P.; Seward, L.E. Applied Statistics in Business & Economics, 5th ed.; McGraw-Hill: New York, NY,
USA, 2015.
4. Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures, 5th ed.; CRC Press: Boca Raton,
FL, USA, 2011.
5. Efron, B.; Tibshirani, R. Statistical data analysis in the computer age. Science 1991, 253, 390–395. [CrossRef]
[PubMed]
6. Cooper, H.; Hedges, L.V.; Valentine, J.C. The Handbook of Research Synthesis and Meta-Analysis, 2nd ed.;
Russell Sage Foundation: New York, NY, USA, 1994.
7. Cochrane Library. Available online: http://www.cochranelibrary.com/ (accessed on 27 May 2016).
8. Varian, H. Big data: New tricks for econometrics. J. Econ. Perspect. 2014, 28, 3–28. [CrossRef]
9. Henriksson, P.J.G.; Rico, A.; Zhang, W.; Ahmad-Al-Nahid, S.; Newton, R.; Phan, L.T.; Zhang, Z.; Jaithiang, J.;
Dao, H.M.; Phu, T.M.; et al. A comparison of Asian aquaculture products using statistically supported LCA.
Environ. Sci. Technol. 2015, 49, 14176–14183. [CrossRef] [PubMed]
10. Lee, P.M. Bayesian Statistics: An Introduction, 2nd ed.; Arnold: London, UK, 1997.
11. Lynch, S.M. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists; Springer: New York,
NY, USA, 2007.
12. Perezgonzalez, J.D. Fisher, Neyman–Pearson or NHST? A tutorial for teaching data testing. Front. Psychol.
2015, 6. [CrossRef] [PubMed]
13. Rice, J.A. Mathematical Statistics and Data Analysis, 3rd ed.; Cengage Learning: Boston, MA, USA, 2007.
14. Wagenmakers, E.J. A practical solution to the pervasive problem of p-values. Psychon. Bull. Rev. 2007,
14, 779–804. [CrossRef] [PubMed]
Entropy 2016, 18, 361 11 of 11
15. Lehmann, E.L. The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? J. Am.
Stat. Assoc. 1993, 88, 1242–1249. [CrossRef]
16. Ziliak, S.T.; McCloskey, D.N. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice,
and Lives; University of Michigan Press: Ann Arbor, MI, USA, 2007.
17. Cohen, J. The earth is round (p < 0.05). Am. Psychol. 1994, 49, 997–1003.
18. Fan, X.; Konold, T.R. Statistical significance versus effect size. In International Encyclopedia of Education,
3rd ed.; Elsevier: New York, NY, USA, 2010; pp. 444–450.
19. Cumming, G. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis; Routledge:
London, UK, 2012.
20. Morris, P.E.; Fritz, C.O. Why are effect sizes still neglected? Psychologist 2013, 26, 580–583.
21. Perezgonzalez, J.D. The meaning of significance in data testing. Front. Psychol. 2015, 6. [CrossRef] [PubMed]
22. Goodman, S. A dirty dozen: Twelve p-value misconceptions. Semin. Hematol. 2008, 45, 135–140. [CrossRef]
[PubMed]
23. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Academic Press: New York, NY,
USA, 1988.
24. Schumi, J.; Wittes, J.T. Through the looking glass: Understanding non-inferiority. Trials 2011, 12. [CrossRef]
[PubMed]
25. Leon, A.C. Comparative effectiveness clinical trials in psychiatry: Superiority, non-inferiority and the role of
active comparators. J. Clin. Psychiatry 2011, 72, 331–340. [CrossRef] [PubMed]
26. Baguley, T. Standardized or simple effect size: What should be reported? Br. J. Psychol. 2009, 100, 603–671.
[CrossRef] [PubMed]
27. Cumming, G.; Finch, S. A primer on the understanding, use, and calculation of confidence intervals that are
based on central and noncentral distributions. Educ. Psychol. Meas. 2001, 61, 161–170. [CrossRef]
28. Berger, J.O.; Delampady, M. Testing precise hypotheses. Stat. Sci. 1987, 2, 317–352. [CrossRef]
29. Raftery, A.E. Bayesian model selection in social research. Sociol. Methodol. 1995, 25, 111–163. [CrossRef]
30. Mulder, J.; Hoijtink, H.; de Leeuw, C. BIEMS: A Fortran 90 program for calculating Bayes factors for inequality
and equality constrained models. J. Stat. Softw. 2012, 46. [CrossRef]
31. Lauretto, M.; Pereira, C.A.B.; Stern, J.M.; Zacks, S. Comparing parameters of two bivariate normal
distributions using the invariant FBST. Braz. J. Probab. Stat. 2003, 17, 147–168.
32. Lauretto, M.S.; Stern, J.M. FBST for mixture model selection. AIP Conf. Proc. 2005, 803, 121–128.
33. Stern, J.M.; Pereira, C.A.B. Bayesian epistemic values: Focus on surprise, measure probability! Log. J. IGPL
2014, 22, 236–254. [CrossRef]
34. Perezgonzalez, J.D. Statistical sensitiveness for science. 2016, arXiv:1604.01844.
35. Student. The probable error of a mean. Biometrika 1908, 6, 1–25.
36. Andrews, D.W.K.; Buchinsky, M. A three-step method for choosing the number of bootstrap repetitions.
Econometrica 2000, 68, 23–51. [CrossRef]
37. Pattengale, N.D.; Alipour, M.; Bininda-Emonds, O.R.P.; Moret, B.M.E.; Stamatakis, A. How many bootstrap
replicates are necessary? J. Comput. Biol. 2010, 17, 337–354. [CrossRef] [PubMed]
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).