Statistical Methodology For Sensory Discrimination Tests and Its Implementation in Sensr
Statistical Methodology For Sensory Discrimination Tests and Its Implementation in Sensr
Statistical Methodology For Sensory Discrimination Tests and Its Implementation in Sensr
Contents
1 Introduction
3.0.1
3.1
3.2
Implementation in sensR . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1
Standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2
3.1.3
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4
10
3.1.5
11
3.1.6
11
3.1.7
Implementation in sensR . . . . . . . . . . . . . . . . . . . . . . . . . .
12
15
3.2.1
15
3.2.2
16
3.2.3
17
3.2.4
18
3.2.5
19
3.2.6
20
Introduction
The aim of this document is 1) to describe the statistical methodology for sensory discrimination testing and analysis, and 2) to describe how such analyses can be performed in R
using package sensR (Christensen and Brockhoff, 2010) co-developed by the author of this
document.
This document is divided into sections that cover topics with similar statistical methodology.
Implementation choices in the sensR package will be described in connection with the statistical methodology whenever appropriate. Small examples illustrating the use of functions
in the sensR package will appear throughout.
This is not a hands-on practical tutorial to analysis of sensory discrimination experiments
with the sensR package, neither is it a user friendly introduction to discrimination and
similarity testing in sensory discrimination protocols. The former document does not really
exist1 (yet), and for the latter document, we refer the reader to (Ns et al., 2010, chapter
7). We will assume throughout that the reader has basic statistical training and is familiar
with sensory discrimination testing to the level of (Ns et al., 2010, chapter 7).
The most common and simplest discrimination protocols comprise the 2-AFC, 3-AFC, triangle, duo-trio, A-not A and same-different protocols. The first four protocols are designed
such that the response follows a binomial distribution in the simplest experimental setting.
On the other hand responses from A-not A and same-different protocols are distributed
according to a compound or product binomial distribution in the simplest experimental
setting. An extension of the A-not A method known as the A-not A with sureness is a
classical SDT method which leads to multinomially distributed responses. Similarly the
same-different method extends to the degree-of-difference protocol also resulting in multinomially distributed responses. An experiment using one of the first four simple protocols
can be summarized with the proportion of correct responses or similarly the probability of
discrimination or d-prime. The Thurstonian models for the remaining protocols involve one
or more additional parameters each with their particular cognitive interpretation.
The 2-AFC and 3-AFC protocols are so-called directional protocols since they require that
the nature of the difference (e.g. sweetness) is provided as part of the assessor instructions.
On the other hand the triangle and duo-trio protocols are not directional since these protocols
are used to test un-specified differences. From a Thurstonian point of view, the sensory
dimension or the perceptual dimension is fixed in the 2-AFC and 3-AFC methods. The
cognitive decision strategy is consequently assumed different in these two classes of protocols.
When the perceptual dimension is fixed, the assessors may use the more effective skimming
strategy, while assessors are forced to use the inferior comparison of distances strategy when
using the un-directional protocols.
The A-not A and same-different protocols are methods with so-called response bias. Response bias refers to the concept that one type of response is preferred over another despite
the sensory distance remains unchanged. For instance some assessors may prefer the A
response over the not A response.
1 this
is on the to-do list of the author of this document, so there is hope it will appear in the future.
The four simple protocols are without response bias since no response can be consistently
preferred over another without affecting the discriminative effect. The decision criterion is
said to be fixed or stabilized.
The four common sensory discrimination protocols are often used in practical applications
in the food industry as well as in other areas. They are also of considerable interest in the
scientific literature about sensory discrimination.
The protocols have one important thing in common from a statistical perspective: their
statistical models can all be described as variants of the binomial distribution. That is,
the answer from any one of these protocols is either correct or incorrect and the sampling
distribution of answers is therefore a binomial distribution.
For the duo-trio and 2-AFC protocols the guessing probability, pg is 1/2. This means that if
there is no discriminative difference between the products, then the probability of a correct
answers, pc is one half. Similarly for the triangle and 3-AFC protocols the guessing probability is 1/3. The four common discrimination protocols are said to be free of response bias
in contrast to the A-not A and same-different protocols.
If we assume for a moment that the population of assessors (be that judges in an expert
panel or consumers) is comprised of ignorants who are always guessing and discriminators
who always discriminate correctly and provide the appropriate answer (though this will not
always be the correct answer). One way to express the sensory distance of the objects
(or discriminative ability of the assessors we will treat these viewpoints synonymously
throughout) is the proportion of discriminators, pd in the population of interest. It is almost
always an unreasonable assumption that some assessors are either always discriminating or
always guessing (Ennis, 1993), but we may still talk about the probability of discrimination.
This probability may refer to particular individuals or to a population; in this section we
will adopt a population perspective.
The relation between the probability of a correct answer and the probability of discrimination
is
pc = pg + pd (1 pg ),
(1)
where the guessing probability, pg is 1/2 for the duo-trio and 2-AFC protocols and 1/3 for
the triangle and 3-AFC protocols. The reverse relation is
pd = (pc pg )/(1 pg ).
(2)
Another way to summarize the sensory distance is through a measure known as d0 (pronounced d-prime) from signal detection theory (SDT, Green and Swets, 1966; Macmillan
and Creelman, 2005), or equivalently the Thurstonian delta, (Thurstone, 1927a,b,c). These
two concepts are identical and will be used synonymously throughout, and they are actually
based on the same underlying psychophysical model for the cognitive process. Whereas pc
is a measure and parameter completely free of reference to any particular discrimination
protocol, pd depends on the discrimination protocol through the guessing probability, but
d0 depends on the discrimination protocol through the so-called psychometric function, for
4
the discrimination protocol. The psychometric function maps from d0 to the probability of
a correct answer:
pc = fps (d0 ).
(3)
For the m-AFC method, where m denotes the number of forced choices, the psychometric
function is given by
Z
(z d0 )(z)m1 dz,
(4)
fm-AFC (d0 ) =
where is the standard normal probability density function and is the standard normal
cumulative distribution function. The psychometric functions for the four common discrimination protocols are given by
Z
f3-AFC (d0 ) =
(z d0 )(z)2 dz
(5)
(6)
(z d0 )(z) dz = (d0 / 2)
f2-AFC (d0 ) =
Z n h
i
h
io
p
p
(7)
ftri (d0 ) = 2
z 3 + d0 2/3 + z 3 d0 2/3 (z) dz
0
n x
p (1 pc )nx .
x c
(10)
There is a subtle but important distinction between the proportion of a correct answer and
the probability of a correct answer. The proportion of correct answers is x/n which can be
any number between 0 and 1. The probability of a correct answer, which we denote by pc ,
is a parameter and represents a true underlying value. As such pc cannot be lower than
the guessing probability for the discrimination protocol that was used and cannot exceed 1.
The usual estimator of a binomial probability is just the sample proportion, x/n, but this is
not the case here, and it is exactly this feature that makes discrimination testing interesting
statistically.
The maximum likelihood (ML) estimator2 of pc is given by:
x/n if x/n pg
pc =
pg
if x/n < pg
2 Following
(11)
1.0
1.0
2AFC
3AFC
duotrio
triangle
0.9
2AFC
3AFC
duotrio
triangle
0.8
P(discrimination)
P(correct answer)
0.8
0.7
0.6
0.6
0.4
0.5
0.2
0.4
0.3
0.0
0
dprime
1.0
dprime
2AFC
3AFC
duotrio
triangle
0.30
Grad(psychometic function)
0.8
P(correct answer)
0.6
0.4
0.2
0.25
0.20
0.15
0.10
0.05
0.0
0.00
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
P(discrimination)
dprime
Figure 1: The connection between d0 , pc and pd for the four common sensory discrimination
protocols. The so-called psychometric functions; Pc as a function of d0 , are shown in the
upper left figure.
The allowed ranges (parameter space) for these three parameters are given by
d0 [0, [,
pd [0, 1],
pc [pg , 1].
(13)
Negative d0 values are sometimes mentioned in the literature, but negative d0 values are not
possible in the discrimination protocols that we consider here. They are possible in preference
tests and theoretically possible in Thurstonian models based on other assumptions, see
section XXX for more background information on this topic.
3.0.1
Implementation in sensR
In package sensR there is a function rescale that maps between the three scales; pc , pd and
d0 . A value on one of these scales is given as argument and values on all three scales are
given in the results. The results respect the allowed ranges of the parameters in eq. (13), so
if the supplied pc is less than pg , then pc = pg is returned with pd and d0 at the appropriate
levels:
> rescale(pc = 0.25, method = "triangle")
Estimates for the triangle protocol:
pc pd d.prime
1 0.3333333 0
0
Function rescale use a number of auxiliary functions for its computations; these are also
directly available to the package user:
pc2pd: maps from the pc -scale to the pd -scale.
pd2pc: maps from the pd -scale to the pc -scale.
psyfun: implements the psychmetric functions pc = fps (d0 ) for the four common
discrimination protocols, cf. eq. (3).
1
(pc ) for the four
psyinv: implements the inverse psychometric functions, d0 = fps
common discrimination protocols, cf. eq. (12).
0
psyderiv: implements the derivative of the psychometric functions, fps
(d0 ) for the
four common discrimination protocols.
3.1
3.1.1
Standard errors
p
pc (1 pc )/n.
(14)
The standard error of pd and d0 can be found by application of the Delta method (see for
example Pawitan, 2001):
f (x)
se(x)
(15)
se{f (x)} =
x
The standard error of pd is therefore
se(pd ) =
1
se(pc )
1 pg
(16)
since pd /pc = 1/(1 pg ), cf. eq. (2). The standard error of d0 can similarly be found as
se(d0 ) =
1
fps
(pc )
1
se(pc ) = 0 0 se(pc )
pc
fps (d )
(17)
0
(d0 ) is the derivative of the psychometric function with respect to d0 ; expressions
where fps
are given by Brockhoff and Christensen (2010).
Standard errors are only defined and only meaningful as measures of uncertainty when
the parameter estimate is at the interior of the parameter space, i.e. when the parameter
estimate is not at the boundary of its allowed range, cf. eq. (13).
Even when the parameter estimate is close, in some sense, to the boundary of its parameter
space, the standard error is not a meaningful measure of uncertainty, because the uncertainty
is in fact asymmetric. This means that symmetric confidence intervals based on the standard
error will also be misleading and other techniques should be applied.
3.1.2
The (log-)likelihood function can be used to obtain likelihood ratio or likelihood root statistics for hypothesis tests, and it can be used to construct confidence intervals with good
properties.
The log-likelihood function for a model based on the binomial distribution is given by
`(pc ; x, n) = C + x log pc + (n x) log(1 pc ),
n
x
(18)
where C = log
is a constant with respect to pc . The log-likelihood function for pd or d0
is given by combining eq. (18) with (2) or (12).
In general, standard errors can be found as the square root of the diagonal elements of the
variance-covariance matrix of the parameters. The variance-covariance matrix can be found
as the inverse of the negative Hessian matrix (the matrix of second order derivatives) of the
log-likelihood function evaluated at the ML estimates. Here there is only one parameter
(either one of pc , pd or d0 ), so the matrices are merely scalars.
It can be shown that the same standard errors as those derived in eq. (14), (16) and (17)
can be derived by differentiating (18) twice and using the chain rule to obtain the standard
errors of pd and d0 .
8
3.1.3
Confidence intervals
There are several general approaches to get CIs for parameters. One general way that applies
(with varying success) to almost all parameters with a standard error is the traditional Wald
interval:
CI :
z1/2 se(
),
(19)
where z1/2 = 1 (1 /2) is the upper /2 quantile of the standard normal distribution.
This CI is based on the Wald statistic3 :
w(0 ) = (
0 )/se(
).
(20)
The CI may also be expressed more generally for a statistic t(0 ) that follows a standard
normal distribution under the null hypothesis as:
CI : {; |t()| < z1/2 }.
(21)
U L : P (X x) = /2,
(23)
where X binom(pc , n). Rather than solving these equations numerically, the limits can
be found directly as quantiles of the beta distribution, Beta(a, b): the lower limit is the /2
quantile of Beta(x, n x + 1) and the upper limit is the 1 /2 quantile of Beta(x + 1, n x).
Another commonly applied statistic is based
p on the normal approximation of the binomial
distribution. Asymptotically (X npc )/ npc (1 pc ) behaves like a standard normal random variable, so we may use
x npc0
w (pc0 ) = p
,
(24)
npc0 (1 pc0 )
as test statistic. This statistic is in fact identical to the Wald statistic (20) if se(0 ) is used
in the denominator instead of se(
).
The statistic w is related to the Pearson 2 statistic
X 2 (pc0 ) =
(x npc0 )2
(n x n(1 pc0 ))2
+
npc0
n(1 pc0 )
(25)
since w is the signed square root of X 2 . Similarly the likelihood root statistic, r(pc0 ) is
related to the likelihood ratio statistic
x
nx
G2 (pc0 ) = x log
+ (n x) log
(26)
npc0
n(1 pc0 )
since r(pc0 ) is the signed square root of G2 (pc0 ).
3.1.4
pc > pc0
versus HA : pd > pd0 ,
d0 > d00
(27)
where the traditional tests of no-difference is given by choosing pc0 = pg , pd0 = 0 and d00 = 0
making the null hypothesis an equality rather than an inequality.
The p-value of a difference test is the probability of observing a number of successes that
is as large or larger than that observed given the null hypothesis that the probability of a
correct answer is pc0 . The p-value based on the exact binomial test is therefore:
x1
X n
p-value = P (X x) = 1
pic0 (1 pc0 )ni ,
(28)
i
i=0
where X binom(pc0 , n)
The p-value for a difference based on a statistic, t(0 ) that follows a standard normal distribution under the null hypothesis is given by:
p-value = P {Z t(0 )} = 1 {t(0 )},
(29)
where Z is a standard normal random variable and is the standard normal cumulative
distribution function.
10
3.1.5
pc < pc0
versus HA : pd < pd0 ,
d0 < d00
(30)
where subject matter considerations and possibly power computations will guide the choice
of pc0 , pd0 or d00 . Observe that d00 has to be positive for the test to make sense.
The p-value of a similarity test is the probability of observing a number of successes that is
as large or less than that observed given the null hypothesis that the probability of a correct
answer is pc0 . The p-value based on the exact binomial test is therefore:
p-value = P (X x) =
x
X
n
i=0
(31)
where X binom(pc0 , n)
The p-value for a difference based on a statistic, t(0 ) that follows a standard normal distribution under the null hypothesis is given by:
p-value = P {Z t(0 )} = {t(0 )},
3.1.6
(32)
Confidence intervals are often described by their relation to hypothesis tests such that a
two-sided hypothesis test should be accompanied by a two-sided confidence interval and
one-sided hypothesis tests should be accompanied by one-sided confidence intervals. This
will make the 1 level confidence interval the region in which an observation would not
lead to rejection of the null hypothesis. A confidence interval should, however, provide more
than a rejection region; it should provide an interval in which we can have confidence that
the true parameter lies. This corresponds to the interval which provides most support for
the parameter. As such confidence intervals should be two-sided even if the appropriate test
may be one-sided (Boyles, 2008). We will use two-sided confidence intervals throughout and
use these in conjunction with p-values from one-sided difference and similarity tests. This
is also implemented in sensR.
Confidence intervals may, however, be one-sided in a slightly different respect since it may
happen, for instance, that the lower confidence limit is at the guessing probability, pg . If the
observed proportion of correct answers is less than pg , the lower confidence limit will also
be higher than the observed proportion.
Confidence intervals may be degenerate in the sense that both limits can be zero; this is
obviously not very informative. This may happen if, for instance, the observed proportion
is below pg and is large enough. For small enough , the upper confidence limit for d0
will, however, exceed zero.
Confidence intervals can be used for difference and similarity testing as argued by MacRae
(1995) and Carr (1995) when it is enough to know if the alternative hypothesis is rejected
or not. Comparing the formulas for the exact Clopper-Pearson confidence limits (23) with
11
the formulas for p-values in difference and similarity tests also based on the exact test, it is
clear that there is a close connection.
If pc0 under H0 is below the lower confidence limit in a 1 level interval, then the p-value
of a difference test will be below /2, i.e. the test will be significant at the /2-level. Thus, if
pc0 is below the lower confidence limit in a 90% interval, then the difference test is significant
at the 5% level. Similarly, if pc0 is above the upper confidence limit in a 90% interval, then
the similarity test is significant at the 5% level.
In difference testing the binomial test is not too liberal even if there is variability in pd under
the alternative hypothesis, because there can be no variability under the null hypothesis that
pd = 0. In similarity testing, however, pd > 0 under H0 and the standard binomial test could
possibly be liberal. Also not that pd under HA will be less than pd under H0 , and if there is
variation in pd in the distribution, this variation could be larger under H0 than under HA .
Also, the power and sample size computations in the following assume that zero variability
in pd . Possibly the power will be lower and sample sizes higher if there really is variation in
pd in the population.
The similarity tests discussed so far are targeted toward equivalence in the population on
average. There is no consideration of equivalence on the level of individual discrimination.
A general problem with discrimination testing outlined so far is the assumption that all
assessors have the same probability of discrimination. This is hardly ever a priory plausible.
The so-called guessing model (refs) assumes that there are two kinds of assessors; nondiscriminators that always guess and true discriminators that always perceive the difference
and discriminate correctly. This assumption is also hardly ever a priory plausible. More
plausible is perhaps that the probability of discrimination has some distribution across the
population of assessors as is assumed in the chance-corrected beta-binomial distribution.
3.1.7
Implementation in sensR
The function rescale that was described in section 3.0.1 has an additional optional argument
std.err which allows one to get the standard error of, say, pd and d0 if the standard error of
pc is supplied. This is done through application of eq. (16) and (17) and by using the user
visible function psyderiv, which implements the derivative of the psychometric functions,
0
fps
(d0 ) for the four common discrimination protocols:
> rescale(pd = 0.2, std.err = 0.12, method = "triangle")
Estimates for the triangle protocol:
pc pd d.prime
1 0.4666667 0.2 1.287124
Standard errors:
pc
pd
d.prime
1 0.08 0.12 0.4424604
The discrim function is the primary function for inference in the duo-trio, triangle, 2-AFC
and 3-AFC protocols. Given the number of correct answers, x and the number of trials, n,
discrim will provide estimates, standard errors and confidence intervals on the scale of pc ,
pd and d0 . It will also report the p-value from a difference or similarity test of the users
choice. p-values will be one-sided while confidence limits will be two-sided, cf. section 3.1.6.
Confidence intervals are computed on the scale of pc and then transformed to the pd and d0
12
scales as discussed in section 3.1.3. The user can choose between several statistics including
the exact binomial, likelihood, Wald and score statistics. The score option leads to the
so-called Wilson or score interval, while the p-value is based on the w statistic, cf. eq. (24).
Estimates and confidence intervals reported by discrim respect the allowed range of the
parameters, cf. eq. (13) and standard errors are not reported if the parameter estimates are
on the boundary of their parameter space (allowed range).
Strictly speaking the Wald statistic (20) is not defined when
q x/n pg , since the standard
error of pc is not defined. However, it makes sense to use nx 1 nx n1 as standard error
in this case. This is adopted in discrim.
Similarity testing does not make sense if pc0 = 0 under the null hypothesis, cf. eq. (30), so
a positive pd0 has to be chosen for similarity testing.
Example: Suppose we have performed a 3-AFC discrimination test and observed 10 correct answers in 15 trials. We want estimates of the pc , pd and d0 , their standard error
and 95% confidence intervals. We are also interested in the difference test of no difference
and decide to use the likelihood root statistic for confidence intervals and tests. Using the
discrim function in R we obtain:
> discrim(10, 15, method = "threeAFC", statistic = "likelihood")
Estimates for the threeAFC discrimination protocol with 10 correct
answers in 15 trials. One-sided p-value and 95 % two-sided confidence
intervals are based on the likelihood root statistic.
pc
pd
d-prime
If instead we had observed 4 correct answers in 15 trials and were interested in the similarity
test with pd0 = 1/5 under the null hypothesis, we get using the exact binomial criterion
for confidence intervals and tests:
> discrim(4, 15, method = "threeAFC", test = "similarity", pd0 = 0.2,
statistic = "exact")
Estimates for the threeAFC discrimination protocol with 4 correct
answers in 15 trials. One-sided p-value and 95 % two-sided confidence
intervals are based on the 'exact' binomial test.
pc
pd
d-prime
13
1.0
Relative Likelihood
0.8
0.6
0.4
0.2
0.0
0
dprime
Figure 2: Relative likelihood function for a 3-AFC experiment with 10 correct answers in
15 trials. The maximum likelihood estimate is at d0 = 1.12 and the two horizontal lines
determine the 95% and 99% likelihood based confidence intervals.
Alternative hypothesis: pd is less than 0.2
A few auxiliary methods for discrim objects are available. confint returns the confidence
intervals computed in the discrim object, profile extracts the (profile) likelihood function
and plot.profile plots the likelihood function.
Example To illustrate the auxiliary methods consider the 3-AFC example above where
10 correct answer were observed in 15 trials.
> fm1 <- discrim(10, 15, method = "threeAFC", statistic = "exact")
> confint(fm1)
Lower
Upper
pc
0.3838037 0.8817589
pd
0.0757056 0.8226383
d-prime 0.1744201 2.1015496
attr(,"method")
[1] "threeAFC"
attr(,"conf.level")
[1] 0.95
attr(,"statistic")
[1] "exact"
> plot(profile(fm1))
The resulting graph is shown in Fig. 2. Observe that the likelihood (profile) function may
be extracted from a discrim object that is not fitted with statistic = "likelihood".
Further information about the use and interpretation of (profile) likelihood curves in sensory
experiments is given in (Brockhoff and Christensen, 2010; Christensen and Brockhoff, 2009).
14
3.2
The power of a test is the probability of getting a significant result for a particular test given
data, significance level and a particular difference. In other words, it is the probability of
observing a difference that is actually there. Power and sample size calculations require that
a model under the null hypothesis and a model under the alternative hypothesis are decided
upon. The null model is often implied by the null hypothesis and is used to calculate the
critical value. The alternative model has to lie under the alternative hypothesis and involves
a subject matter choice. Power is then calculated for that particular choice of alternative
model.
In the following we will consider calculation of power and sample size based directly on the
binomial distribution. Later we will consider calculations based on a normal approximation
and based on simulations.
3.2.1
Formally the critical value, xc of a one-sided binomial test where the alternative hypothesis
is difference, or equivalently greater, is the smallest integer number that satisfies
P (X xc )
where X binom(pc0 , n)
(33)
and pc0 is the probability of a correct answer under the null hypothesis. Similarly the
critical value, xc of a one-sided binomial test where the alternative hypothesis is similarity,
or equivalently less, is the largest integer number that satisfies
P (X xc )
where X binom(pc0 , n)
(34)
If the sample size is small for the desired , there may not be a possible critical value that
satisfies (33) or (34). In a difference test it may not be enough to observe x = n correct
answers, i.e. all correct answers for the test to be significant at the required . Similarly,
it may not be enough to observe no correct answers (x = 0) for the similarity test to be
significant at the required .
A simple way to compute xc is to use a small while loop (shown here for a difference
test):
i=0
while P (X i) > do
i=i+1
end while
return i + 1
However, if xc is a large number, many iterations of the loop would be required, so instead
in the findcr function in package sensR eq. (33) and (34) are solved numerically for xc . One
complication with this method is that P (X xc ) is discontinuous in xc and that requires
special attention.
Example: Consider the situation that X = 15 correct answers are observed out of n = 20
trials in a duo-trio test. The exact binomial p-value of a no-difference test is P (X 15) =
15
3.2.2
(35)
where pcA is the probability of a correct answer under the alternative hypothesis and xc is
the critical value of the test, which depends on the probability of a correct answer under the
null hypothesis and the significance level, .
Power increases with the difference between pc0 and pcA , the sample size and . Power can
be computed directly once the critical value, pcA and n are known, so the only computational
challenge is in the computation of the critical value.
Example: The power of the test considered in the previous example is the probability of
getting this p-value or one that is smaller. This depends on the actual sensory difference
of the objects/the proportion of discriminators. If half the population are discriminators
or equivalently if each assessor has a 50% of correctly discriminating a set of samples, then
pc = 1/2 + 1/2pd = 3/4. The power is the probability of observing 15 or more correct
answers:
power = P (X 15) = 1 P (X 15 1) = 0.617
16
(36)
Observe that discrimPwr requires that the effect size under the alternative hypothesis is
given in terms of pd rather than pc or d0 . If the effect size under the alternative hypothesis is
formulated in terms of d0 , then rescale can be used to convert from d0A to pdA , but it would
be easier to use d.primePwr, which accepts d0A directly and internally calls discrimPwr.
If the significance test of interest is not that of no-difference, but that of a small difference
versus a relevant difference, the computation of the critical value is slightly different. The
power calculation remain essentially the same.
If the limit between irrelevant and relevant differences is at pd = 0.1, so pc = 1/2 + 1/2
0.1 = 0.55, then P (X 16|pc0 = 0.55, n = 20) = 1 P (X 16 1) = 0.019 while
P (X 15|pc0 = 0.55, n = 20) = 1 P (X 15 1) = 0.055. The critical value is therefore
16 and the power of the test is
power = P (X 16) = 0.415
(37)
Note the pd0 argument which should match the value of pd under the null hypothesis.
3.2.3
(38)
and pcA is the probability of a correct answer under the alternative hypothesis and xc is the
critical value of the test, which depends on the probability of a correct answer under the
null hypothesis and the significance level, .
Example: Assume that we want to calculate the power of a similarity test using the duotrio protocol with n = 100, and that we want to show that the probability of discrimination
is less than 1/3, while we believe that there is actually no difference between the objects,
so the true probability of discrimination is zero. The null hypothesis is therefore H0 : pc
1/2 + 1/2 1/3 = 2/3 and the alternative hypothesis is HA : pc < 2/3. The critical value
17
(39)
If in fact there is a small difference between the objects, so that there is a positive probability
of discrimination, say pd = 1/5, then the power is (the critical value remains the same):
power = P (X 58|pc = 0.5(1 + 1/5), n = 100) = 0.377
(40)
Observe how the power of the similarity test is quite good if there is absolutely no observable
difference between the objects, while if there is in fact a small probability that a difference
can be observed, the power is horrible and the sample size far from sufficient.
3.2.4
In more complicated models it is not possible to determine an explicit expression for the
power of a test and calculation of power based on simulations can be an attractive approach.
Sometimes it may also just be easier to let the computer do the job by running simulations
rather than to get bugged down in derivations of explicit expressions for power even though
they may in fact be possible to derive.
Recall that power is the probability of getting a significant result when there is in fact a
difference, thus in the long run it is the proportion of significant results to the total number
of tests:
no. p-values <
power =
(41)
no. tests
We can let the computer generate random data from the model under the alternative hypothesis and then perform the significance test. We can even do that many many times and
record the p-values allowing us to calculate the power via eq. (41). In the following we will
do exactly that for a binomial test for which we know the right answer.
Consider the no-difference example above in section 3.2.2 where n = 20 and the power of
a no-difference test was 0.617 when pd = 1/2, so pc = 3/4. We will estimate the power
via simulation by generating 10,000 (pseudo) random draws, Xi , i = 1, . . . , 10, 000 from
Xi binom(pc = 3/4, n = 20). For each of these draws we calculate the p-value as pi =
P (X xi |pc = 1/2, n = 20). Among these p-values 6184 were below 0.05, so the power
estimated by simulation is 0.6184. Observe that this is close to, but not exactly the power
that we obtained analytically (0.617). If we did the power calculation over again, we would
18
most likely get a slightly different power estimate although probably also close to 0.617
because we would obtain a slightly different set of random draws. This illustrates that
although power calculation via simulation is simple, the result varies a little from one run
to another.
Fortunately we can estimate the uncertainty in the estimated power p
from standard binomial
power(1 power)/nsim =
principles.
The
standard
error
of
the
estimated
power
is
se(
power)
=
p
0.6814(1 0.6814)/10, 000 = 0.0049 and an approximate Wald 95% CI for the estimated
power is [0.609; 0.628], which covers the true value (0.617) as one would expect.
3.2.5
An often used approximation for power and sample size calculations is the normal approximation; the idea is to use a statistic that asymptotically follows a standard normal distribution. For a binomial parameter power and sample size calculation may be based on the
Wald statistic (20) as for example described by Lachin (1981) and advocated by Bi (2006)
in a sensometric context. We are not aware of any numerical assessments of the accuracy
of the normal approximation for power and sample size calculations, but we may expect
that for small n or p (under the null or alternative) close to one, the approximation may be
rather inaccurate. Since power and sample size determinations are readily available for the
exact binomial test, we see no reason to use approximate statistics with doubtful properties
for these purposes.
Consider the following hypotheses for a binomial parameter:
H0 : p = p0
HA : p > p0 ,
(42)
(43)
(44)
p
where
p
is
the
probability
under
the
alternative
hypothesis,
=
p0 (1 p0 )/n, A =
A
0
p
pA (1 pA )/n and p = X/n is the estimator of a binomial parameter. The critical point
above which the null hypothesis is rejected is then
p p0
> 1 (1 ) = z1
0
(45)
p > z1 0 + p0 .
(46)
i.e. when
Under HA the null hypothesis is rejected if
z1 0 + p0 pA
p pA
>
A
A
(47)
z1 0 + p0 pA
z1 0 + p0 pA
power = P Z >
=1
A
A
(48)
19
(49)
Isolating n in eq. (48) leads to the following expression for the sample size of difference tests:
sample size =
pA (1 pA ) z1
p0 pA
p0 (1 p0 )
!2
,
(50)
(51)
z1
pA (1 pA ) z
p0 pA
p0 (1 p0 )
!2
where z1 = 1 (power). The sample sizes given by (50) and (51) should be rounded up
to the nearest integer.
3.2.6
In principle sample size determination is simple; find the sample size such that the power
is sufficiently high for a particular test at some significance level given some true difference.
Computationally, however, it can be a challenge.
Formally, the required sample size, n for a sensory difference test is the smallest integer
number, n that satisfies
P (X xc ) target-power where X binom(pc , n ),
(52)
and P (X xc ) is the actual power of the test. Power for a difference test only increases
with increasing sample size if the true difference, pd is larger than the null difference, pd0 ,
so it is a requirement that the value of pd specified as the true difference is actually covered
by the alternative hypothesis.
Similarly, the required sample size, n for a similarity test is the smallest integer number,
n that satisfies
P (X xc ) target-power where X binom(pc , n ),
(53)
and P (X xc ) is the actual power of the test. Power only increases with increasing sample
size if the true difference, pd is less than the null difference, pd0 , so as for difference tests,
the value specified as the true difference has to be covered by the alternative hypothesis.
The sample size depends on the particulars of the null and alternative hypotheses as well as
the significance level of the test, i.e. and the desired minimum power; the target-power.
So much for the formal definitions: practical sample size determination is in fact not as
simple as the definitions may lead one to believe. Consider a situation in which we want to
know which sample size to choose in a difference test using the triangle protocol where the
null hypothesis is no difference, target power is 0.80, and we believe the actual difference
is d0 = 0.9 under the alternative hypothesis. Standard sample size calculations under the
20
0.84
exact binomial
normal approx.
target power
0.82
0.80
0.78
0.76
power
0.74
275
297
318
325
sample size
Figure 3: The relation between sample size and power for a difference test with the triangle
protocol. The null hypothesis is that of no difference and d0 = 0.9 is assumed under the
alternative model. A power of 0.80 is desired.
definition (52) tells us that 297 tests are enough; this leads to an actual power of 0.802.
However, had we decided to use, say, 300 testsfor convenience and just to be on the safe
side, the power of the test is only 0.774; much less than the power with 297 tests and below
our target power. This is truly worrying; how many samples do we need to be sure that all
larger sample sizes also lead to a power above 0.80? It is natural to expect power to increase
with every increase in sample size (a monotonic increase in power with sample size), but
this is not the case as is illustrated in Fig. 3.
Power generally increases with the sample size, but it does so in a zig-zag way due to the
discreteness of the binomial distribution. As is seen in Fig. 3, the smallest sample size for
which power is higher than 0.80 is 297 (actual power = 0.802). The next sample size that
gives a power above 0.80 is 302, but the actual power is now less than 0.801. We would need
305 samples (actual power = 0.806) to obtain a power that is higher than the power with
297, and no less than 318 samples (actual power = 0.802) if no larger sample size should
lead to a power less than 0.80.
Even though an increase in sample size may lead to a decrease in power, it will instead lead
to a decrease in the actual -level. This occurs because the critical value of the test is at
times piece-wise constant as a function of sample size, cf. Fig. 4.
The sample size for the exact binomial test may be computed with much the same while loop
that could also be used to find the critical value (cf. section 3.2.1):
i=1
while actual power(i) < target power do
i=i+1
21
0.050
120
0.040
Actual alphalevel
115
110
Critical value
0.030
280
290
300
310
320
280
290
Sample size
300
310
320
Sample size
Figure 4: Left: critical value of the test. Right: actual -level of the test.
end while
return i
where actual power depends on the hypothesis, cf. (52) and (53). The problem with this
approach is that if the required sample size is large, it may take some time to get there;
recall that at every evaluation of the actual power, the critical value has to be determined.
Due to the non-monotonicity of the relationship between power and sample size (cf. Fig. 3),
it is not possible to simply solve for the required sample size numerically.
An improvement over the simple while loop is suggested by the normal approximation to
the required sample size shown in Fig. 3 in blue. This approximation seems to estimate the
sample size too low, and to do so consistently. For the example considered here, the normal
approximation estimates that 291 samples is enough to obtain a power of 0.80 (actual power
= 0.8001). The while loop could simply be started at i = 291 rather than at i = 1. A problem
with this approach is that the normal approximation is not always strictly liberal. In the
function discrimSS in package sensR a compromise is used, where the while loop is started
at one if the sample size estimated by the normal approximation is less than 50. Otherwise
the while loop is started at 90% of the normal approximation estimate and sometimes even
lower if necessary. If the normal approximation estimate is larger than 10,000, the function
will inform of that and not attempt to estimate the sample size due to the expected large
computation time. In addition to the sample size for the exact binomial test, it is also
possible to ask for the sample size based on the normal approximation.
Example: Consider the example above illustrated in Fig. 3; we wanted to know the sample
size for a difference test where the null hypothesis is that of no difference using the triangle
protocol. We want a power of 0.80, take = 0.05 and we assume the actual difference is
d0 = 0.9 under the alternative hypothesis. Using package sensR we may get the sample size
for the exact binomial test with
> (pd <- coef(rescale(d.prime = .9, method = "triangle"))$pd)
[1] 0.1044969
> discrimSS(pdA = pd, pd0 = 0, target.power = 0.8, alpha = 0.05, pGuess
= 1/3, test = "difference", statistic = "exact")
[1] 297
22
References
Bi, J. (2006). Sensory Discrimination Tests and MeasurementsStatistical Principles, Procedures and Tables. Blackwell Publishing.
Boyles, R. A. (2008). The role of likelihood in interval estimation. The American Statistician 62 (1), 2226.
Brockhoff, P. B. and R. H. B. Christensen (2010). Thurstonian models for sensory discrimination tests as generalized linear models. Food Quality and Preference 21, 330338.
Carr, B. T. (1995). Confidence intervals in the analysis of sensory discrimination teststhe
integration of similarity and difference testing. In Proceedings of the 4th AgroStat, Dijon,
7.-8. December, pp. 2331.
Christensen, R. H. B. and P. B. Brockhoff (2009). Estimation and Inference in the Same
Different Test. Food Quality and Preference 20, 514524.
Christensen, R. H. B. and P. B. Brockhoff (2010). sensR: An R-package for Thurstonian modelling of discrete sensory data. R package version 1.2.0 http://www.cran.rproject.org/package=sensR/.
Clopper, C. J. and E. S. Pearson (1934). The use of confidence or fiducial limits illustrated
in the case of the binomial. Biometrika 26, 404413.
Ennis, D. M. (1993). The power of sensory discrimination methods. Journal of Sensory
Studies 8 (353-370).
Green, D. M. and J. A. Swets (1966). Signal Detection Theory and Psychophysics. John
Wiley & Sons.
Lachin, J. M. (1981). Introduction to sample size determination and power analysis for
clinical trials. Controlled Clinical Trials 2, 93113.
Macmillan, N. A. and C. D. Creelman (2005). Detection Theory, A Users Guide (Second
ed.). Lawrence Elbaum Associates, Publishers.
MacRae, A. W. (1995). Confidence intervals for the triangle test can give assurance that
products are similar. Food Quality and Preference 6 (61-67).
Ns, T., P. B. Brockhoff, and O. Tomic (2010). Statistics for sensory and consumer science.
John Wiley & sons Ltd.
Pawitan, Y. (2001). In All LikelihoodStatistical Modelling and Inference Using Likelihood.
Oxford University Press.
Thurstone, L. L. (1927a). A law of comparative judgment. Psychological Review 34, 273286.
23
24