2016 Book StatisticalMethodsForDataAnaly
2016 Book StatisticalMethodsForDataAnaly
2016 Book StatisticalMethodsForDataAnaly
Luca Lista
Statistical
Methods for
Data Analysis
in Particle
Physics
Lecture Notes in Physics
Volume 909
Founding Editors
W. Beiglböck
J. Ehlers
K. Hepp
H. Weidenmüller
Editorial Board
M. Bartelmann, Heidelberg, Germany
B.-G. Englert, Singapore, Singapore
P. Hänggi, Augsburg, Germany
M. Hjorth-Jensen, Oslo, Norway
R.A.L. Jones, Sheffield, UK
M. Lewenstein, Barcelona, Spain
H. von Löhneysen, Karlsruhe, Germany
J.-M. Raimond, Paris, France
A. Rubio, Donostia, San Sebastian, Spain
S. Theisen, Potsdam, Germany
D. Vollhardt, Augsburg, Germany
J.D. Wells, Ann Arbor, USA
G.P. Zank, Huntsville, USA
The Lecture Notes in Physics
The series Lecture Notes in Physics (LNP), founded in 1969, reports new devel-
opments in physics research and teaching-quickly and informally, but with a high
quality and the explicit aim to summarize and communicate current knowledge in
an accessible way. Books published in this series are conceived as bridging material
between advanced graduate textbooks and the forefront of research and to serve
three purposes:
• to be a compact and modern up-to-date source of reference on a well-defined
topic
• to serve as an accessible introduction to the field to postgraduate students and
nonspecialist researchers from related areas
• to be a source of advanced teaching material for specialized seminars, courses
and schools
Both monographs and multi-author volumes will be considered for publication.
Edited volumes should, however, consist of a very limited number of contributions
only. Proceedings will not be considered for LNP.
Volumes published in LNP are disseminated both in print and in electronic for-
mats, the electronic archive being available at springerlink.com. The series content
is indexed, abstracted and referenced by many abstracting and information services,
bibliographic networks, subscription agencies, library networks, and consortia.
Proposals should be sent to a member of the Editorial Board, or directly to the
managing editor at Springer:
Christian Caron
Springer Heidelberg
Physics Editorial Department I
Tiergartenstrasse 17
69121 Heidelberg/Germany
[email protected]
123
Luca Lista
INFN Sezione di Napoli
Napoli, Italy
The following notes collect material from a series of lectures presented during a
course on “Statistical methods for data analysis” I gave for Ph.D. students in physics
at the University of Naples “Federico II” from 2009 to 2014. The aim of the course
was to elaborate on the main concepts and tools that allow to perform statistical
analysis of experimental data.
An introduction to probability theory and basic statistics is provided, which
serves as a refresher course for students who did not take a formal course on
statistics before starting their Ph.D.
Many of the statistical tools that have been covered have applications in high
energy physics (HEP), but their scope could well range outside the boundaries of
HEP.
A shorter version of the course was presented at CERN in November 2009
as lectures on statistical methods in LHC data analysis for the ATLAS and CMS
experiments. The chapter discussing upper limits was improved after lectures on the
subject I gave in Autrans, France, in May 2012 at the IN2P3 School of Statistics.
I was also invited to conduct a seminar on statistical methods in Gent University,
Belgium, in October 2014, which gave me an opportunity to review some of my
material and add new examples.
v
Contents
1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 The Concept of Probability.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Classical Probability .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.3 Issues with the Generalization to the Continuum . . . . . . . . . . . . . . . . . . . . 4
1.3.1 The Bertrand’s Paradox .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5
1.4 Axiomatic Probability Definition . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.5 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.6 Conditional Probability and Independent Events . . . . . . . . . . . . . . . . . . . . 7
1.7 Law of Total Probability.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
1.8 Average, Variance and Covariance .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.9 Variables Transformations.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 12
1.10 The Bernoulli Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
1.11 The Binomial Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 14
1.11.1 Binomial Distribution and Efficiency Estimate .. . . . . . . . . . . . 15
1.12 The Law of Large Numbers .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
2 Probability Distribution Functions . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
2.1 Definition of Probability Distribution Function ... . . . . . . . . . . . . . . . . . . . 21
2.2 Average and Variance in the Continuous Case . . .. . . . . . . . . . . . . . . . . . . . 22
2.3 Cumulative Distribution .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
2.4 Continuous Variables Transformation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
2.5 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
2.6 Gaussian Distribution.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 26
2.7 Log-Normal Distribution .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.8 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.9 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.10 Other Distributions Useful in Physics . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
2.10.1 Argus Function .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
2.10.2 Crystal Ball Function . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
2.10.3 Landau Distribution .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
vii
viii Contents
xi
xii List of Figures
xvii
xviii List of Examples
xix
Chapter 1
Probability Theory
Many processes in nature have uncertain outcomes. This means that their result
cannot be predicted before the process occurs. A random process is a process
that can be reproduced, to some extent, within some given boundary and initial
conditions, but whose outcome is uncertain. This situation may be due to insufficient
information about the process intrinsic dynamics which prevents to predict its
outcome, or lack of sufficient accuracy in reproducing the initial conditions in
order to ensure its exact reproducibility. Some processes like quantum mechanics
phenomena have intrinsic randomness. This will lead to possibly different outcomes
if the experiment is repeated several times, even if each time the initial conditions are
exactly reproduced, within the possibility of control of the experimenter. Probability
is a measurement of how favored one of the possible outcomes of such a random
process is compared with any of the other possible outcomes.
There are two main different approaches to the concept of probability which
result in two different meanings of probability. These are referred to as frequentist
and Bayesian probabilities.
• Frequentist probability is defined as the fraction of the number of occurrences
of an event of interest over the total number of possible events in a repeatable
experiment, in the limit of very large number of experiments. This concept can
only be applied to processes that can be repeated over a reasonably long range
of time. It is meaningless to apply the frequentist concept of probability to an
unknown event, like the possible values of an unknown parameter. For instance,
the probability of a possible score in a sport game, or the probability that the mass
of an unknown particle is larger than 200 GeV, are meaningless in the frequentist
approach.
• Bayesian probability measures someone’s degree of belief that a statement
is true (this may also refer to future events). The quantitative definition of
This approach can be used in practice only for relatively simple problems, since
it assumes that all possible cases under consideration are equally probable, which
may not always be the case in complex examples. Examples of cases where the
classical probability can be applied are coin tossing, where the two faces of a coin
1
Note that in physics often event is intended as an elementary event. So, the use of the term event
in statistics may sometimes lead to confusion.
1.2 Classical Probability 3
are assumed to have equal probability, equal to 1 =2 each, or dice roll, where each of
the six dice faces2 has equal probability equal to 1 =6 , and so on.
Starting from simple cases, like coins or dices, more complex models can be
built using combinatorial analysis, in which, in the simplest cases, one may proceed
by enumeration of all the finite number of possible cases, and again the probability
of an event is given by counting the number of favorable cases and dividing it by
the total number of possible cases of the combinatorial problem. An easy case of
combinatorial analysis is given by the roll of two dices, taking the sum of the two
outcomes as final result. The possible number of outcomes is given by the 66 D 36
different combinations that may lead to a sum ranging from 2 to 12. The possible
combinations are enumerated in Table 1.1, and the corresponding probabilities,
computed as the number of favorable cases divided by 36, are shown in Fig. 1.1.
Several examples can be treated in this way, decomposing the problem in all the
possible elementary outcomes, and identifying an event as the occurrence of one of
the elementary outcomes from a specific set of possible outcomes. For instance, the
event “sum of dices = 4” will correspond to the set of possible outcomes f.1; 3/,
.2; 2/, .3; 1/g. Other events (e.g.: “sum is an odd number”, or “sum is greater
than 5”) may be associated to different sets of possible combinations. In general,
formulating the problem in terms of sets allows to replace logical “and”, “or” and
“not” in the sentence defining the outcome condition by the intersection, union and
complement of the corresponding sets of elementary outcomes.
It is possible to build more complex models, but as soon as the problem increases
in complexity, the difficulty to manage the combinatorial analysis often rapidly
grows up and may easily become hard to manage. In many realistic cases, like in
the case of a particle detector where effects like efficiency, resolution, etc. must be
taken into account, it may be hard to determine a number of equally probable cases,
and a different approach should be followed. It can also be shown that classical
2
Or even different from 6, like in many role-playing games that use dices of non-cubic solid shapes.
4 1 Probability Theory
0.16
0.14
0.12
Probability
0.1
0.08
0.06
0.04
0.02
0
2 4 6 8 10 12
probability is easy to define only for discrete cases, and the generalization to the
continuum case is not unambiguously defined, as will be seen in Sect. 1.3.1.
Fig. 1.2 Illustration of Betrand’s paradox: three different choices of random extraction of a chord
in the circle lead apparently to probabilities that the cord is longer than the inscribed triangle’s side
of 12 (left), 13 (center) and 14 (right) respectively
3
This apparent paradox is due to the French mathematician Joseph Louis François Bertrand (1822–
1900).
6 1 Probability Theory
=3, and the chords span an angle of . Repeating for any possible point on the
circumference of the circle, one would derive that P D 13 , which is different from
P D 12 as we derived in the first case.
3. Let’s extract uniformly a point in the circle and let’s construct the chord passing
by that point perpendicular to the radius that passes by the same point (Fig. 1.2,
right plot). We can derive in this way that P D 14 , since the chords starting from
a point contained inside (outside) the circle inscribed in the triangle would have
a length longer (shorter) than the triangle’s side, and the ratio of the areas of the
circle inscribed in the triangle to the area of the circle that inscribes the triangle
is equal to 41 .
The paradox is clearly only apparent, because the process of uniform random
extraction of a chord in a circle is not univocally defined.
X
K
P.E/ D P.xEj / : (1.2)
jD1
1.6 Conditional Probability and Independent Events 7
From the second Kolmogorov’s axiom, it is also clear that the probability of the
event corresponding to the set of all possible outcomes, x1 ; ; xN must be one.
Equivalently, the sum of the probabilities of all possible outcomes is equal to one:
X
n
P.xi / D 1 : (1.3)
iD1
P.A \ B/
P.AjB/ D : (1.4)
P.B/
The conditional probability can be visualized in Fig. 1.3. While the probability of A,
P.A/, corresponds to the area of the set A, relative to the area of the whole sample
space , which is equal to one, the conditional probability, P.AjB/, corresponds to
the area of the intersection of A and B, relative to the area of the set B.
An event A is said to be independent on event B if the conditional probability of
A, given B, is equal to the probability of A, i.e.: the occurrence of B does not change
the probability of A:
Two events are independent if, and only if, the probability of their simultaneous
occurrence is equal to the product of their probabilities, i.e.:
"A and "B are also called efficiencies of the detectors A and B respectively.
This result clearly does not hold if there are causes of simultaneous
inefficiency of both detectors, e.g.: fraction of times where the electronics
systems for both A and B are simultaneously switched off for short periods,
or geometrical overlap of inactive regions.
[
n
Ei D E0 : (1.8)
iD1
This case is visualized in Fig. 1.4. It is easy to prove that the probability correspond-
ing to E0 is equal to the sum of the probabilities of Ei (law of total probability):
X
n
P.E0 / D P.Ei / : (1.9)
iD1
Ei D E0 \ Ai ; (1.10)
1.8 Average, Variance and Covariance 9
...
...
E3 ...
... ...
...
E3 ...
A3
...
X
n
P.E0 / D P.E0 jAi /P.Ai / : (1.12)
iD1
X
N
hxi D xi P.xi / : (1.13)
iD1
Sometimes the notation EŒx or xN is also used in literature to indicate the average
value.
The variance of a random variable x is defined as:
X
N
VŒx D .xi hxi/2 P.xi / ; (1.14)
iD1
Sometimes the standard deviation is confused, in some physics literature, with the
root mean square, abbreviated as r.m.s., which instead should be more properly
defined as:
v
u
u1 X N p
xrms D t x2i P.xi / D hx2 i : (1.16)
N iD1
It is also easy to demonstrate that, if we have two random variables x and y, the
averages can be added linearly, i.e.:
and
cov.x; y/
xy D : (1.23)
x y
2 D ˇ2 3 (1.28)
Given a random variable x, let’s consider the variable y transformed via the function
y D Y.x/. If x can only assume the values fx1 ; ; xN g, then y can only assume one
of the values fy1 ; ; yM g D fY.x1 /; ; Y.yN /g. Note that N is equal to M only
if all the values Y.xi / are different from each other. The probability of any value yj
is given by the sum of the probabilities of all values xi that are transformed into yj
by Y:
X
P.yj / D P.xi / : (1.29)
iWY.xi /Dyj
Note that there could also be a single possible value of the index i for which Y.xi / D
yi . In this case, we will have just P.yj / D P.xi /.
The generalization to more variables is straightforward: assume that we have
two random variables x and y, and we have a variable z defined as z D Z.x; y/; if
fz1 ; ; zM g is the set of all possible values of z, given all the possible values xi and
yj , the probability of each possible value of z, zk , will be given by:
X
P.zk / D P.xi ; yj / : (1.30)
i;jWZ.xi ;yj /Dzk
This expression is consistent with the example of the sum of two dices considered
in Sect. 1.2:
Let’s consider a basket that contains a number n of balls that have two possible
colors, say red and white. We know the number r of red balls in the basket, hence
the number of white balls is n r. The probability to randomly extract a red ball in
that set of n balls is p D r=n, according to Eq. (1.1) (Fig. 1.6).
A variable x equal to the number of extracted red balls is called Bernoulli
variable, and can assume only the values of 0 or 1. The probability distribution
of x is just given by P.1/ D p, P.0/ D 1 p. The average of a Bernoulli variable is
easy to compute:
Fig. 1.6 A set of n D 10 balls, of which r D 3 are red determines a Bernoulli process. The
probability to randomly extract a red ball in the set shown in this figure is p D r=n D 3=10 D 30 %
14 1 Probability Theory
The probability distribution of the binomial variable n for a given N and p can be
obtained considering that the n extractions are independent, hence the corresponding
probability terms (p for a red extraction, 1 p for a white extraction) can be
multiplied, according to Eq. (1.6). Multiplying finally this product by the binomial
coefficient from Eq. (1.34) to take into account the possible extraction paths leading
4
The coefficients present in the binomial distribution are the same that appear in the expansion
of the binomial power .a C b/n . A simple iterative way to compute those coefficients is known
in literature as the Pascal’s triangle. Different countries quote this triangle according to different
authors, e.g.: the Tartaglia’s triangle in Italy, Yang Hui’s triangle in China, etc. In particular, the
following publications of the triangle are present in literature:
• India: published in the tenth century, referring to the work of Pingala, dating back to fifth–
second century BC .
• Persia: Al-Karaju (953–1029) and Omar Jayyám (1048–1131), often referred to as Kayyám
triangle in modern Iran
• China: Yang Hui (1238–1298), referred to as Yang Hui’s triangle by Chinese; see Fig. 1.8
• Germany: Petrus Apianus (1495–1552)
• Italy: Nicolò Fontana Tartaglia (1545), referred to as Tartaglia’s triangle in Italy
• France: Blaise Pascal (1655), refereed to as Pascal’s triangle.
1.11 The Binomial Process 15
p (1)
p (1)
(1) 1−p
(4)
p p
1−p
p p (3)
1−p 1−p
p (2) p (6)
1−p 1−p (3)
p
1−p (1) 1−p (4)
p
1−p (1)
1−p (1)
Fig. 1.7 Building a binomial process as subsequent random extractions of a single red or white
ball (Bernoulli process). The “tree” in figure displays all the possible combinations at each
extraction step. Each of the “branching” has a corresponding probability equal to p or 1 p for
red and white ball respectively. The number of paths corresponding to each possible combination
is shown in parentheses, and is equal to a binomial coefficient
to the same outcome, the probability to obtain n red and N n white extractions can
be written as:
nŠ
P.nI N/ D pN .1 p/Nn : (1.35)
NŠ.N n/Š
The Binomial distribution is shown in Fig. 1.9 for N D 15 and for p D 0:2, 0.5 and
0.8.
Since the binomial variable n is the sum of N independent Bernoulli variables
with probability p, the average and variance of n can be computed as N times the
average and variance of a Bernoulli variable (Eqs. (1.31) and (1.33)):
hni D Np ; (1.36)
VŒn D Np.1 p/ : (1.37)
The efficiency " of a device is defined as the probability that the device gives a
positive signal upon the occurrence of a process of interest. Particle detectors are
examples of such devices. The distribution of the number of positive signals n upon
16 1 Probability Theory
Fig. 1.8 The Yang Hui triangle published in 1303 by Zhu Shijie (1260–1320)
0.25
p=0.2 p=0.8
p=0.5
0.2
Probability
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14
n
Fig. 1.9 Binomial distribution for N D 15 and for p D 0:2, 0.5 and 0.8
efficiency " of the device. The problem of parameter estimates will be discussed
more in general in Chaps. 3 and 5 for what concerns the Bayesian and frequentist
approaches, respectively.
For the moment, a pragmatic way to estimate the efficiency can be considered by
performing a large number N of sampling of our process of interest, and by counting
the number of times nO the device gives a positive signal (i.e.: it has been efficient).
This leads to the estimate of the true efficiency " given by:
nO
"O D : (1.38)
N
One may argue that if N is sufficiently large, "O will be very close to true efficiency ".
This will be more formally discussed as a consequence of the law of large numbers,
which will be introduced in Sect. 1.12. The variance of n, VŒn , from Eq. (1.37), is:
r hni r
".1 "/
" D Var D : (1.39)
N N
By simply replacing " with "O one would have the following approximated expression
for the standard deviation (interpreted as error or uncertainty) of "O:
r
"O.1 "O/
ı" D : (1.40)
N
18 1 Probability Theory
5.5
4.5
4
Average
3.5
2.5
1.5
1
0 100 200 300 400 500 600 700 800 900 1000
Number of dice rolls
Fig. 1.10 An illustration of the law of large numbers using a simulation of die rolls. The average
of the first N out of 1000 random extraction is reported as a function of N. The 1000 extractions
have been repeated twice (red and blue lines)
References 19
1C2C3C4C5C6
hxi D D 3:5 : (1.41)
6
Increasing the number of trials, the probability distribution of the possible average
values acquires a narrower shape, and the interval of the values that correspond
to a large fraction of the total probability (we could chose—say—90 or 95 %) gets
smaller and smaller as the number of trials N increases. If we would ideally increase
to infinity the total number of trials N, the average value xN would no longer be a
random variable, but would take a single possible value equal to hxi D 3:5.
The law of large numbers has many empirical verifications for the vast majority
of random experiments. This broad generality of the law of large numbers even in
realistic and more complex cases than the simple ones that can be described by
simplified classical models gives raise to the frequentist definition of the probability
P.E/ of an event E according to the following limit:
ˇ ˇ
ˇ N.E/ ˇ
ˇ
P.E/ D p if 8" lim P ˇ pˇˇ < " : (1.42)
N!1 N
The limit is intended, in this case, as convergence in probability, that means that, by
the law of large numbers, the limit only rigorously holds in the non-realizable case of
an infinite number of experiments. Rigorously speaking, the definition of frequentist
probability in Eq. (1.42) is defined itself in terms of another probability, which
could introduce some conceptual problems. F. James et al. report the following
sentence [3]: “this definition is not very appealing to a mathematician, since it
is based on experimentation, and, in fact, implies unrealizable experiments (N !
1)”.
In practice, anyway, we know that experiments are reproducible on a finite range
of time (on this planet, for instance, until the sun and the solar system will continue
to exist), so, for the practical purposes of physics application, we can consider
the law of large numbers and the frequentist definition of probability, beyond their
exact mathematical meaning, as pragmatic definitions that describe to a very good
approximation level the concrete situations of the vast majority of the cases we are
interested in experimental physics.
References
1. P. Laplace, Essai Philosophique Sur les Probabilités, 3rd edn. (Courcier Imprimeur, Paris, 1816)
2. A. Kolmogorov, Foundations of the Theory of Probability. (Chelsea Publishing, New York,
1956)
3. W. Eadie, D. Drijard, F. James, M. Roos, B. Saudolet, Statistical Methods in Experimental
Physics. (North Holland, Amsterdam, 1971)
Chapter 2
Probability Distribution Functions
The problem introduced with Bertrand’s paradox, seen in Sect. 1.3.1, occurs because
the decomposition of the range of possible values of a random variable x into
equally probable elementary cases is not possible without ambiguity. The ambiguity
arises because of the continuum nature of the problem. We considered in Sect. 1.3
a continuum random variable x whose outcome may take any possible value in
a continuum interval Œx1 ; x2 , and we saw that if x is uniformly distributed in
Œx1 ; x2 , a transformed variable y D Y.x/ is not in general uniformly distributed
in Œy1 ; y2 D ŒY.x1 /; Y.x2 / (Y is taken as a monotonic function of x). This makes
the choice of a continuum variable on which to find equally probable intervals of
the same size an arbitrary choice.
The concept of probability distribution seen in Sect. 1.5 can be generalized to the
continuum case introducing the probability distribution function defined as follows.
Let’s consider a sample space, fEx D .x1 ; ; xn / 2 Rn g. Each random
extraction (an experiment, in the cases of interest to a physicist) will lead to an
outcome (measurement) of one point Ex in the sample space . We can associate to
any point Ex a probability density f .Ex/ D f .x1 ; ; xn / which is a real value greater
or equal to zero. The probability of an event A, where A , i.e.: the probability
that Ex 2 A, is given by:
Z
P.A/ D f .x1 ; ; xn / dn x : (2.1)
A
The function f is called probability distribution function (PDF). The function f .Ex/
can be interpreted as differential probability, i.e.: the probability corresponding to an
dn P
D f .x1 ; ; xn / : (2.2)
dx1 dxn
The normalization condition for discrete probability distributions (Eq. (1.3)) can be
generalized to continuous probability distribution functions as follows:
Z
f .x1 ; ; xn / dn x D 1 : (2.3)
Note that the probability of each single point is rigorously zero if f is a real function,
i.e.: P.fx0 g/ D 0 for any xo , since the set fx0 g has null measure. The treatment
of discrete variables in one dimension can be done using the same formalism of
PDFs, extending the definition of PDF to Dirac’s delta functions (ı.x x0 /, with
R C1
1 ı.xx0 / dx D 1). Dirac’s delta functions can be linearly combined with proper
weights equal to the probabilities of the discrete values: the PDF corresponding to
the case of a discrete random variable x that can take only the values x1 ; ; xN with
probabilities P1 ; ; PN respectively can be written as:
X
N
f .x/ D Pi ı.x xi / : (2.5)
iD1
which gives again Eq. (1.3). Discrete and continuous values can be combined
introducing linear combinations of continuous PDFs and Dirac’s delta functions.
The definitions of average and variance introduced in Sect. 1.8 are generalized for
continuous variables as follows:
Z
hxi D x f .x/ dx ; (2.7)
2.3 Cumulative Distribution 23
Z
˝ ˛
VŒx D .x hxi/2 f .x/ dx D x2 hxi2 : (2.8)
Integrals should be extended over Œ1; C1 , or the whole validity range of the
variable x. The standard deviation is defined as:
p
x D VŒx : (2.9)
Other interesting parameters of a continuous PDF are the median, which is the
value m such that:
1
P.x m/ D P.x > m/ D ; (2.10)
2
or equivalently:
Z Z C1
m
1
f .x/ dx D f .x/ dx D ; (2.11)
1 m 2
and the mode, which is the value M that corresponds to the maximum of the
probability density:
If the variable x follows the PDF f .x/, the PDF of the transformed variable y D
F.x/ is uniform between 0 and 1, as can be demonstrated in the following:
dP dP dx dx f .x/
D D f .x/ D D 1: (2.16)
dy dx dy dF.x/ f .x/
24 2 Probability Distribution Functions
In the case of transformations into more than one variable, the generalization is
straightforward. If we have for instance: x0 D X 0 .x; y/, y0 D Y 0 .x; y/, the transformed
two-dimensional PDF can be written as:
Z
f 0 .x0 ; y0 / D ı.x0 X 0 .x; y//ı.y0 Y 0 .x; y//f .x; y/ dx dy : (2.19)
( 1
if a x < b
u.x/ D ba : (2.21)
0 if x < a or x b
2.5 Uniform Distribution 25
0.35
a=1, b=4
0.3
0.25
0.2
dP/dx
a=2, b=8
0.15
a=3, b=12
0.1
0.05
0
0 2 4 6 8 10 12
x
Examples of uniform distributions are shown in Fig. 2.1 for different extreme values
a and b.
The average of a variable x which is uniformly distributed is:
aCb
hxi D ; (2.22)
2
and the standard deviation of x is:
ba
D p : (2.23)
12
1 .x /2
g.xI ; / D p exp : (2.24)
2 2 2 2
The average value and standard deviation of the variable x are and respectively.
Examples of Gaussian distributions are shown in Fig. 2.2 for different values of
and . A random variable obeying a normal distribution is called normal random
variable.
If D 0 and D 1, a normal variable is called standard normal following the
standard normal distribution .x/:
1 x2
.x/ D p e 2 (2.25)
2
0.4
μ = 0, σ = 1
0.35
0.3
0.25
dP/dx
0.2
μ = 1, σ = 2
0.15
0.1
μ = 2, σ = 4
0.05
0
-10 -5 0 5 10 15
x
Fig. 2.2 Gaussian distributions with different values of average and standard deviation parameters
2.7 Log-Normal Distribution 27
The most frequently used values are the ones corresponding to 1, 2 and 3, and
correspond to probabilities of 68.72, 95.45 and 99.73 % respectively.
The importance of the Gaussian distribution is due to the central limit theorem
(see Sect. 2.11), which allows to approximate to Gaussian distributions may realistic
cases resulting from the superposition of more random effects each having a finite
and possibly unknown PDF.
The PDF in Eq. (2.28) can be derived from Eq. (2.17) from Sect. 2.4 in the case
of a normal PDF. A log-normal variable has the following average and standard
deviation:
2 =2
hxi D eC ; (2.29)
p
C 2 =2
x D e e 2 1 : (2.30)
Note that Eq. (2.29) implies that hey i > ehyi for a normal random variable y. In fact,
2
since e =2 > 1:
2 =2
hey i D hxi D e e > e D ehyi :
28 2 Probability Distribution Functions
3.5 μ = 0, σ = 0.1
2.5
dP/dx
1
μ = 1, σ = 0.25
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
Examples of lognormal distributions are shown in Fig. 2.3 for different values of
and .
Examples of exponential distributions are shown in Fig. 2.4 for different values of
the parameter .
Exponential distributions are widely used in physics.
2.8 Exponential Distribution 29
3.5
2.5
dP/dx
2 λ=4
1.5
1
λ=2
0.5
λ=1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
t0 = 0 t1
t
δt
Fig. 2.5 An horizontal axis representing occurrence times (dots) of some events. The time origin
(t0 D 0) is marked as a cross
Let’s consider a time t and another time t C ıt, where ıt t. We will compute
first the probability P.t1 > t Cıt/ that t1 is greater than t Cıt, which is equal to
the probability P.0; Œ0; tCıt / that no event occurs in the time interval Œ0; tCıt .
The probability P.0; Œ0; t C ıt / is equal to the probability that no event occurs
in the interval Œ0; t and no event occurs in the interval Œt; t C ıt . Since the
occurrences of events in two disjoint time intervals are independent events,
the combined probability is the product of the two probabilities P.0; Œ0; t / and
P.0; Œt; t C ıt /:
Given the rate r of event occurrences, i.e. the expected number of events per
unit of time, the probability to have one occurrence in a time interval ıt is
given by:
and the probability to have more than one occurrence can be can be neglected
with respect to P.1Œt; t C ıt / D rıt at O.ıt2 /. So, the probability to have zero
events in ıt is equal to:
or, equivalently:
dP.t1 > t/
D rP.t1 > t/ : (2.38)
dt
2.9 Poisson Distribution 31
Considering the initial condition P.t1 > 0/ D 1, Eq. (2.38) has the following
solution:
The probability distribution function of the occurrence of the first event can be
written as:
n e
P.nI / D : (2.44)
nŠ
32 2 Probability Distribution Functions
0.25 ν=2
0.2
ν=4
Probability
0.15
ν=8
0.1
0.05
0
0 5 10 15 20
n
Fig. 2.7 A uniform distribution of events in a variable x. Two intervals are shown of sizes x and
X, where x X
Examples of Poisson distributions are shown in Fig. 2.6 for different values of the
rate parameter . It’s easy to show that the average and variance of a Poisson
distribution are both equal to .
Nx
D D rx : (2.45)
X
The variable n is subject to the binomial distribution (see Sect. 1.11):
NŠ n Nn
P.nI ; N/ D 1 ; (2.46)
nŠ.N n/Š N N
n N.N 1/ .N n C 1/ N n
P.nI ; N/ D 1 1 :
nŠ Nn N N
(2.47)
In the limit of N ! 1, the first term, nŠ , remains unchanged, while the
n
remaining three terms tend to 1, e and 1 respectively. We can write the
distribution of n in this limit, which is equal to the Poisson distribution:
n e
P.nI / D : (2.48)
nŠ
Poisson distributions have several interesting properties, some of which are listed
in the following.
• For large , the Poisson distribution
p can be approximated with a Gaussian having
average and standard deviation .
• A Binomial distribution characterized by a number of extractions N and probabil-
ity p 1 can be approximated with a Poisson distribution with average D pN
(this was seen in the Example 2.5 above).
• If we consider two variables n1 and n2 that follow Poisson distributions with
averages 1 and 2 respectively, it is easy to demonstrate, using Eq. (1.30) and a
bit of mathematics, that the sum n D n1 C n2 follows again a Poisson distribution
with average 1 C 2 . In other words:
X X1
n nn
P.nI 1 ; 2 / D Poiss.n1 I 1 / Poiss.n2 I 2 / D Poiss.nI 1 C 2 /
n1 D0 n2 D0
(2.49)
34 2 Probability Distribution Functions
This property descends from the fact that adding two uniform processes similar
to what was considered in the example above, leads again to a uniform process,
where the total rate is the sum of the two rates.
• Similarly, if we take a random fraction " of the initial uniform sample, again
we have a uniform sample that obeys the Poisson distribution. It can be
demonstrated, in fact, that if we have a Poisson variable n0 with an expected rate
0 and we consider the variable n distributed according to a binomial distribution
with probability " and size of the sample n0 , the variable n is again distributed
according to a Poisson distribution with average D "0 . In other words:
1
X
P.nI 0 ; "/ D Poiss.n0 I 0 / Binom.nI n0 ; "/ D Poiss.nI "0 / (2.50)
n0 D0
This is the case in which, for instance, we count the number of events from a
Poissonian process recorded by a detector whose efficiency " is not ideal (" < 1).
Other PDF that are used in literature are presented in the following subsections.
Though the list is not exhaustive, it covers the most commonly used PDF models.
The Argus collaboration introduced [1] a function that models many cases of
combinatorial backgrounds where kinematical bounds produce a sharp edge. The
Argus PDF is given by:
r x 2 h i
2
12 2 1. x /
A.xI I / D Nx 1 e ; (2.51)
0.2
0.16
0.14
θ=9, ξ=0.5
0.12
dP/dx
0.1
θ=10, ξ=1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10
x
place of one of the two Gaussian tail, ensuring the continuity of the function and its
first derivative. The Crystal ball PDF is defined as:
(
x/ 2
exp .xN
2 2
for xNx
> ˛
CB.xI ˛; n; xN ; / D N ; (2.54)
xNx n
A B for xNx
˛
The parameter ˛ determines the starting point of the power-law tail, measured in
units of (the Gaussian “core” standard deviation).
Examples of Crystal ball distributions are shown in Fig. 2.9 where the parameter
˛ has been varied, while the parameters of the Gaussian core have been fixed at
D 0, D 1 and the power-law exponent was fixed at n D 2.
0.4
α=2
0.35
0.3
α=1
0.25
dP/dx
0.2
α = 0.5
0.15
0.1
0.05
0
-30 -25 -20 -15 -10 -5 0 5 10
x
Fig. 2.9 Crystal ball distributions with D 0, D 1, n D 2 and different values of the tail
parameter ˛
2.10 Other Distributions Useful in Physics 37
A model that describes the fluctuations in energy loss of particles in a thin layers of
matter is due to Landau [3, 4]. The distribution of the energy loss x is given by the
following integral expression:
Z 1
1
L.x/ D et log txt sin. t/ dt : (2.56)
0
0.9
0.8 σ = 0.2
0.7
0.6
dP/dx
0.5
0.4
σ = 0.4
0.3
0.2
0.1 σ = 0.6
0
-2 0 2 4 6 8 10 12
x
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x (x +x2 )/ 2
1
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
(x +x2 +x3 )/ 3 (x +x2 +x3 +x4 )/ 4
1 1
Fig. 2.11 Approximate visual demonstration of the central limit theorem p usingpa Monte Carlo
technique. A random variable x1 is generated uniformly in the interval Œ 3; 3Œ, in order to
have average value D 0 and standard deviation D 1. The top-left plot shows the distribution
p
of 105 random extractions
p
5
p show 10 random extractions of .x1 C x2 /= 2,
of x1 ; the other plots
.x1 C x2 C x3 /= 3 and .x1 C x2 C x3 C x4 /= 4 respectively, where all xi obey the same uniform
distribution as x1 . A Gaussian curve with D 0 and D 1, with proper normalization in order
to match the sample size, is superimposed to the extracted sample in the four cases. The Gaussian
approximation is more and more stringent as a larger number of variables is added
2.12 Central Limit Theorem 39
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000 1000
0 0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x1
(x +x )/ 2
1 2
5000
6000
4000
5000
4000 3000
3000
2000
2000
1000
1000
0 0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
(x +x +x )/ 3
1 2 3 (x +x +x +x )/ 4
1 2 3 4
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0
0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
(x + ... +x )/ 6 (x + ... +x )/ 10
1 6 1 10
Fig. 2.12 Same as Fig. 2.11, using a PDF that is uniformly distributed in two disjoint intervals,
Œ 32 ; 21 Œ and Œ 12 ; 32 Œ, in order to have average value D 0 and standard deviation D 1. The
p
sum of 1, 2, 3, 4, 6 and 10 independent random extractions of such a variable, divided by n,
n D 1; 2; 3; 4; 6; 10 respectively, are shown with a Gaussian distribution having D and D 1
superimposed
40 2 Probability Distribution Functions
A typical problem in data analysis is to combine the effect of the detector response,
e.g.: due to finite resolution, with the theoretical distribution of some observable
quantity x. If the original theoretical distribution is given by f .x/ and, for a measured
value x0 of the observable x, the detector response distribution is given by r.xI x0 /, the
PDF that describes the distribution of the measured value of x, taking into account
both the original theoretical distribution and the detector response, is give by the
convolution of the two pdf f and r, defined as:
Z
g.x/ D f .x0 /r.xI x0 / dx0 : (2.58)
1
f ˝ g D fO gO : (2.61)
Conversely, also the Fourier transform of the product of two PDF is equal to the
convolution of the two Fourier transforms, i.e.:
b D fO ˝ gO :
fg (2.62)
1 .xx0 /2
r.xI x0 / D g.x x0 / D p e 2 2 : (2.63)
2
2.13 Probability Distribution Functions in More than One Dimension 41
The Fourier transform of a Gaussian PDF, like in Eq. (2.63), can be computed
analytically and is given by:
2 k2
gO .k/ D eik e 2 : (2.64)
Probability densities can be defined in spaces with more than one dimension. In the
simplest case of two-dimensions, the PDF f .x; y/ measures the probability density
per unit area, i.e. the ratio of the probability dP corresponding to an infinitesimal
interval around a point .x; y/ and the area of the interval dx dy:
dP
D f .x; y/ : (2.65)
dx dy
In three dimensions the PDF measures the probability density per volume area:
dP
D f .x; y; z/ ; (2.66)
dx dy dz
Given a two-dimensional PDF f .x; y/, the marginal distributions are the probability
distributions of the two variables x and y and can be determined by integrating f .x; y/
over the other coordinate, y and x, respectively:
Z
fx .x/ D f .x; y/ dy ; (2.67)
Z
fy .y/ D f .x; y/ dx : (2.68)
42 2 Probability Distribution Functions
The above expressions can also be seen as a special case of continuous variable
transformation, as described in Sect. 2.4, where the applied transformations maps
the two variables into one of the two: .x; y/ ! x or .x; y/ ! y.
More in general, if we have a PDF in n D h C k variables .Ex; Ey/ D .x1 ; ; xh ;
y1 ; ; yk /, the marginal PDF of the subset of h variables .x1 ; ; xh / can be
determined by integrating the PDF f .Ex; Ey/ over the remaining set of variables
.y1 ; ; yk /:
Z
fEx .Ex/ D f .Ex; Ey/ dk y : (2.69)
A pictorial view that illustrates the interplay between the joint distribution f .x; y/
and the marginal distributions fx .x/ and fy .y/ is shown in Fig. 2.13. Let’s remember
that, according to Eq. (1.6), two events A and B are independent if P.A \ B/ D
P.A/P.B/. In the case shown in Fig. 2.13 we can consider as events A and B the
cases where the two variables xO and yO are extracted in the intervals Œx; x C ıxŒ and
Œy; y C ıyŒ, respectively:
and
δP(x)
x
2.13 Probability Distribution Functions in More than One Dimension 43
So, we have:
Since:
and:
we have:
The equality P.A \ B/ D P.A/P.B/ holds if and only if f .x; y/ can be factorized
into the product of the two marginal PDFs:
From this result, we can say that x and y are independent variables if their joint
PDF can be written as the product of a PDF of the variable x times a PDF of the
variable y.
More in general, n variables x1 ; ; xn are said to be independent if their n-
dimensional PDF can be factorized into the product of n one-dimensional PDF in
each of the variables:
In a less strict sense, the variables sets Ex D .x1 ; ; xn / and Ey D .y1 ; ; ym / are
independent if:
Note that if two variable x and y are independent, it can be easily demonstrated
that they are also uncorrelated, in the sense that their covariance (Eq. (1.22)) is null.
Conversely, if two variables are uncorrelated, they are not necessarily independent,
as shown in the Example 2.6 below.
44 2 Probability Distribution Functions
0.8
dP/dxdy
0.6
0.4
0.2
0
5
4
3
2
1
0
−1
y
−2 4 5
−3 2 3
0 1
−4 −1
−5 −3 −2
−5 −4 x
Fig. 2.14 Example of a PDF of two variables x and y that are uncorrelated but are not independent
2.13 Probability Distribution Functions in More than One Dimension 45
˝hzi˛D ;
z2 D 2 C 2 ;
it is easy to demonstrate that for x and y obeying the PDF in Eq. (2.79), the
following relations also hold:
hxi D hyi D 0 ;
˝ 2˛ ˝ 2˛ 2 2
x D y D C 2
;
hxyi D 0 :
Given a two-dimensional PDF f .x; y/ and a fixed value x0 of the variable x, the
conditional PDF of y given x0 is defined as:
f .x0 ; y/
f .yjx0 / D R : (2.80)
f .x0 ; y0 / dy0
f .Ex0 ; Ey/
f .EyjEx0 / D R : (2.81)
f .Ex0 ; Ey/ dy01 dy0k
46 2 Probability Distribution Functions
f(x,y)
0.35
0.3
dP/dx
0.25 f(y|x0)
0.2
0.15
1
1 0.5
0.5 x0 0
0
-0.5
-0.5 x
y
-1 -1
Let’s apply a rotation from .x0 ; y0 / to .x; y/ with an angle defined by:
x0 D x cos y sin
: (2.83)
y0 D x sin C y cos
Theˇ transformed
ˇ PDF g.x; y/ can be obtained using Eq. (2.20) considering that
det ˇ@x0i =@xj ˇ D 1, which leads to g0 .x0 ; y0 / D g.x; y/. g.x; y/ has the form:
11 1 x
g.x; y/ D 1
exp .x; y/C ; (2.84)
2 jCj 2 2 y
2.14 Gaussian Distributions in Two or More Dimensions 47
where the matrix C1 is the inverse of the covariance matrix of the variables .x; y/.
C1 can be obtained by comparing Eqs. (2.82) and (2.84), from which one can
substitute the rotated coordinates defined Eq. (2.83) in the following equation:
x02 y02 x
C D .x; y/C1 ; (2.85)
x20 y20 y
leading to:
0 1
cos2 sin2 1 1
B x20
C y20
sin cos y0 2
x20 C
C1 D B
@
C:
A (2.86)
1 1 sin2 cos2
sin cos y0 2
x20 x20
C y20
The determinant of C1 that appears in the denominator of Eq. (2.84) under a square
root is:
1 1
jC1 j D D 2 2 ; (2.87)
x20 y20 x y .1 2
xy /
where xy is the correlation coefficient defined in Eq. (1.23). Inverting the matrix
C1 , we can get the covariance matrix of the rotated variables .x; y/:
!
cos2 x20 C sin2 y20 sin cos .y20 x20 /
CD : (2.88)
sin cos .y20 x20 / sin2 x20 C cos2 y20
The last Eq. (2.92), implies that the correlation coefficient is equal to zero if either
y0 D x0 or if is a multiple of 2 . We also have the following relation that gives
48 2 Probability Distribution Functions
2 xy x y
tan 2 D : (2.93)
y2 x2
(2.94)
The geometrical interpretation of x and y in the rotated coordinate system is shown
in Fig. 2.16, where the one-sigma ellipse obtained from the following equation is
drawn:
x2 y2 2xy xy
C D 1: (2.95)
x2 y2 x y
It is possible to demonstrate that the distance of the horizontal and vertical tangent
lines to the ellipse have a distance from their respective axes equal to y and x
respectively.
Fig. 2.16 Plot of the one-sigma contour for a two-dimensional Gaussian PDF. The two ellipse
axes are equal to x0 and y0 ; the x0 axis is rotated of an angle with respect to the x axis and the
lines tangent to the ellipse parallel to the x and y axes have a distance with respect to the respective
axes equal to y and x respectively
2.14 Gaussian Distributions in Two or More Dimensions 49
2σ 1σ
1σ
2σ
Fig. 2.17 Plot of the two-dimensional one- and two-sigma Gaussian contours
50 2 Probability Distribution Functions
where
( )
x2 y2 2xy xy
EZ D .x; y/ W 2 C 2 Z : (2.99)
x y x y
The probabilities corresponding to 1, 2 and 3 are reported for the one- and two-
dimensional cases in Table 2.1. The two-dimensional integrals are in all case smaller
than the one-dimensional case for a given Z. In particular, to recover the same
probability as the one-dimensional interval, one has to enlarge the two-dimensional
ellipse from 1 to 1:515, from 2 to 2:486, and from 3 to 3:439.
Figure 2.17 shows the 1 and 2 contours for a two-dimensional Gaussian, and
three possible choices of 1 and 2 one-dimensional bands, two along the x and y
axes and one along a third generic oblique direction.
The generalization to n dimensions of the two-dimensional Gaussian described
in Eq. (2.94) is:
1 1 1
g.x1 ; ; xn / D 1
exp .xi i /Cij .xj j / ; (2.102)
n
.2 / 2 jCj 2 2
where i is the average of the variable xi and Cij is the n n covariance matrix of
the variables x1 ; ; xn .
References
1. ARGUS collaboration, H. Albrecht et al., Search for hadronic b!u decays. Phys. Lett. B241,
278–282 (1990)
2. J. Gaiser, Charmonium Spectroscopy from Radiative Decays of the J/ and 0 . Ph.D, thesis,
Stanford University, 1982. Appendix F
3. L. Landau, On the energy loss of fast particles by ionization. J. Phys. (USSR) 8, 201 (1944)
4. W. Allison, J. Cobb, Relativistic charged particle identification by energy loss. Annu. Rev. Nucl.
Part. Sci. 30, 253–298 (1980)
5. R. Bracewell, The Fourier Transform and Its Applications, 3rd. edn. (McGraw-Hill, New York,
1999)
Chapter 3
Bayesian Approach to Probability
P.A \ B/
P.AjB/ D : (3.1)
P.B/
We can conversely write the probability of the event B given the event A as:
P.A \ B/
P.BjA/ D : (3.2)
P.A/
This situation is visualized in Fig. 3.1. Extracting from Eqs. (3.1) and (3.2) the
common term P.A \ B/, we obtain:
Ω
A
B
P(A)= P(B)=
P(A|B)= P(B|A)=
Fig. 3.1 Visualization of the conditional probabilities, P.AjB/ and P.BjA/ due to Robert Cousins.
The events A and B are represented as subsets of a sample space
from which the Bayes’ theorem can be derived in the following form:
P.BjA/P.A/
P.AjB/ D : (3.4)
P.B/
The probabilities P.A/ and P.AjB/ can be interpreted as probability of the event A
before the knowledge that the event B has occurred (prior probability) and as the
probability of the same event A having as further information the knowledge that
the event B has occurred (posterior probability).
A visual derivation of Bayes’ theorem, due to Robert Cousins, is presented in
Fig. 3.2.
3.1 Bayes’ Theorem 55
P(A|B)P(B)= X = =P(A B)
P(B|A)P(A)= X = =P(A B)
Fig. 3.2 Visualization of the Bayes’ theorem due to Robert Cousins. The areas of events A an B,
P.A/ and P.B/ respectively, simplify as we multiply P.AjB/P.B/ and P.BjA/P.A/
What we want to know is the probability that a person is really ill after having
received a positive diagnosis, i.e.: P.illjC/. Using Bayes’ theorem, we can
“invert” the conditional probability as follows:
P.Cjill/P.ill/
P.illjC/ D ; (3.9)
P.C/
P.ill/
P.illjC/ ' : (3.10)
P.C/
and:
P.ill/ P.ill/
P.illjC/ D ' : (3.14)
P.C/ P.ill/ C P.Cjhealthy/
If we assume that P.ill/ is smaller than P.Cjhealthy/ then P.illjC/ will result
smaller than 50 %. For instance, if P.ill/ D 0:15 %, compared with the
assumption that P.Cjhealthy/ D 0:2 %, then
0:15
P.illjC/ D D 43% : (3.15)
0:15 C 0:20
3.1 Bayes’ Theorem 57
P(+|healthy)
P(+|ill)
P(−|healthy)
P(−|ill)
P(ill) P(healthy)
Fig. 3.3 Visualization of the ill/healthy case considered in the Example 3.7. The red areas
correspond to the cases of a positive diagnosis for a ill person (P.Cjill/, vertical red area) and
a positive diagnosis for a healthy person (P.Cjhealthy/, horizontal red area). The probability of
being really ill in the case of a positive diagnosis, P.illjC/, is equal to the ratio of the vertical red
area and the total red area. In the example it was assumed that P.jill/ ' 0
The probability to be really ill, given the positive diagnosis, is very different
from the naïve conclusion that one is most likely really hill. The situation
can be visualized, changing a bit the proportions for a better presentation, in
Fig. 3.3.
Having a high probability of a positive diagnosis in case of illness does not
imply that a positive diagnosis turns into a high probability of being really
ill. The correct answer, in this case, also depend on the prior probability for
a random person in the population to be ill, (P.ill/). Bayes’ theorem allows
to compute the posterior probability P.illjC/ in terms of the prior probability
and the probability of a positive diagnosis for a ill person, (P.Cjill/).
Consider, for instance, a muon detector that gives a positive signal for real
muons with an efficiency " D P.Cj/ and gives a false positive signal for
pions with a probability ı D P.Cj /. Given a collection of particles that can
be either muons or pions, what is the probability that a selected particle is
really a muon, i.e.: P.jC/? As before, we can’t give an answer, unless we
also give the prior probability, i.e.: the probability that a random particle from
the sample is really a muon or pion. In other words, we should know P./ and
P. / D 1 P./. Using Bayes’ theorem, together with Eq. (1.12), we can
write, as in the example from the previous section:
P.Cj/P./ P.Cj/P./
P.jC/ D D ; (3.16)
P.C/ P.Cj/P./ C P.Cj /P. /
or, in terms of the fraction of muons f D P./ and the fraction of pions
f D P. / of the original sample, we can write purity of the sample, defined
as the fraction of muons in a sample of selected particles, as:
"f
… D f ;C D : (3.17)
"f C ıf
The above expression also holds if more than two possible particle types are
present in the sample, and does not require to compute the denominator that
is present in Eq. (3.16), which would be needed, instead, in order to compute
probabilities related to all possible particle cases.
So far, Bayes’ theorem has been applied in cases that also fall under the frequentist
domain. Let’s now consider again the following formulation of Bayes’ theorem, as
from Eq. (3.4):
P.BjA/P.A/
P.AjB/ D : (3.19)
P.B/
3.2 Bayesian Probability Definition 59
We can interpret it as follows: before the occurrence of the event B (or knowledge
that B is true), our degree of belief of the event A is equal to the prior probability,
P.A/. After the occurrence of the event B (or after we know that B is true) our degree
of belief of the event A changes and becomes equal to the posterior probability
P.AjB/.
Using this interpretation, we can extend the scope of the definition of probability,
in this new Bayesian sense, to events that are not associated to random variables,
but represent statements about unknown facts, like “my football team will win next
match”, or “the mass of a dark-matter candidate particle is between 1000 and
1500 GeV”. We can in fact consider a prior probability P.A/ of such an unknown
statement, representing a measurement of our prejudice about that statement, before
the occurrence of any event that could modify our knowledge. After the event B
has occurred (and we know that B as occurred), our knowledge of A must change
in order to take into account the fact that we know B has occurred, and our degree
of belief should be modified becoming equal to the posterior probability P.AjB/. In
other words, Bayes theorem tells us how we should change our subjective degree of
belief from an initial prejudice considering newly available information, according
to a quantitative and rational method. Anyway, starting from different priors (i.e.:
different prejudices), different posteriors will be determined.
The term P.B/ that appears in the denominator of Eq. (3.4) can be considered
a normalization factor. If we can decompose the sample space in a partition
A1 ; ; An , where [niD1 Ai D and Ai \ Aj D 0 8i; j, we can write, according
to the law of total probability in Eq. (1.12):
X
n
P.B/ D P.BjAi /P.Ai / : (3.20)
iD1
This decomposition was already used in the examples discussed in the previous
sections.
The Bayesian definition of probability obeys Kolmogorov’s axioms of probabil-
ity, as defined in Sect. 1.4, hence all properties of classical probability discussed in
Chap. 1 also apply to Bayesian probability.
An intrinsic unavoidable feature of Bayesian probability is that the probability
associated to an event A can’t be defined without having a prior probability of that
event, which make Bayesian probability intrinsically subjective.
This corresponds to a person that believes that A0 is absolutely true, all other
alternatives Ai are absolutely false for i ¤ 0. Whatever event B occurs, that
person’s posterior probability on any Ai will not be different from the prior
probability:
P.BjAi /P.Ai /
P.Ai jB/ D : (3.23)
P.B/
But, if i ¤ 0, clearly:
P.BjAi / 0
P.Ai jB/ D D 0 D P.Ai / : (3.24)
P.B/
P.BjA0 / 1 P.BjA0 /
P.A0 jB/ D Pn D D 1 D P.A0 / : (3.25)
iD1 P.BjA i /P.A i / P.BjA 0/ 1
This situation reflects the case that we may call dogma, or religious belief, i.e.:
the case in which someone has such strong prejudices on Ai that no event B,
i.e.: no new knowledge, can change his/her degree of belief.
The scientific method allowed to evolve mankind’s knowledge of Nature
during history by progressively adding more knowledge based on the obser-
vation of reality. The history of science is full of examples in which theories
known to be true have been falsified by more precise observations, and new
better theories have replaced the old ones.
According to Eq. (3.22), instead, scientific progress is not possible in the
presence of religious beliefs about observable facts.
parameters 1 ; ; m :
ˇ
dP.x1 ; ; xn / ˇˇ
L.x1 ; ; xn I 1 ; ; m / D : (3.26)
dx1 dxn ˇ1 ; ;m
Adding more measurements would allow to apply repeatedly Bayes’ theorem. This
possibility allows to interpret the application of Bayes’ theorem as learning process,
where one’s knowledge about an unknown parameter is influenced and improved by
the subsequent observations x1 , x2 , etc.
The more measurement we add, the more the final posterior probability Pn ./
will be insensitive to the choice of the prior probability P0 ./ D ./, because the
range in which L.x1 ; ; xn j/ will be significantly different from zero will be
smaller and smaller, hence the prior ./ can be approximated to a constant value
within this small range, assuming we take a reasonably smooth function.
In this sense, a sufficiently large number of observation may remove, asymptot-
ically, any arbitrariness in posterior Bayesian probability, assuming that the prior is
a sufficiently smooth and regular function (e.g.: not in the extreme case mentioned
in the Example 3.9).
N
Similarly the average value E can also be determined from the same posterior PDF.
In particular, if we assume a uniform prior distribution (this may not necessarily
the most natural choice), the most likely value, in the Bayesian approach, coincides
with the maximum-likelihood estimate, since the posterior PDF is equal, up to a
normalization constant, to the product of the likelihood function times the prior
PDF, which is assumed to be a constant:
ˇ L.ExI /E
E x/ˇˇ
P.jE DR : (3.33)
.E/Dconst: L.ExI E0 / dh 0
This result of course does not necessarily hold in the case a non-uniform prior PDF.
If we are interested only in a subset of the parameters, let’s say we have two
subset of parameters, E D .1 ; ; h / which are parameters of interest and the
remaining parameters, E D .1 ; ; l / which are needed to model our PDF, but
should not appear among our final results, the posterior PDF can be written as:
L.ExI ; E E / .E; E /
E E jEx/ D R
P.; ; (3.34)
L.ExI E0 ; E0 / .E0 ; E0 / dh 0 dl 0
and the posterior PDF for the parameters E can be obtained as marginal PDF,
integrating Eq. (3.34) over all the remaining parameters E :
Z R
L.ExI ; E E / .E; E / dl
E x/ D
P.jE E E jEx/ dl D R
P.; : (3.35)
L.ExI E0 ; E 0 / .E0 ; E 0 / dh 0 dl 0
sn ns
.s/
P.sjn/ D Z 1 nŠ : (3.36)
sn ns
.s/ ds
0 nŠ
hence we have:
sn es
P.sjn/ D ; (3.38)
nŠ
which has the same expression of the original Poisson distribution, but this
time it is interpreted as the posterior PDF of the unknown parameter s, given
the observation n. Using Eq. (3.38), we can determine that the most
probable value of s (according to the posterior in Eq. (3.38)) as:
sO D n : (3.39)
We also have:
hsi D n C 1 ; (3.40)
VarŒs D n C 1 : (3.41)
Note that the most probable value sO is different from the average value hsi.
Those results depend of course on the choice of the prior .s/. We took
a uniform distribution for .s/, but this is of course one of many possible
choices.
1
Considering the prior choice due to Jeffreys, discussed in Sect. 3.7, a constant prior could
not be the most “natural”
p choice. Using Jeffreys’ prior in the case of a Poisson distribution
would give .s/ / s.
3.5 Bayes Factors 65
As seen at the end the Example 3.8, there is a convenient way to compare the
probability of two hypotheses using Bayes theorem which does not require the
knowledge of all possible hypotheses that can be considered. In the Bayesian
approach to probability, this can constitute an alternative to the hypothesis test (that
will be introduced in Chap. 7) adopted under the frequentist approach.
Using the Bayes theorem, one can write the ratio of posterior probabilities
evaluated under two hypotheses H0 and H1 , given our observation Ex, also called
Bayes factor [4], as:
where .H1 / and .H0 / are the priors for the two hypotheses.
The Bayes factor can be used as alternative to significance level (see Sect. 8.2)
adopted under the frequentist approach in order to determine the evidence of one
hypothesis H1 (e.g.: a signal due to a new particle is present in our data sample)
against a null hypothesis H0 (e.g.: no signal due to a new particle is present in our
data sample). The proposed scale for Bayes factor to assess the evidence of H1
against H0 , as reported in [4], are presented in Table 3.1.
Introducing the likelihood function, the Bayes factor can also be written as:
R
L.ExjH1 ; E1 / .E1 / dE1
B1=0 D R ; (3.43)
L.ExjH0 ; E0 / .E0 / dE0
where the parameters E1 and E0 required under the two hypotheses have been
introduced. As usual in Bayesian applications, the computation of the Bayes factors
requires integrations that in most of the realistic cases can be performed using
computer algorithms.
This subjectiveness in the choice of the prior PDF is intrinsic to the Bayesian
approach and raises criticism by supporters of the frequentist approach, which
object that results obtained under the Bayesian approach are to some extent
subjective. Supporters of the Bayesian approach reply that Bayesian result are
intersubjective [5], in the sense that common prior choices lead to common results.
The debate is in some cases still open, and literature still contains opposite opinions
about this issue.
One possible approach to the choice of prior PDFs has been proposed by Harold
Jeffreys [6]. He proposed to adopt a choice of prior PDF in a form which results
invariant under parameter transformation. Jeffreys’ choice is, up to a normalization
factor, given by:
q
E /
p./ E ;
I./ (3.44)
It is not difficult to demonstrate that Jeffreys’ prior does not change when changing
parametrization, i.e.: transforming E ! E0 D E0 ./.
E
Jeffreys’ prior corresponding to the parameters of some of the most frequently
used PDFs are given in Table 3.2. Note that only the case of the mean of a Gaussian
gives a uniform Jeffreys’ prior.
Table 3.2 Jeffreys’ priors corresponding to the parameters of some of the most frequently used
PDFs
PDF parameter Jeffreys’ prior
p
Poissonian mean p./ / 1
p
Poissonian signal mean with a background b p./ / 1 C b
Gaussian mean p./ / 1
Gaussian standard deviation p. / / 1=
p
Binomial success fraction " p."/ / 1= ".1 "/
68 3 Bayesian Approach to Probability
Given f 0 .x0 ; y0 /, one can determine again, for the transformed parameters, the most
likely values, the average, and probability intervals. The generalization to more
variables is straightforward.
Something to be noted is that the most probable values .Ox; yO /, i.e: the values that
maximize f .x; y/, do not necessarily map into values .X.Ox/; Y.Oy// that maximize
f 0 .x0 ; y0 /. This issue is also present in the frequentist approach when dealing with
the propagation of asymmetric uncertainties, and will be discussed in Sect. 5.9. Sec-
tion 5.8 will discuss how to propagate errors in the case of parameter transformation
using a linear approximation, and under this simplified assumption, which is not
always a sufficient approximation, one may assume that values that maximize f
map into values that maximize f 0 , i.e.: .Ox0 ; yO 0 / D .X.Ox/; Y.Oy//.
References
Many computer application, ranging from simulations to video games and 3D-
graphics application, take advantage of computer-generated numeric sequences
having properties very similar to truly random variables. Sequences generated by
computer algorithms through mathematical operations are not really random, having
no intrinsic unpredictability, and are necessarily deterministic and reproducible.
Indeed, it is often a good feature for many application the possibility to reproduce
exactly the same sequence of computer-generated numbers given by a computer
algorithm.
Good algorithms that generate “random” numbers, or better, pseudorandom
numbers, given their reproducibility, must obey, in the limit of large numbers, to
the desired statistical properties of real random variables.
Numerical methods involving the repeated use of computer-generated pseudo-
random numbers are often referred to as Monte Carlo methods, from the name of
the city hosting the famous casino, which exploits the properties of (truly) random
numbers to generate profit.
Depending on the value of , the sequence may have very different possible
behaviors. If the sequence converges to a single asymptotic value x for n ! 1,
we would have:
lim xn D x ; (4.4)
n!1
x D x.1 x/ : (4.5)
x1 D x2 .1 x2 / ; (4.6)
x2 D x1 .1 x1 / : (4.7)
p
For larger values, up to 1 C 6, the sequences oscillates between four values,
and further bifurcations occur for larger values of , until it achieves a very
complex and poorly predictable behavior. For D 4 the sequence finally
densely covers the interval 0; 1Œ. The PDF corresponding to the sequence
with D 4 can be demonstrated to be a beta distribution with parameters
˛ D ˇ D 0:5, where the beta distribution is defined as:
x˛1 .1 x/ˇ1
f .xI ˛; ˇ/ D R ˛1 : (4.8)
0 .1 u/ˇ1 du
The behavior of the logistic map for different values of is shown in Fig. 4.1.
4.3 Uniform Random Number Generators 71
1.0
0.8
0.6
x
0.4
0.2
0.0
2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
λ
Fig. 4.1 Logistic map [2]
The most widely used computer-based pseudorandom number generators are conve-
niently written in such a way to produce sequences of uniformly distributed numbers
ranging from zero to one.1 Starting from uniform random number generators most
of the other distribution of interest can be derived using specific algorithms, some
of which are described in the following.
The period of a random sequence, i.e.: the number of extractions after which the
sequence will repeat itself, should be as high as possible, and anyway higher than
the number of random numbers that will be used in the specific application.
For instance, the function lrand48 [3], which is a standard of C programming
language, is defined according to the following algorithm:
m D 248 (4.10)
a D 25214903917 D 5DEECE66D hex (4.11)
c D 11 D B hex (4.12)
1
In realistic cases of finite numeric precision, one of the extreme values is excluded. It would have
a corresponding zero probability, in the case of infinite precision, but it is not the case with finite
machine precision.
72 4 Random Numbers and Monte Carlo Methods
The sequences obtained from Eq. (4.9) for given initial values x0 are distributed
uniformly between 0 and 248 1. This corresponds to sequences of random bits
that can be used to return 32-bits integer numbers or can be mapped into sequences
of floating-point numbers uniformly distributed in Œ0; 1Œ, as done by the C function
drand48. The value x0 is called seed of the random sequence. Choosing different
initial seeds produces different sequences. In this way one can repeat a computer-
simulated experiment using different random sequences each time changing the
initial seed and obtaining different results that simulate the statistical fluctuation
of the simulated experiment.
Similarly to lrand48, the gsl_rng_rand [4] generator of the BSD rand
function uses the same algorithm but with a D 41C64E6D hex, c D 3039 hex and
m D 231 . The period of lrand48 is about 248 , while gsl_rng_rand has a lower
period of about 231 .
A popular random generator that offers good statistical properties is due to
Lüscher [5], implemented by James in the RANLUX generator [6], whose period
is of the order of 10171 . RANLUX is now considered relatively slower than other
algorithms, like the L’Ecuyer generator [7], which has a period of 288 , or the
Mersenne-Twistor generator [8] which has a period of 219937 1 and is relatively
faster than L’Ecuyer’s generator.
x0 D a C x.b a/ : (4.13)
We already saw in Chap. 1 that, using the central limit theorem, the sum of N
random variables, each having a finite variance, is distributed, in the limit N ! 1,
according to a Gaussian distribution. For a finite but sufficiently large value of N we
can extract N values
p p x1 ; ; xN from a uniform random generator and remap them
from Œ0; 1Œ to Œ 3; 3Œ using Eq. (4.13), so that the average and variance of each
xi are zero and one respectively. Then, we can compute:
x1 C C xN
xD p ; (4.14)
N
which has again average equal to zero and variance equal to one. Figure 2.11 shows
the distribution of x for N up to four, giving approximately
p p a Gaussian distribution,
that is necessarily truncated in the range Œ 3N; 3N by construction. This is
anyway not the most effective way to generate approximately Gaussian-distributed
pseudorandom numbers. A better algorithm will be described at the end of this
section.
x D F 1 .r/ (4.16)
r D F.x/ ; (4.17)
we have:
dF
dr D dx D f .x/ dx : (4.18)
dx
74 4 Random Numbers and Monte Carlo Methods
dP dP
D f .x/ : (4.19)
dx dr
dP
D f .x/ ; (4.20)
dx
which means that x follows the desired PDF. This method only works conveniently
if the cumulative F.x/ can be easily computed and inverted using either analytical
or numerical methods. If not, usually this algorithm may be very slow.
1 ex D r ; (4.23)
If the extraction of r happens in the interval Œ0; 1Œ, like drand48, r D 1 will
never be extracted, hence the argument of the logarithm will never be equal to
zero, ensuring the numerical validity of Eq. (4.24).
4.4 Non Uniform Random Number Generators 75
dP dP
D D k; (4.25)
d sin d d
where k is a normalization constant such that the PDF integrates to unity over
the entire solid angle, 4 . From Eq. (4.25), the combined two-dimensional
PDF can be factorized into the product of two PDFs, as functions of and
(i.e.: and are independent):
dP
D f ./g./ D k sin ; (4.26)
d d
where:
dP
f ./ D D c1 ; (4.27)
d
dP
g./ D D c2 sin : (4.28)
d
The constants c1 and c2 ensure the normalization of f ./ and g./. can be
extracted inverting the cumulative distribution of f ./ (Eq. (4.16)), and since
g./ is uniformly distributed, can be extracted just remapping the interval
Œ0; 1Œ into Œ0; 2 Œ (Eq. (4.13)), leading to:
perform by moving from cartesian coordinates .x; y/ to polar coordinates .r; /.
In particular, the radial Gaussian cumulative PDF was already introduced in
Eq. (2.100), and leads, in the simplest case of a normal Gaussian (with average in
.0; 0/ and standard deviations in both coordinates equal to one), to:
Z 2
r2
P2D . / D e 2 r dr D 1 e 2 : (4.31)
0
This leads to the Box Muller transformation [9] from two variables r1 and r2
uniformly distributed in Œ0; 1Œ into two variables z1 and z2 distributed according to a
normal Gaussian:
p
rD 2 log r1 (4.32)
D 2 r2 (4.33)
z1 D r cos (4.34)
z2 D r sin (4.35)
x D C z : (4.36)
More efficient generators for Gaussian random numbers exist. For instance [10]
describes the so-called Ziggurat algorithm.
There are many cases in which the cumulative distribution of a PDF is not easily
computable and invertible. In those case, other methods allow to generate random
numbers according to a given PDF.
miss
hit
a b x
that we know the maximum value m of f , or at least a value m that is greater or equal
to the maximum of f . The situation is sketched in Fig. 4.2.
The method consist of first extracting a uniform random number x in the interval
Œa; bŒ, and then computing f D f .x/. Then, a random number r is extracted
uniformly in Œ0; mŒ. If r > f (“miss”) we repeat the extraction of x, until r < f
(“hit”). In this case, we accept x as the desired extracted value. In this way,
the probability distribution of the accepted x is our initial (normalized) PDF by
construction. A possible inconvenient of this method is that it rejects a fraction
of extractions equal to the ratio of area under the curve f .x/, and the area of the
rectangle that contains f . The method has an efficiency (i.e.: the fraction of accepted
values of x) equal to:
Rb
f .x/ dx
"D a
; (4.37)
.b a/ m
which may lead to a suboptimal use of the computing power, in particular if the
shape of f .x/ is very peaked.
Hit-or-miss Monte Carlo can also be applied to multi-dimensional cases. In those
cases, one first extracts a multi-dimensional point Ex D .x1 ; ; xn /, then accepts or
rejects Ex according to a random extraction r 2 Œ0; mŒ, compared with f .x1 ; ; xn /.
78 4 Random Numbers and Monte Carlo Methods
a b x
If the function f is very peaked, the efficiency " of this method (Eq. (4.37)) may
be very low. In those cases the algorithm may be adapted to become more efficient
by identifying in a preliminary stage a partition of the interval Œa; bŒ such that in
each subinterval the function f has a smaller variation than in the overall range.
In each subinterval the maximum of f is estimated. This situation is sketched
in Fig. 4.3. First of all, a subinterval of the partition is randomly chosen with a
probability proportional to its area. Then, the hit-or-miss approach is followed in the
corresponding rectangle. This approach is often referred to as importance sampling.
A possible variation of this method is to use, instead of the aforementioned
partition, an “envelope” for the function f , i.e.: a function g.x/ that is always greater
than f .x/: g.x/ f .x/; 8x 2 Œa; bŒ, and for which a convenient method to extract x
according to the normalized distribution g.x/ is known.
It’s evident that the efficiency of the importance sampling may be significantly
higher than the “plain” hit-or-miss Monte Carlo if the partition and the correspond-
ing maxima in each subinterval are properly chosen.
4.6 Numerical Integration with Monte Carlo Methods 79
Monte Carlo methods are often used as numerical methods to compute integrals
with the computer. If we use the hit-or-miss method described in Sect. 4.5.1, for
Rb
instance, we can estimate the integral a f .x/ dx from the fraction of the accepted
hits n over the total number of extractions N:
Z b
n
ID f .x/ dx ' IO D .b a/ : (4.39)
a N
With this approach, n follows a binomial distribution. If we can use the approximate
expression in Eq. (1.40) (see Sect. 1.11.1, n should not be too close to either 0 or N),
we can estimate the error on IO as:
r
I.1 I/
IO D .b a/ : (4.40)
N
p
Equation 4.40 shows that the error on IO decreases as N. We can apply the same
hit-or-miss method to a multi-dimensional integration problem. In that case, we
80 4 Random Numbers and Monte Carlo Methods
p
would have the same expression for the error, which decreases as N, regardless
of the number of dimensions d of the problem. Other numerical methods, not based
on random number extractions, may suffer from severe computing time penalties
as the number of dimensions d increases, and this makes Monte Carlo methods
advantageous in cases of high number of dimensions, compared to other non-
random-based techniques.
In the case of hit-or-miss Monte Carlo, anyway, numerical problems may arise
in the algorithm to find the maximum value of the input function in the multi-
dimensional range of interest. Also, partitioning the multi-dimensional integration
range in an optimal way, in case the of importance sampling, can be a non-trivial
problem.
References
1. R. May, Simple mathematical models with very complicated dynamics. Nature 621, 459 (1976)
2. Logistic map. public domain image, 2011. https://commons.wikimedia.org/wiki/File:
LogisticMap_BifurcationDiagram.png
3. T.O. Group, The single UNIX R
specification (1997). http://www.unix.org. Version 2
4. T.G. project, GNU operating system—GSL—GNU scientific library (1996–2011), http://www.
gnu.org/software/gsl/
5. M. Lüscher, A portable high-quality random number generator for lattice field theory simula-
tions. Comput. Phys. Commun. 79, 100–110 (1994)
6. F. James, Ranlux: A fortran implementation of the high-quality pseudorandom number
generator of Lüscher. Comput. Phys. Commun. 79, 111–114 (1994)
7. P. L’Ecuyer, Maximally equidistributed combined Tausworthe generators. Math. Comput. 65,
203–213 (1996)
8. M. Matsumoto, T. Nishimura, Mersenne twistor: a 623-dimensionally equidistributed unifor
pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3–30 (1998)
9. G.E.P. Box, M. Muller, A note on the generation of random normal deviates. Ann. Math. Stat.
29, 610–611 (1958)
10. G. Marsglia, W. Tsang, The Ziggurat method for generating random variables. J. Stat. Softw.
5, 8 (2000)
Chapter 5
Parameter Estimate
D O ˙ ı :
Cı
D OıC ;
Probability
Theory Model Data
Inference
Theory Model Data
In order to define the PDF describing a data model, in many cases it is necessary to
introduce parameters that are not of direct interest to our problem. For instance,
when determining (“fitting”) the yield of a signal peak, it is sometimes needed
to determine from data other parameters, like the experimental resolution that
gives the peak width, detector efficiencies that are needed to determine the signal
production yield from the measured signal yield, parameters to define the shapes
and amounts of possible backgrounds, and so on. Those parameters are often
referred to as nuisance parameters. In some cases, nuisance parameters can’t be
determined from the same data sample used to measure the parameters of interest
and their estimate should be taken form other measurements. The uncertainty
on their determination, external to the considered fit problem, will reflect into
uncertainties on the estimate of parameters of interest, as we will see in the following
(see Sect. 8.11).
Uncertainties due to the propagation of imperfect knowledge of nuisance param-
eters that can’t be constrained from the same data sample used for the main fit of
the parameters of interest gives raise to systematic uncertainties, while uncertainties
purely related to the fit are referred to as statistical uncertainties.
5.3 Estimators
In realistic cases we have a data sample that is more rich than a single mea-
surement, and PDF models more complex than a Gaussian, which we considered
in the Example 5.15. The definition of an estimator may require in general complex
mathematics or in many cases computer algorithms.
Different estimators may have different statistical properties that makes one or
another estimator more suitable for a specific problem. In the following we will
present some of the main properties of estimators. Section 5.5 will introduced the
maximum-likelihood estimators which have good performances in terms of most of
the properties indicators described in the following.
5.4.1 Consistency
5.4.2 Bias
The bias of an estimator is the average deviation of the estimate from its true value:
D E D E
b./ D O D O : (5.2)
The average is intended as the expected value over an infinitely large number of
repeated experiments.
The denominator of Eq. (5.3) is called Fisher information in statistical literature, and
was already introduced in Sect. 3.7 when defining Jeffreys’ priors in the Bayesian
approach.
The ratio of the Cramér–Rao bound to the estimator’s variance is called
estimator’s efficiency:
O
VCR ./
O D
"./ : (5.4)
VŒO
Clearly, the presence of outliers at the left or right tails of the distribution will not
change significantly the value of the median, if it is dominated by measurements in
the “core” of the distribution, while the usual average (Eq. (1.13)) could be shifted
from the true value more and more as much as the outlier distribution is broader
than the “core” part of the distribution. Trimmed averages, i.e.: averages computed
by removing from the sample a fraction of the rightmost and leftmost data in the
sample is also less sensitive to the presence of outliers.
It is convenient to define the breakdown point as the maximum fraction of
incorrect measurements (i.e. outliers) above which the estimate may grow arbitrarily
large in absolute value. In particular, trimmed averages that remove a fraction f of
the events can be demonstrated to have a breakdown point f , while the median has
a breakdown point of 0:5.
A more detailed discussion of robust estimator is beyond the purpose of this text.
Eadie et al. [3] contains a more extensive discussion of robust estimators.
The most frequently used estimate method is based on the construction of the
combined probability distribution of all measurements in our data sample, called
likelihood function, which was already introduced in Sect. 3.3. The central values of
the parameters we want to estimate are obtained by finding the parameter set that
correspond to the maximum value of the likelihood function. This approach gives
the name of maximum-likelihood method to this technique.
Maximum-likelihood fits are very frequently used because of very good statisti-
cal properties according to the indicators discussed in Sect. 5.4.
88 5 Parameter Estimate
The simple example of estimator discussed in the Example 5.15, assuming the
simple case of a Gaussian PDF with unknown average and known standard
deviation , where the sample consisted of a a single measurement x obeying this
given Gaussian PDF, is a very simple example of the application of the maximum-
likelihood method.
We define the likelihood function as the PDF that characterizes our set of experi-
mental observables, evaluated at the values of those observables that corresponds
to our data sample, for given values of the unknown parameters. If we measure the
values x1 ; xn of n random variables and our PDF model depends on m unknown
parameters 1 ; ; m , we define the likelihood function L as:
Y
N
E D
L.xI / f .xi1 ; ; xin I 1 ; ; m / : (5.7)
iD1
1
In this case, we will often use the term event with the meaning frequently used in physics, more
than in statistics, i.e.: an event is the collection of measurements of the observable quantities
x1 ; ; xn which is repeated in one or more experiments. Measurements performed in different
events are assumed to be uncorrelated. In this sense, each sequence of event variables x1i ; ; xNi ,
for all i D 1; ; n, can be considered a sampling of independent and identically distributed
random variables, or i.i.d., see Sect. 4.2.
5.5 Maximum-Likelihood Method 89
In order to more easily manage the maximization of the likelihood, often the
logarithm of the likelihood function is computed, so that when the product of many
terms appears in the likelihood definition, this is transformed into the sum of the
logarithms of such terms. The logarithm of the likelihood function becomes:
X
N
E D
ln L.xI / ln f .xi1 ; ; xin I 1 ; ; m / : (5.8)
iD1
If the number of events N is also a random variable that obeys the distribution
p.NI 1 ; ; m /, which may also depend on the m unknown parameters, we can
define the extended likelihood function as:
Y
N
L D p.NI 1 ; ; m / f .xi1 ; ; xin I 1 ; ; m / : (5.9)
iD1
In the case for instance where the PDF is a linear combination of two PDFs, one
for “signal” (Ps ), one for “background” (Pb ), whose yields, s and b are unknown,
we can write the extended likelihood function as:
.s C b/n e.sCb/ Y
N
L.xi I s; b; / D .fs Ps .xi I / C fb Pb .xi I // ; (5.11)
nŠ iD1
90 5 Parameter Estimate
fs D s
sCb ; (5.12)
fb D b
sCb : (5.13)
e.sCb/ Y
N
D .sPs .xi I / C bPb .xi I // : (5.15)
nŠ iD1
Writing the logarithm of the likelihood function we have the more convenient
expression:
X
n
ln L.xi I s; b; / D s C b C ln.sPs .xi I / C bPb .xi I // ln nŠ : (5.16)
iD1
The last term, ln nŠ, is constant with respect to the unknown parameters and can
be omitted when minimizing the function ln L.xi I s; b; /.
An example of extended maximum-likelihood fit of a two-component PDF like
the one illustrated above is shown in Fig. 5.2. The points with the error bars represent
100
80
Events / ( 0.01 )
60
40
20
0
2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
m (GeV)
Fig. 5.2 Example of unbinned extended maximum-likelihood fit of a simulated dataset. The fit
curve is superimposed on the data points as solid blue line
5.6 Errors with the Maximum-Likelihood Method 91
the data sample, shown as binned histogram for convenience, which is randomly
extracted according to the assumed PDF. The unknown parameters present in are fit
according to the likelihood function in Eq. (5.16) with a PDF modeled as sum of a
Gaussian component (“signal”) and an exponential component (“background”). The
mean and standard deviation of the Gaussian component and the decay parameter
of the exponential component are fit simultaneously with the number of signal and
background events, s and b.
X
n
.xi /2
2 ln L.xi I ; / D C n.ln 2 C 2 ln / : (5.17)
iD1
2
1X
n
O D xi ; (5.18)
n iD1
1X
n
O 2 D O 2:
.xi / (5.19)
n iD1
Once we have determined the estimate O of our parameter of interest, using the
maximum-likelihood method, we have to determine a confidence interval that
corresponds to a coverage of 68.27 %, or “1”, in most of the cases.
In cases of non-Gaussian PDFs, in the presence of a large number of
measurements that we combined into a single estimate, we can still use a Gaussian
approximation. If we are far from that ideal limit, the result obtained under the
Gaussian assumption may be only an approximation and deviation from the
exact coverage may occur. If we use as PDF model an n-dimensional Gaussian
92 5 Parameter Estimate
(Eq. (2.102)), it is easy to demonstrate that we can obtain the n-dimensional PDF
covariance matrix as the inverse of the matrix of the second-order partial derivatives
of the negative logarithm of the likelihood function, i.e.2 :
@ ln L.x1 ; ; xn I 1 ; ; m /
Cij1 D : (5.20)
@i @j
This covariance matrix gives the n-dimensional confidence contour having the
correct coverage if the PDF model is exactly Gaussian.
Let’s consider the Gaussian likelihood case of Eq. (5.17), from Sect. 5.5.3. In that
case, if we compute the derivative of ln L with respect to the average parameter
, we obtain:
1 @2 . ln L/ n
2
D 2
D 2; (5.21)
@
This expression coincides with the evaluation of the standard deviation of the
average from Eq. (5.18) using the general formulae from Eqs. (1.14) and (1.15).
Another method frequently used to determine the uncertainty on maximum-
likelihood estimates is to look at the excursion of 2 ln L around the minimum
value (i.e.: the value that maximizes L), and find the interval where 2 ln L increases
by one with respect to its minimum value. This method leads to identical results
of Eq. (5.20) in the Gaussian case, but may lead to different results in the case of
non-Gaussian PDFs. In particular, this approach in general may lead to asymmetric
errors, but better follows the non-local behavior of the likelihood function.3 This
procedure is graphically shown in Fig. 5.3 for the case of a single parameter .
In more than one dimension the same method can be applied, obtaining an
approximate multidimensional ellipsoid. For instance, the two-dimensional contour
plot shown in Fig. 5.4 shows the values of the parameters of s and b that minimize
Eq. (5.16) (central point) for the sample shown in Fig. 5.2. The approximate ellipse
corresponds to the points for which ln L.xi I s; b; / in Eq. (5.16) increases by two
with respect to its minimum, corresponding to a 1 uncertainty.
2
For the users of the program MINUIT, this estimate correspond to call the method
M IGRAD /HESSE.
3
Again, for M INUIT users this procedure correspond to the call of the method MINOS.
5.6 Errors with the Maximum-Likelihood Method 93
−2lnL
−2lnLmax+1
−2lnLmax
^ −
θ−σ ^
θ ^
θ+σ+ θ
Fig. 5.3 Scan of 2 ln L as a function of a parameter . the error interval is determined looking at
the excursion of 2 ln L around the minimum, at O , finding the values for which 2 ln L increases
by one unit
1620
1600
1580
b
1560
1540
1520
1500
400 410 420 430 440 450 460 470 480
s
Fig. 5.4 Two-dimensional contour plot showing the 1 error ellipse for the fit parameters s and b
(number of signal and background events) corresponding to the fit in Fig. 5.2
The exact coverage, anyway, is not ensured even in this method where the
excursion of the negative log-likelihood function is evaluated, though usually the
coverage is improved with respect to the parabolic approximation in Eq. (5.20).
Chapter 6 will discuss a more rigorous treatment of uncertainty intervals that ensure
the proper coverage.
94 5 Parameter Estimate
given n measurements of t: t1 ; ; tn .
By minimizing analytically the likelihood function, written as the product
of f .t1 / f .tn /, one can easily show that the maximum-likelihood estimate of
, which is the inverse of the average “lifetime”, D 1=, is:
!1
1X
n
O D ti ; (5.24)
n iD1
1X
n
O 2 D O 2:
.xi / (5.26)
n iD1
The bias is defined in Eq. (5.2). It can be shown analytically that the expected
value of O 2 is:
˝ ˛ n1 2
O 2 D ; (5.27)
n
1 X
n
2
O unbiased D O 2:
.xi / (5.29)
n 1 iD1
E ;
y D f .xI / (5.30)
n Gaussian PDFs:
" #
Y
n
1 E 2
.yi f .xi I //
E D
L.EyI / q exp ; (5.31)
2
iD1 2 i2 2i
where Ey D .y1 ; ; yn /.
Maximizing L.EyI E/ is equivalent to minimize 2 ln L.EyI /:
E
X
n E 2
.yi f .xi I // X
n
E D
2 ln L.EyI / C ln 2 i2 : (5.32)
iD1
i2 iD1
The last term does not depend on the parameters , E if the uncertainties i are known
and fixed, hence it is a constant that we can drop when performing the minimization.
So, we can minimize the following quantity corresponding to the first term in
Eq. (5.32):
X
n E 2
.yi f .xi I //
E D
2 ./ : (5.33)
iD1
i2
An example of fit performed with the minimum-2 method is shown in Fig. 5.5.
6 y = A x e-B x
4
y
2 4 6 8 10
x
Fig. 5.5 Example of minimum-2 fit of a simulated dataset. The points with the error bars,
assumed resulting from Gaussian PDFs, are used to fit a function model of the type y D f .x/ D
AxeBx , where A and B are unknown parameters determined form the fit procedure. The fit curve
is superimposed as solid blue line
5.7 Minimum 2 and Least-Squares Methods 97
In case the uncertainties i are all equal, it is possible to minimize the expression:
X
n X
n
R2 D E 2D
.yi f .xi I // ri : (5.34)
iD1 iD1
In the easiest case of a linear function, the minimum-2 problem can be solved
analytically. The function f can be written as:
y D f .xI a; b/ D a C bx ; (5.35)
X
n
.yi a bxi /2
2
.a; b/ D : (5.36)
iD1
i2
@2 .a; b/
The analytical minimization can proceed as usual, imposing D 0 and
@a
@2 .a; b/
D 0, which gives:
@b
cov.x; y/
bO D ; (5.37)
VŒx
aO D hyi bO hxi ; (5.38)
1= 2
wi D Pn i 2 ; (5.39)
jD1 1=j
we have:
X
n
hxi D wi xi ; (5.40)
iD1
X
n
hyi D wi yi ; (5.41)
iD1
98 5 Parameter Estimate
˝ ˛ X
n X
n
VŒx D hxi2 x2 D . wi xi /2 wi x2i ; (5.42)
iD1 iD1
X
n X
n X
n
cov.x; y/ D hxyi hxi hyi D wi xi yi wi xi wi yi : (5.43)
iD1 iD1 iD1
1 @2 ln L 1 @2 2
2
D 2 D (5.44)
aO @a 2 @2 a
and similarly:
1 1 @2 2
D : (5.45)
O2 2 @2 b
b
Hence, we have:
v !
u n
u X 1 1
aO D t ; (5.46)
iD1 i
v !
u n
u X x2 1
bO D t i
: (5.47)
2
iD1 i
where n is the number of degrees of freedom of the problem, i.e.: the number of
measurements n minus the number of fit parameters m. We define the p-value as
P.2 O 2 /, the probability that 2 is greater or equal to the value O 2 obtained at
the fit minimum. If the data follow the assumed Gaussian distributions, the p-value
is expected to be a random variable uniformly distributed from 0 to 1 (see Sect. 2.3).
5.8 Error Propagation 99
σθ θ
H D AT ‚A ; (5.50)
where:
@ i
Aij D : (5.51)
@j
y D ax (5.52)
hence:
zDxCy (5.55)
hence:
q
xCy D x2 C y2 : (5.57)
More in general, also considering a possible correlation term, using Eq. (5.49), one
may obtain:
q
xCy D x2 C y2 C 2 x y : (5.58)
z D xy (5.59)
y D xn ; (5.61)
one has:
n
x x
D jnj : (5.62)
xn x
C C C C
measurements: x D xO x x
and y D yO yy
, the naïve extension of the sum in
quadrature of errors, derived in Eq. (5.57), would lead to the (incorrect!) sum in
quadratures of the positive and negative errors:
q
C .xC /2 C.yC /2
x C y D .Ox C yO / p ; (5.63)
.x /2 C.y /2
which has no statistical motivation. One reason for which Eq. (5.63) is incorrect,
is the central limit theorem. In the case of a very large sample, symmetric errors
can be approximated by the standard deviation of the distribution of that sample. In
case of an asymmetric (skew) distribution, asymmetric errors may be related to a
measurement of the skewness (Eq. (1.25)) of the distribution. Adding more random
variables, each characterized by an asymmetric PDF, should lead to a resulting
PDF that approaches a Gaussian more than the original PDFs, hence it should have
more symmetric errors. From Eq. (5.63), instead, the error asymmetry would never
decrease by adding more and more measurements with asymmetric errors.
One statistically correct way to propagate asymmetric errors on quantities (say )
that are expressed as functions of some original parameters (say ) is to reformulate
the fit problem in terms of the new parameters, and perform again the fit and error
evaluation for the derived quantities ( ). This approach is sometimes not feasible
in the case when measurements with asymmetric errors are taken as result of a
previous measurement (e.g.: from a publication) that do not specify the complete
underlying likelihood model. In those cases, the treatment of asymmetric errors
requires some assumptions on the underlying PDF model which is missing in the
available documentation of the model’s description. Discussions about how to treat
asymmetric errors can be found in [7–9] using a frequentist approach. D’Agostini
[10] also approaches this subject using the Bayesian approach (see Chap. 3). In the
following part of the section the derivation from [9] will be briefly presented as
example.
Imagine that the uncertainty on a parameter x0 may arise from a non-linear
dependence on another parameter x (i.e.: x0 D f .x/) which has a symmetric
uncertainly .
Figure 5.7 shows a simple case where a random variable x, distributed according
to a Gaussian PDF, is transformed into a variable x0 through a piece-wise linear
transformation, leading to an asymmetric PDF. The two straight-line sections, with
different slopes, join with continuity at the central value of the original PDF. This
leads to a resulting PDF of the transformed variable which is piece-wise Gaussian:
the two half-Gaussians, each corresponding to a 50 % probability, have different
0
standard deviation parameters, C and 0 . Such a PDF is also called bifurcated
Gaussian in some literature.
If we have as original measurement: x D xO ˙ , this will lead to a resulting
C 0
measurement of the transformed variable: x0 D xO0 0C , where C
0
and 0 depend on
0
through factors equal to the two different slopes: C D sC and 0 D s
respectively, as seen from the Fig. 5.7.
5.9 Issues with Treatment of Asymmetric Errors 103
One consequence of the transformed PDF shape is that the average value of the
transformed variable, hx0 i, is different from the most likely value, xO0 . While the
average value of the sum of two variables is equal to the sum of the average values
of the two variables (Eq. (1.19)), this is not the case for the most likely value of the
sum of the two variables.
In fact, using a naïve error treatment, like the one in Eq. (5.63), could even lead
to a bias in the estimate of combined values. In particular, the average value of x0 is:
˝ 0˛ 1
x D xO0 C p .C
0
0 / ; (5.64)
2
The unnormalized skewness that leads to Eq. (5.67) add linearly, like the average
C C
1
and variance. If we have two variables affected by asymmetric errors, say: x1
1
C2C
and x2 ,
and assume an underlying piece-wise linear model, as in the case treated
2
here, we can compute the average, variance and unnormalized skewness for the sum
of the two:
where the three above quantities for x1 and x2 separately can be computed from
Eqs. (5.64), (5.65) and (5.67). Inverting the relation between hx1 C x2 i, VarŒx1 C
2
x2 and Œx1 C x2 and x1 C x2 , 1C2 C
and 1C2 one can obtain, using numerical
C C
1C2
techniques, the correct estimate for x1C2 .
1C2
Barlow [9] also considers the case of a parabolic dependence and obtains a
C C
1C2
procedure to estimate x1C2 with this second model.
1C2
Any estimate of the sum of two measurements affected by asymmetric errors
requires an assumption of an underlying PDF model, and results may differ for
different assumptions, depending case by case on the numerical values.
In the case of a sufficiently large number of entries in each bin, the Poisson
distribution describing the number of entries in each bin can be approximated by a
Gaussian with variance equal to the average number of entries in that bin, according
to what was derived in Sect. 2.9. In this case, the expression for 2 ln L becomes:
R xC
X .ni i
x f .xI 1 ; ; m / dx/2 X
2 ln L D i
C n ln 2 C ln ni ; (5.71)
i
ni i
C
where Œx i ; xi is the interval corresponding to the ith bin. If the binning is
sufficiently fine, we can write:
where xi is center of the ith bin and ıxi is the bin’s width.
Dropping the terms that are constant with respect to the unknown parameters j ,
and hence are not relevant for the minimization, the maximum-likelihood problem
reduces to the minimization of the following 2 :
where the bin size ıxi has been absorbed in the definition of i , for simplicity of
notation:
In binned samples in general the Gaussian hypothesis assumed in Sect. 5.10.1 does
not necessarily hold when the number of entries per bin may be small. In the
Poissonian case, valid in general, also for small number of events, the negative
106 5 Parameter Estimate
Using the approach proposed in [6], we can divide the likelihood function by it’s
maximum value, which we obtain replacing i with ni . Hence, we obtain:
From Wilks’ theorem (see Sect. 7.6) the distribution of 2 can be approximated
with the 2 distribution (Eq. (5.48)), hence 2 can be used to determine a p-value
that provides a measure of the goodness of the fit.
In case the Wilks’ theorem hypotheses do not hold, e.g.: insufficiently large
number of measurements, the distribution of 2 for the specific problem can still
be determined by generating a large number of Monte Carlo pseudo-experiments
that reproduce the theoretical PDF, and the p-value can be computed accordingly.
This technique is often called toy Monte Carlo.
Imagine we have two measurements of the same quantity x, which are x1 ˙ 1 and
x2 ˙ 2 . Assuming a Gaussian distribution for x1 and x2 and no correlation between
the two measurements, we can build a 2 as:
.x x1 /2 .x x2 /2
2 D C : (5.79)
12 22
which gives:
x1
12
C x2
22
xO D 1 1
: (5.81)
12
C 22
1 @2 ln L 1 @2 2 1 1
2
D 2
D 2
D 2C 2: (5.82)
xO @x 2 @x 1 2
Equation (5.81) is called weighted average, and can be generalized for n measure-
ments as:
Pn
i wi xi
xO D Pn1 ; (5.83)
iD1 wi
5.11.2 2 in n Dimensions
The 2 definition from Eq. (5.79) can be generalized in the case of n measurements
Ex D .x1 ; ; xn / of the parameter
E D .1 ; ; n /, where i D i .1 ; ; m /
and the correlation matrix of the measurements .x1 ; ; xn / is ij , as follows:
X
n
2 D .xi i /ij1 .xj j / D .Ex /
E T 1 .Ex /
E (5.85)
i;jD1
0 11 0 1
11 1n x1 1
B :: : : :: C
D x1 1 ; ; xn n @ : @ A :
: : A (5.86)
n1 nn xn n
The 2 can be minimized in order to have the best-fit estimate of the unknown
parameters 1 ; ; m . Numerical algorithms are needed for the general case,
analytical minimizations are as usual possible only in the simplest cases.
Following the same procedure used to obtain Eq. (5.81), we can easily obtain:
x1 .22 1 2 / C x2 .12 1 2 /
xO D ; (5.88)
12 2 1 2 C 22
12 22 .1 2 /
xO2 D : (5.89)
12 2 1 2 C 22
In the general case of more than two measurement, the minimization of the
2 is equivalent to finding the best linear unbiased estimate (BLUE), i.e.: the
unbiased linear combination of the measurements Ex D .x1 ; ; xn / whose weights
5.11 Combining Measurements 109
xO2 D w
E T w
E; (5.92)
where ij is the covariance matrix of the n measurements. It can be shown [11] that
the weights can be determined according to the following expression:
1 uE
ED
w ; (5.93)
uE T 1 uE
where uE is the vector having all elements equal to the unity: uE D .1; ; 1/.
The interpretation of Eq. (5.88) becomes more intuitive [12] if we introduce the
common error, defined as:
C D 1 2 : (5.94)
Imagine, for instance, that the two measurements are affected by a common
uncertainty, like the knowledge of the integrated luminosity in the case of a cross
section measurement, while the other uncertainties are uncorrelated. In that case, we
could write the two measurements as:
x D x1 ˙ 10 ˙ C (5.95)
x D x2 ˙ 20 ˙ C (5.96)
where 102 D 12 C2 and 202 D 22 C2 . Equation (5.88) becomes:
x1 x2
C 2
12 C2
2 C2
xO D (5.97)
1 C 21 2
12 C2 2 C
110 5 Parameter Estimate
with a variance:
1
xO2 D 1 1
C C2 : (5.98)
C
12 C2 22 C2
Equation (5.97) is equivalent to the weighted average from Eq. (5.81), where the
errors 102 and 202 are used. Equation (5.98) shows that the additional term C has to
be added in quadrature to the usual error in the case of no correlation.
In general, as can be seen from Eq. (5.88), weights with the BLUE method
can become negative. This leads to the counter-intuitive result that the combined
measurement lies outside the range delimited by the two central values. Also, in
the case when D 1 =2 , the weight of the first measurement is zero, hence the
combined central value is not influenced by x2 . Conversely, if D 2 =1 , the central
value is not influenced bye x1 .
The BLUE method is unbiased by construction assuming that the true uncertainties
and their correlations are known. Anyway, it can be proven that BLUE combinations
may exhibit a bias if uncertainty estimates are used in place of the true ones, and
in particular if the uncertainty estimates depend on measured values. For instance,
when contributions to the total uncertainty are known as relative uncertainties, the
actual uncertainty estimates are obtained as the product of the relative uncertainties
times the measured central values. An iterative application of the BLUE method can
be implemented in order to mitigate such a bias.
L. Lyons et al. remarked in [13] the limitation of the application of the BLUE
method in the combination of lifetime measurements where uncertainty estimates
Oi of the true unknown uncertainties i were used, and those estimates had a
dependency on the measured lifetime.
References 111
They also demonstrated that the application of the BLUE method violates, in that
case, the “combination principle”: if the set of measurements is split into a number
of subsets, then the combination is first performed in each subset and finally all
subset combinations are combined into a single combination, this result differs from
the combination of all individual results of the entire set.
L. Lyons et al. recommended to apply iteratively the BLUE method, rescaling
at each iteration the uncertainty estimates according to the central value obtained
with the BLUE method in the previous iteration, until the sequence converges to
a stable result. Lyons et al. showed that the bias of the BLUE estimate is reduced
compared to the standard application of the BLUE method. A more extensive study
of the iterative application of the BLUE method is also available in [14].
References
As first step, the confidence belt is determined by scanning the parameter space,
varying within its allowed range. For each fixed value of the parameter D 0 ,
the corresponding PDF which describes the distribution of x, f .xj0 /, is known.
θ θ
θ2(x0)
θ0
θ1(x0)
x1(θ0) x2(θ0) x x0 x
Fig. 6.1 Graphical illustration of Neyman’s belt construction (left) and inversion (right)
According to the PDF f .xj0 /, an interval Œx1 .0 /; x2 .0 / is determined whose
corresponding probability is equal to the specified confidence level, defined as CLD
1 ˛, usually equal to 68.27 % (1), 90 or 95 %:
Z x2 .0 /
1˛ D f .xj0 / dx : (6.1)
x1 .0 /
Other possibilities consist of choosing the interval having the smallest size,
or fully asymmetric intervals on either side: Œx1 .0 /; C1 or Œ1; x2 .0 / . Other
possibilities are also considered in literature. Figure 6.2 shows three of the possible
cases described above.
A special ordering rule was introduced by Feldman and Cousins based on a
likelihood-ratio criterion and will be discussed in Sect. 6.4.
6.1 Neyman’s Confidence Intervals 115
f(x|θ0)
f(x|θ0) f(x|θ )
0
1−α 1−α
α α
x x
Fig. 6.2 Three possible choices of ordering rule: central interval (top) and fully asymmetric
intervals (bottom left, right)
Given a choice of the ordering rule, the intervals Œx1 ./; x2 ./ , for all possible
values of , define the Neyman belt in the space .x; / as shown in Fig. 6.1.
nŠ
P.nI N/ D pN .1 p/Nn : (6.4)
NŠ.N n/Š
nO
pO D : (6.5)
N
This may be justified by the law of large numbers because for n ! 1 pO and p will
coincide. This is clearly not the case in general for finite values of n, and Eq. (6.6)
clearly gives a null error in case either nO D 0 or nO D N, i.e. for pO D 0 or 1,
respectively.
A general solution to this problem is due to Clopper and Pearson [2], and consists
of finding the interval Œp1 ; p2 that gives (at least) the correct coverage.
If the confidence level is 1 ˛, we can find p1 such that:
X
N
nŠ ˛
P.n nO jN; p1 / D pn1 .1 p1 /Nn D ; (6.7)
NŠ.N n/Š 2
nDOn
nO
X nŠ ˛
P.n nO jN; p2 / D pn2 .1 p2 /Nn D : (6.8)
nD0
NŠ.N n/Š 2
This corresponds, in a discrete case, to the Neyman inversion described in Sect. 6.1.
The presence of a discrete observable quantity, n in this case, does not allow
a continuous variation of the observable and the probability associated to possible
discrete intervals Œn1 .p0 /; n2 .p0 / , corresponding to an assumed parameter value p D
p0 , can have again discrete possible values. In the Neyman’s construction, one has
to chose the smallest interval Œn1 ; n2 , consistently with the adopted ordering rule,
that has at least the desired coverage. In this way, the confidence interval Œp1 ; p2
determined by the inversion procedure for the parameter p could overcover, i.e.:
may have a corresponding probability larger than the desired 1 ˛. In this sense,
the interval is said to be conservative.
6.3 The “Flip-Flopping” Problem 117
NŠ N
P.n NjN; p1 / D p .1 p1 /0 D pN1 D 0:05 ;
NŠ 0Š 1
P.n NjN; p2 / D 1 :
So, for p2 we should consider the largest allowed value p2 D 1:0, since the
probability P.n 10j10; p2/ is always equal to one. The first equation can be
inverted and gives:
p1 D expŒln.0:05/=N :
This “3” significance criterion will be discussed in more details later, see
Sect. 8.2 for a more general definition of significance level.
This problem is sometimes referred to in literature as flip-flopping, and can be
illustrated in a simple example [3]. Imagine a model where a random variable x
obeys a Gaussian distribution with a fixed and known standard deviation , for
simplicity we can take D 1, and an unknown average which is bound to
be greater or equal to zero. This is the case of a signal yield, or cross section
measurement.
The quoted central value must always be greater than or equal to zero, given the
assumed constraint. Imagine we decide to quote, as measured value for , zero if
the significance, defined as x=, is less than 3, otherwise we quote the measured
value x:
n x if x= 3
D ; (6.9)
0 if x= < 3
Fig. 6.3 Illustration of the flip-flopping problem. The plot shows the quoted central value of as
a function of the measured x (dashed line), and the 90 % confidence interval corresponding to a
choice to quote a central interval for x= 3 and an upper limit for x= < 3
6.4 The Unified Feldman–Cousins Approach 119
The choice to switch from a central interval to a fully asymmetric interval (upper
limit) based on the observation of x produces an incorrect coverage: looking at
Fig. 6.3, depending on the value of , the interval Œx1 ; x2 obtained by crossing
the confidence belt by an horizontal line, one may have cases where the coverage
decreases from 90 to 85 %, which is lower than the desired CL (purple line in
Fig. 6.3). Next Sect. 6.4, presents the method due to Feldman and Cousins to
preserve consistently the coverage for this example without incurring the flip-
flopping problem.
In order to avoid the flip-flopping problem and to ensure the correct coverage, the
ordering rule proposed by Feldman and Cousins [3] provides a Neyman confidence
belt, as defined in Sect. 6.1, that smoothly changes from a central or quasi-central
interval to an upper limit in the case of low observed signal yield.
The ordering rule is based on the likelihood ratio that will be further discussed
in Sect. 7.3: given a value 0 of the unknown parameter , the chosen interval of
the variable x used for the Neyman-belt construction is defined by the ratio of two
PDFs of x, one under the hypothesis that is equal to the considered fixed value 0 ,
the other under the hypothesis that is equal to the maximum-likelihood estimate
value best .x/ corresponding to the given measurement x. The likelihood ratio must
be greater than a constant k˛ whose value depends on the chosen confidence level
1 ˛. In a formula:
f .xj0 /
.xj0 / D > k˛ : (6.11)
f .xjbest .x//
The constant k˛ should be taken such that the integral of the PDF evaluated in the
range corresponding to .xj0 / > k˛ is equal to 1˛. Hence, the confidence interval
R˛ for a given value D 0 is given by:
kα
1−α
n p1 if x 0
2
f .xjbest .x// D x2 =2
: (6.15)
p1 e if x < 0
2
The likelihood ratio in Eq. (6.11) can be written in this case as:
The interval Œx1 .0 /; x2 .0 / , for a given D 0 , can be found numerically using
the equation .xj/ > k˛ and imposing the desired confidence level 1 ˛ from
Eq. (6.13).
The results are shown in Fig. 6.5, and can be compared with Fig. 6.3. Using
the Feldman–Cousins approach, for large values of x one gets the usual symmetric
confidence interval. As x moves towards lower values, the interval becomes more
and more asymmetric, and at some point it becomes fully asymmetric (i.e.: Œ0; up ),
determining an upper limit up . For negative values of x the result is always an upper
limit avoiding unphysical values with negative values of . As seen, this approach
smoothly changes from a central interval to an upper limit, yet ensuring the correct
required (90 % in this case) coverage.
More cases treated using the Feldman–Cousins approach will be presented in
Chap. 8.
The application of the Feldman–Cousins method requires, in most of the cases,
numerical treatment, even for simple PDF models, like the Gaussian case discussed
above, because it requires the inversion of the integral in Eq. (6.13). The procedure
requires to scan the parameter space, and in case of complex model may be very
CPU intensive. For this reason, other methods are often preferred to the Feldman–
Cousins for such complex cases.
References 121
μ
symmetric errors
asymmetric errors
upper limits
x
Fig. 6.5 Neyman confidence belt constructed using the Feldman–Cousins ordering
References
signal
f(t)
background
t cut t
Fig. 7.1 Probability distribution functions for a discriminating variable t.x/ D x which has two
different PDFs for the signal (red) and background (yellow) hypotheses under test
One simple example is to use a single variable x which has discriminating power
between two hypotheses, say signal = “muon” versus background = “pion”, as
shown in Fig. 7.1. A good “separation” of the two cases can be achieved if the
PDFs of x under the hypotheses H1 = signal and H0 = background are appreciably
different.
On the basis of the observed value xO of the discriminating variable x, a simple
test statistics can be defined as the measured value itself:
Ot D t.Ox/ D xO : (7.1)
A selection requirement (in physics jargon sometimes called cut) can be defined
by identifying a particle as a muon if Ot tcut or as a pion if Ot > tcut , where the value
tcut is chosen a priori.
Not all real muons will be correctly identified as a muon according to this
criterion, as well as not all real pions will be correctly identified as pions. The
expected fraction of selected signal particles (muons) is usually called signal
selection efficiency and the expected fraction of selected background particles
(pions) is called misidentification probability.
Misidentified particles constitute a background to positively identified signal
particles. Applying the required selection (cut), in this case t tcut , on a data
sample made of different detected particles, each providing a measurements of
x, the selected data sample will be enriched of signal, reducing the fraction of
background in the selected data sample with respect to the original unselected
sample. The sample will be actually enriched if the selection efficiency is larger
than the misidentification probability, which is the case considering the shapes of
the PDFs in Fig. 7.1 and the chosen selection cut.
7.1 Introduction to Hypothesis Tests 125
signal efficiency
0
0 1
mis-id probability
Statistical literature defines the significance level ˛ as the probability to reject the
hypothesis H1 if it is true. The case of rejecting H1 if true is called error of the first
kind. In our example, this means selecting a particle as a pion in case it is a muon.
Hence the selection efficiency for the signal corresponds to 1 ˛.
The probability ˇ to reject the hypothesis H0 if it is true (error of the second
kind) is the misidentification probability, i.e.: the probability to incorrectly identify
a pion as a muon, in our case.
Varying the value of the selection cut tcut different values of selection efficiency
and misidentification probability (and the corresponding values of ˛ and ˇ) are
determined. A typical curve representing the signal efficiency versus the misidentifi-
cation probability obtained by varying the selection requirement is shown in Fig. 7.2.
A good selection should have a low misidentification probability corresponding to a
high selection efficiency. But clearly the background rejection can’t be perfect (i.e.:
the misidentification probability can’t drop to zero) if the distributions f .xjH0 / and
f .xjH1 / overlap, as in Fig. 7.2.
More complex examples of cut-based selections involve multiple variables,
where selection requirements in multiple dimensions can be defined as regions in the
discriminating variables space. Events are accepted as “signal” or as “background”
if they fall inside or outside the selection region. Finding an optimal selection in
multiple dimensions is usually not a trivial task. Two simple examples of selections
with very different performances in terms of efficiency and misidentification
probability are shown on Fig. 7.3.
126 7 Hypothesis Tests
Fig. 7.3 Examples of two-dimensional selections of a signal (blue dots) against a background (red
dots). A linear cut is chosen on the left plot, while a box cut is chosen on the right plot
.1 2 /2
E D
J.w/ : (7.2)
12 C 22
1 2 D w
E T .m
E 1 mE2 / ; (7.3)
where mE1 and mE2 are the averages in n dimensions of the two PDFs. Moreover, we
can write the square of Eq. (7.3) as:
.1 2 /2 D w
E T .m
E 1 mE2 /.m
E 1 mE2 /T w
E Dw
E T SB w
E; (7.4)
7.2 Fisher’s Linear Discriminant 127
SB D .m
E 1 mE2 /.m
E 1 mE2 /T : (7.5)
12 D w
E T S1 w
E; (7.6)
22 D w
E T S2 w
E; (7.7)
12 C 22 D w
E T .S1 C S2 /w
E Dw
E T SW w
E; (7.8)
SW D S1 C S2 : (7.9)
The problem of finding the vector w E that maximizes J.w/E can be solved by
E with respect to the components wi of w.
performing the derivatives of J.w/ E It
can also be demonstrated that the problem is equivalent to solve the eigenvalues
equation:
E D SW w
SB w E; (7.11)
i.e.:
SW 1 SB w
E D w
E; (7.12)
E D SW 1 SB .mE1 mE2 / :
w (7.13)
determined from the Monte Carlo samples and Eq. (7.13) can be applied numerically
E that maximizes Fisher’s discriminant.
to find the direction w
L.ExjH1 /
.Ex/ D : (7.14)
L.ExjH0 /
For a fixed signal efficiency " D 1 ˛ the selection that corresponds to the lowest
possible misidentification probability ˇ is given by:
L.ExjH1 /
.Ex/ D k˛ ; (7.15)
L.ExjH0 /
where the value of the “cut” k˛ should be set in order to achieve the required value
of ˛. This corresponds to chose a point in the curve shown in Fig. 7.2 such that
" D 1 ˛ corresponds to a required value.
If the multidimensional PDFs that characterize our discrimination problem are
known, the Neyman–Pearson lemma provides a procedure to implement a selection
that achieves the optimal performances.
In many realistic cases it is not easy to determine the correct model for the
multidimensional PDFs, and approximated solutions may be adopted. Numerical
methods and algorithms exist to find selections in the variable space that have
performances in terms of efficiency and misidentification probability close to the
optimal limit given by the Neyman–Pearson lemma. Some of those algorithms
achieve great complexity.
Among such methods some of the most frequently used ones in High Energy
Physics are Artificial Neural Networks and Boosted Decision Trees, which are
covered in this text. An introduction to Artificial Neural Network can be found in
[3], while an introduction to Boosted Decision Trees is available in [4].
7.5 Kolmogorov–Smirnov Test 129
This allows in many cases to simplify the computation of the likelihood ratio and to
easily obtain the optimal selection according to the Neyman–Pearson lemma.
Even if it is not possible to factorize the PDFs into the product of one-
dimensional marginal PDFs, i.e.: the variables are not independent, the test statistic
inspired by Eq. (7.16) can be used as discriminant using the marginal PDFs fj for the
individual variables:
Qn
jD1 fj .xj jH1 /
.x1 ; ; xn / D Qn : (7.17)
jD1 fj .xi jH0 /
In case the PDFs cannot be exactly factorized, anyway, the test statistic defined
in Eq. (7.17) will differ from the exact likelihood ratio in Eq. (7.14) and hence it
will correspond to worse performances in term of ˛ and ˇ compared with the best
possible values provided by the Neyman–Pearson lemma.
In some cases, anyway, the simplicity of this method can justify its application in
spite of the suboptimal performances. Indeed, the marginal PDFs fj can be simply
obtained using Monte Carlo training samples with a large number of entries that
allow to produce histograms corresponding to the distribution of the individual
variables xj that, to a good approximation, reproduce the marginal PDFs.
Some numerical applications implement this factorized likelihood-ratio tech-
nique after applying proper variable transformation in order to reduce or eliminate
the variables’ correlation. This mitigates the decrease of performance compared
with the Neyman–Pearson limit, but does not necessarily allow to reach the optimal
performances, because uncorrelated variables are not necessarily independent (see
for instance the Example 2.6).
A test due to Kolmogorov [5], Smirnov [6] and Chakravarti [7] can be used to assess
the hypothesis that a data sample is compatible with a given distribution.
Assume to have a sample x1 ; ; xn , ordered by increasing values of x that we
want to compare with a distribution f .x/, which is assumed to be a continuous
function.
130 7 Hypothesis Tests
1X
n
Fn .x/ D .x xi / ; (7.18)
n iD1
1 if x 0
.x/ D : (7.19)
0 if x < 0
The maximum distance between the two cumulative distributions Fn .x/ and F.x/
is used to quantify the agreement of the data sample x1 ; ; xn with f .x/:
1
Dn
F (x)
F n(x)
0
x
x1 x2 xn
according to f .x/) is true, does not depend on the distribution f .x/, and the
probability that K is smaller or equal to a given value k is given by Marsaglia et al.
[8]:
1 p 1
X 2 X .2i1/22 2
i1 i2 k2
P.k k/ D 1 2 .1/ e D e 8k : (7.22)
iD1
k iD1
asymptotically converges to zero as n and m are sufficiently large, and the following
test statistic asymptotically follows a Kolmogorov distribution in Eq. (7.22):
r
nm
Dn;m : (7.24)
nCm
corresponding to the observed data sample .Ex1 ; ; ExN /, has a distribution that can
be approximated for N ! 1, if H0 is true, with a 2 distribution having a number
of degrees of freedom equal to the difference between the dimensionality of the set
‚1 and the dimensionality of the set ‚0 .
As a more specific example, let’s assume that we separate the parameter set
into one parameter of interest and the remaining parameters, which are taken
as nuisance parameters (see Sect. 5.2), E D .1 ; ; m /. For instance, may be
the ratio of signal cross section to it’s theoretical value (signal strength), and D 1
corresponds to a signal cross section equal to the theory prediction.
If we take as H0 the hypothesis D 0 , while H1 is the hypothesis that may
have any possible value greater or equal to zero, Wilks’ theorem ensures that:
QN E
supE L.Exi I 0 ; /
2r .0 /
iD1
D 2 ln QN (7.26)
sup;E xi I ; / E
iD1 L.E
Y
N
L.Exi I ; EO :
O /
iD1
In the numerator, only the nuisance parameters E are fit and is fixed to the
constant value D 0 . If the values of E that maximize the likelihood function for
O
EO 0 /, then Eq. (7.26) can be written as:
a fixed D 0 are E D .
O
EO 0 //
L.Exj; .
2r .0 / D : (7.27)
L.Exj;
O /EO
This test statistic is called in literature profile likelihood and will be used later in
Sect. 8.12 to determine upper limits.
7.7 Likelihood Ratio in the Search for a New Signal 133
In the previous section we considered the following likelihood function for a set of
E
observations .Ex1 ; ; ExN / with a set of parameters :
Y
N
E D
L.Ex1 ; ; ExN I / E :
f .Exi I / (7.28)
iD1
Two hypotheses H1 and H0 are represented as two possible sets of values ‚1 and
‚0 of the parameters E D .1 ; ; m / that characterize the PDFs.
Usually we want to use the number of events N as information in the likelihood
definition, hence we use the extended likelihood function (see Sect. 5.5.2) multiply-
ing Eq. (7.28) by a Poissonian factor corresponding to the probability to observe a
number of events N:
E ENY
e. / ./
N
E D
L.Ex1 ; ; ExN I / E :
f .Exi I / (7.29)
NŠ iD1
In the Poissonian term the expected number of event may also depend on
the parameters E: D ./. E Typically, we want to discriminate between two
hypotheses, which are the presence of only background events in our sample, i.e.:
D b, against the presence of both signal and background, i.e.: D s C b.
Here we have introduced the multiplier , called signal strength, assuming that the
expected signal yield from theory is s and we consider all possible values of the
expected signal yield, given by s, by varying while keeping s constant at the
value predicted from theory. The signal strength is widely used in data analyses
performed at the Large Hadron Collider.
The hypothesis H0 corresponding to the presence of background only is equiva-
lent to D 0, while the hypothesis H1 corresponding to the presence of signal plus
background allows any non-null positive value of .
E can be written as superposition of two components, one PDF
The PDF f .Exi I /
for the signal and another for the background, weighted by the expected signal and
background fractions, respectively:
s b
f .ExI E/ D fs .ExI E/ C E :
fb .ExI / (7.30)
s C b s C b
In this case the extended likelihood function from Eq. (7.29) becomes:
e.s./Cb.// Y
E E N
LsCb .Ex1 ; ; ExM I ; E/ D E C bfb .Exi I E/ :
sfs .Exi I / (7.31)
NŠ iD1
134 7 Hypothesis Tests
E
Note that also s an b may depend on the unknown parameters :
s D s.E/ ; (7.32)
E :
b D b./ (7.33)
For instance in a search for the Higgs boson the theoretical cross section may depend
on the Higgs boson’s mass, as well as the PDF for the signal fs , which represents a
peak centered around the Higgs boson’s mass.
Under the hypothesis H0 , i.e.: D 0 (background only), the likelihood function
can be written as:
E Y
b./ N
E D e
Lb .Ex1 ; ; ExN I / bfb .Exi I E/ : (7.34)
NŠ iD1
The term 1=NŠ disappears when performing the likelihood ratio in Eq. (7.14), which
becomes:
E
LsCb .Ex1 ; ; ExN I ; /
E D
.; /
E
Lb .Ex1 ; ; ExN I /
In the case of a simple event counting, where the likelihood function only
accounts for the Poissonian probability term, assuming that the expected signal and
background yields depend on the unknown parameters , E the likelihood function
only depends on the number of observed event N, and the likelihood ratio which
defines the test statistic is:
which is a simplified version of Eq. (7.36) where the terms fs and fb have been
dropped.
These results will be used to determine upper limits to s in the search for a new
signal, as will be seen in Chap. 8. In particular, this derivation will be very useful
for the determination of the upper limits using the modified frequentist approach, or
CLs , as discussed in Sect. 8.10.
References
1. R.A. Fisher, The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–
188 (1936)
2. J. Neyman, E. Pearson, On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. Ser. A 231, 289–337 (1933)
3. C. Peterson, T.S. Rgnvaldsson, An introduction to artificial neural networks. LU-TP-91-23.
LUTP-91-23, 1991. 14th CERN School of Computing, Ystad, Sweden, 23 Aug–2 Sep 1991
4. B.P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, G. McGregor, Boosted decision trees as an
alternative to artificial neural networks for particle identification. Nucl. Instrum. Methods
A543, 577–584 (2005)
5. A. Kolmogorov, Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital.
Attuari 4, 83–91 (1933)
6. N. Smirnov, Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat.
19, 279–281 (1948)
7. I.M. Chakravarti, R.G. Laha, J. Roy, Handbook of Methods of Applied Statistics, vol. I (Wiley,
New York, 1967)
8. G. Marsaglia, W.W. Tsang, J. Wang, Evaluating Kolmogorov’s distribution. J. Stat. Softw. 8,
1–4 (2003)
9. R. Brun, F. Rademakers, ROOT—an object oriented data analysis framework, in Proceedings
AIHENP’96 Workshop, Lausanne, Sep. 1996, Nuclear Instruments and Methods, vol. A389
(1997), pp. 81–86. See also http://root.cern.ch/
10. M.A. Stephens, EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc.
69, 730–737 (1974)
136 7 Hypothesis Tests
11. T.W. Anderson, D.A. Darling, Asymptotic theory of certain “goodness-of-fit” criteria based on
stochastic processes. Ann. Math. Stat. 23, 193–212 (1952)
12. H. Cramér, On the composition of elementary errors. Scand. Actuar. J. 1928(1), 13–74 (1928).
doi:10.1080/03461238.1928.10416862
13. R.E. von Mises, Wahrscheinlichkeit, Statistik und Wahrheit (Julius Springer, Vienna, 1928)
14. S. Wilks, The large-sample distribution of the likelihood ratio for testing composite hypotheses.
Ann. Math. Stat. 9, 60–62 (1938)
Chapter 8
Upper Limits
that influence the signal yield, such as the mass of a new particle, if it is related to
the theoretical cross section, etc.
The determination of upper limits is in many cases a complex task and the
computation frequently requires numerical algorithms. Several methods are adopted
in High Energy Physics and are documented in literature to determine upper limits.
The interpretation of the obtained limits can be, even conceptually, very different,
depending on the adopted method.
This chapter will introduce the concept of significance and will present the most
popular methods to set upper limits in High Energy Physics discussing their main
benefits and limitations. The interpretation of upper limits under the frequentist and
Bayesian approaches will be discussed, devoting special attention to the so-called
modified frequentist approach, which is a popular method in High Energy Physics
that is neither a purely frequentist nor a Bayesian method.
Given an observed data sample, claiming the discovery of a new signal requires
to determine that the sample is sufficiently inconsistent with the hypothesis that
only background is present in the data. A test statistic can be used to measure how
consistent or inconsistent the observation is with the hypothesis of the presence of
background only.
A quantitative measurement of the inconsistency with the background-only
hypothesis is given by the significance, defined from the probability p (p-value)
that the considered test statistics t assumes a value greater or equal to the observed
one in the case of pure background fluctuation. We have implicitly assumed that
large values of t corresponds to a more signal-like sample. The p-value has by
construction (see the end of Sect. 2.3) a uniform distribution between 0 and 1 for
the background-only hypothesis and tends to have small values in the presence of a
signal.
If the number of observed events is adopted as test statistics (event counting
experiment), the p-value can be determined as the probability to count a number of
events equal or greater than the observed one assuming the presence of no signal
and the expected background level. In this cases the test statistics and the p-value
may assume discrete values and the distribution of the p-value is only approximately
uniform.
8.2 Claiming a Discovery 139
1
X 7
X 4:5n
pD Pois.nI 4:4/ D 1 e4:5 :
nD8 nD0
nŠ
P(n | ν = b = 4.5)
0.18
0.16
P(n ≥ 8) = 0.087
0.14
0.12
P(n|ν )
0.1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14
n
Fig. 8.1 Poisson distribution in the case of a null signal and an expected background of b D 4:5.
The probability corresponding to n 8 (red area) is 0.087, and gives a p-value assuming the event
counting as test statistics
8.2.2 Significance
Instead of quoting the p-value, publications often preferred to quote the equivalent
number of standard deviations that correspond to an area p under the rightmost tail
140 8 Upper Limits
Determining the significance, anyway, is only part of the process that leads to a
discovery, in the scientific method. Quoting from [1]:
It should be emphasized that in an actual scientific context, rejecting the background-only
hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree
of belief that a new process is present will depend in general on other factors as well, such as
the plausibility of the new signal hypothesis and the degree to which it can describe the data.
Here, however, we only consider the task of determining the p-value of the background-only
hypothesis; if it is found below a specified threshold, we regard this as “discovery”.
In order to evaluate the “plausibility of a new signal” and other factors that
give confidence in a discovery, the physicist’s judgement cannot, of course, be
replaced by the statistical evaluation only. In this sense, we can say that a Bayesian
interpretation of the final result is somehow implicitly assumed.
8.4 Significance and Parameter Estimates Using Likelihood Ratio 141
For the purpose of excluding a signal hypothesis, usually the requirement applied in
terms of p-value is much milder than for a discovery. Instead of requiring a p-value
of 2:87 107 or less (the “5” criterion), upper limits for a signal exclusion are
set requiring p < 0:05, corresponding to a 95 % confidence level (CL) or p < 0:10,
corresponding to a 90 % CL.
In the case of signal exclusion, p indicates the probability of a signal underfluc-
tuation, i.e.: the null hypothesis and alternative hypothesis are inverted with respect
to the case of a discovery.
In Sect. 7.7 the following test statistic was introduced in order to achieve the benefits
of the Wilks’ theorem (see Sect. 7.6):
More in general, in case of a parameter of interest , e.g.: the mass of the new
particle that produced the signal yield s./, the same procedure may be applied
and the value ln ./ can be plotted as a function of . The same consideration
p be done for what concerns the approximate significance evaluation as Z D
may
2 ln max as a function of the unknown parameter .
This significance is called local significance, in the sense that it corresponds to
a fixed value of the parameter . In Sect. 8.14 the interpretation of significance in
the case of parameter estimates from data will be further discussed, and it will be
more clear that the estimate of the local significance at a fixed value of a measured
142 8 Upper Limits
A better estimate of the significance may be achieved adopting the test statistics
2 ln generating a large number of pseudoexperiment (toy Monte Carlo) repre-
senting randomly extracted samples that assume the presence of no signal ( D 0)
in order to obtain the expected distribution of 2 ln .
The distribution of the generated 2 ln can be used together with the observed
value of D O to determine the p-value which is the probability that is less or
equal to the observed D :O
O ;
p D PsCb ../ / (8.4)
O
equal to the fraction of generated pseudoexperiments for which ./ .
Note that, in order to assess large values of the significance, the number of
generated toy Monte Carlo samples required to achieve a sufficient precision may
8.6 Poissonian Counting Experiments 143
be very large because the required p-value is very small. Remember that a p-value
for a 5 evidence is (Table 8.1) 2:87 107 , hence, in order to determine it with
sufficient precision the number of toy samples may be as large as 109 .
In the frequentist approach the procedure to set an upper limit is a special case of
determination of the confidence interval (see Sect. 6.1) for the unknown signal yield
s, or alternatively the signal strength .
In order to determine an upper limit instead of a central interval, the choice of
the interval with the desired confidence level (90 % or 95 %, usually) should be fully
asymmetric, becoming s 2 Œ0; sup Œ. When the outcome of an experiment is an upper
limit, one usually quotes:
.s C b/n .sCb/
L.nI s; b/ D e ; (8.5)
nŠ
where n is the observed number of events.
144 8 Upper Limits
In case we use as single information the number of observed events n that we want to
compare with an expected number of background events b, if b is sufficiently large
and known with negligible uncertainty, the distribution of the number of events,
assuming that background only is present in the data (s D 0), can
pbe approximated
to a Gaussian with average b and standard deviation equal to b. The observed
number of events n will be on average equal to b. An excess,pquantified as s D n b,
should be compared with the expected standard deviation b, and the significance
can be approximately evaluated as:
s
ZD p : (8.6)
b
In case the expected background yield b comes from an estimate which has a
non-negligible uncertainty b , Eq. (8.6) can be modified as follows:
s
ZD q : (8.7)
b C b2
The above expressions clearly can only be applied in the particular case presented
above, and have no general validity.
The easiest treatment of a counting experiment, at least from the technical point of
view, can be done under the Bayesian approach. The Bayesian posterior PDF for s
is given by:
L.nI s/ .s/
P.sjn/ D R 1 : (8.8)
0 L.nI s0 / .s0 / ds0
The upper limit sup can be computed requiring that the posterior probability
corresponding to the interval Œ0; sup Œ is equal to CL, or equivalently that the
probability corresponding to Œsup ; 1Œ is ˛ D 1 CL:
Z 1
R1
up L.nI s/ .s/ ds
˛ D 1 CL D P.sjn/ .s/ ds D Rs1 : (8.9)
sup 0 L.nI s/ .s/ ds
8.7 Bayesian Approach 145
Apart from the technical aspects related to the computation of the integrals and
the already mentioned arbitrariness in the choice of the prior .s/ (see Sect. 3.6),
the above expression poses no particular problem.
sn es
P.sjn/ D : (8.10)
nŠ
In the case of no observed events, i.e.: n D 0, we have:
and:
Z 1
es ds D es ;
up
˛ D 1 CL D (8.12)
sup
which gives:
For ˛ D 0:05 (95 % CL) and ˛ D 0:10 (90 % CL), we can set the following upper
limits:
Xn
.sup C b/m
mŠ
˛ D es
up mD0
: (8.16)
Xn
bm
mD0
mŠ
146 8 Upper Limits
The above expression can be inverted numerically to determine sup for given ˛, n
and b. In the case of no background (b D 0) Eq. (8.16) becomes:
Xn
sup m
˛ D es
up
; (8.17)
mD0
mŠ
The derivation of Bayesian upper limits presented above assumes a uniform prior
on the expected signal yield. Assuming a different prior distribution would result
in different upper limits. In general, there is no unique criterion to chose a specific
prior PDF to model the complete lack of knowledge about a variable, like in this
case the signal yield, as it was already discussed in Sect. 3.6. In case of search for
new signals, sometimes the signal yield is related to other parameters of the theory
(e.g.: the mass of unknown particles, or specific coupling constants). In that case,
should one chose a uniform prior for the signal yield or a uniform prior for the
theory parameters? The choice is not obvious and no prescription may be provided
from first principles.
A possible approach is to chose more prior PDFs that reasonably model one’s
ignorance about the unknown parameters and to verify that the obtained upper limit
is not too sensitive to the choice of the prior.
8.8 Frequentist Upper Limits 147
sup
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
18
16
14
12
10
sup
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
Frequentist upper limits can be computed rigorously by inverting the Neyman belt
(see Sect. 6.1). In cases where the confidence interval is fully asymmetric, i.e.: it has
the form Œ0; sup Œ, the result may be quoted as:
s < sup ;
148 8 Upper Limits
at the given confidence level. In order to better interpret the upper limits obtained
from the Neyman belt inversion in the general case, a simple example of a counting
experiment will be analyzed first, similarly to what was considered in Sect. 8.7.1
under the Bayesian approach.
es sn
P.nI s/ D : (8.18)
nŠ
We can set an upper limit on the expected signal yield s by excluding the values of
s for which the probability to observe n events or less (p-value) is below the value
˛ D 1 CL. For n D 0 we have:
and the p-value is equal to es . Setting p > ˛ D 1 CL allows values of the
expected signal yield s that satisfy:
Those results coincide accidentally with the results obtained under the Bayesian
approach.
The coincidence of the values of limits computed under the Bayesian and
frequentist approaches for this simplified but commonly used case may lead to
confusion. There is no intrinsic reason for which limits evaluated under the two
approaches should coincide, and in general, with very few exceptions, like in
this case, Bayesian and frequentist limits don’t coincide numerically. Moreover,
8.8 Frequentist Upper Limits 149
[0 , θ0[
x0 x
This case is illustrated in Fig. 8.3, which is the equivalent of Fig. 6.1 when adopting
a fully asymmetric interval.
P(n | ν = s = 4)
0.25
P(n ≤ 1) = 0.092
0.2
0.1
0.05
0
0 2 4 6 8 10 12 14
n
Fig. 8.4 Poisson distribution in the case of a signal s D 4 and b D 0. The white bins show the
smallest possible fully asymmetric confidence interval (f2; 3; 4; g in this case) that gives at least
the coverage of 1 ˛ D 90 %
probability that the true value s lies within the determined confidence interval Œs1 ; s2
is greater or equal to CL D 1 ˛ (overcoverage).
Figure 8.4 shows an example of Poisson distribution corresponding to the case
with s D 4 and b D 0. Using a fully asymmetric interval as ordering rule, the
interval f2; 3; g of the discrete variable n corresponds to a probability P.n 2/ D
1P.0/P.1/ D 0:9084, and is the smallest interval which has a probability greeter
or equal to a desired CL of 0.90: the interval f3; 4; g would have a probability
P.n 3/ less than 90 %, while enlarging the interval to f1; 2; g would produce a
probability P.n 1/ larger than P.n 2/.
Given an observation of n events, we could set the upper limit sup such that:
hence, again we find the same result that was derived in Sect. 8.8.1 with a simplified
approach:
From the purely frequentist point of view, anyway, this result suffers from the
flip-flopping problem discussed in Sect. 6.3: if we decide a priori to quote an upper
limit as our final result the procedure to chose a fully asymmetric interval in the
Neyman’s construction leads to the correct overage. But if we choose to switch
from fully asymmetric to central interval in case we observe a significant signal,
this would lead to an incorrect coverage.
The Feldman–Cousins (FC) approach (see Sect. 6.4) may be a solution for this case.
In the Poissonian counting case, the 90 % confidence belt obtained with the FC
approach is shown in Fig. 8.5 for the case in which b D 3. The results in the case
of no background (b D 0) are reported in Table 8.3 for different values of the
number of observed events n. Figure 8.6 shows the value of the 90 % CL upper limit
computed using the FC approach as a function of the expected background b for
different values of n.
14
12
s
10
0 2 4 6 8 10 12 14
n
152 8 Upper Limits
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
Comparing Table 8.3 with Table 8.2, which reports the Bayesian results, FC
upper limits are in general numerically larger than Bayesian limits. In particular,
for the case n D 0, the upper limit increases from 2.30 to 2.44 for a 90 % CL and
from 3.00 to 3.09 for a 95 % CL. Anyway, as remarked before, the interpretation of
frequentist and Bayesian limits is very different, and the numerical comparison of
those upper limits should not lead to any specific interpretation.
A peculiar feature of FC upper limits is that, for n D 0, a larger expected
background b corresponds to a more stringent, i.e.: lower, upper limit, as can be
seen in Fig. 8.6 (lowest curve). This feature is absent in Bayesian limits that do not
depend on the expected background b for n D 0 (see Fig. 8.2).
8.9 Can Frequentist and Bayesian Upper Limits Be “Unified”? 153
The coincidence of Bayesian and frequentist upper limits in the simplest event
counting case motivated an effort attempted by Zech [5] to conciliate the two
approaches, namely the limits obtained by Helene in [2] and the frequentist
approach.
In order to determine the probability distribution of the number of events from
the sum of two Poissonian processes with s and b expected number of events from
signal and background, respectively, one can write, using Eq. (2.49), the probability
distribution for the total observed number of events n as:
X Xb
n nn
P.nI s; b/ D P.nb I b/P.nsI s/ : (8.30)
nb D0 ns D0
154 8 Upper Limits
Zech proposed to modify the first term of Eq. (8.30), P.nb I b/, to take into account
that the observation of n events should put a constraint on the possible values of nb ,
which can only range from 0 to n. In this way, Zech attempted to replace P.nb I b/
with:
X
n
P0 .nb I b/ D P.nb I b/= P.n0b I b/ : (8.31)
n0b D0
This modification leads to the same result obtained by Helene in Eq. (8.16), which
apparently indicates a possible convergence of Bayesian and frequentist approaches.
This approach was later criticized by Highland and Cousins [6] who demon-
strated that the modification introduced by Eq. (8.31) produced an incorrect cov-
erage, and Zech himself admitted the non rigorous application of the frequentist
approach [7].
This attempt could not demonstrate a way to conciliate the Bayesian and
frequentist approaches, which, as said, have completely different interpretations.
Anyway, Zech’s intuition anticipated the formulation of the modified frequentist
approach that will be discussed in Sect. 8.10, which is nowadays widely used in
High Energy Physics.
The concerns about frequentist limits discussed at the end of Sect. 8.8.4 have been
addressed with the definition of a procedure that was adopted for the first time for
the combination of the results of the search for the Higgs boson [4] of the four LEP
experiments, Aleph, Delphi, Opal and L3.
The modification of the purely frequentist confidence level with the introduction
of a conservative corrective factor can cure, as will be presented in the following,
the aforementioned counterintuitive peculiarities of the frequentist limit procedure
in case of no observed signal events.
The so-called modified frequentist approach will be illustrated using the test
statistic adopted in the original proposal, which is the ratio of the likelihood
functions evaluated under two different hypotheses: the presence of signal plus
background (H1 , corresponding to the likelihood function LsCb ), and the presence
of background only (H0 , corresponding to the likelihood function Lb ):
E
LsCb .ExI /
E D
./ : (8.32)
Lb .ExI E/
Different test statistics have been applied after the original definition of the LEP
procedure, but the method described in the following remains unchanged for all
different kinds of test statistics.
8.10 Modified Frequentist Approach: The CLs Method 155
where the functions fs and fb are the PDFs for signal and background of the variables
xE. The negative logarithm of the test statistic is (Eq. 7.36):
!
X E
s.E/fs .Exi I /
N
E D s.E/
ln .; / ln C1 : (8.34)
iD1
E b .Exi I /
b./f E
In order to quote an upper limit using the frequentist approach, the distribution
of the test statistics (or equivalently 2 ln ) in the hypothesis of signal plus
background (s C b) has to be known, and the p-value corresponding to the observed
value D O (denoted below as CLsCb ) has to be determined.
The proposed modification to the purely frequentist approach consists of finding
two p-values corresponding to both the sCb and b hypotheses (below, for simplicity
of notation, the set of parameters E also includes ):
E D PsCb ../
CLsCb ./ E /
O ; (8.35)
E D Pb ../
CLb ./ E /
O : (8.36)
E
E D CLsCb ./ :
CLs ./ (8.37)
CLb .E/
Upper limits are determined excluding the range of the parameters of interest (e.g.:
E is lower than the
the signal strength or a new particle’s mass) for which CLs ./
conventional exclusion confidence level, typically 95 % or 90 %. For this reason, the
modified frequentist approach is often referred to as the CLs method.
In most of the cases, the probabilities PsCb and Pb in Eqs. (8.35) and (8.36)
are not trivial to obtain analytically and are determined numerically using ran-
domly generated pseudoexperiments, or toy Monte Carlo. In this way, CLsCb and
CLb can be estimated as the fraction of generated pseudoexperiments for which
E ,
./ O assuming the presence of both signal and background, or background
only, respectively. An example of the outcome of this numerical approach is shown
in Fig. 8.7.
156 8 Upper Limits
0.05
0.04
0.03
0.02
0.01
0
− 60 − 40 − 20 0 20 40
test statistics
Fig. 8.7 Example of determination of CLs from pseudoexperiments. The distribution of the test
statistics 2 ln is shown in blue assuming the signal-plus-background hypothesis and in red
assuming the background-only hypothesis. The black line shows the value of the test statistics
measured in data, and the hatched areas represent CLsCb (blue) and 1 CLb (red)
This method does not produce the desired (90 % or 95 %, usually) coverage from
the frequentist point of view, but does not suffer from the problematic features of
frequentist upper limits that were observed at the end of Sect. 6.4.
The CLs method has convenient statistical properties:
• It is conservative from the frequentist point of view. In fact, since CLb 1, we
have that CLs ./ E CLsCb ./.E So, it overcovers; this means that a CLs upper
limit is less stringent than a purely frequentist limit.
• Unlike the upper limits obtained using the Feldman–Cousins approach, if no
signal event is observed (n D 0), the CLs upper limit does not depend on the
expected amount of background.
For a simple Poissonian counting experiment with expected signal s and a back-
ground b, using the likelihood ratio from Eq. (7.39), it is possible to demonstrate
analytically that the CLs approach leads to a result identical to the Bayesian one
(Eq. 8.16) and to what was derived by Zech (see Sect. 8.9). In general, it turns out
that in many realistic applications, the CLs upper limits are numerically very similar
to Bayesian upper limits computed assuming a uniform prior. But of course the
Bayesian interpretation of upper limits cannot be applied to limits obtained using
the CLs approach.
8.11 Incorporating Nuisance Parameters and Systematic Uncertainties 157
On the other hand, the interpretation of limits obtained using the CLs method
is not obvious, and it does not match neither the frequentist nor the Bayesian
approaches. CLs limits have been defined as [8]:
approximation to the confidence in the signal hypothesis one might have obtained if the
experiment had been performed in the complete absence of background.
Some of the parameters in the set considered so far, E D .1 ; ; m /, are not
of direct interest for our measurement, but are needed to model unknown char-
acteristics of our data sample. Those parameters are defined nuisance parameters
(Sect. 5.2). Nuisance parameters may appear, for instance, when the yield and/or the
distribution of the observed background is estimated with some uncertainty from
simulation or from control samples in data, or when the modeling of distributions
for signal and/or background events are not (perfectly) known, e.g.: to take into
account the effect of detector response (resolution, calibration, etc.).
If we are only interested in the measurement of the signal strength only, all
other parameters i can be considered as nuisance parameters. In case, for instance,
we are also interested in the measurement of the mass of a new particle, like the
Higgs boson’s mass, the parameter corresponding to the particle mass, say m D 1 ,
is, like , a parameter of interest and all other remaining parameters 2 ; ; m can
be considered nuisance parameters.
More in general, let’s divide the parameter set in two subsets: the parameters of
interest, E D .1 ; ; h /, and the nuisance parameters, E D .1 ; ; l /.
The treatment of nuisance parameters is a well defined task under the Bayesian
approach and was already discussed in Sect. 3.4. The posterior joint probability
distribution for all the unknown parameters can be defined as follows (Eq. 3.32):
L.ExI ; E E / .E; E /
E E jEx/ D R
P.; ; (8.38)
L.ExI E0 ; E0 / .E0 ; E0 / dh 0 dl 0
where L.ExI E; E / is the as usual the likelihood function and E E / is the prior
.;
distribution of the unknown parameters.
158 8 Upper Limits
The problem is well defined, and the only difficulty is the numerical integration
in multiple dimensions. Several algorithms can be adopted for the evaluation
of Bayesian integrals. A particularly performant algorithm in those cases is the
Markov-chain Monte Carlo [9].
In order to include detector resolution effects, for instance to model the width of a
signal peak, the hybrid approach requires the convolution of the likelihood function
with the experimental resolution function.
The above likelihood functions can be used to compute CLs limits, as it was done
in the combined Higgs limit at LEP [4].
This hybrid Bayesian-frequentist approach does not ensure an exact frequentist
coverage, and it may tend to undercover in some cases [11]. It has been proven on
simple models to provide results that are numerically close to Bayesian evaluation
performed assuming a uniform prior [12].
8.12 Upper Limits Using the Profile Likelihood 159
.s C b/n .sCb/
LsCb .n; b0 I s; b/ D e P.b0 I b/ ; (8.42)
nŠ
bn
Lb .n; b0 I b/ D eb P.b0 I b/ : (8.43)
nŠ
In order to eliminate the dependence on the nuisance parameter b, the hybrid
likelihoods, using the Cousins–Highlands can be written as:
Z 1
0 .s C b/n .sCb/
LsCb .n; b I s/ D e P.b0 I b/ db ; (8.44)
0 nŠ
Z 1
bn b
Lb .n; b0 / D e P.b0 I b/ db : (8.45)
0 nŠ
In the simplified case, for instance when P.b0 I b/ is a Gaussian function, the
integration can be performed analytically [13]. In this case, when the standard
deviation of the distribution is not much smaller than b0 , P.b0 I b/ extends to negative
values of b, and the integration includes unphysical regions with negative signal
yields. For this reason, the above equations can be safely used only when the
probability to have negative values of b are negligible.
In order to avoid such cases, the use of distributions whose range is limited to
positive values is preferred. For instance, a log-normal distribution (see Sect. 2.7) is
usually preferred to a plain Gaussian.
A procedure that accounts for the treatment of nuisance parameters avoiding the
hybrid Bayesian approach adopts as test statistic the profile likelihood defined in
Eq. (7.27):
O
EO
L.Exj; .//
./ D : (8.46)
L.Exj;
O /EO
160 8 Upper Limits
O
Above, in Eq. (8.46), O and E are the best fit values of and E corresponding to
O
EO
the observed data sample, and ./ is the best fit value for E obtained for a fixed
value of . Above we assumed that all parameters are treated as nuisance parameter
and is the only parameter of interest.
The profile likelihood is introduced in order to satisfy the conditions required to
apply Wilks’ theorem (see Sect. 7.6).
A scan of 2 ln ./ as a function of reveals a minimum at the value
D , O where 2 ln ./ O D 0 by construction. As for the usual likelihood
function, the excursion of 2 ln ./ around the minimum and the intersection
of the corresponding curve with a straight line corresponding to 2 ln ./ D 1
allow to determine an uncertainty interval for similarly to what was discussed in
Sect. 5.6. According to Wilks’ theorem (see Sect. 7.6) if corresponds to the true
value, then the distribution of 2 ln ./ follow a 2 distribution with one degree
of freedom.
Usually the addition of nuisance parameters E has the effect of broadening the
shape of the profile likelihood function as a function of the parameter of interest
compared with cases where the nuisance parameters are kept fixed. This effect
increases the uncertainty on when systematic uncertainties are included in the test
statistic.
Compared with the Cousins–Highland hybrid treatment of nuisance parameters,
the profile likelihood offers several advantages, including a CPU-faster numerical
implementation because it requires no numerical integration, which may be more
time-consuming than the minimizations required to compute the profile likelihood.
Given that the profile likelihood is based on a likelihood ratio, according to
the Neyman–Pearson lemma (see Sect. 7.3) it has optimal performances for what
concerns the test of the two hypotheses assumed in the numerator and in the
denominator of Eq. (8.46).
The test statistics t D 2 ln ./ can be used to determine p-values p
corresponding to the various hypotheses on . Those p-values can be computed
in general by generating sufficiently large toy Monte Carlo samples.
Different variation in the definition of the profile likelihood have been proposed
and adopted for various data analysis cases. A review of most of the adopted test
statistics is presented in [1] where approximate formulae have been provided to
compute significances in asymptotic limits. Some examples are reported below.
8.13 Variations of Profile-Likelihood Test Statistics 161
In order to enforce the condition 0, since we can’t have negative yields from
new signals, the test statistic t D 2 ln ./ can be modified as follows:
8 O
ˆ EO
ˆ
< 2 ln
L.Exj;.//
O 0 ;
O
Q
Qt D 2 ln ./ O E/
L.Exj;
D O (8.47)
ˆ EO
:̂ 2 ln L.Exj;.// O < 0 :
OE
L.Exj0;.0//
In practice, one limits the best-fit value for to values greater or equal to zero.
In order to asses the presence of a new signal, one wants to test the case of a positive
signal strength against the case D 0. This can be done using the test statistic t0
which is equal to t D 2 ln ./ evaluated for D 0. The test statistic t , anyway,
may reject the hypothesis D 0 also in case of a downward fluctuations in the data
that would yield a negative best-fit value . A modification of t0 has been proposed
in order to be only sensitive to an excess in data that yields a significantly high value
of O [1]:
2 ln .0/ O 0 ;
q0 D (8.48)
0 O < 0 :
The p-value p0 corresponding to the test statistic q0 can be evaluated using toy
Monte Carlo samples generated assuming background-only events. The distribution
of q0 will have a spike at q0 D 0 (a Dirac’s delta ı.q0 / component) corresponding
to all cases which give a negative .
O
2 ln ./ O ;
q D (8.49)
0 O > :
A modified version of q defined above has been adopted for Higgs search at LHC.
The modified test statistics is:
8 O
ˆ EO
ˆ
ˆ 2 ln L.Exj;.// O < 0 ;
ˆ
< O
L.Exj0;E0/
O
qQ D EO (8.50)
ˆ
ˆ 2 ln L.Exj;.// 0 O ;
ˆ O
O E/
L.Exj;
:̂
0 O > :
Above, the constant introduced for O < 0, as for q0 , protects against unphysical
values of the signal strength, while the cases which have an upward fluctuations
of the data, such to give O > , are not considered as evidence against the
signal hypothesis with signal strength equal to a considered value of , and the
test statistics is set to zero in those cases, in order not to spoilt the upper limit
performances of this test statistic.
For the definition of the test statistics qQ , as well for the most adopted variations
of the profile likelihood, asymptotic approximations which use Wilks’ theorem and
approximate formulae by Wald [14] have been computed and are treated extensively
in [1].
Using the test statistic for discovery q0 , the asymptotic approximation gives the
following simplified expression for the significance:
p
Z0 ' q0 : (8.51)
pseudoexperiments. The square roots of the test statistics evaluated at the Asimov
data sets corresponding to the assumed signal strength can be used to approximate
the median significance, assuming a data sample distributing according to the
background-only hypothesis:
p
medŒZ j0 D qQ ;A : (8.53)
In Eq. (8.54) the distribution for the signal is modeled as Es D .si ; ; sbins /
and for the background as bE D .bi ; ; bbins /. The normalization of Es and bE are
given by theory expectations, and variations of the normalization scale can be
modeled by the extra parameter ˇ and . is the signal strength, and has been
introduced in several cases before. ˇ has the same role as for the background
yield instead of the signal.
In order to measure the signal yield, the signal strength should be
determined, which is the parameter of interest of the problem. ˇ, instead, can
be considered as a nuisance parameter. Let’s assume for the moment ˇ D 1,
i.e.: the uncertainty on the expected background yield from theory can be
considered negligible. We can use as test statistics:
E ; ˇ D 1/
L.EnjEs; b;
q D 2 ln : (8.55)
E D 0; ˇ D 1/
L.EnjEs; b;
164 8 Upper Limits
90
data
80
background
70
60
50
entries
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
90
data
80
background
70
60
entries
50
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 8.8 A toy Monte Carlo data sample superimposed to an exponential background model (left)
and to an exponential background model plus a Gaussian signal (right)
8.13 Variations of Profile-Likelihood Test Statistics 165
Note that this is somewhat different from the profile likelihood in Eq. (8.46),
and we used here the ratio of the likelihood functions LsCb =Lb .
This test statistic was used at Tevatron for several searches for new physics.
Note that compared with the ordinary profile likelihood (Eq. 8.46) this test
statistic can be written as:
The distributions of the test statistic t for the background-only and for
the signal-plus-background hypotheses are shown in Fig. 8.9. The figure has
been produced generating 100,000 randomly-extracted toy samples in the two
hypotheses.
The p-value corresponding to the background-only hypothesis can be
evaluated as the fraction of randomly-extracted toy samples having a value
of the test statistic lower than the one observed in data.
From the distributions shown in Fig. 8.9, 375 out of 100,000 toy samples
have a value of q below the one in our data sample, hence the p-value is
3.7 %. Assuming a binomial uncertainty, the p-value can be determined with
an uncertainty of 0.2 %.
Using Eq. (8.1) to convert the p-value into a significance level, this corre-
sponds to a significance Z D 2:7.
An alternative way to determine the significance is to look at the scan of the
test statistics as a function of the parameter of interest . This can be seen in
Fig. 8.10 which shows that the minimum value of q is reached for D 1:24.
Considering the range of where q exceeds the minimum value by not
more than 1, the asymmetric uncertainty interval for can be determined as
O D 1:24C0:49
0:48 . Indeed, the interval exhibits a very small asymmetry.
The minimum value of q can be used to determine the significance in the
asymptotic approximation. If the null hypothesis ( D 0, assumed in the
denominator of Eq. (8.55)) is true, than the Wilks’ theorem holds, and we
p
have the approximated expression Z ' qmin D 2:7, in agreement with
the estimate obtained with the toy generation.
Note that in Fig. 8.10 the test statistic is zero for D 0 and reaches a
negative minimum for D , O while the profile likelihood defined in Eq. (8.46)
has a minimum value of zero.
166 8 Upper Limits
104
103
n. toys
102
10
− 30 − 20 − 10 0 10 20 30
q
Fig. 8.9 Distribution of the test statistic q for the background-only hypothesis (blue) and for the
signal-plus-background hypothesis (red). Superimposed is the value determined with the presented
data sample (black arrow). p-values can be determined from the shaded areas of the two PDFs
0
q
−2
−4
−6
sup E ; ˇ/
L.EnjEs; b;
0:9ˇ1:1
q D 2 ln : (8.57)
sup E D 0; ˇ/
L.EnjEs; b;
0:9ˇ1:1
The scan of the new test statistic is shown in Fig. 8.12. Compared with the
case where no uncertainty was included, the shape of the test-statistic scan is
now broader and the minimum is less deep. This results in a larger uncertainty:
D 1:40C0:61
0:60 and a smaller significance: Z D 2:3.
Note that this approaches considers implicitly a uniform distribution of the
estimate ˇ 0 of the parameter ˇ, while a different PDF could be adopted. In that
case, the test statistic can be modified as follows:
sup E ; ˇ; ˇ 0 /
L.EnjEs; b;
1<ˇ<C1
q D 2 ln ; (8.58)
sup E D 0; ˇ; ˇ 0 /
L.EnjEs; b;
1<ˇ<C1
and the term P.ˇ 0 jˇ/ is the PDF for the estimated background yield ˇ 0 , given
the true value ˇ. Typical options for P.ˇ 0 jˇ/ are a Gaussian distribution, with
the problem that it could also lead to negative values of ˇ 0 , or a lognormal
distribution (see Sect. 2.7), which constrains ˇ 0 to be positive.
168 8 Upper Limits
90 data
background
80
bkg +10%
70 bkg -10%
60
50
entries
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
90 data
background
bkg +10%
80 bkg -10%
signal
70 signal+bkg.
60
50
entries
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
Fig. 8.11 Toy data sample superimposed to an exponential background model (top) and to an
exponential background model plus a Gaussian signal (bottom) adding a 10 % uncertainty to the
background normalization
8.14 The Look-Elsewhere Effect 169
0
q
−2
−4
−6
where f .qj/ is the PDF of the adopted test statistics q for a given value of the signal
strength .
The local significance gives the probability corresponding to a background
fluctuation at a fixed value of the mass m0 . The probability to have a background
overfluctuation at any mass value, called global p-value, is larger than the local p-
170 8 Upper Limits
EO D
q./ max E :
q./ (8.61)
imin <i <imax ;
iD1; ;m
An approximate way to determine the global significance taking into account the
look-elsewhere effect is reported in [16], relying on the asymptotic behavior of
likelihood-ratio estimators. It is possible to demonstrate [17] that the probability
O is larger than a given value u, i.e.: the value of the global
that the test statistic q.m/
p-value, is bound by:
q(m)
1 2 3
m
˝ ˛
Fig. 8.13 Visual illustration of upcrossing, computed to determine Nu0 . In this example, we have
Nu D 3
where P.2 > u/ comes from an asymptotic approximation of the distribution of the
local test statistics q.m/ as a 2 distribution with one degree of freedom. The term
hNu i in Eq. (8.62), instead, is the average number of upcrossings, i.e. the average
number of times the curve q D q.m/ crosses an horizontal line at a given level
q D u with a positive derivative, which can be evaluated using toy Monte Carlo.
This is visualized in an example in Fig. 8.13.
The value of hNu i could be very small, depending on the level u. Fortunately, a
scaling law exists, so, starting from a different level u0 one can extrapolate hNu0 i as:
References