A Model For Spectra-Based Software Diagnosis
A Model For Spectra-Based Software Diagnosis
A Model For Spectra-Based Software Diagnosis
LEE NAISH,
HUA JIE LEE
and
KOTAGIRI RAMAMOHANARAO
University of Melbourne
This paper presents an improved approach to assist diagnosis of failures in software (fault
localisation) by ranking program statements or blocks according to how likely they are to be
buggy. We present a very simple single-bug program to model the problem. By examining different
possible execution paths through this model program over a number of test cases, the effectiveness
of different proposed spectral ranking methods can be evaluated in idealised conditions. The
results are remarkably consistent to those arrived at empirically using the Siemens test suite and
Space benchmarks. The model also helps identify groups of metrics which are equivalent for
ranking. Due to the simplicity of the model, an optimal ranking method can be devised. This
new method out-performs previously proposed methods for the model program, the Siemens test
suite and Space. It also helps provide insight into other ranking methods.
Categories and Subject Descriptors: D.2.5 [Software Engineering]: Testing and Debugging—
Debugging aids
General Terms: Performance, Theory
Additional Key Words and Phrases: fault localization, program spectra, statistical debugging
1. INTRODUCTION
Despite the achievements made in software development, bugs are still pervasive
and diagnosis of software failures remains an active research area. One of many
useful sources of data to help diagnosis is the dynamic behaviour of software as
it is executed over a set of test cases where it can be determined if each result
is correct or not (each test case passes or fails). Software can be instrumented
automatically to gather data such as the statements that are executed in each test
case. A summary of this data, often called program spectra, can be used to rank
the parts of the program according to how likely it is they contain a bug. Ranking
is done by sorting based on the value of a numeric function (we use the term
ranking metric or simply metric) applied to the data for each part of the program.
There is extensive literature on spectra-based methods in other domains, notably
classification in botany, and this is the source for many ranking metrics that can
be used for software diagnosis. We make the following contributions to this area:
Thanks to Tim Miller and the anonymous referees for their comments on drafts of this paper.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c 2009 ACM 0004-5411/2009/0100-0001 $5.00
We present a very simple program that captures important aspects of the diagno-
sis problem. In this paper we focus on programs with a single bug, though the
methodology can be adapted to multiple bugs.
(2) Because the model allows a more formal approach, we are able to define
metrics for spectra-based ranking that are optimal for the case we examine. This
helps us compare different metrics.
(3) We give the first comprehensive evaluation of spectra-based ranking methods.
Over thirty ranking metrics are evaluated using our model, the Siemens test suite
(several small C programs) and Space (a more realistically sized program). The
results from the model closely match the empirical results, including the superior
performance of the optimal metrics for diagnosis of single-bug programs.
(4) We prove that although all the metrics examined are distinct functions, sev-
eral are equivalent for ranking purposes (they produce identical rankings).
Although diagnosing code with multiple bugs is the norm, there are several rea-
sons for focussing on the single bug case. First, it provides a simpler instance of the
problem to trial our model-based approach. A simpler model can be used, which
allows much simpler analysis and theoretical results. Second, empirical comparison
with other work requires the use of the same benchmarks, and the most widely used
have a strong single-bug bias. Third, the program having a single bug is a sufficient
condition for our results but it is not a necessary condition. A key condition, which
depends on the set of test cases and is sufficient for some technical results, is a bug
exists that is executed in every failed test case. This condition may be satisfied
even though some of the failures are caused by other bugs and/or there are still
other bugs that cause no failures. Furthermore, we typically have some control over
the set of test cases. There is some previous work on partitioning test cases that
attempts to satisfy this weaker condition — we discuss this in Section 10.
The rest of the paper is organized as follows. Section 2 provides the necessary
background on spectra-based diagnosis, including the formulas used in the many
ranking metrics from the literature. Section 3 discusses the related idea of using
data on predicates rather than statement executions, and how the two approaches
can be equivalent in some cases. Section 4 presents our model program. Section 5
describes our methodology for considering multisets of execution paths of the model
program and how we evaluate performance of metrics in the model. In Section 6,
we discuss optimal metrics. In Section 7 all metrics are compared, analysed and
evaluated using our model. Various metrics are shown to be equivalent. Section 8
gives empirical performance results for the metrics using the Siemens test suite and
Space benchmarks. Sections 9 and 10 discuss related and further work, respectively.
Section 11 concludes.
evaluated, compared, and analyzed using our model
2. BACKGROUND
A program spectrum is a collection of data that provides a specific view of the
dynamic behavior of software. It contains information about the parts of a program
that were executed during the execution of several test cases. In general the parts
could be individual statements, basic blocks, branches or larger regions such as
functions. Here we use individual statements. This is equivalent to considering basic
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 3
dist(a, b)+dist(b, c)). Given a norm, statements (rows in the matrix) can be ranked
according to how close they are to the result vector (whether each test passes or
fails). Some of the ranking metrics we use are referred to as norms in the literature;
others are called measures or distances. Some are defined in more general cases,
such as where we have a matrix and vector of arbitrary numbers. We restrict at-
tention to the special case of binary numbers where the data is summarised by the
four aij values. This induces equivalence of some ranking metrics that are distinct
in more general cases. Additional equivalences are induced because we are only
interested in the ranking of statements produced — not the metric values per se
(see Section 7.2).
All metrics project points in many dimensions (proportional to the number of test
cases) onto a single dimension. This is a simple instance of the clustering problem
where there are just two clusters we want to distinguish — the correct statements
(with low metric values) and buggy statements (with high metric values). The
methodology can be used in many domains. For example, instead of program
statements and test cases that fail or pass we might have genes and individuals
who do or do not have some form of disease; the goal being to determine the genes
most likely to be associated with the disease. Usually there can be both “false
negatives” (for example, the disease is not present or the test case succeeds even
though the bad gene is present or the buggy statement is executed) and “false
positives” (for example, the disease is present even though the bad gene is not).
In general the data may not be binary: genes may be expressed to varying degrees
(even in software diagnosis its possible to consider the number of times a statement
is executed in each test case) and the severity of the disease may vary. There are
infinitely many possible metrics and numerous metrics have been proposed in the
literature for predicting the statements most likely to be buggy or for the equivalent
classification in other domains — see Table II. We briefly review their sources next.
The oldest of the metrics is Jaccard [1901], originally used for classification in
the botany domain but since used in many areas. Botany has also spawned several
other metrics: Ochiai [1957], Russell and Rao [1940], Sørensen and Dice [Dice 1945;
Duarte et al. 1999], Rogers and Tanimoto [1960], Anderberg [1973] and Simple-
Matching [Meyer et al. 2004]. Jaccard, Dice, Overlap and a more general version
of the Ochiai metric called Cosine have been used in the field of information re-
trieval [Dunham 2002]. Hamming [1950] was originally introduced for error detect-
ing/correcting codes; for binary numbers it is equivalent to Lee [1958] and Manhat-
tan [Krause 1973] (we refer to these as Hamming etc. in Table II). Manhattan and
Euclid [Krause 1973] have been used in clustering. Jaccard and Simple-Matching
have also been used in clustering, along with with two unnamed metrics that we
refer to as M1 and M2, respectively [Everitt 1978]. In the area of biometrics various
metrics have been introduced: Goodman and Kruskal [1954], Scott [1955], Fleiss
[1965], Cohen [1960], Geometric Mean [Maxwell and Pilliner 1968] as well as Arith-
metic Mean, Harmonic Mean, Rogot1 and Rogot2 [Rogot and Goldberg 1966]. In
the field of clustering using the Self-Organizing Maps (SOM) algorithm various
similarity measures or metrics have been introduced and evaluated: Kulczynski1,
Kulczynski2, Hamann and Sokal [Lourenço et al. 2004].
Spectra-based ranking has also been applied to software diagnosis by various
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 5
researchers. In Section 9 we give more details of this work. The Tarantula system
[Jones and Harrold 2005] is credited as the first system to apply spectra-based
ranking to software diagnosis (for an imperative language). The AMPLE system
[Dallmeier et al. 2005] was designed for diagnosis of object-oriented software. The
Ample metric we use is actually taken from subsequent work [Abreu et al. 2007]; it
is just one possible generalisation of the actual formula used in the AMPLE system,
which only ever has a single failed test case. The Pinpoint web diagnosis system
[Chen et al. 2002] uses the Jaccard metric. Ochiai, Jaccard, Ample and Tarantula
metrics have been evaluated for diagnosis using the Siemens test suite [Abreu et al.
2007]. More recently, the Zoltar metric [Gonzalez 2007] has been proposed and
used for software diagnosis and in [Wong et al. 2007] several metrics (Wong1-3)
were proposed and evaluated.
The intuition behind the metrics varies greatly. Here we give three brief examples.
Jaccard devised a way to measure the similarity of two sets. It is the size of their
intersection divided by the size of their union. We can apply this idea to spectral
diagnosis by using the set of test cases that fail and the set of test cases for which a
particular statement is executed. A second approach is to view a row of the matrix
as a vector in n-dimensional space. The cosine of the angle between this vector and
the vector of test results is another measure of similarity — this is what the Ochiai
metric computes. A third way is to think of the rows and results as bit strings.
The number of bits that differ in two strings is a measure of dis-similarity (the
Hamming distance). By first taking the complement of one of the bit strings we
obtain a measure of similarity — this is what our formula for the Hamming metric
computes.
Several of the metrics contain quotients where the denominator can be zero. If the
numerator is zero we use zero otherwise we use a suitably large value. For example,
the Overlap formula we can use the number of tests plus 1, which is larger than
any value which can be returned with a non-zero denominator. An alternative is
to add a suitably small ǫ to the denominator.
3. INSTRUMENTING PREDICATES
An alternative approach to bug localization is to instrument certain predicates in
the code rather than statement execution [Liblit et al. 2005; Liu et al. 2005]. Each
predicate is associated with a single program point. Examples of useful predicates
are conditions of if statements, and whether the “return” expressions of functions
are positive, negative or zero. During execution of a test case, data can be gathered
on the predicates that are “observed” (execution has reached that program point)
and of those, the ones that were true (at least once). This data can then be used
to rank the different predicates according to how well they predict failure of a test
case. For example, it may be that a failure occurs if and only if a certain variable
is zero at a certain program point. If we have a predicate that tests this, it is likely
to be highly ranked and useful for diagnosis.
One implementation of these ideas is the Cooperative Bug Isolation (CBI) system
[Liblit 2004; Liblit et al. 2005], in which predicates are ranked as follows. Each
predicate B has values Failure(B), defined as the number of failed tests where B
was true divided by the total number of tests where B was true, and Context(B),
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 7
defined as the number of failed tests where B was observed (executed) divided by
the total number of tests where B was observed. One ranking metric proposed is
Increase(B), defined as F ailure(B) − Context(B). There are also two variations,
which attempt to combine Increase with another simple metric (the proportion of
failed tests in which B is true), using a harmonic mean. One uses logarithms [Liblit
et al. 2005]; the other (from the source code distribution) uses square roots.
If the only predicates instrumented are related to control flow, instrumenting
statement execution provides the same information. For example, in the program
segment S1; if B then S2, we know B is observed if and only if S1 is executed and
B is true if and only if S2 is executed. Note that although the information about
B is available, it is spread across two different statements, so the program spectra
methods we consider (which never combine aij values from different statements)
may not always be able to reproduce the predicate-based ranking. An apparent
advantage of predicate-based systems is that predicates unrelated to control flow can
be introduced. This allows our domain knowledge to be used to obtain more relevant
information. For example, whether the return value of a function is negative is
often important. However, any predicate P deemed worth instrumenting can be
used as a control flow predicate by introducing a dummy if statement that tests
the predicate then does nothing. Some predicate based system, such as CBI, use
sampling of data points rather than collecting all the data, but this can equally
be applied to statement-based methods. Hence there is no fundamental difference
between instrumenting predicates and statements.
The model program we use is very simple, and for a given set of tests, Context is
the same for all control flow predicates (they are always observed). This allows the
predicate-based metrics used in CBI to be translated into the statement-based pro-
gram spectra formalism and compared fairly with other metrics (there is a bijection
between statements and predicates and our translation of the formulas preserves
the ranking in all cases). We refer to these metrics as CBI Increase (abbreviated
Inc), CBI Log and CBI Sqrt. The definitions are given in Table III. We use the
same formulas in our empirical experiments. However, for the test suite programs,
Context can vary between different predicates. This means our translation of the
CBI predicate-based metrics to statement-based metrics does not preserve ranking,
so the empirical performance we report for these metrics may not reflect that of
the CBI system.
4. A MODEL PROGRAM
Figure 1 gives the code for the If-Then-Else-2 (ITE2 ) program segment which we
use as a model for single-bug programs. It is assumed function s1() and s2()
exist and may have side effects such as assignments to variables, whereas Boolean
functions t1(), t2() and t3() return Booleans but have no side effects. The
intention is that ITE2 should assign True to variable x. There are also intentions
for the individual statements and these are met by the program except for S4, which
may sometimes assign False instead of True to x.
We use t3() to model the fact that buggy code may sometimes behave correctly
and sometimes trigger a failure. We use t2() (and the second if-then-else) to
model the fact that buggy code may not be executed in every run. We use t1() to
Journal of the ACM, Vol. V, No. N, June 2009.
8 · Naish et al.
if (t1())
s1(); /* S1 */
else
s2(); /* S2 */
if (t2())
x = True; /* S3 */
else
x = t3(); /* S4 - BUG */
model the fact that ranking methods (and debugging in general) must cope with
“noise”. It may be that the execution of S1, for example, is strongly correlated
with failed test runs. This may be due to logical dependencies within the program
or the particular selection of test data. The “signal” we want to detect is associated
with t2() — the buggy statement is executed if and only if t2() returns False.
The signal is essentially attenuated by t3() — if t3() almost always returns True
there is little signal we can detect (and its more likely that the noise will be greater,
leading to S1 or S2 being ranked top).
Our intuition suggested that having noise and an attenuated signal were the two
most important features we needed in a model, and ITE2 is the simplest model
program we could think of that has these features. Despite its simplicity, this model
has been very useful in evaluating, understanding and improving the performance
of spectral diagnosis methods, as our later results show. There are many ways the
model could be extended, for example, by:
—having more bugs,
—having more sources of noise,
—simulating loops, so both branches of an if-then-else can be executed in a single
test, and
—having statements that are executed more or less often over typical sets of tests
(in IT E2 all statements are executed in half the tests, on average).
The way we evaluate performance of metrics (described in the next section) is
independent of the model. We have experimented with all these extensions and
report some general observations here.
often far from this ideal. Our methodology for evaluating overall performance of
a metric uses the number of tests as a parameter. Given a number of tests t, we
determine the average performance over all possible multisets of t execution paths
(Table IV gives results from our model for each metric for various values of t we
choose between 2 and 1000). When the number of possible multisets is not too
large we generate each distinct multiset once to evaluate the overall performance
of each metric. For larger numbers of multisets we compute estimates by randomly
sampling multisets from a uniform distribution.
The problem is equivalent to the more classical “balls and pins” or “books and
bookends” problems (the number of ways t books can be partitioned on a shelf
using p − 1 bookends) and there are several equivalent recurrences. The solution is
the Binomial numbers (also known as Pascal’s triangle) — f (t, p) = C(t + p − 1, t)
where
n!
C(n, k) =
k!(n − k)!
We use a bijection between integers in the range 1 to f (t, p) and sequences of p
integers that sum to t. We repeatedly generate a random number in this range (or
generate each number once if the range is small enough) and map it to a sequence
of path counts which represents a multiset.
the score in some cases but never decreasing it. We further discuss scoring functions
in Section 8.2.
Definition 5.2 Total score for ranking metric M with t tests. The total score for
ranking metric M with test set size t is the sum of the scores for M with paths X,
over all possible multisets X of t execution paths that contain at least one failed
path.
In our experiments we report percentage scores: the total score for all multisets
examined divided by the number of multisets, times 100. We give percentage scores
for multisets of paths with certain characteristics (such as particular numbers of
failed cases). Multisets of paths without these characteristics are simply ignored
when computing percentage scores. We always ignore multisets with no failed tests.
For example, with a single test there are just eight multisets of paths. Six of these
have no failed tests and are ignored. The remaining two multisets have a single
execution path containing S1 or S2 followed by (the failing path through) S4. For
both these multisets the Jaccard metric (and other reasonable metrics) give a score
of 0.5. The percentage score is thus (0.5 + 0.5)/2 × 100 = 50%.
6. OPTIMAL RANKING
For any given multiset of paths, metrics exist that rank S4 top (or equal top) but
no single metric exists that ranks S4 top for all multisets. Thus no metric is best
in all cases. However, it is possible to have metrics that are optimal in the sense
that they maximise the total score over all possible multisets.
Definition 6.1 Optimal ranking metric for t tests. A ranking metric is optimal
for test set size t if no other ranking metric has a higher total score with t tests.
This definition is most appropriate if we assume a uniform distribution, since
each multiset of paths is considered equally. It would be possible to generalise the
definition so each multiset has a weight, which is multiplied by the score for that
multiset. Ignoring some multisets, as we do in our experiments, is equivalent to
having a weight of zero for those multisets and a weight of one for other multisets.
Note that a metric that is optimal for a uniform distribution (equal weights) may
not be optimal for other distributions, but may be close to optimal.
6.1 Optimal Ranking metric O
We now define a ranking metric and prove it is optimal (for IT E28 ) for all numbers
of tests.
Definition 6.2 Ranking metric O.
−1 if anf > 0
O(anp , anf , aep , aef ) =
anp otherwise
Superficially this is a rather odd metric as it only uses anf and anp , which appear
to be the least important variables in most other metrics. However, for single-
bug programs such as IT E28 , anf is always zero for the buggy statement, so any
statement with a non-zero anf can be given the lowest rank. Furthermore, any
two statements that have anf = 0 must have the same aef value since anf + aef is
the total number of failed tests (which is the same for all statements). Similarly,
Journal of the ACM, Vol. V, No. N, June 2009.
12 · Naish et al.
anp + aep is the total number of passed tests, so there is only one non-trivial degree
of freedom. There is no a priori reason to suppose a buggy statement is executed
more or less often than a correct statement. However, since all failed tests use the
buggy statement, aep will tend to be smaller for buggy statements, so anp will tend
to be higher.
Proposition 6.3. For single-bug programs, increasing the value returned by a
ranking metric when anf > 0 never increases the total score for the metric.
Proof. Since there is a single bug, anf > 0 only for non-buggy statements. In-
creasing the rank of a non-buggy statement never increases the score for a multiset,
so the total score is never increased.
The key lemma below shows that O cannot be improved by making a minimal
change to the ranking it produces. O ranks a statement with anf = 0 and anp = x
lower than a statement with with anf = 0 and anp = x + 1, and no statements are
ranked between these. We consider the effect of swapping the ranks of two such
statements, or making the ranks equal.
Lemma 6.4. Suppose O′ and O′′ are ranking metrics such that for a single x
′ x+1 if anp = x and anf = 0
O (anp , anf , aep , aef ) =
O(anp , anf , aep , aef ) otherwise
x + 1.5 if anp = x and anf = 0
O′′ (anp , anf , aep , aef ) =
O(anp , anf , aep , aef ) otherwise
For IT E28 , O has a total score at least as high as O′ and O′′ for all test set sizes
t.
Proof. For O′′ the ranking of statements where anp = x and anp = x + 1 is
swapped compared to the ranking using O and for O′ they have equal rank; the
relative rank of all other pairs of statements is the same as that using O. Since S3
is not executed in the failed test(s) it has a metric value of -1, which is strictly less
than value for S4 and will not affect the score for any multiset. Similarly, at least
one of S1 and S2 have a score of -1 for each multiset: whenever S1 is used S2 is
not used and vice versa, so they can’t both be executed in (all) the failed test(s).
We can never have a tie between S1 and S2 unless both have value -1.
Thus O has a total score greater or equal to that of O′ and O′′ if and only if the
number of multisets such that hanp , anf i = hx + 1, 0i for S4 and hanp , anf i = hx, 0i
for S1 or S2 (whichever has the larger score) is greater or equal to the number of
multisets such that hanp , anf i = hx + 1, 0i for S1 or S2 (whichever has the larger
score) and hanp , anf i = hx, 0i for S4. Due to the symmetry between S1 and S2, we
can just take the numbers of multisets where S1 has the larger score, and double it.
All these multisets have anf = 0 for S1 and S4 so we leave this constraint implicit
below.
The six passed paths through IT E28 can be classified as follows: two pass through
S1 but not S4, one passes through S4 but not S1, two pass through neither S4
or S1, and one passes through S4 and S1 (due to symmetry the same holds if we
replace S1 by S2 here). Let b be the number of tests using neither S4 or S1 (it
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 13
contributes to the anp total for both since anp is the number of passed tests that do
not execute the statement). Also, let y = x − b and z = t − 2y − b − 1.
The number of multisets such that anp = x + 1 for S4 and anp = x for S1 is the
number of multisets where there are y + 1 paths that contribute only to anp for
S4, y paths that contribute only to anp for S1, b paths that contribute to both and
the remaining z paths that contribute to neither. The number of distinct paths in
these categories in IT E28 are two, one, two and one, respectively (as noted above),
and b can range between zero and ⌊t/2⌋ so the number of multisets is
⌊t/2⌋
X
f (y + 1, 2)f (y, 1)f (b, 2)f (z, 1)
b=0
Similarly for S2. By the same reasoning, the number of multisets such that
anp = x + 1 for S1 and anp = x for S4 is
⌊t/2⌋
X
f (y, 2)f (y + 1, 1)f (b, 2)f (z, 1)
b=0
Similarly for S2. It is sufficient to show that f (y + 1, 2)f (y, 1) ≥ f (y, 2)f (y + 1, 1)
in all cases. Using definitions of f and C and simplifying we obtain
C(y + 2, y + 1)C(y, y) ≥ C(y + 1, y)C(y + 1, y + 1)
C(y + 2, y + 1) ≥ C(y + 1, y)
y+2≥y+1
which is true.
Theorem 6.5 Optimality of O. Ranking metric O is optimal with respect to
IT E28 for all test set sizes.
Proof. Suppose some ranking metric O′ differs from O. Due to Proposition 6.3
any difference in ranking for tuples such that anf > 0 cannot make O′ better than
O. For tuples in which anf = 0, aef is the number of failed tests, which is the
same for all tuples, and anp + aep is the number of passed tests. Thus the number
of distinct values for the tuples is the number of distinct values for anp and O′ is
equivalent to a function that maps these tuples to the range of anp values. O can
be modified to obtain an equivalent ranking in a finite number of steps by swapping
the ranking of adjacent values or making them equal. But by Lemma 6.4 none of
these steps would increase the total score.
6.2 Other optimal ranking metrics
The proofs above can be generalised to show that a ranking metric M is optimal
for IT E28 if, given a fixed number of passed and failed tests,
(1) when anf = 0, M is increasing in anp or (equivalently) decreasing in aep (the
key requirement for Lemma 6.4), and
(2) when anf > 0, the value returned is always less than any value returned when
anf = 0 (allowing use of Proposition 6.3).
Journal of the ACM, Vol. V, No. N, June 2009.
14 · Naish et al.
7. MODEL RESULTS
There are several reasons why our approach of modelling the diagnosis problem
using a simple program such as IT E28 is beneficial. First, it is simple to run
many experiments where various parameters are controlled precisely. The vagaries,
biases, restrictions and cost of constructing or gathering “real” code and test suites
are avoided. Such experiments can inspire hypotheses about behaviour of various
metrics et cetera, which can then be investigated further. For example, we have
proved the equivalence of various metrics having noticed they produced the same
results for all test suite sizes using our model — see Section 7.2. Second, if some
result holds for the model and the Siemens test suite, say, we can be more confident
it will also hold for other situations (more than if the result holds for only one of
these at least). Third, a model allows a more analytical approach, rather than
empirical. A clear example is our development of optimal metrics.
Table IV. Influence of test suite size and metric on total score (%) for IT E28
Num. of tests 2 5 10 20 50 100 500 1000
O, Op 60.00 72.59 81.10 87.98 94.13 96.81 99.31 99.65
Wong3′ 60.00 67.04 76.51 86.87 94.06 96.81 99.31 99.65
Wong3 56.67 63.15 75.59 86.79 94.06 96.81 99.31 99.65
Zoltar 60.00 71.48 79.97 87.53 94.06 96.80 99.31 99.65
M2 60.00 72.59 80.98 87.34 92.41 94.33 95.87 96.05
Kulczynski2 60.00 71.48 79.72 86.30 91.78 93.96 95.78 95.99
Russell etc 53.33 61.11 69.57 78.79 88.89 93.80 98.64 99.31
Overlap 48.89 54.88 66.57 77.91 88.79 93.79 98.64 99.31
Ochiai 60.00 71.48 79.12 84.90 89.51 91.28 92.75 92.93
Rogot2 56.67 67.78 77.10 83.50 88.42 90.28 91.81 92.00
HMean 48.33 67.13 77.06 83.50 88.42 90.28 91.81 92.00
GMean 48.33 67.13 76.92 83.33 88.23 90.07 91.59 91.78
AMean 48.33 67.13 76.75 83.18 88.01 89.86 91.39 91.58
Ample2 56.67 67.78 76.45 82.79 87.84 89.73 91.28 91.48
Jaccard etc 60.00 71.48 78.22 83.12 87.02 88.55 89.85 90.02
Ochiai2 48.33 67.13 75.13 80.93 85.24 86.89 88.30 88.47
Cohen 48.33 67.13 75.16 80.63 84.75 86.35 87.70 87.87
Tarantula etc 55.56 62.10 69.68 75.68 80.40 82.21 83.73 83.92
Fleiss 56.67 65.93 72.54 76.70 80.10 81.48 82.70 82.86
Scott etc 56.67 66.67 72.43 76.46 79.66 80.95 82.12 82.27
CBI Log 33.33 50.86 63.69 73.13 78.70 80.30 82.35 82.89
CBI Sqrt 28.89 46.73 60.71 70.99 77.16 78.69 79.63 79.73
Rogers etc 56.67 63.15 67.60 71.02 73.93 75.15 76.26 76.41
Ample 36.67 34.54 38.28 41.40 43.92 44.87 45.64 45.74
1
(1/x)−1 + 21 .
In the following we use F to denote the number of failed tests, P to denote the
number of passed tests and T to denote total number of tests; T = F + P .
Proposition 7.3. The Rogers and Tanimoto, Simple Matching, Hamann, Sokal,
M1, Wong2, Euclid, Hamming, Manhattan and Lee metrics are all equivalent for
ranking.
Proof. Simple Matching is equivalent to (aef + anp )/T . Applying x · T − P
we obtain aef − aep (Wong2), since P = aep + anp . Hamann is equivalent to
(2aef + 2anp − T )/T and applying (x · T + T )/2 we get the same result. For Rogers
2T
and Tanimoto we can apply (1/x)+1 − P to get the Simple Matching formula.
T
For Sokal we can apply (2/x)−1 to get the same result. For M1 we can apply
1
(1/x)+1 to obtain the Simple Matching formula. Euclid squared is simply the same
as Hamming, Manhattan and Lee and applying x/T to Hamming we get Simple
Matching.
Proposition 7.4. The Scott and Rogot1 metrics are equivalent for ranking.
Proof. Taking the Scott formula, using anf = F − aef , anp = P − aep , multi-
plying out and simplifying we obtain
−F 2 + (4P + 2F )aef − 2F aep − 2aef aep − a2ep − a2ef
F 2 + 2P F + 2P aef + 2P aep − 2aef aep − a2ep − a2ef
Adding one we obtain
2P F + (6P + 2F )aef + (2P − 2F )aep − 4aef aep − 2a2ep − 2a2ef
F 2 + 2P F + 2P aef + 2P aep − 2aef aep − a2ep − a2ef
Taking the Rogot1 formula, using anf = F − aef , 2anp = P − aep , multiplying out
and simplifying we obtain this result divided by two.
Proposition 7.5. The CBI Increase and Tarantula metrics are both equivalent
aef
to aep for ranking.
aef /F F
Proof. The Tarantula metric is equivalent to aef /F +aep /P . Applying P ((1/x)−1)
a aef
we obtain aef
ep
. CBI Increase is equal to aef +aep − FT . Applying 1/(1/(x + FT ) − 1)
we obtain the same result.
This proposition is of particular interest for two reasons. First, it illustrates an-
other advantage of our more formal approach. Our initial implementation of the
CBI Increase metric had slightly different performance to Tarantula. Having proved
they should be identical the code was examined and a bug in our implementation
of CBI Increase was found1 . Second, the Tarantula metric does not perform par-
ticularly well — there are significantly better metrics. This suggests that there is a
strong possibility of improving the performance of the CBI system by using a bet-
ter metric. We are currently investigating this. Alternatively, the role of Context
1A special case to avoid division by zero returned a value that was not minimal in all cases.
Journal of the ACM, Vol. V, No. N, June 2009.
18 · Naish et al.
may be particularly important and this could influence the way data on statement
executions are best used for diagnosis.
Proposition 7.6. The Russell and Rao and Wong1 metrics are equivalent for
ranking and if there is a single bug they result in the same ranking for the bug as
the Binary metric.
Proof. The Russell and Rao metric equals aef /T ; multiplying by T gives aef
(which is Wong1). This is maximized for exactly those statements that are executed
in all failed tests. There is at least one such statement, the buggy one, if there is a
single bug and at least one failed test; if there are no failed tests all statements are
ranked equally for all these metrics. The statements that are executed in all failed
tests have anf = 0 and thus also maximise the Binary metric. So the same set of
statements is maximally ranked in these metrics and includes the bug if there is a
single bug.
Note the non-maximal rankings using Binary can differ from the those using
Russell and Rao and Wong1, so for multiple bug programs the performance may
differ. Our empirical results in Section 8, which include a small proportion of
multiple bug programs, are almost identical for these metrics. O and Op have
a similar relationship, but can give multiple distinct ranks for statements with
anf = 0:
Proposition 7.7. O and Op give the same rankings to all statements with
anf = 0, which includes the bug if there is a single bug.
Proof. If anf = 0, O is anp and Op is F −aep /(P +1). Also, both metrics return
higher values than for any statement with anf > 0. Applying F − (x − P )/(P + 1)
to O gives Op .
7.3 Error detection accuracy
Error detection accuracy, qe [Abreu et al. 2006], is defined as the proportion of
failed tests in test cases where the buggy statement was executed. That is,
aef
qe =
aef + aep
It is an indication of how consistent a bug is, though it depends on the test set used.
Some bugs always result in failure when they are executed (called deterministic bugs
in [Liblit et al. 2005]) whereas some return the correct result for nearly all tests
in which they are executed. We would expect less consistent bugs to be harder to
diagnose. This has been studied empirically [Abreu et al. 2006; 2007; Lee et al.
2008]. Here we study it using our model.
Table V gives the performance of the metrics for 100 test cases and several ranges
of qe values. For this (and subsequent) tables, we use strict inequality for the lower
bound and non-strict for the upper bound. For example, 0.9–1 means 0.9 < qe ≤ 1.
We generate multisets of paths as before and for each one we determine the qe
value. The second row of the tables gives the percentage of multisets for which
the qe value is in the range, which peaks around 0.5. Multisets where there are no
failed tests are not in any of the ranges, so the total percentage is slightly less than
100.
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 19
Table V. Influence of error detection accuracy qe with 100 tests for IT E28
qe range 0.00-0.05 0.05-0.10 0.10-0.20 0.20-0.50 0.50-0.90 0.90-1.00
% of multisets 1.06 2.48 8.03 38.90 45.39 3.76
O, Op 65.48 83.13 91.10 96.68 99.16 99.94
Wong3′ 65.46 83.12 91.09 96.67 99.15 99.93
Wong3 65.46 83.12 91.09 96.67 99.15 99.93
Zoltar 65.48 83.13 91.09 96.67 99.15 99.94
M2 65.08 80.34 86.37 92.78 98.04 99.93
Kulczynski2 65.46 82.60 88.59 92.29 97.14 99.89
Russell etc 59.57 76.62 86.71 93.67 96.62 97.33
Overlap 59.57 76.61 86.70 93.66 96.60 97.24
Ochiai 63.11 75.64 80.77 88.32 96.48 99.89
Rogot2 62.39 74.44 79.23 86.79 95.94 99.89
HMean 62.39 74.44 79.23 86.79 95.94 99.89
GMean 62.01 73.82 78.75 86.57 95.79 99.88
AMean 61.65 73.22 78.28 86.35 95.65 99.87
Ample2 62.50 74.76 79.64 86.49 94.92 99.74
Jaccard etc 58.51 67.71 73.19 84.34 95.77 99.89
Ochiai2 58.37 67.28 71.76 81.78 94.61 99.87
Cohen 58.17 66.58 70.97 81.04 94.23 99.85
Tarantula etc 57.98 65.82 69.16 76.79 89.24 98.81
Fleiss 30.41 39.92 52.31 75.82 93.44 99.75
Scott etc 33.01 43.28 53.79 74.01 93.32 99.85
CBI Log 41.08 63.48 69.29 77.03 85.90 92.22
CBI Sqrt 37.04 62.45 69.12 76.71 83.17 88.09
Rogers etc 26.69 31.79 37.64 63.45 93.26 99.87
Ample 31.25 37.38 39.82 43.24 47.46 49.87
Table VI. Influence of proportion of failed tests with 100 tests for IT E28
Failure range 0.00-0.05 0.05-0.10 0.10-0.20 0.20-0.50 0.50-0.90 0.90-1.00
% of multisets 6.24 11.02 26.54 49.27 6.56 0.002
O, Op 86.86 94.23 96.79 98.36 99.10 99.37
Wong3′ 86.85 94.22 96.79 98.35 99.09 99.17
Wong3 86.85 94.22 96.79 98.35 99.09 99.17
Zoltar 86.86 94.22 96.78 98.34 99.08 99.26
M2 86.34 92.11 93.91 95.64 97.41 99.16
Kulczynski2 86.76 93.17 94.14 94.67 96.05 98.77
Russell etc 74.36 88.76 93.76 96.82 98.30 98.95
Overlap 74.36 88.75 93.75 96.81 98.23 94.68
Ochiai 84.28 88.73 90.44 92.64 95.44 98.77
Rogot2 84.25 88.27 89.48 91.37 94.47 97.35
HMean 84.25 88.27 89.48 91.37 94.47 97.35
GMean 83.65 87.81 89.34 91.27 93.85 96.54
AMean 83.05 87.35 89.20 91.18 93.26 95.90
Ample2 84.60 88.96 89.96 90.38 90.11 85.53
Jaccard etc 80.73 84.34 86.82 90.58 94.83 98.77
Ochiai2 81.26 84.30 85.71 88.13 92.07 96.42
Cohen 80.34 83.35 85.04 87.79 91.47 95.98
Tarantula etc 79.87 82.01 82.46 82.47 81.74 73.29
Fleiss 66.19 73.92 79.77 85.10 88.45 90.48
Scott etc 67.72 73.92 78.51 84.26 90.36 96.50
CBI Log 75.57 82.44 82.95 79.71 74.89 69.95
CBI Sqrt 74.42 82.33 83.04 77.18 70.41 65.88
Rogers etc 58.41 62.60 68.95 81.07 92.77 98.50
Ample 42.30 44.48 44.98 45.19 45.05 42.77
Overlap and Ample2 also decreases slightly for high proportions of failures.
In previous work [Abreu et al. 2007] there is empirical evidence that around 5
failed test cases is sufficient for good performance of Ochiai, and increasing the
number of passed tests beyond 20 results in only minor improvement in perfor-
mance. We reproduce this experiment, using 4–6 failed tests, and all the metrics.
The results are given in Table VII. For each number of tests we give the percent-
age of multisets with 4–6 failed tests (as expected, this decreased as the number
of tests increases). The results are consistent with those of [Abreu et al. 2007] —
most metrics show only a modest performance increase. The performance of a few
of the poorer metrics actually decreases with more passed tests. As with Table
VI, when the number of tests is large (and thus the proportion of failed tests is
small) Ample2 performs well, Russell etc. and Overlap perform relatively poorly
and Kulczynski2 performs better than M2.
Table VII. Influence of passed tests with 4–6 failed tests for IT E28
Num. of tests 10 20 50 100 500
% of multisets 35.94 33.79 13.18 4.53 0.280
O, Op 86.29 88.28 89.35 89.68 91.58
Wong3′ 81.09 87.18 89.26 89.67 91.58
Wong3 81.09 87.18 89.26 89.67 91.58
Zoltar 84.71 87.85 89.30 89.67 91.58
M2 86.07 87.80 88.68 88.95 90.33
Kulczynski2 84.23 86.93 88.89 89.53 91.57
Russell etc 78.45 79.28 79.72 79.87 83.29
Overlap 75.02 78.83 79.70 79.86 83.29
Ochiai 83.13 84.88 85.94 86.29 87.39
Rogot2 80.58 83.51 85.47 86.25 87.88
HMean 80.61 83.51 85.47 86.25 87.88
GMean 80.54 83.52 85.07 85.54 86.68
AMean 80.21 83.58 84.62 84.85 85.64
Ample2 79.84 83.99 86.08 86.70 88.06
Jaccard etc 81.90 82.31 82.18 82.05 82.37
Ochiai2 78.05 80.58 82.09 82.61 83.53
Cohen 77.98 80.19 81.24 81.56 82.24
Tarantula etc 70.31 76.84 80.01 80.96 82.09
Fleiss 75.89 76.52 72.52 67.90 60.70
Scott etc 75.52 75.28 72.66 69.28 62.43
CBI Log 75.32 79.35 81.23 81.81 82.37
CBI Sqrt 72.68 77.70 80.33 81.11 82.13
Rogers etc 73.00 67.40 61.52 59.02 57.03
Ample 39.94 41.99 43.04 43.35 44.03
Table VIII. Influence of buggy code execution frequency with 100 tests for IT E28
S4 exec range 0.00-0.05 0.05-0.10 0.10-0.20 0.20-0.50 0.50-0.90 0.90-1.00
% of multisets 0.063 0.432 3.74 46.51 48.49 0.389
O, Op 99.89 99.75 99.42 97.92 95.54 94.39
Wong3′ 99.86 99.73 99.41 97.91 95.54 94.35
Wong3 99.86 99.73 99.41 97.91 95.54 94.35
Zoltar 99.89 99.75 99.42 97.91 95.53 94.24
M2 99.89 99.75 99.42 97.55 90.92 77.20
Kulczynski2 99.89 99.74 99.37 97.20 90.52 75.18
Russell etc 66.53 77.61 85.87 92.74 95.59 96.94
Overlap 66.52 77.60 85.86 92.73 95.57 96.70
Ochiai 99.89 99.73 99.31 96.31 85.97 65.01
Rogot2 99.89 99.74 99.35 96.40 83.92 52.69
HMean 99.89 99.74 99.35 96.40 83.92 52.69
GMean 99.89 99.73 99.30 96.15 83.77 46.52
AMean 99.89 99.72 99.24 95.90 83.62 41.35
Ample2 99.89 99.75 99.42 96.95 82.43 30.21
Jaccard etc 99.89 99.72 99.21 95.27 81.43 55.96
Ochiai2 99.89 99.73 99.28 95.56 77.98 25.44
Cohen 99.89 99.72 99.19 94.91 77.44 31.91
Tarantula etc 99.86 99.65 98.95 93.26 70.65 18.67
Fleiss 99.89 99.75 99.41 94.31 68.12 16.85
Scott etc 99.89 99.72 99.16 93.38 67.93 19.62
CBI Log 78.13 92.50 96.70 91.94 68.26 18.47
CBI Sqrt 78.13 92.47 96.66 90.77 66.08 17.66
Rogers etc 99.84 99.52 98.37 88.60 60.56 30.54
Ample 49.94 49.88 49.71 48.48 41.22 15.11
numbers of tests is reasonably large. Although Russell etc. and Overlap are next
in overall performance for larger numbers of tests and can perform better than O
when the bug is executed in a large proportion of the tests, they are less robust. For
small numbers of tests, qe values and proportion of failed tests, their performance is
significantly worse. In contrast, although M2 and Kulczynski2 perform somewhat
worse than these metrics overall for large numbers of tests, they are more robust
and our experiments suggest they may be better metrics to use in practice (and for
small qe and proportion of failed tests Kulczynski2 is the better of the two). Several
of the other metrics also perform very well in some circumstances. However, even
the ones with higher profiles such as Tarantula, Jaccard and Ochiai (the best of
these) do not have particularly good overall performance, and when they perform
very well they are still not as good as the best metrics. A few metrics perform
particularly poorly.
The reason why Russell and Rao, Binary, Wong1 and Overlap behave somewhat
differently from the other metrics can be understood by considering the two condi-
tions for optimality. Recall metrics are optimal if they 1) increase as anp increases
when anf = 0 and 2) always have smaller values when anf > 0. When anf = 0
Russell etc. and Overlap are minimal but constant, rather than increasing in anp
(satisfying 2 but not 1). All other non-optimal metrics we consider are increasing
in anp when anf = 0 (this is straightforward to prove from the formulas) but do
not always have smaller values when anf > 0 (satisfying 1 but not 2). Further in-
sight into the relative performance of the metrics can be obtained by constructing
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 23
formulas that are equivalent to the metrics for ranking purposes, but are easier to
compare. Specifically, we can compare with formulas that are known to be optimal.
In Table IX we provide formulas that (for ranking) are equivalent to a selection of
the formulas given earlier.
Proposition 7.8. Table IX gives metrics that, for ranking, are equivalent to
the definitions in Tables II and III, assuming a single bug and our score function
in the case of Binary.
Proof. The first optimal metric equivalent formula is Op , given earlier. The
second optimal formula is equivalent to aef /(aep + P · F ). If anf = 0 this equals
F/(aep + P · F ) which increases as aep decreases and is greater than the value of
the metric for anf > 0 since the denominator is between P · F and P · (F + 1). Note
that the term P · F can be replaced by any larger term. The rest are either given
by the previous propositions or are straightforward.
Op is similar to Wong3, but does not have three separate cases with different aep
coefficients: one very small coefficient (its maximum size dependent on the number
of passed tests) is optimal and the other cases just decrease performance in the
model. Note that the smallest aep coefficient in Wong3 is also sub-optimal for more
than 1000 tests. With larger numbers of test cases we see the performance of Wong3
again drops noticeably below that of O and Zoltar. For 5000 tests O and Zoltar
have a percentage score of 99.93 compared with 99.91 for Wong3.
Comparing the Op formula with those for Rogers etc. and Russell etc. we
see that Russell etc. neglects the aep term whereas Rogers etc. gives it too much
influence. For a large number of passed tests, this means the performance of Russell
etc. approaches that of Op (since the optimal coefficient of aep approaches zero).
Conversely, Rogers etc. is closer to Op for very small numbers of passed tests.
For large numbers of tests Rogers etc. is one of the worst metrics because of its
relatively large coefficient for aep . It is particularly bad when the failure rate, qe or
S4 are small (since aep is relatively large compared to aef in these cases).
Journal of the ACM, Vol. V, No. N, June 2009.
24 · Naish et al.
the definition we have given. If the buggy statement is (almost) always executed
then metrics such as Russell etc. can perform better than O. In Table VI we
have this result for just a subset of the multisets possible with IT E28 , but by
changing the model it can occur with the total score. For such models we believe it
is appropriate to generalise our definition of optimality. Rather than fix a specific
statement as being buggy, we could consider the possibility of each statement being
buggy, and average over all these cases. For IT E28 the two definitions are equivalent
due to the symmetry of the code. With such a revised definition O still appears
optimal (for Russel etc. the performance is excellent when the frequently executed
statement is buggy but poor for when the statement is correct).
For models with multiple bugs we see a divergence between the performance of O
and Op . Op and the other good metrics other than O generally perform quite well,
especially when most failures are caused by a single bug. In some cases Ochiai and
even Jaccard perform better. None of the metrics examined is optimal for more
than a very small number of tests. We have constructed optimal metrics for some
cases. It may be that by constructing optimal metrics in other cases some patterns
will emerge, allowing us to find optimal or near optimal metrics in a broad range
of cases.
8. EMPIRICAL RESULTS
We use the Siemens test suite and Space benchmarks to perform empirical evalu-
ation of how well the metrics perform. The Siemens test suite contains multiple
versions of seven programs; Space has multiple versions of a single larger program.
Table X lists the programs, the number of faulty versions of each program, number
of lines of code, number of test cases and brief descriptions. The versions of the
test suites we used were obtained from the Software Information Repository [Do
et al. 2005]. The Siemens test suite is a particularly widely used benchmark for bug
localization [Renieres and Reiss 2003; Wong et al. 2007; Abreu et al. 2006; 2007;
Liu et al. 2005; Pytlik et al. 2003].
Program spectra were extracted using gcov (part of the gcc compiler suite) and
Bash scripts. All experiments were carried out on a Pentium 4 PC running Ubuntu
6.06 and gcc version 4.2.1. There are six Siemens test suite programs that have
runtime errors for some tests and gcov fails to produce any execution counts; we
have obtained data from other researchers for these cases. There were also six (of
an original 38) Space programs for which we failed to obtain data; we have excluded
these programs from the benchmark set. gcov reports execution statistics for each
Journal of the ACM, Vol. V, No. N, June 2009.
26 · Naish et al.
line of source code, including lines that are empty or contain only preprocessor
directives such as #define, comments, braces, et cetera. For one set of results
(Tables XI and XII) we use the gcc statistics directly, rather than attempting to
filter out lines such as these. For the others we ignore lines that are not executed
in any tests.
We conducted experiments using the same metrics as previously reported in our
IT E28 model. As well as figures for the entire Siemens test suite and Space, we
report on various subsets. As in other papers, we report results for the (versions
of the) individual different programs. We also consider the subset of programs
that fail for at least one test case. Without failed test cases all of the metrics
perform very poorly and this just adds noise to the results. Though potentially
useful for debugging tools that use static analysis, these programs and test sets are
not appropriate for dynamic diagnosis (there are no symptoms to diagnose). There
are 130 Siemens test suite programs and 28 Space programs that fail at least one
test case.
We also report on a subset of the suite that allows a better comparison with
our model, and suits the main focus of this paper. We use the set of programs
that fail a test case and have a single bug according to the following definition: a
bug is a statement that, when executed, has unintended behaviour. This definition
is motivated by programming language semantics and seems particularly apt for
spectra-based diagnosis. Not all the Siemens test suite programs have a single bug
according to this definition (though they are single bug programs according to other
definitions). For example, some programs have a (single) #define that is wrong.
The #define is never executed; it is only when a statement that uses the macro
is executed that unintended behaviour occurs. There are typically several such
statements. There are 122 “single bug” Siemens test suite programs and 15 such
Space programs.
We provide experimental results using two established measures of diagnostic
performance for the different ranking methods. The first is the percentage of the
program that needs to be examined for a bug to be found. That is, the percentage
of the code ranked higher than any bug. We call this the rank percentage as it
corresponds to the rank of the buggy statement. We average the percentages over
all programs in the test suite considered. Unlike the other performance measures we
use, larger numbers mean worse performance. The second is whether diagnosis is
successful (a bug can be found) by examining some fixed percentage of the program.
We report the percentage of successful diagnoses over the test suite. For example,
suppose we choose to examine at most 10% of the statements of each of the 122
single-bug Siemens test suite programs and for some metric the top ranked 10% of
statements include the bug for 70 of the programs. We would report 70/122×100 =
57.38.
In the case where the buggy statement is tied in the ranking (the value of the
metric for the buggy statement is the same as that for some other statements), we
assume the bug can be found anywhere in that range, with uniform probability,
and reflect this in our performance measures. For example, suppose the top 10% of
statements are all ranked equally and include the bug. The rank percentage would
be computed as 5%, as in [Abreu et al. 2007]. Successful diagnosis by examining
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 27
Table XI. Average Rank Percentages for programs of Siemens Test Suite
Tcas Sch Sch2 PrT PrT2 TInf Repl
Op 10.27 3.94 23.23 3.42 0.82 3.13 4.73
O 10.27 7.72 23.23 3.42 0.82 3.13 4.73
Zoltar 10.27 3.94 20.80 3.42 0.82 3.15 3.97
Kulczynski2 10.31 3.94 20.83 3.42 1.12 3.29 4.32
Wong3′ 10.48 3.94 17.35 3.42 0.86 3.13 3.99
M2 10.64 1.64 23.19 4.33 2.04 4.61 4.54
Ochiai 11.01 1.69 23.58 6.23 3.73 5.53 4.95
Jaccard etc 11.13 1.75 26.05 8.38 5.57 6.55 6.29
Amean 11.18 5.22 23.91 7.32 5.12 9.38 5.29
Tarantula etc 11.15 1.83 26.05 8.94 5.73 7.10 6.49
Hmean 11.27 5.03 25.34 7.75 4.95 9.24 5.06
Gmean 11.28 5.03 25.14 7.85 4.95 9.32 5.09
Ample2 11.27 5.06 28.42 7.73 4.97 9.39 5.76
Cohen 11.15 5.30 26.05 8.79 5.73 10.10 6.44
CBI Log 11.83 7.29 26.05 8.89 5.73 10.28 6.29
CBI Sqrt 11.84 7.35 26.05 8.89 5.73 10.29 6.49
Rogot2 11.27 10.92 27.77 7.75 4.95 14.39 5.84
Russell etc 14.60 11.34 18.53 7.17 10.66 6.67 9.13
Binary 14.60 15.12 18.53 7.17 10.66 6.67 9.13
Overlap 14.60 11.34 18.53 7.17 10.66 6.70 9.14
Ochiai2 11.69 7.29 28.45 8.23 5.50 14.13 6.02
Ample 13.19 12.70 27.48 8.33 5.50 15.08 6.95
Wong3 18.17 17.82 29.14 3.42 0.86 9.30 11.34
Scott etc 73.81 71.25 86.14 44.89 38.62 74.97 52.57
Fleiss 73.93 71.25 86.14 44.84 38.19 75.01 54.88
Rogers etc 78.93 71.44 86.14 65.02 52.61 75.05 71.74
10% of the program is assured (we give a value of 1). However, if only 8% of the
program is examined successful diagnosis is likely but not assured. We give a value
of 0.8 to indicate the probability. We report the sum of these probabilities times 100
divided by the number of programs in the test suite. Alternative ways of measuring
performance with ties are to report the worst case [Abreu et al. 2006] (performance
of a metric can be increased according to this measure by adding a small random
number) or both the best and worst cases [Wong et al. 2007] (which requires twice
as many figures).
Table XII. Average Rank Percentages for Siemens Test Suite and Space
Benchmark Siemens Space Combined
Subset All Fail 1Bug All Fail 1Bug All Fail 1Bug
Op 7.15 6.11 5.90 9.77 0.70 0.50 7.66 5.15 5.31
O 7.41 6.37 5.90 9.77 0.70 0.50 7.87 5.37 5.31
Zoltar 6.79 6.12 5.91 6.90 0.69 0.55 6.81 5.16 5.32
Kulczynski2 6.94 6.27 6.06 6.94 0.79 0.63 6.94 5.30 5.47
Wong3′ 6.60 6.45 6.14 9.77 0.70 0.50 7.22 5.43 5.52
M2 7.46 6.80 6.81 6.91 0.76 0.58 7.35 5.73 6.13
Ochiai 8.10 7.45 7.47 6.97 0.82 0.67 7.88 6.28 6.73
Jaccard etc 9.08 8.45 8.68 7.24 1.13 1.06 8.72 7.15 7.85
Amean 9.33 8.70 8.88 7.21 1.09 0.93 8.92 7.35 8.01
Hmean 9.38 8.76 8.96 7.13 1.01 0.84 8.94 7.39 8.07
Gmean 9.40 8.77 8.97 7.13 1.00 0.83 8.96 7.39 8.08
Tarantula etc 9.28 8.65 8.89 7.81 1.78 1.88 8.99 7.43 8.12
Ample2 9.82 8.81 9.02 10.06 1.02 0.82 9.87 7.43 8.12
Cohen 10.02 9.40 9.69 7.40 1.31 1.26 9.51 7.97 8.77
CBI Log 10.36 9.75 10.11 3.83 1.73 1.82 9.09 8.33 9.20
CBI Sqrt 10.42 9.81 10.12 7.75 1.71 1.83 9.90 8.37 9.21
Rogot2 11.06 10.07 10.37 10.04 1.01 0.84 10.86 8.46 9.33
Ochiai2 11.06 10.46 10.74 7.70 1.66 1.64 10.40 8.90 9.74
Russell etc 11.27 10.68 10.51 11.39 5.88 5.35 11.29 9.83 9.95
Binary 11.53 10.94 10.51 11.39 5.88 5.35 11.50 10.04 9.95
Overlap 11.28 10.68 10.52 11.46 5.96 5.46 11.32 9.84 9.97
Wong3 13.68 12.74 12.36 9.77 0.70 0.50 12.92 10.61 11.06
Ample 12.22 12.01 12.22 3.55 2.62 2.28 10.53 10.35 11.13
Scott etc 65.42 65.28 65.67 21.04 13.58 18.37 56.76 56.12 60.49
Fleiss 65.99 65.85 66.27 21.03 13.57 18.47 57.22 56.59 61.04
Rogers etc 73.82 73.80 74.15 45.45 41.47 43.07 68.28 68.07 70.75
gcov reports, leading to different results. Our model program could be modified
to include statements that are never executed, or debugging systems could a priori
exclude such statements from consideration, as we have done in Table XIII (making
Wong3 and Wong3′ equivalent).
The relative performance of the different metrics on the single bug subset of the
combined test suite in Table XIII fits reasonably well with the overall results of
our model, with the optimal metrics performing best (they perform best for all
single bug subsets and all performance measures we present). The most significant
difference is that Russell etc, Binary and Overlap perform more poorly than the
overall performance in our model. However, the model predicts their relatively poor
performance for low qe and proportion of failed tests. In the Siemens test suite, the
average qe value is approximately 0.12 and the average proportion of failed tests
is 0.035. The top six ranked metrics (treating O and Op as the same metric and
similarly Wong3 and Wong3′ ) in Table XIII are identical (modulo equal ranking)
to those ranked according to the first column of Tables V and VI.
Performance of Russell etc, Binary and Overlap is significantly affected by state-
ments that are executed in all test cases, such as initialisation code. These are
always ranked equal-highest. Adding such a statement in our model program ap-
proximately halves their model performance. The simple score function we use is
also kind to these metrics, especially Binary. If Binary doesn’t rank the bug equal-
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 29
top it ranks it equal-bottom; other good metrics that fail to rank it equal-top will
often rank it strictly above most correct statements but the score will be the same.
Several other metrics (Fleiss, Scott etc, Rogot2 and Wong3) are also slightly lower
in the ranking than in our model. For Fleiss and Scott etc. this is to be expected
since they perform poorly for low qe and proportion of failed tests. Kulczynski2 is
slightly higher (it performs relatively well for low qe and proportion of failed tests)
and Ample performs rather better than predicted by the model. If the bug occurs
in an if-then-else statement the Ample metric typically gives both the then and else
statements the same reasonably high rank. Since this is still only a small fraction
of the code (as opposed to half the code in our model) the performance overall
is reasonable. Experiments using different models indicate than as the number of
statements increases the difference in performance between Ample and the better
metrics decreases.
Another interesting observation is there are 30 Siemens test suite programs where
Russell etc. perform better than O, despite relatively poor performance overall. In
all these cases, the buggy statement is executed in at least 82% of total test cases.
This is consistent with our model (see Table VIII) — when S4 is executed in 0.50–
0.90 of tests, Russell outperforms O and is the best of all metrics considered.
When multiple-bug programs are included, we see Op still performs the best, but
the performance of O declines more, as expected. Results for all programs in the
test suites are influenced by the programs with no failed test cases. This can be
Journal of the ACM, Vol. V, No. N, June 2009.
30 · Naish et al.
seen particularly for Schedule (column two of Table XI), for which four out of nine
versions pass all tests, and Space. The relative performance of metrics in these
cases is quite different from the overall pattern.
8.2 Successful diagnosis percentages
Table XIV shows the percentage of single bug programs (in the combined bench-
mark set) that are successfully diagnosed as the percentage of the code examined
increases; lines of code that are not executed are ignored. Again, the results fit
well with our model for the smaller percentiles. The performance measure used
with our model only considers the top-ranked statements — higher percentiles are
ignored. Using average rank percentages arguably gives too much weight to high
percentiles (where Wong3 performs worse than Kulczynski2, for example). There
is little practical difference between a percentage ranking of 33% and 99% for the
bug, since in both cases the programmer is almost sure to abandon this diagnosis
method before the bug is found. A weighted average of rank percentages (giving
more weight to lower percentages) would make a better practical measure of per-
formance. We have experimented with such scoring functions with our model. The
optimal metrics perform best in all cases examined. However, proving optimality
for these more complex scoring functions is much more difficult than for the simple
scoring function.
Table XV gives the corresponding results for the complete Siemens test suite,
which allows a fairer comparison with diagnosis based on other techniques (albeit
with some noise from programs with no failed test cases). The last line of the table
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 31
gives figures for Cause Transition (CT) from [Cleve and Zeller 2005]. The previous
three lines of the table are figures published in [Jiang and Su 2005] for their own
system, SOBER [Liu et al. 2005], and CBI [Liblit et al. 2005]. Precise comparison
is not possible because the way percentiles are computed is different: [Jiang and Su
2005] uses nodes in the program dependence graph (basic blocks), rather than lines
of code that are executed. However, we can conclude that the system of [Jiang and
Su 2005] performs better than the best spectral-based diagnosis for the very small
percentiles: it ranks significantly more bugs in the top 1%. This is unsurprising
as the analysis is much more sophisticated. However, from the eighth percentile
upwards, the best spectral methods appear to perform best.
Surprisingly, the predicate-based CBI system seems to perform worse than spectral-
based diagnosis using the Tarantula metric. As we have shown, the Tarantula metric
is equivalent to our simplified version of the CBI metric. It may be that the actual
CBI metric performs even more poorly than our simplified version, or the CBI sys-
tem makes a poor choice of predicates to test, or the way the figures are calculated
makes this comparison misleading (for example, basic blocks versus statements or
the treatment of ties in the ranking).
Journal of the ACM, Vol. V, No. N, June 2009.
32 · Naish et al.
9. RELATED WORK
In this section, we briefly review related work on software diagnosis. Program spec-
tra were first introduced in the context of fixing year 2000 problems in software
[Reps et al. 1997]. Various classes of program spectra were introduced: branch hit
spectra, complete path spectra and execution trace spectra. In our study we con-
centrate on statement-level execution traces. Statement and basic block execution
spectra have received most attention to date.
Tarantula [Jones et al. 2002; Jones and Harrold 2005] is normally credited as the
first spectra-based debugger. It ranks statements in C programs according to what
we refer to as the Tarantula metric. One of the novel features of Tarantula is the
way the ranking is displayed. It is a graphical debugger which displays the program
and uses colour to encode the ranking. A spectrum of colours from red through
yellow to green is used; red corresponding the most highly ranked. Brightness is also
used to indicate confidence in the ranking, using another metric (this has received
much less attention as it is not designed to identify the statements that are most
likely to be buggy). The Tarantula metric does not appear to work particularly
well, but the Tarantula system could easily be modified to use another metric, as
long as the values are scaled appropriately. For example, the rank percentage of
each statement could be computed then mapped to a colour in a straightforward
way.
Pinpoint [Chen et al. 2002] is a framework for root cause analysis in the J2EE
platform, implemented for internet services such as web-mail. It uses data from
web client request traces to diagnose exceptions generated due to failures in sys-
tem components. The Jaccard metric is used for ranking, but others could be
substituted.
Ample (Analyzing Method Patterns to Locate Errors) [Dallmeier et al. 2005] is
an Eclipse plug-in for identifying buggy classes in Java software. It uses informa-
tion on method call sequences from multiple passing runs and a single failing run.
Ultimately a “weight” is computed for each class, which indicates the likelihood of
it being buggy. In Abreu et al. [2006] the ranking method used in Ample is gener-
alised to multiple failing runs (what we refer to as the Ample metric) and applied
to blocks of statements in an imperative language; we use it for single statements.
The Ochiai metric was first used for software diagnosis in a comparison with
Tarantula, Jaccard and (the generalised) Ample metrics [Abreu et al. 2006; 2007].
For the Siemens test suite, Ochiai was found to perform best (using rank percent-
ages for basic blocks) and was also quite consistent. Jaccard, Tarantula then Ample
followed in performance overall, though Ample performed well for some of the pro-
grams. Performance for different qe values was also investigated, by adjusting the
sets of test cases used for the different programs. In this paper we have reported
similar results using the IT E28 model, Siemens test suite and Space (ranking state-
ments and using two different performance measures). We have also shown there
are several metrics that perform significantly better than Ochiai on single bug pro-
grams (and many that perform worse). We have explicitly focused on single bug
programs but the results of Abreu et al. [2006] and Abreu et al. [2007] (for example)
also have a strong bias towards single bug programs due to the choice of test suite.
We are not aware of any work that gives a good comparison of spectral ranking
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 33
very similar inputs, only one of which exhibits the bug symptom (this can take some
time). A method is needed to determine if the bug symptom appears in a run. If
the symptom is a program crash this is simple and in the Siemens test suite there
is a correct version of each program so the correct output also can be ascertained.
In general, human intervention is required. Failure cause-effect chains can then be
used to provide a relatively concise explanation of the bug symptom. These are
related to the program state (the values of the variables) at various program points.
Methods that can simplify the test data that causes a bug symptom are likely to
be useful for any diagnosis system.
LingXiao [Jiang and Su 2005] further explored cause-effect chains from isolated
bug-related predicates using CBI [Liblit et al. 2005]. Their work is quite similar
to delta debugging [Zeller 2001]; they try to locate not only failure but also cause-
transition moments. The key difference of their work with the latter is they focus on
instrumenting predicates of program using CBI. CBI [Liblit et al. 2005] is used to
instrument a program and a list of suspicious predicates is generated. Next, machine
learning approaches such as feature clustering and classification algorithms (support
vector machine and random forest) are used to assign scores to these predicates.
The control flow graph of the program is also used. Paths that connect different
suspicious predicates are considered to adjust the ranking.
Although execution paths are an important consideration for the way we use our
IT E28 model, they are not used in the spectral ranking methods we have examined.
The aij values used in ranking metrics contain no path information. The matrix
has some additional information. For example, for any set of test executions the
matrix contains the information that statement S1 in IT E28 is executed in a test
if and only if S2 is not executed (and for a large number of tests we might conclude
this is not simply a coincidence). The results of Jiang and Su [2005] (reproduced in
Table XV) suggest that using the control flow graph is an effective way of improving
the higher part of the ranking. The best performing spectral ranking methods seem
to find more bugs when a larger percentage of the code is examined. This may be
because the machine learning algorithms used in Jiang and Su [2005] do not take
advantage of the fact that nearly all Siemens test suite programs have a single bug.
The most obvious direction is to consider (classes of) models with more than one
bug and find metrics that are optimal for these classes (possibly using different
scoring function or using the whole matrix and result vector). Another possibility
is to discover some way of finding a subset (or subsets) of the tests for which the
failures are likely to be caused by a single bug. The single bug case gives us extra
leverage, but this must be traded off against having a smaller number of tests and
the possibility that there is not just a single bug. Though motivated differently, the
work of Jones et al. Jones et al. [2007] on “parallel debugging” is promising. They
use the Tarantula metric for ranking and present two techniques to cluster failed test
cases. One uses similarity of the rankings produced, and an almost 30% reduction
in total debugging effort (ignoring parallelism) was observed in experiments with
versions of Space with eight bugs. Such experiments could be repeated using Op
rather than Tarantula. We would expect to see improved results. Using Op , more
aggressive clustering (producing smaller sets of tests on average) is likely to pay off,
since Op is more specialised to the single bug case, and this can help minimise the
time taken to find all sources of failure when debugging in parallel.
Finally, it may be possible to apply our methodology to other domains, such
as medicine. Each test run of the model program could correspond to a different
person, the execution of a each statement corresponds to information about the
person, such as having a particular gene, and the result vector indicates the people
that have some disease. The choice of model program encodes domain knowledge.
For example, more commonly occurring genes would be in more execution paths.
Having constructed an appropriate model, different ranking metrics could be evalu-
ated as we have done, and perhaps optimal metrics could be devised for that model.
The best metrics could then used on actual medical data.
11. CONCLUSION
Software diagnosis using program spectra is an instance of the classification or clus-
tering problem. The literature contains many metrics that can be used for ranking
statements according to how likely they are to be buggy, and there are infinitely
many more that could be used. Faced with such a choice it is unclear how metrics
are best compared, which proposed metrics are best for a particular situation and
whether there are other metrics, not yet proposed, that are substantially better.
Our novel solution is to use a model which captures important features of the prob-
lem, and a method of evaluating the performance of ranking metrics for such a
model. This allows analytical approaches to be used. Optimality of a metric can
be defined, and classes of metrics can be discovered and proved optimal. Metrics
can be compared by seeing how they deviate from optimality.
We have restricted attention to single bug programs in this paper, allowing us
to experiment with the methodology in a relatively simple domain. The model for
software diagnosis and evaluation measure we have used in this paper are extremely
simple. There are just four statements and only the top-ranked statements are
considered for performance. We have defined optimal ranking metrics that are
much simpler than many of the metrics proposed. Furthermore they seem quite
robust, performing better than all other metrics considered in a wide range of
conditions, and allow a better understanding of the relative performance of other
Journal of the ACM, Vol. V, No. N, June 2009.
36 · Naish et al.
metrics with the model. There are two other metrics that come close to optimality.
Both were proposed for the software diagnosis domain. We have also evaluated the
performance of all metrics using the Siemens test suite and Space benchmarks. A
small number of metrics are particularly sensitive to such things as statements that
are never executed or always executed, which our model does not cover. However,
overall the results from the model fit very well with results for single bug programs
in the Siemens test suite and Space, with the optimal metrics performing best. We
consider this to be a firm validation of our approach.
REFERENCES
Abreu, R., Zoeteweij, P., and van Gemund, A. 2006. An Evaluation of Similarity Coefficients
for Software Fault Localization. Proceedings of the 12th Pacific Rim International Symposium
on Dependable Computing, 39–46.
Abreu, R., Zoeteweij, P., and van Gemund, A. 2007. On the Accuracy of Spectrum-based Fault
Localization. Testing: Academic and Industrial Conference Practice and Research Techniques-
Mutation, 2007. TAICPART-Mutation 2007 , 89–98.
Anderberg, M. 1973. Cluster analysis for applications. New York .
Chen, M., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: problem de-
termination in large, dynamic Internet services. Dependable Systems and Networks, 2002.
Proceedings. International Conference on, 595–604.
Cleve, H. and Zeller, A. 2005. Locating causes of program failures. Proceedings of the 27th
international conference on Software engineering, 342–351.
Cohen, J. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological
Measurement 20, 1, 37.
Dallmeier, V., Lindig, C., and Zeller, A. 2005. Lightweight bug localization with AMPLE.
Proceedings of the Sixth sixth international symposium on Automated analysis-driven debug-
ging, 99–104.
Dice, L. 1945. Measures of the Amount of Ecologic Association Between Species. Ecology 26, 3,
297–302.
Do, H., Elbaum, S., and Rothermel, G. 2005. Supporting Controlled Experimentation with
Testing Techniques: An Infrastructure and its Potential Impact. Empirical Software Engineer-
ing 10, 4, 405–435.
Duarte, J., Santos, J., and Melo, L. 1999. Comparison of similarity coefficients based on
RAPD markers in the common bean. Genetics and Molecular Biology 22, 427–432.
Dunham, M. 2002. Data Mining: Introductory and Advanced Topics. Prentice Hall, PTR Upper
Saddle River, NJ, USA.
Everitt, B. 1978. Graphical techniques for multivariate data. North-Holland, New York.
Fleiss, J. 1965. Estimating the accuracy of dichotomous judgments. Psychometrika 30, 4, 469–
479.
Gonzalez, A. 2007. Automatic Error Detection Techniques based on Dynamic Invariants. M.S.
thesis, Delft University of Technology, The Netherlands.
Goodman, L. and Kruskal, W. 1954. Measures of association for cross classifications. Journal
of the American Statistical Association 49, 268, 732–764.
Hamming, R. 1950. Error detecting and error correcting codes. Bell System Technical Jour-
nal 29, 2, 147–160.
Jaccard, P. 1901. Étude comparative de la distribution florale dans une portion des Alpes et
des Jura. Bull. Soc. Vaudoise Sci. Nat 37, 547–579.
Jiang, L. and Su, Z. 2005. Automatic isolation of cause-effect chains with machine learning.
Tech. rep., Technical Report CSE-2005-32, University of California, Davis.
Jones, J. and Harrold, M. 2005. Empirical evaluation of the tarantula automatic fault-
localization technique. Proceedings of the 20th IEEE/ACM international Conference on Au-
tomated software engineering, 273–282.
Journal of the ACM, Vol. V, No. N, June 2009.
Spectra-based Software Diagnosis · 37
Jones, J., Harrold, M., and Bowring, J. 2007. Debugging in Parallel. In Proceedings of the
2007 international symposium on Software testing and analysis. ACM Press New York, NY,
USA, 16–26.
Jones, J., Harrold, M., and Stasko, J. 2002. Visualization of test information to assist fault
localization. Proceedings of the 24th international conference on Software engineering, 467–
477.
Krause, E. 1973. Taxicab Geometry. Mathematics Teacher 66, 8, 695–706.
Lee, C. 1958. Some properties of nonbinary error-correcting codes. IEEE Transactions on
Information Theory 4, 2, 77–82.
Lee, H., Naish, L., and Ramamohanarao, K. 2008. Evaluation of metrics based on program
spectra for software fault localization. Submitted for publication.
Liblit, B. 2004. Cooperative Bug Isolation. Ph.D. thesis, University of California.
Liblit, B., Naik, M., Zheng, A., Aiken, A., and Jordan, M. 2005. Scalable statistical bug
isolation. Proceedings of the 2005 ACM SIGPLAN conference on Programming language design
and implementation, 15–26.
Liu, C., Yan, X., Fei, L., Han, J., and Midkiff, S. P. 2005. Sober: statistical model-based bug
localization. SIGSOFT Softw. Eng. Notes 30, 5, 286–295.
Lourenço, F., Lobo, V., and Bação, F. 2004. Binary-based similarity measures for categorical
data and their application in Self-Organizing Maps.
Maxwell, A. and Pilliner, A. 1968. Deriving coefficients of reliability and agreement for ratings.
Br J Math Stat Psychol 21, 1, 105–16.
Meyer, A., Garcia, A., Souza, A., and Souza Jr, C. 2004. Comparison of similarity coefficients
used for cluster analysis with dominant markers in maize (Zea mays L). Genetics and Molecular
Biology 27, 83–91.
Naish, L. 2008. Probabilistic declarative debugging. Journal of Functional and Logic Program-
ming 2008, 1 (July).
Ochiai, A. 1957. Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring
regions. Bull. Jpn. Soc. Sci. Fish 22, 526–530.
Pytlik, B., Renieris, M., Krishnamurthi, S., and Reiss, S. 2003. Automated Fault Localization
Using Potential Invariants. Arxiv preprint cs.SE/0310040 .
Renieres, M. and Reiss, S. 2003. Fault localization with nearest neighbor queries. Automated
Software Engineering, 2003. Proceedings. 18th IEEE International Conference on, 30–39.
Reps, T., Ball, T., Das, M., and Larus, J. 1997. The use of program profiling for software
maintenance with applications to the year 2000 problem. Proceedings of the 6th European
conference held jointly with the 5th ACM SIGSOFT international symposium on Foundations
of software engineering, 432–449.
Rogers, D. and Tanimoto, T. 1960. A Computer Program for Classifying Plants. Sci-
ence 132, 3434, 1115–1118.
Rogot, E. and Goldberg, I. 1966. A proposed index for measuring agreement in test-retest
studies. J Chronic Dis 19, 9, 991–1006.
Russel, P. and Rao, T. 1940. On habitat and association of species of anopheline larvae in
south-eastern Madras. J. Malar. Inst. India 3, 153–178.
Scott, W. 1955. Reliability of Content Analysis: The Case of Nominal Scale Coding. Public
Opinion Quarterly 19, 3, 321–325.
Telcordia Technologies, Inc. 1998. Telecordia software visualization and analysis toolsuite (χSuds).
Users manual, Chapter 12.
Wong, W., Qi, Y., Zhao, L., and Cai, K. 2007. Effective Fault Localization using Code Coverage.
Proceedings of the 31st Annual International Computer Software and Applications Conference-
Vol. 1-(COMPSAC 2007)-Volume 01 , 449–456.
Zeller, A. 2001. Automated debugging: are we close? Computer 34, 11, 26–31.
Zeller, A. 2006. Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann.