Quality Measures and Assurance For AI Software
Quality Measures and Assurance For AI Software
Quality Measures and Assurance For AI Software
John Rushby
1
This work was performed for the National Aeronautics and Space Administra-
tion under contract NAS1 17067 (Task 5).
Abstract
This report is concerned with the application of software quality and assur-
ance techniques to AI software. It describes work performed for the National
Aeronautics and Space Administration under contract NAS1 17067 (Task
5) and is also available as a NASA Contractor Report.
The report is divided into three parts. In Part I we review existing software
quality assurance measures and techniques–those that have been developed
for, and applied to, conventional software. This part, which provides a fairly
comprehensive overview of software reliability and metrics, static and dy-
namic testing, and formal specification and verification, may be of interest
to those unconcerned with AI software. In Part II, we consider the char-
acteristics of AI-based software, the applicability and potential utility of
measures and techniques identified in the first part, and we review those
few methods that have been developed specifically for AI-based software. In
Part III of this report, we present our assessment and recommendations for
the further exploration of this important area. An extensive bibliography
with 194 entries is provided.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Software Reliability 8
3.1 The Basic Execution Time Reliability Model . . . . . . . . . 10
3.2 Discussion of Software Reliability . . . . . . . . . . . . . . . . 17
5 Testing 29
5.1 Dynamic Testing . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Random Test Selection . . . . . . . . . . . . . . . . . . 30
5.1.2 Regression Testing . . . . . . . . . . . . . . . . . . . . 32
5.1.3 Thorough Testing . . . . . . . . . . . . . . . . . . . . 32
5.1.3.1 Structural Testing . . . . . . . . . . . . . . . 34
5.1.3.2 Functional Testing . . . . . . . . . . . . . . . 36
i
ii Contents
6 Characteristics of AI Software 65
8 Testing of AI Systems 85
8.1 Dynamic Testing . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.1.1 Influence of Conflict-Resolution Strategies . . . . . . . 87
8.1.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . 87
8.1.3 Statistical Analysis and Measures . . . . . . . . . . . . 88
8.1.4 Regression Testing and Automated Testing Support . 89
8.2 Static Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . 90
8.2.2 Mathematical Verification . . . . . . . . . . . . . . . . 97
8.2.3 Structured Walk-Throughs . . . . . . . . . . . . . . . 102
8.2.4 Comprehension Aids . . . . . . . . . . . . . . . . . . . 103
10 Conclusions 109
10.1 Recommendations for Research . . . . . . . . . . . . . . . . . 115
Bibliography 118
iv
Chapter 1
Introduction
This report is concerned with the application of software quality and eval-
uation measures to AI software and, more broadly, with the question of
quality assurance for AI software. By AI software we mean software that
uses techniques from the field of Artificial Intelligence. (Genesereth and
Nilsson [73] give an excellent modern introduction to such techniques; Har-
mon and King [84] provide a more elementary overview.) We consider not
only metrics that attempt to measure some aspect of software quality, but
also methodologies and techniques (such as systematic testing) that attempt
to improve some dimension of quality, without necessarily quantifying the
extent of the improvement. The bulk of the report is divided into three
parts. In Part I we review existing software quality measures—those that
have been developed for, and applied to, conventional software. In Part II,
we consider the characteristics of AI software, the applicability and potential
utility of measures and techniques identified in the first part, and we review
those few methods that have been developed specifically for AI software. In
Part III of this report, we present our assessment and recommendations for
the further exploration of this important area.
1.1 Motivation
It is now widely recognized that the cost of software vastly exceeds that of
the hardware it runs on—software accounts for 80% of the total computer
systems budget of the Department of Defense, for example. Furthermore,
as much as 60% of the software budget may be spent on maintenance. Not
only does software cost a huge amount to develop and maintain, but vast
1
2 Chapter 1. Introduction
Unless compelling evidence can be adduced that such software can be “trusted”
to perform its function, then it will not—and should not—be used in many
circumstances where it would otherwise bring great benefit. In the follow-
ing sections of this report, we consider measures and techniques that may
provide the compelling evidence required.
1.2 Acknowledgments
Alan Whitehurst of the Computer Science Laboratory (CSL) and Leonard
Wesley of the Artificial Intelligence Center (AIC) at SRI contributed ma-
terial to this report. It is also a pleasure to acknowledge several useful
discussions with Tom Garvey of AIC, and the careful reading and criticism
of drafts provided by Oscar Firschein of AIC and Teresa Lunt of CSL. The
guidance provided by our technical monitors, Kathy Abbott and Wendell
Ricks of NASA Langley Research Center, was extremely valuable.
Part I
3
Chapter 2
Development comprising detailed design, coding and unit testing, and the
establishment of operating procedures.
There have been many refinements to this basic model: Royce’s Water-
fall model [161], for example, recognized the existence of feedback between
5
6 Chapter 2. Software Engineering and Software Quality Assurance
The more phases of the life-cycle that separate the commission and de-
tection of an error, the more expensive it is to correct. It is usually cheap
and simple to correct a coding bug caught during unit test, and it is usu-
ally equally simple and cheap to insert a missed requirement that is caught
during system requirements review. But it will be ruinously expensive to
correct such a missed requirement if it is not detected until the system has
been coded and is undergoing integration test. Software Quality Assurance
comprises a collection of techniques and guidelines that endeavor to ensure
that all errors are caught, and caught early.
Software Reliability
8
9
3. The probability that a failure will occur during the half-open interval
(t, t + δt] is λ(t).δt + o(δt), where λ(t) is the failure intensity of the
process.
4. The probability that more than one failure will occur during the half-
open interval (t, t + δt] is o(δt).
If we let Pm(t) denote the probability that M (t) is equal to m, that is:
3. The hazard rate for all faults is the same for all faults, namely za (t).
(Note this implies that the per-fault hazard rate is constant, that is za (t) =
φ). Substituting (3.3) into (3.2), we obtain
λ(t) = ω0 φe−φt .
The assumption that the fault responsible for each failure is identified
and removed immediately the failure occurs is clearly unrealistic. The model
can be modified to accommodate imperfect debugging by introducing a fault
reduction factor B. This attempts to account for faults that cannot be
located, faults found by code-inspection prior to causing failure, and faults
introduced during debugging, by assuming that, on average, each failure
leads to the removal of B faults (0 ≤ B ≤ 1). (Conversely, each fault will
lead to 1/B failures.) It can then be shown that
µ(t) = ν0 Ga (t)
In addition we may ask what the failure intensity will be after 10 CPU-hours
and how many failures will have occurred by that time. For failure intensity
we have:
λ
− ν0 t
λ(t) = λ0 e 0
10
− 100 10
λ(10) = 10e
= 10e−1
= 3.68 failures per CPU-hour,
10
µ(10) = 100 1 − e− 100 10
= 100 1 − e−1
= 100(1 − 0.368)
= 63 failures.
14 Chapter 3. Software Reliability
of the program itself, and the circumstances of its development. The total
number of failures can be predicted as the number of faults inherent in the
program, divided by the fault-reduction factor: ν0 = ω0 /B. Dozens of tech-
niques have been proposed for estimating the number of faults in a program
based on static characteristics of the program itself. These are generally
related to some notion of “complexity” of programs and are described in
detail in the following chapter. For the purposes of exposition, we will use
the simplest (yet one of the best) such measures: the length of the program.
There is quite good evidence that faults are linearly related to the length of
the source program. If we let Is denote the length of the program (measured
by lines of source code), and D the fault density (in faults per source line),
then we have
ω0 = I s × D (3.12)
and
Is × D
ν0 = . (3.13)
B
Results from a large number of experiments to determine values of D
are summarized in Figure 3.1 (taken from [134, page 118]). Experimental
determinations of the fault-reduction factor range from 0.925 to 0.993, with
an average of 0.955 (again, from [134, page 121]). Thus, an à priori estimate
for the number of failures to be expected from a 20,000 line program entering
the system test phase is given by
Is × D
ν0 =
B
20 × 6.01
=
0.955
= 126 failures.
16 Chapter 3. Software Reliability
The initial failure intensity λ0 depends upon the number of faults in the
program, the rate at which faults are encountered during execution, and the
ratio between fault encounters and failures (not all fault encounters—i.e.,
execution of faulty pieces of code—will lead to failure: particular circum-
stances may be needed to “trigger” the bug). For the purposes of estimation,
we may assume that fault encounters are linearly related to the number of
instructions executed (i.e., to the duration of execution and to processor
speed), and that failures are linearly related to fault encounters. Since pro-
cessor speed is generally given in object instructions per second, we will also
assume that the number of object instructions Io in a compiled program
is linearly related to the number of source lines Is . Thus, if Q is the code
expansion ratio (i.e., Io /Is), and Ro is the number of object instructions
executed in unit time, then the number of source lines executed in unit time
will be given by Rs = Ro /Q. Now each source line executed exposes ω0 /Is
faults. Thus ω0 × Rs /Is faults will be exposed in unit time. If each fault
exposure leads to K failures, we see that the initial failure intensity is given
by λ0 = K × ω0 × Rs /Is . Substituting the previously derived expressions
for ω0 and Rs , we obtain:
Ro
λ0 = D × K × . (3.14)
Q
In order to complete the estimation of λ0 , we need values for three new
parameters: Ro , Q, and K. The first of these is provided by the computer
manufacturer, while the second is a function of the programming language
and compiler used. A table of approximate expansion ratios is given by
Jones [102, page 49]. The most difficult value to estimate is the fault expo-
sure ratio K. Experimental determinations of K for several large systems
yield values ranging from 1.41 × 10−7 to 10.6 × 10−7 —a range of 7.5 to 1,
with an average of 4.2 × 10−7 [134, page 122].
As an example, of the application of these estimates, consider a 20,000
line program entering the system test phase. Using D = 6.01 and B = 0.955
as before, and assuming a code expansion ratio Q of 4, a fault exposure ratio
K of 4.2 × 10−7 failures per fault, and a 3 MIPS processor, we obtain:
Ro
λ0 = D × K ×
Q
6.01 4.2 3 × 106
= 3
× 7×
10 10 4
−3
= 1.893 × 10 failures per CPU-sec,
3.2. Discussion of Software Reliability 17
ν0 λ P
δt = ln
λ0 λF
126 6.8
= ln
6.8 0.1
= 18.53 ln 68
= 18.53(4.22)
= 78 CPU-hours.
In the previous chapter, we have seen that à priori estimates for the relia-
bility of a program depend on estimates for the number of faults it contains
and for its initial failure intensity. It is plausible to suppose that these
parameters may themselves be determined by the “size” and “complexity”
of the program. Accordingly, considerable effort has been expended in the
attempt to define and quantify these notions.
Whereas measurements of the static properties of completed programs
(e.g., size and complexity) may help in predicting some aspects of their
behavior in execution (e.g., reliability), similar measurements of “require-
ments” or “designs” may help to predict the “effort” or cost required to
develop the finished program.
In this chapter we will examine metrics that purport to measure the size,
and complexity of programs, and those that attempt to predict the cost of
developing a given piece of software.
19
20 Chapter 4. Size, Complexity, and Effort Metrics
slightly more elaborate measure of the same type. Such metrics only measure
the total number of different variables that are used in the program; they do
not indicate how many are “live” (i.e., must be actively considered by the
programmer) in any one place. A simple definition states that a variable is
“live” in all statements lexically contained between its first appearance and
its last. It is then easy to compute the number of variables that are live at
each statement, and hence LV —the average number of variables that are
live at each statement.
Another approach to quantifying the complexity of data usage is to
measure the extent of inter-module references. The work of Henry and Ka-
fura [89, 88] is representative of this type. The fan-in of a module may be de-
fined as the number of modules that pass data to the module, either directly
or indirectly; similarly the fan-out may be defined as the number of mod-
ules to which the present module passes data, either directly or indirectly.
The complexity of the interconnections to the module are then defined as
(fan-in × fan-out)2 . Henry and Kafura then relate the overall complexity
of a module within a program to both its length and the complexity of its
interconnections by the definition: complexity = SLOC×(fan-in×fan-out)2 .
program should be related to both its length and its difficulty, and to define
the “effort” E required to implement a program by E = D × V (= V 2 /V ∗ ).
Halstead claimed that E represented “elementary mental discriminations”
and, based on the suggestion of a psychologist, J. Stroud, that the human
mind is capable of making only between 5 and 20 elementary mental dis-
criminations per second, he then proposed the equation T = E/18, where
T is the time required to construct a program in seconds.
The practical application of these last few formulae depend on a method
for computing or estimating the potential volume, V ∗ , for a program. Hal-
stead proposed that potential volume could be defined by
V ∗ = (2 + η2∗ ) log2 (2 + η2∗ ),
where η2∗ is the number of input/output operands, and estimated by
2η2
V∗ = × V.
η1 N2
Using this estimation, we obtain
η1 N2 η1 N2
D= and E = V × .
2η2 2η2
In contrast to the scientific pretensions of Software Science, a widely
practised method for predicting the time and effort required to construct a
software project is simply to ask the opinions of those experienced in similar
projects. A refinement of this approach is the “Delphi” technique in which
several people prepare independent estimates and are then told how their
estimates compare with those of the others. (In some variants, they discuss
their estimates with each other). Next, they are allowed to modify their
estimates and the process is repeated until the estimates stabilize. Often
the estimates converge to a very narrow range, from which a consensus value
may be extracted.
A composite cost estimation technique (which combines the use of expert
judgment with statistical data fitting) is the COCOMO (COnstructive COst
MOdel) of Boehm [27, 26]. There are three “levels” of the COCOMO model;
the most successful seems to be the (modified) Intermediate Level. There
are also three “modes” to the COCOMO model—for simplicity we will pick
just one (the “semidetached” mode). This variant of the COCOMO predicts
development effort E in man-months by the equation
16
Y
E = 3.0 × S 1.12 × mi
i=1
4.3. Cost and Effort Metrics 25
Rating Multiplier
Very Low 0.75
Low 0.88
Nominal 1.00
High 1.15
Very High 1.40
Figure 4.1: Multipliers for the Reliability Cost Driver in the COCOMO
Model
Se = Sn + kSo ,
lected complexity measures, and thereby cast serious doubt on the wisdom
of using complexity metrics in a prescriptive setting.
Criticism such as that levied by Kearney and his colleagues against much
of the work in complexity metrics is acknowledged by the more thoughtful
practioners. Shen, for example, introducing a magazine column dedicated
to metrics [172], writes:
The issue addressed in this report is software quality; size and complexity
metrics are of interest only in so far as they contribute to the estimation or
improvement of software quality. As far as estimation is concerned, metrics
research does not yet seem able to provide accurate predictions for the prop-
erties of economic interest (number of faults, cost to maintain or modify)
from the measurable properties of the program itself. The most accurate and
best validated predictive techniques seem to be the crudest—for example 20
faults per KLOC for a program entering unit test [134, page 121]. Even these
techniques are likely to need considerable calibration and re-validation when
used in new environments.
The application of software metrics to improving (as opposed to pre-
dicting) the quality of software is even more questionable. Given their lack
of substantiation, the use of such metrics prescriptively seems highly ill-
advised [106]. Their least controversial application might be as components
of “program comprehension aids.” By these, we mean systems that as-
sist a programmer to better comprehend his program and to master its
complexities—especially those due to scale. Examples of such aids range
from simple prettyprinters and cross-reference generators, to more seman-
tically oriented browsers. Descriptive complexity metrics might assist the
programmer in identifying parts of the program worthy of restructuring or
simplification in much the same way that profiling tools help to identify the
places where optimization can be applied most fruitfully.
28 Chapter 4. Size, Complexity, and Effort Metrics
Effort metrics are rather less controversial and easier to validate than
complexity metrics—their purpose, if not their effectiveness, is clear: to
enable the cost and duration of a programming task to be estimated in ad-
vance. However, the efficacy of existing effort metrics seems dubious at best.
The COCOMO model, for example, seems to perform very poorly when ap-
plied to projects other than those used in its own validation. Kemerer [108],
for example, reported an average error of over 600%, and also found that
the Basic COCOMO mode outperformed the more complex Intermediate
and Detailed modes. Conte et al. [48] suggest that the COCOMO model
is too complicated, and involves too many parameters that require estima-
tion. Kemerer’s data [108] suggests that a much simpler “function count”
method, similar to the Software Science metric, actually performs better
than COCOMO.
Chapter 5
Testing
29
30 Chapter 5. Testing
1
C= .
|I| · pmin
def
thorough(T ) = successful(T ) ⊃ successful(D).
5.1. Dynamic Testing 33
Given these definitions, Goodenough and Gerhart were able to state the
Fundamental Theorem of Testing:
In words, this says that if a successful test set satisfies a reliable and valid
criterion, then the program is correct. Another way of stating this result is
that a test is thorough if it satisfies a reliable and valid test criterion.
This result is called “Fundamental,” not because it is profound (indeed,
its proof is a trivial consequence of the definitions, and is omitted for that
reason), but because it captures the basic idea underlying all attempts to
create thorough test strategies, namely the search for test criteria that are
both reliable and valid, yet economical in the sense that they can be satisfied
by relatively small test sets.
Unfortunately, there are several problems with the practical application
of these definitions [192]. First of all, the concepts of reliability and validity
are not independent. A test selection criterion is valid if, given that the
program is faulty, at least one test set satisfying the criterion is unsuccessful.
Therefore, if a test selection criterion is invalid, all test sets that satisfy it
must be successful. Hence they will all give the same result, and so the
criterion is reliable. Thus, all test selection criteria are either valid or reliable
(or both) [192]. (Note also that if F is correct, then all criteria are both
reliable and valid for F .)
Next, the concepts of validity and reliability are relative to a single, given
program. A criterion that is valid and reliable for program F may not be so
for the slightly different program F 0 . Furthermore, since F 0 may result from
34 Chapter 5. Testing
if x = 1 then y := z endif,
The all-uses criterion is a comprehensive one but, like the all-paths crite-
rion, may require an excessive, or infinite, number of test cases. The all-defs
criterion seems an attractive compromise from this point of view. System-
atic test selection criteria should surely draw on both data and control flow
perspectives—so a very reasonable criterion would seem to be one that en-
compassed both all-defs and all-edges. Rapps and Weyuker [154] showed
that these are independent criteria—neither implies the other. This tends
to confirm the belief that both are important, but complicates test selection
since two very different criteria are involved. It would be nice if a single cri-
terion could be found that would include both all-defs and all-edges. Rapps
and Weyuker [154] showed that all-p-uses/some-c-uses has this property
and recommended its use. Ntafos [142] provides a comprehensive compari-
son of the relative coverage of these and other structural testing strategies.
Clarke [45] provides a more formal examination.
I=3
TERM=TERM*X**2/(I*(I-1))
SUM=SUM+(-1)**(I/2)*TERM
We can compute the final values of variables I, TERM and SUM, given the
initial assignments TERM=SUM=X to be I=3, TERM=X**3/6, and SUM=X-X**3/6.
38 Chapter 5. Testing
the input that will exercise a particular path: symbolic execution is generally
employed in order to derive the “path predicates” associated with the path,
then solutions must be found to the systems of numerical inequalities that
they induce. Consequently, most test case generators are severely limited
in the class of programs and/or in the class of test criteria that they sup-
port. Ince [97] provides a modern survey of the field and also suggests that
systematic use of randomly generated test data could provide very good
coverage at low cost. The idea, which is elaborated in a short note [98],
starts from the observation that relatively small sets of random test data
seem to provide quite good coverage [61]. For example, with ten programs
ranging in size from 12 to 102 branches, random test data achieved an aver-
age branch coverage of 92.80% [98]. The average size of the test sets needed
to achieve this coverage was 12.8, with the largest being 112. Since the test
sets were randomly generated, they were far from optimal, and contained
subsets that provided the same coverage as the whole set. Using zero-one
integer linear programming, the minimum such subsets were found for each
test set. These were found to average only 3.1 in size, with the test set of size
112 being reduced to only 10. Thus, the suggestion is to generate random
test data until adequate coverage is achieved relative the chosen structural
testing criterion (or until further tests result in little increased coverage).
Coverage is measured by instrumenting the program under test. The ran-
domly generated test set is then reduced to minimum size using zero-one
integer linear programming, and the test results obtained using this subset
are examined.
if x = 0 then y := 1 endif
5.2. Static Testing 41
there is no data flow from x to y, but the value of x certainly influences the
subsequent value of y, and so there is considered to be a flow of information
from x to y. Information flow analysis is used routinely in computer security
verification [57, 176]; its application to more general analysis and anomaly
detection is promising [20].
Automated tools have been developed to perform detection of anoma-
lies of the types described above for programs written in a variety of lan-
guages [144, 194].
major testing effort performed in the Clean Room is random testing for the
purposes of statistical quality control. Software specifications in the Clean
Room methodology include not only a description of the functions to be
supported by the software, but a probability distribution on scenarios for
its use. The Clean Room methodology then prescribes a testing procedure
and a method for computing a certified statistical quality measure for the
delivered software.
The mathematical verification technique used in the Clean Room method-
ology is called “functional verification” and is different from that employed in
“classical” mathematical verification. The “rigorous” techniques of Jones [101],
and of the Oxford school [86] are based on more conventional forms of math-
ematical verification than the Clean Room methodology, and are also more
formal, but share much of their motivation with the Clean Room approach.
An empirical evaluation of the Clean Room methodology has recently been
reported by Selby et al [170].
Beyond the techniques described above come the totally formal meth-
ods. These generally employ the assistance of an automated specification
and verification environment: the sheer quantity of detail entailed by com-
plete formality is likely to render verification less, rather than more, reliable
unless mechanical assistance is employed. Formal specification and veri-
fication environments generally provide a formal specification language, a
programming language, a means of reducing the problem of establishing
consistency between a program and its specification to a putative theorem
in some formal system, and a mechanical theorem prover for seeking, or
checking, proofs for such theorems. Three or four systems of this type have
been developed [110].
All verification methods are vulnerable to two classes of error. First
is the possibility of a flaw in the verification itself—the demonstration of
consistency between the program and its specification may be flawed, and
the two descriptions may, in fact, be inconsistent. The second class of error
arises when the verification is sound, but irrelevant, because the specification
does not reflect the actual user requirements (i.e., the system may satisfy
the verification process, but fail validation)
It is in order to avoid the first class of error that truly formal, and me-
chanically assisted, formal verification is generally recommended. There are,
however, drawbacks to this approach. Firstly, in order to be amenable to
mechanization, rather restrictive specification and programming languages,
built on a very elementary formal system (usually a variation on first-order
predicate calculus), must usually be employed. Because of their lack of con-
44 Chapter 5. Testing
venient expressiveness, such languages and logics may make it difficult for
the user to say what he really intends—so that instead of concentrating on
what he wants to say, he has to spend all his effort on how to say it—with
the consequent danger that he may fail to say it correctly, and so fall into
the second kind of error. Similarly, the user may have to spend consider-
able ingenuity, not in proving his theorems directly, but in persuading a
mechanical theorem prover to prove them for him. This problem is com-
pounded by the fact that most verification systems do not allow the user to
reason about programs directly (as he would do if performing the proof by
hand), but reduce the question of consistency to a (generally large) num-
ber of (generally lengthy) formulas called “verification conditions” that rob
the user of much of his intuition concerning the program’s behavior. This
latter difficulty should not affect the reliability of the process (provided the
theorem prover is sound), but will adversely affect its economics. SRI’s
EHDM system attempts to overcome these difficulties by providing a very
expressive specification language and powerful underlying logic (based on
multi-sorted higher order logic) together with the ability to reason about
programs directly (using the Hoare, or relational, calculus) [165].
More radical techniques that have much in common with executable as-
sertions include “provably safe” programming [10], and the “recovery block”
approach to software fault tolerance—in which an “acceptance test” (effec-
tively an executable assertion) governs the invocation of alternative program
components to replace those that have failed [9]. An experiment by Ander-
son [8] showed promise for recovery blocks (70% of software failures were
eliminated, and MTBF was increased by 135%), but found that acceptance
tests are hard to write. In another study [39], 24 students were hired to
add self-checks to programs containing a total 60 known faults (there were
8 different programs, each was given to three students). Only 9 out the 24
self-checking programs detected any faults at all; those that did find faults
found only 6 of the 60 known faults, but they also discovered 6 previously
unknown faults (in programs which had already been subjected to one mil-
lion test-cases). Sadly, 22 new faults were introduced into the programs in
the process of adding the self-checks.
Among the methodologies that aim to satisfy these criteria TRW’s SREM [5,
6, 17, 52] is representative, and is described in the following section.
5.3.1.1 SREM
SREM (Software Requirements Engineering Methodology) was the prod-
uct of a program undertaken by TRW Defense and Space Systems Group
as part of a larger program sponsored by the Ballistic Missile Defense Ad-
vanced Technology Center (BMDATC) in order to improve the techniques
for developing correct, reliable BMD software. Early descriptions of SREM
include [5, 17]; descriptions of subsequent extensions for distributed systems
can be found in [6], and accounts of experience using SREM are given in [28,
38, 167].
Owing to its genesis in the problems of software for ballistic missile
defense, SREM adopts a system paradigm derived from real-time control
systems. Such systems are considered as “stimulus-response” networks:
an “input message” is placed on an “input interface” and the results of
processing—the “output message” and the contents of memory—are ex-
tracted from an “output interface” [5]. Furthermore, the requirements for
the system are understood in terms of the processing steps necessary to un-
dertake the required task. The essence of a SREM requirements definition
is therefore a dataflow-like description (called an R-net) of the processing
steps to be performed and the flow of data (messages) between them.
SREM recognizes that requirements engineering is concerned with more
than just writing a description of what is required—it is first necessary to
52 Chapter 5. Testing
3
Stickel’s Prolog Technology Theorem Prover (PTTP) [179], which provides correct
first-order semantics—but with Prolog-like efficiency when restricted to Horn clauses—
overcomes some of these disadvantages.
5.3. Testing Requirements and Specifications 57
and a mere 2% of the bugs accounted for 1000 times more failures than the
60% of bugs that were encountered least often.4
Interpretation of data such as these requires a context that establishes
the purpose of testing—is it to find bugs or to improve reliability? Gelperin
and Hetzel [72] discuss this issue and identify a historical evolution of con-
cerns, starting with debugging in the 1950s. They propose that the concern
for 1988 and beyond should be the use of testing as a fault prevention mech-
anism, based on the detection of faults at the earliest possible stage in the
life-cycle.
Of the systematic testing strategies, Howden’s data [94] ([92] is similar,
but based on a smaller sample of programs) provides evidence that functional
testing finds about twice as many bugs as structural testing; furthermore,
several of the bugs found by structural testing would be found more easily
or more reliably by other methods, so that the ratio three to one probably
more accurately reflects the superiority of functional over structural testing.
Howden’s data also shows that static testing (particularly automated
anomaly detection) found more bugs than dynamic testing—however, the
two methods were complementary, dynamic testing tending to find the bugs
that static testing missed, and vice versa. Myers [136] presented similar
data, showing that code walk-throughs were about as effective as dynamic
testing at locating errors in PL/1 programs.
The efficacy of anomaly detection is likely to increase with the degree
of redundancy present in the programming language—indeed, the presence
of redundancy may often allow errors to be detected at compile-time that
would otherwise only be located at run-time. (Compare Lisp with Ada,
for example: almost any string with balanced parentheses is a syntactically
valid Lisp program and any deficiencies will only be discovered in dynamic
testing, whereas writing a syntactically correct Ada program is quite de-
manding and many errors will be caught at the compilation stage.) There
have been proposals to increase the degree of redundancy in programs in or-
der to improve early error-detection. One of the most common suggestions
is to allow physical units to be attached to variables and constants [105,
12]. Expressions may then be subject to dimensional analysis in order to
prevent a variable representing length being added to one representing time,
or assigned to one representing velocity.5 Other proposals would render it
4
The thirty-fold improvement of random over structural testing is simply estimated by
the calculation 2 × 1000/60.
5
Strong typing, present in any modern programming language, provides some protec-
tion of this sort—preventing booleans being added to integers, for example—but the use
60 Chapter 5. Testing
of physical units and dimensional analysis represents a capability beyond the normal typ-
ing rules. The data abstraction facilities of a modern language such as C++ or Ada can
provide this capability, however [47, 90].
6
N-version programming is an adaptation for software of the modular redundancy tech-
niques used to provide reliable and fault tolerant hardware. N (typically N = 3) separate
modules perform the computation and their results are submitted to a majority voter.
Since all software faults are design faults, the redundant software modules must be sepa-
rately designed and programmed. N-Version programming is advocated by Avizienis [44]
(see also [65, 66] for a slightly different perspective), but has been criticized by Leveson
and others (see the discussion in [9]).
5.4. Discussion of Testing 61
on the value of this approach [55, 137], arguing that use of truly formal
specification and verification tends to overload the user with intricate detail,
and that the soundness of the whole process rests on the correctness of the
underlying verification environment. Since this will typically be a very large
and complex program in its own right, located at the limits of the state
of the art, its own correctness should be regarded with more than usual
skepticism. The thrust of this argument is that formal verification moves
responsibility away from the “social process” that involves human scrutiny,
towards a mechanical process with little human participation. We believe
this concern unfounded and based on a mistaken view of how a mechanical
verification environment is used. De Millo, Lipton and Perlis [55] claim that:
“The scenario envisaged by the proponents of verification goes
something like this: the programmer inserts his 300-line in-
put/output package into the verifier. Several hours later, he
returns. There is his 20,000-line verification and the message
‘VERIFIED’.”
This is a parody of the scenario actually envisaged by the proponents of
verification. In a paper published several years earlier [188], von Henke and
Luckham indicated the true nature of this scenario when they wrote:
“The goal of practical usefulness does not imply that the verifi-
cation of a program must be made independent of creative effort
on the part of the programmer . . . such a requirement is utterly
unrealistic.”
In reality, a verification system assists the human user to develop a convinc-
ing argument for his program by acting as an implacably skeptical colleague
who demands that all assumptions be stated and all claims justified. The re-
quirement to explicate and formalize what would otherwise be unexamined
assumptions is especially valuable. Speaking from substantial experience,
Shankar [171] observes:
“The utility of proof-checkers is in clarifying proofs rather than in
validating assertions. The commonly held view of proof-checkers
is that they do more of the latter than the former. In fact, very
little of the time spent with a proof-checker is actually spent
proving theorems. Much of it goes into finding counterexam-
ples, correcting mistakes, and refining arguments, definitions, or
statements of theorems. A useful automatic proof-checker plays
the role of a devil’s advocate for this purpose.”
62 Chapter 5. Testing
Our own experience supports this view. We have recently undertaken the
formal verification of an important and well-known algorithm for clock-
synchronization and have discovered that the published journal proof [116]
contains major flaws [164, 163]. It was the relentless skepticism of our for-
mal verification environment that led us to this discovery, but our belief in
the correctness of our current proof owes as much to the increased under-
standing of the problem that we obtained through arguing with the theorem
prover as it does to the fact that the theorem prover now accepts our proofs.
Another element in De Millo, Lipton and Perlis’ parody that is far from
the truth is their implicit assumption that formal verification is something
that is done after the program has been developed. In reality, formal ver-
ification is practised as a component of a development methodology: the
verification and the program (or specification) are developed together, each
generating new problems, solutions, and insights that contribute to the fur-
ther development of the other. Formal verification can be made easier if
the property being verified is achieved by direct and simple means. Thus, in
addition to helping the user build a convincing case for belief in his program,
formal verification encourages the programmer to build a more believable,
and often better, program in the first place.
Overall, the conclusion to be drawn from experimental and other data
seems to be that all testing methods have their merits, and these tend to
be complementary to each other. For the purpose of enhancing reliability,
random testing is a clear winner. For the purpose of finding bugs, anomaly
detection and walk-throughs should be used in combination with functional
and structural testing (Howden [96] describes such an integrated approach).
Techniques for evaluating requirements and other specifications early in the
life-cycle deserve special attention: rapid prototyping and some of the tech-
niques of formal verification may be especially useful here. For really critical
requirements, formal verification of those properties should be considered;
conventional testing can be used to ensure that the less-critical requirements
are satisfied.
Part II
Application of Software
Quality Measures to AI
Software
63
Chapter 6
Characteristics of AI
Software
In this part of the report we will consider the application of the quality
assurance techniques and metrics identified in Part I to AI software, and we
will examine those few techniques that have been developed specifically for
such software.
We face an initial difficulty in that the notion of AI software is fuzzy—
indeed practioners of AI do not even agree among themselves on what con-
stitutes AI. There are two notions of AI current today; Parnas [149] dubs
them AI-1 and AI-2:
AI-1 is a problem-oriented notion that takes AI to be the use of computers
to solve problems that could previously be solved only by applying
human intelligence.
AI-2 is a technique-oriented notion that identifies AI with the use of charac-
teristic programming strategies, in particular those based on heuristics,
and explicit representation of “knowledge.”
These notions are not mutually exclusive—indeed, most AI software has
elements of both AI-1 and AI-2 and each facet contributes to the difficulty
of SQA for AI software. The problems addressed by AI software (AI-1) are
generally somewhat ill-defined, and a clear statement of requirements for
the task the software is to perform is often lacking. This means that the
notions of success and failure are vague, and evaluation is correspondingly
difficult. In addition, the heuristic techniques employed in AI software (AI-
2) tend to render it fragile, or unstable: very similar inputs may produce
65
66 Chapter 6. Characteristics of AI Software
C = (F − 32) × 5/9.
67
All the techniques for software reliability estimation and for dynamic testing
(and, for that matter, mathematical verification) that were described in Part
I depend on the availability of requirements and specification documents—
at least to the extent that it is possible to determine whether a program has
experienced a failure.
The problem with requirements and specifications for AI software is that
generally there aren’t any—so failures in fielded AI systems may go unno-
ticed because those using them have no idea what the “correct” behavior
should be. Almost any output may appear reasonable at the time it is pro-
duced and only be discovered as erroneous much later on (when, for example,
an autopsy is performed, or an engine is stripped down). The problems of
dynamic testing for AI software are similar: it may not be clear whether or
not the outcome of a particular test is satisfactory.
Thus, before we can consider the application of software reliability and
dynamic testing to AI software, we must consider the problems of obtain-
ing requirements and specifications for such software, and of evaluating the
system against these requirements and specifications.
72
7.1. Requirements and Specifications 73
may have much in common with system safety specifications. It is the estab-
lishment of precise and comprehensive minimum competency requirements,
and the demonstration that they have been achieved, that may be the de-
termining factor in the widespread acceptance of AI software in other than
non-critical applications.
Whereas desired competency requirements may be hard to state explic-
itly, there is some hope that minimum competency requirements may some-
times be capable of precise definition. This is particularly likely to be so
in the case of AI systems whose purpose is to optimize some function—for
although an optimal value may be hard to find, and its optimality just as
hard to check, it may be quite feasible to check that it is at least a valid
solution.
For example, one of the expert systems developed at SRI, by the Infor-
mation Sciences and Technology Center, is the Automated Air Load Plan-
ning System—AALPS [7]. It is used to produce schedules and layouts for
loading army divisions and their equipment (tanks, helicopters, etc.) onto
air transports. One of the things to be considered is the unloading of the
aircraft, especially if this is to be performed in flight using parachute drops.
As the heavy equipment is first moved around inside the aircraft prior to
being dropped, and then actually dropped, the aerodynamic stability of the
aircraft must remain undisturbed. Specifying the desired competency of
AALPS is obviously difficult—we want a near optimum loading plan, but
an optimum loading plan is hard to define, and it is even harder to determine
whether a given plan is optimal. But the minimum competency requirement
can be given quite a sharp statement: the aerodynamic trim of the aircraft
must remain within certain limits at all times during all stages of unloading
in flight. Satisfaction of this requirement can easily be tested (e.g., by an-
other computer program that examines the output of AALPS and computes
the location of the center of gravity during all stages of unloading).
ment, however, may not admit of a precise requirement statement, and the
only feasible evaluation may be against the performance of human experts.
When we are dealing with a problem for which no deep knowledge exists—
for example, medical diagnosis, there is almost certainly no alternative to
evaluation against human experts (though see Section 7.2.3 on page 78);
some experiences and recommended practices for this case are described be-
low in Section 7.2.2. When deep knowledge does exist, however, it may be
possible to evaluate competence against an automated adversary.
(page 87), 7.3.1 (page 81), and 5.1.2 (page 32), respectively. An additional
topic—statistical analysis of results—is covered in Section 8.1.3 (page 88).
The use of human experts in evaluations introduces factors that may be un-
familiar to computer scientists. Unclear instructions may reduce the quality
of the evaluations produced by the experts; demands for excessive detail may
lessen their degree of cooperation and increase the time taken. For example,
it took a year for all the evaluation booklets to be returned by the experts
who had agreed to participate in an early evaluation of MYCIN. Advice
from those (such as psychologists and sociologists) experienced in the use
of human subjects may be very useful in helping to design evaluations that
generate prompt, accurate, and useful responses from human experts.
78 Chapter 7. Issues in Evaluating the Behavior of AI Software
linear models with unit, or even random weights can work quite well. For
example, Dawes and Corrigan report an experiment that involved prediction
of students’ grade point averages at the University of Oregon [54, Table 1].
The average validity of expert judges’ predictions was 0.37, that of their
regression models was 0.43, that of random linear models was 0.51, that of
an equal-weighting model was 0.60, while that of an optimal linear model
(i.e., one fitted to the actual outcomes, rather than the judges’ predictions)
was 0.69.
Dawes and Corrigan identify circumstances when linear models can be
expected to perform particularly well. First, there should be a “condition-
ally monotone” relationship between each cue and the judgment (outcome).
That is, it should be possible for the values of cue variables to be scaled so
that higher values of each predict a higher value of the judgment, indepen-
dently of the values of the remaining cue variables. Secondly, there should
be noise or uncertainty in the measurement of the judgment variable, and
thirdly, there should be noise or uncertainty in the measurement of the cue
variables. When these three conditions are satisfied, a linear model is often
accurate and robust.
Psychologists have proposed using linear models in two ways. The first is
as a “paramorphic representation” or model of the expert’s behavior that can
be used for analysis or prediction; the second, known as “bootstrapping,”
is as a replacement for the expert. The second of these applications of
linear models sets them in direct competition with expert systems and is
considered further in Chapter 10 (page 113). It is the use of linear models
as paramorphic representations that is of interest in this section, since it
suggests the use of such models in the evaluation of knowledge-based expert
systems.
Knowledge-based systems that perform judgment or discrimination tasks
for which it is possible to construct adequately performing linear models may
usefully be evaluated against such models. Clearly, there is little benefit in
employing a complex, knowledge-based system unless it can significantly
out-perform a simple linear model. Thus a linear model may serve to es-
tablish a testable “floor” for the desired competency level of the knowledge-
based system. In addition, since a linear model may be able to accurately
reproduce an expert’s behavior for at least some part of the input space, it
could serve as a “screening” filter during dynamic testing of the knowledge-
based system: those test cases that produce similar responses from both the
knowledge-based system and the linear model could be accepted as correct
(or subject to further investigation on only a random sampling basis), while
80 Chapter 7. Issues in Evaluating the Behavior of AI Software
1. The test selection criterion was naive: simply the last 10 orders re-
ceived. Many of these test cases were trivial and there was no attempt
to look for difficult configuration tasks on which the system might fail.
Consequently, McDermott admits [128]:
The first of these points is considered more fully in Section 8.1 (page
85); the second was discussed above in Sections 7.2.2.1 and 7.2.2.2 (page
76). The third and fourth points are discussed below.
The testing and experimentation that are undertaken as part of the re-
finement phase have a quite different character and purpose—and, most
importantly, a different audience—than those undertaken during the later
evaluation phases. Gaschnig et al. [70] point out that it is important to com-
plete each phase of the development and evaluation before proceeding to the
next. For example, the fourth phase is intended to establish that the system
is performing at an expert level of competence, whereas the fifth phase is
concerned with its acceptability to the user. If the fourth phase has not been
completed satisfactorily before the fifth is started, then it will not be clear
whether the user’s failure to accept the system results from inadequacies in
its competence or in its human factors. The purpose of the different phases
of evaluation is to systematically eliminate the variables that can lead to
the user’s failure to accept the system. Thus the developers of MYCIN did
not attempt to assess its clinical utility until they had established that its
decision-making was at the level of an expert in the field. In this way, they
could be sure that any subsequent failure of acceptance by physicians was
due to human-engineering problems, rather than to decision-making errors.
Testing of AI Systems
85
86 Chapter 8. Testing of AI Systems
2
Though recall the possibility, suggested in Section 7.2.3 (page 78), that linear models
could mitigate this in certain situations.
8.1. Dynamic Testing 87
when debugging heuristic rule sets and show that the techniques used in
TEIRESIAS do not always lead to the best set of rules.
Suwa, Scott and Shortliffe [181] describe another early program for check-
ing the rule-base of an expert system. Their program, which was used in
the development of the ONCOCIN expert system, first partitions rules into
disjoint sets based on the attribute that is assigned a value in the conclusion.
It then makes a table, displaying all possible combinations of attributes used
in the antecedents and the corresponding values that will be concluded in
the conclusion of the rule. The table is checked for conflicts, redundancy,
subsumption, and missing rules (see below). The rule checker assumes there
should be a rule for each possible combination of values of attributes that
appear in the antecedent; the checker hypothesizes missing rules based on
this assumption. The CHECK program described by Nguyen [140] is an ad-
junct to the “Lockheed Expert System” shell (LES) [115] that extends the
rule-base checks used in the ONCOCIN project. The following paragraphs
describe some of the checks performed by the CHECK program.
Situations that are considered likely to indicate problems of consistency
in goal-driven rules are:
Redundancy: two rules have the same antecedent, and the conclusions of
one subsume those of the other (e.g., x → y and x → y ∧ z).
Conflict: two rules have the same antecedent, but their conclusions are
contradictory (e.g., x → y and x → ¬y).
Subsumption: two rules have similar conclusions, but the antecedent of
one subsumes that of the other (e.g., x → y and x ∧ z → y).
Unnecessary IF rules: two rules have the same conclusions, and their
antecedents contain contradictory clauses, but are otherwise the same
(e.g., x ∧ z → y and x ∧ ¬z → y).
Circularity: a set of rules forms a cycle.
P ∨Q → A
Q∨R → B
A∧B → T
A → D
B → ¬D
1. For each distinct attribute value, form the disjunction of the an-
tecedents of the rules that assert that value (call this disjunction the
“combined antecedent” for that attribute value).
2. For each incompatible pair of attribute values, show that the conjunc-
tion of their combined antecedents is unsatisfiable.
(A ∧ v124 ≥ 35) ∨ (A ∧ v124 < 35 ∧ v124 > 10) ∨ (¬A ∧ v124 > 20)
v418test: MODULE
THEORY
A: boolean
v124: integer
lemma1: LEMMA
(A AND v124>=35) OR
(A AND v124<35 AND v124>10) OR
((NOT A) AND v124>20) OR
((NOT A) AND v124>6 AND v124<=20) OR
(A AND v124<=10) OR
((NOT A) AND v124<6)
PROOF
END v418test
5
Strictly, the method is incomplete since it uses Gomory’s method for integer feasibility.
However, Gomory’s method is incomplete only in that it is not guaranteed to terminate—
and this proof does terminate.
96 Chapter 8. Testing of AI Systems
p → q
p ∧ r → ¬q
since both formulas evaluate to true in any interpretation that assigns false
to p. Interpreted as rules, however, these two formulas are surely inconsis-
tent, since both q and ¬q will be asserted if p and r are simultaneously
true.
While this example (from Ginsberg [77]) clearly demonstrates that the
notion of consistency employed in propositional logic is different from that
used for rule-bases, none of the completeness and consistency tests described
in this section give formal definitions for their notions of “consistency” and
“completeness” (though Ginsberg [77] does identify his notion of “redun-
dancy” with that of independence in logic). This is a serious omission; the
weaknesses and uncertainties attending existing completeness and consis-
tency tests for production rule systems may largely be attributed to the
lack of a clear semantics for the notations employed and a corresponding
lack of rigorous definitions for the properties being tested. The develop-
ment of an appropriate semantic foundation in which to pose the problems
of completeness and consistency for production rule systems with accept-
able rigor is therefore a prerequisite for the creation of good algorithms for
completeness and consistency checking. Such a foundation might also make
it possible to develop techniques for establishing that a rule-based system
terminates.
Despite their rather shaky foundations, completeness and consistency
checking tools do seem to identify a significant number of problems in “typ-
ical” rule-bases. The proponents of such tools argue that this demonstrates
that they are appropriate and effective; an alternative interpretation, how-
ever, is that it merely demonstrates that rule-bases are a dangerously un-
structured and unreliable form of knowledge representation. If tools as crude
as these detect problems, then how many deeper problems must remain un-
detected? Some experimental evidence for this latter position is provided
by Pearce [152] who describes two knowledge-based systems for diagnosis
of the electrical power system in a satellite. The first system was a con-
ventional hand-crafted rule-based expert system. When this system was
analyzed by the KIC rule-base checking program [151] (this is comparable
to Nguyen’s CHECK program), seven problems were found among its 110
rules. And when it was tested against a simulator for the satellite system
8.2. Static Testing 97
concerned, it correctly diagnosed only 72% (10 out of 14) simulated failures.
In contrast, the second system, which was produced mechanically from a
qualitative model of the satellite electrical subsystem, contained only 75
rules, triggered no warnings from KIC, correctly diagnosed all 14 test cases,
and required only half the development time needed for the hand-crafted
rule-base. Furthermore, validation of the second system could also consider
the qualitative model underlying its rule-base—a higher level and more per-
spicuous representation than the rule-base itself. In essence, the rule-base in
the second of these systems may be regarded as compiled code; verification
and validation can be directed to the qualitative model that constitutes its
“source code.” This seems a very promising approach and could well be
the precursor of a new and more sophisticated and reliable approach to the
construction of rule-based expert systems. In contrast, completeness and
consistency checkers seem to address only the symptoms, rather than the
causes of unreliability in rule-based systems.
7
It is claimed that during the war with Argentina over the Falkland Islands, British
ships’ radar did not classify incoming Exocet missiles as hostile because the radar signature
of the Exocet identified it as belonging to an allied NATO country (France).
8.2. Static Testing 99
First, let us define “full,” which Winston uses without definition in the
preceding rules, to be a predicate which is true if a bag contains 6 large
items, 12 medium items or 24 small items (i.e., a large item is twice as big
as a medium item, and a medium item is twice as big as a small item, and
each bag may hold a maximum of the equivalent of 6 large items), and false
otherwise.
Now, let us suppose for a moment that we are interested in safety in
this context, which we define as not overloading the bags to the point where
100 Chapter 8. Testing of AI Systems
they might break. The particular bags in use may break if loaded with 30
lbs. or more. Let us further suppose that we do not know how much each
individual item weighs, but we do know something about each category. We
know that the heaviest large item is 4 lbs, and that the heaviest small and
medium items are both 2 lbs. What we are interested in showing, of course,
is that these rules will never produce a situation in which a bag is in danger
of breaking.
Exhaustive testing, even of this three rule subset, is quite impossible. For
each grocery order of r items, there are 6r possible combinations. Therefore,
to exhaustively test the three rules given above, it would be necessary to
run: ∞
X
6r
r=0
test cases. Even with an operational constraint placed on the number of
items one could possibly purchase (let’s say 100), one would run:
100
X
6r = 7.839823E + 77
r=0
test cases. In order to assure ourselves that this system is safe, we must
appeal to formal methods.
We begin by satisfying ourselves that the system begins in a safe state,
that is, we show that the initial state conforms to the criterion of our state
invariant (our safety property), which is that all bags should contain strictly
less than 30 lbs. It is easy to see that when all bags are empty, the weight
in any given bag is less than 30 lbs. Therefore, our initial state is safe.
Next, we assume the system is in some arbitrary state which conforms
to our safety property (where the weight in each bag is less than 30), and
attempt to show that we cannot reach an unsafe state through the applica-
tion of any of the rules. We assume the given rules exhaustively define all
state transitions which may occur relative to the state-space of our problem
domain. Therefore, if we can show that for each of the above three rules,
application of the rule in an arbitrary safe state leads only to another safe
state, then we can appeal to induction, and conclude that the above system
is safe for all reachable states.
Rule B4 can only cause a large item to be put into a bag if the bag
is empty or if the bag holds only large items and is not yet full. By the
definition of full, this means fewer than 6 large items. Therefore, since the
heaviest large item weighs 4 lbs, the most a bag could weigh prior to the
8.2. Static Testing 101
be present [56]. There has been little work on developing formal seman-
tics for production-rule systems, and little consideration given to developing
languages and corresponding inference mechanisms that permit a clean se-
mantic characterization. In fact, the programming notations used for AI
systems (including those provided by expert system shells) seem to be at
a stage of development comparable to that of conventional programming
languages in the 1960s, where “power,” “features” and “convenience” were
considered more important than a formal semantic foundation. Thus de-
velopers of AI system architectures (commonly referred to as “shells”) have
paid little or no attention to mathematical rigor in the construction of the
underlying reasoning methods upon which these systems are based.
Progress in systematic anomaly detection and in mathematical verifica-
tion for AI systems will require that this attitude toward the programming
notations used to describe AI systems is replaced by one that shows more
concern for the desirability of constructs that have clean and tractable se-
mantic characterization.
the internal operations of that group, but not the rest of the
knowledge base. If he or she preserves the correct functioning of
the rules within the group and does not change the validity of
the assertions about its intergroup facts, the knowledge engineer
can be confident that the change that has been made will not
adversely affect the rest of the system. Conversely, if the knowl-
edge engineer wants to use additional intergroup facts from other
groups, he or she should rely only on the assertions provided for
them.”
105
106 Chapter 9. Reliability Assessment and Metrics for AI Systems
Conclusions and
Recommendations for
Research
107
Chapter 10
Conclusions
109
110 Chapter 10. Conclusions
els. Suitable notations might include higher order logic and set theory. It is
unlikely that a model represented in such a notation would permit usefully
efficient deduction, but it would possess an internal coherence and permit
causal explanations of behavior that are not possible for haphazard collec-
tions of rules. Knowledge engineering of a different form than that generally
practiced would then seek to elaborate the representation of such a model
into a form that would permit efficient deductions—in just the same way that
conventional software engineering elaborates specifications written in very
abstract form into executable programs. Indications of this sort of approach
applied to AI may be seen in qualitative reasoning techniques [24], in con-
straint satisfaction systems [118], and in the work of Chandrasekaran [42],
Neches, Swartout and Moore [138], and Abbott [3].
In our opinion, the explicit construction and scrutiny of models is an
essential component of the knowledge engineering process for trustworthy
AI systems. It is unlikely that users will trust an AI system, no matter how
impressive some of its demonstrations of competence may be, without some
causal explanation of its behavior. It is the underlying model, rather than
its accidental representation in terms of rules, frames or other notations,
that ultimately determines the competence and believability of the system,
as well as its ability to explain its actions. The fact that human experts
often do not have an explicit model of their domain of excellence is not an
argument against our position: humans bring other attributes (e.g., common
sense, self-awareness) to bear on their problem solving in addition to their
“rule-base” and are, in any case, permitted a degree of fallibility that is not
countenanced for machines that perform serious functions. Systems which
address domains that are so ill-understood that the only source of guidance
is an unstructured knowledge base extracted from expert’s “war stories” may
represent interesting AI research but they are not ready for deployment in
serious situations.
Knowledge Representation
To be useful, knowledge needs to be organized and represented in such a way
that it can be used to solve problems. The detailed design and representation
of a knowledge base has much in common with conventional programming
and requires support similar to that provided for software engineering [153].
Bobrow et al. [25] quote experienced knowledge engineers as follows:
“Knowledge engineering is more than software engineering . . . but
not much more.”
113
3
The skill measure (serial correlation coefficient) for the expert system was 0.38, that
for the optimal linear model was 0.36, and those for the other linear models ranged from
0.32 to 0.34.
10.1. Recommendations for Research 115
Near Term
• For critical applications at least, abandon the use of rule-based expert
systems whose rule sets are unsupported by causal models.
Longer Term
• Develop methodologies and notations in support of knowledge engi-
neering as an explicit model-building activity. Investigate the feasibil-
ity of “compiling” certain classes of models into production rules (or
other efficient representations), or of verifying a set of production rules
against an explicit model.
• Identify useful metrics for AI based software and validate their value
as predictors of important cost, performance and reliability measures.
10.1. Recommendations for Research 117
Closing Remarks
Until very recently, knowledge-based systems were an exploratory research
topic and little attention was paid to their quality assurance. Now that
some knowledge-based systems, especially those known as “rule-based expert
systems,” are being placed into operational environments, it is important to
develop a software quality assurance methodology for such systems.
In this report, we hope to have shown how some of the quality assur-
ance and measurement techniques developed for conventional software can
be adapted and applied to knowledge-based systems. However, while such
adaptations of existing techniques would be beneficial and important, they
tend to focus on representational and implementation issues—whereas it is
the knowledge encoded in the system that ultimately determines its quality
of performance. It is entirely feasible to implement a knowledge-based sys-
tem as a program in Basic—but to evaluate that system as simply a Basic
program surely misses the point. It is the knowledge that it embodies, as
much as its representation in Basic, that needs to be examined. Thus, the
way to focus on the distinctive problems of quality assurance for AI software
may be through the development of techniques for describing, evaluating,
and manipulating knowledge in ways that are abstracted from the concrete
details of a particular representation, in much the same way as algorithms
are abstracted from programs. This represents an exciting challenge for the
future.
Bibliography
118
Bibliography 119
[15] Farokh B. Bastani and S. Sitharama Iyengar. The effect of data struc-
tures on the logical complexity of programs. Communications of the
ACM, 30(3):250–259, March 1987.
[16] Michael Z. Bell. Why expert systems fail. Journal of the Operational
Research Society, 36(7):613–619, 1985.
[23] J. Bliss, P. Feld, and R. E. Hayes. Decision aid measurement and evalu-
ation (DAME). In Proceedings, International Conference on Systems,
Man, and Cybernetics, pages 126–130, Atlanta, GA, October 1986.
IEEE.
[25] Daniel G. Bobrow, Sanjay Mittal, and Mark J. Stefik. Expert sys-
tems: Perils and promise. Communications of the ACM, 29(9):880–
894, September 1986.
[32] Susan S. Brilliant and John C. Knight ans Nancy G. Leveson. The con-
sistent comparison problem in N-Version software. IEEE Transactions
on Software Engineering, 15(11):1481–1485, November 1989.
[36] Barbara Carroll. Expert systems for clinical diagnosis: Are they worth
the effort? Behavioral Science, 32:274–292, 1987.
[40] Fun Ting Chan and Tsong Hueh Chen. AIDA—a dynamic data flow
anomaly detection system for Pascal programs. Software—Practice
and Experience, 17(3):227–239, March 1987.
[45] Lori Clarke, Andy Podgurski, Debra Richardson, and Steven Zeil. A
formal evaluation of data flow path selection criteria. IEEE Transac-
tions on Software Engineering, 15(11):1318–1332, November 1989.
[50] Chris Culbert, Gary Riley, and Robert T. Savely. Approaches to ver-
ification of rule-based expert systems. In First Annual Workshop on
Space Operations, Automation, and Robotics (SOAR 87), pages 191–
196a, Houston, TX, August 1987. NASA Conference Publication 2491.
[51] P. Allen Currit, Michael Dyer, and Harlan D. Mills. Certifying the
reliability of software. IEEE Transactions on Software Engineering,
SE-12(1):3–11, January 1986.
[70] John Gaschnig, Philip Klahr, Harry Pople, Edward Shortliffe, and
Allan Terry. Evaluation of expert systems: Issues and case studies. In
F. Hayes-Roth, D. A. Waterman, and D. B. Lenat, editors, Building
Expert Systems, chapter 8. Addison-Wesley, Reading, MA, 1983.
[72] David Gelperin and Bill Hetzel. The growth of software testing. Com-
munications of the ACM, 31(6):687–695, June 1988.
[79] Mary Ann Goodwin and Charles C. Robertson. Expert system verifica-
tion concerns in an operations environment. In First Annual Workshop
on Space Operations, Automation, and Robotics (SOAR 87), pages
203–207, Houston, TX, August 1987. NASA Conference Publication
2491.
[84] Paul Harmon and David King. Expert Systems: Artificial Intelligence
in Business. John Wiley and sons, New York, NY, 1985.
[121] Nancy G. Leveson. Software safety: Why, what and how. ACM
Computing Surveys, 18(2):125–163, June 1986.
[123] Jay Liebowitz. Useful approach for evaluating expert systems. Expert
Systems, 3(2):86–96, April 1986.
[131] Harlan D. Mills, Michael Dyer, and Richard Linger. Cleanroom soft-
ware engineering. IEEE Software, 4(5):19–25, September 1987.
130 Bibliography
[133] M. Moriconi and D.F. Hare. The PegaSys system: Pictures as formal
documentation of large programs. ACM Transactions on Programming
Languages and Systems, 8(4):524–546, October 1986.
[143] Robert M. O’Keefe, Osman Balci, and Eric P. Smith. Validating expert
system performance. IEEE Expert, 2(4):81–90, Winter 1987.
[154] Sandra Rapps and Elaine J. Weyuker. Selecting software test data
using data flow information. IEEE Transactions on Software Engi-
neering, SE-11(4):367–375, April 1985.
[163] John Rushby and Friedrich von Henke. Formal verification of the
Interactive Convergence clock synchronization algorithm using Ehdm.
Bibliography 133
[189] Leslie J. Waguespack, Jr. and Sunil Badlani. Software complexity as-
sessment: An introduction and annotated bibliography. ACM Software
Engineering Notes, 12(4):52–71, October 1987.
[194] C. Wilson and L. J. Osterweil. Omega—a data flow analysis tool for
the C programming language. In Proceedings COMPSAC, pages 9–18,
1982.
136 Bibliography