Successive Sampling and Software Reliabi
Successive Sampling and Software Reliabi
Successive Sampling and Software Reliabi
by
Gordon M. Kaufman*
MIT Sloan School Working Paper 3316
July 1991
*Supported by AFOSR Contract #AFOSR-890371. I wish to thank Nancy Choi and Tom
Wright for valuable programming assistance.
SUCCESSIVE SAMPLING AND SOFTWARE RELIABILITY
by
Gordon M. Kaufman*
1. Introduction
A software system is tested and times between failures are observed. How many faults
remain in the system? What is the waiting time to the next failure? To the occurrence
of the next n failures? Conditional on the observed history of the test process, knowledge
of properties of the time on test necessary to discover the next n faults is very useful for
making test design decisions.
Times between failures models of software reliability are designed to answer such
questions. Many versions of such models appear in the literature on software reliability
and most such models rely on the assumption that the failure rate is proportional to the
number of remaining faults or to some single valued function of the number of remaining
faults. Goel (1985) observes that this is a reasonable assumption if the experimental de-
sign of the test assures equal probability of executing all portions of the code - - a design
seldom achieved in practice. The character of testing usually varies with the test phase:
requirements, unit, system or operational. The impact of such considerations have been
recognized by some authors: Littlewood's criticism of the Jelinski-Moranda assumption
that software failure rate at any point in time is directly proportional to the residual num-
ber of faults in the software is cited by Langberg and Singpurwalla (1985) in an excellent
overview paper. Only recently have some researchers come to grips with the implications of
replacing this assumption. In terms of counts of failures, it may be labelled an "equal bug
size" postulate (Scholz (1985). Littlewood (1981) and Singpurwalla and Langberg (1985)
do incorporate the assumption that different bugs may have different failure rates, but
the empirical Bayes (superpopulation) approach adopted by Littlewood and the Bayesian
approach adopted by Singpurwalla and Langberg "averages out" the effects of this as-
sumption. According to Scholz "...it was not recognized by some proponents of reliability
growth models that relaxing the equal bug size assumption also entails some complications
concerning the independence and exponentiality [of waiting times between failures]". He
and Miller (1986) are the first to investigate systematically (in the absence of a super-
1
III
population process or of a Bayesian prior for failure rates) the implications of assuming
that given an observational history, each of the remaining bugs in a software system may
possess different probabilities of detection at a given point in time. In contrast to most
times between failures models, for successive sampling - EOS models, times between fail-
ures are not independent. As Goel [1985] points out, independence would be acceptable
if "...successive test cases were chosen randomly. However, testing especially functional
testing, is not based on independent test cases, so that the test process is not likely to be
random".
Scholz presents a multinomial model for software reliability that is identical to Rosen's
characterization of successive sampling stopping times. (Rosen, 1972) The connection
seems to have gone unnoticed. The "continuous" model based on independent, non-
identically distributed exponential random variables suggested by Scholz as an approxima-
tion to multinomial waiting times is in fact in exact correspondence with a representation
of successive sampling in terms of non-identically distributed but independent exponential
order statistics. Scholz's approximation is in fact Ross's (1985) exponential order statistics
model which Ross treats Bayesianly. Gordon (1983) was among the first to observe that
successive sampling is representable in this fashion. Miller's study of such order statis-
tics is focused on similarities and differences between types of models derivable from this
particular paradigm.
Joe (1989) provides an asymptotic (large sample) maximum likelihood theory for
parametric order statistics models and non-homegenous Poisson models of fault occurrence
that, when the parameter is of fixed dimension, yields asymptotic confidence intervals.
He states that for the general exponential order statistics model, one cannot expect any
estimate [of the conditional failure rate] to be good because the ratio of parameters to
random variables is too big".
Successive sampling as described in the next section has been successfully used as a
model for the evolution of magnitudes of oil and gas field discovery and has its roots in
the sample survey literature. (Hajek (1981), for example.) In this application magnitudes
of fields in order of discovery are observed and used to make predictions of the empirical
frequencies of magnitudes of undiscovered fields. Logically tight theories of maximum
likelihood, moment type and unbiased estimation for this class of problems have been
developed by Bickel, Nair and Wang, (1989), Gordon, (1989) and Andreatta and Kaufman,
(1986). The problem of estimation of software reliability based on observation of times
between failures of a software system may be viewed as the dual to the problem of inference
when only magnitudes of population elements are observed. The principal purpose of this
paper is to establish connections between these two disparate lines of research and to lay out
2
possibilities for applying methods of estimation developed for successive sampling schemes
to successive sampling as a model for software reliability. Our attention is restricted to
successive sampling of elements of a finite population of software faults in a software system;
that is,
(1) Individual faults may possess distinct failure rates that depend on covariates par-
ticular to the stage of testing and on other features of the software environment.
For a given fault, that fault's failure rate as a function of such covariates is called
the fault magnitude.
(2) Faults are sampled (a) without replacement and (b) proportional to magnitude.
Some recent studies of successive sampling schemes have assumed the existence of a su-
perpopulation process generating finite population magnitudes. We shall not.
The accuracy of model structure as a depiction of the physics of software fault occur-
rence depends in part on the validity of choice of definition for the magnitude of a fault.
Different definitions of magnitude may be required for different environments. Here we
shall assume that the appropriate definition of the magnitude of a fault for the particular
application considered has been resolved. It is NOT easy to resolve and considerable effort
must be devoted to defining operationally meaningful definitions of fault magnitudes. For
example, the effort in programmer time required to fix a fault could be adopted. (Program-
mer effort expended to correct faults is measured and recorded for six software projects
studied by the Software Engineering Laboratory at NASA-Goddard). Empirical work on
fault prediction by Basili and Perricone (1982), and Basili and Patniak (1986) provides an
excellent starting point. It is a subject for a different paper.
Following a formal description of successive sampling properties of successive sampling
schemes needed in the sequel, two distinct sampling (observational) schemes are examined
in section three. The first is a scheme in which both the time from start of testing to
time of occurrence and the magnitude of each fault in a sample of n faults are jointly
observed. With this scheme we can order faults observed from first to last and assign a
"waiting time" to each fault. In the second scheme magnitudes of faults in a sample of
n faults are observed along with the waiting time to occurrence of the last fault in the
sample; waiting times to occurrences of individual faults are not observed. The order in
which faults occurred is then lost.
Section 4 is devoted to properties of unbiased estimators of unobserved finite popu-
lation parameters for each of the two aftermentioned sampling schemes. The connection
between maximum likelihood estimation (MLE) and unbiased estimation established by
Bickel, Nair and Wang (1989) for a successive sampling scheme in which magnitudes alone
3
11
are observed is developed for a scheme in which both waiting times to failures and mag-
nitudes are observed. The results of a Monte Carlo study of properties of both types of
estimators presented in Section 5.
Section 6 returns a principal interest of the software manager: conditional on observing
the history of the process up to and including the mth failure, what is the waiting time to
the occurrence of the next n - m failures? Successive sampling theory suggests a simple
point estimator of this waiting time, dependent on the waiting time z(m) to occurrence of
the first m faults and on the unordered set {yl,... , ym} of magnitudes of faults observed
in (0, Z(m)). A Monte Carlo study of its behavior suggests that this class of estimators of
returns to test effort measured in faults/unit time on test is worth further study.
4
2 Successive Sampling
Let U = 1,2,...,N} denote the set of labels of N finite population elements and
associate with each k e U a magnitude ak > 0. A successive sampling scheme is charac-
terized by a permutation distribution of labels induced by sampling of AN = {al,..., aN}
proportional to magnitude and with replacement. The parameter AN (al,... ,aN) of
a finite population with label set U takes on values in the parameter space (0, oo)N.
Define for 1 < n < N, S, = (il,..., in), i l,...,inE U to be an ordered sample S n of
t
size n with i h component ij and Sn = {il,...,in} C U to be an unordered sample of
size n. We distinguish a random variable from a value assumed by it with capital letter;
N
e.g., the random variable S assumes a values .n With rN = Z ak, a successive sampling
k=l
scheme is implemented via
n
P{Sn = n I AN} -P{i AN} = II aj/[rN - - - ai 1 ] (2.1)
j=l
N
with ai - 0; equivalently with pj = aj/rN, Z pj = 1 and
j=1
Defining Zj = Xj/aj, Z(j) as the jth smallest of Z1,... ,ZN and U/sn = {k I k U
and k /Sn}, the probability that an unordered sample Sn is observed is
P{sn I AN} = P{max{Zi I jsn} < min{Zk I keU/sn}}
= S
keU/sn
P{max{Zj I jeS} < Zk < min{Z I eU/ISn and e : k}}. (2.5)
5
II
Ps I AN} -P( )= j e d [I (1
= aj e- JJ (1
jfSn
-ajx)dx. (2.6)
We shall need derivatives of P(S, I a) with respect to a and in order to maintain consis-
tency of notation will call the rightmost integral in (2.6) P,+l(s, I a) as the density of
Z(,+l) conditional on S, = s, is just the normalized integrand of this integral:
for ze(O, oo) and zero otherwise. In a similar vein, the middle integral in (2.6) will be called
Pn(Sn a) as the density of Z(n) conditional on Sn = Sn is
- t
Dn(Z sI n; a) = [ E aje-zaJ H (1 -e-zae)]/Pn(n I o). (2.8)
jEstn tLes
to6
which is equivalent to
1
E(Z(n+l) I s,; ) = -+ E(Z(n) I sn; a). (2.9)
a
The next two propositions record properties of marginal expectations of Y1 ,..., YN.
The first documents the intuitively obvious notion that when elements of A are distinct
from one another, the marginal expectation E(Yk) of Yk strictly decreases with increasing k.
Proposition 2.1: When elements of AN are distinct E(Y 1 ) > E(Y 2 ) > ... > E(YN).
Expectations E(Yk), k = 1, 2,... , N may be usefully expressed in terms of inclusion
probabilities 7rj(n) as follows: let 7rj(n) = P {jeSnIAN}.
6
Proposition 2.2: For sample size is n = 1, 2,..., ,N, with 7rj(O) 0,
N
E(Yn) = [7rj(n)- 7rj(n - 1)]aj. (2.10)
j=1
We shall adopt H6jek bounds on the ajs and examine large N behavior in the following
uniform asymptotic regime:
(A 2 ): As N -, oo, n/N = f - f (0, 1) with f bounded away from zero and one.
N
N e - aj z(fn) = 1-f, (2.11)
j=1
and Nmo z(fn) -+ z(f). An immediate consequence of (Al), (A 2 ) and (2.11) is that
for large N, z(fn) = 0(1) because - log (1 - f) < z(fn) < - log (1 - fn)/E and
-log (1-ffn) = 0(1).
7
111
Likelihood functions formed from (3.1) or (3.2) for purposes of inference about unobserved
elements U/sn of U (or about a(sn) rN- E aj) are not regular (see Kaufman (1989))
jesn
and so are not useful for maximum likelihood estimation. For example, (3.1) as a function
of a(sn) - is proportional to exp{-cYz(n)} and a maximizer of this function over ae(O, oo)
occurs at a = 0, irrespective of the observed sample.
If we condition on either the ordered sample or on the unordered sample, it is possible
to deduce a regular likelihood function with a corresponding efficient score function whose
expectation is zero.
8
and corresponding efficient score function
d 1
da logL(a I tn = - Z() + E b,+a (3.5)
j=l
n
Conditional on S, = Sn, E(Z(n) I sn; a) = E [bj + a] - 1 , so (3.5) is representable as
j=l
daEZ(n) ln log £(a I Z(n); sn) = EZ(n) sln d- log C(a I Z(n)n) = 0. (3.8)
In addition
d2 n 1
da2 logl(a I z(n);) = - [b + = -Var(Z() I n;a) < (3.9)
n
Proposition 3.1: A unique zero in (0, oo) of (3.6) exists iff E b' > Z(n).
j=1
While use of (3.6). to define an estimator of a(s) may produce desultory results for
small examples (unacceptably large probability of no solution), in the uniform asymptotic
regime dictated by (A 2 ) a unique zero of (3.7) exists with probability one.
9
II
Nfn
-e log(1 - fn) 1
< a( < - log(l - fn), (3.10)
j= (s n ) + bj
Nfn 1
Var(Z(n) Is.n) = (o(n) + bj)2 N fN(1 - f) = O(N). (3.11)
j=j (ce(s.) + bj) 2 NON(1-
Proposition 3.2: As N -- oo Z(n) = E(Z(n) Isn; NN) possesses a unique solution interior
to (0, oo) with probability one.
Proof: In terms of N,
N
E(Z(n) I n; N~N) = E I = 0(1). (3.12)
j=1 N N
Since
n n 1
1 1
= E[+logn+ (n )] (3.13)
j=1 -'
Eij=l .
n
(y = Euler's constant), b diverges logarithmically as N -- oo. As Z(n) I n converges
j=1
in probability to an atom of order one for each realizable sequence s, n = 1, 2,..., the
n
event E- 1 bj > Z(n) conditional on Sn = s obtains with probability one as N - oo. []
j=1
10
Proposition 3.3: The estimator °-(Z(n), S) defined as zero if E (Yj+...+Yn)-l < Z(n)
j=1
and a solution to (3.5) if E (Yj + ... + Yn)-' > Z(n) is positively biased.
j=l
n
Proof: Temporarily let &(z,Sn) = &(z) for fixed . As ca(z)= 0 for z > E bT -
j=l
and (3.7) obtains for 0 < z < Pn, defining Hn(z) = P{Z(n) < zlsn} for 0 < z < oo,
n 1
(Z +b- I#nP{Z(n) > Pnlsn} + zdHn(z). (3.14)
EZ~n) I j=l (Z(n)) + bj
Since by definition,
n
we have
n _ 1
__n $00
1
EZ(n) In E
j=l
- .i
a(Z(n)) + bj j=
[/3.n - Z] dHn(z) <
n--b
. (3.16)
In addition, as [a + bj]- 1
is strictly convex in a on [0, oo), Jensen's inequality gives
n 1
+bj] - <EZ(n)In E &(Z(n)) + bj (3.17)
j=1
11
3.2 Unordered Samples
As in the case of an ordered sample of magnitudes, the likelihood function for a suggested
by (3.19) is not regular. However, the likelihood function for a corresponding to
e-a(sn)Zdz
Prob{Z(n) edz I sn;AN} = Dn(Z I n;la(s,)) c P(s (dz))d
P'(s I (sn))
is regular (Kaufman (1989)) and the efficient score function for
using (2.9) is
In turn,
d2
d 2 logL(a Z(n),Sn) = - Var (Z(n) I sn;) <0
possesses a solution it must be unique. The efficient score function (3.9) is of the same
form as that for the ordered sample when all of Z(1), Z( 2),... Z(n) (or equivalently Tn) as
well as magnitudes Y 1, Y2 , .. ., Yn in order of occurrence are observed. The only difference
is that the expectation of Z(n) conditional on S, = s,, the unordered sample, replaces
the expectation of Z(n) conditional on the ordered sample S = sn.
We may interpret Var(Z(.)lsn; a) as expected Fisher information with respect to a
measure (the density of Z(n)) conditional on S, = s. Since
12
we have that
. 1 Var(Z(.) S; ct) < Var(Z(.) I s,; a)
Es_ (3.24)
and inference based on observation of (Z(),S) is more efficient in the sense of (3.17)
than inference based on observation of (T,, Y) or equivalently of (T, S). One possible
explanation is that unordering of the sample as suggested by (3.14) is a form of Rao-
Blackwellization.
Paralleling our treatment of the case when Z(n) = z(n) and S = s are observed, for
large N we define an estimator (Z(n), S.) = &(Z(n), Sn)/N.
13
4. Unbiased Estimation
-
Theorem 4.1: For k e Sn, EsnE{[1 - exp{-akZ(n+l)} 1
I So} = 1
EZ(n+l) L -
1
-ak Z(+l) 1 ae
1-eakY II,(y)dy/P(s, I a)
(4.1)
P(sn I k#; a)
P(sn I a)
The right-hand side of (4.1) is a version of Murthy's estimator which has been shown by
Andreatta and Kaufman (1986) to have expectation 1/7rk(n). 0
14
Corollary 4.1: Let h be a single valued function with domain AN and range (-oc, co)
or some subset of (-oo, oo). Define H = EjAN h(aj). Then when S, = s, and Z(n+l) =
Z(,n+) are observed
is an unbiased estimator of H.
with
ake- kz()/(1 - e-akz()) (4.4)
jESn aee-ajZn) /(-e-a Z(n)
jfSn
15
III
I0 e-a/ x)
Ez(.)l| (1 - e- a
kZ(n) )- =
0f 1 - l_ n.()P(s.
a I x)
-ak x
P(sn1II ) Jl e ifJn
j Ak
aje-aj
(1 - e-aix) en(
hIi,.. (1
jiAt
- e-aj )dx
P(sn I k#; a)
P(sn I )
1
+ P(s, I )
0
(1_ e-akZ)2 ]j(1 - e-aj)dx (4.5)
kesg
k~~s P(sn I a) k s: 1 -
kCEsn
e-akZ] Dn(Z I sn; )dz.
The first term on the RHS of (4.6) is Murthy's unbiased estimator of N; the second term
is
Pk(sn ; Z(n) ) (4.7)
E 1 - e-akZ()
kes.
'
To set the stage for establishing the correspondence between unbiased estimation and
CMLE as N -* oo, examine the score function for the log likelihood when Z(n+l) = Z(n+l)
and Sn = Sn is observed in place of Z(n) = Z(n) and Sn = Sn. From (2.7) and (2.9)
d
dlogl(a I z(n+);sn) = -Z(n+l) + E(Z(n+l) I n; a) (4.8)
d=
1
= -Z(n+i) + - + E(Z(n) I ; a).
a
16
As stated above Bickel, Nair and Wang (1990) show that when a fixed finite number
of magnitude classes are successively sampled, as N -- oo, unbiased estimators similar in
form to (4.2) constitutes an asymptotically efficient approximation to MLE. In the present
setting we have
aje-ajwn(a)
E (a)=a,
-aj (4.9)
jE8n
(Andreatta and Kaufman (1986)). Thus when Z(n+l) = Z(n+l), and Sn = s, are observed,
Wn(a) = Z(n+l) is a large N solution to the efficient score function for a CMLE of a(Sn).
Using (4.1) and (4.2) with a = (s),
N
Murthy's unbiased estimator of rN - ak. Since
k=1
or equivalently
aj
1 - e-(n+l)a j
= (Z(n+l),sn) + E aj rN(Z(n+l),Sn). (4.14)
je9n jESn
17
III
5. Numerical Study
To illustrate small sample behavior of the three types of estimators presented here, a
monte carlo study of a successive sampling example from Hajek (1981) was done. In this
example, a successive sample of size n = 4 is taken from A 10 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
The range 1 to 10 of Ak values is small by comparison with applications of successive
sampling to oil and gas discovery in which the largest element of AN can be as large as 103
times the smallest. [The North Sea oil and gas data taken from O'Carroll and Smith (1980)
by Bickel, Nair and Wang (1989) varies from 12 to 3017 million barrels of recoverable oil
equivalent]. The small range of values of A 10 in Hajek's example provides a stringent test
of the performance of successive sampling estimators as the expectation of Y is only about
3.2 times larger than that of Y1o.
A computational scheme for computing inclusion probabilities based on an integral
representation of rk(n) was used to compute Table 5.1. [Andreatta and Kaufman (1990)]
It shows inclusion probabilities rk(n) for sample sizes n = 1, 2,..., 6 from A 10 .
Table 5.1 here
Figure 5.1 is a graphical display of estimators examined in the study.
Figure 5.1 here
Results of a monte carlo study of properties of three of the four types of estimators
studied here appear in Table 5.2. Means, and standard deviations of unbiased, corrected
unbiased and maximum likelihood estimators for each of rN, c(sn) and N are based on
4,000 successive samples drawn from A 1 0.
Table 5.2 here
Monte carloed properties of these estimators for n = 4 in H/jek's example are:
(2) An ordered estimator of rN that is a solution of (3.7) based on (3.6) and obser-
vation of S_and z(n) is positively biased - as proven in Proposition 3.3. The bias
is about 26%. An estimator of rN based on replacing Z(n) with z(n+l) in (4.2)
is negatively biased by about 5.5-7.5%. For this example, ordered estimation of
a(Sn) led to &(n)= 0 in 32 out of 4000 cases using Sn and Z(n) and 89 out of
18
4000 cases using s, and z(,+l).
The behavior of these estimators for moderate sized samples is explored next in a
Monte Carlo study of samples of size n = 10 from a population of size N = 30 with ak =
magnitude of the k/(N + 1)st fractile of an exponential distribution having mean one,
k = 1,2,...,N.
Table 5.3 here
Monte Carloed properties of estimators for this second example are:
(1) Unbiased estimation of rN outperforms all other estimators studied here, producing
smaller standard deviation and less sampling bias (about .4%) . Unbiased estimation
of the successive sampling remainder yielded Monte Carlo averages & + bl = .4767 +
.5273 = 1.004.
(2) An ordered estimator of c was produced for each of 1000 trials, suggesting that the
probability of no solution to (3.7) [ b7l < Z(n)3 is less than .001 for this example.
j=1
(3) While unbiased estimation of N over 1000 trials exhibits small bias (about .7%) the
standard deviation of an estimator of N of the form (4.3) is quite large - about one-half
of the true value of N = 30.
19
6. Prediction of Test Effort as a Function of Number of Failures
As stated at the outset, given a history of the test process up to the time of the nth
failure, an estimate of the incremental time on test to the occurrence of the next n - m
failures is of considerable practical value to the software manager who wishes to predict
testing effort as a function of number of failures.
Our objective is to provide an estimator for Z(,) given Z(m) = Z(m) and Sm = sm that
is both easy to compute and behaves reasonably for moderate samples. Recall that with
w,(c) = z(n+l) (4.8) yields an estimate of a(s,) based on observation of Z(n+l) and S(n)
that is both asymptotically CMLE, unbiased, and in addition is coincident with unbiased
estimation of the sum H = E h(j) of any single valued function h(-) of the ajs (See
jeAN
Proposition 4.1).
These facts together with (2.11) suggest a point estimator for Z(n), n > m + 1, given
observation of Z(m+1) = z(m+l) and Sm = Sm that is a solution (,) to
1- e -zYi
1- e-z(m+l)y =n (6.1)
N
for z e (0, oo). Specific motivation for the form of this estimator is the identity E rk(n) =
k=1
n. If we are given only inclusion probabilities rk(n) and rk(m), k s, then
7rk(n) (6.2)
N
has expectation 7rk(n) = n with respect to Sm. Replacing 1/7rk(m) with its unbiased
k=1
point estimator (1 - e-z(m++)Yi) - and 7rk(n) with Rosen's approximation (1 - e-z(f)y),
z(f) satisfying (2.11), yields (6.1).
A unique solution to (6.1) in (z(m+l), cc) exists provided that n is chosen to be less
than N = E [1- exp(-z(,m+l)yj}]- 1, an unbiased estimate of the total number N of
je9m
faults in the system at the outset of testing. This point estimator is a function of both
Z(m+l) and Sm, so prior to observing Z(m+l) and Sm, it is rv Z(n)(Z(m+l), Sm) which we
abbreviate as Z(,). Under mild conditions on the large N behavior of the set AN of fault
magnitudes, if m/N = fm - ge(0, 1) and n/N = fn - fe(O,1) as in the asymptotic
regime (A 2 ), IZ(n) - (n) = op(l); i.e. as N -- o, the absolute value of the difference
20
between Z(,) and Z(,) converges to zero in probability. In this sense, Z(,) is a "consistent"
estimator of the rv Z(,). [A proof will be provided elsewhere.]
The results of a small Monte Carlo study of Z(,) given a sample of size m < n and
Z(,) = z(m) - NOT Z(m+l) = Z(m+l) - suggest that the prediction scheme represented
by (6.1) produces point estimates of Z(n) with small to moderate bias and increasing
variability as z(m) increases.
Statistics describing Z(,) for the Hiajek and exponential examples of Section 5 are
presented in Table 6.1 and 6.2 respectively. Table 6.1 is based on use of 4,000 replications
of a successive sample of size m = 4 from Alo 0 = 1, 2,..., 10} to generate point estimates
of Z(7 ) using (6.1). Table 6.2 is based on use of 1,000 replications of a successive sample
of size m = 10 from A30 composed of 30 fractiles of a mean 1.0 exponential distribution
to generate point estimates of Z(20) using (6.1).
Even though both sample and population size for the exponential example are sub-
stantially larger for the exponential example than for the H6jek example, n = 100, N = 30
and n = 4, N = 10 respectively, Z(,) behaves similarly in both cases:
(1) Z(,) is positively skewed with extreme right tail outliers [Figures 6.1b and 6.2b]. By
comparison Z(,) is substantially less positively skewed and, for the case n = 10, N = 30
the histogram of simulated values of Z(,) is reasonably approximated by a Normal
distribution [Figures 6.1a and 6.2a]. (As N ) oo in the regime (A 2) an appropriately
scaled version of Z(n) is N(0, 1); see Andreatta and Kaufman (1990)).
(2) Variability of Z(n) rapidly increases as Z(,) increases. [Scatterplots in Figures 6.3a
and 6.3b].
(3) The expectation of the predictor Z(n) given Z(m) = Z(m) increases faster than linearly
as a function of Z(m,,,) and slower than exponentially. [Figures 6.4a and 6.4b].
(4) The mean of 2(7 ) in the Hajek example is 1.6% greater than the mean of Z(7 ) [Table
6.1]. In the exponential example, right tail outliers boost the positive bias of Z(20) to
10.4% [Table 6.2a]. If the Monte Carlo sample is trimmed to exclude values of 2(20)
greater than four standard deviations above the maximum value of Z(20) achieved in
the sample, the bias of the trimmed sample is negligible. When the 16 of 868 such
values are deleted the bias is -. 03%.
21
7. CONCLUDING REMARKS
Successive sampling schemes and EOS models of software failures are intimately re-
lated to one another. This relation allows application of methods of inference and pre-
diction developed for successive sampling to be applied to software failure sampling as
demonstrated in this paper.
Further linkages between successive sampling schemes and Bayes/empirical Bayes
treatment of software reliability models can be established by invoking a super-population
process that describes how fault magnitudes are generated.
22
TABLE 5.1
INCLUSION PROBABILITIES nk(n)
N 9.945 10.163
(1.716) (1.843)
Corrected( 2 )
Unbiased(l) Unbiased Ordered( 3 ) Ordred (4 )
1.004 .994 .980 1.087
(.225) (.257) (.227) (.284)
'"DT-rPP-- - ^` -- -- --
TABLE 5.4
MEAN SQUARE PREDICTION ERROR OF UNBIASED ESTIMATOR a(Sn)
(HAJEK'S EXAMPLE, N=10, n=4, 4000 TRIALS)
1000 TRIALS
TABLE 6.2b
OUTLIERS >200 OMITED
OBSERVATIONS
Z
) Z (n) Z (+1)
ff
a=O a=0O a R N a
I I R
I
N
a=O a=O
excluded included excluded included
R>bl R2b R>b
~""~"~~"""~~~~ --I" --
FIGURE 6.la
ACTUAL Z-HAJEK EXAMPLE
200
150
100
8o
50
0
0 2 4 4 10 12 14 1 20 24 20SO 4S 40
ZQ
FIGURE 6.lb
ESTIMATED Z-HAJEK EXAMPLE
200 I I I I I I
.. .
I I
.
I
. .
I I II I
*
I
.
.
l
r
I'
150
I
Z 7]
100 Z
0
Z
/
Z
Z
I
50 Z Z
Z
/ Z Z
5 Z
Z Z
0 Z Z
0 8 6 9 12 16 1 21 24 2T S0 U8 S 8e 42 4 4 61
ZQHAT
FIGURE 6.2a
ACTUAL Z-EXPONENTIAL CASE
250
200
150
80
100
50
0
o 20
m 4o eo 10 o l0 1101ON1DO 10 l 110tOl0W
ZQ
FIGURE 6.2b
ESTIMATED Z-EXPONENTIAL CASE
400 -,-.....-..-..-..-..-..-..-
I I I I I I I I I I I I I I I I I I I
7
/
/
II//
300
/
4
/
/
4
200 /
/
o8 4
/
;;;:
4
/
/
/
100 4
/
/
/
4
/
4
A I
.I e- I
air m
A I I I
ZQHAT
FIGURE 6.3a
ESTIMATED VS ACTUAL Z-HAJEK CASE
100
80
60
N
40
20
0
0 10 20 30 40
ZQ
FIGURE 6.3b
ESTIMATED VS ACTUAL Z-EXPONENTIAL CASE
500
400
300
N
N
200
100
0 50 100 150
ZQ
FIGURE 6.4a
Z(7) vs Z(4) - HAJEK EXAMPLE
150
100
aN
50
0
0 5 10 15 20
ZN
FIGURE 6.4b
20) VS410) EXPONENTIAL EXAMPLE
500 IIIIII
IIIII
III I I III I~~~~~~~~~~~~~~~~
III II
400
I
300
a
I
200 . *
·
·
* ·
:· ·
100
0
l
20
I
30
00 10
10 20 30
Z(10o)
III
Bibliography
Andreatta, G., and Kaufman, G.M. (1986), "Estimation of Finite Population Properties When
Sampling is Without Replacement and Proportional to Magnitude," Journalof the
American StatisticalAssociation, Vol. 81, pp. 657-666.
Basili, V.R., and Perricone (1982), "Software Errors and Complexity: an Empirical
Investigation," Computer Science, University of Maryland, UOM- 1195.
Basili, V.R., and Patniak, D., "A Study on Fault Prediction and Reliability Assessment in the
SEL Environment," Computer Science, University of Maryland Technical Report, TR--
1699.
Bickel, P.J.,V.N. Nair and Wang, P.C. (1989), "Nonparametric Inference Under Biased
Sampling from a Finite Population," Technical Report, University of California at
Berkeley, August 1989.
Chapman, D.G. (1951), "Some Properties of the Hypergeometric Distribution With Application to
Zoological Sample Censuses," Univeristy of CaliforniaPublicationsin Statistics, 1, 1313-
1316.
Goel, A., "Software Reliability Models: Assumptions, Limitations, and Applicability," IEEE
Trans. Software Eng. Vol. SE-li, No. 12 (April 1985), pp. 375-386.
Gordon, L. (1983) "Successive Sampling in Large Finite Populations," Annals of Statistics, Vol.
11, No. 2, pp. 702-706.
Gordon, L. (1989), "Estimation for Large Successive Samples of Unknown Inclusion
Probabilities," Advances in Applied Mathematics (to appear).
Hdjek, J. (1981), Samplingfrom a FinitePopulation,(New York: Marcel Decker, Inc.), 247 pp.
Horvitz, D..G., and Thompson, D.J. (1952), "A Generalization of Sampling Without
Replacement From a Finite Universe," Journalof the American Statistics Association, 47,
663-685.
Kaufman, G. M. (1988), "Conditional Maximum Likelihood Estimation for Successively Sampled
Finite Populations," M.I.T. Sloan School of Management Working Paper.
Langberg, N. and Singpurwalla, N. (1985), " A Unification of Some Software Reliability
Models,: SIAM Journalon Scientific and Statistical Computing, 6(3), pp. 781-790.
Littlewood, B., "Stochastic Reliability Growth: A Model for Fault-removal in Computer Programs
and Hardware Designs," IEEE Trans. Rel. Vol. R-30 1981, pp. 313-320.
Miller, D.R., "Exponential Order Statistic Models of Software Reliability Growth," IEEE, Trans.
Software Eng. Vol. SE-12 No. 1, Jan. 1986, pp. 12-24.
Murthy, M.N. (1957), "Ordered and Unordered Estimators in Sampling Without
Replacement,"Sankhya 18, 378-390.
Ros6n, B. (1972), "Asymptotic Theory for Successive Sampling with Varying Probabilities
without Replacement," I and II, Annals of Statistics, Vol. 43, pp. 373-397, 748-776.
Ross, S.M. (1985), "Statistical Estimation of Software Reliability," IEEE Transactionson
Software Engineering, Vol. SE-11, pp. 479-483.
Scholz, F.W. (1986), "Software Reliability Modeling and Analysis," IEEE Transactions on
Software Engineering, Vol. SE-12, No. 1, January 1986.