Chapter I
Software Quality Modeling with
Limited Apriori Defect Data
Naeem Seliya
University of Michigan, USA
Taghi M. Khoshgoftaar
Florida Atlantic University, USA
AbstrAct
In machine learning the problem of limited data for supervised learning is a challenging problem with
practical applications. We address a similar problem in the context of software quality modeling. Knowledge-based software engineering includes the use of quantitative software quality estimation models. Such
models are trained using apriori software quality knowledge in the form of software metrics and defect
data of previously developed software projects. However, various practical issues limit the availability
of defect data for all modules in the training data. We present two solutions to the problem of software
quality modeling when a limited number of training modules have known defect data. The proposed
solutions are a semisupervised clustering with expert input scheme and a semisupervised classification
approach with the expectation-maximization algorithm. Software measurement datasets obtained from
multiple NASA software projects are used in our empirical investigation. The software quality knowledge
learnt during the semisupervised learning processes provided good generalization performances for
multiple test datasets. In addition, both solutions provided better predictions compared to a supervised
learner trained on the initial labeled dataset.
IntroductIon
Data mining and machine learning have numerous
practical applications across several domains, especially for classification and prediction problems.
This chapter involves a data mining and machine
learning problem in the context of software quality modeling and estimation. Software measurements and software fault (defect) data have been
used in the development of models that predict
Copyright © 2007, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Software Quality Modeling with Limited Apriori Defect Data
software quality, for example, a software quality
classification model (Imam, Benlarbi, Goel, &
Rai, 2001; Khoshgoftaar & Seliya, 2004; Ohlsson
& Runeson, 2002) predicts the fault-proneness
membership of program modules. A software
quality model allows the software development
team to track and detect potential software defects
relatively early-on during development.
Software quality estimation models exploit the
software engineering hypothesis that software
measurements encapsulate the underlying quality
of the software system. This assumption has been
verified in numerous studies (Fenton & Pfleeger,
1997). A software quality model is typically built
or trained using software measurement and defect
data from a similar project or system release previously developed. The model is then applied to the
currently under-development system to estimate
the quality or presence of defects in its program
modules. Subsequently, the limited resources
allocated for software quality inspection and
improvement can be targeted toward low-quality modules, achieving cost-effective resource
utilization (Khoshgoftaar & Seliya, 2003).
An important assumption made during typical software quality classification modeling is
that fault-proneness labels are available for all
program modules (instances) of training data, that
is, supervised learning is facilitated because all
instances in the training data have been assigned a
quality-based label such as fault-prone ( fp) or not
fault-prone (nfp). In software engineering practice,
however, there are various practical scenarios
that can limit availability of quality-based labels
or defect data for all the modules in the training
data, for example:
•
•
The cost of running data collection tools
may limit for which subsystems software
quality data is collected.
Only some project components in a distributed software system may collect software quality data, while others may not be
equipped for collecting similar data.
•
•
The software defect data collected for some
program modules may be error-prone due
to data collection and recording problems.
In a multiple release software project, a given
release may collect software quality data for
only a portion of the modules, either due to
limited funds or other practical issues.
In the training software measurement dataset
the fault-proneness labels may only be known for
some of the modules, that is, labeled instances,
while for the remaining modules, that is, unlabeled
instances, only software attributes are available.
Under such a situation following the typical supervised learning approach to software quality
modeling may be inappropriate. This is because
a model trained using the small portion of labeled
modules may not yield good software quality
analysis, that is, the few labeled modules are not
sufficient to adequately represent quality trends
of the given system. Toward this problem, perhaps
the solution lies in extracting the knowledge (in
addition to the labeled instances) stored in the
software metrics of the unlabeled modules.
The above described problem represents the
labeled-unlabeled learning problem in data mining
and machine learning (Seeger, 2001). We present
two solutions to the problem of software quality
modeling with limited prior fault-proneness defect data. The first solution is a semisupervised
clustering with expert input scheme based on
the k-means algorithm (Seliya, Khoshgoftaar,
& Zhong, 2005), while the other solution is a
semisupervised classification approach based on
the expectation maximization (EM) algorithm
(Seliya, Khoshgoftaar, & Zhong, 2004).
The semisupervised clustering with expert
input approach is based on implementing constraint-based clustering, in which the constraint
maintains a strict membership of modules to
clusters that are already labeled as nfp or fp. At the
end of a constraint-based clustering run a domain
expert is allowed to label the unlabeled clusters,
and the semisupervised clustering process is iter-
Software Quality Modeling with Limited Apriori Defect Data
ated. The EM-based semisupervised classification
approach iteratively augments unlabeled program
modules with their estimated class labels into the
labeled dataset. The class labels of the unlabeled
instances are treated as missing data which is
estimated by the EM algorithm. The unlabeled
modules are added to the labeled dataset based
on a confidence in their prediction.
A case study of software measurement and
defect data obtained from multiple NASA software
projects is used to evaluate the two solutions. To
simulate the labeled-unlabeled problem, a sample
of program modules is randomly selected from
the JM1 software measurement dataset and is
used as the initial labeled dataset. The remaining
JM1 program modules are treated (without their
class labels) as the initial unlabeled dataset. At
the end of the respective semisupervised learning approaches, the software quality modeling
knowledge gained is evaluated by using three
independent software measurement datasets.
A comparison between the two approaches for
software quality modeling with limited apriori
defect data indicated that the semisupervised
clustering with expert input approach yielded
better performance than EM-based semisupervised classification approach. However, the
former is associated with considerable expert
input compared to the latter. In addition, both
semisupervised learning schemes provided an
improvement in generalization accuracy for independent test datasets.
The rest of this chapter is organized as follows:
some relevant works are briefly discussed in the
next section; the third and fourth sections respectively present the semisupervised clustering with
expert input and the EM-based semisupervised
classification approaches; the empirical case study,
including software systems description, modeling
methodology, and results are presented in the fifth
section. The chapter ends with a conclusion which
includes some suggestions for future work.
relAted Work
In the literature, various methods have been
investigated to model the knowledge stored in
software measurements for predicting quality of
program modules. For example, Schneidewind
(2001) utilizes logistic regression in combination with Boolean discriminant functions for
predicting fp program modules. Guo, Cukic,
and Singh (2003) predict fp program modules
using Dempster-Shafer networks. Khoshgoftaar,
Liu and Seliya (2003) have investigated genetic
programming and decision trees (Khoshgoftaar,
Yuan, & Allen, 2000), among other techniques.
Some other works that have focused on software
quality estimation include Imam et al. (2001),
Suarez and Lutsko (1999) and Pizzi, Summers,
and Pedrycz (2002).
While almost all existing works on software
quality estimation have focused on using a supervised learning approach for building software
quality models, very limited attention has been
given to the problem of software quality modeling and analysis when there is limited defect
data from previous software project development
experiences. In a machine learning classification problem when both labeled and unlabeled
data are used during the learning process, it is
termed as semisupervised learning (Goldman,
2000; Seeger, 2001). In such a learning scheme
the labeled dataset is iteratively augmented with
instances (with predicted class labels) from
the unlabeled dataset based on some selection
measure. Semisupervised classification schemes
have been investigated across various domains,
including content-based image retrieval (Dong &
Bhanu, 2003), human motion and gesture pattern
recognition (Wu & Huang, 2000), document categorization (Ghahramani & Jordan, 1994; Nigam
& Ghani, 2000), and software engineering (Seliya
et al., 2004). Some of the recently investigated
techniques for semisupervised classification
Software Quality Modeling with Limited Apriori Defect Data
include the EM algorithm (Nigam, McCallum,
Thrun, & Mitchell, 1998), cotraining (Goldman
& Zhou, 2000; Mitchell, 1999; Nigam & Ghani,
2000), and support vector machine (Demirez &
Bennett, 2000; Fung & Mangasarian, 2001).
While many works in semisupervised learning
are geared toward the classification problem, a
few studies investigate semisupervised clustering for grouping of a given set of text documents
(Zeng, Wang, Chen, Lu, & Ma, 2003; Zhong,
2006). A semisupervised clustering approach has
some benefits over semisupervised classification.
During the semisupervised clustering process
additional classes of data can be obtained (if
desired) while the semisupervised classification
approach requires the prior knowledge of all possible classes of the data. The unlabeled data may
form new classes other than the pre-defined classes
for the given data. Pedrycz and Waletzky (1997)
investigate semisupervised clustering using fuzzy
logic-based clustering for analyzing software
reusability. In contrast, this study investigates
semisupervised clustering for software quality
estimation.
The labeled instances in a semisupervised
clustering scheme have been used for initial seeding of the clusters (Basu, Banerjee, & Mooney,
2002), incorporating constraints in the clustering
process (Wagstaff & Cardie, 2000), or providing
feedback subsequent to regular clustering (Zhong,
2006). The seeded approach uses the labeled data
to initialize cluster centroids prior to clustering.
The constraint-based approach keeps a fixedgrouping of the labeled data during the clustering
process. The feedback-based approach uses the
labeled data to adjust the clusters after executing
a regular clustering process.
semIsupervIsed clusterIng
WIth expert Input
The basic purpose of a semisupervised approach
during clustering is to aid the clustering algorithm
in making better partitions of instances in the given
dataset. The semisupervised clustering approach
presented is a constraint-based scheme that uses
labeled instances for initial seeding (centroids)
of some clusters among the maximum allowable
clusters when using k-means as the clustering
algorithm. In addition, during the semisupervised
iterative process a domain (software engineering)
expert is allowed to label additional clusters as
either nfp or fp based on domain knowledge and
some descriptive statistics of the clusters.
The data in a semisupervised clustering
scheme consists of a small set of labeled instances
and a large set of unlabeled instances. Let D be
a dataset of labeled (nfp or fp) and unlabeled (ul)
program modules, containing the subsets L of
labeled modules and U of unlabeled modules.
In addition, let the dataset L consist of subsets
L_nfp of nfp modules and L_ fp of fp modules.
The procedure used in our constraint-based
semisupervised clustering approach with k-means
is summarized next:
1.
Obtain initial numbers of nfp and fp
clusters:
• An optimal number of clusters for
the nfp and fp instances in the initial
labeled dataset are obtained using the
Cg criterion proposed by Krzanowski
and Lai (1988).
• Given L_nfp, execute the Cg criterion
algorithm to obtain the optimal number
of nfp clusters among {1, 2, …, Cin_nfp}
number of clusters, where Cin_nfp is
the user-defined maximum number
of clusters for L_nfp. Let p denote
the obtained number of nfp clusters.
Given L_ fp, execute the Cg criterion
algorithm to obtain the optimal number
of fp clusters among {1, 2, …, Cin_ fp}
number of clusters, where Cin_ fp is
the user-defined maximum number
of clusters for L_ fp. Let q denote the
obtained number of fp clusters.
Software Quality Modeling with Limited Apriori Defect Data
2.
3.
Initialize centroids of clusters: Given the
maximum number of clusters, Cmax, allowed
during the semisupervised clustering process with k-means,
• The centroids of p clusters out of Cmax are
initialized to centroids of the clusters
labeled as nfp.
• The centroids of q clusters out of {Cmax
- p} are initialized to centroids of the
clusters labeled as fp.
• The centroids of the remaining r (i.e.,
Cmax – p – q) clusters are initialized to
randomly selected instances from U.
We randomly select 5 unique sets of
r instances each for initializing centroids of the unlabeled clusters. Thus,
centroids of the {p + q + r} clusters
can be initialized using 5 different
combinations.
• The sets of nfp, fp, and unlabeled clusters
are thus,
C_nfp = {c_nfp1, c_nfp2, …, c_nfpp},
C_ fp = {c_ fp1, c_ fp2, …, c_nfpq}, and
C_ul = {c_ul1, c_ul2, …, c_ulr} respectively.
Execute constraint-based clustering:
• The k-means clustering algorithm with
the Euclidean distance function is run
on D using the initialized centroids
for the Cmax clusters, and under the
constraint that the existing membership of a program module to a labeled
cluster remains unchanged. Thus, at a
given iteration during the semisupervised clustering process, if a module
already belongs (initial membership or
expert-based assignment from previous iterations) to a nfp (or fp) cluster,
then it cannot move to another cluster
during the clustering process of that
iteration.
• The constraint-based clustering process
with k-means is repeated for each of
4.
5.
the 5 centroid initializations, and the
respective SSE (sum-of-squares-error)
values are computed.
• The clustering result associated with the
median SSE value is selected for continuation to the next step. This is done
to minimize the likelihood of working
with a lucky/unlucky initialization of
cluster centroids.
Expert-based labeling of clusters:
• The software engineering expert is
presented with descriptive statistics of
the r unlabeled clusters, and is asked to
label them as either nfp or fp. The specific statistics presented for attributes
of instances in each cluster depends on
the expert’s request, and include data
such as minimum, maximum, mean,
standard deviation, and so forth.
• The expert labels only those clusters
for which he/she is very confident in
the label estimation.
• If the expert labels at least one of the r
(unlabeled) clusters, then go to Step 2
and repeat, otherwise continue.
Stop semisupervised clustering: The iterative process is stopped when the sets C_nfp,
C_ fp, and C_ul remain unchanged. The
modules in the nfp ( fp) clusters are labeled
and recorded as nfp ( fp), while those in the
ul clusters are not assigned any label. In addition, the centroids of the {p + q} labeled
clusters are also recorded.
semIsupervIsed clAssIfIcAtIon
WIth em AlgorIthm
The expectation maximization (EM) algorithm is a
general iterative method for maximum likelihood
estimation in data mining problems with incomplete data. The EM algorithm takes an iterative
approach consisting of replacing missing data with
Software Quality Modeling with Limited Apriori Defect Data
estimated values, estimating model parameters,
and re-estimating the missing data values. An
iteration of EM consists of an E or Expectation
step and an M or Maximization step, with each
having a direct statistical interpretation.
We limit our EM algorithm discussion to a
brief overview, and refer the reader to Little and
Rubin (2002) and Seliya et al. (2004) for a more
extensive coverage. In our study, the class value
of the unlabeled software modules is treated as
missing data, and the EM algorithm is used to
estimate the missing values. Many multivariate
statistical analysis, including multiple linear
regression, principal component analysis, and
canonical correlation analysis are based on the
initial study of the data with respect to the sample
mean and covariance matrix of the variables.
The EM algorithm implemented for our study
on semisupervised software quality estimation is
based on maximum likelihood estimation of missing data, means, and covariances for multivariate
normal samples (Little et al., 2002).
The E and M steps continue iteratively until
a stopping criterion is reached. Commonly used
stopping criteria include specifying a maximum
number of iterations or monitoring when the
change in the values estimated for the missing
data reaches a plateau for a specified epsilon value
(Little et al., 2002). We use the latter criteria and
allow the EM algorithm to converge without a
maximum number of iterations, that is, iteration
is stopped if the maximum change among the
means or covariances between two consecutive
iterations is less than 0.0001. The initial values
of the parameter set are obtained by estimating
means and variances from all available values
of each variable, and then estimating covariances from all available pairwise values using
the computed means.
Given the L (labeled) and U (unlabeled) datasets, the EM algorithm is used to estimate the
missing class labels by creating a new dataset
combining L and U and then applying the EM
algorithm to estimate the missing data, that is,
the dependent variable of U. The following procedure is used in our EM-based semisupervised
classification approach:
1.
2.
3.
4.
5.
Estimate the dependent variable (class labels) for the labeled dataset. This is done by
treating L also as U, that is, the unlabeled
dataset consists of the labeled instances but
without their fault-proneness labels. The
EM algorithm is then used to estimate these
missing class labels. In our study the fp and
nfp classes are labeled 1 and 0, respectively.
Consequently, the estimated missing values
will approximately fall within the range 1
and 0.
For a given significance level α, obtain confidence intervals for the predicted dependent
variable in Step 1. The assumption is that the
two confidence interval boundaries delineate
the nfp and fp modules. Record the upper
boundary as ci_nfp (i.e., closer to 0) and the
lower boundary as ci_ fp (i.e., closer to 1).
For the given L and U datasets, estimate the
dependent variable for U using EM.
An instance in U is identified as nfp if it’s
predicted dependent variable falls within
(i.e., is lower than) the upper boundary,
that is, ci_nfp. Similarly, an instance in U
is identified as fp if it’s predicted dependent
variable falls within (i.e., is greater than) the
lower bound, that is, ci_ fp.
The newly labeled instances of U are used
to augment L, and the semisupervised classification procedure is iterated from Step 1.
The iteration stopping criteria used in our
study is such that if the number of instances
selected from U is less than a specific number
(that is, 1% of initial L dataset), then stop
iteration.
Software Quality Modeling with Limited Apriori Defect Data
empIrIcAl cAse study
software system descriptions
The software measurements and quality data used
in our study to investigate the proposed semisupervised learning approaches is that of a large
NASA software project, JM1. Written in C, JM1
is a real-time ground system that uses simulations
to generate certain predictions for missions. The
data was made available through the Metrics Data
Program (MDP) at NASA, and included software
measurement data and associated error (fault or
defect) data collected at the function level.
A program module for the system consisted
of a function or method. The fault data collected
for the system represents, for a given module,
faults detected during software development.
The original JM1 dataset consisted of 10,883
software modules, of which 2,105 modules had
software defects (ranging from 1 to 26) while the
remaining 8,778 modules were defect-free, that is,
had no software faults. In our study, a program
module with no faults was considered nfp and
fp otherwise.
The JM1 dataset contained some inconsistent modules (those with identical software
measurements but with different class labels) and
those with missing values. Upon removing such
Table 1. Software measurements
Line Count Metrics
Total Lines of Code
Executable LOC
Comments LOC
Blank LOC
Code And Comments LOC
Halstead Metrics
Total Operators
Total Operands
Unique Operators
Unique Operands
McCabe Metrics
Cyclomatic Complexity
Essential Complexity
Design Complexity
Branch Count Metrics
Branch Count
modules, the dataset was reduced from 10,883 to
8,850 modules. We denote this reduced dataset
as JM1-8850, which consisted of 1,687 modules
with one or more defects and 7,163 modules with
no defects.
Each program module in the JM1 dataset
was characterized by 21 software measurements
(Fenton et al., 1997): the 13 metrics as shown in
Table 1 and 8 derived Halstead metrics (Halstead
length, Halstead volume, Halstead level, Halstead
difficulty, Halstead content, Halstead effort, Halstead error estimate, and Halstead program time.
We used only the 13 basic software metrics in
our analysis. The eight derived Halstead metrics
were not used. The metrics for the JM1 (and other
datasets) were primarily governed by their availability, internal workings of the projects, and the
data collection tools used. The type and numbers
of metrics made available were determined by
the NASA Metrics Data Program. Other metrics,
including software process measurements, were
not available. The use of the specific software
metrics does not advocate their effectiveness, and
a different project may consider a different set of
software measurements for analysis (Fenton et
al., 1997; Imam et al., 2001).
In order to gauge the performance of the
semisupervised clustering results, we use software
measurement data of three other NASA projects,
KC1, KC2, and KC3, as test datasets. These software measurement datasets were also obtained
through the NASA Metrics Data Program. The
definitions of what constituted a fp and nfp module
for these projects are the same as those of the JM1
system. A program module of these projects also
consisted of a function, subroutine, or method.
These three projects were characterized by the
same software product metrics used for the JM1
project, and were built in a similar software development organization. The software systems of
the test datasets are summarized next:
•
The KC1 project is a single CSCI within
a large ground system and consists of 43
Software Quality Modeling with Limited Apriori Defect Data
•
•
KLOC (thousand lines of code) of C++ code.
A given CSCI comprises of logical groups
of computer software components (CSCs).
The dataset contains 2107 modules, of which
325 have one or more faults and 1782 have
zero faults. The maximum number of faults
in a module is 7.
The KC2 project, written in C++, is the
science data processing unit of a storage
management system used for receiving and
processing ground data for missions. The
dataset includes only those modules that
were developed by NASA software developers and not commercial-of-the-shelf (COTS)
software. The dataset contains 520 modules,
of which 106 have one or more faults and
414 have zero faults. The maximum number
of faults in a software module is 13.
The KC3 project, written in 18 KLOC of
Java, is a software application that collects,
processes, and delivers satellite meta-data.
The dataset contains 458 modules, of which
43 have one or more faults and 415 have zero
faults. The maximum number of faults in a
module is 6.
empirical setting and modeling
The initial L dataset is obtained by randomly selecting LP number of modules from JM1-8850, while
the remaining UP number of modules were treated
(without their fault-proneness labels) as the initial U
dataset. The sampling was performed to maintain
the approximate proportion of nfp:fp = 80:20 of the
instances in JM1-8850. We considered different
sampling sizes, that is, LP = {100, 250, 500, 1000,
1500, 2000, 3000}. For a given LP value, three
samples were obtained without replacement from
the JM1-8850 dataset. In the case of LP = {100, 250,
500}, five samples were obtained to account for their
relatively small sizes. Due to space consideration,
we generally only present results for LP = {500,
1000}; however, additional details are provided in
(Seliya et al., 2004; Seliya et al., 2005).
When classifying program modules as fp or
nfp, a Type I error occurs when a nfp module is
misclassified as fp, while a Type II error occurs
when a fp module is misclassified as nfp. It is
known that the two error rates are inversely proportional (Khoshgoftaar et al., 2003; Khoshgoftaar
et al., 2000).
semisupervised clustering
modeling
The initial numbers of the nfp and fp clusters,
that is, p and q, were obtained by setting both
Cin_nfp and Cin_ fp to 20. The maximum number
of clusters allowed during our semisupervised
clustering with k-means was set to two values:
Cmax = {30, 40}. These values were selected based
on input from the domain expert and reflects a
similar empirical setting used in our previous work
(Zhong, Khoshgoftaar, & Seliya, 2004). Due to
similarity of results for the two Cmax values, only
results for Cmax = 40 are presented.
At a given iteration during the semisupervised
clustering process, the following descriptive statistics were computed at the request of the software
engineering expert: minimum, maximum, mean,
median, standard deviation, and the 75, 80, 85,
90, and 95 percentiles. These values were computed for all 13 software attributes of modules
in a given cluster. The expert was also presented
with following statistics for JM1-8850 and the U
dataset at a given iteration: minimum, maximum,
mean, median, standard deviation, and the 5, 10,
15, 20, 25, 30, 35, 40, 45, 55, 60, 70, 75, 80, 85, 90
and 95 percentiles. The extent to which the above
descriptive statistics were used was at the disposal
of the expert during his labeling task.
Semisupervised Classification
modeling
The significance level used to select instances
from the U dataset to augment the L dataset is
set to α = 0.05. Other significance levels of 0.01
Software Quality Modeling with Limited Apriori Defect Data
and 0.10 were also considered; however, their
results are not presented as the software quality
estimation performances were relatively similar
for the different α values. The iterative semisupervised classification process is continued until
the number of instances added to U is less than
1% of the initial unlabeled dataset.
Table 3. Data performances with unsupervised
clustering
Dataset
Type I
Type II
Overall
KC1
0.0617
0.6985
0.1599
KC2
0.0918
0.4151
0.1577
KC3
0.1229
0.5116
0.1594
semisupervised clustering results
The predicted class labels of the labeled program
modules obtained at the end of each semisupervised clustering run are compared with their
actual class labels. The average classification
performance across the different samples for each
LP and Cmax = 40 is presented in Table 2. The table
shows the average Type I, Type II, and Overall
misclassification error rates for the different LP
values. It was observed that for the given Cmax
value, the Type II error rates decreases with an
increase in the LP value, indicating that with a
larger initial labeled dataset, the semisupervised
clustering with expert input scheme detects more
fp modules.
In a recent study (Zhong et al., 2004), we investigated unsupervised clustering techniques on the
JM1-8850 dataset. In that study, the k-means and
Neural-Gas (Martinez, Berkovich, & Schulten,
1993) clustering algorithms were used at Cmax =
30 clusters. Similar to this study, the expert was
given descriptive statistics for each cluster and
was asked to label them as either nfp or fp. In
(Zhong et al., 2004), the Neural-Gas clustering
technique yielded better classification results than
the k-means algorithm.
For the program modules that are labeled
after the respective semisupervised clustering
runs, the corresponding module classification
performances by the Neural-Gas unsupervised
clustering technique are presented in Table 2. The
semisupervised clustering scheme depicts better
false-negative error rates (Type II) than the unsupervised clustering method. The false-negative
error rates of both techniques tend to decrease
with an increase in LP. The false-positive error
rates (Type I) of both techniques tends to remain
relatively stable across the different LP values.
A z-test (Seber, 1984) was performed to compare the classification performances (populations)
Table 2. Average classification performance of labeled modules with semisupervised clustering.
Sample
Size
Semisupervised
Type I
Type II
Unsupervised
Type I
Type II
Overall
100
0.1491
0.4599
0.2058
Overall
0.1748
0.5758
0.2479
250
0.1450
0.4313
0.1989
0.1962
0.5677
0.2661
500
0.1408
0.4123
0.1913
0.1931
0.5281
0.2554
1000
0.1063
0.4264
0.1630
0.1778
0.5464
0.2431
1500
0.1219
0.4073
0.1759
0.1994
0.5169
0.2595
2000
0.1137
0.3809
0.1641
0.1883
0.5172
0.2503
2500
0.1253
0.3777
0.1725
0.1896
0.4804
0.2440
3000
0.1361
0.3099
0.1687
0.1994
0.4688
0.2499
Software Quality Modeling with Limited Apriori Defect Data
of semisupervised clustering and unsupervised
clustering. The Overall misclassifications obtained by both techniques are used as the response
variable in the statistical comparison at a 5%
significance level. The proposed semisupervised
clustering approach yielded significantly better
Overall misclassifications than the unsupervised
clustering approach for LP values of 500 and
greater.
The KC1, KC2, and KC3 datasets are used as
test data to evaluate the software quality knowledge learnt through the semisupervised clustering
process as compared to unsupervised clustering
with Neural-Gas. The test data modules are
classified based on their Euclidean distance from
centroids of the final nfp and fp clusters at the
end a semisupervised clustering run. We report
the averages of the respective number of random
samples for LP = {500, 1000}. A similar classification is made using centroids of the nfp and fp
clusters labeled by the expert after unsupervised
clustering with the Neural-Gas algorithm.
The classification performances obtained by
unsupervised clustering for the test datasets are
shown in Table 3. The misclassification error rates
of all test datasets are rather unbalanced with a
low Type I error rate and a relatively high Type
II error rate. Such a classification is obviously not
useful to the software practitioner since among
Table 4. Average test data performances with
semisupervised clustering
Dataset
Type I
Type II
Overall
LP = 500
KC1
0.0846
0.4708
0.1442
KC2
0.1039
0.3302
0.1500
KC3
0.1181
0.4186
0.1463
LP = 1000
0
KC1
0.0947
0.3477
0.1337
KC2
0.1304
0.2925
0.1635
KC3
0.1325
0.3488
0.1528
the program modules correctly detected as nfp
or fp, most are nfp instances—many fp modules
are not detected.
The average misclassification error rates obtained by the respective semisupervised clustering
runs for the test datasets are shown in Table 4. In
comparison to the test data performances obtained
with unsupervised clustering, the semisupervised
clustering approach yielded noticeable better
classification performances. The Type II error
rates obtained by our semisupervised clustering approach were noticeably lower than those
obtained by unsupervised clustering. This was
accompanied, however, with higher or similar
Type I error rates compared to unsupervised
clustering. Though the Type I error rates were
generally higher for semisupervised clustering,
they were comparable to those of unsupervised
clustering.
Semisupervised Classification
results
We primarily discuss the empirical results obtained by the EM-based semisupervised software
quality classification approach in the context of
a comparison with those of the semisupervised
clustering with expert input scheme presented in
previous section. The quality-of-fit performances
of the EM-based semisupervised classification
approach for the initial labeled datasets are
summarized in Table 5. The corresponding misclassification error rates for the labeled datasets
after the respective EM-based semisupervised
classification process is completed are shown in
Table 6.
As observed in the Tables 5 and 6, the EMbased semisupervised classification approach
improves the overall classification performances
for the different LP values. It is also noted that
the final classification performance is (generally)
inversely proportional to the size of the initial
labeled dataset, that is, LP. This is perhaps indicative of the presence of excess noise in the JM1-
Software Quality Modeling with Limited Apriori Defect Data
8850 dataset. A further insight into the presence
of noise in JM1-8850 in the context of the two
semisupervised learning approaches is presented
in (Seliya et al., 2004; Seliya et al., 2005).
The software quality estimation performance
of the semisupervised classification approach for
the three test datasets is shown in Table 7. The table
shows the average performance of the different
samples for the LP values of 500 and 1000. In the
case of LP = 1000, semisupervised clustering (see
previous section) provides better prediction for the
KC1, KC2, and KC3 test datasets. The noticeable
difference between the two techniques for these
three datasets is observed in the respective Type
II error rates. While providing relatively similar
Table 5. Average (initial) performance with
semisupervised classification
LP
Type I
Type II
Overall
100
0.1475
0.4500
0.2080
250
0.1580
0.4720
0.2208
500
0.1575
0.4820
0.2224
1000
0.1442
0.5600
0.2273
1500
0.1669
0.5233
0.2382
2000
0.1590
0.5317
0.2335
3000
0.2132
0.4839
0.2673
Table 6. Average (final) performance with semisupervised classification
LP
Type I
Type II
Overall
100
0.0039
0.0121
0.0055
250
0.0075
0.0227
0.0108
500
0.0136
0.0439
0.0206
1000
0.0249
0.0968
0.0428
1500
0.0390
0.1254
0.0593
2000
0.0482
0.1543
0.0752
3000
0.0830
0.1882
0.1094
Table 7. Average test data performances with
semisupervised classification
Dataset
Type I
Type II
Overall
LP = 500
KC1
0.0703
0.7329
0.1725
KC2
0.1072
0.4245
0.1719
KC3
0.1118
0.5209
0.1502
KC1
0.0700
0.7528
0.1753
KC2
0.1031
0.4465
0.1731
KC3
0.0988
0.5426
0.1405
LP = 1000
or comparable Type I error rates, semisupervised
clustering with expert input yields much lower
Type II error rates than the EM-based semisupervised classification approach.
For LP = 500, the semisupervised clustering with expert input approach provides better
software quality prediction for the KC1 and KC2
datasets. In the case of KC3, with a comparable
Type I error rate the semisupervised clustering
approach provided a better Type II error rate. In
summary, the semisupervised clustering with
expert input generally yielded better performance
than EM-based semisupervised clustering.
We note that the preference of selecting one of
the two approaches for software quality analysis
with limited apriori fault-proneness data may also
be based on criteria other than software quality
estimation accuracy. The EM-based semisupervised classification approach requires minimal
input from the expert other than incorporating
the desired software quality modeling strategy. In
contrast, the semisupervised clustering approach
requires considerable input from the software
engineering expert in labeling new program
modules (clusters) as nfp or fp. However, based
on our study it is likely that the effort put into
the semisupervised clustering approach would
yield fruitful outcome in improving quality of
the software product.
Software Quality Modeling with Limited Apriori Defect Data
conclusIon
The increasing reliance on software-based
systems further stresses the need to deliver
high-quality software that is very reliable during system operations. This makes the task of
software quality assurance as vital as delivering
a software product within allocated budget and
scheduling constraints. The key to developing
high-quality software is the measurement and
modeling of software quality, and toward that
objective various activities are utilized in software
engineering practice including verification and
validation, automated test case generation for
additional testing, re-engineering of low-quality program modules, and reviews of software
design and code.
This research presented effective data mining
solutions for tackling very important yet unaddressed software engineering issues. We address
software quality modeling and analysis when
there is limited apriori fault-proneness defect data
available. The proposed solutions are evaluated
using case studies of software measurement and
defect data obtained from multiple NASA software projects, made available through the NASA
Metrics Data Program.
In the case when the development organization
has experience in developing systems similar to
the target project but has limited availability of
defect data for those systems, the software quality
assurance team could employ either the EM-based
semisupervised classification approach or semisupervised clustering approach with expert input.
In our comparative study of these two solutions
for software quality analysis with limited defect
data, it was shown that semisupervised clustering
approach generally yielded better software quality
prediction that the semisupervised classification
approach. However, once again, the software
quality assurance team may also want to consider
the relatively higher complexity involved in the
semisupervised clustering approach when making their decision.
In our software quality analysis studies with
the EM-based semisupervised classification and
semisupervised clustering with expert input
approaches, an explorative analysis of program
modules that remain unlabeled after the different
semisupervised learning runs provided valuable
insight into the characteristics of those modules.
A data mining point of view indicated that many
of them were likely noisy instances in the JM1
software measurement dataset (Seliya et al., 2004;
Seliya et al., 2005). From a software engineering point of view we are interested to learn why
those specific modules remain unlabeled after the
respective semisupervised learning runs. However, due to the unavailability of other detailed
information on the JM1 and other NASA software
projects a further in-depth analysis could not be
performed.
An additional analysis of the two semisupervised learning approaches was performed by
comparing their prediction performances with
software quality classification models built by
using the C4.5 supervised learner trained on the
respective initial labeled datasets (Seliya et al.,
2004; Seliya et al., 2005). It was observed (results
not shown) that both semisupervised learning
approaches generally provided better software
quality estimations compared to the supervised
learners trained on the initial labeled datasets.
The software engineering research presented
in this chapter can lead to further related research
in software measurements and software quality
analysis. Some directions for future work may
include: using different clustering algorithms for
the semisupervised clustering with expert input
scheme, using different underlying algorithms
for the semisupervised classification approach,
and incorporating the costs of misclassification
into the respective semisupervised learning approaches.
Software Quality Modeling with Limited Apriori Defect Data
references
Basu, S., Banerjee A., & Mooney, R. (2002).
Semisupervised clustering by seeding. In Proceedings of the 19th International Conference
on Machine Learning, Sydney, Australia (pp.
19-26).
Demirez, A., & Bennett, K. (2000). Optimization
approaches to semisupervised learning. In M.
Ferris, O. Mangasarian, & J. Pang (Eds.), Applications and algorithms of complementarity.
Boston: Kluwer Academic Publishers.
Dong, A., & Bhanu, B. (2003). A new semisupervised EM algorithm for image retrieval. In
Proceedings of the IEEE International Conference
on Computer Vision and Pattern Recognition
(pp. 662-667). Madison, WI: IEEE Computer
Society.
Fenton., N. E., & Pfleeger, S. L. (1997). Software
metrics: A rigorous and practical approach (2nd
ed.). Boston: PWS Publishing Company.
Fung, G., & Mangasarian, O. (2001). Semisupervised support vector machines for unlabeled
data classification. Optimization Methods and
Software, 15, 29-44.
Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an
EM approach. In J. D. Cowan, G. Tesauro, & J.
Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 120-127).
San Francisco.
Goldman, S., & Zhou, Y. (2000). Enhancing
supervised learning with unlabeled data. In
Proceedings of the 17th International Conference
on Machine Learning, Stanford University, CA
(pp. 327-334).
Guo, L., Cukic, B., & Singh, H. (2003). Predicting
fault prone modules by the dempster-shafer belief
networks. In Proceedings of the 18th International
Conference on Automated Software Engineering,
Montreal, Canada (pp. 249-252).
Imam, K. E., Benlarbi, S., Goel, N., & Rai, S.
N. (2001). Comparing case-based reasoning
classifiers for predicting high-risk software
components. Journal of Systems and Software,
55(3), 301-320.
Khoshgoftaar, T. M., Liu, Y., & Seliya, N. (2003).
Genetic programming-based decision trees for
software quality classification. In Proceedings
of the 15th International Conference on Tools
with Artificial Intelligence, Sacramento, CA (pp.
374-383).
Khoshgoftaar, T. M., & Seliya, N. (2003). Analogybased practical classification rules for software
quality estimation. Empirical Software Engineering Journal, 8(4), 325-350. Kluwer Academic
Publishers.
Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study.
Empirical Software Engineering Journal, 9(3),
229-257. Kluwer Academic Publishers.
Khoshgoftaar, T. M., Yuan, X., & Allen, E. B.
(2000). Balancing misclassification rates in classification tree models of software quality. Empirical
Software Engineering Journal, 5, 313-330.
Krzanowski, W. J., & Lai, Y. T. (1988). A criterion
for determining the number of groups in a data
set using sums-of-squares clustering. Biometrics,
44(1), 23-34.
Little, R. J. A., & Rubin, D. B. (2002). Statistical
analysis with missing data (2nd ed.). Hoboken, NJ:
John Wiley and Sons.
Martinez, T. M., Berkovich, S. G., & Schulten,
K. J. (1993). Neural-gas: Network for vector
quantization and its application to time-series
prediction. IEEE Transactions on Neural Networks, 4(4), 558-569.
Software Quality Modeling with Limited Apriori Defect Data
Mitchell, T. (1999). The role of unlabeled data
in supervised learning. In Proceedings of the 6th
International Colloquium on Cognitive Science,
Donostia. San Sebastian, Spain: Institute for
Logic, Cognition, Language and Information.
Nigam, K., & Ghani, R. (2000). Analyzing the
effectiveness and applicability of co-training. In
Proceedings of the 9th International Conference
on Information and Knowledge Management,
McLean, VA (pp. 86-93).
Nigam, K., McCallum, A. K., Thrun, S., &
Mitchell, T. (1998). Learning to classify text
from labeled and unlabeled documents. In Proceedings of the 15th Conference of the American
Association for Artificial Intelligence, Madison,
WI (pp. 792-799).
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled
and unlabeled documents using EM. Machine
Learning, 39(2-3), 103-134.
Ohlsson, M. C., & Runeson, P. (2002). Experience
from replicating empirical studies on prediction
models. In Proceedings of the 8th International
Software Metrics Symposium, Ottawa, Canada
(pp. 217-226).
Pedrycz, W., & Waletzky, J. (1997a). Fuzzy clustering in software reusability. Software: Practice
and Experience, 27, 245-270.
Pedrycz, W., & Waletzky, J. (1997b). Fuzzy clustering with partial supervision. IEEE Transactions
on Systems, Man, and Cybernetics, 5, 787-795.
Pizzi, N. J., Summers, R., & Pedrycz, W. (2002).
Software quality prediction using median-adjusted class labels. In Proceedings of the International Joint Conference on Neural Networks,
Honolulu, HI (Vol. 3, pp. 2405-2409).
Schneidewind, N. F. (2001). Investigation of logistic regression as a discriminant of software quality.
In Proceedings of the 7th International Software
Metrics Symposium, London (pp. 328-337).
Seber, G. A. F. (1984). Multivariate observations.
New York: John Wiley & Sons.
Seeger, M. (2001). Learning with labeled and
unlabeled data (Tech. Rep.). Scotland, UK: University of Edinburgh, Institute for Adaptive and
Neural Computation.
Seliya, N., Khoshgoftaar, T. M., & Zhong, S.
(2004). Semisupervised learning for software
quality estimation. In Proceedings of the 16th IEEE
International Conference on Tools with Artificial
Intelligence, Boca Raton, FL (pp. 183-190).
Seliya, N., Khoshgoftaar, T. M., & Zhong, S.
(2005). Analyzing software quality with limited
fault-proneness defect data. In Proceedings of
the 9th IEEE International Symposium on High
Assurance Systems Engineering, Heidelberg,
Germany (pp. 89-98).
Suarez, A., & Lutsko, J. F. (1999). Globally optimal
fuzzy decision trees for classification and regression. Pattern Analysis and Machine Intelligence,
21(12), 1297-1311.
Wagstaff, K., & Cardie, C. (2000). Clustering with
instance-level constraints. In Proceedings of the
17th International Conference on Machine Learning, Stanford University, CA (pp. 1103-1110) .
Wu, Y., & Huang, T. S. (2000). Self-supervised
learning for visual tracking and recognition of
human hand. In Proceedings of the 17th National
Conference on Artificial Intelligence, Austin, TX
(pp. 243-248) .
Zeng, H., Wang, X., Chen, Z., Lu, H., & Ma, W.
(2003). CBC: Clustering based text classification
using minimal labeled data. In Proceedings of the
IEEE International Conference on Data Mining,
Melbourne, FL (pp. 443-450).
Software Quality Modeling with Limited Apriori Defect Data
Zhong, S. (2006). Semisupervised model-based
document clustering: A comparative study. Machine Learning, 65(1), 2-29.
Zhong, S., Khoshgoftaar, T. M., & Seliya, N.
(2004). Analyzing software measurement data
with clustering techniques. IEEE Intelligent
Systems, 19(2), 22-27.