Software Quality Modeling with Limited Apriori Defect Data

Data mining and machine learning have numerous practical applications across several domains, especially for classification and prediction problems. This chapter involves a data mining and machine learning problem in the context of software quality modeling and estimation. Software measurements and software fault (defect) data have been used in the development of models that predict

The proposed solutions are a semisupervised clustering with expert input scheme and a semisupervised classification approach with the expectation-maximization algorithm. Software measurement datasets obtained from multiple NASA software projects are used in our empirical investigation. The software quality knowledge learnt during the semisupervised learning processes provided good generalization performances for multiple test datasets. In addition, both solutions provided better predictions compared to a supervised learner trained on the initial labeled dataset. IntroductIon Data mining and machine learning have numerous practical applications across several domains, especially for classification and prediction problems. This chapter involves a data mining and machine learning problem in the context of software quality modeling and estimation. Software measurements and software fault (defect) data have been used in the development of models that predict Copyright © 2007, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Software Quality Modeling with Limited Apriori Defect Data software quality, for example, a software quality classification model (Imam, Benlarbi, Goel, & Rai, 2001; Khoshgoftaar & Seliya, 2004; Ohlsson & Runeson, 2002) predicts the fault-proneness membership of program modules. A software quality model allows the software development team to track and detect potential software defects relatively early-on during development. Software quality estimation models exploit the software engineering hypothesis that software measurements encapsulate the underlying quality of the software system. This assumption has been verified in numerous studies (Fenton & Pfleeger, 1997). A software quality model is typically built or trained using software measurement and defect data from a similar project or system release previously developed. The model is then applied to the currently under-development system to estimate the quality or presence of defects in its program modules. Subsequently, the limited resources allocated for software quality inspection and improvement can be targeted toward low-quality modules, achieving cost-effective resource utilization (Khoshgoftaar & Seliya, 2003). An important assumption made during typical software quality classification modeling is that fault-proneness labels are available for all program modules (instances) of training data, that is, supervised learning is facilitated because all instances in the training data have been assigned a quality-based label such as fault-prone ( fp) or not fault-prone (nfp). In software engineering practice, however, there are various practical scenarios that can limit availability of quality-based labels or defect data for all the modules in the training data, for example: • •  The cost of running data collection tools may limit for which subsystems software quality data is collected. Only some project components in a distributed software system may collect software quality data, while others may not be equipped for collecting similar data. • • The software defect data collected for some program modules may be error-prone due to data collection and recording problems. In a multiple release software project, a given release may collect software quality data for only a portion of the modules, either due to limited funds or other practical issues. In the training software measurement dataset the fault-proneness labels may only be known for some of the modules, that is, labeled instances, while for the remaining modules, that is, unlabeled instances, only software attributes are available. Under such a situation following the typical supervised learning approach to software quality modeling may be inappropriate. This is because a model trained using the small portion of labeled modules may not yield good software quality analysis, that is, the few labeled modules are not sufficient to adequately represent quality trends of the given system. Toward this problem, perhaps the solution lies in extracting the knowledge (in addition to the labeled instances) stored in the software metrics of the unlabeled modules. The above described problem represents the labeled-unlabeled learning problem in data mining and machine learning (Seeger, 2001). We present two solutions to the problem of software quality modeling with limited prior fault-proneness defect data. The first solution is a semisupervised clustering with expert input scheme based on the k-means algorithm (Seliya, Khoshgoftaar, & Zhong, 2005), while the other solution is a semisupervised classification approach based on the expectation maximization (EM) algorithm (Seliya, Khoshgoftaar, & Zhong, 2004). The semisupervised clustering with expert input approach is based on implementing constraint-based clustering, in which the constraint maintains a strict membership of modules to clusters that are already labeled as nfp or fp. At the end of a constraint-based clustering run a domain expert is allowed to label the unlabeled clusters, and the semisupervised clustering process is iter- Software Quality Modeling with Limited Apriori Defect Data ated. The EM-based semisupervised classification approach iteratively augments unlabeled program modules with their estimated class labels into the labeled dataset. The class labels of the unlabeled instances are treated as missing data which is estimated by the EM algorithm. The unlabeled modules are added to the labeled dataset based on a confidence in their prediction. A case study of software measurement and defect data obtained from multiple NASA software projects is used to evaluate the two solutions. To simulate the labeled-unlabeled problem, a sample of program modules is randomly selected from the JM1 software measurement dataset and is used as the initial labeled dataset. The remaining JM1 program modules are treated (without their class labels) as the initial unlabeled dataset. At the end of the respective semisupervised learning approaches, the software quality modeling knowledge gained is evaluated by using three independent software measurement datasets. A comparison between the two approaches for software quality modeling with limited apriori defect data indicated that the semisupervised clustering with expert input approach yielded better performance than EM-based semisupervised classification approach. However, the former is associated with considerable expert input compared to the latter. In addition, both semisupervised learning schemes provided an improvement in generalization accuracy for independent test datasets. The rest of this chapter is organized as follows: some relevant works are briefly discussed in the next section; the third and fourth sections respectively present the semisupervised clustering with expert input and the EM-based semisupervised classification approaches; the empirical case study, including software systems description, modeling methodology, and results are presented in the fifth section. The chapter ends with a conclusion which includes some suggestions for future work. relAted Work In the literature, various methods have been investigated to model the knowledge stored in software measurements for predicting quality of program modules. For example, Schneidewind (2001) utilizes logistic regression in combination with Boolean discriminant functions for predicting fp program modules. Guo, Cukic, and Singh (2003) predict fp program modules using Dempster-Shafer networks. Khoshgoftaar, Liu and Seliya (2003) have investigated genetic programming and decision trees (Khoshgoftaar, Yuan, & Allen, 2000), among other techniques. Some other works that have focused on software quality estimation include Imam et al. (2001), Suarez and Lutsko (1999) and Pizzi, Summers, and Pedrycz (2002). While almost all existing works on software quality estimation have focused on using a supervised learning approach for building software quality models, very limited attention has been given to the problem of software quality modeling and analysis when there is limited defect data from previous software project development experiences. In a machine learning classification problem when both labeled and unlabeled data are used during the learning process, it is termed as semisupervised learning (Goldman, 2000; Seeger, 2001). In such a learning scheme the labeled dataset is iteratively augmented with instances (with predicted class labels) from the unlabeled dataset based on some selection measure. Semisupervised classification schemes have been investigated across various domains, including content-based image retrieval (Dong & Bhanu, 2003), human motion and gesture pattern recognition (Wu & Huang, 2000), document categorization (Ghahramani & Jordan, 1994; Nigam & Ghani, 2000), and software engineering (Seliya et al., 2004). Some of the recently investigated techniques for semisupervised classification  Software Quality Modeling with Limited Apriori Defect Data include the EM algorithm (Nigam, McCallum, Thrun, & Mitchell, 1998), cotraining (Goldman & Zhou, 2000; Mitchell, 1999; Nigam & Ghani, 2000), and support vector machine (Demirez & Bennett, 2000; Fung & Mangasarian, 2001). While many works in semisupervised learning are geared toward the classification problem, a few studies investigate semisupervised clustering for grouping of a given set of text documents (Zeng, Wang, Chen, Lu, & Ma, 2003; Zhong, 2006). A semisupervised clustering approach has some benefits over semisupervised classification. During the semisupervised clustering process additional classes of data can be obtained (if desired) while the semisupervised classification approach requires the prior knowledge of all possible classes of the data. The unlabeled data may form new classes other than the pre-defined classes for the given data. Pedrycz and Waletzky (1997) investigate semisupervised clustering using fuzzy logic-based clustering for analyzing software reusability. In contrast, this study investigates semisupervised clustering for software quality estimation. The labeled instances in a semisupervised clustering scheme have been used for initial seeding of the clusters (Basu, Banerjee, & Mooney, 2002), incorporating constraints in the clustering process (Wagstaff & Cardie, 2000), or providing feedback subsequent to regular clustering (Zhong, 2006). The seeded approach uses the labeled data to initialize cluster centroids prior to clustering. The constraint-based approach keeps a fixedgrouping of the labeled data during the clustering process. The feedback-based approach uses the labeled data to adjust the clusters after executing a regular clustering process. semIsupervIsed clusterIng WIth expert Input The basic purpose of a semisupervised approach during clustering is to aid the clustering algorithm  in making better partitions of instances in the given dataset. The semisupervised clustering approach presented is a constraint-based scheme that uses labeled instances for initial seeding (centroids) of some clusters among the maximum allowable clusters when using k-means as the clustering algorithm. In addition, during the semisupervised iterative process a domain (software engineering) expert is allowed to label additional clusters as either nfp or fp based on domain knowledge and some descriptive statistics of the clusters. The data in a semisupervised clustering scheme consists of a small set of labeled instances and a large set of unlabeled instances. Let D be a dataset of labeled (nfp or fp) and unlabeled (ul) program modules, containing the subsets L of labeled modules and U of unlabeled modules. In addition, let the dataset L consist of subsets L_nfp of nfp modules and L_ fp of fp modules. The procedure used in our constraint-based semisupervised clustering approach with k-means is summarized next: 1. Obtain initial numbers of nfp and fp clusters: • An optimal number of clusters for the nfp and fp instances in the initial labeled dataset are obtained using the Cg criterion proposed by Krzanowski and Lai (1988). • Given L_nfp, execute the Cg criterion algorithm to obtain the optimal number of nfp clusters among {1, 2, …, Cin_nfp} number of clusters, where Cin_nfp is the user-defined maximum number of clusters for L_nfp. Let p denote the obtained number of nfp clusters. Given L_ fp, execute the Cg criterion algorithm to obtain the optimal number of fp clusters among {1, 2, …, Cin_ fp} number of clusters, where Cin_ fp is the user-defined maximum number of clusters for L_ fp. Let q denote the obtained number of fp clusters. Software Quality Modeling with Limited Apriori Defect Data 2. 3. Initialize centroids of clusters: Given the maximum number of clusters, Cmax, allowed during the semisupervised clustering process with k-means, • The centroids of p clusters out of Cmax are initialized to centroids of the clusters labeled as nfp. • The centroids of q clusters out of {Cmax - p} are initialized to centroids of the clusters labeled as fp. • The centroids of the remaining r (i.e., Cmax – p – q) clusters are initialized to randomly selected instances from U. We randomly select 5 unique sets of r instances each for initializing centroids of the unlabeled clusters. Thus, centroids of the {p + q + r} clusters can be initialized using 5 different combinations. • The sets of nfp, fp, and unlabeled clusters are thus, C_nfp = {c_nfp1, c_nfp2, …, c_nfpp}, C_ fp = {c_ fp1, c_ fp2, …, c_nfpq}, and C_ul = {c_ul1, c_ul2, …, c_ulr} respectively. Execute constraint-based clustering: • The k-means clustering algorithm with the Euclidean distance function is run on D using the initialized centroids for the Cmax clusters, and under the constraint that the existing membership of a program module to a labeled cluster remains unchanged. Thus, at a given iteration during the semisupervised clustering process, if a module already belongs (initial membership or expert-based assignment from previous iterations) to a nfp (or fp) cluster, then it cannot move to another cluster during the clustering process of that iteration. • The constraint-based clustering process with k-means is repeated for each of 4. 5. the 5 centroid initializations, and the respective SSE (sum-of-squares-error) values are computed. • The clustering result associated with the median SSE value is selected for continuation to the next step. This is done to minimize the likelihood of working with a lucky/unlucky initialization of cluster centroids. Expert-based labeling of clusters: • The software engineering expert is presented with descriptive statistics of the r unlabeled clusters, and is asked to label them as either nfp or fp. The specific statistics presented for attributes of instances in each cluster depends on the expert’s request, and include data such as minimum, maximum, mean, standard deviation, and so forth. • The expert labels only those clusters for which he/she is very confident in the label estimation. • If the expert labels at least one of the r (unlabeled) clusters, then go to Step 2 and repeat, otherwise continue. Stop semisupervised clustering: The iterative process is stopped when the sets C_nfp, C_ fp, and C_ul remain unchanged. The modules in the nfp ( fp) clusters are labeled and recorded as nfp ( fp), while those in the ul clusters are not assigned any label. In addition, the centroids of the {p + q} labeled clusters are also recorded. semIsupervIsed clAssIfIcAtIon WIth em AlgorIthm The expectation maximization (EM) algorithm is a general iterative method for maximum likelihood estimation in data mining problems with incomplete data. The EM algorithm takes an iterative approach consisting of replacing missing data with  Software Quality Modeling with Limited Apriori Defect Data estimated values, estimating model parameters, and re-estimating the missing data values. An iteration of EM consists of an E or Expectation step and an M or Maximization step, with each having a direct statistical interpretation. We limit our EM algorithm discussion to a brief overview, and refer the reader to Little and Rubin (2002) and Seliya et al. (2004) for a more extensive coverage. In our study, the class value of the unlabeled software modules is treated as missing data, and the EM algorithm is used to estimate the missing values. Many multivariate statistical analysis, including multiple linear regression, principal component analysis, and canonical correlation analysis are based on the initial study of the data with respect to the sample mean and covariance matrix of the variables. The EM algorithm implemented for our study on semisupervised software quality estimation is based on maximum likelihood estimation of missing data, means, and covariances for multivariate normal samples (Little et al., 2002). The E and M steps continue iteratively until a stopping criterion is reached. Commonly used stopping criteria include specifying a maximum number of iterations or monitoring when the change in the values estimated for the missing data reaches a plateau for a specified epsilon value (Little et al., 2002). We use the latter criteria and allow the EM algorithm to converge without a maximum number of iterations, that is, iteration is stopped if the maximum change among the means or covariances between two consecutive iterations is less than 0.0001. The initial values of the parameter set are obtained by estimating means and variances from all available values of each variable, and then estimating covariances from all available pairwise values using the computed means. Given the L (labeled) and U (unlabeled) datasets, the EM algorithm is used to estimate the missing class labels by creating a new dataset  combining L and U and then applying the EM algorithm to estimate the missing data, that is, the dependent variable of U. The following procedure is used in our EM-based semisupervised classification approach: 1. 2. 3. 4. 5. Estimate the dependent variable (class labels) for the labeled dataset. This is done by treating L also as U, that is, the unlabeled dataset consists of the labeled instances but without their fault-proneness labels. The EM algorithm is then used to estimate these missing class labels. In our study the fp and nfp classes are labeled 1 and 0, respectively. Consequently, the estimated missing values will approximately fall within the range 1 and 0. For a given significance level α, obtain confidence intervals for the predicted dependent variable in Step 1. The assumption is that the two confidence interval boundaries delineate the nfp and fp modules. Record the upper boundary as ci_nfp (i.e., closer to 0) and the lower boundary as ci_ fp (i.e., closer to 1). For the given L and U datasets, estimate the dependent variable for U using EM. An instance in U is identified as nfp if it’s predicted dependent variable falls within (i.e., is lower than) the upper boundary, that is, ci_nfp. Similarly, an instance in U is identified as fp if it’s predicted dependent variable falls within (i.e., is greater than) the lower bound, that is, ci_ fp. The newly labeled instances of U are used to augment L, and the semisupervised classification procedure is iterated from Step 1. The iteration stopping criteria used in our study is such that if the number of instances selected from U is less than a specific number (that is, 1% of initial L dataset), then stop iteration. Software Quality Modeling with Limited Apriori Defect Data empIrIcAl cAse study software system descriptions The software measurements and quality data used in our study to investigate the proposed semisupervised learning approaches is that of a large NASA software project, JM1. Written in C, JM1 is a real-time ground system that uses simulations to generate certain predictions for missions. The data was made available through the Metrics Data Program (MDP) at NASA, and included software measurement data and associated error (fault or defect) data collected at the function level. A program module for the system consisted of a function or method. The fault data collected for the system represents, for a given module, faults detected during software development. The original JM1 dataset consisted of 10,883 software modules, of which 2,105 modules had software defects (ranging from 1 to 26) while the remaining 8,778 modules were defect-free, that is, had no software faults. In our study, a program module with no faults was considered nfp and fp otherwise. The JM1 dataset contained some inconsistent modules (those with identical software measurements but with different class labels) and those with missing values. Upon removing such Table 1. Software measurements Line Count Metrics Total Lines of Code Executable LOC Comments LOC Blank LOC Code And Comments LOC Halstead Metrics Total Operators Total Operands Unique Operators Unique Operands McCabe Metrics Cyclomatic Complexity Essential Complexity Design Complexity Branch Count Metrics Branch Count modules, the dataset was reduced from 10,883 to 8,850 modules. We denote this reduced dataset as JM1-8850, which consisted of 1,687 modules with one or more defects and 7,163 modules with no defects. Each program module in the JM1 dataset was characterized by 21 software measurements (Fenton et al., 1997): the 13 metrics as shown in Table 1 and 8 derived Halstead metrics (Halstead length, Halstead volume, Halstead level, Halstead difficulty, Halstead content, Halstead effort, Halstead error estimate, and Halstead program time. We used only the 13 basic software metrics in our analysis. The eight derived Halstead metrics were not used. The metrics for the JM1 (and other datasets) were primarily governed by their availability, internal workings of the projects, and the data collection tools used. The type and numbers of metrics made available were determined by the NASA Metrics Data Program. Other metrics, including software process measurements, were not available. The use of the specific software metrics does not advocate their effectiveness, and a different project may consider a different set of software measurements for analysis (Fenton et al., 1997; Imam et al., 2001). In order to gauge the performance of the semisupervised clustering results, we use software measurement data of three other NASA projects, KC1, KC2, and KC3, as test datasets. These software measurement datasets were also obtained through the NASA Metrics Data Program. The definitions of what constituted a fp and nfp module for these projects are the same as those of the JM1 system. A program module of these projects also consisted of a function, subroutine, or method. These three projects were characterized by the same software product metrics used for the JM1 project, and were built in a similar software development organization. The software systems of the test datasets are summarized next: • The KC1 project is a single CSCI within a large ground system and consists of 43  Software Quality Modeling with Limited Apriori Defect Data • • KLOC (thousand lines of code) of C++ code. A given CSCI comprises of logical groups of computer software components (CSCs). The dataset contains 2107 modules, of which 325 have one or more faults and 1782 have zero faults. The maximum number of faults in a module is 7. The KC2 project, written in C++, is the science data processing unit of a storage management system used for receiving and processing ground data for missions. The dataset includes only those modules that were developed by NASA software developers and not commercial-of-the-shelf (COTS) software. The dataset contains 520 modules, of which 106 have one or more faults and 414 have zero faults. The maximum number of faults in a software module is 13. The KC3 project, written in 18 KLOC of Java, is a software application that collects, processes, and delivers satellite meta-data. The dataset contains 458 modules, of which 43 have one or more faults and 415 have zero faults. The maximum number of faults in a module is 6. empirical setting and modeling The initial L dataset is obtained by randomly selecting LP number of modules from JM1-8850, while the remaining UP number of modules were treated (without their fault-proneness labels) as the initial U dataset. The sampling was performed to maintain the approximate proportion of nfp:fp = 80:20 of the instances in JM1-8850. We considered different sampling sizes, that is, LP = {100, 250, 500, 1000, 1500, 2000, 3000}. For a given LP value, three samples were obtained without replacement from the JM1-8850 dataset. In the case of LP = {100, 250, 500}, five samples were obtained to account for their relatively small sizes. Due to space consideration, we generally only present results for LP = {500, 1000}; however, additional details are provided in (Seliya et al., 2004; Seliya et al., 2005).  When classifying program modules as fp or nfp, a Type I error occurs when a nfp module is misclassified as fp, while a Type II error occurs when a fp module is misclassified as nfp. It is known that the two error rates are inversely proportional (Khoshgoftaar et al., 2003; Khoshgoftaar et al., 2000). semisupervised clustering modeling The initial numbers of the nfp and fp clusters, that is, p and q, were obtained by setting both Cin_nfp and Cin_ fp to 20. The maximum number of clusters allowed during our semisupervised clustering with k-means was set to two values: Cmax = {30, 40}. These values were selected based on input from the domain expert and reflects a similar empirical setting used in our previous work (Zhong, Khoshgoftaar, & Seliya, 2004). Due to similarity of results for the two Cmax values, only results for Cmax = 40 are presented. At a given iteration during the semisupervised clustering process, the following descriptive statistics were computed at the request of the software engineering expert: minimum, maximum, mean, median, standard deviation, and the 75, 80, 85, 90, and 95 percentiles. These values were computed for all 13 software attributes of modules in a given cluster. The expert was also presented with following statistics for JM1-8850 and the U dataset at a given iteration: minimum, maximum, mean, median, standard deviation, and the 5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 60, 70, 75, 80, 85, 90 and 95 percentiles. The extent to which the above descriptive statistics were used was at the disposal of the expert during his labeling task. Semisupervised Classification modeling The significance level used to select instances from the U dataset to augment the L dataset is set to α = 0.05. Other significance levels of 0.01 Software Quality Modeling with Limited Apriori Defect Data and 0.10 were also considered; however, their results are not presented as the software quality estimation performances were relatively similar for the different α values. The iterative semisupervised classification process is continued until the number of instances added to U is less than 1% of the initial unlabeled dataset. Table 3. Data performances with unsupervised clustering Dataset Type I Type II Overall KC1 0.0617 0.6985 0.1599 KC2 0.0918 0.4151 0.1577 KC3 0.1229 0.5116 0.1594 semisupervised clustering results The predicted class labels of the labeled program modules obtained at the end of each semisupervised clustering run are compared with their actual class labels. The average classification performance across the different samples for each LP and Cmax = 40 is presented in Table 2. The table shows the average Type I, Type II, and Overall misclassification error rates for the different LP values. It was observed that for the given Cmax value, the Type II error rates decreases with an increase in the LP value, indicating that with a larger initial labeled dataset, the semisupervised clustering with expert input scheme detects more fp modules. In a recent study (Zhong et al., 2004), we investigated unsupervised clustering techniques on the JM1-8850 dataset. In that study, the k-means and Neural-Gas (Martinez, Berkovich, & Schulten, 1993) clustering algorithms were used at Cmax = 30 clusters. Similar to this study, the expert was given descriptive statistics for each cluster and was asked to label them as either nfp or fp. In (Zhong et al., 2004), the Neural-Gas clustering technique yielded better classification results than the k-means algorithm. For the program modules that are labeled after the respective semisupervised clustering runs, the corresponding module classification performances by the Neural-Gas unsupervised clustering technique are presented in Table 2. The semisupervised clustering scheme depicts better false-negative error rates (Type II) than the unsupervised clustering method. The false-negative error rates of both techniques tend to decrease with an increase in LP. The false-positive error rates (Type I) of both techniques tends to remain relatively stable across the different LP values. A z-test (Seber, 1984) was performed to compare the classification performances (populations) Table 2. Average classification performance of labeled modules with semisupervised clustering. Sample Size Semisupervised Type I Type II Unsupervised Type I Type II Overall 100 0.1491 0.4599 0.2058 Overall 0.1748 0.5758 0.2479 250 0.1450 0.4313 0.1989 0.1962 0.5677 0.2661 500 0.1408 0.4123 0.1913 0.1931 0.5281 0.2554 1000 0.1063 0.4264 0.1630 0.1778 0.5464 0.2431 1500 0.1219 0.4073 0.1759 0.1994 0.5169 0.2595 2000 0.1137 0.3809 0.1641 0.1883 0.5172 0.2503 2500 0.1253 0.3777 0.1725 0.1896 0.4804 0.2440 3000 0.1361 0.3099 0.1687 0.1994 0.4688 0.2499  Software Quality Modeling with Limited Apriori Defect Data of semisupervised clustering and unsupervised clustering. The Overall misclassifications obtained by both techniques are used as the response variable in the statistical comparison at a 5% significance level. The proposed semisupervised clustering approach yielded significantly better Overall misclassifications than the unsupervised clustering approach for LP values of 500 and greater. The KC1, KC2, and KC3 datasets are used as test data to evaluate the software quality knowledge learnt through the semisupervised clustering process as compared to unsupervised clustering with Neural-Gas. The test data modules are classified based on their Euclidean distance from centroids of the final nfp and fp clusters at the end a semisupervised clustering run. We report the averages of the respective number of random samples for LP = {500, 1000}. A similar classification is made using centroids of the nfp and fp clusters labeled by the expert after unsupervised clustering with the Neural-Gas algorithm. The classification performances obtained by unsupervised clustering for the test datasets are shown in Table 3. The misclassification error rates of all test datasets are rather unbalanced with a low Type I error rate and a relatively high Type II error rate. Such a classification is obviously not useful to the software practitioner since among Table 4. Average test data performances with semisupervised clustering Dataset Type I Type II Overall LP = 500 KC1 0.0846 0.4708 0.1442 KC2 0.1039 0.3302 0.1500 KC3 0.1181 0.4186 0.1463 LP = 1000 0 KC1 0.0947 0.3477 0.1337 KC2 0.1304 0.2925 0.1635 KC3 0.1325 0.3488 0.1528 the program modules correctly detected as nfp or fp, most are nfp instances—many fp modules are not detected. The average misclassification error rates obtained by the respective semisupervised clustering runs for the test datasets are shown in Table 4. In comparison to the test data performances obtained with unsupervised clustering, the semisupervised clustering approach yielded noticeable better classification performances. The Type II error rates obtained by our semisupervised clustering approach were noticeably lower than those obtained by unsupervised clustering. This was accompanied, however, with higher or similar Type I error rates compared to unsupervised clustering. Though the Type I error rates were generally higher for semisupervised clustering, they were comparable to those of unsupervised clustering. Semisupervised Classification results We primarily discuss the empirical results obtained by the EM-based semisupervised software quality classification approach in the context of a comparison with those of the semisupervised clustering with expert input scheme presented in previous section. The quality-of-fit performances of the EM-based semisupervised classification approach for the initial labeled datasets are summarized in Table 5. The corresponding misclassification error rates for the labeled datasets after the respective EM-based semisupervised classification process is completed are shown in Table 6. As observed in the Tables 5 and 6, the EMbased semisupervised classification approach improves the overall classification performances for the different LP values. It is also noted that the final classification performance is (generally) inversely proportional to the size of the initial labeled dataset, that is, LP. This is perhaps indicative of the presence of excess noise in the JM1- Software Quality Modeling with Limited Apriori Defect Data 8850 dataset. A further insight into the presence of noise in JM1-8850 in the context of the two semisupervised learning approaches is presented in (Seliya et al., 2004; Seliya et al., 2005). The software quality estimation performance of the semisupervised classification approach for the three test datasets is shown in Table 7. The table shows the average performance of the different samples for the LP values of 500 and 1000. In the case of LP = 1000, semisupervised clustering (see previous section) provides better prediction for the KC1, KC2, and KC3 test datasets. The noticeable difference between the two techniques for these three datasets is observed in the respective Type II error rates. While providing relatively similar Table 5. Average (initial) performance with semisupervised classification LP Type I Type II Overall 100 0.1475 0.4500 0.2080 250 0.1580 0.4720 0.2208 500 0.1575 0.4820 0.2224 1000 0.1442 0.5600 0.2273 1500 0.1669 0.5233 0.2382 2000 0.1590 0.5317 0.2335 3000 0.2132 0.4839 0.2673 Table 6. Average (final) performance with semisupervised classification LP Type I Type II Overall 100 0.0039 0.0121 0.0055 250 0.0075 0.0227 0.0108 500 0.0136 0.0439 0.0206 1000 0.0249 0.0968 0.0428 1500 0.0390 0.1254 0.0593 2000 0.0482 0.1543 0.0752 3000 0.0830 0.1882 0.1094 Table 7. Average test data performances with semisupervised classification Dataset Type I Type II Overall LP = 500 KC1 0.0703 0.7329 0.1725 KC2 0.1072 0.4245 0.1719 KC3 0.1118 0.5209 0.1502 KC1 0.0700 0.7528 0.1753 KC2 0.1031 0.4465 0.1731 KC3 0.0988 0.5426 0.1405 LP = 1000 or comparable Type I error rates, semisupervised clustering with expert input yields much lower Type II error rates than the EM-based semisupervised classification approach. For LP = 500, the semisupervised clustering with expert input approach provides better software quality prediction for the KC1 and KC2 datasets. In the case of KC3, with a comparable Type I error rate the semisupervised clustering approach provided a better Type II error rate. In summary, the semisupervised clustering with expert input generally yielded better performance than EM-based semisupervised clustering. We note that the preference of selecting one of the two approaches for software quality analysis with limited apriori fault-proneness data may also be based on criteria other than software quality estimation accuracy. The EM-based semisupervised classification approach requires minimal input from the expert other than incorporating the desired software quality modeling strategy. In contrast, the semisupervised clustering approach requires considerable input from the software engineering expert in labeling new program modules (clusters) as nfp or fp. However, based on our study it is likely that the effort put into the semisupervised clustering approach would yield fruitful outcome in improving quality of the software product.  Software Quality Modeling with Limited Apriori Defect Data conclusIon The increasing reliance on software-based systems further stresses the need to deliver high-quality software that is very reliable during system operations. This makes the task of software quality assurance as vital as delivering a software product within allocated budget and scheduling constraints. The key to developing high-quality software is the measurement and modeling of software quality, and toward that objective various activities are utilized in software engineering practice including verification and validation, automated test case generation for additional testing, re-engineering of low-quality program modules, and reviews of software design and code. This research presented effective data mining solutions for tackling very important yet unaddressed software engineering issues. We address software quality modeling and analysis when there is limited apriori fault-proneness defect data available. The proposed solutions are evaluated using case studies of software measurement and defect data obtained from multiple NASA software projects, made available through the NASA Metrics Data Program. In the case when the development organization has experience in developing systems similar to the target project but has limited availability of defect data for those systems, the software quality assurance team could employ either the EM-based semisupervised classification approach or semisupervised clustering approach with expert input. In our comparative study of these two solutions for software quality analysis with limited defect data, it was shown that semisupervised clustering approach generally yielded better software quality prediction that the semisupervised classification approach. However, once again, the software quality assurance team may also want to consider the relatively higher complexity involved in the  semisupervised clustering approach when making their decision. In our software quality analysis studies with the EM-based semisupervised classification and semisupervised clustering with expert input approaches, an explorative analysis of program modules that remain unlabeled after the different semisupervised learning runs provided valuable insight into the characteristics of those modules. A data mining point of view indicated that many of them were likely noisy instances in the JM1 software measurement dataset (Seliya et al., 2004; Seliya et al., 2005). From a software engineering point of view we are interested to learn why those specific modules remain unlabeled after the respective semisupervised learning runs. However, due to the unavailability of other detailed information on the JM1 and other NASA software projects a further in-depth analysis could not be performed. An additional analysis of the two semisupervised learning approaches was performed by comparing their prediction performances with software quality classification models built by using the C4.5 supervised learner trained on the respective initial labeled datasets (Seliya et al., 2004; Seliya et al., 2005). It was observed (results not shown) that both semisupervised learning approaches generally provided better software quality estimations compared to the supervised learners trained on the initial labeled datasets. The software engineering research presented in this chapter can lead to further related research in software measurements and software quality analysis. 