Multi-Objective Approach For Software Module Clustering
Multi-Objective Approach For Software Module Clustering
Multi-Objective Approach For Software Module Clustering
Asadi Srinivasulu
Department of Information Technology, Sree Vidyanikethan engineering College
I. INTRODUCTION Many interesting software systems are large and complex, which is difficult to understand their structure [7]. One of the reasons for the complexity is source code contains many entities (e.g., classes, modules etc.) that depend on each other in intricate ways (e.g., procedure calls, variable references etc). Once a software engineer understands the systems structure [7], it is difficult to preserve their understanding, because the structure tends to change during maintenance. Creating a good model of the structure [7] of a complex is one of the problems that confront modern software developers. In order to solve the above problem, the Reverse Engineering Research Community has developed techniques to decompose (partition) the structure of software system into meaningful subsystems (clusters). Subsystems provide developers with high-level structure information about the software components, their interfaces, and their inter connections. Subsystems generally consist of a collection of collaborating source code resources that implement a feature or provide a service to the rest of the system. Software clustering problem has defined techniques that partition the structure of a software system into systems. Systems are collections of source code resources that exhibit similar feature, properties or behavior. Because there are far fewer subsystems than modules studying the subsystem structure is easier than trying to understand the system by analyzing the source code manually. There are many ways to approach the module clustering problem. Following Mancoridis et al., who first suggested the search-based approach [13] to module
clustering, this paper follows the search-based approach. In the search based approach, the attributes of a good modular decomposition are formulated as objectives, the evaluation of which as a fitness function [12] guides a search-based optimization algorithm. In previous work, multi-objective approach was used, in that they were introduced two novel multi-objective formulations i.e., Maximizing Cluster Approach (MCA) and Equal-size Cluster Approach (ECA) [1]. But there is no evidence shows that these two approaches provide optimal results for unweighted MDG graph. So, to overcome this problem, we introduce MultiObjective approach by increasing the number of clusters having same number of modules in each cluster. By this approach we can increase the MQ value. Increasing MQ value will get optimal results for unweighted graph also. In this paper we introduce, pareto optimality for multiobjective approach. Pareto optimality gives solution for the problem having multiple objectives. The primary contributions of this paper are as follows: (1) Multi-objective approach was introduced for software module clustering. (2) Pareto optimal genetic algorithm was introduced for multi-objective approach. (3) Empirical study shows that multi-objective approach presents better results for unweighted graph than the two novel approaches i.e., MCA and ECA. The rest of this paper is organized as follows: Section 2 presents background and related work on software module clustering. Section 3 introduces the multi-objective approach for modularization. Section 4 results. Section 5 concludes.
www.ijarcsse.com Where n is the number of clusters. But there is a drawback in single-objective search problem, so multi-objective formulations, MCA and ECA are introduced. Here multiple objectives are considered for software module clustering. But these approaches are not produced optimal results for unweighted MDG graph. III. MULTI-OBJECTIVE APPROACH
There are many metaheuristic [14], [9] methods have been applied to software module clustering. The field was established by the seminal work of the Drexel group. In this work, hill climbing [2] was the primary search technique, leading to development of Bunch tool [3] for automated software module clustering. Several search techniques are also applied to software module clustering including simulated annealing and genetic algorithms. These techniques are outperformed to the hill-climbing search technique. The representation and fitness function are needed to formulate the software engineering problem. In order to module clustering problem, Module Dependency Graph (MDG) was used to represent the problem. In MDG, modules are the nodes and relationships are the edges. Edges can be weighted to indicate the strength of the relationship, or unweighted to indicate the presence and absence of relationship. In previous work (and in the present paper), a module is taken to be a file and a relationship is an inclusion of reference relationship between files (e.g., a method invocation). In search based approach, the attributes of a good modular decomposition are formulated as objectives, the evaluation of which as a fitness function [12] guides a search based optimization algorithm. Single-objective approach used the Modularization Quality measure, introduced by Mancordis et al. The intra-edges are those for which the source and target of the edge lie inside the same cluster. The interedges are those for which the source and target lie in distinct clusters. MQ is the sum of Modularization Factor (MFk) which is the ratio of intra-edges [4] and intra-edges [6] in each cluster. Modularization Factor (MFk) for cluster k can be defined as follows: 0, =
+1 2
Existing two novel multi-objective formulations are not produced optimal results for unweighted MDG graph. Now, we are introducing Pareto optimality which is an alternative approach to handling multiple objectives. A. Pareto Optimality The use of Pareto optimality is an alternative to aggregated fitness. It is superior in many ways. Using Pareto optimality [11], [15], it is not possible to measure how much better one solution is than another, merely to determine whether one solution is better than another. In this way, Pareto optimality combines a set of measurements into a single ordinal scale metrics, as follows:
1 2 . (1 ) (2 )
and, for strict inequality:
. 1
1 > 2 2 . (1 ) > (2 )
= 0 , > 0
where i is the weight of intra-edges and j is that of interedges, that is, j is the sum of edge weights for all edges that originate or terminate in cluster k. The reason for the 1 occurrence of the term in the above equation (rather than 2 merely j) is to split the penalty of the inter-edge across the two clusters that connected by that edge. If the MDG is unweighted, then the weights are set to 1.
=
=1
Thus, under Pareto optimality, one solution is better than another if it is better according to at least one of the individual fitness functions and no worse according to all of the others. Under the Pareto interpretation of combined fitness, no overall fitness improvement occurs no matter how much almost all of the fitness functions improve, should they do so at the slightest expense of any one of their number. When searching for solutions to a problem using Pareto optimality, the search yields a set of solutions that are non dominated. That is, each member of the nondominated set is no worse than any of the others in the set, but also cannot be said to be better. Any set of non dominated solutions forms a Pareto front. Consider Figure 1, which depicts the computation of Pareto optimality for two imaginary fitness functions. The longer the search algorithm is run the better the approximation becomes to the real Pareto front. The approximation of the Pareto front is also a useful analysis tool in itself. For example, it may contain knee points, where a small change in one fitness is accompanied by a large change in another. These knee points denote interesting parts of the solution space that warrant closer investigation.
Page | 267
www.ijarcsse.com To illustrate the Multi-Objective Approach, consider the MDG in fig. 2. The objective values are as follows: The sum of intra-edges of all clusters: 9 The sum of inter-edges of all clusters: -6 The number of clusters: 3 Modularization Quality: 2.014836 The number of isolated clusters: 0 The difference between the maximum and minimum number of modules in a cluster: 1 C. Genetic Algorithm Genetic algorithms [5] uses the concept of population and recombine. Of all optimization algorithms, genetic algorithm [8] is the most widely applied search based technique in Search Based Software Engineering. Figure 3 shows the generic genetic algorithm. An iterative process is executed, initialized by randomly chosen population. The iterations are called generations and the members of a population are called chromosomes. When a population satisfies some pre-determined condition or certain number of generations has been exceeded, the process will be terminated. On each generation, some members of the population are recombined, crossing over elements of their chromosomes. A fraction of the offspring of this union are mutated and, from the offspring and the original population a selection process is used to determine the new population. Crucially, recombination and selection are guided by the fitness function; fitter chromosomes having a greater chance to be selected and recombined.
B. Multi-Objective Approach The Maximizing Cluster Approach uses the following set of objectives: The sum of intra-edges of all clusters (maximizing), The sum of inter-edges of all clusters (minimizing), The number of clusters (maximizing), Modularization Quality (maximizing), The number of isolated clusters (minimizing). the difference between the maximum and minimum number of modules in a cluster (minimizing) The aim of this approach is to capture the attributes of a good clustering. It will have maximum cohesion and minimum coupling. Intra edges, inter edges and Modularization Quality are used to measure the quality of the system partitioned. So these are included as objectives. Here the isolated clusters are nothing but cluster contains only one module. However it should not put all modules in an each cluster and not produced series of isolated clusters. The MQ value will be increased when more number of clusters in a system, so we included number of clusters as an objective. We have to make the clusters of equal number of modules, so we included the difference between the maximum and minimum number of modules in a cluster as an objective. This approach has been implemented using genetic algorithm.
Set generation number, m=0 Choose the initial population of candidate solutions, P(0) Evaluate the fitness for each individuals of P(0), F(Pi(0)) Loop Recombine: P(m) := R(P(m)) Mutate: P(m) := M(P(m)) Evaluate: F(P(m)) Select: P(m+1) := S(P(m)) m := m+1 exit when goal or stopping condition is satisfied end loop;
Fig. 3 A generic genetic algorithm
Page | 268
www.ijarcsse.com
weighted
Nodes Edges Description 20 57 An operating system for educational purposes written in the turning language ispell 24 103 Software for spelling and typographical error correction in files rcs 29 163 Revision Control System used to manages multiple revisions of files bison 37 179 General-purpose parser generator for converting grammar description into c programs grappa 86 295 Genome rearrangement analyzer under parsimony and other phylogenetic algorithms bunch 116 365 Software Clustering tool (Essential java classes only) incl 174 360 Graph drawing tool icecast 60 650 Streaming media server based on the MP3 audio codec gnupg 88 601 Complete implementation of the OpenPGP Internet standard inn 90 624 Unix news group software bitchx 23 729 Open source IRC client xntp 111 729 Time synchronization tool exim 23 1255 Message transfer agent for use on Unix systems connected to the Internet mod_ssl 135 1095 Apache SSL/TLS Interface ncurses 138 682 Software for display and update of text on text-only terminals lynx 23 1745 Web browser for users on UNIX and VMS platforms nmh 198 3262 Mail client software
modules. Table 1 presents details of the subject MDGs studied. These systems are not necessarily degraded systems in terms of their modular structure, but they have been studied widely by other researches to evaluate their algorithms for module clustering and so they denote reasonable choices for comparison. This section presents the results of the experiment that compare the MQ value obtained for three approaches, MCA, ECA and Multi-Objective Approach (MO) described earlier in previous section. i.e., the results assess how well the Multi-Objective Approach perform when compared with MCA and ECA using MQ value. Table 2 and 3 presents the results comparing MultiObjective Approach with MCA and ECA. In Table 2, the results from two approaches, Multi-Objective approach and MCA were comparable. There is good evidence to suggest that our approach is outperformed the MCA approach for weighted and unweighted MDG problems. For unweighted MDG problem, Multi-Objective approach gives higher values for MQ in six from seven problems. However for weighted MDG problem, the results shows that multi-objective approach gives higher values for MQ in nine from ten problems.
There are many variations on this overall process, but the crucial ingredients are the way in which the fitness guides the search, the recombine and the population based nature of the process. There is also a variation of genetic algorithms, called genetic programming. Genetic programming has been used in SBSE to form formula that captures predictive models of software projects and in testing. IV. RESULTS In experimental study, the algorithm is applied to 17 different Module Dependency Graphs (MDGs). The modules in MDGs are varying from 20 to 100. There are two types of MDGs are taken for experiment. The first type is weighted MDGs and second type is unweighted MDGs shown in table 1. In unweighted MDG graphs, an edge denotes a unidirectional method or a variable passed between two modules, where as weighted MDG graph is assigned by considering the number of unidirectional method or variables passed between two modules; the greater the weight, the more dependency between two
Page | 269
Name unweighted mutunis ispell Rcs bison grappa bunch incl icecast gnupg Inn bitchx xntp exim mod_ssl ncurses lynx nmh
weighted
MultiObjective 2.198 2.412 2.279 2.654 12.732 13.356 13.648 2.762 7.210 7.926 3.487 8.786 6.482 10.021 11.384 3.488 7.012
MCA 2.294 2.269 2.145 2.416 11.586 12.145 11.811 2.401 6.259 7.421 3.572 6.482 5.316 8.832 10.211 3.447 6.671
This paper introduces the multi-objective approach to software module clustering and presents the results for the application of this technique, comparing the results obtained with those from the existing multi-objective formulations on 17 real world module clustering problems. The result indicated that multi-objective approach produces superior results for unweighted MDG graph than the existing Multi-Objective formulations, MCA and ECA. Future work could be considering other possible objectives for better modularization. REFERENCES Kata Praditwong, Mark Harman, and Xin Yao, Software Module Clustering as a Multi-Objective Search Problem, Vol. 37, No. 2, March/April 2011. [2] K. Mahdavi, M. Harman, and R.M. Hierons, A Multiple Hill Climbing Approach to Software Module Clustering, Proc. IEEE Intl Conf. Software Maintenance, pp. 315-324, Sept. 2003. [3] S. Mancoridis, B.S. Mitchell, Y.-F. Chen, and E.R. Gansner, Bunch: A Clustering Tool for the Recovery and Maintenance of Software System Structures, Proc. IEEE Intl Conf. Software Maintenance, pp. 5059, 1999. [4] J.M. Bieman and L.M. Ott, Measuring Functional Cohesion, IEEE Trans. Software Eng., vol. 20, no. 8, pp. 644-657, Aug. 1994. [5] M. Bowman, L. Briand, and Y. Labiche, MultiObjective Genetic Algorithms to Support Class Responsibility Assignment, Proc. 23rd IEEE Intl Conf. Software Maintenance, Oct. 2007. [6] L.C. Briand, J. Feng, and Y. Labiche, Using Genetic Algorithms and Coupling Measures to Devise Optimal Integration Test Orders, Proc. 14th Intl Conf. Software Eng. and Knowledge Eng., pp. 43-50, 2002. [7] L.L. Constantine and E. Yourdon, Structured Design. Prentice Hall, 1979. [8] D. Doval, S. Mancoridis, and B.S. Mitchell, Automatic Clustering of Software Systems Using a Genetic Algorithm, Proc. Intl Conf. Software Tools and Eng. Practice, Aug.-Sept. 1999. [9] B.S. Mitchell, A Heuristic Search Approach to Solving the Software Clustering Problem, PhD thesis, Drexel Univ., Jan. 2002. [10] B.S. Mitchell and S. Mancoridis, On the Automatic Modularization of Software Systems Using the Bunch Tool, IEEE Trans. Software Eng., vol. 32, no. 3, pp. 193-208, Mar. 2006. [11] M. Harman, The Current State and Future of Search Based Software Engineering, Future of Software Eng. 2007, L. Briand and A. Wolf, eds., IEEE CS Press, 2007. [1]
TABLE 3 COMPARISON OF MULTI-OBJECTIVE APPROACH AND ECA USING MQ VALUES AS AN ASSESSMENT CRITERIA
Name unweighted mutunis ispell rcs bison grappa bunch incl icecast gnupg inn bitchx xntp exim mod_ssl ncurses lynx nmh
weighted
MultiObjective 2.198 2.412 2.279 2.654 12.732 13.356 13.648 2.762 7.210 7.926 3.487 8.786 6.482 10.021 11.384 3.488 7.012
ECA 2.314 2.339 2.239 2.648 12.578 13.455 13.511 2.654 6.905 7.876 4.267 8.168 6.361 9.749 11.297 4.694 8.592
In Table 3, the results from two approaches, MultiObjective approach and ECA were comparable. There is strong evidence to suggest that our approach is outperformed the ECA approach for weighted and unweighted MDG problems. For unweighted MDG problem, Multi-Objective approach gives higher values for MQ in 5 from 7 problems. However for weighted MDG problem, Multi-Objective approach gives higher values for MQ in 8 from 10 problems.
Page | 270
Volume 2, Issue 3, March 2012 [12] Baresel, H. Sthamer, and M. Schmidt. Fitness function design to improve evolutionary structural testing. In GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pages 1329 1336, San Francisco, CA 94104, USA, 9-13 July 2002. Morgan Kaufmann Publishers. [13] M. Harman and B. F. Jones. Search based software engineering. Information and Software Technology, 43(14):833839, Dec. 2001. [14] Z. Li,M. Harman, and R. Hierons. Meta-heuristic search algorithms for regression test case prioritization. IEEE Transactions on Software Engineering. To appear. [15] S. Yoo and M. Harman, Pareto Efficient MultiObjective Test Case Selection, Proc. Intl Symp. Software Testing and Analysis, pp. 140-150, July 2007. AUTHORS BIOGRAPHY Author 1: Kishore C received the B.Tech Information Technology from JNTUA, Anantapur, India in 2010 and pursuing his Masters degree in Software Engineering from the JNTUA, Anantapur, India. He has published one international conference. His research areas are Software Engineering, Data warehousing and Data Mining, Database Management Systems and Cloud Computing. Author 2: Asadi Srinivasulu received the B Tech Computer Science Engineering from Sri Venkateswara University, Tirupati, India in 2000 and M.Tech with Intelligent Systems in IT from Indian Institute of Information Technology, Allahabad (IIIT) in 2004 and he is pursuing Ph.D in CSE from J.N.T.U.A, Anantapur, India. He has got 10 years of teaching and industrial experience. He served as the Head, Dept of Information Technology, S V College of Engineering, Karakambadi, Tirupati, India during 2007-2009. His areas of interests include Data Mining and Data warehousing, Intelligent Systems, Image Processing, Pattern Recognition, Machine Vision Processing and Cloud Computing. He is a member of IAENG, IACSIT. He has published more than 25 papers in International Journals and Conferences. Some of his publications appear in IJCSET, IJCA and IJCSIT digital libraries. He visited Malaysia and Singapore.
www.ijarcsse.com
Page | 271