Academia.eduAcademia.edu

Moving to Smaller Libraries via Clustering and Genetic Algorithms

2003

There may be several reasons to reduce a software system to its bare bone removing the extra fat introduced during development or evolution. Porting the software system on embedded devices or palmtops are just two examples. This paper presents an approach to re-factoring libraries with the aim of reducing the memory requirements of executables. The approach is organized in two steps. The first step defines an initial solution based on clustering methods, while the subsequent phase refines the initial solution via genetic algorithms. In particular, a novel genetic algorithm approach, considering the initial clusters as the starting population, adopting a knowledge-based mutation function and a multiobjective fitness function, is proposed. The approach has been applied to several medium and large-size open source software systems such as GRASS, KDE-QT Samba and MySQL, allowing one to effectively produce smaller loosely coupled libraries, and to reduce the memory requirement for each application.

Moving to Smaller Libraries via Clustering and Genetic Algorithms ✁ ✂ ✁✄ G. Antoniol , M. Di Penta , M. Neteler [email protected], [email protected], [email protected] ✂ ✁✁✄ Irst-ITC Istituto Trentino Cultura Via Sommarive 18 38050 Povo (Trento), Italy T RCOST - Research Centre on Software Technology University of Sannio, Department of Engineering Palazzo ex Poste, Via Traiano, I-82100 Benevento, Italy Abstract form of restructuring, at library and at object file level, may be required. The latest intervention must deal with dependencies among software artifacts. There may be several reasons to reduce a software system to its bare bone removing the extra fat introduced during development or evolution. Porting the software system on embedded devices or palmtops are just two examples. This paper presents an approach to re-factoring libraries with the aim of reducing the memory requirements of executables. The approach is organized in two steps. The first step defines an initial solution based on clustering methods, while the subsequent phase refines the initial solution via genetic algorithms. In particular, a novel genetic algorithm approach, considering the initial clusters as the starting population, adopting a knowledge-based mutation function and a multiobjective fitness function, is proposed. The approach has been applied to several medium and large-size open source software systems such as GRASS, KDE-QT, Samba and MySQL, allowing to effectively produce smaller, loosely coupled libraries, and to reduce the memory requirement for each application. AF For any given software system, dependencies among executables and object files may be represented via a dependency graph, a graph where nodes represent resources and edges the resource dependencies. Each library, in turn, may be thought of as a sub-graph in the overall object file dependency graph. Therefore, software miniaturization can be modeled as a graph partitioning problem. Any graph partitioning (into subgraphs) represents a problem solution characterized by the resource used by each executable present in the system. Unfortunately, it is well known that graph partitioning is an NP-complete problem [9] and thus often heuristics are adopted to find a sub-optimal solution. For example, one may be interested to first examine graph partitions minimizing cross edges between sub-graphs corresponding to libraries. More formally, a cost function describing the restructuring problem has to be defined, and heuristics driving the solution search process must be identified and applied. In [7], a process to miniaturize software systems has been proposed. The central idea is to apply clustering techniques to identify software libraries minimizing the average executable size. Doval et al. [8], applied Genetic Algorithms (GA) to find what they called meaningful partitions in a graph representing dependencies among software components. Other communities, such as the optimization community, addressed the graph partitioning related problems in several ways. For example, constraints were incorporated by modifying the problem definition. To speed up the search process, heuristics based on GA and modified GA [27] were proposed. DR Keywords: library re-factoring, clustering, genetic algorithms 1. Introduction Software miniaturization deals with a particular form of re-factoring aiming to reduce some measures of the size of a software system. Consider, for example, embedded systems; the amount of resources available is often limited, and thus developers are interested to reduce the footprint of executables. Applications running on hand-held devices have similar, even if less stringent, resource requirements. All in all, it is not infrequent that, as described in [7], the software extra fat needs to be eliminated or reduced. Clearly, several actions may be taken. First and foremost, dead code and software clones should be removed. Furthermore, some This paper stems from the observation that previously proposed approaches to software miniaturization were not completely satisfactory. For example, it is not obvious if pruning clones may be beneficial to reduce the memory requirements of executables. Moreover, an approach based 1 solely on clustering may be unable to find solutions easily identified by GA. Conversely, GA requires a starting population; choosing a random solution may not be very efficient, or it may lead to a local sub-optimal solution. To overcome the aforementioned limitations, we propose a novel approach where an initial sub-optimal solution to library identification (i.e., a set of graph partitions) is determined via clustering approaches, then followed by a GA search aimed at reducing the inter-library dependencies. GA is applied to a newly defined problem encoding, where genetic mutation may lead sometimes to generate clones, clones that do indeed reduce the overall amount of resources required by the executables, in that they remove inter-library dependencies. Cut−point 1 3 T Figure 1. Hierarchical clustering dendrogram and cut-point. Moreover, a multi-objective fitness function was defined, trying to keep low, at the same time, both the number of inter-library dependencies and the average number of objects linked by each application. 2.1. Clustering The approach was applied to improve the GRASS refactoring presented in [7], and, to gain more empirical evidence, to other open source software systems such as KDEQT, Samba and MySQL. AF In this paper, the agglomerative-nesting (agnes) algorithm [15] was applied to build the initial set of candidate libraries. Agnes is an agglomerative, hierarchical clustering algorithm: it builds a hierarchy of clusters in such way that each level contains the same clusters as the first lower level, except for two clusters, which are joined to form a single cluster. In particular, agglomerative algorithms start building the dendrogram from the bottom of the hierarchy (where each one of the entities represents a cluster), until at the level all entities are grouped in a single cluster. The key point of hierarchical clustering is determining the cut-point, i.e. the level to be considered in order to determine the actual clusters (e.g., in Figure 1 the cut-point determines a total of two clusters). As will be shown in Section 2.2, in this work such operation was supported by the Silhouette statistic. The paper is organized as follows. First, the essential background notions to help the reader are summarized in Section 2; the re-factoring approach and support tools are described in Section 3. Information on the case study systems is reported in Section 4. Section 5 presents case studies results. Finally, an analysis of related work is reported in Section 6, before conclusions and work-in-progress. 2. Background Notions 2 ✂✁☎✄ DR To re-factor the software systems libraries, clustering and GA were integrated in a semi-automatic, human-driven re-factoring process. 2.2. Determining the Optimal Number of Clusters Clustering deals with the grouping of large amounts of things (entities) in groups (clusters) of closely related entities. Clustering is used in different areas, such as business analysis, economics, astronomy, information retrieval, image processing, pattern recognition, biology, and others. To determine the optimal number of clusters, traditionally, people rely on the plot of an error measure representing the within cluster dispersion. The error measure decreases as the number of cluster increases, but for some the curve flattens. Traditionally, it is assumed that the error curve elbow indicates the appropriate number of clusters [11]. To overcome the limitation of such a heuristic approach, several methods have been proposed, see [11] for a comprehensive summary. Kaufman and Russeeuw [15] proposed the Silhouette statistic for estimating and assessing the optimal number of clusters. For the observation , let be the average distance to the other points in its cluster, and the average distance to points in the nearest cluster (but it own), then the Silhouette statistic is defined as: ✆ GA come from an idea, born over 30 years ago, of applying the biological principle of evolution to artificial systems. GA are applied to different domains such as machine and robot learning, economics, operations research, ecology, studies of evolution, learning and social systems [10]. ✆ ✝ ✞✠✟✡✝☞☛ In the following sub-sections, for sake of completeness, some essential notions are summarized. Describing the different types of clustering algorithms or the details of GA is out of the scope of this paper. More details can be found in [2, 14, 16] for clustering and in [10] for GA. 2 ✌✍✟✎✝☞☛ ✌✍✟✡✝ ☛ ✁ ✂✁ ✄ ✆✞ ☎ ✟✎✝☞☛ ✟ ✞ ✟✎✝☞☛ ✞ ✟✎✝☞☛ ✞✝ ✌✍✟✎✝☞☛ ☛ ANALYSIS OF SYSTEM OBJECTS (1) Dependency Graphs SILHOUETTE ANALYSIS Kaufman and Russeeuw suggested choosing the optimal number of clusters as the value maximizing the average ✟✎✝☞☛ over the dataset. Notice that the Silhouette statistic, as most of the methods described in [11], has the disadvantage that it is undefined for one cluster, and thus it offers no indication of whether the current dataset already represents a good cluster. But, since our purpose is to split the original libraries into smaller ones, in our case this does not constitute a problem. Optimal # of Clusters AGGLOMERATIVE NESTING CLUSTERING Candidate Libraries CLUSTER REFINING BY GA T Refined Libraries Figure 2. The re-factoring process. 2.3. Genetic Algorithms GA revealed their effectiveness in finding approximate solutions for problems where: ✟ Initialize population P[0]; g=0; // Generation counter while(g < max_number_of_generations) //Apply the fitness function to the //current population Evaluate P[g]; AF ✟ The search space is large or complex; ✟ No mathematical analysis is available; //Advance to the next generation g=g+1; ✟ Traditional search methods did not work; and, above all //Make a list of pairs of individuals //likely to mate (best fitness) Select P[g] from P[g-1]; The problem is NP-complete [9, 27]. //Crossover with probability //pcross on each pair Crossover P[g]; DR Roughly speaking, a GA may be defined as an iterative procedure that searches the best solution of a given problem among a constant-size population, represented by a finite string of symbols, the genome. The search is made starting from an initial population of individuals, often randomly generated. At each evolutionary step, individuals are evaluated using a fitness function. High-fitness individuals will have the highest probability to reproduce themselves. The evolution (i.e., the generation of a new population) is made by means of two kind of operator: the crossover operator and the mutation operator. The crossover operator takes two individuals (the parents) of the old generation and exchanges parts of their genomes, producing one or more new individuals (the offspring). The mutation operator has been introduced to prevent convergence to local optima, in that it randomly modifies an individual’s genome (e.g., flipping some of its bits if the genome is represented by a bit string). Crossover and mutation are respectively performed on each individual of the population with probability pcross and pmut respectively, where ✠ ✄☛✡✌☞✎✍ ✠✌✏✞✑✓✒ ✔ . The GA does not guarantee to converge: the termination condition is often specified as a maximal number of generations, or as a given value of the fitness function. The GA behavior can be represented in pseudo-code as shown below: //Mutation With probability //pmut on each individual Mutate P[g]; end while 3. The Re-factoring Method This Section describes the proposed re-factoring process, derived from what already described in [4, 7], then refined with the Silhouette statistic for determining the optimal number of clusters, and with GA for minimizing interlibrary dependencies. A flow diagram of the re-factoring process is depicted in Figure 2. 3.1. Basic Factoring Criteria and Representation As described in [4, 7], given a system composed by applications and ✕ libraries, the idea is to re-factor the biggest libraries, splitting them in two or more smaller clusters, such that each cluster contains symbols used by a common subset of applications (i.e., we made the assumption that symbols often used together should be contained in the same library). ✄ 3 3.4. Reducing Dependencies using Genetic Algorithms Given that, for each library ✂✁ to be re-factored, a Boolean matrix ✄✆☎ , composed by ✕✞✝ ✄ rows and ✠✟✁ columns, was built, such that: ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ☞ ✑ ✡✔✓ ✏ ✄✆☎✡✠ ☎ ✝☞☛✍✌✌✁ ✑ ✏ ✏ ☞ ✙✒ ✘ ✝ ✡ ✚✓✜✛ ✌✢☛ ✞✔✠ ✣ ✠ ✎✝ ✏ ✞ ☞ ✝ ✒ ✕ ✞✥✤ ✒ ✌✗✖ ✓ ✏ ☞ ✒ ✘ ✝ ✡ ✚✓✜✛ ✌✢☛✧✡✝ ✌✞✑ ✞✆✜ ✑ ☛★ ✤✪✩✣✫ ✒ ✑ ✏ ✏ ✏ ✏✒ ✏ ✏ ✏ ✏ ☎✡✦ ✄ ✏ ✏ ✏✒ ✬ ✞✍ ✚✓ ✒ ☞☞✭✟✓ ✑✜✮ ✝ ✌✗✖ The solution reached at the previous step presents a drawback: the number of dependencies between the new libraries could be high, forcing to load another library each time a symbol from that library is needed, therefore wasting the advantage of having new smaller libraries. Of course, as shown in [7], an important step to perform is moving to dynamic-loadable libraries, so that each (small) library is loaded at run-time only when needed, and then unloaded when it is no longer useful. In this case, even if there are dependencies among libraries,❋ the average number of libraries in memory is considerably smaller than in the original system. Given ✯ ✰ ✱✔✒ ✝ ✒✳✲ ✝✶✴✵✴✶✴ ✝ ✒✙❙❚✺ the set of all objects contained into the candidate libraries produced in the previous step, we built a dependency graph, defined as follows: ✓ ✚✓ where ✯ ✰ ✱✔✒ ✄ ✝ ✒✳✲ ✝✵✴✶✴✶✴ ✝ ✒✵✷✹✸✜✺ is the set of objects of the library ✻✁ (archiving ✠✼✁ objects). 3.2. Determining the Optimal Number of Clusters As explained in Section 2.2, the optimal number of clusters was computed on each ✄✆☎ matrix applying the Silhouette statistic. Giving the curve of the average Silhouette values for different numbers ✆ of clusters, instead of considering the maximum (often too high for our re-factoring purpose), for some libraries we chose as optimal number the knee of that curve [15]. We also incorporated in the choice experts’ knowledge, and we considered a tradeoff between excessive fragmentation and library size. Examples of Silhouette statistic are shown in Figure 4. T ✎✏ ✏ ☎✡✕ ✄ ✎✏ ☎❱❯❲✰❳✱✙✯ ✝✾☎❨✺ (3) ❉P❖ is the set of nodes representing the objects, and the set of oriented edges ✛ ✁ representing dependencies between objects. Given two libraries, ❴❛❵❝❜❞✯ ❉♠❧ ❧ ❉P❖ ❧ ▼ and ❴❢❡❣❜❤✯ , we say that there is a dependency between ✛ the two libraries ✐ ❥ ✝ ✝❦✖ ✒ ❴❢❵ ✝✂✒✹✖ ❴♥❡ ✝ ✁ ☎ . The removal of inter-library dependencies can be therefore brought back to a graph partitioning problem that, as shown in [27], is NP-complete, and a GA was used to reach an approximate solution of the problem (i.e., minimize the number of dependencies). A GA requires the specification of: ✯ AF where ☎❬❩❭✯❫❪✡✯ 3.3. Determining the Sub-Optima Libraries by Clustering DR Once known the number of clusters for each “old library”, agglomerative-nesting clustering was performed on each ✄✆☎ matrix. This builds a dendrogram and a vector of heights, that allow identifying ✆ clusters. These clusters are the new candidate libraries. A measure of the performances of the re-factoring process was introduced [7]. Let ✆ the number of clusters ✤✜✽ ✝✵✴✶✴✵✴ ✝✾ ✤✜✿ obtained from a library ✤ . Then, the Partitioning Ratio ❀❂❁❃✤ can be defined as:❊▲❋◆▼ ▼ ❉P❖ ❀❂❁❄✤ ✁ ✄✵❅✪❅❇❆ ✫ ❈ ❉✂❊●❋■❍❑❏ ✁ ▼ ✤ ▼ ❉P❖ ✛ ❆ ✤ ✸ ❆ ✛ ✤ ✤ ✸ 1. The genome encoding; 2. The initial population; 3. The fitness function; 4. The crossover operator; 5. The mutation operator; and 6. All GA parameters, such as the crossover and mutation probability, the population size and the number of generations. (2) where:❉P❖ ✟◗✛ ☎▼ ✟ An approach of clustering functions using GA was discussed in [8]. However, as shown below, in our case the genome encoding, the initial population, the mutation operator and the fitness function are different. The encoding schema widely adopted in literature [8, 27] indicates each partition with an integer ✠ such that ❅♦✕ ✠♣✕ ✆ ✁ ✄ (where ✆ is the number of candidate libraries), and represents the genome as a -size array ❯ , where the integer ✠ in position q means that the function q is contained into partition ✠ . ✤ is equal to one if the application ✝ uses the library (and zero otherwise); ▼ ❘✤ is the number of objects archived into library P✤ . The smaller is the ❀❂❁ , the most effective is the partitioning, in that the average number of objects linked (or loaded) from each application is smaller than using the old whole library. 4 ❴✌✂ However, as explained in [7], our purpose is the reduction of memory requirements for each application, therefore sometimes “cloning” an object in different libraries may help reducing the number of libraries to be linked by the application itself. The encoding schema mentioned above does not allow an object to be contained in more than one library. We therefore adopted a bit-matrix encoding, where the genome for each library to re-factor corresponds exactly to the matrix (where true values are indicated by “ ✄ ” and false values by “ ”). Clearly, the presence of the same object in more libraries is indicated by more “1” on the same column. Instead of randomly generating the initial population (i.e., the initial libraries), the GA was initialized with the encoding of the set of libraries obtained in the previous step. The fitness function was constructed to balance three factors: ✄✆☎ ✟ ☎✄✂ ☎ ✟ ☛ ✂✁ ✝ ✟ ✝ ✎☛ ✟ ✁ ✝ ✎☛ T ✓✔✂ ✆ ✟ ✓ ✠✓ ✕ ▼ ✞✖ ✗✙✘ ✖✞✗✛✚✥▼ ✔✓ ✂ ✟☎ ☛✂✁ ✁ Finally, the fitness function ✂ was defined as: ✂ ☎✟❋ ✮ ✮ ❋ ✮ ✓✔✂ ❋✮ ☎✄✂ ☎ ✝ ✟ ☛ ❋ ✄ ✝ ✮ ❴✌✂ ☎ ❨✝ ✮✜✝✢✓✔✂ ✟ ✮ ✮✜✝ (4) ✂ ✮ ☛ account the number of objects linked by all the applications, if the candidate libraries were those produced at generation . is obviously proportional to the , and therefore from this point on we will not distinguish between the two terms. ❴✌✂ ✂✁ ☛ (5) ☛ ☎ ✟ (6) ☛ where and are real, positive weighting factor for the and contribution to the overall fitness function. The higher is , the smaller will be the overall number of objects linked by applications; on the other hand, rising too decreases dependency reduction. much Similarly, the higher is , the more similar will be the result to the starting set of library, while an excessively could not allow a satisfactory dependency reduchigher maximizes the tion. Finally, it is worth noting that our fitness function, therefore is the inverse of the weighted sum of the three factors. As stated in (6), our fitness function is multi-objective [6, , in 12, 31]. Notice that we set a unitary weight to the that we aimed to maximize dependency reduction. Then we selected and using a trial-and-error, iterative procedure, adjusting them each time until the , and obtained at the final step were not satisfactory. The process was guided by computing each time the average values for , and , and also by plotting their evolution, in order to determine the 3D space region in which the population should evolve. ✝ ✟ ✝ ☛ library, the “most useful library” (i.e., the library containing the largest number of objects needed by that application) is considered. ✟ ☛ ) The third factor, the Standard Deviation Factor ( can be thought of as the difference between the initial library sizes standard deviation and the actual (at the current generation) standard deviation. A similar factor was also the array of library sizes for the applied in [27]. Given initial population, and the same for the g-th generation: ✝ ✄ ✟ ✞ ❅ ❯ ✠ ✌ ✁❑❯ ✠ ✖✪✌ ❯ ✠ ✌✌✁❑☞ ❯ ✠ ✖✪✌ ❴ ✂ ☎ ) takes into The second factor, the Linking Factor ( ✍ ✕ ✓ q P❯ ✠ ❦✌ ✝ ❯ ✠ ✖✪✌ ✂✁ ✞ ✝ ☛ ✝ ) was de- ❋ ❙❄❈❉✻❊✠✩✞✟ ✝ ❙❄❊❈ ❉☛✩ ✡▲❋ ✆ ✄ ❱ ☎ ❯ ✠ ✝❦✖✪✌ ✕ ✓ q P❯ ✠ ❦✌ ✝ ❯ ✠ ✖✪✌ ✁ ☛ ✎✝☞✌ ✍✞ ✟ ❴✌✂ DR where ✆ ✟ ✞ ✝ ✬ ✑✓✒ ✄ ✙✒ ✘ ✒✙✤ ✛✍✓ ✠ ✓ ✕ ✛ ✒ ☞☞ ✭ ✓ ✑✜✮ ✝ ✵✓ The first factor, the Dependency Factor ( fined as: ☎✄✂ ✝ AF ❅ ✄ ✞ ✝ Without taking into account the last item, it could happen that the GA, in the attempt to reduce dependencies, groups a large fraction of the objects in the same library, negatively . affecting the The dependency graph was encoded as a matrix of adjacencies : ✁ ☛ ✎✝ ✌ 2. The total number of objects linked to each application that, as said, should be as small as possible; and ☎❱❯ is computed as follows: ✞ 1. The number of inter-library dependencies at a given generation; ❂❀ ❁ ✄✆☎❱❯ ✄✆☎❱❯ ✠ ☎ ✝☞☛✍✌ ✁ ☛ ✄✏✎ ✁ ✄✆☎ ❴✑✍✂ ☎ ✎✁ ☞ ❅ ✔✑ ✠ ✠✣ ✏ ✒ ✕ ✥✤ ✒ ✗✬ ✖ ✓ ✏ ☞ ✒✙✘ ✬✄✏✕ ✎♦✛ ✠ ☎ ✝✾☛✍✑ ✌ ✆✁ ✑✜☛✧☞ ✑ ✡✔✓✏ ✒ ☞☞✕ ✭ ▼ ☞ ✓ ✕ ▼ ✕ ✕✒ ✒✙✘ ❧ ❏ ✑✌❴ ✂ ✒ ✁ ✆ ✁ ✌❴ ✂ ✆ ✝ ❏ ❏ ✬ ✄ ✏ ✎ ♦ ✠ ☎ ✝☞☛✍✌ ✁ ✥ ✚✓ ✬ ▼ ▼ ✓✕ ✛ where ❏ is the number of objects contained in ❏ . ✄✏✎ initially contains the matrix ✄✆☎ (see Section 3.1). As a library is linked, all its objects are removed from ✏ ✄ ✎ . It is worth noting that, if ✒✚✁ is contained in more than one ❅ 3. The size of the new libraries. ☎ ✟ ❀❂❁ ☎✄✂ ❴✍✂ 5 ❋ ✮ ✓✔✂ ✝ ❯✤✣ ✝ ☎✄✂ ☎✥✂ ❴✌✂ ✓✔✂ Parents Offspring ✠✡✁ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁ ✠✡✁ ✠ ✡✠✡✠✡✠ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄✄✄ ✟✞ ☎✄☎✄✄ ✞✞ ☎✁ ☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ✞✁ ✄ ✟✁ ✞✁☎✁ ☎✁✟✞ ☎ ✟✁☎✁ ☛☛☛ ✝✁ ✆✆ ✝✁ ✆✆ ☛✁ ✆✆ ☞✁ ✆✆ ☞☛☞☛☛ ✝✆✆ ✝✆✁ ☛✁ ☛✁✝✁ ✆ ✝✁ ☞✁✝✁ ☞✝ ✝✁ ✝✁✝✁ ✝✁☞✁ ✂✁✁✁✁✁✁ ✂✁✂✁✂✁✂✁✂✁✂✁✂✁ ✂✁✁✁✁✁✁ ✂✁✂✂ To support the re-factoring process, different tools were needed, some of which already described in [7]. In particular: ✟ The application identifier that, using the nm Unix tool, identifies the list of object modules containing the main symbol; ✟ The dependency graph extractor, also based on the nm ✄✆☎ a) Crossover 11 0 01 00 1 10 11 0 01 00 1 10 ✄✆☎❱❯ tool, that produces the and the matrices. The tool presented in [7] was modified in order to produce information in the format required by our ✤✣ tool; T Random crossover point 3.5. Tool Support ❯ ✟ The number of clusters identifier: as said in Sec11 1 01 00 0 10 tion 2.1, the number of clusters was determined using the Silhouette statistic. In particular, implementations available in the cluster package of the R Statistical Environment [1, 13] were used; 11 1 01 00 1 10 c) Mutation (clone an object) ✟ The library re-factoring tool: it supports the process of AF b) Mutation (move an object) splitting libraries in smaller clusters. As said in Section 3.3, this is performed by clustering algorithms. Again, the cluster analysis is performed by the agnes function available under the cluster package of the R Statistical Environment; and Figure 3. Genetic operators. The crossover operator used in this paper is the one point crossover: given two matrices, both are cut at the same random column, and the two portions are exchanged (Figure 3a). The mutation operator works in two modes: ✟ The GA library refiner: implemented in C++ using the GAlib [30]. 4. Case Studies 1. Normally, it takes a random column and randomly swaps two rows: this means that, if the two swapped bits are different then an object is moved from a library to another (Figure 3b); ✹ ✓✍✌ Although our primary objective was to improve the refactoring of GRASS biggest libraries, we applied our process to other open source systems, different for purpose and size. This gave us more empirical evidence for the validity of the proposed approach. As explained in Section 5, for each system we chose to re-factor the biggest libraries. Characteristics of the four systems analyzed are shown in Table 1. DR 2. With probability ✠✌✏ ✒ ✕ ✠ ✄ ✡✌☞ , it takes a random position in the matrix: if it is zero and the library is dependent on it, then the mutation operator clones the object into the current library (Figure 3c). ❴ Of course the cloning of an object increases both ✍✂ and , therefore it should be minimized. Our GA activates the cloning only for the final part of the evolution (after 66% of generations in our case studies). ✓✔✂ ❴ Ver KLOC Apps Libs GRASS MySQL Samba KDE-QT Snap. 02-22-2002 3.23.51 2.2.5 3.0.3 1,014 478 293 4,196 517 38 16 589 43 24 2 456 Libs to re-factor 3 2 2 4 Table 1. Case studies characteristics. Our strategy favors dependency minimization by moving objects between libraries; then, at the end, we attempt to remove remaining dependencies by cloning objects. The population size and the number of generations were chosen by an iterative procedure, doubling both each time until the obtained ✥✂ and ✍✂ (and thus the ) were equal to those at the previous step. ☎ System 4.1. GRASS ❀❂❁ GRASS (Geographic Resources Analysis Support System, http://grass.itc.it) is an Open Source, 6 4.4. KDE-QT raster/vector Geographical Information System (GIS), having integrated image processing and data visualization subsystems [23]. Supported platforms at the date of writing comprise Linux/PC, SUN, HP/UX, MacOSX, MSWindows/Cygwin, iPAQ/Linux and others. GRASS modules (commands) are invoked within a shell environment (also the current graphical user interface runs commands within a shell). The GRASS parser is a collection of subroutines allowing the programmer to define options (parameters) and flags that make up the valid command line input of a GRASS command. GRASS provides an ANSI C language API with several hundreds of GIS functions which are utilized in the GRASS modules, from reading and writing maps to area and distance calculations for georeferenced data as well as attribute handling and map visualization. Details of GRASS programming are covered in the “GRASS 5.0 Programmer’s Manual” [22]. T KDE is an open source desktop environment for Unix workstations. It was developed to facilitate Unix desktop interaction and programming, in a way more similar to MacOS or MS-Windows. In particular, KDE provides some features not available under X11, such as common Drag and Drop protocol, desktop configuration, unified help system, application development framework, consistent lookand-feel and menu system, and internationalization. The KDE distribution consists of 19 packages: the base package (KDE-Base), the library package (KDE-Libs), the development package (KDevelop), the network package (KDE-Network), the graphics package (KDE-Graphics), the multimedia package (KDE-Multimedia), and others (see http://www.kde.org for further details). KDE was developed using a multiplatform C++ GUI application framework, QT (http://www.trolltech.com/products/qt/). As shown in Section 5, QT library is quite big, therefore its re-factoring could be useful for porting graphical applications on hand-held devices. AF 4.2. Samba DR Samba (http://www.samba.org) is a freely available file server that runs on Unix and other operating systems (usually to share resources between Unix-based systems and Microsoft-based systems). The code has been written to be as portable as possible. It has been “ported” to many unixes (Linux, SunOS, Solaris, SVR4, Ultrix, etc.). Samba consists of two key programs, plus a bunch of other utilities. The two key programs are smbd and nmbd. They implement four basic services: file sharing & print services, authentication and authorization, name resolution and service announcement (browsing). Moreover, Samba comes with a variety of utilities. The most commonly used utilities are: smbclient a simple SMB (Server Message Block, a protocol for sharing general communications abstractions such as files, printers, etc.) client; nmblookup a NetBIOS name service client, and swat which is the Samba Web Administration Tool, it allows the configuration of Samba remotely, using a web browser. 5. Case Studies Results This section reports results obtained applying the proposed re-factoring process on the different systems described in Section 4. Table 2 reports results of the refactoring process applied on selected libraries of the systems analyzed. The table reports, for each library: ✟ ✟ The number of objects composing the library; 4.3. MySQL The number of candidate libraries the original library is re-factored into, and the corresponding Silhouette statistic value; ✟ The number of inter-library dependencies and the ❀❂❁ before applying the GA; and ✟ The number of inter-library dependencies and the ❀❂❁ after applying the GA. GA parameters were fixed, after a proper calibration, ✄ ✡✌☞ ✓ ✁ ❅✼✴ , as follows: ✠ ✏✞✑✓✒ ✁ ❅✟✴ , ✠ ✁ ❅✟✴ ✲ , ✠✌✏✹ ✒ ✕ ✡ ☞ ✓ ✠ ✒ ✠ ✒ ✕ ✂✁ ✁☎✄ ❅✪❅ . The number of generations required varied from 1500 to 3000. As shown in Figure 4, different heuristics were followed to choose the optimal number of clusters, such as the curve knee (libmsqlclient and KDE-libkio) or the maximum (Samba and GRASS-libgis). A similar approach was followed for all others libraries. ✝✆ All ❀❂❁ shown in Table 2 are due to the presence of objects unused by the current set of applications, therefore clustered in a separate library. MySQL (http://www.mysql.com/) is an open source, fast, multi-threaded, multi-user SQL database server, intended for mission-critical, heavy loaded production systems. MySQL is written using both C and C++ (the latter constitutes no big deal for our approach, since dependency graph were extracted from object files), and can be compiled with several different C/C++ compilers. The power of MySQL is in its fastness: in order to pursue this objective, some advanced features (e.g., nested queries) are not available, while others (e.g., transactions) were introduced only in the latest version of the database server. ✄ ✞ ✝ ✝ ✌ ✄ 7 ✆ Library GRASS libgis libdbmi libvect lib + libsmb libmsqlclient libmysys lib-qt libkhtml libkio libmpeg Samba MySQL KDE-QT # of objects 184 97 54 77 Candidate Libraries (k) 4 3 3 2 Silhouette statistic 0.70 0.78 0.57 0.72 80 92 403 100 129 138 3 2 5 3 4 3 0.53 0.53 0.79 0.58 0.56 0.74 Before GA DF PR 356 10% 237 31% 66 46% 106 72% 187 158 147 0 2 68 76% 89% 11% 83% 16% 48% After GA DF PR 13 8% 5 23% 4 25% 2 64% 7 1 9 0 0 1 70% 66% 5% 26% 7% 17% T System Table 2. Results of the re-factoring process. braries, we tried to re-organize the two existing libraries, in order to minimize dependencies (initially 106) between them. The resulted new libraries exhibited only two dependencies (then easily manually removed), and a ❀❂❁ (64%) smaller than the original (72%). The two biggest MySQL libraries (libmsqlclient and libmysys) were re-factored in three and two clusters, and the GA allowed us to minimize dependencies (1 and 7). libqt represented for us an interesting challenge, it that it was really a big library (403 objects) and it was used by a large number (630) of KDE applications and other libraries. According to the Silhouette statistic, it was split into five clusters; then, the GA allowed to decrease the number of inter-library dependencies from 147 to 9 and the ❀❂❁ from 11% to 5%. When re-factoring libmpeg, instead of considering the use made by all KDE applications, only applications contained in the kde-multimedia package (the only ones using that library) were taken into account. Also in this case, the dependencies and ❀❂❁ reduction was successfully performed. The purpose of applying GA was slightly different for libkhtml and libkio: the number of inter-library dependencies was zero for the former, and very small (2) for the latter. Instead, GA allowed a considerable reduction of ❀❂❁ (from 83% to 26% and from 16% to 7%). The high ❀❂❁ reduction for libkhtml was due to the fact that, in performing agglomerative clustering, some objects used by applications were clustered together with a large number of unused objects, that were linked by all applications needed by the former objects. In this case, GA allowed grouping in a separate cluster only the unused objects. 0.8 GRASS-libgis Samba MySQL-libmysqlclient KDE-libkio 0.75 0.7 0.6 0.55 0.5 0.45 0.4 0.35 0.3 2 3 4 # of clusters AF Silhouette statistic 0.65 5 6 Figure 4. Examples of Silhouette statistic behaviors. DR The biggest GRASS library to re-factor was libgis, composed by 184 objects. The library was split in only four clusters (according to the Silhouette statistic) instead of the six proposed in [7]. The GA reduced dependencies from 356 to 13 keeping the ❀❂❁ almost constant (from 10% to 8%). libdbmi and libvect were both re-factored in three clusters: in the first case the three cluster structure (also suggested by developers) reflected, as explained in [7], the separation of high-level functionalities from low-level functionalities and unused objects. GA allowed, for both libraries, a considerable reduction of inter-library dependencies (from 237 to 5 for libdbmi and from 66 to 4 for libvect) also slightly reducing the ❀❂❁ . All the new GRASS libraries received a positive feedback by original developers, indicating us the effective and useful re-factoring. The re-factoring process performed for Samba was quite different from all others: instead of re-factoring big li- 6. Related Work Literature reports several works applying clustering or concept analysis (CA) to software system modules cluster8 ing and/or restructuring, identifying objects, and recovering or building libraries. An overview of CA applied to software reengineering problems was shown by G. Snelting in his seminal work [26], where he used CA in several remodularization problems such as exploring configuration spaces (see also [17]), transforming class hierarchies, and re-modularizing COBOL systems. A comparison between clustering and CA was presented in [18]. We share with them the idea to apply an agglomerative-nesting clustering to a Boolean usage matrix, although in [18] the matrix indicated the uses of variables by programs. A survey of clustering techniques applied on software engineering was presented by Tzerpos and Holt in [29]. The same authors presented in [28] a metric to evaluate the similarity of different decompositions of software systems. Applications of clustering to re-engineering can be found in [3] and [21]. In [3] a method for decomposing complex software systems into independent subsystems was proposed by Anquetil and Lethbridge. Merlo et al. [21] exploited comments, as well as variable and function names, to cluster files. Our work shares with [20] the idea of analyzing intra-module and inter-module dependency graphs, finding a tradeoff between having highly cohesive libraries and a low inter-connectivity. GA has been recently applied in different fields of computer science and software engineering. An approach for partitioning a graph using GA was discussed in [27]. Similar approaches were also shown in [25, 5, 24]. Maini et al. [19] discussed a method to introduce the problem knowledge in a non-uniform crossover operator, and presented some examples of its application (also a graph partitioning problem). We share with this work the idea of using operators incorporating the problem knowledge: in our case the mutation operator, as discussed in Section 3, clones objects if, after a given percentage of generations, inter-library dependencies are still present. GA were used by Doval et al. [8] for identifying clusters on software systems. We share with this paper the idea of a software clustering approach using GA, trying to minimize inter-cluster dependencies. In [4] we proposed the idea of recovering libraries and creating a source file directory structure using CA. This paper shares with [4] the idea of finding libraries searching for sets of objects used by common groups of applications. The re-factoring of GRASS was proposed in [7], where several activities were carried out in order to re-factoring GRASS libraries. In particular, unused symbols were identified and pruned, clones were re-factored, and a preliminary work aimed at splitting the biggest libraries in clusters was performed. As stated in the introduction, this paper aims to refine the library re-factoring approach, determining the optimal num- ber of clusters with the Silhouette statistics and minimizing the number of inter-cluster dependencies using GA. 7. Conclusions DR AF T The proposed re-factoring process allowed obtaining smallest, loosely coupled libraries from the original biggest ones. In particular, the Silhouette statistic gave us information that, together with the experts’ knowledge, allowed defining the optimal number of new candidate libraries. Then, the GA, initialized from the clusters produced by agglomerative-hierarchical clustering, significantly reduced the number of inter-dependencies, keeping lower, at the same time, the ratio between the average number of objects linked by each application, and the number of object linked before re-factoring (Partitioning Ratio). The method was successfully applied to our main case study (GRASS), as well as to other system like KDE, where the size of some libraries (especially libqt, composed by 319 objects) was really considerable. Our approach is language-independent since information is gathered from object modules, thus it could be applied to object code produced by any known programming language. Work in progress is devoted to incorporate experts’ knowledge into genetic algorithms, in order to cluster objects taking into account not only the use made by applications, but also trying to cluster objects having similar purpose. 8. Acknowledgments We are grateful to the GRASS development team for the support, the information provided, and the feedback on the re-factored artifacts. Giuliano Antoniol and Massimiliano Di Penta were partially supported by the project ”Software Architectures for Heterogeneous Access Networks Infrastructures - SAHARA” funded by ”Ministero dell’Istruzione, dell’Università e della Ricerca Scientifica- MIUR”. References [1] The R project for statistical computing. http://www.r-project.org. [2] M. R. Anderberg. Cluster Analysis for Applications. Academic Press Inc., 1973. [3] N. Anquetil and T. Lethbridge. Extracting concepts from fi le names; a new fi le clustering criterion. In Proceedings of the International Conference on Software Engineering, pages 84–93, April 1998. 9 [4] G. Antoniol, G. Casazza, M. Di Penta, and E. Merlo. A method to re-organize legacy systems via concept analysis. In Proceedings of the IEEE International Workshop on Program Comprehension, pages 281–290, Toronto, ON, Canada, May 2001. IEEE Press. [5] T. N. Bui and B. R. Moon. Genetic algorithm and graph partitioning. IEEE Transactions on Computers, 45(7):841– 855, Jul 1996. [6] K. Deb. Multi-objective genetic algorithms: Problem diffi culties and construction of test problems. Evolutionary Computation, 7(3):205–230, 1999. [7] M. Di Penta, M. Neteler, G. Antoniol, and E. Merlo. Knowledge-based library re-factoring for an open source project. In Proceedings of IEEE Working Conference on Reverse Engineering, Richmond - VA, Oct 2002 (to appear). [8] D. Doval, S. Mancoridis, and B. Mitchell. Automatic clustering of software systems using a genetic algorithm. In Software Technology and Engineering Practice (STEP), pages 73–91, Pittsburgh, PA, 1999. [9] M. Garey and D. Johnson. Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [10] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Pub Co, Jan 1989. [11] A. Gordon. Classifi cation - (2nd edition). Chapman and Hall, London, 1988. [12] J. Horn, N. Nafpliotis, and D. E. Goldberg. A Niched Pareto Genetic Algorithm for Multiobjective Optimization. In Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, volume 1, pages 82–87, Piscataway, New Jersey, 1994. IEEE Service Center. [13] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. [14] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. [15] L. Kaufman and P. Rousseeew. Finding groups in data: an introduction to cluster analysis. Wiley-Inter Science, Wiley - NY, 1990. [16] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [17] M. Krone and G. Snelting. On the inference of confi guration structures from source code. In Proc. of the 16th International Conference on Software Engineering, pages 49–57, Sorrento Italy, May 1994. [18] T. Kuipers and A. van Deursen. Identifying objects using cluster and concept analysis. In Proceedings of the International Conference on Software Engineering, pages 246–255, June 1999. [19] H. Maini, K. Mehrotra, C. Mohan, and S. Ranka. Knowledge-based nonuniform crossover. In IEEE World Congress on Computational Intelligence, pages 22–27, 1994. [20] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner. Using automatic clustering to produce high-level system organizations of source code. In IEEE Proceedings of the 1998 Int. Workshop on Program Comprehension (IWPC’98), 1998. DR AF T [21] E. Merlo, I. McAdam, and R. De Mori. Source code informal information analysis using connectionist model. In Proceedings of the International Joint Conference on Artifi cial Intelligence, pages 1339–44, Load Altos Calif, 1993. [22] M. Neteler, editor. GRASS 5.0 Programmer’s Manual. Geographic Resources Analysis Support System. ITC-irst, Italy, http://grass.itc.it/grassdevel.html, 2001. [23] M. Neteler and H. Mitasova. Open Source GIS: A GRASS GIS Approach. Kluwer Academic Publishers, Boston/U.S.A; Dordrecht/Holland; London/U.K., 2002. [24] B. Oommen and E. de St. Croix. Graph partitioning using learning automata. IEEE Transactions on Computers, 45(2):195–208, Feb 1996. [25] S. Shazely, H. Baraka, and A. Abdel-Wahab. Solving graph partitioning problem using genetic algorithms. In Midwest Symposium on Circuits and Systems, pages 302–305, 1998. [26] G. Snelting. Software reengineering based on concept lattices. In Proceedings of IEEE International Conference on Software Maintenance, pages 3–10, March 2000. [27] E. Talbi and P. Bessire. A parallel genetic algorithm for the graph partitioning problem. In ACM International Conference on Supercomputing, Cologne, Germany, 1991. [28] V. Tzerpos and R. C. Holt. MoJo: A distance metric for software clusterings. pages 187–195. [29] V. Tzerpos and R. C. Holt. Software botryology: Automatic clustering of software systems. In DEXA Workshop, pages 811–818, 1998. [30] M. Wall. GAlib - a C++ library of genetic algorithm components. http://lancet.mit.edu/ga/. [31] E. Zitzler, K. Deb, and L. Thiele. Comparison of Multiobjective Evolutionary Algorithms on Test Functions of Different Diffi culty. In A. S. Wu, editor, Proceedings of the 1999 Genetic and Evolutionary Computation Conference. Workshop Program, pages 121–122, Orlando, Florida, 1999. 10