Academia.eduAcademia.edu

A language-independent software renovation framework

2005, Journal of Systems and Software

One of the undesired effects of software evolution is the proliferation of unused components, which are not used by any application. As a consequence, the size of binaries and libraries tends to grow and system maintainability tends to decrease. At the same time, a major trend of today’s software market is the porting of applications on hand-held devices or, in general, on devices which have a limited amount of available resources. Refactoring and, in particular, the miniaturization of libraries and applications are therefore necessary.We propose a Software Renovation Framework (SRF) and a toolkit covering several aspects of software renovation, such as removing unused objects and code clones, and refactoring existing libraries into smaller more cohesive ones. Refactoring has been implemented in the SRF using a hybrid approach based on hierarchical clustering, on genetic algorithms and hill climbing, also taking into account the developers’ feedback. The SRF aims to monitor software system quality in terms of the identified affecting factors, and to perform renovation activities when necessary. Most of the framework activities are language-independent, do not require any kind of source code parsing, and rely on object module analysis.The SRF has been applied to GRASS, which is a large open source Geographical Information System of about one million LOCs in size. It has significantly improved the software organization, has reduced by about 50% the average number of objects linked by each application, and has consequently also reduced the applications’ memory requirements.

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/222818756 A language-independent software renovation framework Article in Journal of Systems and Software · September 2005 DOI: 10.1016/j.jss.2004.03.033 · Source: DBLP CITATIONS READS 25 76 4 authors, including: Massimiliano Penta Markus Neteler 291 PUBLICATIONS 6,269 CITATIONS 136 PUBLICATIONS 2,548 CITATIONS Università degli Studi del Sannio SEE PROFILE mundialis GmbH & Co. KG SEE PROFILE Giuliano Antoniol Polytechnique Montréal 281 PUBLICATIONS 6,233 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Geoinformatics FCE CTU View project Sentinel-2 satellite data processing chain View project All content following this page was uploaded by Markus Neteler on 18 December 2016. The user has requested enhancement of the downloaded file. JSS 7667 ARTICLE IN PRESS 15 November 2004 Disk Used No. of Pages 16, DTD = 5.0.1 The Journal of Systems and Software xxx (2004) xxx–xxx www.elsevier.com/locate/jss M. Di Penta 3 4 5 6 a OF A language-independent software renovation framework 2 a,* , M. Neteler b, G. Antoniol a, E. Merlo c Department of Engineering, RCOST—Research Centre on Software Technology, University of Sannio, Via Traiano, 1-82100 Benevento, Italy b ITC-irst Istituto Trentino Cultura, Via Sommarive, 18-38050 Povo (Trento), Italy c École Polytechnique de Montréal, Montréal, Quebéc, Canada PRO Received 1 April 2003; received in revised form 16 July 2003; accepted 2 March 2004 9 Abstract TED One of the undesired effects of software evolution is the proliferation of unused components, which are not used by any application. As a consequence, the size of binaries and libraries tends to grow and system maintainability tends to decrease. At the same time, a major trend of todayÕs software market is the porting of applications on hand-held devices or, in general, on devices which have a limited amount of available resources. Refactoring and, in particular, the miniaturization of libraries and applications are therefore necessary. We propose a Software Renovation Framework (SRF) and a toolkit covering several aspects of software renovation, such as removing unused objects and code clones, and refactoring existing libraries into smaller more cohesive ones. Refactoring has been implemented in the SRF using a hybrid approach based on hierarchical clustering, on genetic algorithms and hill climbing, also taking into account the developersÕ feedback. The SRF aims to monitor software system quality in terms of the identified affecting factors, and to perform renovation activities when necessary. Most of the framework activities are language-independent, do not require any kind of source code parsing, and rely on object module analysis. The SRF has been applied to GRASS, which is a large open source Geographical Information System of about one million LOCs in size. It has significantly improved the software organization, has reduced by about 50% the average number of objects linked by each application, and has consequently also reduced the applicationsÕ memory requirements.  2004 Elsevier Inc. All rights reserved. REC 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Keywords: Refactoring; Software renovation; Clustering; Genetic algorithms; Hill climbing 26 Software systems evolution often presents several factors that contribute to deteriorate the quality of the system itself (Lehman and Belady, 1985). First, unused components, which have been introduced for testing purposes or which belong to obsolete functionalities, may proliferate. Second, maintenance and evolution activities are likely to introduce clones, while, for exam- * UNC 28 29 30 31 32 33 34 OR 27 1. Introduction Corresponding author. E-mail addresses: [email protected] (M. Di Penta), [email protected] (M. Neteler), [email protected] (G. Antoniol), [email protected] (E. Merlo). 0164-1212/$ - see front matter  2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2004.03.033 ple, adding support and drivers for an architecture similar to an already supported one (Antoniol et al., 2002). Third, library sizes tend to increase, because new functionalities are added and refactoring is rarely performed; for the same reasons, also the number of inter-library dependencies, some of which are circular, tends to increase. Finally, sometimes, new functionalities logically related to already existing ones are added in a non-systematic way and they result in sets of modules which are neither organized nor linked into libraries. As a consequence, systems become difficult to maintain. Moreover, unused objects, big libraries, and circular dependencies significantly increase application sizes and memory requirements. This is clearly in contrast 35 36 37 38 39 40 41 42 43 44 45 46 47 48 JSS 7667 15 November 2004 Disk Used 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 2. Related work 131 Many research contributions have been published about software system modules clustering and restructuring, identifying objects, and recovering or building libraries. Most of these work applied clustering or Concept Analysis (CA). An overview of CA applications to software reengineering problems was published by G. Snelting in his seminal work (Snelting, 2000). Snelting applied CA to several remodularization problems such as exploring configuration spaces (see also Krone and Snelting, 1994), transforming class hierarchies, and remodularizing COBOL systems. Kuipers and Moonen (2000) combined CA and type inference in a semi-automatic approach to find objects in COBOL legacy code. Antoniol et al. (2001a) applied CA to the problem of identifying libraries and of defining new directories and files organizations in software systems with degraded architectures. As according to Krone and Snelting (1994), Kuipers and Moonen (2000), and Antoniol et al. (2001a), we believe that with the present level of technology a programmer-centric approach is required, since programmers are in charge of choosing the proper remodularization strategy based on their knowledge 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 OF information extracted from object files; furthermore, the clone detection algorithm adopted in the SRF is not tied to any specific programming language, provided that a set of metrics can be extracted from the source code. The SRF has been applied to a large Open Source software system: a Geographical Information System (GIS) named GRASS 1 (Geographic Resources Analysis Support System). GRASS is a raster/vector GIS combined with integrated image processing and data visualization subsystems (Neteler and Mitasova, 2002) composed of 517 applications and 43 libraries, for a total of over one million LOCs. The number of team members is small and it is about 7–15 active developers. Decisions are usually taken by the members most capable to solve specific problems. Developers are also GRASS users and they often focus on their needs within the general project. This paper is organized as follows. First, a short review on related work (Section 2) and on main notions of clustering and GAs (Section 3), will be presented. Then, the SRF is presented in Section 4. The case study software system (i.e., GRASS) is described in Section 5, while results are presented and discussed in Section 6, and are followed by conclusions and work-in-progress in Section 7. OR REC TED with todayÕs industry hype towards porting existing software applications onto hand-held devices, such as Personal Digital Assistants (PDA), onto wireless devices (e.g., multimedia cell phones), or, in general, onto devices with limited resources. This paper proposes the SRF to monitor and control some of the quality factors which have been described above. When the number of unused objects and clones increase, or when library sizes become unmanageable, some actions may be taken among the several possible ones. First and foremost, unused code may be removed and clones may be monitored or factored out. Furthermore, some form of restructuring, at library and at object file level, may be required. Together with monitoring and improving maintainability, the SRF eases the miniaturization challenge of porting applications onto limited resources devices. Most of the SRF activities deal with analyzing dependencies among software artifacts. For any given software system, dependencies among executables and object files may be represented via a dependency graph, which is a graph where nodes represent resources and edges represent dependencies. Each library, in turn, may be thought of as a subgraph in the overall object file dependency graph. Therefore, software miniaturization can be modeled as a graph partitioning problem. Unfortunately, it is well known that graph partitioning is an NP-hard problem (Garey and Johnson, 1979) and thus heuristics have been adopted to find a ‘‘good-enough’’ solution. For example, one may be interested to first examine graph partitions by minimizing cross edges between subgraphs which correspond to libraries. More formally, a cost function describing the restructuring problem has to be defined and heuristics to drive the solution search process must be identified and applied. We propose a novel approach in which hierarchical clustering and Silhouette statistics (Kaufman and Rousseeuw, 1990) are initially used to determine the optimal number of clusters and the starting population of a Software Renovation Genetic Algorithm (SRGA). This initial step is followed by a SRGA search aimed at minimizing a multi-objective function which takes into account, at the same time, both the number of inter-library dependencies and the average number of objects linked by each application. Finally, by letting the SRGA fitness function also consider the expertsÕ suggestions, the SRF becomes a semi-automatic approach composed of multiple refactoring iterations, which are interleaved by developersÕ feedback. To speed up the search process, heuristics based on a Genetic Algorithm (GA) and a modified GA (Talbi and Bessière, 1991) approach were proposed. Performance improvement was also achieved by means of a hybrid approach, which combines GA strategies with hill climbing techniques. The SRF has the advantage of being language independent. All activities, except clone detection, rely on UNC 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx PRO 2 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS 1 http://grass.itc.it JSS 7667 15 November 2004 Disk Used No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 3. Background notions 231 The fundamental activity of the SRF is library refactoring. This requires the integration of clustering and GA techniques in a semi-automatic, human-driven process. Clustering deals with the grouping of large amounts of things (entities) in groups (clusters) of closely related entities (Kaufman and Rousseeuw, 1990; Anderberg, 1973). Clustering is used in different areas, such as business analysis, economics, astronomy, information retrieval, image processing, pattern recognition, biology, and others. GAs come from an idea, born over 30 years ago, of applying the biological principle of evolution to artificial systems. GAs are applied to different domains such as machine and robot learning, economics, operations research, ecology, studies of evolution, learning and social systems (Goldberg, 1989; Mitchell, 1996). In the following subsections, for sake of completeness, only some essential notions are summarized, because describing the different types of clustering algorithms or the details of GAs is out of the scope of this paper. More details can be found in Anderberg (1973) for clustering and in Goldberg (1989) and Mitchell (1996) for GAs. 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 3.1. Agglomerative hierarchical clustering 254 In this paper, the agglomerative-nesting (Agnes) algorithm (Kaufman and Rousseeuw, 1990) was applied to build the initial set of candidate libraries. Agnes is an agglomerative, hierarchical clustering algorithm: it builds a hierarchy of clusters in such way that each level contains the same clusters as the first lower level, except 255 256 257 258 259 260 PRO OF removal of redundant methods and fields, devirtualization and inlining of method calls, renaming methods, fields, class and packages, and transforming class hierarchies. Another approach, devoted to reduce the size of Java libraries for embedded systems, was proposed by Rayside and Kontogiannis (2002). While the approach proposed by Rayside and Kontogiannis (2002) and Jax are tied to a programming language, ours is not. Our approach also differs from Jax in philosophy since we do not limit ourselves to reduce the size of the instance application to be executed, but we also support the reorganization of a software system whose structure has been deteriorated because of its evolution. The reduction of memory requirements is thus just one of the effects of the reorganization. This paper extends preliminary contributions (Di Penta et al., 2002; Antoniol et al., 2003). Together with Di Penta et al. (2002), we share the choice GRASS as target application and several activities carried out to refactor libraries. OR REC TED and judgment. A comparison between clustering and CA was presented by Kuipers and van Deursen (1999). Our work also applies an agglomerative-nesting clustering to a Boolean usage matrix, although according to Kuipers and van Deursen (1999) the matrix indicated the uses of variables by programs. Surveys and overviews of cluster analysis applied to software systems have been published in the past, for example, by Wiggerts (1997) and by Tzerpos and Holt (1998). The latter authors (Tzerpos and Holt, 1999) defined a metric to evaluate the similarity of different decompositions of software systems. Tzerpos and Holt (2000a) proposed a novel clustering algorithm which had been specifically conceived to address the peculiarities of the program comprehension; they also addressed the issue of stability of software clustering algorithms (Tzerpos and Holt, 2000b). Applications of clustering to reengineering were suggested by Anquetil and Lethbridge (1998), that devised a method for decomposing complex software systems into independent subsystems. Source files were clustered according to file names and their name decomposition. An approach relying on inter-module and intra-module dependency graphs to refactor software systems was presented by Mancoridis et al. (1998). We share the idea of analyzing dependency graphs and of finding a tradeoff between highly cohesive and little inter-connected libraries, with Mancoridis et al. (1998). GAs have been recently applied in different fields of computer science and software engineering. An approach for partitioning a graph using GAs was discussed by Talbi and Bessière (1991). Similar approaches were also published by Shazely et al. (1998), Bui and Moon (1996), and Oommen and de St Croix (1996). Maini et al. (1994) discussed a method to introduce knowledge about the problem in a non-uniform crossover operator and presented some examples of its application. A GA was used by Doval et al. (1999) to identify clusters on software systems. Together with Doval et al., 1999, we share the idea of a software clustering approach which uses a GA and which tries to minimize inter-cluster dependencies. Finally, Harman and et al. (2002) reported experiments of modularization and remodularization by comparing GAs with hill climbing techniques and by introducing a representation and a crossover operator tied to the remodularization problem. Their case studies revealed that hill climbing outperformed GAs. Mahdavi et al. (2003) proposed an approach aimed to combine multiple hill climbs for subsequent searches, thus reducing the search spaces. Software miniaturization for Java application was recently addressed by Jax which is an application extractor for Java software systems (Tip et al., 1999) whose goal is the size reduction of Java programs with particular interest to applets to be transmitted over the network. Jax is based on transformations including UNC 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 3 JSS 7667 15 November 2004 Disk Used 4 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 278 279 280 281 282 283 284 285 286 sðiÞ ¼ bðiÞ  aðiÞ : maxðaðiÞ; bðiÞÞ ð1Þ Kaufman and Russeeuw suggested choosing the optimal number of clusters as the value maximizing the average s(i) over the dataset. Traditionally, it is assumed that the error curve knee indicates the appropriate number of clusters (Gordon, 1988). Often, a compromise has to be accepted between maximizing the Silhouette (and thus having highly cohesive clusters) and obtaining an excessive number of clusters (that in our application, causes library fragmentation). 287 3.3. Genetic algorithms OR REC Applications based on GAs revealed their effectiveness in finding approximate solutions when the search space is large or complex, when mathematical analysis or traditional methods are not available, and, in general, when the problem to be solved is NP-complete or NPhard (Garey and Johnson, 1979). Roughly speaking, a GA may be defined as an iterative procedure that searches for the best solution of a given problem among a constant-size population, represented by a finite string of symbols, the genome. The search is made starting from an initial population of individuals, often randomly generated. At each evolutionary step, individuals are evaluated using a fitness function. High-fitness individuals will have the highest probability to reproduce themselves. The evolution (i.e., the generation of a new population) is made by means of two kinds of operator: the crossover operator and the mutation operator. The crossover operator takes two individuals (the parents) of the old generation and exchanges parts of their genomes, producing one or more new individuals (the offspring). The mutation operator has been introduced to prevent convergence to local optima and it randomly modifies an individualÕs genome, for example, by flipping some of its bits if the genome is represented by a bit string. UNC 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 3.3.1. Hill climbing and GA hybrid approaches As suggested by Goldberg (1989), hybrid GAs may be advantageous when there is the need for optimization techniques tied to a specific problem structure. The inlarge perspective of GAs may be combined with the precision of local search. GAs are able to explore large search spaces, but often they reach a solution that is not accurate, or they very slowly converge to an accurate solution. On the other hand, local optimization techniques, such as hill climbing, quickly converge to a local optimum, but they are not very effective for searching large solution spaces because of the possible presence of local maximum or plateaus. There are at least two different ways to hybridize a GA with hill climbing techniques. The first approach attempts to optimize the best individuals of the last generation, using hill climbing techniques. The second approach uses hill climbing to optimize the best individuals of each generation. Applying hill climbing on each generation could be expensive. However, this technique ‘‘inserts’’ in each generation high quality individuals, who are determined by the optimization phase, and therefore reduces the number of generations requested to achieve convergence. 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 4. The refactoring framework 342 As highlighted in the introduction, the proposed framework consists of several steps: 343 344 TED 277 To determine the actual or optimal number of clusters, people traditionally rely on the plot of an error measure representing the dispersion within a cluster. The error measure decreases as the number of clusters, k, increases, but for some values of k the curve flattens. Kaufman and Rousseeuw (1990) proposed the Silhouette statistics for estimating and assessing the optimal number of clusters. For the observation i, let a(i) be the average distance to the other points in its cluster, and b(i) the average distance to points in the nearest cluster. Then the Silhouette statistics is defined as 313 314 315 316 317 318 OF 263 3.2. Determining the optimal number of clusters Crossover and mutation are respectively performed on each individual of the population with probability pcross and pmut respectively, where pmut  pcross. GAs are not guaranteed to converge. The termination condition is often based on a maximum number of generations or on a given value of the fitness function. PRO 261 for two clusters, which are joined to form a single 262 cluster. 264 265 266 267 268 269 270 271 272 273 274 275 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS • First and foremost, software system applications, libraries, and dependencies among them are identified; • Unused functions and objects are identified, removed or factored out; • Duplicated or cloned objects are identified and possibly factored out; • Circular dependencies among libraries, which cause a library to be linked each time another circularly linked library is needed, are removed, or, at least, reduced; • Large libraries are refactored into smaller ones and, if possible, transformed into dynamic libraries; and • Objects which are used by multiple applications, but which are not yet organized into libraries, are grouped into new libraries. 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 The SRF activities and the adopted representations 362 are detailed in the following subsections. 363 JSS 7667 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS 15 November 2004 Disk Used M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 5 364 4.1. Software system graph representation A graph representation of dependencies between object modules is central to our framework and most of the SRF computations rely on it. Software systems can be represented by an instance of the System Graph (SG), an example of which is depicted in Fig. 1. SG is defined as where O  {o1, o2, . . ., op} is the set of all object modules; L  {l1, l2, . . ., ln}, where li O i = 1, . . ., n, is the set of all software system libraries. Libraries, subsets of objects, are depicted in Fig. 1 as rounded boxes; A  {a1, a2, . . ., am}, where A O and A \ {¨i li} = ;, is the set of all software system applications. Applications, i.e. the object modules containing the main symbol, are represented in Fig. 1 as squares source nodes; 2 and D O · O is the set of oriented edges di,j representing dependencies between objects. We can extract from the SG graph two other graphs useful for our refactoring purposes. The first graph is called Use Graph and it highlights the uses of objects by applications or by libraries. The use relationship is defined as 389 ax uses oy () 9 pathfax ; . . . ; oy g 2 SG: ð3Þ In other words the Use Graph highlights the reachability between applications and library objects in SGs. Such reachability can be obtained computing a k-fold product on the graph represented by an adjacency matrix. Similarly, the second graph is called Dependency Graph and it is used to represent existing dependencies between two or more libraries, or between to-be-refactored objects contained in a library. The clustering algorithm should avoid inter-cluster dependencies. The dependency relationship is defined as 402 403 404 405 406 407 ox depends on oy () ox uses oy ^ ox 2 L ^ oy 2 L: REC 390 391 392 393 394 395 396 397 398 399 400 ð4Þ OR In particular, a dependency (ox, oy) is considered an inter-library dependency, i.e., a dependency that increases the coupling, if ox 2 li, oy 2 lj, and i 5 j. Given the above definition of SG, the SRF activities can be graphically shown in Fig. 2. 408 4.2. Graph construction UNC OF 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 ð2Þ PRO 372 SG  fO; L; A; Dg; Fig. 1. Example of system graph. be identified. In this paper we rely on an approach similar to the one proposed by Antoniol et al. (2001a). However, Antoniol et al. (2001a) identified applications by detecting all source files containing the definition of a main function. Once applications and existing libraries are identified, the SG graph can be built. Given the use relationship between an object module requiring a symbol and a module defining it, the corresponding SG is built via the transitive closure of the use relationship, starting from the main object of each application and from each library. In other words, for each application, undefined symbols are identified and recursively resolved (possibly adding new undefined symbols to the stack) first inside the objects contained in the same path (i.e., other modules of the application), then inside libraries. A similar process is performed to detect dependencies among libraries. Finally, the use graph and the dependency graph, represented as adjacency matrices MU and MD, are extracted from the SG graph. 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 4.3. Handling unused objects 432 Symbols defined in libraries which are neither used by applications nor by other libraries are likely to represent useless resources. Their presence is often due to utility functions which are inserted in libraries but which are not used by the current set of applications, or it is due to not yet fully implemented features. The objects defining these unused symbols should be removed from the libraries, provided that they do not also export used symbols. In the opposite case such an object should be left into library and its corresponding source file should be restructured. One possible refactoring strategy is to 433 434 435 436 437 438 439 440 441 442 443 TED 365 366 367 368 369 370 409 Prior to recover dependencies among applications 410 and libraries, and among libraries themselves, executa411 ble applications composing the software system must 2 Applications are not the only source nodes. In fact, as it will detailed later, also unused objects have no incoming edges, even if they can be distinguished from the applications since the latter also define a main symbol. JSS 7667 15 November 2004 Disk Used M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx TED PRO OF 6 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS Fig. 2. The framework activities. 444 create two new libraries from each library, one of which 445 containing all the unused symbols and the other one 446 containing all the used symbols. The DG introduced in Section 4.1 captures dependencies among the different libraries and allows the identification of strongly connected components. In particular, circular dependencies between libraries cause a library to be linked each time the other one is needed. Once these dependencies are identified, four strategies could be used to remove them: OR 448 449 450 451 452 453 454 REC 447 4.4. Removal of circular dependencies among libraries UNC 455 (1) Move the object which causes the circular depend456 ence to another library. This is only feasible if the 457 object does not need resources located in its original 458 library and it is not needed by that library; 459 (2) Duplicate the object: like the previous case, this is 460 appropriate, if the object does not need resources 461 located in the original library but, differently from 462 the previous case, the object is required in that 463 library. Moving the object the library outside will 464 make the situation worse; 465 (3) Merge the two libraries: this strategy should be 466 avoided whenever possible because it increases 467 library sizes; however, it could be the only available solution when the number of objects causing circular and, in general, inter-library dependencies is very high; (4) Create dynamic libraries: instead of merging circularly dependent libraries, one may decide to make them dynamic. Circular dependency problem is not solved, but the average amount of resources needed is reduced, as described in Section 4.6.2. When the DG does not allow the removal of circular dependencies and, when, for performance reasons, options three and four cannot be adopted, a deeper analysis should be performed to identify dependencies at the granularity level of functions rather than objects. Finally, the existence of a complex dependency relationship between two libraries, if confirmed by developerÕs feedback, indicates the possibility of a library design which has not been done with miniaturization in mind. In this case, library objects should be merged and then refactored again in new clusters, adopting the process detailed in Section 4.6. 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 4.5. Identification of duplicate symbols and clones 489 Examining the list of symbols defined in each library allows the comparison of exported symbol names. It is worth noting that homonym symbols in different librar- 490 491 492 JSS 7667 15 November 2004 Disk Used No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 7 lattice gives useful information, it becomes unmanageable when a large number of applications and libraries must be handled (Anquetil, 2000), as in our case study. Instead of pruning information on a concept lattice like Siff and Reps (1999) and Tonella (2001), clustering analysis was performed, similar to Anquetil and Lethbridge (1998), Mancoridis et al. (1998), and Merlo et al. (1993). The library refactoring process, as shown in Fig. 3, consists of the following steps: 502 (1) If a whole, duplicated, object module has been 503 detected inside two or more libraries, then it should 504 be left in only one of these, unless it conflicts with 505 circular dependencies removal (see Section 4.4); 506 (2) If duplicated functions are identified inside different 507 objects, refactoring could be performed by moving 508 them outside their respective objects and by apply509 ing considerations similar to the previous case; and 510 (3) Clone detection may reveal clones outside libraries, 511 since applications may contain duplicated portions 512 of code, in their objects. In some cases, it could be 513 useful to remove such duplicated portions of code 514 and place them into new libraries. (1) Determine the optimal number of clusters and an initial solution; (2) Determine the new candidate libraries using a GA; and (3) Ask developers for feedback and, possibly, iterate through step 2. PRO REC TED Preliminary to clone refactoring is impact analysis in terms of introduced dependencies, especially circular dependencies, since clone removal may increase dependencies. As explained in Section 4.4 and as it will be shown in Section 4.6, sometimes an object is duplicated to reduce dependencies. In general, it may be preferable to duplicate few objects, rather than introducing a dependence that causes, for a subset of the applications, the linking or the loading of one or more additional libraries. Clearly, if the process duplicates a conspicuous number of objects into two or more libraries, these objects can be refactored, as explained in Section 4.6.2, into a new library on which the old libraries will depend. Overall, clone removal aims to improve the software system maintainability, although attention should be paid to avoid deteriorating software system reliability, and to reflect the developersÕ objectives (Cordy, 2003). Clone can also contribute decrease the overall software system size; again a tradeoff should be made: sometimes clone refactoring (especially for very small clones) produces a system bigger than the original one. 4.6.1. Determining the optimal number of clusters and a suboptimal solution As explained in Section 3.2, the optimal number of clusters is determined by inspecting the Silhouette statistics computed on the suboptimal clusters which are determined using agglomerative-nesting clustering. Given the curve of the average Silhouette values obtained from Eq. (1) for different numbers k of clusters, we choose for some libraries the knee of that curve (Kaufman and Rousseeuw, 1990) as the optimal number of clusters, instead of considering the maximum of the curve because that is often too high for our refactoring purpose. OR 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 OF ies may refer to completely different functions, external variable or data structures. On the other hand, two or more symbols may have different names, but they may correspond to duplicated functions. Therefore, clone detection analysis is helpful for library renovation. In this paper a metric-based clone detection process (Antoniol et al., 2001b), aimed at detecting duplicated functions, is adopted. The obtained results suggest different possible actions: 493 494 495 496 497 498 499 500 501 538 539 540 541 542 543 544 545 UNC 537 4.6. Library refactoring The last phase of the SRF is devoted to splitting existing, large libraries into smaller clusters of objects. Basically, the idea is similar to that proposed by Antoniol et al. (2001a) to identify libraries. To minimize the average number of libraries required by each program, objects used by a common set of programs should be grouped together. Antoniol et al. (2001a) used a concept lattice to group objects into libraries. Although the Fig. 3. Activity diagram of the library refactoring process. 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 JSS 7667 15 November 2004 Disk Used M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx We have also incorporated expertsÕ knowledge in the choice of the optimal number of clusters and we have considered a tradeoff between excessive fragmentation produced by too many clusters and excessive library size produced by fewer clusters. The suboptimal solution for the chosen value of k is then used as the starting point of the application of a GA, which is the subsequent framework step. The effectiveness of the refactoring process is evaluated by a quality measure of the new library organization. Let k be the number of clusters lx1, . . ., lxk obtained from a library lx. The Partitioning Ratio (PR) is defined as Pk m X j¼1 jlxj j  mui;xj PRðxÞ ¼ 100  ; ð5Þ jlx j  mui;x i¼1 590 591 592 593 594 595 where jlxj is the number of objects archived into library lx. The smaller is the PR, the more effective is the partitioning since the average number of objects linked or loaded by each application is smaller than using the whole old library. 628 629 630 631 632 633 (1) The number of inter-library dependencies at a given generation; (2) The total number of objects linked to each application which should be as small as possible; (3) The size of the new libraries; and (4) The feedback given by the developers. 634 635 636 637 638 639 640 641 642 643 644 645 646 647 Overall, the fitness function F is defined in terms of four factors which are the Dependency Factor (DF), the Partitioning Ratio (PR) defined by Eq. (5), the Standard Deviation Factor (SDF), and the Feedback Factor (FF). DF is defined as: DF ðgÞ ¼ jlX m1 x j1 X i¼0 gmi;j j¼0 k¼m1 X  ½1  dðk; jÞ; md j;k ð1  gmi;k Þ k¼0 ð6Þ TED 596 4.6.2. Refining the solution using genetic algorithms 597 The solution determined by the previous step presents 598 two main drawbacks: already stated, instead of randomly generating the initial population (i.e., the initial libraries), the GA is initialized with the encoding of the set of libraries obtained in the previous step. The fitness function has been conceived to balance four factors: OF 575 576 577 578 579 580 581 582 583 584 585 586 587 588 PRO 8 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS OR Of course, as shown by Di Penta et al., 2002, an important step to perform is the conversion of static libraries into dynamically-loadable libraries (DLL), so that each and possibly small library is loaded at runtime only when needed, and it is unloaded when it is no longer useful. However, the DLL approach presents a main drawback: loading and unloading libraries may be cause of a significant decrease in performance and its use should be limited, when performance constitutes an essential requirement, and, whenever possible, it should be accompanied by dependency minimization. The genome has been encoded using a bit-matrix encoding. The genome matrix GM for each library to refactor corresponds to a matrix of k rows and jlxj columns, where gmi,j = 1 if the object j is contained into cluster i, 0 otherwise. Clearly, the presence of the same object in more libraries is indicated by more ‘‘1’’ in the same column (this is not possible using the array genome, widely used for graph partitioning problems). As UNC 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 REC 599 (1) The number of dependencies between the new 600 libraries may be high. Each time a symbol from a 601 library is needed, another library may also need 602 to be loaded, therefore reducing the advantage of 603 having new smaller libraries; and 604 (2) New libraries may not be meaningful with respect 605 to developersÕ intentions whose feedback has to be 606 incorporated in the refactoring process. where d(x, y) is the well-known Kronecker delta function:  1 x ¼ y; dðx; yÞ ¼ 0 x 6¼ y; gmi,j is the genome encoding i.e., the GM[i, j] bit matrix entry. As shown in Eq. (6), the DF(g) is incremented each time an object (i.e., a high bit in the genome) depends from another object not contained in the same cluster. SDF can be thought of as the difference between the initial library sizes standard deviation and the one at the current generation. Without taking SDF into account, the SRGA may attempt to reduce dependencies by grouping a large fraction of the objects in the same library and it may negatively affect the PR. A similar factor was also applied by Talbi and Bessière (1991). Given the arrays of library sizes S0 and Sg, respectively for the initial population and for the gth generation, SDF is SDF ðgÞ ¼ jrS 0  rS g j: ð7Þ The fourth factor takes into account the developersÕ feedback. After a first execution of the SRGA without considering FF, developers are asked to provide a feedback on the proposed new libraries. DevelopersÕ feedback is stored in a bit-matrix FM, which has the same structure of the genome matrix and which incorporates those changes to the libraries that developers suggested. After this feedback, the SRGA is run again taking into account, this time, the feedback factor FF, based on the difference between the genome and the FM matrix: 649 650 651 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 669 670 671 672 673 674 675 676 677 678 679 JSS 7667 15 November 2004 Disk Used No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS 9 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx i¼1 jgmi;j  fmi;j j: ð8Þ Offspring Parents j¼1 682 683 684 685 686 688 In other words, the FF counts the number of differences between the genome and the refactoring proposed by developers. The fitness function F is formally defined as 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 where w1, w2 and w3 are real, positive weighting factors for the PR, SDF, and FF contribution to the overall fitness function. The higher is w1, the smaller will be the overall number of objects linked by applications at the expense of dependency reduction. Similarly, the higher is w2, the more similar will be the result to the starting set of library, again, at the expense of a satisfactory dependency reduction. After the first preliminary run of the SRGA which must be performed with w3 = 0, w3 should be properly sized to weight the influence of developersÕ feedback. As stated in (9), our fitness function is multi-objective (Deb, 1999). Notice that, since we aim to give maximum priority to dependency reduction, the DF weight is set to 1. Successively, w1, w2 and w3 are selected using a trial-and-error, iterative procedure, adjusting them each time until the DF, PR, SDF, and FF obtained at the final step were satisfactory. The process is guided by computing each time the average values for DF, PR, SDF, and FF, and by plotting their evolution, to determine the 3D space region in which the population should evolve. The crossover operator used in this paper is the one point crossover which exchanges the content of two genome matrices around the same random column (see Fig. 4a). The mutation operator works in two modes: 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 (1) with probability pmut, it takes a random column and randomly swaps two bits: this means that, if the two swapped bits are different then an object is moved from a library to another (see Fig. 4b); or (2) with probability pclone < pmut, it takes a random position in the matrix: if it is zero and the library is dependent on it, then the mutation operator clones the object into the current library (Fig. 4c). ð9Þ Random crossover point 11 0 01 00 1 10 (a) 11 0 01 00 1 10 11 1 01 00 0 10 11 1 01 00 1 10 (b) (c) Fig. 4. Genetic operators: (a) crossover, (b) mutation (move an object), and (c) mutation (clone an object). OR REC TED F ðgÞ ¼ DF ðgÞ þ w1 PRðgÞ þ w2 SDF ðgÞ þ w3 FF ðgÞ; OF 681 jlx j k X X PRO FF ¼ UNC Noticeably, cloning an object increases both PR and SDF, and therefore it must be minimized. The SRGA heuristically activates the cloning only for the final part of the evolution (after 66% of generations in our case study). Our strategy favors dependency minimization by moving objects between libraries. At the end, we attempt to remove remaining dependencies by cloning objects. Obviously, at the end of the refactoring process cloned objects should be factored out again. For example, if objects oa and ob are contained in both li and lj, then oa and ob should be moved into a third library on which li and lj depend. Finally, we have introduced the Lock Matrix (LM) as a further, stronger level of developersÕ feedback. When developers strongly believe that an object should belong to a cluster, LM matrix gives them the possibility to enforce such a constraint. The mutation operator does not perform any action that would bring a genome in a inconsistent state with respect to the Lock Matrix. The population size and the number of generations are determined by using an iterative procedure, which doubles both of them each time until the obtained DF, PR and FF are equal to those obtained at the previous iterative step. The SRGA suffers from slow convergence. To improve its performance, is has been hybridized with hill climbing techniques. In our experience, applying hill climbing only to the last generation significantly improves neither the performance nor the results. On the opposite, applying hill climbing to the best individuals of each generation makes the SRGA converge significantly faster. 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 4.7. Identification of new libraries 755 Due to its evolution, a software system tends to contain objects that, even if used by a common set of applications, are not contained in any library. Their identification and organization into libraries should therefore be desirable. The factoring process is quite similar to that described in the previous section. In par- 756 757 758 759 760 761 JSS 7667 15 November 2004 Disk Used 10 762 763 764 765 766 767 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx ticular, a MU matrix is built on a subgraph of the use graph obtained by removing all the already existing libraries. Then, a first set of new candidate libraries is built by analyzing the dendrogram and the Silhouette statistics. These libraries are then refined with the aid of the SRGA and of developersÕ feedback. the R statistical environment and a C++ compiler for the GA library refiner. To collect the programmersÕ feedback, the SRF relies on a PHP web application (the developersÕ feedback collector). Since the required infrastructure is available under several operating systems (both Unixes and Windows) the SRF is widely portable. 811 812 813 814 815 816 5. Case study 817 REC OR 806 807 The SRF works under any standard Unix operating 808 system, or under any operating system which supports 809 the GNU tool set. In particular, the SRF uses the stand810 ard Bourne shell (or the new Bash), the Perl interpreter, 3 4 5 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 6. Case study results 846 This section presents the results obtained by applying the SRF, which has been described in Section 4, to GRASS. 847 848 849 6.1. Handling unused objects 850 Out of 921 objects composing GRASS libraries, 89 were not used by any application, nor by other libraries. When refactoring libraries with the SRF, those objects will be moved and organized into a separate cluster, thought of as a sort of repository to be ‘‘frozen’’ for fu- 851 852 853 854 855 TED The application identifier identifies the list of object modules containing the main symbol by using the nm Unix tool; The graph extractor, which is also based on the nm tool, which produces the System Graph, the Use Graph, and the Dependency Graph. The graph extractor also exports data in .DOT format, to allow visualization and analysis using the Dotty graph visualization tool; 3 The unused symbol identifier produces, for each library, the list of the symbols which are not used by any application or library together with the object names in which those symbols are contained; The circular dependency identifier produces the list of all circular paths among libraries; The duplicated symbol identifier identifies the list of duplicated and defined external symbols. It is used in conjunction with the metric-based clone detector (see Antoniol et al., 2001b, for details) and with the dependency graph extractor to minimize the presence of clones inside libraries; The number of clusters identifier implements the Silhouette statistics. In particular, implementations available in the cluster package of the R Statistical Environment 4 have been used; The library refactoring tool supports the process of splitting libraries in smaller clusters. Cluster analysis is performed by the Agnes function available in the cluster package of R Statistical Environment; The GA library refiner is implemented in C++ using the GAlib; 5 and The developersÕ feedback collector is a web application that allows developers to post their feedback about the produced libraries on an appropriate web site. UNC 771 (1) 772 773 774 (2) 775 776 777 778 779 780 (3) 781 782 783 784 (4) 785 786 (5) 787 788 789 790 791 792 (6) 793 794 795 796 (7) 797 798 799 800 (8) 801 802 (9) 803 804 805 As mentioned in the introduction, the SRF has been applied to GRASS, which is a large open source GIS. In particular, the GRASS CVS development snapshot of April 5, 2002 6 was used as a case study. Its characteristics are summarized in Table 1. GRASS modules, which correspond to applications and which represent commands, are organized by name, based on their function class such as display, general, imagery, raster, vector or site, etc. The first letter of a module name refers to a function class and is followed by one dot and one or two other dot-separated words, which describe specific tasks. All GRASS modules are linked with an internal ‘‘front.end’’. If there are no command-line arguments entered by a user, the ‘‘front.end’’ module calls the interactive version of a command. Otherwise, it will start the command-line version. If only one version of the specific command exists, i.e., if there is only one command-line version available, the command is executed. Code parameters and flags are defined within each module. They are used to ask user to define map names and other options. GRASS provides an ANSI C language API with several hundreds of GIS functions which are used by GRASS modules, to read and write maps, to compute areas and distances for georeferenced data, and to visualize attributes and maps. Details of GRASS programming are covered in the ‘‘GRASS 5.0 ProgrammerÕs Manual’’ (Neteler, 2001). PRO 769 To support the refactoring process, different tools 770 have been conceived: OF 768 4.8. Tool support http://www.research.att.com/sw/tools/graphviz/ http://www.r-project.org http://lancet.mit.edu/ga/ 6 Downloadable from http://grass.itc.it JSS 7667 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS 15 November 2004 Disk Used 11 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 856 857 858 859 860 861 862 863 864 865 866 867 43 921 517 7107 1014 ture uses. A deeper analysis revealed that some functions, which are contained in unused objects, wrap lower level GRASS functions such as db_create_index, wrap standard library and system call functions such as scan_dbl, scan_int, whoami, and, in general, provide some simple functionalities using lower level functions such as datetime_is_same, that compares two DateTime structures. An interesting example is the library libdbmi (see also Section 6.4): out of 97 objects, 19 were not used at all. In all cases, the unused functions corresponded to one or more wrapped, lower level functions, that have been directly used by applications. 868 6.2. Removal of circular dependencies among libraries 6.3. Identification of clones 898 Clone detection was performed at two different levels of the software system architecture: within libraries and on the whole system. In the first case, clone detection aimed at library renovation; in the second case, the objective was to identify portions of duplicated code that could be potentially re-organized into new libraries. Table 2 reports results obtained from clone analysis in terms of the total number of analyzed functions, the number of clone clusters (Antoniol et al., 2002) detected, and the number and the percentage of cloned functions. Finally, clones were computed while filtering out the shortest functions; for example, two functions that simply return a value are clones by definition, but they are not significant and should not be taken into consideration. Results are reported considering two thresholds of function size: functions longer than five and than 10 LOCs. As shown in Table 2, the overall percentage of clones is not negligible (26.04%), even considering only functions longer than five LOCs (16.38%) and it suggests a potential for reduction in the number of the cloned functions. Clearly, the actual reduction rate depends on the number of false positives which typically include functions that simply contain a list of calls to other functions (where the number of calls and of parameters match), functions that print different error messages and, in general, any other function that shares the same metrics while being different. The number of clones contained inside libraries is low, indicating that the developers accurately factored functions and objects to avoid duplicates. Finally, we investigated the set of clones between libraries and objects outside libraries in the perspective of possible refactoring. The analysis of clones inside libraries revealed an 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 Table 2 Results of clone detection Overall OR REC TED Three cases of circular dependencies among libraries were found. The first dependency was between libstubs.a and libdbmi.a. In particular, we discovered that libstubs.a required one symbol, located inside the error.o module which belonged to libdbmi.a. On the other hand, libdbmi.a required 27 symbols from libstubs.a. The obvious solution was to move error.o into libstubs.a: this required moving in that library also the module alloc.o, since it depends from error.o. The second circular dependency was found between libgis.a and libcoorcnv.a. In particular, libgis.a required three symbols. Such symbols were located in the module datum.o from libcoorcnv.a. In the other direction, libcoorcnv.a dependencies involved 13 symbols from libgis.a. Moving datum.o into libgis.a resolved the problem. Finally, circular dependencies were found between libvect.a and libdig2.a. They involved 13 symbols in one direction and 31 symbols in the other direction. Symbols involved in the dependencies were located UNC 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 OF Pre-existing libraries Library objects Applications C source files C KLOC in several different objects. The links present in the dependency graph excluded the possibility of resolving circular dependencies between libvect.a and libdig2.a by simply moving or duplicating objects. The decision taken together with GRASS developers was initially to merge the two libraries which, in effect, have been designed to work together, and then try to refactor the new library (see Section 6.4). PRO Table 1 GRASS key characteristics Within libraries Total number of functions Number of clone clusters Number of cloned functions Percent of cloned functions (%) Threshold (LOCs) 22,229 2019 1404 72 41 1817 1290 130 73 5789 3641 180 101 4974 3268 635 272 26.04 16.38 3.41 1.92 29.33 19.27 2.86 1.22 5 10 5 10 5 10 5 10 5271 Outside libraries 16,958 Libraries vs. outside 22,229 JSS 7667 15 November 2004 Disk Used 12 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx Objects libgis libdbmi libproj libvect-new 184 97 119 54 by following the process described in Section 4.6 and depicted in Fig. 3. As suggested by developers, libproj was not refactored, because it was under development by a different team. As explained in Section 6.2, libvect-new library was obtained by merging libvect.a and libdig2.a. Silhouette statistics was used to determine the optimal number of clusters for each library. Values of such statistics, are plotted in Fig. 6, for different number of clusters. We decided to split libgis into four clusters (instead of the six proposed in Di Penta et al., 2002), and to divide libvect-new and libdbmi into three clusters. It is worth noting that, for libgis, the number of clusters was chosen in correspondence of the Silhouette maximum; for the other two libraries, a TED Cloned functions contained in these two libraries were removed from libimage_sup, libgmath and libtrans. Several ‘‘interesting’’ clones were also found outside libraries. In particular, the r.mapcalc3 application contains four clusters of cloned functions, spanning from 27 to 59 LOCs in size. These cluster contain mathematical functions, cloned to handle different data types. In this case, refactoring is clearly possible by generalizing the operations and by abstracting types. Finally, we analyzed clones between applications and libraries. In most cases clones were revealed to be part of legacy applications developed before the corresponding functions were added into a library. Unfortunately, the application was never changed afterwards. A relevant fraction of about 20% of these clones was discovered in the contrib subsystems, which had often been developed by third parties and therefore which were not always properly aligned with respect to the rest of the system. 967 6.4. Library refactoring REC 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 Library OF 944 (1) A library (libmatrix) to handle matrices; and 945 (2) A library (libcamera) to handle photogrammet946 ric computations for aerial cameras. Table 3 GRASS largest libraries PRO interesting situation: 16 functions from library libortho, were cloned across libimage_sup, libgmath and libtrans. Nine of the cloned functions were devoted to performing matrix algebra. By analyzing the Dependency Graph of libortho (see Fig. 5), a subgraph composed of such functions was identified and depicted in the box on the right. On the other hand, seven of the functions in the box on the left were cloned in libimage_sup. In particular, the entire structure enclosed in the rounded-dashed-box was replicated in that library. libortho was split libortho in two libraries, shown in the two boxes in Fig. 5: libvect 0.7 0.6 0.5 0.4 0.3 2 3 4 # of clusters 5 Fig. 6. Silhouette statistics for different number of clusters. UNC OR 968 Refactoring was performed on libraries which were 969 composed of a large number of objects (see Table 3), libgis libdbmi 0.8 Silhouette statistics 932 933 934 935 936 937 938 939 940 941 942 943 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS Fig. 5. Splitting library libortho. 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 JSS 7667 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS 15 November 2004 Disk Used 13 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx compromise was accepted between maximizing the Silhouette and avoiding excessive fragmentation. Subsequently, a preliminary clustering was performed and it was refined by an initial execution of the SRGA, which had been performed without considering any developersÕ feedback and by setting w3 = 0. Table 4 reports for each library: APPLICATIONS 985 986 987 988 989 990 991 libdbmi–1 libdbmi–3 libdbmi–2 HIGH LEVEL As shown, the SRGA reduced libgis dependencies from 579 to 26, while keeping PR almost constant (from 51% to 48%). A significant reduction of inter-library dependency was obtained (from 237 to 4 for libdbmi and from 66 to 3 for libvect), while slightly reducing PR, except for libdbmi, where it increased to 46% and it was worse than the preliminary solution. The first refactored architecture of the candidate libraries was submitted to GRASS developers to seek their feedback. For libgis, manual analysis indicated that the first cluster should contain ‘‘utility’’ and ‘‘allocation’’ functions, the second ‘‘area’’ and ‘‘geodesic’’ functions, the third ‘‘color-related’’ functions, and the fourth ‘‘raster’’ functions. For libvect-new, developers indicated that the first cluster should contain basic file-system operations and the other two clusters should include all other functions without any further distinction. The feedback for libdbmi was quite different with respect to the other two libraries. In this case, developers confirmed that the solution suggested by the hierarchical clustering performed before applying the SRGA reflected their own conception of libraries. A manual graph analysis via Dotty graph visualization agreed, too. In fact, as also reported by Di Penta et al. (2002), the library was split into the three following clusters: OF 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 Fig. 7. New libdbmi layering structure. • libdbmi-3 contains 48 objects, which are only 1029 internally used by libdbmi, and represents some 1030 sort of ‘‘low-level’’ library. 1031 Fig. 7 reports the layering structure of the clusters extracted from libdbmi. To avoid circular dependencies, one object was moved from libdbmi-3 to libdbmi1. Clearly, when refactoring a large software system such as GRASS, a compromise should be accepted between having small and decoupled clusters like those generated by applying the SRGA and having clusters that are not totally decoupled, but are conceptually cohesive, since they contain functions which implement closely-related tasks. In the latter case, memory optimization is possible adopting, as noted, dynamically loadable libraries (in spite, however, of performances, as explained in Section 4.6.2). We decided to leave libdbmi clusters as they were after hierarchical clustering and to perform a ‘‘second iteration’’ of the SRGA refactoring on libgis and libvect-new, while taking into consideration also the Feedback Factor FF, this time. For sake of completeness, we also reported results for libdbmi. By varying the w1, w2 and w3 thresholds, we obtained different results. As shown in Table 5, it was never possible to achieve a complete cluster decoupling and to obtain, at the same time, libraries which were very close to the structure proposed by developers. In Table 5, the comparison of the first three columns with the last three highlights that, after the first SRGA iteration, the coupling between clusters remained low. On the other hand, as highlighted by a high FF value before the second iteration, identified libraries tend to have a structure which is somehow different with respect to developersÕ intention. The second iteration of the SRGA tried to decrease FF, while, unfortunately, coupling increased. At such a stage, in the authorsÕ opinion, developers may decide either to produce meaningful libraries PRO • The number of objects composing the library; • The number of candidate libraries the original library is refactored into and the corresponding Silhouette statistics value; • The number of inter-library dependencies and PR before applying the SRGA; and • The number of inter-library dependencies and PR after applying the SRGA. OR REC TED 992 993 994 995 996 997 998 999 LOW LEVEL UNC 1026 • libdbmi-1 contains (19) unused objects; 1027 • libdbmi-2 contains (30) objects which are directly 1028 used by applications; and Table 4 Results of the library refactoring process before considering feedback (w3 = 0) Library libgis libdbmi libvect Number of objects Candidate libraries (k) Silhouette statistics Before GA After GA DF PR (%) DF PR (%) 184 97 54 4 3 3 0.70 0.78 0.57 579 237 66 51 35 46 26 4 3 48 46 40 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 JSS 7667 15 November 2004 Disk Used 14 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx Table 5 Results of the second round of the library refactoring process (w3 5 0) Number of objects Candidate libraries (k) Before second round After second round FF DF PR (%) FF DF PR (%) libgis libdbmi libvect 184 97 54 4 3 3 203 97 72 26 4 3 48 46 40 128 23 30 60 43 6 52 39 52 Table 6 Performance comparison between pure GA and hybrid GA with hill climbing Pure GA libgis libdbmi libvect Time (s) Fitness function Time (s) 3119 77 195 9113 509 96 3239 83 198 4524 190 41 REC 1089 6.5. Extraction of new libraries To identify new candidate libraries, the final step of the SRF is devoted to the analysis of the Use Graph obtained by subtracting the already existing libraries. Sometimes there are groups of objects used by a common set of applications, but they have not yet been organized into libraries. Clustering was performed on objects used by, at least, two applications. Results revealed the presence of four clusters which were all located in the orthophoto subsystem. The number of dependencies between clusters was low and it was possible to solve them simply moving a couple of objects between clusters. Besides, all clusters had a considerable UNC 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 Fitness difference (%) Time difference (%) 1 7 3 49 37 43 number of dependencies to external objects which belonged to the same set of applications. To eliminate these dependencies, it would have been necessary to increase the size of each cluster by 100%, clearly in contradiction with respect to the intended objective of reducing applicationsÕ memory requirements and size. Consequently, it was decided not to cluster these objects into libraries. In the authorsÕ opinion, this is not a negative result, but it constitutes a quality indicator of the system showing that developers had carefully created and maintained libraries. 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 7. Conclusions 1112 This paper has presented a framework for software system renovation (SRF) and the results of its application to GRASS Geographical Information System, which is over one million LOCs in size. The SRF has allowed us to remove several structural problems from GRASS. In particular, unused objects were identified and factored out; clones were identified and, especially for those inside libraries, refactoring was performed. The SRF incorporates a novel library refactoring process, in which a suboptimal solution is first identified by hierarchical clustering and then refined by the SRGA. The proposed SRGA fitness function takes into account different factors: minimizing the number of dependencies, the average number of objects linked by each application, and the feedback of developers. Although the approach has been applied on C and C++ systems (GRASS and others reported by Antoniol et al., 2003), it is not tied to any specific programming language, provided that object modules, which contain a list of defined and required symbols, be available. However, for applications to be executed on a virtual machine, such as Java, Smalltalk programs, other approaches such as those of Tip et al. (1999) and Rayside and Kontogiannis (2002) may be preferable. 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 TED and to reduce the memory requirements using dynamically-loadable libraries, or to obtain independent clusters, which may not always conceptually group objects as related as expected. Although it is counterintuitive, the latter result is not surprising, since experts classified functions according to the intended purpose or semantic. This seldom ensure high cohesion and low coupling, because the improvement of the latter attributes produces a final partitioning which somehow differs from what it was expected. The addition of hill climbing into the SRGA did not improve the fitness function, since the SRGA also converged to similar results, when it was executed on an increased number of generations and increased population size. Noticeably, performing hill climbing on the best individuals of each generation produced a drastic reduction of convergence times. Comparing both strategies when the difference between values of the fitness function was below 10% highlighted that a hybrid strategy allowed on average to reduce the execution time of 43%. Convergence times for a Compaq ProliantTM with Dual XeonTM 900 MHz processor, 2 MB Cache and 4 GB of RAM are reported in Table 6. OR 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 Hybrid GA Fitness function PRO Library OF Library JSS 7667 15 November 2004 Disk Used No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx OR 1174 Acknowledgments We are grateful to the GRASS development team for the support, the information provided, and the feedback on the refactored artifacts. Giuliano Antoniol and Massimiliano Di Penta were partially supported by the ASI grant I/R/091/00. Markus Neteler was partially supported by the FUR-PAT Project WEBFAQ. Ettore Merlo was partially supported by the National Sciences and Engineering Research Council of Canada (NSERC). UNC 1175 1176 1177 1178 1179 1180 1181 1182 1183 PRO OF Anquetil, N., 2000. A comparison of graphs of concept for reverse engineering. In: Proceedings of the IEEE International Workshop on Program Comprehension. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 231–240. Anquetil, N., Lethbridge, T., 1998. Extracting concepts from file names; a new file clustering criterion. In: Proceedings of the International Conference on Software Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 84–93. Antoniol, G., Di Penta, M., 2003. Library miniaturization using static and dynamic information. In: Proceedings of IEEE International Conference on Software Maintenance, Amsterdam, The Netherlands. pp. 235–244. Antoniol, G., Casazza, G., Di Penta, M., Merlo, E., 2001a. A method to re-organize legacy systems via concept analysis. In: Proceedings of the IEEE International Workshop on Program Comprehension. IEEE Computer Society Press, Los Alamitos, CA, USA, Toronto, ON, Canada, pp. 281–290. Antoniol, G., Casazza, G., Di Penta, M., Merlo, E., 2001b. Modeling clones evolution through time series. In: Proceedings of IEEE International Conference on Software Maintenance. pp. 273–280. Antoniol, G., Villano, U., Merlo, E., Di Penta, M., 2002. Analyzing cloning evolution in the Linux Kernel. In: SCAM 2002 Special Issue. Information and Software Technology 44, 755–765. Antoniol, G., Di Penta, M., Neteler, M., 2003. Moving to smaller libraries via clustering and genetic algorithms. In: European Conference on Software Maintenance and Reengineering. IEEE Computer Society Press, Los Alamitos, CA, USA, Benevento, Italy, pp. 307–316. Bui, T.N., Moon, B.R., 1996. Genetic algorithm and graph partitioning. IEEE Transactions on Computers 45 (7), 841–855. Cordy, J., 2003. Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: Proceedings of the IEEE International Workshop on Program Comprehension, Portland, OR, USA. pp. 196–205. Deb, K., 1999. Multi-objective genetic algorithms: problem difficulties and construction of test problems. Evolutionary Computation 7 (3), 205–230. Di Penta, M., Neteler, M., Antoniol, G., Merlo, E., 2002. Knowledgebased library re-factoring for an open source project. In: Proceedings of IEEE Working Conference on Reverse Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, Richmond, VA, pp. 128–137. Doval, D., Mancoridis, S., Mitchell, B., 1999. Automatic clustering of software systems using a genetic algorithm. In: Software Technology and Engineering Practice (STEP), Pittsburgh, PA. pp. 73–91. Garey, M., Johnson, D., 1979. Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Pub. Co. Gordon, A.., 1988. Classification, 2nd ed. Chapman and Hall, London. Harman, M., Hierons, R., Proctor, M., 2002. A new representation and crossover operator for search-based optimization of software modularization. In: AAAI Genetic and Evolutionary Computation COnference (GECCO). Springer-Verlag, New York, USA, pp. 82– 87. Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Inter Science, Wiley, NY. Krone, M., Snelting, G., 1994. On the inference of configuration structures from source code. In: Proceedings of the 16th International Conference on Software Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, Sorrento, Italy, pp. 49–57. Kuipers, T., Moonen, L., 2000. Types and concept analysis for legacy systems. In: Proceedings of the IEEE International Workshop on Program Comprehension. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 221–230. TED Overall, the SRF helps to monitor and improve the quality of a software system, which tends inevitably to deteriorate during the evolution. Unused objects, clones, library coupling, library sizes, and poor object organization are in fact significant quality indicators. For instance, the absence of new libraries identified by the SRF in GRASS indicates a careful design and a controlled evolution. Moreover, the SRF also addresses the miniaturization problem, which is relevant to port applications on limited-resource devices. The SRF has allowed us to reduce GRASS memory requirements and to improve its performance. The average number of library objects linked by each application was indeed reduced of about 50%. At the time of writing, GRASS has successfully been ported on a PDA (i.e., a CompaQ iPAQ). Given the size of the application and the available resources, a brute force automatic approach would not be feasible, since developersÕ suggestions were an essential component for the miniaturization process. Clone detection performed on GRASS revealed that the cloning level outside libraries was not negligible and suggested further clone refactoring. Besides, the cloning level inside libraries was in general low, except for the mentioned cases. The cloning between libraries and the rest of the system was in most cases due to third party applications. Most of the system reorganization work described in the paper was incorporated in the subsequent releases of GRASS by removing unused objects and some clones, and by reorganizing some libraries. The latter reorganization, as pointed out in the paper, was carried out with minor modifications with respect to the result of the SRF. Our in-progress work is devoted to investigate the feasibility of integrating other sources of knowledge into the SRF with special regards to dynamic information and in-field user profiles (Antoniol and Di Penta, 2003), obtained by instrumenting the source code. REC 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1184 References 1185 Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic 1186 Press Inc. 15 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 JSS 7667 15 November 2004 Disk Used 16 M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 Massimiliano Di Penta received his laurea degree in Computer Engi1329 1328 neering in 1999 and his PhD in Computer Science Engineering in 20031330 at the University of Sannio in Benevento, Italy. Currently he is with1331 RCOST—Research Centre On Software Technology in the same1332 University. His main research interests include software maintenance,1333 software quality, reverse engineering, program comprehension and1334 search-based software engineering. He is author of about 30 papers1335 appeared in international journals, conferences and workshops. He1336 serves the program and organizing committees of workshops and1337 conferences in the software maintenance field, such as the International1338 Conference on Software Maintenance, the International Workshop on1339 Program Comprehension, the Workshop on Source code Analysis and1340 Manipulation. 1341 1342 Markus Neteler received his M.Sc. degree in Physical Geography and 1344 1343 Landscape Ecology from the University of Hanover in Germany in1345 1999. He worked at the Institute of Geography as Research Scientist1346 and teaching associate for two years. Since 2001 he is researcher at1347 ITC-irst (Centre for Scientific and Technological research), Trento,1348 Italy since 2001. His main research interest is remote sensing for1349 environmental risk assessment and Free Software GIS development.1350 He is author of two books on the Open Source Geographical Infor-1351 mation System GRASS and various papers applications in GIS. 1352 1353 1355 1354 Giuliano Antoniol received his doctoral degree in Electronic Engineering from the University of Padua in 1982. He worked at Irst for 101356 years were he led the Irst Program Understanding and Reverse Engi-1357 neering (PURE) Project team. Giuliano Antoniol published more than1358 60 papers in journals and international conferences. He served as a1359 member of the Program Committee of international conferences and1360 workshops such as the International Conference on Software Main-1361 tenance, the International Workshop on Program Comprehension, the1362 International Symposium on Software Metrics. He is presently mem-1363 ber of the Editorial Board of the Journal Software Testing Verification1364 & Reliability, the Journal Information and Software Technology, the1365 Empirical Software Engineering and the Journal of Software Quality.1366 He is currently Associate Professor the University of Sannio, Faculty1367 of Engineering, where he works in the area of software metrics, process1368 modeling, software evolution and maintenance. 1369 1370 Ettore Merlo received his Ph.D. in computer science from McGill 1372 1371 University (Montreal) in 1989 and his Laurea degree—summa cum1373 laude—from University of Turin (Italy) in 1983. He has been the lead1374 researcher of the software engineering group at Computer Research1375 Institute of Montreal (CRIM) until 1993 when he joined Ecole Poly-1376 technique de Montreal where he is currently an associate professor. His1377 research interests are in software analysis, software reengineering, user1378 interfaces, software maintenance, artificial intelligence and bio-infor-1379 matics. He has collaborated with several industries and research cen-1380 ters in particular on software reengineering, clone detection, software1381 quality assessment, software evolution analysis, testing, architectural1382 reverse engineering and bio-informatics. 1383 1384 PRO OF Reverse Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 187–195. Tzerpos, V., Holt, R.C., 2000a. ACDC: An algorithm for comprehension-driven clustering. In: Proceedings of IEEE Working Conference on Reverse Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 258–267. Tzerpos, V., Holt, R., 2000b. The stability of software clustering algorithms. In: Proceedings of the IEEE International Workshop on Program Comprehension. IEEE Computer Society Press, Los Alamitos, CA, USA. Wiggerts, T.A., 1997. Using clustering algorithms in legacy systems remodularization. In: Proceedings of IEEE Working Conference on Reverse Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA. OR REC TED Kuipers, T., van Deursen, A., 1999. Identifying objects using cluster and concept analysis. In: Proceedings of the International Conference on Software Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 246–255. Lehman, M.M., Belady, L.A., 1985. Software Evolution—Processes of Software Change. Academic Press, London. Mahdavi, K., Harman, M., Hierons, R.M., 2003. A multiple hill climbing approach to software module clustering. In: Proceedings of IEEE International Conference on Software Maintenance, Amsterdam, The Netherlands. pp. 315–324. Maini, H., Mehrotra, K., Mohan, C., Ranka, S., 1994. Knowledgebased nonuniform crossover. In: IEEE World Congress on Computational Intelligence. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 22–27. Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., Gansner, E.R., 1998. Using automatic clustering to produce high-level system organizations of source code. In: Proceedings of the IEEE International Workshop on Program Comprehension. IEEE Computer Society Press, Los Alamitos, CA, USA. Merlo, E., McAdam, I., De Mori, R., 1993. Source code informal information analysis using connectionist model. In: Proceedings of the International Joint Conference on Artificial Intelligence. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 1339–1344. Mitchell, M., 1996. An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA, USA. Neteler, M. (Ed.), 2001. GRASS 5.0 ProgrammerÕs Manual. Geographic Resources Analysis Support System. ITC-irst, Italy, Available from: <http://grass.itc.it/grassdevel.html>. Neteler, M., Mitasova, H., 2002. Open Source CIS: A GRASS CIS Approach. Kluwer Academic Publishers, Boston/USA; Dordrecht/ Holland; London/UK. Oommen, B., de St Croix, E., 1996. Graph partitioning using learning automata. IEEE Transactions on Computers 45 (2), 195–208. Rayside, D., Kontogiannis, K., 2002. Extracting Java library subsets for deployment on embedded systems. Science of Computer Programming 45 (2–3), 245–270. Shazely, S., Baraka, H., Abdel-Wahab, A., 1998. Solving graph partitioning problem using genetic algorithms. In: Midwest Symposium on Circuits and Systems. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 302–305. Siff, M., Reps, T., 1999. Identifying modules via concept analysis. IEEE Transactions on Software Engineering 25, 749–768. Snelting, G., 2000. Software reengineering based on concept lattices. In: Proceedings of IEEE International Conference on Software Maintenance. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 3–10. Talbi, E., Bessière, P., 1991. A parallel genetic algorithm for the graph partitioning problem. In: ACM International Conference on Supercomputing. ACM Press, New York, USA, Cologne, Germany. Tip, F., Laffra, C., Sweeney, P.F., Streeter, D., 1999. Practical experience with an application extractor for Java. ACM SIGPLAN Notices 34 (10), 292–305. Tonella, P., 2001. Concept analysis for module restructuring. IEEE Transactions on Software Engineering 27 (4), 351–363. Tzerpos, V., Holt, R.C., 1998. Software botryology: automatic clustering of software systems. In: DEXA Workshop. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 811–818. Tzerpos, V., Holt, R.C., 1999. MoJo: A distance metric for software clusterings. In: Proceedings of IEEE Working Conference on UNC 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 No. of Pages 16, DTD = 5.0.1 ARTICLE IN PRESS View publication stats