See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/222818756
A language-independent software renovation
framework
Article in Journal of Systems and Software · September 2005
DOI: 10.1016/j.jss.2004.03.033 · Source: DBLP
CITATIONS
READS
25
76
4 authors, including:
Massimiliano Penta
Markus Neteler
291 PUBLICATIONS 6,269 CITATIONS
136 PUBLICATIONS 2,548 CITATIONS
Università degli Studi del Sannio
SEE PROFILE
mundialis GmbH & Co. KG
SEE PROFILE
Giuliano Antoniol
Polytechnique Montréal
281 PUBLICATIONS 6,233 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Geoinformatics FCE CTU View project
Sentinel-2 satellite data processing chain View project
All content following this page was uploaded by Markus Neteler on 18 December 2016.
The user has requested enhancement of the downloaded file.
JSS 7667
ARTICLE IN PRESS
15 November 2004 Disk Used
No. of Pages 16, DTD = 5.0.1
The Journal of Systems and Software xxx (2004) xxx–xxx
www.elsevier.com/locate/jss
M. Di Penta
3
4
5
6
a
OF
A language-independent software renovation framework
2
a,*
, M. Neteler b, G. Antoniol a, E. Merlo
c
Department of Engineering, RCOST—Research Centre on Software Technology, University of Sannio, Via Traiano, 1-82100 Benevento, Italy
b
ITC-irst Istituto Trentino Cultura, Via Sommarive, 18-38050 Povo (Trento), Italy
c
École Polytechnique de Montréal, Montréal, Quebéc, Canada
PRO
Received 1 April 2003; received in revised form 16 July 2003; accepted 2 March 2004
9 Abstract
TED
One of the undesired effects of software evolution is the proliferation of unused components, which are not used by any application. As a consequence, the size of binaries and libraries tends to grow and system maintainability tends to decrease. At the same
time, a major trend of todayÕs software market is the porting of applications on hand-held devices or, in general, on devices which
have a limited amount of available resources. Refactoring and, in particular, the miniaturization of libraries and applications are
therefore necessary.
We propose a Software Renovation Framework (SRF) and a toolkit covering several aspects of software renovation, such as
removing unused objects and code clones, and refactoring existing libraries into smaller more cohesive ones. Refactoring has been
implemented in the SRF using a hybrid approach based on hierarchical clustering, on genetic algorithms and hill climbing, also taking into account the developersÕ feedback. The SRF aims to monitor software system quality in terms of the identified affecting factors, and to perform renovation activities when necessary. Most of the framework activities are language-independent, do not
require any kind of source code parsing, and rely on object module analysis.
The SRF has been applied to GRASS, which is a large open source Geographical Information System of about one million LOCs
in size. It has significantly improved the software organization, has reduced by about 50% the average number of objects linked by
each application, and has consequently also reduced the applicationsÕ memory requirements.
2004 Elsevier Inc. All rights reserved.
REC
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 Keywords: Refactoring; Software renovation; Clustering; Genetic algorithms; Hill climbing
26
Software systems evolution often presents several factors that contribute to deteriorate the quality of the system itself (Lehman and Belady, 1985). First, unused
components, which have been introduced for testing
purposes or which belong to obsolete functionalities,
may proliferate. Second, maintenance and evolution
activities are likely to introduce clones, while, for exam-
*
UNC
28
29
30
31
32
33
34
OR
27 1. Introduction
Corresponding author.
E-mail addresses:
[email protected] (M. Di Penta),
[email protected] (M. Neteler),
[email protected] (G. Antoniol),
[email protected] (E. Merlo).
0164-1212/$ - see front matter 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2004.03.033
ple, adding support and drivers for an architecture similar to an already supported one (Antoniol et al., 2002).
Third, library sizes tend to increase, because new functionalities are added and refactoring is rarely performed;
for the same reasons, also the number of inter-library
dependencies, some of which are circular, tends to increase. Finally, sometimes, new functionalities logically
related to already existing ones are added in a non-systematic way and they result in sets of modules which
are neither organized nor linked into libraries. As a consequence, systems become difficult to maintain. Moreover, unused objects, big libraries, and circular
dependencies significantly increase application sizes
and memory requirements. This is clearly in contrast
35
36
37
38
39
40
41
42
43
44
45
46
47
48
JSS 7667
15 November 2004 Disk Used
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
2. Related work
131
Many research contributions have been published
about software system modules clustering and restructuring, identifying objects, and recovering or building
libraries. Most of these work applied clustering or Concept Analysis (CA).
An overview of CA applications to software reengineering problems was published by G. Snelting in his
seminal work (Snelting, 2000). Snelting applied CA to
several remodularization problems such as exploring
configuration spaces (see also Krone and Snelting,
1994), transforming class hierarchies, and remodularizing COBOL systems. Kuipers and Moonen (2000) combined CA and type inference in a semi-automatic
approach to find objects in COBOL legacy code. Antoniol et al. (2001a) applied CA to the problem of identifying libraries and of defining new directories and files
organizations in software systems with degraded architectures. As according to Krone and Snelting (1994),
Kuipers and Moonen (2000), and Antoniol et al.
(2001a), we believe that with the present level of technology a programmer-centric approach is required, since
programmers are in charge of choosing the proper
remodularization strategy based on their knowledge
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
OF
information extracted from object files; furthermore,
the clone detection algorithm adopted in the SRF is
not tied to any specific programming language, provided
that a set of metrics can be extracted from the source
code.
The SRF has been applied to a large Open Source
software system: a Geographical Information System
(GIS) named GRASS 1 (Geographic Resources Analysis
Support System). GRASS is a raster/vector GIS combined with integrated image processing and data visualization subsystems (Neteler and Mitasova, 2002)
composed of 517 applications and 43 libraries, for a total of over one million LOCs.
The number of team members is small and it is about
7–15 active developers. Decisions are usually taken by
the members most capable to solve specific problems.
Developers are also GRASS users and they often focus
on their needs within the general project.
This paper is organized as follows. First, a short review on related work (Section 2) and on main notions
of clustering and GAs (Section 3), will be presented.
Then, the SRF is presented in Section 4. The case study
software system (i.e., GRASS) is described in Section 5,
while results are presented and discussed in Section 6,
and are followed by conclusions and work-in-progress
in Section 7.
OR
REC
TED
with todayÕs industry hype towards porting existing software applications onto hand-held devices, such as Personal Digital Assistants (PDA), onto wireless devices
(e.g., multimedia cell phones), or, in general, onto devices with limited resources.
This paper proposes the SRF to monitor and control
some of the quality factors which have been described
above. When the number of unused objects and clones
increase, or when library sizes become unmanageable,
some actions may be taken among the several possible
ones. First and foremost, unused code may be removed
and clones may be monitored or factored out. Furthermore, some form of restructuring, at library and at object file level, may be required. Together with
monitoring and improving maintainability, the SRF
eases the miniaturization challenge of porting applications onto limited resources devices.
Most of the SRF activities deal with analyzing
dependencies among software artifacts. For any given
software system, dependencies among executables and
object files may be represented via a dependency graph,
which is a graph where nodes represent resources and
edges represent dependencies. Each library, in turn,
may be thought of as a subgraph in the overall object file
dependency graph. Therefore, software miniaturization
can be modeled as a graph partitioning problem. Unfortunately, it is well known that graph partitioning is an
NP-hard problem (Garey and Johnson, 1979) and thus
heuristics have been adopted to find a ‘‘good-enough’’
solution. For example, one may be interested to first
examine graph partitions by minimizing cross edges between subgraphs which correspond to libraries. More
formally, a cost function describing the restructuring
problem has to be defined and heuristics to drive the
solution search process must be identified and applied.
We propose a novel approach in which hierarchical
clustering and Silhouette statistics (Kaufman and Rousseeuw, 1990) are initially used to determine the optimal
number of clusters and the starting population of a Software Renovation Genetic Algorithm (SRGA). This initial step is followed by a SRGA search aimed at
minimizing a multi-objective function which takes into
account, at the same time, both the number of inter-library dependencies and the average number of objects
linked by each application. Finally, by letting the SRGA
fitness function also consider the expertsÕ suggestions,
the SRF becomes a semi-automatic approach composed
of multiple refactoring iterations, which are interleaved
by developersÕ feedback. To speed up the search process,
heuristics based on a Genetic Algorithm (GA) and a
modified GA (Talbi and Bessière, 1991) approach were
proposed. Performance improvement was also achieved
by means of a hybrid approach, which combines GA
strategies with hill climbing techniques.
The SRF has the advantage of being language independent. All activities, except clone detection, rely on
UNC
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
PRO
2
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
1
http://grass.itc.it
JSS 7667
15 November 2004 Disk Used
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
3. Background notions
231
The fundamental activity of the SRF is library refactoring. This requires the integration of clustering and
GA techniques in a semi-automatic, human-driven process. Clustering deals with the grouping of large amounts
of things (entities) in groups (clusters) of closely related
entities (Kaufman and Rousseeuw, 1990; Anderberg,
1973). Clustering is used in different areas, such as business analysis, economics, astronomy, information retrieval, image processing, pattern recognition, biology, and
others. GAs come from an idea, born over 30 years ago,
of applying the biological principle of evolution to artificial systems. GAs are applied to different domains such
as machine and robot learning, economics, operations
research, ecology, studies of evolution, learning and social systems (Goldberg, 1989; Mitchell, 1996).
In the following subsections, for sake of completeness, only some essential notions are summarized, because describing the different types of clustering
algorithms or the details of GAs is out of the scope of
this paper. More details can be found in Anderberg
(1973) for clustering and in Goldberg (1989) and Mitchell (1996) for GAs.
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
3.1. Agglomerative hierarchical clustering
254
In this paper, the agglomerative-nesting (Agnes) algorithm (Kaufman and Rousseeuw, 1990) was applied to
build the initial set of candidate libraries. Agnes is an
agglomerative, hierarchical clustering algorithm: it
builds a hierarchy of clusters in such way that each level
contains the same clusters as the first lower level, except
255
256
257
258
259
260
PRO
OF
removal of redundant methods and fields, devirtualization and inlining of method calls, renaming methods,
fields, class and packages, and transforming class hierarchies. Another approach, devoted to reduce the size of
Java libraries for embedded systems, was proposed by
Rayside and Kontogiannis (2002). While the approach
proposed by Rayside and Kontogiannis (2002) and
Jax are tied to a programming language, ours is not.
Our approach also differs from Jax in philosophy since
we do not limit ourselves to reduce the size of the instance application to be executed, but we also support
the reorganization of a software system whose structure
has been deteriorated because of its evolution. The
reduction of memory requirements is thus just one of
the effects of the reorganization.
This paper extends preliminary contributions (Di
Penta et al., 2002; Antoniol et al., 2003). Together with
Di Penta et al. (2002), we share the choice GRASS as
target application and several activities carried out to
refactor libraries.
OR
REC
TED
and judgment. A comparison between clustering and
CA was presented by Kuipers and van Deursen (1999).
Our work also applies an agglomerative-nesting clustering to a Boolean usage matrix, although according to
Kuipers and van Deursen (1999) the matrix indicated
the uses of variables by programs.
Surveys and overviews of cluster analysis applied to
software systems have been published in the past, for
example, by Wiggerts (1997) and by Tzerpos and Holt
(1998). The latter authors (Tzerpos and Holt, 1999) defined a metric to evaluate the similarity of different
decompositions of software systems. Tzerpos and Holt
(2000a) proposed a novel clustering algorithm which
had been specifically conceived to address the peculiarities of the program comprehension; they also addressed
the issue of stability of software clustering algorithms
(Tzerpos and Holt, 2000b). Applications of clustering
to reengineering were suggested by Anquetil and Lethbridge (1998), that devised a method for decomposing
complex software systems into independent subsystems.
Source files were clustered according to file names and
their name decomposition. An approach relying on inter-module and intra-module dependency graphs to
refactor software systems was presented by Mancoridis
et al. (1998). We share the idea of analyzing dependency
graphs and of finding a tradeoff between highly cohesive
and little inter-connected libraries, with Mancoridis
et al. (1998).
GAs have been recently applied in different fields of
computer science and software engineering. An approach for partitioning a graph using GAs was discussed by Talbi and Bessière (1991). Similar
approaches were also published by Shazely et al.
(1998), Bui and Moon (1996), and Oommen and de St
Croix (1996). Maini et al. (1994) discussed a method
to introduce knowledge about the problem in a non-uniform crossover operator and presented some examples
of its application. A GA was used by Doval et al.
(1999) to identify clusters on software systems. Together
with Doval et al., 1999, we share the idea of a software
clustering approach which uses a GA and which tries to
minimize inter-cluster dependencies. Finally, Harman
and et al. (2002) reported experiments of modularization
and remodularization by comparing GAs with hill
climbing techniques and by introducing a representation
and a crossover operator tied to the remodularization
problem. Their case studies revealed that hill climbing
outperformed GAs. Mahdavi et al. (2003) proposed an
approach aimed to combine multiple hill climbs for subsequent searches, thus reducing the search spaces.
Software miniaturization for Java application was recently addressed by Jax which is an application extractor for Java software systems (Tip et al., 1999) whose
goal is the size reduction of Java programs with particular interest to applets to be transmitted over the network. Jax is based on transformations including
UNC
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
3
JSS 7667
15 November 2004 Disk Used
4
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
278
279
280
281
282
283
284
285
286
sðiÞ ¼
bðiÞ aðiÞ
:
maxðaðiÞ; bðiÞÞ
ð1Þ
Kaufman and Russeeuw suggested choosing the optimal
number of clusters as the value maximizing the average
s(i) over the dataset. Traditionally, it is assumed that the
error curve knee indicates the appropriate number of
clusters (Gordon, 1988).
Often, a compromise has to be accepted between maximizing the Silhouette (and thus having highly cohesive
clusters) and obtaining an excessive number of clusters
(that in our application, causes library fragmentation).
287 3.3. Genetic algorithms
OR
REC
Applications based on GAs revealed their effectiveness in finding approximate solutions when the search
space is large or complex, when mathematical analysis
or traditional methods are not available, and, in general,
when the problem to be solved is NP-complete or NPhard (Garey and Johnson, 1979). Roughly speaking, a
GA may be defined as an iterative procedure that
searches for the best solution of a given problem among
a constant-size population, represented by a finite string
of symbols, the genome. The search is made starting
from an initial population of individuals, often randomly generated. At each evolutionary step, individuals
are evaluated using a fitness function. High-fitness individuals will have the highest probability to reproduce
themselves.
The evolution (i.e., the generation of a new population) is made by means of two kinds of operator: the
crossover operator and the mutation operator. The crossover operator takes two individuals (the parents) of the
old generation and exchanges parts of their genomes,
producing one or more new individuals (the offspring).
The mutation operator has been introduced to prevent
convergence to local optima and it randomly modifies
an individualÕs genome, for example, by flipping some
of its bits if the genome is represented by a bit string.
UNC
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
3.3.1. Hill climbing and GA hybrid approaches
As suggested by Goldberg (1989), hybrid GAs may
be advantageous when there is the need for optimization
techniques tied to a specific problem structure. The inlarge perspective of GAs may be combined with the precision of local search. GAs are able to explore large
search spaces, but often they reach a solution that is
not accurate, or they very slowly converge to an accurate solution. On the other hand, local optimization
techniques, such as hill climbing, quickly converge to a
local optimum, but they are not very effective for searching large solution spaces because of the possible presence of local maximum or plateaus.
There are at least two different ways to hybridize a GA
with hill climbing techniques. The first approach attempts
to optimize the best individuals of the last generation,
using hill climbing techniques. The second approach uses
hill climbing to optimize the best individuals of each generation. Applying hill climbing on each generation could
be expensive. However, this technique ‘‘inserts’’ in each
generation high quality individuals, who are determined
by the optimization phase, and therefore reduces the
number of generations requested to achieve convergence.
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
4. The refactoring framework
342
As highlighted in the introduction, the proposed
framework consists of several steps:
343
344
TED
277
To determine the actual or optimal number of clusters, people traditionally rely on the plot of an error
measure representing the dispersion within a cluster.
The error measure decreases as the number of clusters,
k, increases, but for some values of k the curve flattens.
Kaufman and Rousseeuw (1990) proposed the Silhouette statistics for estimating and assessing the optimal
number of clusters. For the observation i, let a(i) be
the average distance to the other points in its cluster,
and b(i) the average distance to points in the nearest
cluster. Then the Silhouette statistics is defined as
313
314
315
316
317
318
OF
263 3.2. Determining the optimal number of clusters
Crossover and mutation are respectively performed on
each individual of the population with probability
pcross and pmut respectively, where pmut pcross.
GAs are not guaranteed to converge. The termination
condition is often based on a maximum number of generations or on a given value of the fitness function.
PRO
261 for two clusters, which are joined to form a single
262 cluster.
264
265
266
267
268
269
270
271
272
273
274
275
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
• First and foremost, software system applications,
libraries, and dependencies among them are
identified;
• Unused functions and objects are identified, removed
or factored out;
• Duplicated or cloned objects are identified and possibly factored out;
• Circular dependencies among libraries, which cause a
library to be linked each time another circularly
linked library is needed, are removed, or, at least,
reduced;
• Large libraries are refactored into smaller ones and, if
possible, transformed into dynamic libraries; and
• Objects which are used by multiple applications, but
which are not yet organized into libraries, are
grouped into new libraries.
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
The SRF activities and the adopted representations 362
are detailed in the following subsections.
363
JSS 7667
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
15 November 2004 Disk Used
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
5
364 4.1. Software system graph representation
A graph representation of dependencies between object modules is central to our framework and most of the
SRF computations rely on it. Software systems can be
represented by an instance of the System Graph (SG),
an example of which is depicted in Fig. 1.
SG is defined as
where O {o1, o2, . . ., op} is the set of all object modules;
L {l1, l2, . . ., ln}, where li O i = 1, . . ., n, is the set of
all software system libraries. Libraries, subsets of objects, are depicted in Fig. 1 as rounded boxes;
A {a1, a2, . . ., am}, where A O and A \ {¨i li} = ;,
is the set of all software system applications. Applications, i.e. the object modules containing the main symbol, are represented in Fig. 1 as squares source nodes; 2
and D O · O is the set of oriented edges di,j representing dependencies between objects.
We can extract from the SG graph two other graphs
useful for our refactoring purposes. The first graph is
called Use Graph and it highlights the uses of objects
by applications or by libraries. The use relationship is
defined as
389 ax uses oy () 9 pathfax ; . . . ; oy g 2 SG:
ð3Þ
In other words the Use Graph highlights the reachability between applications and library objects in SGs.
Such reachability can be obtained computing a k-fold
product on the graph represented by an adjacency
matrix.
Similarly, the second graph is called Dependency
Graph and it is used to represent existing dependencies
between two or more libraries, or between to-be-refactored objects contained in a library. The clustering algorithm should avoid inter-cluster dependencies. The
dependency relationship is defined as
402
403
404
405
406
407
ox depends on oy () ox uses oy ^ ox 2 L ^ oy 2 L:
REC
390
391
392
393
394
395
396
397
398
399
400
ð4Þ
OR
In particular, a dependency (ox, oy) is considered an inter-library dependency, i.e., a dependency that increases
the coupling, if ox 2 li, oy 2 lj, and i 5 j.
Given the above definition of SG, the SRF activities
can be graphically shown in Fig. 2.
408 4.2. Graph construction
UNC
OF
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
ð2Þ
PRO
372 SG fO; L; A; Dg;
Fig. 1. Example of system graph.
be identified. In this paper we rely on an approach similar to the one proposed by Antoniol et al. (2001a).
However, Antoniol et al. (2001a) identified applications
by detecting all source files containing the definition of a
main function.
Once applications and existing libraries are identified,
the SG graph can be built. Given the use relationship between an object module requiring a symbol and a module defining it, the corresponding SG is built via the
transitive closure of the use relationship, starting from
the main object of each application and from each library. In other words, for each application, undefined
symbols are identified and recursively resolved (possibly
adding new undefined symbols to the stack) first inside
the objects contained in the same path (i.e., other modules of the application), then inside libraries. A similar
process is performed to detect dependencies among
libraries. Finally, the use graph and the dependency
graph, represented as adjacency matrices MU and MD,
are extracted from the SG graph.
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
4.3. Handling unused objects
432
Symbols defined in libraries which are neither used by
applications nor by other libraries are likely to represent
useless resources. Their presence is often due to utility
functions which are inserted in libraries but which are
not used by the current set of applications, or it is due
to not yet fully implemented features. The objects defining these unused symbols should be removed from the
libraries, provided that they do not also export used
symbols. In the opposite case such an object should be
left into library and its corresponding source file should
be restructured. One possible refactoring strategy is to
433
434
435
436
437
438
439
440
441
442
443
TED
365
366
367
368
369
370
409
Prior to recover dependencies among applications
410 and libraries, and among libraries themselves, executa411 ble applications composing the software system must
2
Applications are not the only source nodes. In fact, as it will
detailed later, also unused objects have no incoming edges, even if they
can be distinguished from the applications since the latter also define a
main symbol.
JSS 7667
15 November 2004 Disk Used
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
TED
PRO
OF
6
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
Fig. 2. The framework activities.
444 create two new libraries from each library, one of which
445 containing all the unused symbols and the other one
446 containing all the used symbols.
The DG introduced in Section 4.1 captures dependencies among the different libraries and allows the identification of strongly connected components. In particular,
circular dependencies between libraries cause a library
to be linked each time the other one is needed. Once
these dependencies are identified, four strategies could
be used to remove them:
OR
448
449
450
451
452
453
454
REC
447 4.4. Removal of circular dependencies among libraries
UNC
455 (1) Move the object which causes the circular depend456
ence to another library. This is only feasible if the
457
object does not need resources located in its original
458
library and it is not needed by that library;
459 (2) Duplicate the object: like the previous case, this is
460
appropriate, if the object does not need resources
461
located in the original library but, differently from
462
the previous case, the object is required in that
463
library. Moving the object the library outside will
464
make the situation worse;
465 (3) Merge the two libraries: this strategy should be
466
avoided whenever possible because it increases
467
library sizes; however, it could be the only available
solution when the number of objects causing circular and, in general, inter-library dependencies is
very high;
(4) Create dynamic libraries: instead of merging circularly dependent libraries, one may decide to make
them dynamic. Circular dependency problem is
not solved, but the average amount of resources
needed is reduced, as described in Section 4.6.2.
When the DG does not allow the removal of circular
dependencies and, when, for performance reasons, options three and four cannot be adopted, a deeper analysis should be performed to identify dependencies at the
granularity level of functions rather than objects.
Finally, the existence of a complex dependency relationship between two libraries, if confirmed by developerÕs feedback, indicates the possibility of a library
design which has not been done with miniaturization
in mind. In this case, library objects should be merged
and then refactored again in new clusters, adopting the
process detailed in Section 4.6.
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
4.5. Identification of duplicate symbols and clones
489
Examining the list of symbols defined in each library
allows the comparison of exported symbol names. It is
worth noting that homonym symbols in different librar-
490
491
492
JSS 7667
15 November 2004 Disk Used
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
7
lattice gives useful information, it becomes unmanageable when a large number of applications and libraries
must be handled (Anquetil, 2000), as in our case study.
Instead of pruning information on a concept lattice like
Siff and Reps (1999) and Tonella (2001), clustering analysis was performed, similar to Anquetil and Lethbridge
(1998), Mancoridis et al. (1998), and Merlo et al. (1993).
The library refactoring process, as shown in Fig. 3,
consists of the following steps:
502 (1) If a whole, duplicated, object module has been
503
detected inside two or more libraries, then it should
504
be left in only one of these, unless it conflicts with
505
circular dependencies removal (see Section 4.4);
506 (2) If duplicated functions are identified inside different
507
objects, refactoring could be performed by moving
508
them outside their respective objects and by apply509
ing considerations similar to the previous case; and
510 (3) Clone detection may reveal clones outside libraries,
511
since applications may contain duplicated portions
512
of code, in their objects. In some cases, it could be
513
useful to remove such duplicated portions of code
514
and place them into new libraries.
(1) Determine the optimal number of clusters and an
initial solution;
(2) Determine the new candidate libraries using a GA;
and
(3) Ask developers for feedback and, possibly, iterate
through step 2.
PRO
REC
TED
Preliminary to clone refactoring is impact analysis in
terms of introduced dependencies, especially circular
dependencies, since clone removal may increase dependencies. As explained in Section 4.4 and as it will be
shown in Section 4.6, sometimes an object is duplicated
to reduce dependencies. In general, it may be preferable
to duplicate few objects, rather than introducing a
dependence that causes, for a subset of the applications,
the linking or the loading of one or more additional
libraries. Clearly, if the process duplicates a conspicuous
number of objects into two or more libraries, these objects can be refactored, as explained in Section 4.6.2,
into a new library on which the old libraries will depend.
Overall, clone removal aims to improve the software
system maintainability, although attention should be
paid to avoid deteriorating software system reliability,
and to reflect the developersÕ objectives (Cordy, 2003).
Clone can also contribute decrease the overall software
system size; again a tradeoff should be made: sometimes
clone refactoring (especially for very small clones) produces a system bigger than the original one.
4.6.1. Determining the optimal number of clusters and a
suboptimal solution
As explained in Section 3.2, the optimal number of
clusters is determined by inspecting the Silhouette statistics computed on the suboptimal clusters which are
determined using agglomerative-nesting clustering. Given the curve of the average Silhouette values obtained
from Eq. (1) for different numbers k of clusters, we
choose for some libraries the knee of that curve (Kaufman and Rousseeuw, 1990) as the optimal number of
clusters, instead of considering the maximum of the
curve because that is often too high for our refactoring
purpose.
OR
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
OF
ies may refer to completely different functions, external
variable or data structures. On the other hand, two or
more symbols may have different names, but they may
correspond to duplicated functions. Therefore, clone
detection analysis is helpful for library renovation. In
this paper a metric-based clone detection process (Antoniol et al., 2001b), aimed at detecting duplicated functions, is adopted. The obtained results suggest different
possible actions:
493
494
495
496
497
498
499
500
501
538
539
540
541
542
543
544
545
UNC
537 4.6. Library refactoring
The last phase of the SRF is devoted to splitting existing, large libraries into smaller clusters of objects. Basically, the idea is similar to that proposed by Antoniol
et al. (2001a) to identify libraries. To minimize the average number of libraries required by each program,
objects used by a common set of programs should be
grouped together. Antoniol et al. (2001a) used a concept
lattice to group objects into libraries. Although the
Fig. 3. Activity diagram of the library refactoring process.
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
JSS 7667
15 November 2004 Disk Used
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
We have also incorporated expertsÕ knowledge in the
choice of the optimal number of clusters and we have
considered a tradeoff between excessive fragmentation
produced by too many clusters and excessive library size
produced by fewer clusters. The suboptimal solution for
the chosen value of k is then used as the starting point of
the application of a GA, which is the subsequent framework step.
The effectiveness of the refactoring process is evaluated by a quality measure of the new library organization. Let k be the number of clusters lx1, . . ., lxk
obtained from a library lx. The Partitioning Ratio (PR)
is defined as
Pk
m
X
j¼1 jlxj j mui;xj
PRðxÞ ¼ 100
;
ð5Þ
jlx j mui;x
i¼1
590
591
592
593
594
595
where jlxj is the number of objects archived into library
lx. The smaller is the PR, the more effective is the partitioning since the average number of objects linked or
loaded by each application is smaller than using the
whole old library.
628
629
630
631
632
633
(1) The number of inter-library dependencies at a given
generation;
(2) The total number of objects linked to each application which should be as small as possible;
(3) The size of the new libraries; and
(4) The feedback given by the developers.
634
635
636
637
638
639
640
641
642
643
644
645
646
647
Overall, the fitness function F is defined in terms of
four factors which are the Dependency Factor (DF),
the Partitioning Ratio (PR) defined by Eq. (5), the
Standard Deviation Factor (SDF), and the Feedback Factor (FF).
DF is defined as:
DF ðgÞ ¼
jlX
m1
x j1 X
i¼0
gmi;j
j¼0
k¼m1
X
½1 dðk; jÞ;
md j;k ð1 gmi;k Þ
k¼0
ð6Þ
TED
596 4.6.2. Refining the solution using genetic algorithms
597
The solution determined by the previous step presents
598 two main drawbacks:
already stated, instead of randomly generating the initial
population (i.e., the initial libraries), the GA is initialized with the encoding of the set of libraries obtained
in the previous step.
The fitness function has been conceived to balance
four factors:
OF
575
576
577
578
579
580
581
582
583
584
585
586
587
588
PRO
8
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
OR
Of course, as shown by Di Penta et al., 2002, an
important step to perform is the conversion of static
libraries into dynamically-loadable libraries (DLL), so
that each and possibly small library is loaded at runtime only when needed, and it is unloaded when it is
no longer useful. However, the DLL approach presents
a main drawback: loading and unloading libraries
may be cause of a significant decrease in performance
and its use should be limited, when performance
constitutes an essential requirement, and, whenever
possible, it should be accompanied by dependency
minimization.
The genome has been encoded using a bit-matrix
encoding. The genome matrix GM for each library to
refactor corresponds to a matrix of k rows and jlxj columns, where gmi,j = 1 if the object j is contained into
cluster i, 0 otherwise. Clearly, the presence of the same
object in more libraries is indicated by more ‘‘1’’ in the
same column (this is not possible using the array genome, widely used for graph partitioning problems). As
UNC
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
REC
599 (1) The number of dependencies between the new
600
libraries may be high. Each time a symbol from a
601
library is needed, another library may also need
602
to be loaded, therefore reducing the advantage of
603
having new smaller libraries; and
604 (2) New libraries may not be meaningful with respect
605
to developersÕ intentions whose feedback has to be
606
incorporated in the refactoring process.
where d(x, y) is the well-known Kronecker delta
function:
1 x ¼ y;
dðx; yÞ ¼
0 x 6¼ y;
gmi,j is the genome encoding i.e., the GM[i, j] bit matrix
entry. As shown in Eq. (6), the DF(g) is incremented
each time an object (i.e., a high bit in the genome) depends from another object not contained in the same
cluster. SDF can be thought of as the difference between
the initial library sizes standard deviation and the one at
the current generation. Without taking SDF into account, the SRGA may attempt to reduce dependencies
by grouping a large fraction of the objects in the same
library and it may negatively affect the PR. A similar
factor was also applied by Talbi and Bessière (1991).
Given the arrays of library sizes S0 and Sg, respectively
for the initial population and for the gth generation,
SDF is
SDF ðgÞ ¼ jrS 0 rS g j:
ð7Þ
The fourth factor takes into account the developersÕ
feedback. After a first execution of the SRGA without
considering FF, developers are asked to provide a feedback on the proposed new libraries. DevelopersÕ feedback is stored in a bit-matrix FM, which has the same
structure of the genome matrix and which incorporates
those changes to the libraries that developers suggested.
After this feedback, the SRGA is run again taking into
account, this time, the feedback factor FF, based on the
difference between the genome and the FM matrix:
649
650
651
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
669
670
671
672
673
674
675
676
677
678
679
JSS 7667
15 November 2004 Disk Used
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
9
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
i¼1
jgmi;j fmi;j j:
ð8Þ
Offspring
Parents
j¼1
682
683
684
685
686
688
In other words, the FF counts the number of differences
between the genome and the refactoring proposed by
developers.
The fitness function F is formally defined as
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
where w1, w2 and w3 are real, positive weighting factors
for the PR, SDF, and FF contribution to the overall fitness function. The higher is w1, the smaller will be the
overall number of objects linked by applications at the
expense of dependency reduction. Similarly, the higher
is w2, the more similar will be the result to the starting
set of library, again, at the expense of a satisfactory
dependency reduction. After the first preliminary run
of the SRGA which must be performed with w3 = 0,
w3 should be properly sized to weight the influence of
developersÕ feedback. As stated in (9), our fitness function is multi-objective (Deb, 1999). Notice that, since
we aim to give maximum priority to dependency reduction, the DF weight is set to 1. Successively, w1, w2 and
w3 are selected using a trial-and-error, iterative procedure, adjusting them each time until the DF, PR, SDF,
and FF obtained at the final step were satisfactory.
The process is guided by computing each time the average values for DF, PR, SDF, and FF, and by plotting
their evolution, to determine the 3D space region in
which the population should evolve.
The crossover operator used in this paper is the one
point crossover which exchanges the content of two genome matrices around the same random column (see Fig.
4a). The mutation operator works in two modes:
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
(1) with probability pmut, it takes a random column
and randomly swaps two bits: this means that, if
the two swapped bits are different then an object
is moved from a library to another (see Fig. 4b); or
(2) with probability pclone < pmut, it takes a random
position in the matrix: if it is zero and the library
is dependent on it, then the mutation operator
clones the object into the current library (Fig. 4c).
ð9Þ
Random
crossover
point
11 0 01
00 1 10
(a)
11 0 01
00 1 10
11 1 01
00 0 10
11 1 01
00 1 10
(b)
(c)
Fig. 4. Genetic operators: (a) crossover, (b) mutation (move an
object), and (c) mutation (clone an object).
OR
REC
TED
F ðgÞ ¼ DF ðgÞ þ w1 PRðgÞ þ w2 SDF ðgÞ þ w3 FF ðgÞ;
OF
681
jlx j
k X
X
PRO
FF ¼
UNC
Noticeably, cloning an object increases both PR and
SDF, and therefore it must be minimized. The SRGA
heuristically activates the cloning only for the final part
of the evolution (after 66% of generations in our case
study). Our strategy favors dependency minimization
by moving objects between libraries. At the end, we attempt to remove remaining dependencies by cloning objects. Obviously, at the end of the refactoring process
cloned objects should be factored out again. For example, if objects oa and ob are contained in both li and lj,
then oa and ob should be moved into a third library on
which li and lj depend.
Finally, we have introduced the Lock Matrix (LM) as
a further, stronger level of developersÕ feedback. When
developers strongly believe that an object should belong
to a cluster, LM matrix gives them the possibility to enforce such a constraint. The mutation operator does not
perform any action that would bring a genome in a
inconsistent state with respect to the Lock Matrix.
The population size and the number of generations
are determined by using an iterative procedure, which
doubles both of them each time until the obtained DF,
PR and FF are equal to those obtained at the previous
iterative step.
The SRGA suffers from slow convergence. To improve its performance, is has been hybridized with hill
climbing techniques. In our experience, applying hill
climbing only to the last generation significantly improves neither the performance nor the results. On the
opposite, applying hill climbing to the best individuals
of each generation makes the SRGA converge significantly faster.
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
4.7. Identification of new libraries
755
Due to its evolution, a software system tends to contain objects that, even if used by a common set of applications, are not contained in any library. Their
identification and organization into libraries should
therefore be desirable. The factoring process is quite
similar to that described in the previous section. In par-
756
757
758
759
760
761
JSS 7667
15 November 2004 Disk Used
10
762
763
764
765
766
767
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
ticular, a MU matrix is built on a subgraph of the use
graph obtained by removing all the already existing
libraries. Then, a first set of new candidate libraries is
built by analyzing the dendrogram and the Silhouette
statistics. These libraries are then refined with the aid
of the SRGA and of developersÕ feedback.
the R statistical environment and a C++ compiler for the
GA library refiner. To collect the programmersÕ feedback, the SRF relies on a PHP web application (the
developersÕ feedback collector). Since the required infrastructure is available under several operating systems
(both Unixes and Windows) the SRF is widely portable.
811
812
813
814
815
816
5. Case study
817
REC
OR
806
807
The SRF works under any standard Unix operating
808 system, or under any operating system which supports
809 the GNU tool set. In particular, the SRF uses the stand810 ard Bourne shell (or the new Bash), the Perl interpreter,
3
4
5
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
6. Case study results
846
This section presents the results obtained by applying
the SRF, which has been described in Section 4, to
GRASS.
847
848
849
6.1. Handling unused objects
850
Out of 921 objects composing GRASS libraries, 89
were not used by any application, nor by other libraries.
When refactoring libraries with the SRF, those objects
will be moved and organized into a separate cluster,
thought of as a sort of repository to be ‘‘frozen’’ for fu-
851
852
853
854
855
TED
The application identifier identifies the list of object
modules containing the main symbol by using the
nm Unix tool;
The graph extractor, which is also based on the nm
tool, which produces the System Graph, the Use
Graph, and the Dependency Graph. The graph
extractor also exports data in .DOT format, to
allow visualization and analysis using the Dotty
graph visualization tool; 3
The unused symbol identifier produces, for each
library, the list of the symbols which are not used
by any application or library together with the
object names in which those symbols are contained;
The circular dependency identifier produces the list
of all circular paths among libraries;
The duplicated symbol identifier identifies the list of
duplicated and defined external symbols. It is used
in conjunction with the metric-based clone detector
(see Antoniol et al., 2001b, for details) and with the
dependency graph extractor to minimize the presence of clones inside libraries;
The number of clusters identifier implements the Silhouette statistics. In particular, implementations
available in the cluster package of the R Statistical
Environment 4 have been used;
The library refactoring tool supports the process of
splitting libraries in smaller clusters. Cluster analysis is performed by the Agnes function available in
the cluster package of R Statistical Environment;
The GA library refiner is implemented in C++ using
the GAlib; 5 and
The developersÕ feedback collector is a web application that allows developers to post their feedback
about the produced libraries on an appropriate
web site.
UNC
771 (1)
772
773
774 (2)
775
776
777
778
779
780 (3)
781
782
783
784 (4)
785
786 (5)
787
788
789
790
791
792 (6)
793
794
795
796 (7)
797
798
799
800 (8)
801
802 (9)
803
804
805
As mentioned in the introduction, the SRF has been
applied to GRASS, which is a large open source GIS. In
particular, the GRASS CVS development snapshot of
April 5, 2002 6 was used as a case study. Its characteristics are summarized in Table 1.
GRASS modules, which correspond to applications
and which represent commands, are organized by name,
based on their function class such as display, general,
imagery, raster, vector or site, etc. The first letter of a
module name refers to a function class and is followed
by one dot and one or two other dot-separated words,
which describe specific tasks. All GRASS modules are
linked with an internal ‘‘front.end’’. If there are no command-line arguments entered by a user, the ‘‘front.end’’
module calls the interactive version of a command. Otherwise, it will start the command-line version. If only
one version of the specific command exists, i.e., if there
is only one command-line version available, the command is executed. Code parameters and flags are defined
within each module. They are used to ask user to define
map names and other options.
GRASS provides an ANSI C language API with several hundreds of GIS functions which are used by
GRASS modules, to read and write maps, to compute
areas and distances for georeferenced data, and to visualize attributes and maps. Details of GRASS programming are covered in the ‘‘GRASS 5.0 ProgrammerÕs
Manual’’ (Neteler, 2001).
PRO
769
To support the refactoring process, different tools
770 have been conceived:
OF
768 4.8. Tool support
http://www.research.att.com/sw/tools/graphviz/
http://www.r-project.org
http://lancet.mit.edu/ga/
6
Downloadable from http://grass.itc.it
JSS 7667
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
15 November 2004 Disk Used
11
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
856
857
858
859
860
861
862
863
864
865
866
867
43
921
517
7107
1014
ture uses. A deeper analysis revealed that some functions,
which are contained in unused objects, wrap lower level
GRASS functions such as db_create_index, wrap
standard library and system call functions such as
scan_dbl, scan_int, whoami, and, in general, provide some simple functionalities using lower level functions such as datetime_is_same, that compares two
DateTime structures. An interesting example is the library libdbmi (see also Section 6.4): out of 97 objects,
19 were not used at all. In all cases, the unused functions
corresponded to one or more wrapped, lower level functions, that have been directly used by applications.
868 6.2. Removal of circular dependencies among libraries
6.3. Identification of clones
898
Clone detection was performed at two different levels
of the software system architecture: within libraries and
on the whole system. In the first case, clone detection
aimed at library renovation; in the second case, the
objective was to identify portions of duplicated code
that could be potentially re-organized into new libraries.
Table 2 reports results obtained from clone analysis in
terms of the total number of analyzed functions, the number of clone clusters (Antoniol et al., 2002) detected, and
the number and the percentage of cloned functions. Finally, clones were computed while filtering out the shortest functions; for example, two functions that simply
return a value are clones by definition, but they are not
significant and should not be taken into consideration.
Results are reported considering two thresholds of function size: functions longer than five and than 10 LOCs.
As shown in Table 2, the overall percentage of clones
is not negligible (26.04%), even considering only functions longer than five LOCs (16.38%) and it suggests a
potential for reduction in the number of the cloned functions. Clearly, the actual reduction rate depends on the
number of false positives which typically include functions that simply contain a list of calls to other functions
(where the number of calls and of parameters match),
functions that print different error messages and, in general, any other function that shares the same metrics
while being different.
The number of clones contained inside libraries is
low, indicating that the developers accurately factored
functions and objects to avoid duplicates. Finally, we
investigated the set of clones between libraries and objects outside libraries in the perspective of possible refactoring. The analysis of clones inside libraries revealed an
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
Table 2
Results of clone detection
Overall
OR
REC
TED
Three cases of circular dependencies among libraries
were found. The first dependency was between libstubs.a and libdbmi.a. In particular, we discovered that libstubs.a required one symbol, located
inside the error.o module which belonged to libdbmi.a. On the other hand, libdbmi.a required 27
symbols from libstubs.a. The obvious solution was
to move error.o into libstubs.a: this required
moving in that library also the module alloc.o, since
it depends from error.o.
The second circular dependency was found between
libgis.a and libcoorcnv.a. In particular, libgis.a required three symbols. Such symbols were located in the module datum.o from libcoorcnv.a.
In the other direction, libcoorcnv.a dependencies
involved 13 symbols from libgis.a. Moving datum.o into libgis.a resolved the problem.
Finally, circular dependencies were found between
libvect.a and libdig2.a. They involved 13 symbols in one direction and 31 symbols in the other direction. Symbols involved in the dependencies were located
UNC
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
OF
Pre-existing libraries
Library objects
Applications
C source files
C KLOC
in several different objects. The links present in the
dependency graph excluded the possibility of resolving
circular dependencies between libvect.a and libdig2.a by simply moving or duplicating objects. The
decision taken together with GRASS developers was initially to merge the two libraries which, in effect, have
been designed to work together, and then try to refactor
the new library (see Section 6.4).
PRO
Table 1
GRASS key characteristics
Within libraries
Total number
of functions
Number of clone
clusters
Number of cloned
functions
Percent of cloned
functions (%)
Threshold
(LOCs)
22,229
2019
1404
72
41
1817
1290
130
73
5789
3641
180
101
4974
3268
635
272
26.04
16.38
3.41
1.92
29.33
19.27
2.86
1.22
5
10
5
10
5
10
5
10
5271
Outside libraries
16,958
Libraries vs. outside
22,229
JSS 7667
15 November 2004 Disk Used
12
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
Objects
libgis
libdbmi
libproj
libvect-new
184
97
119
54
by following the process described in Section 4.6 and depicted in Fig. 3. As suggested by developers, libproj
was not refactored, because it was under development
by a different team. As explained in Section 6.2,
libvect-new library was obtained by merging libvect.a and libdig2.a.
Silhouette statistics was used to determine the optimal
number of clusters for each library. Values of such statistics, are plotted in Fig. 6, for different number of clusters. We decided to split libgis into four clusters
(instead of the six proposed in Di Penta et al., 2002),
and to divide libvect-new and libdbmi into three
clusters. It is worth noting that, for libgis, the number of clusters was chosen in correspondence of the Silhouette maximum; for the other two libraries, a
TED
Cloned functions contained in these two libraries
were removed from libimage_sup, libgmath and
libtrans.
Several ‘‘interesting’’ clones were also found outside
libraries. In particular, the r.mapcalc3 application
contains four clusters of cloned functions, spanning
from 27 to 59 LOCs in size. These cluster contain mathematical functions, cloned to handle different data types.
In this case, refactoring is clearly possible by generalizing the operations and by abstracting types.
Finally, we analyzed clones between applications and
libraries. In most cases clones were revealed to be part of
legacy applications developed before the corresponding
functions were added into a library. Unfortunately, the
application was never changed afterwards. A relevant
fraction of about 20% of these clones was discovered in
the contrib subsystems, which had often been developed
by third parties and therefore which were not always
properly aligned with respect to the rest of the system.
967 6.4. Library refactoring
REC
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
Library
OF
944 (1) A library (libmatrix) to handle matrices; and
945 (2) A library (libcamera) to handle photogrammet946
ric computations for aerial cameras.
Table 3
GRASS largest libraries
PRO
interesting situation: 16 functions from library libortho, were cloned across libimage_sup, libgmath
and libtrans. Nine of the cloned functions were devoted to performing matrix algebra. By analyzing the
Dependency Graph of libortho (see Fig. 5), a subgraph composed of such functions was identified and depicted in the box on the right. On the other hand, seven
of the functions in the box on the left were cloned in
libimage_sup. In particular, the entire structure enclosed in the rounded-dashed-box was replicated in that
library. libortho was split libortho in two libraries, shown in the two boxes in Fig. 5:
libvect
0.7
0.6
0.5
0.4
0.3
2
3
4
# of clusters
5
Fig. 6. Silhouette statistics for different number of clusters.
UNC
OR
968
Refactoring was performed on libraries which were
969 composed of a large number of objects (see Table 3),
libgis
libdbmi
0.8
Silhouette statistics
932
933
934
935
936
937
938
939
940
941
942
943
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
Fig. 5. Splitting library libortho.
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
JSS 7667
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
15 November 2004 Disk Used
13
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
compromise was accepted between maximizing the Silhouette and avoiding excessive fragmentation.
Subsequently, a preliminary clustering was performed
and it was refined by an initial execution of the SRGA,
which had been performed without considering any
developersÕ feedback and by setting w3 = 0. Table 4 reports for each library:
APPLICATIONS
985
986
987
988
989
990
991
libdbmi–1
libdbmi–3
libdbmi–2
HIGH LEVEL
As shown, the SRGA reduced libgis dependencies
from 579 to 26, while keeping PR almost constant (from
51% to 48%). A significant reduction of inter-library
dependency was obtained (from 237 to 4 for libdbmi
and from 66 to 3 for libvect), while slightly reducing
PR, except for libdbmi, where it increased to 46% and
it was worse than the preliminary solution.
The first refactored architecture of the candidate
libraries was submitted to GRASS developers to seek
their feedback. For libgis, manual analysis indicated
that the first cluster should contain ‘‘utility’’ and ‘‘allocation’’ functions, the second ‘‘area’’ and ‘‘geodesic’’
functions, the third ‘‘color-related’’ functions, and the
fourth ‘‘raster’’ functions. For libvect-new, developers indicated that the first cluster should contain basic
file-system operations and the other two clusters should
include all other functions without any further distinction. The feedback for libdbmi was quite different with
respect to the other two libraries. In this case, developers
confirmed that the solution suggested by the hierarchical
clustering performed before applying the SRGA reflected their own conception of libraries. A manual
graph analysis via Dotty graph visualization agreed,
too. In fact, as also reported by Di Penta et al. (2002),
the library was split into the three following clusters:
OF
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
Fig. 7. New libdbmi layering structure.
• libdbmi-3 contains 48 objects, which are only 1029
internally used by libdbmi, and represents some 1030
sort of ‘‘low-level’’ library.
1031
Fig. 7 reports the layering structure of the clusters extracted from libdbmi. To avoid circular dependencies,
one object was moved from libdbmi-3 to libdbmi1. Clearly, when refactoring a large software system
such as GRASS, a compromise should be accepted between having small and decoupled clusters like those
generated by applying the SRGA and having clusters
that are not totally decoupled, but are conceptually
cohesive, since they contain functions which implement
closely-related tasks. In the latter case, memory optimization is possible adopting, as noted, dynamically loadable libraries (in spite, however, of performances, as
explained in Section 4.6.2). We decided to leave
libdbmi clusters as they were after hierarchical clustering and to perform a ‘‘second iteration’’ of the SRGA
refactoring on libgis and libvect-new, while taking into consideration also the Feedback Factor FF, this
time. For sake of completeness, we also reported results
for libdbmi. By varying the w1, w2 and w3 thresholds,
we obtained different results. As shown in Table 5, it was
never possible to achieve a complete cluster decoupling
and to obtain, at the same time, libraries which were
very close to the structure proposed by developers.
In Table 5, the comparison of the first three columns
with the last three highlights that, after the first SRGA
iteration, the coupling between clusters remained low.
On the other hand, as highlighted by a high FF value before the second iteration, identified libraries tend to have
a structure which is somehow different with respect to
developersÕ intention. The second iteration of the SRGA
tried to decrease FF, while, unfortunately, coupling increased. At such a stage, in the authorsÕ opinion, developers may decide either to produce meaningful libraries
PRO
• The number of objects composing the library;
• The number of candidate libraries the original library
is refactored into and the corresponding Silhouette
statistics value;
• The number of inter-library dependencies and PR
before applying the SRGA; and
• The number of inter-library dependencies and PR
after applying the SRGA.
OR
REC
TED
992
993
994
995
996
997
998
999
LOW LEVEL
UNC
1026 • libdbmi-1 contains (19) unused objects;
1027 • libdbmi-2 contains (30) objects which are directly
1028
used by applications; and
Table 4
Results of the library refactoring process before considering feedback (w3 = 0)
Library
libgis
libdbmi
libvect
Number of
objects
Candidate
libraries (k)
Silhouette
statistics
Before GA
After GA
DF
PR (%)
DF
PR (%)
184
97
54
4
3
3
0.70
0.78
0.57
579
237
66
51
35
46
26
4
3
48
46
40
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
JSS 7667
15 November 2004 Disk Used
14
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
Table 5
Results of the second round of the library refactoring process (w3 5 0)
Number of
objects
Candidate
libraries (k)
Before second round
After second round
FF
DF
PR (%)
FF
DF
PR (%)
libgis
libdbmi
libvect
184
97
54
4
3
3
203
97
72
26
4
3
48
46
40
128
23
30
60
43
6
52
39
52
Table 6
Performance comparison between pure GA and hybrid GA with hill climbing
Pure GA
libgis
libdbmi
libvect
Time (s)
Fitness function
Time (s)
3119
77
195
9113
509
96
3239
83
198
4524
190
41
REC
1089 6.5. Extraction of new libraries
To identify new candidate libraries, the final step of
the SRF is devoted to the analysis of the Use Graph obtained by subtracting the already existing libraries.
Sometimes there are groups of objects used by a common set of applications, but they have not yet been
organized into libraries. Clustering was performed on
objects used by, at least, two applications.
Results revealed the presence of four clusters which
were all located in the orthophoto subsystem. The number
of dependencies between clusters was low and it was possible to solve them simply moving a couple of objects between clusters. Besides, all clusters had a considerable
UNC
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
Fitness difference (%)
Time difference (%)
1
7
3
49
37
43
number of dependencies to external objects which belonged to the same set of applications. To eliminate these
dependencies, it would have been necessary to increase the
size of each cluster by 100%, clearly in contradiction with
respect to the intended objective of reducing applicationsÕ
memory requirements and size. Consequently, it was
decided not to cluster these objects into libraries. In the
authorsÕ opinion, this is not a negative result, but it constitutes a quality indicator of the system showing that developers had carefully created and maintained libraries.
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
7. Conclusions
1112
This paper has presented a framework for software
system renovation (SRF) and the results of its application to GRASS Geographical Information System,
which is over one million LOCs in size.
The SRF has allowed us to remove several structural
problems from GRASS. In particular, unused objects
were identified and factored out; clones were identified
and, especially for those inside libraries, refactoring
was performed. The SRF incorporates a novel library
refactoring process, in which a suboptimal solution is
first identified by hierarchical clustering and then refined
by the SRGA. The proposed SRGA fitness function
takes into account different factors: minimizing the
number of dependencies, the average number of objects
linked by each application, and the feedback of developers. Although the approach has been applied on C and
C++ systems (GRASS and others reported by Antoniol
et al., 2003), it is not tied to any specific programming
language, provided that object modules, which contain
a list of defined and required symbols, be available.
However, for applications to be executed on a virtual
machine, such as Java, Smalltalk programs, other approaches such as those of Tip et al. (1999) and Rayside
and Kontogiannis (2002) may be preferable.
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
TED
and to reduce the memory requirements using dynamically-loadable libraries, or to obtain independent clusters, which may not always conceptually group objects
as related as expected. Although it is counterintuitive,
the latter result is not surprising, since experts classified
functions according to the intended purpose or semantic. This seldom ensure high cohesion and low coupling,
because the improvement of the latter attributes produces a final partitioning which somehow differs from
what it was expected.
The addition of hill climbing into the SRGA did not
improve the fitness function, since the SRGA also converged to similar results, when it was executed on an increased number of generations and increased population
size. Noticeably, performing hill climbing on the best
individuals of each generation produced a drastic reduction of convergence times. Comparing both strategies
when the difference between values of the fitness function was below 10% highlighted that a hybrid strategy
allowed on average to reduce the execution time of
43%. Convergence times for a Compaq ProliantTM with
Dual XeonTM 900 MHz processor, 2 MB Cache and
4 GB of RAM are reported in Table 6.
OR
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
Hybrid GA
Fitness function
PRO
Library
OF
Library
JSS 7667
15 November 2004 Disk Used
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
OR
1174 Acknowledgments
We are grateful to the GRASS development team for
the support, the information provided, and the feedback
on the refactored artifacts. Giuliano Antoniol and Massimiliano Di Penta were partially supported by the ASI
grant I/R/091/00. Markus Neteler was partially supported by the FUR-PAT Project WEBFAQ. Ettore
Merlo was partially supported by the National Sciences
and Engineering Research Council of Canada
(NSERC).
UNC
1175
1176
1177
1178
1179
1180
1181
1182
1183
PRO
OF
Anquetil, N., 2000. A comparison of graphs of concept for reverse
engineering. In: Proceedings of the IEEE International Workshop
on Program Comprehension. IEEE Computer Society Press, Los
Alamitos, CA, USA, pp. 231–240.
Anquetil, N., Lethbridge, T., 1998. Extracting concepts from file
names; a new file clustering criterion. In: Proceedings of the
International Conference on Software Engineering. IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 84–93.
Antoniol, G., Di Penta, M., 2003. Library miniaturization using static
and dynamic information. In: Proceedings of IEEE International
Conference on Software Maintenance, Amsterdam, The Netherlands. pp. 235–244.
Antoniol, G., Casazza, G., Di Penta, M., Merlo, E., 2001a. A method
to re-organize legacy systems via concept analysis. In: Proceedings
of the IEEE International Workshop on Program Comprehension.
IEEE Computer Society Press, Los Alamitos, CA, USA, Toronto,
ON, Canada, pp. 281–290.
Antoniol, G., Casazza, G., Di Penta, M., Merlo, E., 2001b. Modeling
clones evolution through time series. In: Proceedings of IEEE
International Conference on Software Maintenance. pp. 273–280.
Antoniol, G., Villano, U., Merlo, E., Di Penta, M., 2002. Analyzing
cloning evolution in the Linux Kernel. In: SCAM 2002 Special
Issue. Information and Software Technology 44, 755–765.
Antoniol, G., Di Penta, M., Neteler, M., 2003. Moving to smaller
libraries via clustering and genetic algorithms. In: European
Conference on Software Maintenance and Reengineering. IEEE
Computer Society Press, Los Alamitos, CA, USA, Benevento,
Italy, pp. 307–316.
Bui, T.N., Moon, B.R., 1996. Genetic algorithm and graph partitioning. IEEE Transactions on Computers 45 (7), 841–855.
Cordy, J., 2003. Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: Proceedings of the IEEE International Workshop on Program
Comprehension, Portland, OR, USA. pp. 196–205.
Deb, K., 1999. Multi-objective genetic algorithms: problem difficulties
and construction of test problems. Evolutionary Computation 7
(3), 205–230.
Di Penta, M., Neteler, M., Antoniol, G., Merlo, E., 2002. Knowledgebased library re-factoring for an open source project. In: Proceedings of IEEE Working Conference on Reverse Engineering. IEEE
Computer Society Press, Los Alamitos, CA, USA, Richmond, VA,
pp. 128–137.
Doval, D., Mancoridis, S., Mitchell, B., 1999. Automatic clustering of
software systems using a genetic algorithm. In: Software Technology and Engineering Practice (STEP), Pittsburgh, PA. pp. 73–91.
Garey, M., Johnson, D., 1979. Computers and Intractability: a Guide
to the Theory of NP-Completeness. W.H. Freeman.
Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization
and Machine Learning. Addison-Wesley Pub. Co.
Gordon, A.., 1988. Classification, 2nd ed. Chapman and Hall,
London.
Harman, M., Hierons, R., Proctor, M., 2002. A new representation
and crossover operator for search-based optimization of software
modularization. In: AAAI Genetic and Evolutionary Computation
COnference (GECCO). Springer-Verlag, New York, USA, pp. 82–
87.
Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: An
Introduction to Cluster Analysis. Wiley-Inter Science, Wiley, NY.
Krone, M., Snelting, G., 1994. On the inference of configuration
structures from source code. In: Proceedings of the 16th International Conference on Software Engineering. IEEE Computer
Society Press, Los Alamitos, CA, USA, Sorrento, Italy, pp. 49–57.
Kuipers, T., Moonen, L., 2000. Types and concept analysis for legacy
systems. In: Proceedings of the IEEE International Workshop on
Program Comprehension. IEEE Computer Society Press, Los
Alamitos, CA, USA, pp. 221–230.
TED
Overall, the SRF helps to monitor and improve the
quality of a software system, which tends inevitably to
deteriorate during the evolution. Unused objects, clones,
library coupling, library sizes, and poor object organization are in fact significant quality indicators. For instance, the absence of new libraries identified by the
SRF in GRASS indicates a careful design and a controlled evolution. Moreover, the SRF also addresses the
miniaturization problem, which is relevant to port applications on limited-resource devices. The SRF has allowed us to reduce GRASS memory requirements and
to improve its performance. The average number of library objects linked by each application was indeed reduced of about 50%. At the time of writing, GRASS
has successfully been ported on a PDA (i.e., a CompaQ
iPAQ). Given the size of the application and the available resources, a brute force automatic approach would
not be feasible, since developersÕ suggestions were an
essential component for the miniaturization process.
Clone detection performed on GRASS revealed that
the cloning level outside libraries was not negligible
and suggested further clone refactoring. Besides, the
cloning level inside libraries was in general low, except
for the mentioned cases. The cloning between libraries
and the rest of the system was in most cases due to third
party applications. Most of the system reorganization
work described in the paper was incorporated in the
subsequent releases of GRASS by removing unused objects and some clones, and by reorganizing some libraries. The latter reorganization, as pointed out in the
paper, was carried out with minor modifications with respect to the result of the SRF.
Our in-progress work is devoted to investigate the
feasibility of integrating other sources of knowledge into
the SRF with special regards to dynamic information
and in-field user profiles (Antoniol and Di Penta,
2003), obtained by instrumenting the source code.
REC
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1184 References
1185 Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic
1186
Press Inc.
15
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
JSS 7667
15 November 2004 Disk Used
16
M. Di Penta et al. / The Journal of Systems and Software xxx (2004) xxx–xxx
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
Massimiliano Di Penta received his laurea degree in Computer Engi1329
1328
neering in 1999 and his PhD in Computer Science Engineering in 20031330
at the University of Sannio in Benevento, Italy. Currently he is with1331
RCOST—Research Centre On Software Technology in the same1332
University. His main research interests include software maintenance,1333
software quality, reverse engineering, program comprehension and1334
search-based software engineering. He is author of about 30 papers1335
appeared in international journals, conferences and workshops. He1336
serves the program and organizing committees of workshops and1337
conferences in the software maintenance field, such as the International1338
Conference on Software Maintenance, the International Workshop on1339
Program Comprehension, the Workshop on Source code Analysis and1340
Manipulation.
1341
1342
Markus Neteler received his M.Sc. degree in Physical Geography and
1344
1343
Landscape Ecology from the University of Hanover in Germany in1345
1999. He worked at the Institute of Geography as Research Scientist1346
and teaching associate for two years. Since 2001 he is researcher at1347
ITC-irst (Centre for Scientific and Technological research), Trento,1348
Italy since 2001. His main research interest is remote sensing for1349
environmental risk assessment and Free Software GIS development.1350
He is author of two books on the Open Source Geographical Infor-1351
mation System GRASS and various papers applications in GIS.
1352
1353
1355
1354
Giuliano Antoniol received his doctoral degree in Electronic Engineering from the University of Padua in 1982. He worked at Irst for 101356
years were he led the Irst Program Understanding and Reverse Engi-1357
neering (PURE) Project team. Giuliano Antoniol published more than1358
60 papers in journals and international conferences. He served as a1359
member of the Program Committee of international conferences and1360
workshops such as the International Conference on Software Main-1361
tenance, the International Workshop on Program Comprehension, the1362
International Symposium on Software Metrics. He is presently mem-1363
ber of the Editorial Board of the Journal Software Testing Verification1364
& Reliability, the Journal Information and Software Technology, the1365
Empirical Software Engineering and the Journal of Software Quality.1366
He is currently Associate Professor the University of Sannio, Faculty1367
of Engineering, where he works in the area of software metrics, process1368
modeling, software evolution and maintenance.
1369
1370
Ettore Merlo received his Ph.D. in computer science from McGill
1372
1371
University (Montreal) in 1989 and his Laurea degree—summa cum1373
laude—from University of Turin (Italy) in 1983. He has been the lead1374
researcher of the software engineering group at Computer Research1375
Institute of Montreal (CRIM) until 1993 when he joined Ecole Poly-1376
technique de Montreal where he is currently an associate professor. His1377
research interests are in software analysis, software reengineering, user1378
interfaces, software maintenance, artificial intelligence and bio-infor-1379
matics. He has collaborated with several industries and research cen-1380
ters in particular on software reengineering, clone detection, software1381
quality assessment, software evolution analysis, testing, architectural1382
reverse engineering and bio-informatics.
1383
1384
PRO
OF
Reverse Engineering. IEEE Computer Society Press, Los Alamitos,
CA, USA, pp. 187–195.
Tzerpos, V., Holt, R.C., 2000a. ACDC: An algorithm for comprehension-driven clustering. In: Proceedings of IEEE Working Conference on Reverse Engineering. IEEE Computer Society Press, Los
Alamitos, CA, USA, pp. 258–267.
Tzerpos, V., Holt, R., 2000b. The stability of software clustering
algorithms. In: Proceedings of the IEEE International Workshop
on Program Comprehension. IEEE Computer Society Press, Los
Alamitos, CA, USA.
Wiggerts, T.A., 1997. Using clustering algorithms in legacy systems
remodularization. In: Proceedings of IEEE Working Conference
on Reverse Engineering. IEEE Computer Society Press, Los
Alamitos, CA, USA.
OR
REC
TED
Kuipers, T., van Deursen, A., 1999. Identifying objects using cluster
and concept analysis. In: Proceedings of the International Conference on Software Engineering. IEEE Computer Society Press, Los
Alamitos, CA, USA, pp. 246–255.
Lehman, M.M., Belady, L.A., 1985. Software Evolution—Processes of
Software Change. Academic Press, London.
Mahdavi, K., Harman, M., Hierons, R.M., 2003. A multiple hill
climbing approach to software module clustering. In: Proceedings
of IEEE International Conference on Software Maintenance,
Amsterdam, The Netherlands. pp. 315–324.
Maini, H., Mehrotra, K., Mohan, C., Ranka, S., 1994. Knowledgebased nonuniform crossover. In: IEEE World Congress on
Computational Intelligence. IEEE Computer Society Press, Los
Alamitos, CA, USA, pp. 22–27.
Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y., Gansner, E.R.,
1998. Using automatic clustering to produce high-level system
organizations of source code. In: Proceedings of the IEEE
International Workshop on Program Comprehension. IEEE
Computer Society Press, Los Alamitos, CA, USA.
Merlo, E., McAdam, I., De Mori, R., 1993. Source code informal
information analysis using connectionist model. In: Proceedings of
the International Joint Conference on Artificial Intelligence. IEEE
Computer Society Press, Los Alamitos, CA, USA, pp. 1339–1344.
Mitchell, M., 1996. An Introduction to Genetic Algorithms. MIT
Press, Cambridge, MA, USA.
Neteler, M. (Ed.), 2001. GRASS 5.0 ProgrammerÕs Manual. Geographic Resources Analysis Support System. ITC-irst, Italy,
Available from: <http://grass.itc.it/grassdevel.html>.
Neteler, M., Mitasova, H., 2002. Open Source CIS: A GRASS CIS
Approach. Kluwer Academic Publishers, Boston/USA; Dordrecht/
Holland; London/UK.
Oommen, B., de St Croix, E., 1996. Graph partitioning using learning
automata. IEEE Transactions on Computers 45 (2), 195–208.
Rayside, D., Kontogiannis, K., 2002. Extracting Java library subsets
for deployment on embedded systems. Science of Computer
Programming 45 (2–3), 245–270.
Shazely, S., Baraka, H., Abdel-Wahab, A., 1998. Solving graph
partitioning problem using genetic algorithms. In: Midwest Symposium on Circuits and Systems. IEEE Computer Society Press,
Los Alamitos, CA, USA, pp. 302–305.
Siff, M., Reps, T., 1999. Identifying modules via concept analysis.
IEEE Transactions on Software Engineering 25, 749–768.
Snelting, G., 2000. Software reengineering based on concept lattices.
In: Proceedings of IEEE International Conference on Software
Maintenance. IEEE Computer Society Press, Los Alamitos, CA,
USA, pp. 3–10.
Talbi, E., Bessière, P., 1991. A parallel genetic algorithm for the graph
partitioning problem. In: ACM International Conference on
Supercomputing. ACM Press, New York, USA, Cologne,
Germany.
Tip, F., Laffra, C., Sweeney, P.F., Streeter, D., 1999. Practical
experience with an application extractor for Java. ACM SIGPLAN
Notices 34 (10), 292–305.
Tonella, P., 2001. Concept analysis for module restructuring. IEEE
Transactions on Software Engineering 27 (4), 351–363.
Tzerpos, V., Holt, R.C., 1998. Software botryology: automatic
clustering of software systems. In: DEXA Workshop. IEEE
Computer Society Press, Los Alamitos, CA, USA, pp. 811–818.
Tzerpos, V., Holt, R.C., 1999. MoJo: A distance metric for software
clusterings. In: Proceedings of IEEE Working Conference on
UNC
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
No. of Pages 16, DTD = 5.0.1
ARTICLE IN PRESS
View publication stats