A Method Based On Naming Similarity To Identify Reuse Opportunities

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

A Method Based on Naming Similarity to Identify Reuse

Opportunities
Authors omitted due to blind review1
1

Affiliation omitted due to blind review

Abstract. Software reuse is a development strategy in which existing software


components are used to implement new software systems. There are many advantages of applying software reuse, such as minimization of development efforts and improvement of software quality. Few methods have been proposed
in the literature for recommendation of reuse opportunities. In this paper, we
propose a method for identification and recommendation of reuse opportunities
based on the similarity of the names of classes. Our method, called JReuse,
computes a similarity function to identify similarly named classes from a set
of software systems from a specific domain. The identified classes compose a
repository with reuse opportunities. We also present a prototype tool to support
the proposed method. We applied our method, through the tool, to 72 software
systems mined from GitHub, in 4 different domains: accounting, restaurant,
hospital, and e-commerce. In total, these systems have 1, 567, 337 lines of code,
57, 017 methods, and 12, 598 classes. As a result, we observe that JReuse is able
to identify the main classes that are frequent in each selected domain.

1. Introduction
Software reuse is a development strategy in which existing software components, called
reusable assets, are used to implement new software systems [Krueger 1992]. It has
been studied and pointed as an alternative to traditional development aiming to increase software quality and decrease development efforts by using previously developed,
and sometimes already tested, software components [Mohagheghi and Conradi 2007,
Mohagheghi et al. 2004, Morisio et al. 2002, Ravichandran and Rothenberger 2003].
The extraction of reusable assets is essential to support the software reuse activity by building repositories of reuse opportunities [Guo and Luqi 2000]. These methods may be used in different contexts related to software reuse, including the support of feature extraction for a software product line [Lee et al. 2004], for instance.
Many methods have been proposed in the literature to support the extraction of reuse
opportunities from software systems [Caldiera and Basili 1991, Kawaguchi et al. 2004,
Kuhn et al. 2007, Maarek et al. 1991, Ye and Fischer 2005].
There are different approaches used by proposed methods to identify reuse opportunities, such as natural-language processing [Maarek et al. 1991], formal specifications [Caldiera and Basili 1991], machine learning [Kawaguchi et al. 2004], and other Information Retrieval (IR) approaches [Kuhn et al. 2007, Ye and Fischer 2005]. However,
to the best of our knowledge, we did not find a method for extraction of reuse opportunities and reuse recommendation considering the most frequent source code elements such
as classes from systems of the same domain.

This paper is an extension of previous work (citation omitted due to blind review)
that proposes a method for extraction of reuse opportunities called JReuse. Considering
a set of software systems, JReuse aims to identify classes with similar names through a
similarly analysis from different systems. Then, we are able to identify classes eventually, that may be recommended as reuse opportunities. We also present a prototype tool
that applies the proposed method. Finally, we evaluate our method with 38 e-commerce
software systems.
Additionally to our earlier contributions, we conduct an evaluation of our method
through an experiment with 72 Java systems from four different domains: accounting,
restaurant, hospital, and e-commerce. All systems were mined from GitHub. As a result,
we observe that our method is able to identify reuse opportunities using naming similarity
analysis. That is, JReuse can provide meaningful classes for the analyzed domains, and
these classes may be indicated as reuse opportunities to developers of new systems from
the respective domain.
The remainder of this paper is organized as follows. Section 2 presents background to support the study comprehension, in addition to related work. Section 3 proposes a method for reuse opportunities extraction and a prototype tool that supports the
proposed method. Section 4 presents an evaluation of the method. Section 5 describes the
results obtained through the evaluation and discusses lessons learned. Section 6 presents
threats to the study validity. Finally, Section 7 concludes the paper with a discussion and
suggestions for future work.

2. Background and Related Work


This section presents background information to support the comprehension of this study,
and also a discussion on related work. Section 2.1 overviews software reuse and its supporting techniques. Section 2.2 discusses related work that propose methods for identification of reuse opportunities from software systems.
2.1. Software Reuse
In software reuse, previously implemented software components are used to support the
development of new software systems [Krueger 1992]. The main goal of reuse is the
improvement of software quality aspects followed by an increase of the development
efficiency [Ravichandran and Rothenberger 2003]. There are many approaches to support
reuse in software development. As an example, Krueger (1992) presents an extensive
study regarding definitions, approaches, and application of software reuse.
Software reuse approaches may be categorized as ad hoc or systematic [Mohagheghi and Conradi 2007]. In the ad hoc approach, software reuse is applied in
an opportunistic way, without planning. An example of ad hoc reuse is the use of random
software code snippets from the Web [Sojer and Henkel 2011]. In turn, systematic reuse
follows specific protocols and processes to provide the use of existing software components when developing new systems [Mohagheghi and Conradi 2007]. Moreover, reuse
opportunities may be identified in two ways: forward identification, in which software
reuse is planned before the development of software systems; and reverse identification, in
which reuse opportunities are identified from a set of existing systems [Wang et al. 2005].

Some studies investigate advantages and drawbacks of systematic software


reuse [Mohagheghi and Conradi 2007, Mohagheghi et al. 2004]. Mohagheghi et al.
(2004) study the impacts of reuse on software quality through an empirical study on
large-scale system components. They conclude that reuse contributes positively in software quality, since it provides software components with lower defect-density and higher
stability when compared with non-reused components. Mohagheghi and Conradi (2007)
provide a literature review on the impact of software reuse in the industrial development
context. They identified decrease of flaws, reduction of development efforts, and increasing productivity as the main advantages provided by software reuse.
Many strategies are proposed in the literature to identify reuse opportunities, such as: natural-language processing based on lexical inspection of source
code elements [Maarek et al. 1991]; formal specifications by the analysis of software models and metrics, for instance [Caldiera and Basili 1991]; architectural
style [Monroe and Garlan 1996], supported by the analysis of high-level component interaction, generally applied to software design and modeling; and machine learning that
gathers different types of analysis, such as semantic categorization of software components [Kawaguchi et al. 2004].
2.2. Identification of Reuse Opportunities
Previous work investigate the identification of reuse opportunities from software systems [Inoue et al. 2005, Koziolek et al. 2013, Li et al. 2005, Mende et al. 2009,
Michail and Notkin 1999, Oliveira et al. 2007, Ye and Fischer 2005]. Inoue et al. (2005)
propose a graph-based technique to support the extraction of frequently used components
in a given software component repository. The proposed technique relies on ranking components based on their usage by other components from the repository. The authors also
present a supporting tool called SPARS-J, for analysis of Java classes.
Koziolek et al. (2013) present a technique for identification of reuse opportunities based on domain analysis. The proposed technique aims to support the assessment
of potential SPL implementation by organizations. This technique encompasses feature
modeling of the domain, comparison of systems in architectural level, and the extraction
of reusable components. Li et al. (2005) present an approach for identification of reusable
components from legacy systems. The proposed approach aims to support reengineering
tasks; that is, the implementation of new systems based on existing source code. For
this purpose, the authors propose the generation of the Abstract Syntax Tree (AST) for
analysis and extraction of modules and components as candidate for reuse.
Mende et al. (2009) propose a tool to support software evolution and maintenance. For this purpose the tool identifies similar methods along the source code and
recommend merging of these methods to the developer. The proposed tool computes
code clones in method-level and uses the Levenshtein distance for textual comparison of
methods. Michail and Notkin (1999) propose CodeWeb, a tool to support the comparison
of software libraries in terms of components classes and methods provided by these
libraries. For this purpose, the tool performs naming similarity computation to identify
similar classes and methods from a set of libraries.
Oliveira et al. (2007) propose a method and a supporting tool for recommendation
of reusable software components. The tool applies a technique called Automatic Iden-

tification of Software Components (AISC) to identify candidate components for reuse.


The tool, called Digital Assets Discoverer, performs static code analysis for identification of reuse opportunities. The tool also provides an interactive graphic interface and
exports feature using a metadata representation model. Ye and Fischer (2005) present
CodeBroker, a tool to support runtime identification of reusable software components.
The proposed tool relies on information retrieval techniques. CodeBroker is based on
search engines and Javadoc artifacts for code analysis.
In this paper, we propose a method, and supporting tool, to identify candidates for
reuse in software systems from an specific domain. For this purpose, we apply lexical
code analysis. Unlike related work, our method can be used in two scenarios. First, to
support the identification of reuse opportunities in software systems. Second, to guide
users regarding the partial design of software systems to be developed, by recommending
the most frequent entities that may compose the new system. Our method also ranks
software entities identified as reuse opportunities by frequency in which the appear in
different systems from the same domain. We expect to support reuse by suggesting classes
that are the most used in systems from a specific domain.

3. Proposed Method
This section explains in detail the proposed method for identification of reuse opportunities. Section 3.1 describes the similarity-based process applied by our method to identify
reuse opportunities. Section 3.2 proposes our method and its steps. Finally, Section 3.3
presents a tool that implements our method.
3.1. Identifying Similarity
Previous work investigate the use of textual similarity in the context of source code analysis [Tian et al. 2014, Zhen et al. 2008]. There are many applications for similarity analysis in software systems, such as comparison of dialects, spell check, and plagiarism
detection [Liu and Lu 2008]. In this context, we propose JReuse, a method that relies on
similarity analysis of source code as a static technique to identify reuse opportunities.
We conducted an ad hoc literature review in order to select algorithms that
compute similarity between strings to be used by our method. For this purpose, we
searched for the most popular similarity computation algorithms to find the one that
fits our study purpose. After the literature review, we selected the Levenshteins algorithm [Yujian and Bo 2007]. This algorithm is a similarity function used by our method
to compute lexical similarity between names of classes from different systems. In short
terms, given two strings A and B, the algorithm computes the number of changes required
to turn A into B.
To identify similarly named classes, we adopted 75% a threshold for the minimum
similarity between two names of entities. This threshold, derived empirically by the authors, was taken because some well-known naming conventions for classes may lead to
similarly named entities that clearly represent different purposes. As an example, a similarity of 72% is obtained for the names Costumer and CostumerDAO, observed by the
authors as frequent names of classes in e-commerce systems. However, we intuitively expect that two classes with these names implement different functions, since DAO classes
implement data base persistence.

Table 1 presents some examples of class names and the respective similarity rate
using the chosen algorithm. In this table, we present eight matches between names of
classes from two software systems: System A and System B. Each match has at least 75%
of similarity rate between names of classes, in accordance to our empirical threshold.
Note that our threshold covers, for instance, names of classes that vary from singular to
plural (e.g., Client and Clients).
Table 1. Examples of similarity computation

System A
ShoppingCart
OrderProductId
Orderservice
Reviwes
Clients
CartController
Products
ProductsController

System B
ShoppCart
OrderProduc
Orderservi
Reviwe
Client
CartControll
Product
ProductController

Similarity Rate
75%
78%
83%
85%
85%
85%
87%
94%

3.2. Proposed Method and Its Steps


A software domain is a set of systems that shares a common set of functionalities, requirements, or terminology [Neighbors 1992, Pressman 2005]. Therefore, we expect
that software systems within the same domain present lexical similarity with respect
to names of elements, such as classes. In this context, similarly named elements may
contribute to the comprehension of the characteristics of systems from a given business
domain [Cybulski and Reed 2000].
Considering this scenario, our study proposes JReuse, a method for identification
of reuse opportunities from software systems. Our method is based on naming lexical
similarity of classes. Given a set of software systems from the same domain, JReuse
compares names of classes, in pairs, to identify common names among different systems.
We believe that recurring names of classes may indicate reuse opportunities in a given
domain. Furthermore, frequent names of classes may indicate common behaviors and
requirements of these entities [Cybulski and Reed 2000].
In general, similarity rate is not enough for electing a class as a possible reuse
opportunity. We then consider the classes that are more frequent among the systems for
recommendation. Note that, for instance, a name of class with matches in 10 different
systems is more frequent than a name of class that matches in only 2 systems. Figure 1
illustrates the comparison between classes performed by JReuse. We provide a description
of this process as follows.
Consider array[1..n] an array of names of classes and two pointers i =
{1, .., n 1} and j = {2, .., n 1}. For each i, we compare array[i] with array[j]
for j = {i + 1, .., n 1}. If array[i] is similar to array[j] with a minimum similarity rate of 75%, then the method registers a reuse opportunity. JReuse compares all
classes from the set of systems to identify the similarly named classes. Since our method
is based on lexical analysis, we do not perform synonymous analysis. Therefore, entities

Figure 1. Steps to identify common entities

such as Client and Customer, that may be similar semantically, are considered as
different entities.
Figure 2 present the five steps performed by JReuse to identify reuse opportunities
in a set of systems. All steps are described as follows.

Figure 2. Steps of the JReuse method

1. First, the JReuse method receives, as input, software systems from a data set provided by the user. These systems are supposed to belong to the same domain.
Then, the method filters non-Java source files, discards every system projects that
are for the Android platform, and extracts the names of classes from the Java
source files.
2. After, the method identifies the class to compute the similarity between the classes
entities of other systems. We highlight that JReuse does not compare classes of
the same software system.
3. Then, JReuse compares the names of classes in pairs to identify names with at
least 75% of similarity. Classes with similar names, called matches, are gathered
and each class name receives a score that is the number of systems in which the
class occurs. The higher the score, the more relevant may be the class regarding
the analyzed domain.
4. After comparing names of classes and computing similarity, the results obtained
by JReuse are sorted, in decreasing order, by the frequency of the identified reuse
opportunities if necessary.
5. Finally, JReuse composes a repository of candidates to reuse opportunities with
the identified classes. This repository may be used to support use of these classes.

3.3. Tool Support


To automate the proposed method, we developed a prototype tool that implements JReuse
for Java software systems. We selected Java because (i) it is one of the most popular
programming languages1 , (ii) there is an available Java parser to support source code
analysis by the generation of an Abstract Syntax Tree (AST), and (iii) many studies have
been investigating software reuse in Java systems. Through the Java parser, we may
access the source code structure, Javadoc, and comments, for instance. It is also possible
to change the AST nodes or create new ones to modify the source code. We also used the
Eclipse Java Development Tools (JDT) parser to support the identification of similarly
named classes.
The supporting tool performs three steps to identify reuse opportunities. Each step
is described as follows.
1. First, the tool retrieves the name of all classes from a software system data set.
This step is important to support the similarity computation among classes from
different systems.
2. After, the tool compares the names of classes, in pairs, to identify class names
with at least 75% of similarity. Classes with similar name, that is, matches, are
gathered and each class name receives a score that is the number of systems in
which the class occurs. The higher a score, the more relevant may be the class
with respect to the analyzed domain.
3. Finally, the tool persists the classes identified and extracted as reuse opportunities
in a data base.
JReuse provides an abstraction for the design organization of a system given a
domain. In other words, the reuse opportunities identified by JReuse may be used to
compose a partial design for any system that belongs to the analyzed domain in terms
of frequent classes. For this purpose, the tool provides output as a CSV file. Each line
of the file contains (i) the name of a class identified as reuse opportunity and (ii) the
absolute path of the class. The output file is sorted, in decreasing order, by frequency of
the identified reuse opportunities.

4. Method Evaluation
This section describes an empirical evaluation of the method proposed in Section 3. For
this purpose, we designed an exploratory study conducted in environment controlled
based on guidelines of Wohlin et al. (2012). Since JReuse aims to identify the main
reuse opportunities from software systems, our evaluation consists of analyzing the reuse
opportunities identified by the proposed method. Section 4.1 presents study goal and research questions that we designed to guide our study. Section 4.2 describes the data set
used to evaluate our method through the prototype tool. Finally, Section 4.3 presents the
evaluation steps.
4.1. Goal and Research Questions
In this study, our goal is to assess whether JReuse is able to identify frequent classes in a
specific software domain. We are also interested in assessing the relevance of the results
1

http://spectrum.ieee.org/static/interactive-the-top-programming-languages-2015

provided by our method. For this purpose, we chose four domains to be evaluated: accounting, restaurant, hospital, and e-commerce. We also designed the following research
questions (RQs) to guide our study.
RQ1 What are the most frequent classes in software systems for each selected domain?
Through RQ1, we are interested in investigating whether the most frequent identified classes are indicated for software systems for the respective domain. We expect that
JReuse is able to provide a list of classes and methods whose recommendations for reuse
are relevant for the respective domains.
RQ2 How are the most frequent classes distributed through systems per domain?
With RQ2, we aim to understand to what extent the same class, identified as one
of the most frequent classes, occur in different software systems from a given domain.
For instance, we want to know if the same class can occur in all systems, or most of them.
4.2. Data Set
To evaluate our method, we chose only systems from the domain of accounting, restaurant, hospital, and e-commerce, for several reasons. First, software systems from these
domains encompass several business features, such as user personnel, financial, product, and service management. Second, there is a significant number of domain systems
available for download in GitHub2 . Third, from the viewpoint of the authors, the four
domains we chose are well-defined in terms of requirements and we believe that it would
be possible to find reuse opportunities among systems of these domains. The systems that
compose our data set were extracted from GitHub repositories. We performed the selection of systems for the e-commerce domain in January 2015 and in May 2016 for the other
domains. We selected software systems are based on the ranking of starred systems and
system length in terms of storage space. In GitHub, stars are a meaningful measure for
repository popularity among the platform users, and may be used to support the selection
of systems.
There is a diverse terminology to represent a same software domain. For instance,
we may refer to the e-commerce domain as ecommerce, without hyphenation. In order
to support the collection of software systems to compose our data set, we developed an
algorithm to clone GitHub repositories individually, with the respective systems, based
on a well-defined search string for each domain under analysis in this study. Since the
goal of our study is to identify reuse opportunities from different software systems, given
large system sets per domain, we defined the following search strings.
For accountancy: accountancy OR accounting
For restaurant: restaurant OR eatery OR restaurants
For hospital: hospital OR infirmary OR lazaretto
For e-commerce: e-commerce OR ecommerce OR electronic commerce
Table 2 presents the exclusion criteria applied in the selected systems. First,
we collected 400 Java systems from GitHub, 100 for each domain in order descending
2

https://github.com

sorted by stars. Then, we discarded systems according to the following exclusion criteria: (i) non-Java software systems, since GitHub do not verify automatically the main
programming languages of the systems, (ii) Java projects developed for Android platform, because Android systems tend to have a different architectural design and code
implementation when compared with traditional Java systems, (iii) systems with less than
1,000 lines of code (LOC), and (iv) systems written in other languages rather than English, since our method relies on a lexical similarity technique and, then, natural language
may impact significantly the results provided by our method.
Table 2. Filters applied to the data set

Domains
Accounting
Restaurant
Hospital
E-commerce

Not English
9
3
16
21

Excluded Systems by
Less than 1,000 LOC
49
56
37
40

Android
31
28
34
4

Selected
Systems
11
13
13
35

For each selected system, we considered only the last release. This process was
necessary to discard different versions of the same system, which probably contain lots
of similarly named classes and methods. Finally, we obtained in 72 Java systems for
evaluation of the JReuse method.
To better characterize systems in the four domains, Figures 3 and 4 presents software metrics for systems per domain: lines of code (LOC) and number of classes (NOC),
respectively. We plotted twelve boxplots, one for each metric. However, because of the
heterogeneity of the sample of our data set, we decided to eliminate outliers for each
metric. Therefore, all boxplots presented a brief overview of each analyzed domain.
Let us consider Figure 3 in the following analysis of LOC. With respect to the
accounting domain, we observe that the mean of LOC for the systems is 8,690. Moreover,
the median is 5,112, i.e., half of the accounting systems has at least 4 KLOC. That is, a
significant number for analysis and identification of reuse opportunities. Regarding the
restaurant domain, the mean of LOC is 3,447. In addition, the median is 3,256. Again,
we conclude that these systems have a significant LOC for analysis. For the hospital
domain, the mean is 4,964 and the median is 2,534 of LOC. Although these values are
smaller than the obtained values for the other domains, it remains significant for the study.
Finally, with respect to the e-commerce domain, we observe a mean LOC of 46,100 and a
median of 3,730. In general, systems from this domain have the highest numbers of LOC
and, therefore, they may have several reuse opportunities.
With respect to the following analysis of NOC, consider Figure 4. Regarding the
accounting domain, note that the mean of NOC for the systems is 35.73. Furthermore, the
median is 18, i.e., half of the accounting systems has at least 18 classes. This number is
significant for analysis because we are interested in finding similarly named classes within
a pairwise comparison. Therefore, we expect a comparison of 18 18 = 324 pairs that
may be reuse opportunities. Regarding the restaurant domain, the mean of NOC is 37.23.
In addition, the median is 40. Again, we conclude that these systems have a significant
NOC for analysis. For the hospital domain, the mean is 33.85 and the median is 25 of

Figure 3. LOC of the systems per domain

NOC. Finally, with respect to the e-commerce domain, we observe a mean NOC of 368.9
and a median of 45.5. In general, systems from this domain has the highest numbers of
NOC and, therefore, there is a significant possibility of identifying reuse opportunities.

Figure 4. NOC of the systems per domain

4.3. Evaluation Steps


Figure 5 presents the three study steps we followed to investigate the research question
described in Section 4.1. Each step is described below.

Figure 5. Steps of the exploratory study

Step 1: Automated Search By using the search strings described in Section 4.2,
we cloned from GitHub several software systems, belonging to different domains. We
intended to identify appropriate domains to be analyzed in our exploratory study. For
this purpose, we considered a domain as appropriate when, from our viewpoint, systems
from the given domain contain a significant number of classes and methods for analysis.
After performing the search for systems, with support of our algorithm, we obtained
400 software systems from four distinct domains: accounting, restaurant, hospital, and
e-commerce.
Step 2: Exclusion Criteria By applying a set of exclusion criteria defined by the
authors, we select the systems according to with the following requirements: (i) software
systems written only in English, (ii) software systems with more than 1,000 lines of
source code, and (iii) traditional Java software systems, i.e, software that are not exclusive
to the Android platform. After applying the exclusion criteria, 72 different software
systems remained for analysis.
Step 3: Detection of Similarly Named Classes We executed the JReuse prototype
tool for the 72 collected systems. Per domain, the respective systems were submitted
to JReuse for extraction of reuse opportunities. After the automated analysis for each
domain, JReuse provided a list with the most frequent classes that occur in the given
domain.

5. Results and Discussion


In this section, we present and discuss the main results of our empirical evaluation with
JReuse. Section 5.1 presents the most frequent classes identified by the method per domain. Section 5.2 focus on the distribution of the most frequent classes through the systems of each domain. Section 5.3 provides an overview and discusses lessons learned.
5.1. Frequent Classes per Domain
In a first moment, we present the results regarding the most frequent classes per analyzed
domain. Therefore, we answer RQ1 as follows.
RQ1 What are the most frequent classes in software systems for each selected domain?
In this study, we analyzed the frequency of similarly named classes for the systems
of each domain. Table 3 presents software metrics for systems per domain: lines of
code (LOC) and number of classes (NOC). This table categorizes NOC in two types: (i)
analyzed, i.e., the number of entities analyzed by the tool and (ii) recommended, that is,
entities identified by the tool as reuse opportunities. In general, from Table 3 we observe
that JReuse identified good results as candidates for reuse opportunities. For instance, for
domain e-commerce, JReuse identified 75 classes as reuse opportunities.
In order to present and discuss the most frequent classes extracted as reuse opportunities, we considered the following exclusion criteria of classes. For each domain, we
discarded classes that occur in a maximum of two different systems. This decision was
taken because our method compares classes in pairs and, then, 3 occurrences may not be
significant to a reuse recommendation. We selected the top-ten most frequent classes of

Table 3. Software metrics computed for the systems per domain

Domains

Systems

LOC

Accounting
Restaurant
Hospital
E-commerce

11
13
13
35

95,588
44,813
65,297
1,567.337

NOC
Analyzed Recommended
493
25
484
17
446
21
12,598
75

each domain, as presented in Figures 4.3, 4.4, 4.5, and 4.6. We submitted the list of most
frequent entities to a group of 4 researchers at a Software Engineering laboratory (name
omitted due to blind review), for validation of the entities with respect to relevance.
Tables 4, 5, 6, and 7 present classes identified as reuse opportunities for ecommerce, accounting, restaurant, and hospital, respectively. We selected only the classes
with at least 15% 3 occurrences in the systems of the respective domain. Each table has a
Domain-Specific field. This field indicates the viewpoint of the focal group regarding a
given entity to be specific for the analyzed domain. The focal groups viewpoint is represented by three symbols in table: (i) the (X) symbol indicates that the focal group agreed
that the class is specific for the domain under analysis, (ii) the (7) symbol indicates that
the focal group disagreed that the class is indicated for the domain, and (iii) blank field
(Unconfirmed) indicates that the focal group did not converge to a specific opinion on the
class. Moreover, each table has a Labels filed to inform the level of relevance of the
entity identified by JReuse as reuse opportunity.
Scale to Indicate the Level of Relevance of the Entities Identified. To support the
identification of the most recommended classes for each domain, Figure 6 shows a scale
from 0% to 100% that represents the level of relevance to recommend an entity based
on frequency of classes identified as reuse opportunity. The thresholds 0% and 100%
determine two labels for level of relevance, namely weak and strong. The weak label
(from 0% to < 50%) indicates that the class is weakly or moderately recommended as
reuse given a domain. Finally, the strong label (from 50% to 100%) indicates that the
class is highly recommended as reuse.

Figure 6. Scale of relevance for entities identified as reuse opportunities

Table 4 presents results with respect to the accounting domain. For this domain,
the classes from Users to TransactionManager belong to the strong label and,
therefore, they are the highly recommended classes for accounting systems. On the other
hand, the focal group did not consider the classes Users, DatabaseConnection,
3

The percentage is arbitrary, i.e. can be adapted for domains with more or with less systems for analysis.

and Util as specific classes for the accounting domain. In addition, the classes from
AddFinancialsAction to RawMaterial belong to the weak label. The remainder
classes have exactly 2 or 3 occurrences in different systems from the accounting domain.
Therefore, they are weakly recommended and were omitted from this table.
Table 4. Classes with at least 15% of occurrences in the accounting domain
Labels

Strong

Weak

Classes

Frequency

% of systems

Users
DatabaseConnection
CashFlow
Util
BalancesAssets
CashBanks
ShareholderEquity
BalancesLiabilities
ChartAccounts
AccountingMovement
AccountsReceivable
AccountsPayable
Transactions
Log
FinancialReportsPoeHelper
InventoryManager
TransactionManager
AddFinancialsAction
Accounts
FeaturesAnalysis
RawMaterial

13
13
11
10
9
9
9
8
8
8
8
6
7
7
7
7
7
6
6
6
6

100%
100%
85%
77%
69%
69%
69%
62%
62%
62%
62%
46%
54%
54%
54%
54%
54%
46%
46%
46%
46%

Domain
Specific
7
7
X
7
X
X
X
X
X
X
X
X
X
7
7
X
X
X
X
X
X

Key: Agree (X), Disagree (7), and Unconfirmed (field blank)

Table 5 presents results for the restaurant domain. The classes Login and User
belong to the strong label. They are not considered as specific classes in the given domain
from the focal groups viewpoint. However, they are relevant in restaurant systems. The
classes from Client to Order belong to the strong label and are relevant for the restaurant domain from the focal groups viewpoint. Note that many of the classes identified
by JReuse were pointed as relevant reuse opportunities for restaurant systems, even in the
weak label, such as RestaurantMenu, Delivery, and Customer.
Table 5. Classes with at least 15% of occurrences in the restaurant domain
Labels

Strong

Weak

Classes

Frequency

% of systems

Login
User
ConnectionManager
Client
Table
PaymentType
Dish
Employee
Order
RestaurantMenu
Delivery
ItemOrdered
Customer

10
10
9
9
8
8
8
7
7
6
6
6
4

77%
77%
70%
70%
62%
62%
62%
54%
54%
47%
47%
47%
31%

Domain
Specific
7
7
7
X
X
X
X
X
X
X
X
X
X

Key: Agree (X), Disagree (7), and Unconfirmed (field blank)

Consider Table 6 for analysis of the hospital domain. Observe that the classes
from Patient to Microbiology belong to the strong label and, therefore, they are
highly recommended classes as reuse opportunities. Note that, from the viewpoint of the
focal group, the three most frequent classes are considered specific from hospital systems.
In fact, classes such as Patient and Doctor are meaningful in the given domain. In
addition, classes from PatientCondition to OperationsWithCards are from
the weak label. Finally, the remainder classes have less than 10% of the occurrences

Table 6. Classes with at least 15% of occurrences in the hospital domain


Domain
Label
Classes
Frequency % of systems
Specific
Patient
13
100%
X
Doctor
13
100%
X
Disease
11
85%
X
User
10
77%
7
Login
9
69%
7
Diagnose
9
69%
X
Symptoms
9
69%
X
PatientDisease
8
62%
X
Strong
HealthPlan
8
62%
X
Immunology
8
62%
X
Haematology
8
62%
X
Medication
7
54%
X
Surgery
7
54%
X
MedicalRecords
7
54%
X
TypePayment
7
54%
Microbiology
7
54%
X
PatientCondition
6
46%
X
LaboratoryExams
6
46%
X
Log
6
46%
7
HistoPathology
6
46%
X
Weak
Connection
6
46%
7
Paycash
5
38%
Util
5
38%
7
OperationsWithCards
3
23%
Key: Agree (X), Disagree (7), and Unconfirmed (field blank)

Finally, consider Table 7 for the analysis and discussion regarding the e-commerce
domain. Note that the classes Product to ClientDao belong to the strong label,
according to Figure 6. That is, they are highly recommended classes for e-commerce
systems, because they are present in more than 50% of the analyzed systems. In addition,
the classes Item to ShoppingCartService are the weakly recommended classes.
As aforementioned, classes with less than 15% of the occurrences were omitted.
In general, we observed that the classes identified by JReuse are relevant to their
respective system domains, from the viewpoint of the focal group. Although some classes
in the weak label are considered relevant, most of the groups agreement was related to
classes in the strong label. Therefore, our data suggests that our method is able to identify
interesting candidates to reuse. That said, we are able to assess how such classes are
distributed among different systems from the same domain.

Table 7. Classes with at least 15% occurrences in the e-commerce domain


Labels

Strong

Weak

Classes

Frequency

% of systems

Product
PaymentType
Client
ProductDao
ClientDao
Item
ShoppingCart
User
Customer
Category
ProductService
Order
LoginController
UserDao
ProductServiceImpl
ShoppingCartController
OrderedProduct
ShoppingCartService

28
24
20
18
18
17
17
17
14
12
10
9
7
6
6
6
5
5

80%
69%
58%
52%
52%
49%
49%
49%
40%
35%
29%
26%
20%
18%
18%
18%
15%
15%

Domain
Specific
X
X
X
X
X
X
X
7
X
X
X
X
7
X
X
X
X
X

Key: Agree (X), Disagree (7), and Unconfirmed (field blank)

5.2. Distribution of Frequent Classes


After presenting the most frequent classes from systems of each system domain under
analysis, we present the results with respect to the distribution of classes through systems
from the same domain. Therefore, we answer RQ2 as follows.
RQ2 How are the most frequent classes distributed through systems per domain?
Figure 7 presents the top-ten most frequent classes for the accounting domain, based on the number of occurrences for each class.
The
classes are, in decreasing order of frequency, Users, DatabaseConnection,
CashFlow, Util, BalancesAssets, CashBanks, ShareholderEquity,
BalancesLiabilities, ChartAccounts, and AccountingMovement. We
observe that, although only CashFlow is considered specific to the given domain, from
the viewpoint of the focal group, all classes from this label are meaningful in accounting
systems. In turn, the remainder classes are from the medium label. Among these classes,
CashBanks, Transaction, and Accounts are considered specific, for instance.
Regarding the restaurant domain analysis, Figure 8 presents the top-ten classes
with the highest occurrences, namely Login, User, ConnectionManager,
Client,
Table,
PaymentType,
Dish,
Employee,
Order,
and
RestaurantMenu. These classes have an high to medium level for recommendation according to the scale from Figure 6. The classes with the highest occurrences
in this domain are Login and User, respectively. Both are present in 77% of the
analyzed information systems. Nevertheless, they are not specific classes of restaurant
systems. However, JReuse identified some frequent classes such as Client, Table,
PaymentType, and Dish.
Figure 9 presents the most frequent classes identified for the hospital domain, in
decreasing order of frequency. For the 13 systems we collected from this domain, JReuse
extracted some relevant entities, such as Patient, Doctor, and Disease, from the
focal groups point of view. The classes presented in this figure belong to the strong label,

Figure 7. Distribution of frequent classes through accounting systems

Figure 8. Distribution of frequent classes through restaurant systems

as illustrated in Figure 6. Note that the classes Patient and Doctor are present in
100% of the evaluated systems. Similarly to the other domains, JReuse identified some
classes that are generic, such as User (77%) class, that are expected in systems from
other domains.
Finally, Figure 10 presents the top-ten most frequent classes for e-commerce systems. We sorted the classes in decreasing order of frequency. The most frequent entities
are, respectively, Product, PaymentType, Client, ProductDao, ClientDao,
Item, ShoppingCart, User, Customer, and Category. Note that, according to
the focal group the classes Product, Payment, ShoppingCart, Customer, and
Client are elementary entities to be expected in an e-commerce system. In turn, al-

Figure 9. Distribution of frequent classes through hospital systems

though User is one of the most frequent classes identified by JReuse (49% of the systems
contain this class), User is not specific of the e-commerce domain. However, this entity
is meaningful for information systems in general.

Figure 10. Distribution of frequent classes through e-commerce systems

5.3. Lessons Learned


In this study, we learned a lot regarding interesting research topics such as software reuse,
reuse opportunities identification, and recommendation systems. For this propose, we
take as an example the e-commerce domain, especially by the popularity and size of these
systems on GitHub. We discuss some of the main lessons learned with support of the
following questions.

How much a lexical analysis may support the identification of reuse opportunities
assets? As discussed in Section 2, there are many approaches to support software reuse
in literature. Lexical analysis is a simple one. However, as pointed by the results of
Section 5, it may be effective to identify reuse opportunities in systems from a single
domain. Moreover, we initially conceived our method to gather elements with names that
are semantically similar. However, through our study we identified some occurrences of
similar entities in an intuitive fashion that do not represent the same real-world concept.
In our exploratory study which was conducted in a controlled environment (see
Section 4.3) we found for instance, frequent classes such as Client and Costumer
have distinct behaviors although intuitively they represent the same real-world abstraction. Some classes named as Client implement a simplistic system clients which register data basically. In turn, Costumer classes generally implement system clients with
more robust features, such as data management. Therefore, we conclude that lexical analysis performs satisfactorily to identify reuse opportunities at least in this domain.
Names of classes are suitable to the entities they represent in a business domain?
We discuss in Section 3 that names of classes may be useful for reuse opportunities identification. In fact, we observed that naming similarity identification may support reuse
opportunities identification. However, to retrieve similarly named classes may be uninteresting if they are not representative in an specific domain. Section 4 highlights identified
classes that fit to e-commerce domain. These entities are the most frequent that our tool
detected.
Therefore, we believe that names of entities are, in general, sufficiently representative. Moreover, we observed in this study that our method is able to identify reuse
opportunities in randomly mined systems from GitHub, provided by different development teams. Therefore, we expect to obtain even more relevant results in the context of
an specific organization.
How to apply our reuse opportunities identification tool in a reuse recommendation system? Methods and classes are elementary entities of object-oriented software
systems. Knowing these entities, we are able to describe the architecture of a system.
Therefore, with results provided by our tool, we see an opportunity for reuse recommendation through software modeling using class diagrams, for instance.
To the best of our knowledge, we have not found many recent studies with respect
to reuse opportunities identification, supported by tools for this activity, and methods to
support the building of reuse repositories with similar approach. Therefore, as an interesting research topic, we lack more quantitative data to measure and compare different
techniques that support software reuse.

6. Threats to Validity
We based our study on related work to support the method definition, the tool development, and the proposal of a recommendation system. Regarding the evaluation of our
method and tool, we conducted a careful empirical study to assess effectiveness of the
method in identifying reuse opportunities. However, some threats to validity may affect
our research findings. The main threats and respective treatments are discussed below
based on the categories presented by Wohlin et al. (2012).

Construct Validity. Before running our reuse opportunities identification method, we


conducted a careful filtering of information systems from GitHub repositories. However,
some threats may affect the correct filtering of systems, such as human factors that
wrongly lead to discard a valid system to be evaluated. Considering the exclusion criteria
for selection of systems (see Section 4.2), we implemented an algorithm to automate
this process and, then, discard inappropriate systems for analysis. However, we may
have discarded relevant software systems by using our algorithm, such as systems
misidentified as non-Java systems.
Internal Validity. We conducted a lexical classification of entities that may be affected
by some threats. To treat this possible problem, we selected a sample of 10 e-commerce
systems from our data set, with diversified number of entities. Then, we manually
identified the names of entities from source code to find synonyms. We compared our
manual results with the results provided by the tool and observed a loss of 10% in
synonym terms identified through the automated process.
Conclusion Validity. After running our identify tool, we gathered manually classes that
seemed to represent the same real-world object. For instance, classes named as Client
and Costumer were considered the same type of entity. The same occurred with
methods identified by the tool as reuse candidates. However, this process is subjective
and may be affected by human factors. In this first exploratory study, we decided to not
unify terms (e.g., Customer and Client) in the quantitative analysis.
External Validity. We evaluated our method with a set of 72 systems, extracted from
GitHub. Considering that they may not represent the 4 domains analyzed, our findings
may be not be generalized. Furthermore, we evaluated only four system domains, accounting, restaurant, hospital, and e-commerce. However, the collected systems are the
most popular on GitHub that is a largely used platform. Finally, we evaluated systems
implemented only in Java programming language. Although it is one of the most popular
languages worldwide, our results may not generalize to other programming languages.

7. Conclusion and Future Work


In previous work, we propose JReuse, a method to identify reuse opportunities from software system of a specific domain. This method relies on lexical analysis to compare
names of classes and identify the most frequent ones. Such classes are used to We also
present a prototype tool that implements the method for Java systems. Finally, we conduct
a preliminary evaluation of our method with 38 software systems from the e-commerce
domain. This paper extends our earlier contributions by evaluating the proposed method
with 72 Java systems of four different domains: accounting, restaurant, hospital, and ecommerce.
We evaluate JReuse through an exploratory study conducted in controlled environment, considering two aspects. First, we assess whether the most frequent classes
provided by our method are relevant for the respective domains. Second, we assess the
distribution of frequent names of classes among different systems of the same domain.
Our findings suggest that our method was able to suggest relevant classes for systems

from the four analyzed domains. Furthermore, we observe that the most frequent classes
suggested as candidates for reuse are present in a significant number of different systems
of the respective domain.
As future work, we intend to enhance JReuse to suggest source code to developers based on the most frequent classes identified by the proposed method. In addition,
we aim to combine lexical with semantic analysis to improve the identification of reuse
opportunities. For instance, the semantic analysis may support analysis of synonyms
and improve our results of identification. In addition, we may explore alternative techniques for similarity computation. We also intend to implement our method targeting
other object-oriented programming languages.

References
Caldiera, G. and Basili, V. R. (1991). Identifying and qualifying reusable software components. In Journal Computer - Special Issue on Cryptography, pages 6170.
Cybulski, J. and Reed, K. (2000). Requirements classification and reuse: Crossing domain
boundaries. In Proceedings of the International Conference on Software Reuse (ICSR),
pages 190210.
Guo, J. and Luqi (2000). A survey of software reuse repositories. In Proceedings of
the International Conference and Workshops on the Engineering of Computer Based
Systems (ECBS), pages 92100.
Inoue, K., Yokomori, R., Yamamoto, T., Matsushita, M., and Kusumoto, S. (2005). Ranking significance of software components based on use relations. In IEEE Transactions
on Software Engineering (TSE), pages 213225.
Kawaguchi, S., Garg, P., Matsushita, M., and Inoue, K. (2004). Mudablue: an automatic
categorization system for open source repositories. In Proceedings of the Asia-Pacific
Software Engineering Conference (APSEC), pages 184193.
Koziolek, H., Goldschmidt, T., de Gooijer, T., Domis, D., and Sehestedt, S. (2013). Experiences from identifying software reuse opportunities by domain analysis. In Proceedings of the International Software Product Line Conference (SPLC), pages 208217.
Krueger, C. (1992). Software reuse. In Journal of Computing Surveys (CSUR), pages
131183.
Kuhn, A., Ducasse, S., and Grba, T. (2007). Semantic clustering: Identifying topics in
source code. In Journal of Information and Software Technology, pages 230243.
Lee, J., Kang, K. C., and Kim, S. (2004). A feature-based approach to product line production planning. In Proceedings of the International Conference on Software Product
Lines (SPLC), pages 183196.
Li, J., Zhang, Z., and Yang, H. (2005). A grid oriented approach to reusing legacy code
in iceni framework. In Proceedings of the International Conference on Information
Reuse and Integration (IRI), pages 464469.
Liu, H. and Lu, R. (2008). Word similarity based on an ensemble model using ranking svms. In Proceedings of the International Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT), pages 283286.

Maarek, Y., Berry, D., and Kaiser, G. (1991). An information retrieval approach for
automatically constructing software libraries. In IEEE Transactions on Software Engineering (TSE), pages 800813.
Mende, T., Koschke, R., and Beckwermert, F. (2009). An evaluation of code similarity
identification for the grow-and-prune model. In Journal of Software Maintenance and
Evolution: Research and Practice, pages 143169.
Michail, A. and Notkin, D. (1999). Assessing software libraries by browsing similar
classes, functions and relationships. In Proceedings of the International Conference
on Software Engineering (ICSE), pages 463472.
Mohagheghi, P. and Conradi, R. (2007). Quality, productivity and economic benefits of
software reuse: a review of industrial studies. In Journal Empirical Software Engineering (ESE), pages 471516.
Mohagheghi, P., Conradi, R., Killi, O., and Schwarz, H. (2004). An empirical study of
software reuse vs. defect-density and stability. In Proceedings of the International
Conference on Software Engineering (ICSE), pages 282291.
Monroe, R. and Garlan, D. (1996). Style-based reuse for software architectures. In Proceedings of the International Conference on Software Reuse (ICSR), pages 8493.
Morisio, M., Ezran, M., and Tully, C. (2002). Success and failure factors in software
reuse. Transactions on Software Engineering (TSE), 28(4):340357.
Neighbors, J. (1992). The evolution from software components to domain analysis. In
International Journal of Software Engineering and Knowledge Engineering (IJSEKE),
pages 325354.
Oliveira, M., Goncalves, E., and Bacili, K. (2007). Automatic identification of reusable
software development assets: Methodology and tool. In Proceedings of the International Conference on Information Reuse and Integration (IRI), pages 461466.
Pressman, R. S. (2005). Software Engineering: a Practitioners Approach. Palgrave
Macmillan.
Ravichandran, T. and Rothenberger, M. (2003). Software reuse strategies and component
markets. In Magazine Communications of the ACM, pages 109114.
Sojer, M. and Henkel, J. (2011). License risks from ad hoc reuse of code from the internet.
In Journal Communications of the ACM, pages 7481.
Tian, Y., Lo, D., and Lawall, J. (2014). Sewordsim: Software-specific word similarity
database. In Proceedings of the International Conference on Software Engineering
(ICSE), pages 568571.
Wang, Z., Xu, X., and Zhan, D. (2005). A survey of business component identification
methods and related techniques. In International Journal of Information Technology,
pages 229238.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., and Wesslen, A. (2012).
Experimentation in software engineering. Springer Science & Business Media.
Ye, Y. and Fischer, G. (2005). Reuse-conducive development environments. In Journal
Automated Software Engineering (ASE), pages 199235.

Yujian, L. and Bo, L. (2007). A normalized levenshtein distance metric. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 10911095.
Zhen, Z., Shen, J., and Lu, S. (2008). Wcons: An ontology mapping approach based on
word and context similarity. In Proceedings of the International Conference on Web
Intelligence and Intelligent Agent Technology (WI-IAT), pages 334338.

You might also like