A Method Based On Naming Similarity To Identify Reuse Opportunities
A Method Based On Naming Similarity To Identify Reuse Opportunities
A Method Based On Naming Similarity To Identify Reuse Opportunities
Opportunities
Authors omitted due to blind review1
1
1. Introduction
Software reuse is a development strategy in which existing software components, called
reusable assets, are used to implement new software systems [Krueger 1992]. It has
been studied and pointed as an alternative to traditional development aiming to increase software quality and decrease development efforts by using previously developed,
and sometimes already tested, software components [Mohagheghi and Conradi 2007,
Mohagheghi et al. 2004, Morisio et al. 2002, Ravichandran and Rothenberger 2003].
The extraction of reusable assets is essential to support the software reuse activity by building repositories of reuse opportunities [Guo and Luqi 2000]. These methods may be used in different contexts related to software reuse, including the support of feature extraction for a software product line [Lee et al. 2004], for instance.
Many methods have been proposed in the literature to support the extraction of reuse
opportunities from software systems [Caldiera and Basili 1991, Kawaguchi et al. 2004,
Kuhn et al. 2007, Maarek et al. 1991, Ye and Fischer 2005].
There are different approaches used by proposed methods to identify reuse opportunities, such as natural-language processing [Maarek et al. 1991], formal specifications [Caldiera and Basili 1991], machine learning [Kawaguchi et al. 2004], and other Information Retrieval (IR) approaches [Kuhn et al. 2007, Ye and Fischer 2005]. However,
to the best of our knowledge, we did not find a method for extraction of reuse opportunities and reuse recommendation considering the most frequent source code elements such
as classes from systems of the same domain.
This paper is an extension of previous work (citation omitted due to blind review)
that proposes a method for extraction of reuse opportunities called JReuse. Considering
a set of software systems, JReuse aims to identify classes with similar names through a
similarly analysis from different systems. Then, we are able to identify classes eventually, that may be recommended as reuse opportunities. We also present a prototype tool
that applies the proposed method. Finally, we evaluate our method with 38 e-commerce
software systems.
Additionally to our earlier contributions, we conduct an evaluation of our method
through an experiment with 72 Java systems from four different domains: accounting,
restaurant, hospital, and e-commerce. All systems were mined from GitHub. As a result,
we observe that our method is able to identify reuse opportunities using naming similarity
analysis. That is, JReuse can provide meaningful classes for the analyzed domains, and
these classes may be indicated as reuse opportunities to developers of new systems from
the respective domain.
The remainder of this paper is organized as follows. Section 2 presents background to support the study comprehension, in addition to related work. Section 3 proposes a method for reuse opportunities extraction and a prototype tool that supports the
proposed method. Section 4 presents an evaluation of the method. Section 5 describes the
results obtained through the evaluation and discusses lessons learned. Section 6 presents
threats to the study validity. Finally, Section 7 concludes the paper with a discussion and
suggestions for future work.
3. Proposed Method
This section explains in detail the proposed method for identification of reuse opportunities. Section 3.1 describes the similarity-based process applied by our method to identify
reuse opportunities. Section 3.2 proposes our method and its steps. Finally, Section 3.3
presents a tool that implements our method.
3.1. Identifying Similarity
Previous work investigate the use of textual similarity in the context of source code analysis [Tian et al. 2014, Zhen et al. 2008]. There are many applications for similarity analysis in software systems, such as comparison of dialects, spell check, and plagiarism
detection [Liu and Lu 2008]. In this context, we propose JReuse, a method that relies on
similarity analysis of source code as a static technique to identify reuse opportunities.
We conducted an ad hoc literature review in order to select algorithms that
compute similarity between strings to be used by our method. For this purpose, we
searched for the most popular similarity computation algorithms to find the one that
fits our study purpose. After the literature review, we selected the Levenshteins algorithm [Yujian and Bo 2007]. This algorithm is a similarity function used by our method
to compute lexical similarity between names of classes from different systems. In short
terms, given two strings A and B, the algorithm computes the number of changes required
to turn A into B.
To identify similarly named classes, we adopted 75% a threshold for the minimum
similarity between two names of entities. This threshold, derived empirically by the authors, was taken because some well-known naming conventions for classes may lead to
similarly named entities that clearly represent different purposes. As an example, a similarity of 72% is obtained for the names Costumer and CostumerDAO, observed by the
authors as frequent names of classes in e-commerce systems. However, we intuitively expect that two classes with these names implement different functions, since DAO classes
implement data base persistence.
Table 1 presents some examples of class names and the respective similarity rate
using the chosen algorithm. In this table, we present eight matches between names of
classes from two software systems: System A and System B. Each match has at least 75%
of similarity rate between names of classes, in accordance to our empirical threshold.
Note that our threshold covers, for instance, names of classes that vary from singular to
plural (e.g., Client and Clients).
Table 1. Examples of similarity computation
System A
ShoppingCart
OrderProductId
Orderservice
Reviwes
Clients
CartController
Products
ProductsController
System B
ShoppCart
OrderProduc
Orderservi
Reviwe
Client
CartControll
Product
ProductController
Similarity Rate
75%
78%
83%
85%
85%
85%
87%
94%
such as Client and Customer, that may be similar semantically, are considered as
different entities.
Figure 2 present the five steps performed by JReuse to identify reuse opportunities
in a set of systems. All steps are described as follows.
1. First, the JReuse method receives, as input, software systems from a data set provided by the user. These systems are supposed to belong to the same domain.
Then, the method filters non-Java source files, discards every system projects that
are for the Android platform, and extracts the names of classes from the Java
source files.
2. After, the method identifies the class to compute the similarity between the classes
entities of other systems. We highlight that JReuse does not compare classes of
the same software system.
3. Then, JReuse compares the names of classes in pairs to identify names with at
least 75% of similarity. Classes with similar names, called matches, are gathered
and each class name receives a score that is the number of systems in which the
class occurs. The higher the score, the more relevant may be the class regarding
the analyzed domain.
4. After comparing names of classes and computing similarity, the results obtained
by JReuse are sorted, in decreasing order, by the frequency of the identified reuse
opportunities if necessary.
5. Finally, JReuse composes a repository of candidates to reuse opportunities with
the identified classes. This repository may be used to support use of these classes.
4. Method Evaluation
This section describes an empirical evaluation of the method proposed in Section 3. For
this purpose, we designed an exploratory study conducted in environment controlled
based on guidelines of Wohlin et al. (2012). Since JReuse aims to identify the main
reuse opportunities from software systems, our evaluation consists of analyzing the reuse
opportunities identified by the proposed method. Section 4.1 presents study goal and research questions that we designed to guide our study. Section 4.2 describes the data set
used to evaluate our method through the prototype tool. Finally, Section 4.3 presents the
evaluation steps.
4.1. Goal and Research Questions
In this study, our goal is to assess whether JReuse is able to identify frequent classes in a
specific software domain. We are also interested in assessing the relevance of the results
1
http://spectrum.ieee.org/static/interactive-the-top-programming-languages-2015
provided by our method. For this purpose, we chose four domains to be evaluated: accounting, restaurant, hospital, and e-commerce. We also designed the following research
questions (RQs) to guide our study.
RQ1 What are the most frequent classes in software systems for each selected domain?
Through RQ1, we are interested in investigating whether the most frequent identified classes are indicated for software systems for the respective domain. We expect that
JReuse is able to provide a list of classes and methods whose recommendations for reuse
are relevant for the respective domains.
RQ2 How are the most frequent classes distributed through systems per domain?
With RQ2, we aim to understand to what extent the same class, identified as one
of the most frequent classes, occur in different software systems from a given domain.
For instance, we want to know if the same class can occur in all systems, or most of them.
4.2. Data Set
To evaluate our method, we chose only systems from the domain of accounting, restaurant, hospital, and e-commerce, for several reasons. First, software systems from these
domains encompass several business features, such as user personnel, financial, product, and service management. Second, there is a significant number of domain systems
available for download in GitHub2 . Third, from the viewpoint of the authors, the four
domains we chose are well-defined in terms of requirements and we believe that it would
be possible to find reuse opportunities among systems of these domains. The systems that
compose our data set were extracted from GitHub repositories. We performed the selection of systems for the e-commerce domain in January 2015 and in May 2016 for the other
domains. We selected software systems are based on the ranking of starred systems and
system length in terms of storage space. In GitHub, stars are a meaningful measure for
repository popularity among the platform users, and may be used to support the selection
of systems.
There is a diverse terminology to represent a same software domain. For instance,
we may refer to the e-commerce domain as ecommerce, without hyphenation. In order
to support the collection of software systems to compose our data set, we developed an
algorithm to clone GitHub repositories individually, with the respective systems, based
on a well-defined search string for each domain under analysis in this study. Since the
goal of our study is to identify reuse opportunities from different software systems, given
large system sets per domain, we defined the following search strings.
For accountancy: accountancy OR accounting
For restaurant: restaurant OR eatery OR restaurants
For hospital: hospital OR infirmary OR lazaretto
For e-commerce: e-commerce OR ecommerce OR electronic commerce
Table 2 presents the exclusion criteria applied in the selected systems. First,
we collected 400 Java systems from GitHub, 100 for each domain in order descending
2
https://github.com
sorted by stars. Then, we discarded systems according to the following exclusion criteria: (i) non-Java software systems, since GitHub do not verify automatically the main
programming languages of the systems, (ii) Java projects developed for Android platform, because Android systems tend to have a different architectural design and code
implementation when compared with traditional Java systems, (iii) systems with less than
1,000 lines of code (LOC), and (iv) systems written in other languages rather than English, since our method relies on a lexical similarity technique and, then, natural language
may impact significantly the results provided by our method.
Table 2. Filters applied to the data set
Domains
Accounting
Restaurant
Hospital
E-commerce
Not English
9
3
16
21
Excluded Systems by
Less than 1,000 LOC
49
56
37
40
Android
31
28
34
4
Selected
Systems
11
13
13
35
For each selected system, we considered only the last release. This process was
necessary to discard different versions of the same system, which probably contain lots
of similarly named classes and methods. Finally, we obtained in 72 Java systems for
evaluation of the JReuse method.
To better characterize systems in the four domains, Figures 3 and 4 presents software metrics for systems per domain: lines of code (LOC) and number of classes (NOC),
respectively. We plotted twelve boxplots, one for each metric. However, because of the
heterogeneity of the sample of our data set, we decided to eliminate outliers for each
metric. Therefore, all boxplots presented a brief overview of each analyzed domain.
Let us consider Figure 3 in the following analysis of LOC. With respect to the
accounting domain, we observe that the mean of LOC for the systems is 8,690. Moreover,
the median is 5,112, i.e., half of the accounting systems has at least 4 KLOC. That is, a
significant number for analysis and identification of reuse opportunities. Regarding the
restaurant domain, the mean of LOC is 3,447. In addition, the median is 3,256. Again,
we conclude that these systems have a significant LOC for analysis. For the hospital
domain, the mean is 4,964 and the median is 2,534 of LOC. Although these values are
smaller than the obtained values for the other domains, it remains significant for the study.
Finally, with respect to the e-commerce domain, we observe a mean LOC of 46,100 and a
median of 3,730. In general, systems from this domain have the highest numbers of LOC
and, therefore, they may have several reuse opportunities.
With respect to the following analysis of NOC, consider Figure 4. Regarding the
accounting domain, note that the mean of NOC for the systems is 35.73. Furthermore, the
median is 18, i.e., half of the accounting systems has at least 18 classes. This number is
significant for analysis because we are interested in finding similarly named classes within
a pairwise comparison. Therefore, we expect a comparison of 18 18 = 324 pairs that
may be reuse opportunities. Regarding the restaurant domain, the mean of NOC is 37.23.
In addition, the median is 40. Again, we conclude that these systems have a significant
NOC for analysis. For the hospital domain, the mean is 33.85 and the median is 25 of
NOC. Finally, with respect to the e-commerce domain, we observe a mean NOC of 368.9
and a median of 45.5. In general, systems from this domain has the highest numbers of
NOC and, therefore, there is a significant possibility of identifying reuse opportunities.
Step 1: Automated Search By using the search strings described in Section 4.2,
we cloned from GitHub several software systems, belonging to different domains. We
intended to identify appropriate domains to be analyzed in our exploratory study. For
this purpose, we considered a domain as appropriate when, from our viewpoint, systems
from the given domain contain a significant number of classes and methods for analysis.
After performing the search for systems, with support of our algorithm, we obtained
400 software systems from four distinct domains: accounting, restaurant, hospital, and
e-commerce.
Step 2: Exclusion Criteria By applying a set of exclusion criteria defined by the
authors, we select the systems according to with the following requirements: (i) software
systems written only in English, (ii) software systems with more than 1,000 lines of
source code, and (iii) traditional Java software systems, i.e, software that are not exclusive
to the Android platform. After applying the exclusion criteria, 72 different software
systems remained for analysis.
Step 3: Detection of Similarly Named Classes We executed the JReuse prototype
tool for the 72 collected systems. Per domain, the respective systems were submitted
to JReuse for extraction of reuse opportunities. After the automated analysis for each
domain, JReuse provided a list with the most frequent classes that occur in the given
domain.
Domains
Systems
LOC
Accounting
Restaurant
Hospital
E-commerce
11
13
13
35
95,588
44,813
65,297
1,567.337
NOC
Analyzed Recommended
493
25
484
17
446
21
12,598
75
each domain, as presented in Figures 4.3, 4.4, 4.5, and 4.6. We submitted the list of most
frequent entities to a group of 4 researchers at a Software Engineering laboratory (name
omitted due to blind review), for validation of the entities with respect to relevance.
Tables 4, 5, 6, and 7 present classes identified as reuse opportunities for ecommerce, accounting, restaurant, and hospital, respectively. We selected only the classes
with at least 15% 3 occurrences in the systems of the respective domain. Each table has a
Domain-Specific field. This field indicates the viewpoint of the focal group regarding a
given entity to be specific for the analyzed domain. The focal groups viewpoint is represented by three symbols in table: (i) the (X) symbol indicates that the focal group agreed
that the class is specific for the domain under analysis, (ii) the (7) symbol indicates that
the focal group disagreed that the class is indicated for the domain, and (iii) blank field
(Unconfirmed) indicates that the focal group did not converge to a specific opinion on the
class. Moreover, each table has a Labels filed to inform the level of relevance of the
entity identified by JReuse as reuse opportunity.
Scale to Indicate the Level of Relevance of the Entities Identified. To support the
identification of the most recommended classes for each domain, Figure 6 shows a scale
from 0% to 100% that represents the level of relevance to recommend an entity based
on frequency of classes identified as reuse opportunity. The thresholds 0% and 100%
determine two labels for level of relevance, namely weak and strong. The weak label
(from 0% to < 50%) indicates that the class is weakly or moderately recommended as
reuse given a domain. Finally, the strong label (from 50% to 100%) indicates that the
class is highly recommended as reuse.
Table 4 presents results with respect to the accounting domain. For this domain,
the classes from Users to TransactionManager belong to the strong label and,
therefore, they are the highly recommended classes for accounting systems. On the other
hand, the focal group did not consider the classes Users, DatabaseConnection,
3
The percentage is arbitrary, i.e. can be adapted for domains with more or with less systems for analysis.
and Util as specific classes for the accounting domain. In addition, the classes from
AddFinancialsAction to RawMaterial belong to the weak label. The remainder
classes have exactly 2 or 3 occurrences in different systems from the accounting domain.
Therefore, they are weakly recommended and were omitted from this table.
Table 4. Classes with at least 15% of occurrences in the accounting domain
Labels
Strong
Weak
Classes
Frequency
% of systems
Users
DatabaseConnection
CashFlow
Util
BalancesAssets
CashBanks
ShareholderEquity
BalancesLiabilities
ChartAccounts
AccountingMovement
AccountsReceivable
AccountsPayable
Transactions
Log
FinancialReportsPoeHelper
InventoryManager
TransactionManager
AddFinancialsAction
Accounts
FeaturesAnalysis
RawMaterial
13
13
11
10
9
9
9
8
8
8
8
6
7
7
7
7
7
6
6
6
6
100%
100%
85%
77%
69%
69%
69%
62%
62%
62%
62%
46%
54%
54%
54%
54%
54%
46%
46%
46%
46%
Domain
Specific
7
7
X
7
X
X
X
X
X
X
X
X
X
7
7
X
X
X
X
X
X
Table 5 presents results for the restaurant domain. The classes Login and User
belong to the strong label. They are not considered as specific classes in the given domain
from the focal groups viewpoint. However, they are relevant in restaurant systems. The
classes from Client to Order belong to the strong label and are relevant for the restaurant domain from the focal groups viewpoint. Note that many of the classes identified
by JReuse were pointed as relevant reuse opportunities for restaurant systems, even in the
weak label, such as RestaurantMenu, Delivery, and Customer.
Table 5. Classes with at least 15% of occurrences in the restaurant domain
Labels
Strong
Weak
Classes
Frequency
% of systems
Login
User
ConnectionManager
Client
Table
PaymentType
Dish
Employee
Order
RestaurantMenu
Delivery
ItemOrdered
Customer
10
10
9
9
8
8
8
7
7
6
6
6
4
77%
77%
70%
70%
62%
62%
62%
54%
54%
47%
47%
47%
31%
Domain
Specific
7
7
7
X
X
X
X
X
X
X
X
X
X
Consider Table 6 for analysis of the hospital domain. Observe that the classes
from Patient to Microbiology belong to the strong label and, therefore, they are
highly recommended classes as reuse opportunities. Note that, from the viewpoint of the
focal group, the three most frequent classes are considered specific from hospital systems.
In fact, classes such as Patient and Doctor are meaningful in the given domain. In
addition, classes from PatientCondition to OperationsWithCards are from
the weak label. Finally, the remainder classes have less than 10% of the occurrences
Finally, consider Table 7 for the analysis and discussion regarding the e-commerce
domain. Note that the classes Product to ClientDao belong to the strong label,
according to Figure 6. That is, they are highly recommended classes for e-commerce
systems, because they are present in more than 50% of the analyzed systems. In addition,
the classes Item to ShoppingCartService are the weakly recommended classes.
As aforementioned, classes with less than 15% of the occurrences were omitted.
In general, we observed that the classes identified by JReuse are relevant to their
respective system domains, from the viewpoint of the focal group. Although some classes
in the weak label are considered relevant, most of the groups agreement was related to
classes in the strong label. Therefore, our data suggests that our method is able to identify
interesting candidates to reuse. That said, we are able to assess how such classes are
distributed among different systems from the same domain.
Strong
Weak
Classes
Frequency
% of systems
Product
PaymentType
Client
ProductDao
ClientDao
Item
ShoppingCart
User
Customer
Category
ProductService
Order
LoginController
UserDao
ProductServiceImpl
ShoppingCartController
OrderedProduct
ShoppingCartService
28
24
20
18
18
17
17
17
14
12
10
9
7
6
6
6
5
5
80%
69%
58%
52%
52%
49%
49%
49%
40%
35%
29%
26%
20%
18%
18%
18%
15%
15%
Domain
Specific
X
X
X
X
X
X
X
7
X
X
X
X
7
X
X
X
X
X
as illustrated in Figure 6. Note that the classes Patient and Doctor are present in
100% of the evaluated systems. Similarly to the other domains, JReuse identified some
classes that are generic, such as User (77%) class, that are expected in systems from
other domains.
Finally, Figure 10 presents the top-ten most frequent classes for e-commerce systems. We sorted the classes in decreasing order of frequency. The most frequent entities
are, respectively, Product, PaymentType, Client, ProductDao, ClientDao,
Item, ShoppingCart, User, Customer, and Category. Note that, according to
the focal group the classes Product, Payment, ShoppingCart, Customer, and
Client are elementary entities to be expected in an e-commerce system. In turn, al-
though User is one of the most frequent classes identified by JReuse (49% of the systems
contain this class), User is not specific of the e-commerce domain. However, this entity
is meaningful for information systems in general.
How much a lexical analysis may support the identification of reuse opportunities
assets? As discussed in Section 2, there are many approaches to support software reuse
in literature. Lexical analysis is a simple one. However, as pointed by the results of
Section 5, it may be effective to identify reuse opportunities in systems from a single
domain. Moreover, we initially conceived our method to gather elements with names that
are semantically similar. However, through our study we identified some occurrences of
similar entities in an intuitive fashion that do not represent the same real-world concept.
In our exploratory study which was conducted in a controlled environment (see
Section 4.3) we found for instance, frequent classes such as Client and Costumer
have distinct behaviors although intuitively they represent the same real-world abstraction. Some classes named as Client implement a simplistic system clients which register data basically. In turn, Costumer classes generally implement system clients with
more robust features, such as data management. Therefore, we conclude that lexical analysis performs satisfactorily to identify reuse opportunities at least in this domain.
Names of classes are suitable to the entities they represent in a business domain?
We discuss in Section 3 that names of classes may be useful for reuse opportunities identification. In fact, we observed that naming similarity identification may support reuse
opportunities identification. However, to retrieve similarly named classes may be uninteresting if they are not representative in an specific domain. Section 4 highlights identified
classes that fit to e-commerce domain. These entities are the most frequent that our tool
detected.
Therefore, we believe that names of entities are, in general, sufficiently representative. Moreover, we observed in this study that our method is able to identify reuse
opportunities in randomly mined systems from GitHub, provided by different development teams. Therefore, we expect to obtain even more relevant results in the context of
an specific organization.
How to apply our reuse opportunities identification tool in a reuse recommendation system? Methods and classes are elementary entities of object-oriented software
systems. Knowing these entities, we are able to describe the architecture of a system.
Therefore, with results provided by our tool, we see an opportunity for reuse recommendation through software modeling using class diagrams, for instance.
To the best of our knowledge, we have not found many recent studies with respect
to reuse opportunities identification, supported by tools for this activity, and methods to
support the building of reuse repositories with similar approach. Therefore, as an interesting research topic, we lack more quantitative data to measure and compare different
techniques that support software reuse.
6. Threats to Validity
We based our study on related work to support the method definition, the tool development, and the proposal of a recommendation system. Regarding the evaluation of our
method and tool, we conducted a careful empirical study to assess effectiveness of the
method in identifying reuse opportunities. However, some threats to validity may affect
our research findings. The main threats and respective treatments are discussed below
based on the categories presented by Wohlin et al. (2012).
from the four analyzed domains. Furthermore, we observe that the most frequent classes
suggested as candidates for reuse are present in a significant number of different systems
of the respective domain.
As future work, we intend to enhance JReuse to suggest source code to developers based on the most frequent classes identified by the proposed method. In addition,
we aim to combine lexical with semantic analysis to improve the identification of reuse
opportunities. For instance, the semantic analysis may support analysis of synonyms
and improve our results of identification. In addition, we may explore alternative techniques for similarity computation. We also intend to implement our method targeting
other object-oriented programming languages.
References
Caldiera, G. and Basili, V. R. (1991). Identifying and qualifying reusable software components. In Journal Computer - Special Issue on Cryptography, pages 6170.
Cybulski, J. and Reed, K. (2000). Requirements classification and reuse: Crossing domain
boundaries. In Proceedings of the International Conference on Software Reuse (ICSR),
pages 190210.
Guo, J. and Luqi (2000). A survey of software reuse repositories. In Proceedings of
the International Conference and Workshops on the Engineering of Computer Based
Systems (ECBS), pages 92100.
Inoue, K., Yokomori, R., Yamamoto, T., Matsushita, M., and Kusumoto, S. (2005). Ranking significance of software components based on use relations. In IEEE Transactions
on Software Engineering (TSE), pages 213225.
Kawaguchi, S., Garg, P., Matsushita, M., and Inoue, K. (2004). Mudablue: an automatic
categorization system for open source repositories. In Proceedings of the Asia-Pacific
Software Engineering Conference (APSEC), pages 184193.
Koziolek, H., Goldschmidt, T., de Gooijer, T., Domis, D., and Sehestedt, S. (2013). Experiences from identifying software reuse opportunities by domain analysis. In Proceedings of the International Software Product Line Conference (SPLC), pages 208217.
Krueger, C. (1992). Software reuse. In Journal of Computing Surveys (CSUR), pages
131183.
Kuhn, A., Ducasse, S., and Grba, T. (2007). Semantic clustering: Identifying topics in
source code. In Journal of Information and Software Technology, pages 230243.
Lee, J., Kang, K. C., and Kim, S. (2004). A feature-based approach to product line production planning. In Proceedings of the International Conference on Software Product
Lines (SPLC), pages 183196.
Li, J., Zhang, Z., and Yang, H. (2005). A grid oriented approach to reusing legacy code
in iceni framework. In Proceedings of the International Conference on Information
Reuse and Integration (IRI), pages 464469.
Liu, H. and Lu, R. (2008). Word similarity based on an ensemble model using ranking svms. In Proceedings of the International Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT), pages 283286.
Maarek, Y., Berry, D., and Kaiser, G. (1991). An information retrieval approach for
automatically constructing software libraries. In IEEE Transactions on Software Engineering (TSE), pages 800813.
Mende, T., Koschke, R., and Beckwermert, F. (2009). An evaluation of code similarity
identification for the grow-and-prune model. In Journal of Software Maintenance and
Evolution: Research and Practice, pages 143169.
Michail, A. and Notkin, D. (1999). Assessing software libraries by browsing similar
classes, functions and relationships. In Proceedings of the International Conference
on Software Engineering (ICSE), pages 463472.
Mohagheghi, P. and Conradi, R. (2007). Quality, productivity and economic benefits of
software reuse: a review of industrial studies. In Journal Empirical Software Engineering (ESE), pages 471516.
Mohagheghi, P., Conradi, R., Killi, O., and Schwarz, H. (2004). An empirical study of
software reuse vs. defect-density and stability. In Proceedings of the International
Conference on Software Engineering (ICSE), pages 282291.
Monroe, R. and Garlan, D. (1996). Style-based reuse for software architectures. In Proceedings of the International Conference on Software Reuse (ICSR), pages 8493.
Morisio, M., Ezran, M., and Tully, C. (2002). Success and failure factors in software
reuse. Transactions on Software Engineering (TSE), 28(4):340357.
Neighbors, J. (1992). The evolution from software components to domain analysis. In
International Journal of Software Engineering and Knowledge Engineering (IJSEKE),
pages 325354.
Oliveira, M., Goncalves, E., and Bacili, K. (2007). Automatic identification of reusable
software development assets: Methodology and tool. In Proceedings of the International Conference on Information Reuse and Integration (IRI), pages 461466.
Pressman, R. S. (2005). Software Engineering: a Practitioners Approach. Palgrave
Macmillan.
Ravichandran, T. and Rothenberger, M. (2003). Software reuse strategies and component
markets. In Magazine Communications of the ACM, pages 109114.
Sojer, M. and Henkel, J. (2011). License risks from ad hoc reuse of code from the internet.
In Journal Communications of the ACM, pages 7481.
Tian, Y., Lo, D., and Lawall, J. (2014). Sewordsim: Software-specific word similarity
database. In Proceedings of the International Conference on Software Engineering
(ICSE), pages 568571.
Wang, Z., Xu, X., and Zhan, D. (2005). A survey of business component identification
methods and related techniques. In International Journal of Information Technology,
pages 229238.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., and Wesslen, A. (2012).
Experimentation in software engineering. Springer Science & Business Media.
Ye, Y. and Fischer, G. (2005). Reuse-conducive development environments. In Journal
Automated Software Engineering (ASE), pages 199235.
Yujian, L. and Bo, L. (2007). A normalized levenshtein distance metric. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 10911095.
Zhen, Z., Shen, J., and Lu, S. (2008). Wcons: An ontology mapping approach based on
word and context similarity. In Proceedings of the International Conference on Web
Intelligence and Intelligent Agent Technology (WI-IAT), pages 334338.