Academia.eduAcademia.edu

Clone Detection: How accurate is your data set?

2006

Abstract Duplication of code in software systems is considered to be a serious problem that can affect a systems maintainability and extendability. It is reported that 10-15% of code in a software system is involved in cloning. However, because of the difficultly of objectively measuring the number of false positives in a clone result set, the accuracy of these reports is difficult to evaluate. Although an important topic, little work has been done in the area of evaluating the accuracy of clone detection methods.

Introduction

Analysis and detection of clones in software has recently become a popular area of research. Code cloning, generally referred to as the practice of duplicating code within a software system, is considered a serious problem by many sources [4,6,8,10,15,16,18]. Some problems associated with code clones are unnecessary increase in code size, duplicated bugs and duplicated maintenance effort to fix them, introduction of unused code, and increased code complexity. If code cloning is not managed costs as-sociated with maintaining and extending the system will increase needlessly.

Typically, clone detection tools report that 10-15% of the lines of code in a software system contribute to clones. In extreme cases, the duplication can be as high as 50% of the software system [8]. However, not all of these reported clones are related to the duplication of source code [12,13]. In many cases false positives are also part of the result set. False positives are segments of code that are reported as clones but in fact are not. In many cases these matches are caused by segments of code with very simple and repetitive structure [13]. Many of these falsely reported clones can be removed from the result set using filters, but additional manual inspection of clones is required to refine the results further [13].

In addition to false positives, there is another type of clone not directly related to the duplication of source code that may be reported by clone detection tools. These clones, called "incidental clones", are segments of code that are similar in structure and function not because of explicit copy-and-paste activities but rather arise due to other factors such as programming idioms, API interactions, and the inherent structure of programs written in a programming language. These clones can be very difficult to filter, both manually and automatically, because their form and function may actually be related. For example, building a GUI is a highly repetitive task, and interactions with the API may result in many repeated calls to the same set of functions. In these cases, it is difficult to classify the cause of the clone as copy-and-paste or incidental.

It is important that we measure the proportion of clones in a result set that are false positives or incidental clones if we wish to properly evaluate the effectiveness and accuracy of clone detection tools, yet little work has been done on this topic. This is largely because this type of evaluation would have required human subjects to classify the clones, a task that was found to be highly subjective [19]. With the recent existence of large source code repositories such as csourcesearch.net [1], we can now take an objective approach to measuring the amount of incidental cloning and false positives, there by giving us insight into the degree of true cloning within a software system. This paper proposes an experiment that will measure the commonality amongst a very large set of unrelated open source projects taken from the csourcesearch.net project. Because these systems will be generally unrelated we expect that the code will be equally unrelated, giving us a baseline of false positives that are detected in unrelated code. This will provide insight about how many clones are detected amongst unrelated code when inspecting a software system, and giving us a way to more accurately estimate the amount of true cloning in a software system. In addition, we will also measure the effect of API protocols on clone detection results by measuring the amount of cloning that occurs between software that uses the same API or library. In this work we expect the commonalities amongst software systems to be low, providing further validation of the significance of cloning found within a software system.

Methodology

The goal of this study is to estimate the amount of false positives and incidental clones that exist in the results of a clone detection tool. We will do so by measuring the amount of clones that are detected amongst unrelated code, under the assumption that most clones that are detected will be false positives. This assumption was derived from the results of our previous work comparing source code of similar open source projects [2] where we found that the open source projects in our study did not share code, even though they were related in functionality.

The experiment will consist of two phases. In the first phase we will detect clones amongst a random sample of projects, giving us an estimate of false positives and incidental clones detected amongst unrelated code. In the second phase, our study subjects will consist of source code that is related to GUI construction. This phase will provide us with an estimate of the amount of incidental cloning that is detected by clone detection tools. For each phase, we will carry out the following steps:

1. Randomly select study subjects.

2. Detect clones between each study subject pair.

3. Detect clones within each study subject.

4. Measure overlap of clones within software systems with clones occurring between them.

Each of these steps will be discussed in more detail below.

Unrelated Code

For the purpose of this experiment, we will use 200 randomly selected projects, selected from the list of downloaded projects published by the author of csourcesearch.net. csourcesearch.net is an on-line searchable repository of a very large number of C projects. It allows the use to query the source code using a variety of mechanisms. To detect clones amongst files that include GUI libraries, we will use the "includes" search functionality.

There will be no restriction on size of project. However, the source language will be restricted to C, the only language currently in the repository. After selecting the projects, we will proceed to download the source and run clone detection tools on them. In this study we will use two clone detection techniques to gather our results, parametrized sting matching as described by Kamiya et al. [10] and exact match string matching as described by Ducasse et al. [8]. This will allow us to measure the impact of the detection technique on the results as well as provide us with a comparison of the amount and type of false positives that are detected by the two different approaches.

In our first step of clone detection we will only detect clones that occur across each possible pair of software systems. Because we expect most of the source code to be unrelated, most clones should in fact be false positives or incidental clones. From this set of results, we will record the average percent of commonality between each pair of systems and the average size of the clones. This will provide us with a baseline of the amount false positives that occur in a results set from each of the clone detection techniques.

Through manual inspection of the results in this step, we will try to analyze the types of clones that tend to occur in both, in an effort to profile the types of code that cause false positives in the clone detection techniques we are using.

In our next steps we will detect the clones that occur within each project and measure the amount of code that occurs in both the set of clones across projects and within the projects. This will give us a further indicator as to how much code that is likely to be part of a false positive contributes to the detected clones in a software system.

GUI Code

In our next step, using csourcesearch.net we will search for any files in the repository that include header files from widget libraries such as GTK, GNOME Widgets, and xlib. Partitioning the files by project, we will detect the clones occurring across projects using the same libraries. By detecting the clones that occur between code using specialized libraries such as GUI libraries, we can gain insight into the degree of "incidental cloning" that is detected by clone detection tools. These clones will in many cases represent strategies or protocols required for the use of the libraries, something that can not be avoided. Perhaps the result of studying these clones can lead to further abstractions with the libraries themselves.

As in the first phase, will will detect clones within each of the projects as well. By measuring the overlap, well will gain insight into the contribution of "incidental clones" as part of a result set of detected clones.

Related Work

There is a wide variety of clone detection techniques that have been developed. These methods range from string comparison, metrics comparison, and program graph comparison strategies [4,6,8,10,16,18,9,5,17,14]. Currently we propose to only use two of these methods of clone detection as a pilot study. The study we propose could be expanded to other clone detection techniques. Several case studies have been performed on cloning with a software system [3,7,10,11,12,13], but none of these studies have considered measuring cloning across software systems.

There are very few studies that perform clone detection across software systems. Kamiya et al. [10] investigated the cloning across the source code of three different operating systems: Linux, FreeBSD, and NetBSD. Their analysis showed that there was about 20% cloning between FreeBSD and NetBSD, whereas there was less than 1% of the code cloned between Linux and FreeBSD or NetBSD. Because FreeBSD and NetBSD have the same origin, the cloning between them was not surprising. Because Linux was developed independently from the BSD systems, very little cloning was detected. In [2] we found similar results, finding that very few clones are detected across software that was not related. However, in both of these case, the study size is very small, making the results not generalizable. In addition, neither study considers the effects of using libraries such as GUI libraries on the clone detection results.

Conclusion

Previously it has been very difficult to objectively measure the amount of false positives are returned by a clone detection tool, yet this is important if we wish to confidently analyze the results of clone analysis and clone detection research. In this paper, we propose a study that will effectively find the lower limit of this value. In addition we also aim to measure the impact of the protocols required to use APIs on the clone detection results, helping us measure the effects of incidental cloning in a software system.

The results of this work will provide not only more insights into the accuracy of clone detection tools, but it also provides a platform from which we can investigate the weaknesses of tools, and also improve data filtering techniques. For example, from the resulting detected clones between software system and within software systems one may be able to train learning algorithms to classify true clones and false positives, something that we would like to research further.