Academia.eduAcademia.edu

The plastic surgery hypothesis

2014, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Recent work on genetic-programming-based approaches to automatic program patching have relied on the insight that the content of new code can often be assembled out of fragments of code that already exist in the code base. This insight has been dubbed the plastic surgery hypothesis; successful, well-known automatic repair tools such as GenProg rest on this hypothesis, but it has never been validated. We formalize and validate the plastic surgery hypothesis and empirically measure the extent to which raw material for changes actually already exists in projects. In this paper, we mount a large-scale study of several large Java projects, and examine a history of 15,723 commits to determine the extent to which these commits are graftable, i.e., can be reconstituted from existing code, and find an encouraging degree of graftability, surprisingly independent of commit size and type of commit. For example, we find that changes are 43% graftable from the exact version of the software being changed. With a view to investigating the difficulty of finding these grafts, we study the abundance of such grafts in three possible sources: the immediately previous version, prior history, and other projects. We also examine the contiguity or chunking of these grafts, and the degree to which grafts can be found in the same file. Our results are quite promising and suggest an optimistic future for automatic program patching methods that search for raw material in already extant code in the project being patched.

The Plastic Surgery Hypothesis Earl T. Barr† Yuriy Brun † University College London London, UK Premkumar Devanbu⋆ University of Massachusetts Amherst, MA, USA Mark Harman† Federica Sarro† ⋆ University of California Davis Davis, CA, USA {e.barr,mark.harman,f.sarro}@ucl.ac.uk, [email protected], [email protected] ABSTRACT mate development tasks. In 2009, the advent of GenProg [41] and Clearview [31] demonstrated automated bug repair. Automatically fixing bugs requires searching a vast space of possible programs, and a key insight that limits that search space is the assumption that fixes often already exist elsewhere in the codebase [2, 40]. This insight arises from the idea that code is locally repetitive, and that the same bug appears in multiple locations, but, when fixed, is not likely to be fixed everywhere. In fact, program source code changes that occur during development can often be constructed from grafts, snippets of code located elsewhere in the same program [41]. The act of grafting existing code to construct changes is known as plastic surgery [13]. Reformulated as a hypothesis, the insight follows: Recent work on genetic-programming-based approaches to automatic program patching have relied on the insight that the content of new code can often be assembled out of fragments of code that already exist in the code base. This insight has been dubbed the plastic surgery hypothesis; successful, well-known automatic repair tools such as GenProg rest on this hypothesis, but it has never been validated. We formalize and validate the plastic surgery hypothesis and empirically measure the extent to which raw material for changes actually already exists in projects. In this paper, we mount a large-scale study of several large Java projects, and examine a history of 15,723 commits to determine the extent to which these commits are graftable, i.e., can be reconstituted from existing code, and find an encouraging degree of graftability, surprisingly independent of commit size and type of commit. For example, we find that changes are 43% graftable from the exact version of the software being changed. With a view to investigating the difficulty of finding these grafts, we study the abundance of such grafts in three possible sources: the immediately previous version, prior history, and other projects. We also examine the contiguity or chunking of these grafts, and the degree to which grafts can be found in the same file. Our results are quite promising and suggest an optimistic future for automatic program patching methods that search for raw material in already extant code in the project being patched. Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement, D.2.13 [Software Engineering]: Reusable Software General Terms: Experimentation, Languages, Measurement Keywords: Software graftability, code reuse, empirical software engineering, mining software repositories, automated program repair The Plastic Surgery Hypothesis: Changes to a codebase contain snippets that already exist in the codebase at the time of the change, and these snippets can be efficiently found and exploited. The early success in automating program repair has triggered a dramatic recent upsurge in research on automated repair [2, 9, 19, 24, 29], refactoring [10, 12, 36], and genetic improvement [14, 22, 23, 30, 42]. These approaches have implicitly assumed the correctness of the plastic surgery hypothesis since they rely, in part, on plastic surgery. Despite the fact that a growing body of work depends on it, the plastic surgery hypothesis has not been validated experimentally. The goal of this paper is to validate this hypothesis empirically, on the large scale, on real-world software. Le Goues et al. [24] and Nguyen et al. [29] considered repetitiveness of changes abstracted to ASTs, and Martínez et al. [25] considered changes that could be entirely constructed from existing snippets. Both restricted their search to changes, neglecting primordial, untouched code that was inherited (unchanged) from the first version to the last. Both report the portion of repetitiveness in their datasets, but do not consider the cost of finding it. In this work, we consider both the changes and the primordial code and also explore aspects of the cost of searching in these spaces. In short, our result provides a solid footing to new and ongoing work on automating software development that depends on the plastic surgery hypothesis. The plastic surgery hypothesis has two parts: 1) the claim that changes are repetitive relative to their parent, the program to which they are applied, and 2) the claim that this repetitiveness is usefully exploitable. To address each claim, we focus our inquiry on two questions: “How much of each change to a codebase can be constructed from existing code snippets?” and “What is the cost of finding these snippets?” To answer the first question, we measure the graftability of each change. The graftability of a change is the number of snippets in it that match a snippet in the search space (we clarify the intuitive term “snippets” below). We study over 15,000 human-implemented changes to a program. If the graftability of these changes is high, then this explains part of the success of automated repair, refactoring, 1. INTRODUCTION Software has successfully relieved humans of many tedious tasks, yet many software engineering tasks remain manual, and require significant developer effort. Developers have long sought to autoAuthor order is alphabetical. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Copyright held by the owner/author(s). Publication rights licensed to ACM. FSE ’14, November 16–22, 2014, Hong Kong, China ACM 978- 1-4503-3056-5/14/11 http://dx.doi.org/10.1145/2635868.2635898 306 and genetic improvement and it is an encouraging news for further research in these areas. We consider only line-granular snippets and search for exact matches, ignoring whitespace. We make this choice because 1) developers tend to think in terms of lines, and, 2) practically, this choice reduces the size of the search space with which any tool seeking to help construct changes must contend. Our choice is informed by our practical experience with GenProg, which allows within-line expression granular changes. When we experimented with this setting on a large — hundreds of fairly small buggy programs — dataset, the genetic programming search algorithm almost always bogged down within a few generations because of the search space explosion [7]. To answer the second question, we consider three spaces in which we search for grafts: 1) the parent of a change, the search space of the plastic surgery hypothesis, 2) a change’s non-parental ancestors, and 3) the most recent version of entirely different projects. During search, we consider all the lines in each version, and not merely its changes, as this allows us to search those lines that survive unchanged from the start of a version history to its end. This matters when the start of a version history is not empty, as is often the case, since many projects are bootstrapped from prototypes, adapted from existing projects, migrated from one version control system to another, or undergo extensive development outside of version control. In particular, our dataset covers an interval of each project’s history that starts from a nonempty change and, on average, these core lines account for 30% of all lines searched. To quantify our answer to the second question, we simply count the number of grafts found in the different search spaces over their size. We take a pragmatic, actionability-oriented view of the plastic surgery hypothesis. We argue that it is not merely about repetitiveness of changes relative to their parent. This fact alone is not actionable, if the cost of finding grafts were prohibitive. The practical success of work on automated repair has demonstrated both the existence of grafts and the tractability of finding them. Thus, the hypothesis is about richness of the first of these search spaces, the parent of a change. We therefore validate it by comparing cost of finding grafts in this search space against the cost of finding them in the other two. Over the first search space, we find that, on average, changes are 43% graftable, and that 11% of them can be 100% graftable. This suggests that a fair amount of the lines in code changes could be derived from code that already exists in the project. When we compare this result to the other two search spaces, we see that on average the non-parental ancestors contribute only 5% more grafts than the parents, while other projects only provide 9% on average. Moreover, we found that graftability from parent is significantly higher than graftability from both non-parental ancestors and other projects with a high effect size (0.84 and 0.80, respectively). Thus, we can answer the first question, which captures the claim that many donor sites exist at the time of the change in the plastic surgery hypothesis does hold (Section 4.1). An initial answer to the second question is to count the lines searched for each of the search spaces and report the work done to find each donor as the ratio of number of donor sites found to the number of total lines searched (i.e., density). We found that the density of the parent is significantly higher than those of both non-parental ancestors and other projects with a high effect size. Here, again, we see that the plastic surgery hypothesis holds. We found that the cost to search from parent is significantly lower than the cost to search in the other two search-spaces (Section 4.1). Having validated the plastic surgery hypothesis, we turn our attention to how to exploit it. The success of automated bug fixing, refactoring, and genetic improvement demonstrates the utility of incorporating the search of existing code into techniques seeking to automate software engineering tasks; that is, the consequences of plastic surgery hypothesis are indeed exploitable. The grafts we have found are mostly single lines (57%), with the distribution following a power law. Thus, grafts would not be considered clones, because the traditional threshold for minimal clone size is 5–7 lines [38]. These smaller repetitive snippets are micro-clones. Syntactic repetitiveness below this threshold has simply been considered uninteresting because it lacks sufficient content. 53% of our grafts fall below this threshold and are, therefore, a micro-clones. The success of automated repair, refactoring, and genetic improvement are evidence that these micro-clones, of which our grafts form a subset, are, to the contrary, useful. We contend that micro-clones are the atoms of code construction and therefore are fundamental to code synthesis. Indeed, Gabel and Su demonstrated that line-length micro-clones from a large corpus can reconstruct entire programs [11]. Regardless of the intrinsic interest (or lack thereof) of a particular graft to a human being, such as a trivial assignment statement over the correct variable names, grafts can usefully bootstrap the automation of software development. To reduce the cost of searching for grafts, we tried to correlate features of changes with graftability (Section 4.3). If we found such a correlation, we could exploit it to search more or less intensively. To this end, we studied if different categories of human-written changes, e.g., bug fixes, refactorings, or new feature additions, are more graftable than others. We also asked if graftability depends on size (Section 4.2). However, we found no such correlations. Indeed, concerning the category of change, the success of automatic bug fixing, refactoring, and genetic improvement suggest that different kinds of changes exhibit the same graftability, as we found. As a community, we have learned that several lines of code are required for a change to be unique [11] and that a surprising number of changes are redundant, in the sense that they repeat changes already made [25, 29]. We also know that automated repair can be improved by including elements of human-constructed bug fixes [19]. We know that source code is locally repetitive to a greater degree than natural language [16]. To this growing body of knowledge about the repetitiveness of code and its exploitation, we add the validation of the plastic surgery hypothesis, augmented with insights into the proximity of grafts to each other. The primary contributions of this paper are: • A formal statement and validation of the plastic surgery hypothesis; • A large-scale, empirical study of the extent to which development changes can be constructed from code already available during development, i.e., their graftability; and • An analysis of the relationship between commit features (i.e., size and type) and commit graftability; • An analysis of the locality of the distribution of grafts in codebase to which a commit applies. These findings relating to the plastic surgery hypothesis bode well for the likelihood of continuing success of patching-by-grafting approaches (including GenProg); they generally suggest that donor grafts to compose a patch can be feasibly found in temporal and spatial proximity to the patch site. • Donor grafts can often be found in the current version of the program to be patched and it is rarely necessary to search earlier versions (Section 4.1). • The graftable portions of a patch can usually be composed out of lines from just one contiguous donor graft site, and very often from no more than two (Section 4.4). • A significant portion (30%) of donor graft code can be found in the same file as the patch site (Section 4.5). 307 The rest of this paper is structured as follows. Section 2 formally defines the problem we tackle and Section 3 describes our experimental design and data. Section 4 discusses our findings. Section 5 places our work in the context of related research. Finally, Section 6 summarizes our contributions. S7 S5 2. PROBLEM FORMULATION grafts We are interested in the graftablity of changes to a codebase with respect to three search spaces: its parents, its non-parental ancestors, and other projects. In this section, we define graftability, its granularity, these three search spaces, and the cost of finding grafts in each of them. Figure 1 depicts how we measure the graftability of a change. We are interested in the limits of graftability, so we formulate the problem as a thought experiment in which we take the commit as a given, and ask if we can find its pieces in various search spaces, rather than trying to put pieces from a search space together, then ask if they form a commit. We also assume that we can find where a potential graft applies in a target host commit in constant time; this assumption is reasonable in practice, since commits are small, with median range 11–43 lines (Figure 4). The change, shown on the right of Figure 1, is the target host for the grafts. It is cut up into the snippets S1 –Sn . We search the donor codebase for grafts that exactly match these snippets. The shaded snippets in the change are graftable, the unshaded snippets are not. We match snippets that are contiguous in both the host and the donor, when possible, as with S1 –S2 . Contiguity holds the promise of reducing the search space (Section 4.4). Our interest is redundancy in general, not merely the existence of a snippet shared across donor and host; we want to track the abundance of grafts in a donor search space, as this bears directly on the richness of the search space, which we measure using density as defined in Definition 2.2 below. Recall that a multiset generalizes a set to allow elements to repeat. The number of repetitions of an element is its multiplicity. For example, in the multiset {a, a, a, b, x, y}, the multiplicity of a is 3. We use multisets, therefore, to track the abundance of a particular snippet. We can view the file f as a concatenation of strings, f = αβγ, over some alphabet Σ. Snippets are the strings into which any file can be decomposed. Snippets define our unit of granularity; they are the smallest units of interest to us. The smallest a snippet can be is a symbol in Σ; the largest is the file itself. Snippets allow us to treat a file as an ordered multiset of its snippets. We require ordering to impose coordinates that we use to measure distances. We define a snipper function s that fragments a file into its snippets, and rewrites the whitespace in each snippet to a welldefined normal form (e.g., a single blank). For f defined above, s( f ) = {α, β, γ}: in other words, s cuts up its input into substrings from which its input can be reconstructed. The details of how s accomplishes this task are unimportant, so we treat s as a black box. We are now ready to model version-control systems, including ones that facilitate branching and multi-way merges. A version V of a project is an ordered multiset of files. ∆ models a change, or commit in a distributed version control system like git or hg. For us, each ∆ : V k → V is a function that rewrites a k-tuple of versions to produce a new version. When k > 1, ∆ merges the k parents to form the new version, as when a branch is merged into the master branch in a distributed version control system like git or hg. Our data is drawn from subversion repositories in which branching is rare and k = 1, so we drop k when writing ∆(V ). In addition to files, our snip function s applies to versions, so s(V ) is the ordered multiset of the snippets in all the files in V ; s also applies to changes, so s(∆) is the set of snippets in ∆. Each ∆ is a sequence of snippets added, S1 contiguous snippets S1 S1 S2 S3 S4 S5 S6 S7 S8 commit host S2 S3 codebase donor Figure 1: Graftability: We break up a commit into the snippets S1 , . . . , Sn , and search the donor — the codebase — for these snippets. Matches for the snippet in the codebase are grafts (rectangles). A single snippet may have alternate grafts, as with S1 ; we try to match snippets that are contiguous in both the donor and the host, as with S2 –S3 . The graftability of the change is the proportion of its snippets we can cover from the donor codebase (shaded grey). deleted, and modified, where a modification is a paired addition and deletion, as in Myers [27] and in Unix diff. When our snipping function s cuts up a commit, it retains only the snippets, producing a multiset, and does not track whether a snippet was added, deleted, or modified, which produces two snippets. A version history is the function composition Vn = ∆n−1 (∆n−2 (· · · (∆0 (V0 ] . (1) For clarity, we write this function composition as an alternating sequence of versions Vi and changes ∆i : V0 ∆0 V1 ∆1 V2 ∆2 · · · ∆n−1 Vn . (2) The first version V0 is special: it can be empty. When V0 = ε, we have a project’s very first commit. Typically, V0 = ε ⇒ |∆0 | ≫ |∆i |, i > 0, because projects often start from a skeleton drawn from another project or inherit code from a prototype. Otherwise, we do not have a project’s first commit, but are starting from somewhere within a project’s history, as is true in our data set. V0 ∩ Vn is the core of a project, those lines that remain unchanged between the first and last commits of a version history, including lines that may have been deleted then re-added in some intermediate versions. Definition 2.1 (Graftability). The graftability of the change ∆ against the search space S is g(∆, S ) = |s(∆) ∩ s(S )| , |s(∆)| where S is an ordered multiset of snippets and ∩ is multiset intersection, multiplicity of whose elements is the minimum of its two operands. The graftability of ∆i against its parent is g(∆i ,Vi ). Our notion of graftability measures the partial, not just total, constructibility of changes. Thus, it generalizes previous measures, which focus on 100% graftability. This previous focus was natural, 308 since such changes are entirely reconstructible. Nonetheless, the existence of changes that are highly, but not completely, graftable, falling into the interval [0.7..1), suggests that the search for grafts is more generally worthwhile than focusing solely on 100% graftable changes, since any nontrivial grafts that are found may considerably constrain the search space, even for novel lines. While it remains to be shown, even lower proportions of graftability may be useful, since a single graft may, in some cases, be highly informative. Nonparental Ancestral Snippets. The ancestors of the change ∆ j are all the changes ∆i where i < j. Our search spaces consist of snippets, so when searching a version history, our interest is in identifying all the snippets from which a change, in principle, could be constructed. One’s parent is, of course, an ancestor, but we already consider this search space; indeed, the plastic surgery hypothesis is couched in terms of a change’s parent. Thus, here we are interested only in the snippets that did not survive to a change’s parent. This consists of all the snippets in all the ancestors of ∆ j that did not survive to the parent. Thus, we define as(∆ j ) = ] s(∆i ) \ s(V j ). Project Camel CXF Derby Felix HadoopC Hbase Hive Lucene OpenEJB OpenJPA Qpid Wicket 3. EXPERIMENTAL DESIGN (3) We describe our corpus and present aggregate statistics for its commits, then discuss how we concretely realized our problem formulation for the experiments that follow. Note that a snippet repeatedly added and deleted in a version history has high multiplicity in Equation 3. In practice, |as(∆ j )| ≪ |s(V j )| because snippets rarely die, although there are notable exceptions, such Apache’s transition in its 2.4 release to handling concurrency via its MultiProcessing Modules, which abandoned many lines. Search Spaces. Let C be the set of all changes and P be the set of projects. The three search spaces we consider in this paper follow. 3.1 ] Parent Ancestral lines not in parent p s(Vhead ) Other projects p∈P In terms of a version history, the existence component of the plastic / surgery hypothesis states s(∆i ) ∩ s(Vi ) 6= 0. Search Cost. To compare the relative richness of these search spaces, we compute their graft density, the number of grafts found in them over their size, averaged over all changes. For the search space S and the change ∆, let grafts(S , ∆) = {l ∈ S | ∃k ∈ s(∆) s.t. l = k} (4) be the grafts found in S for the snippets in ∆. This definition of grafts captures the multiplicity in S of a shared element, with the consequence that grafts(S , ∆) 6= s(S ) ∩ s(∆), since the intersection on the right-hand side computes a set where the multiplicity of each element is the minimum of its two operands. 1. Bug: A problem which impairs or prevents the functions of the product. 2. Improvement: An enhancement to an existing feature. 3. New Feature: A new feature of the product. 4. Task: A task that needs to be done. 5. Custom Issue: A custom issue type, as defined by the organization if required. Definition 2.2 (Search Space Graft Density). The graft density of a search space is then gd(S ) = Corpus Our corpus contains the 12 software projects listed in Figure 2. All are Java-based, and maintained by Apache Software Foundation. They range in size from 2,712 to 371,186 LOC, from 25 to 3,826 commits, and come from a very diverse range of domains, e.g., service framework, relational database, distributed data storage, messaging system, and web applications. We mined Apache’s git repository1 to retrieve the change history of the projects from 2004 to 2012. Since Apache uses Subversion and provides only git mirrors, all the changes belong to a single branch. Using git allowed us to access to relevant attributes for each change, such as date, committer identity, source files where the change applies, and so on. Moreover, since all the projects use the JIRA issue tracking system2 , for each change, we were also able to retrieve the kind of issue (e.g., bug fixing or enhancement commits), its status (e.g., open, closed), and its resolution (i.e., Fixed, Incomplete). Depending on how an organization uses JIRA, a change could represent a software bug, a project task, a help desk ticket, a leave request form, etc. By default, JIRA specifies the following five change types: ∀∆i ∈ C, S= Commits 1,600 175 820 1,003 639 3,826 25 344 534 84 3,672 3,001 Figure 2: Experimental corpus: 12 Apache projects; HadoopCommon is abbreviated as HadoopC. i< j S = s(Vi ) S = as(∆i−1 ), i > 0 Description Enterprise Integration Framework Services Framework Relational Database OSGi R4 Implementation Common libraries for Hadoop Distributed Scalable Data Store Data Warehouse System for Hadoop Text Search Engine Library Enterprise Java Beans Java Persistence Framework Enterprise Messaging system Web Application Framework The first four types are self-explanatory. The last category groups issues not covered by the other four, but needed by an organization using JIRA. In our dataset, the commits belonging to this set generally concern user wishes, testing code, and sub-tasks. Each issue has a status label that indicates where the issue currently is in its lifecycle, or workflow: 1 |grafts(S , ∆)| ∑ |C| ∆∈C |S | Graft density is the search space analog of commit graftability. It models the likelihood that a searcher guessing uniformly at random will find a graft for a line in a commit, averaged over all commits. In Section 4.1, we compute and compare the graft density of each of these three search spaces. Graftability and graft density are the measures we apply to commits and our three search spaces to study the degree to which the plastic-surgery hypothesis applies in a large corpus of versioned repositories of project code. 1. Open: this issue is ready for the assignee to start work on it. 2. In Progress: this issue is being actively worked on at the moment by the assignee. 1 http://git.apache.org. 2 https://issues.apache.org/jira/. 309 Type Bug Improvement New Feature Task Custom Issue Camel 553 777 146 68 56 CXF 110 57 0 3 5 Derby 626 170 0 9 15 Felix 538 298 110 43 14 HadoopC 376 160 31 4 68 Hbase 2319 1053 163 115 176 Hive 15 4 2 3 1 Lucene 97 165 54 11 17 OpenEJB 298 82 34 17 103 OpenJPA 28 41 4 0 11 Qpid 2102 992 252 115 211 Wicket 1855 839 173 25 109 Figure 3: Count of commit types in our corpus. Commit Type Median Mean St. Dev. Bug Improvement New Feature Task Custom Issue 11 43 16 20 37 44.40 146.50 116.50 76.10 126.50 156.46 289.62 359.55 293.69 197.87 focus solely on changes. As Figure 5 shows, the core varies from negligible to dominant at 97% in the case of Hive. 3.2 Methodology We used git to clone and query the histories of the projects in our corpus and extracted the related JIRA information (Section 3.1) into a database. For each project in our corpus, we issued git reset -hard <commit> to retrieve a specific version. This command sets the current branch head to <commit> modifying index and working tree to match those of <commit>. To retrieve a specific change, we issued git diff on a commit and its parent and extracted the commit lines, i.e., the lines to be grafted. We used the JGit API3 to automate both tasks. To realize the snipping function, we wrote a lexer that snips a file into a multiset of code lines, then, from each line, removes the comments and semantically meaningless content, such as whitespace and syntactic delimiters, to normalize the lines. We ran this lexer over each search space, then loaded the resulting multiset into a hash table, whose key is normalized source line (to speed up the search for grafts) and the value is a pair that stores the source line c and its multiplicity. To compute graftability from Definition 2.1 of a commit, we looked up each normalized commit line in the hash table of the commit’s parent and divided the number of hits by the number of lines (snippets) in the commit. Figure 4: Commit size aggregate statistics. 3. Resolved: a resolution has been identified or implemented, and this issue is awaiting verification by the reporter. From here, issues are either Reopened or are Closed. 4. Reopened: This issue was once Resolved or Closed, but is now being re-examined. 5. Closed: this issue is complete. This means it has been identified, implemented and verified by the reporter. An issue can be resolved in many ways. The JIRA default resolutions are listed below: 1. Fixed: A fix for this issue has been implemented. 2. Won’t Fix: This issue will not be fixed, e.g., it may no longer be relevant. 3. Duplicate: This issue is a duplicate of an existing issue. 4. Incomplete: There is not enough information to work on this issue. 5. Cannot Reproduce: This issue could not be reproduced at this time, or not enough information was available to reproduce the issue. If more information becomes available, the issue can be reopened. An issue is initially Open, and generally progresses to Resolved, then Closed. A more complicated life cycle includes an issue whose initial resolution was Cannot Reproduce then changed to Reopened when the issue becomes reproducible. Such issues can subsequently transition to In Progress and, eventually to Won’t Fix, Resolved or Closed. Figure 3 shows the number of changes distinguished per type related to the projects contained in our corpus. We considered only those changes that have been successfully closed, i.e., status=closed and resolution=fixed. Moreover, we did not consider changes made to non-source code files or containing only comments. As result, we analyzed a total of 15,723 commits. Figure 4 shows the size of the different kinds of commits considered in this study. Note that we did not take into account deleted lines since they are obviously graftable from the parent. We can observe that, on average, Bug commits are smaller than all the other types of commits, while Improvement commits are the largest. Figure 5 shows the size of each project’s core, the lines that survive untouched from the first version to the last in our version histories for each project. The existence of nonempty first commits is one the reasons for the effectiveness of the plastic surgery hypothesis, which searches these lines in contrast to approaches that 4. RESULTS AND DISCUSSION To open, we validate the plastic surgery hypothesis, the core result of this paper, both its well-known, first claim, the existence of grafts, as well as its heretofore neglected, second claim about that a change’s parent is a rich search space for grafts. We then consider features of grafts with the aim of discovering correlations that might help us better decide which software engineering task would benefit most from being augmented with searching an existing codebase. We turn to the question of graft contiguity; that is, for a swath of adjacent lines in the host, can we find an equal sized contiguous donor? If we can, it means we can reconstruct the human-generated patches we are studying more easily, with promising implications for automated construction. We close by considering the distribution of grafts in the donor search space. 4.1 Plastic Surgery Hypothesis For convenience, we reproduce our central hypothesis: Research Question 1 [The Plastic Surgery Hypothesis]: Changes to a codebase contain snippets that already exist in the codebase at the time of the change, and these snippets can be efficiently found and exploited. Above, by “changes”, we mean all commits made, and, by “codebase”, we mean the search spaces we defined in Section 2: a com3 http://www.eclipse.org/jgit/. 310 Project Core Camel 26% CXF 85% Derby 45% Felix <0.5% HadoopC <0.5% Hbase <0.5% Hive 97% Lucene 16% OpenEJB <0.5% OpenJPA 83% Qpid <0.5% Wicket <0.5% Average 30% Figure 5: The size of each project’s core. The core consists of those lines that are unchanged from the first version to the last in the studied portion of a project’s version history. • RQ1a : How do parents fare as possible sources of grafts, when compared to nonparental ancestors and other projects? • RQ1b : How do parents fare as efficient sources of grafts, when compared to nonparental ancestors and other projects? 2500 count 2000 Figure 7a and Figure 7b show the boxplots of the graftability and density values obtained over the 15,723 commits, when varying the search space from the 3 considered sources: a) commit’s parent, b) its ancestral lines not those found in its parent, and c) the latest version of other projects, as defined in Section 2. Figure 7a bears upon the existence of grafts in the three locations. Graftability from parent is much higher than graftability from the nonparental ancestors and than graftability from other projects. This is not that surprising, and at least partially reflects differences in vocabulary (variable names, type names, method names, etc.) between projects. Similar inter-project differences were reported in statistical models of code [16]. Code bases tend to monotonically grow in size, so most lines survive from their first introduction to the parent of a given commit. Thus, the nonparental ancestors search space consists of deleted lines. A consequence of the fact that we do not find many changes in the nonparental ancestral lines is that there are not many “regretted deletions”: deletions that are later re-added. Figure 7b bears upon the efficiency of finding grafts in different search spaces, in terms of the the density measure defined in Definition 2.2. We ignore the density figure for non-parental ancestors because (as Figure 7a indicates) they tend to be of low value in graftability. We can observe that density of parent is higher than density from other projects. Since the boxplots showed no evidence that our samples come from normally distributed populations, we used the Wilcoxon signedrank test to check for statistical significance. In particular, we tested the following null hypotheses: 1500 1000 500 0 0.00 0.25 0.50 Graftability 0.75 1.00 Figure 6: The number of the commits that are x% graftable. mit’s parent Vi , its ancestral lines not in its parent as(∆i ) (Equation 3 in Section 2), and the latest versions in our corpus of the other projects. This question explores the limits, or, conversely, the potential, of automatic programming: “How many changes are constituted of snippets that already exist in the code base, or its history, at the time when the commit is applied?” To answer this question, we analyzed the “graftability” of 15, 723 commits coming from a corpus of 12 software projects (Section 3.1). For the commit ∆i , we model its graftability as shown in Definition 2.1 in Section 2. Nongraftability, or novelty, is 1 − graftability. The results immediately prompt us to wonder “How many commits are fully graftable and how many are entirely novel?” Figure 6 shows the distribution of graftability over the 15,723 commits. We can observe that a large fraction of the commits (42%) are more than 50% graftable. More notably, 10% of the commits can be completely reconstructed from grafts. This result aligns with that of Martínez et al., who found that 3–17% of the change in their dataset could be entirely reconstructed [25]. Only 16% of our commits are utterly novel. This data thus clearly suggests confirmation of the first, “snippets that already exist” component of the Plastic Surgery Hypothesis. This finding relates to Gabel and Su, who found very few unique (non-recurring) snippets even of considerable length, in a large (400,000,000 line) corpus of code; however, the mere existence of recurring snippets within this formidably large corpus offers scant hope of feasible graftability [11]. We, however, compute the graftability of commits, not arbitrary snippets of the codebase. Gabel and Su’s was a scientific finding, unconcerned with the feasibility of searching for grafts. The “efficiently found” part of the Plastic Surgery Hypothesis is about where to efficiently search; it states the parent’s entire codebase (and perhaps just the file where the commit applies, Section 4.5) of the commit, is the best place to search both in terms of richness and cost, in terms of the likelihood of finding a graft in a set of lines in the donor search space. Should we just search the parents and ancestors of the commit to be grafted? Or should we search other projects in the same language? To this end, we address the following research questions: • H0a : There is no significant difference between the graftability from parent and the graftability from nonparental ancestors. • H0b : There is no significant difference between the graftability from parent and the graftability from other projects. • H0c : There is no significant difference between the density in parent and the density from nonparental ancestors. We set the confidence limit, α, at 0.05 and applied BenjaminiHochberg [6] correction since multiple hypotheses were tested. To assess whether the effect size is worthy of interest we employed a non-parametric effect size measure, namely the Vargha and Delaney’s A12 statistic [37]. According to Vargha and Delaney [37] a small, medium, and a large difference between two populations is indicated by A12 over 0.56, 0.64, and 0.71, respectively. The results revealed that there was significant statistical difference (p < 0.001) between the graftability from parent and nonparental ancestors in favor of the parent codebase with high effect size (A12 = 0.84). The Wilcoxon Test also revealed that there was statistical difference between the graftability achieved between parent and other projects codebases in favor of parent with high effect size (p < 0.001, A12 = 0.80). The Wilcoxon Test between the density of the commit’s parent and those of other projects revealed significant statistical difference (p < 0.001) in favor of the commit’s parent 311 200 100 1.0 50 0.8 10 20 0.6 5 0.4 2 0.2 1 0.0 Parent Nonparental Ancestors OtherProjects Parent (a) Graftability of a commit (over 15,723 commits). OtherProjects (b) Density (log scale) of the search spaces (over 15,723 commits). Figure 7: Graftability of a commit (a) and cost to search for its grafts (b) as the search space changes from the commit’s parent, its ancestors (excluding its parent), and other projects. with high effect size (A12 = 1). We therefore reject the hypotheses that the search spaces are indistinguishable and affirm the Plastic Surgery Hypothesis. 4.2 1.00 Graftability by Commit Size Next, we consider the fact that commits vary considerably in size. Some are quite small; in fact, about half of all bug fixes are under 10 lines of code. Some commits contain as many as 10,000 lines of code. The question naturally arises, “Is automatic patching only likely to succeed on small patches?” One part of this is the existence question: “Do grafts exists only for small patches?”. This motivates our second research question: Graftability from Parent 0.75 Research Question 2: How does graftability vary with commit size? count 900 0.50 600 300 0.25 Figure 8 shows the relationship existing between commit graftability and commit size. The plot is a binhex plot, which is essentially a two-dimensional histogram. The x-axis is the size of the commit, and the y-axis is the graftability value for commits of that size. Each hexagon is a “bin” which counts the number of (size, graftability) value pairs that occur within the Euclidean range bounded by that hexagram. The color of the hexagram reflects the count of those pairs that fall within a given bin, lighter colors reflecting a larger count. The figure has some interesting patterns for low values of commit size, which arise from discrete fractions with small denom3 7 inators and their complements (e.g., 10 , 10 ). But these are just a distraction. The main trend visible in this plot is the absence of one; surprisingly, there appears to be no relationship between graftability and commit size — one might rather expect, that as commit size increases, there are more snippets to search for, and thus we might have more difficulty in finding them — thus leading to lower graftability. No such trend is visible. To confirm this rather strange phenomenon, we estimated a linear regression model, to judge the effect of commit size on graftability. This can be seen as Model 1, in Figure 10. The response variable was graftability, and the predictor variable used in Model 1 was the commit size, log-scaled. The coefficient is very significant (p < 0.001), indicating very low probability of observing the data if the true regression coefficient were zero; in other words, we would be very unlikely observe the data if graftability of a commit had no linear dependence on log(CommitSize). This might seem rather surprising, given that no such dependence is visible in Figure 8. The resolution to this puzzle is clear from the values of R2 and sample 0.00 10 1000 Size of Commit (LOC, logscale) Figure 8: Does commit graftability vary with commits size? size on the bottom rows of Model 1: just 4% of the variance in graftability is explained by commit size! In other words, varying the commit size has an extremely weak effect on the variance in graftability; however, even this weak explanatory power is divined as statistically significant by the linear regression, thanks to the large number of samples (15,723). We conclude that commit size has a negligible effect on the variance in graftability. 4.3 Graftability by Commit Type Commits are made for different reasons. As noted in Section 3, the commits in our JIRA data are tagged as one of Bug, Improvement, New Feature, Task, or Custom Issue. It seems reasonable to wonder whether different types of categories have different graftability. If there were a strong difference, this could tell us which software engineering tasks are most likely to benefit from techniques that rely on the plastic surgery hypothesis. For instance, it seems likely that a New Feature commit would be less graftable than one tagged Bug, since bugs often appear in multiple locations and may have 312 1.00 Intercept Graftability 0.75 Commit Size (log scale) 0.50 Improvement vs. Bug 0.25 New Feature vs. Bug 0.00 Bug Improvement New Feature Task Custom Issue Task vs. Bug Custom Issue vs. Bug ● ● ● ● ● ● ● CommitSize (Log Scale) Model 1 0.29⋆⋆⋆ (0.00) 0.07⋆⋆⋆ (0.00) ● ● ● R2 Adj. R2 Number of observations 1000 0.04 0.04 15,723 Model 2 0.30⋆⋆⋆ (0.00) 0.08⋆⋆⋆ (0.00) -0.04⋆⋆⋆ (0.01) 0.02 (0.01) -0.03⋆⋆⋆ (0.01) 0.00 (0.01) 0.04 0.04 15,723 ⋆⋆⋆ p < 0.001 10 Bug Improvement New Feature Task Figure 10: Size has little effect on graftability, as demonstrated by two regression models with graftability as response: although both models find commit size to be strongly statistically significant with p < 0.001, and the standard errors (shown within parentheses) are all small, R2 shows that these models account for only 4% of the variance in graftability. Custom Issue Figure 9: Graftability obtained for five kinds of commits. already been fixed in some, but not all of those locations. The prior fixes would then be grafts available for fixing the bug of the missed locations. Compare this to New Feature, where, especially if New Feature is complex, seems more likely to contain novel lines that do not already exist in the system. This leads to the next question. 4.4 Graft Contiguity Once a technique has found grafts, it must arrange and order them to transplant them into a host change. Composing grafts, at line granularity, to (re)construct a change even when that change is 100% graftabilty faces a combinatorial explosion of all the permutations of that graft. Novel, nongraftable lines exacerbate the problem. This graft composition search space would be more manageable if grafts were bigger. Intuitively, code decomposes into semantic units that are sometimes bigger than the granularity at which one is searching for grafts. If we could find these units, we could use them to constrain the change (re)construction search space. Thus, we ask how often can we find contiguous grafts of size greater than a single line, in both the host and the donor. When trying to constitute a commit using snippets that already exist in the code, a natural intuition is that larger chunks will make constituting such a commit easier. At the extremes, searching individual lines in the code that make up a commit would certainly be harder than serendipitously finding all the lines in a commit, altogether in one area of code. We now attempt to formalize this intuition. For this, we refer the reader back to Figure 6. Consider the snippet sequence S1 . . . S8 . This sequence constitutes the commit. In this, we show how the snippet sequence S2 –S3 is contiguous, in both the donor and in the change host we seek to reconstitute. If this were a common occurrence, the heuristic of search of attempting to constitute commits in groups of lines would be quite effective. When contiguous host snippets are constituted from single or very few donor snippets, the search for donor snippets is simplified and accelerated. This leads the following question: Research Question 3: Do different kinds of commits exhibit same graftability? To answer this question, we compared the graftabilty of different types of changes in our dataset (Bug, Improvement, New Feature, Task, Custom Issue). Figure 9 shows the boxplot (upper plot) of the graftability obtained for the five commit types. The lower plot shows the commit sizes for the different kinds. The lower plot is logscaled on the y-axis. It is noteworthy that the lower plot shows some differences in the sizes of different types, despite the log-scaling; differences are less visible in the upper plot. To confirm the visual impression that commit types do not affect graftability, we added the type of commit as a factor variable to the regression model discussed earlier, yielding Model 2 in Figure 10. In this model, the effect of each kind of commit (as a factor) is compared to graftability of the Bug commit type, to check if such comparisons have any added explanatory power beyond that provided by commit size, and also to see what that effect might be. In the combined model, variance inflation was well within acceptable limits. This finding echoes that for commit size. While Improvement and Task commit types are statistically significantly less graftable than the Bug commit type, the actual explained effect is very small. The R2 value essentially remains numerically essentially unchanged. If we consider the commit type by itself as an explanatory factor variable, we can only explain about 0.001 of the variance in graftability (model is omitted). The high significance of this very small effect reported by the regression modeling exercise is simply a result of the large number of samples. Thus, we come to the rather unexpected conclusion that a commit’s type has no significant, practical impact on finding a graft for that commit. Research Question 4: To what extent are grafts contiguous? A commit, in its role as the target host, determines the maximum size of the contiguous region. Contiguous regions of grafts in the donor larger than the largest contiguous region of snippet in the host must necessarily be broken up when transplanted. 313 50 10000 20 7500 2500 2 5 sqrt(count) 10 5000 Host Donor 0 Figure 11: How big are contiguous grafts? The figure reports the size (log scale) of both host and donor snippets. 20 40 Fragment Size 60 Figure 12: How many host and donor snippets have the same size? The figure shows the number (square root scale) of those host and donor snippets having the same size. It is very convenient when a contiguous graft in the donor matches a contiguous site in the host: the more often this occurs, the more likely we are to be able to “bite off” larger chunks of code from the donor and shoehorn them to reconstitute substantial pieces of the commit, so we ask 4.5 Graft Clustering An important factor in the computational feasibility of automatic commit synthesis is the search space size. If one had to search for suitable donors all the time, over the entire possible space of donors (e.g., the entire previous version of the project) it would be much less efficient than if one could just search near the locus of the commit, such as in the same file, the same directory, etc. This motivates the next question RQ4a : How often do contiguous regions in the donor match contiguous regions in the host? How big are they? We found 21,726 host snippets and 24,346 donor snippets both consisting of two or more consecutive graftable lines. So we can positively answer the question: continuous graftable regions often appear in both hosts and donors. Figure 11 shows the size of both host and donor snippets. We can observe that the average size of a host snippet (i.e., 4.5 lines) is about twice the one of a donor snippet (i.e., 2.5 lines), this indicates that not all the snippets can be entirely grafted from a single donor and explains the fact that the donor snippets are more than the host snippets. However, in particular, when these continuous regions are exactly matched in size, we can simply pluck them out of the donor and paste them into the host: essentially these are little pieces of “micro-clones” that are reproduced in commits. Research Question 5: Are the snippets needed to graft a host snippet in the same file? Fortunately, we find that 30% of the donor snippets can be found in the same donor file and 9% in the same package. This is an encouraging result, suggesting that donor snippets are often found in the same file, not requiring more extensive search. 4.6 RQ4b : What is the distribution of host and donor snippets of the same size? Examining the number of contiguous snippets in both host and donor that have the same size, we found that a host snippet can be grafted from a single donor (i.e., fully matched snippets) in 12,827 cases (53%), while, in the remaining cases more than one donor is needed. Figure 12 shows the number of fully matched snippets grouped by size. We can observe that the majority (72%) of these snippets (9,259) has size 2, while the 16% has size 3, and only 0.6% has size 4, then, as we can observe from the figure, the trend dramatically decreases. Threats to Validity This section discusses the validity of our study based on three types of threats: construct, internal, and external validity. Construct validity asks whether the measures taken and used in an experiment actually measure phenomenon under study. Internal validity concerns the soundness of the methodology employed to construct the experiment, while external validity concerns the bias of choice of experimental subjects. Section 3.1 describes how we automatically computed our measures. To mitigate the threat of an implementation error, we applied unit testing and one of the authors manually verified the accuracy of the measurements of 30 commits selected uniformly. To address internal validity, we carefully verified that our data met all required assumptions before applying statistical tests and the regression model. Moreover, our 100% graftability results are low relative to the standard finding of 5–30% redundancy in the clone literature. This is probably due to our choice of considering exact matches over whitespace normalized, but otherwise untouched and notably unobstructed source lines. We adopted this choice because abstraction reduces the semantic content of lines, such as that contained in identifiers, which must then be restored. Thus, we choose exact matching because of our belief that these lines would be strictly more useful for techniques relying on the plastic surgery hypothesis. RQ4c : Counting contiguous grafts in the donor as a single site, how many distinct donor sites do we need in order to cover a transplant point in the host? Since 47% of the donor’s snippets does not fully match a host’s snippet we are interested in how many donors are needed on average to graft a given host snippet and how difficult is to look for these donors (see next section). We found that 2 donors are needed on average to graft a given host snippet. The above results revealed that continuous graftable regions often appear in both hosts and donors. More than a half of host snippets can be grafted from a single donor. In the remaining cases, two donors are needed on average to graft a given host snippet. 314 Relaxed notions of matching are a distinct and interesting avenue of research that has witnessed positive results [33]. As for the external validity, the projects in our corpus are all open source Apache projects. Although they differ in domain and size (Section 3.1), we cannot claim that our findings generalize to other software systems. However, we have formally stated the problem (Section 2) and described our methodology (Section 3.2) to facilitate the replication of our study. starting, ab initio, with the empty system, then we could assume that ‘commit-commit’ and ‘code-commit’ approaches would study largely the same information. However, since version histories typically do not go back this far, that assumption is invalid; the current version of a system is not merely the product of the sequence of commits for which information is available. Assessing the degree of code-commit graftability has implications for work on program improvement, an area that is witnessing a significant upsurge in interest. Program improvement seeks to automate improving an existing code base with relatively small modifications. Examples include repairing the code base [17, 24, 39, 41], enhancing its properties [22,23,30,34,42,43], or even migrating it to other systems [21, 33]. All these program improvement approaches share the foundational assumption that many software systems contain the seeds of their own improvement. They are united by the way they search for, extract and recombine fragments of code to create desired new functionalities and behaviors. In order to assess the potential of this search space, we need to study not just commit-commit redundancy, but also code-commit graftability. By assessing code-commit graftability, we seek to shed light on the degree to which a commit made by a human could have been found by a machine and the cost of so-doing. Our results on the degree and cost of graftability or within-system code commits are relevant to program improvement work that searches for modifications in existing system [1, 22, 23, 30, 42], or which finds patches from elsewhere in the system [24, 41]. Our results on graftability between systems are relevant to program improvement work that searches for transplants from one system to another [15, 32]. Overall, our results provide further evidence to support claims that it is promising to mine human commits for patterns, templates, and code fragments that can be reused to improve systems. This is a technique to which other authors have recently turned for automated program improvement [20, 39] and for semi-automated improvement as decision support to software engineers [26, 35], with promising results. 5. RELATED WORK It has been known for some time that the production code contains software clones [3, 4, 8, 18]. These can be verbatim cut-and-paste copies of code (so-called Type 1 clones) or might arise from a more elaborate cut-modify-and-paste process (so-called Type 2 and Type 3 clones [8]). The presence of code clones has led to much work on techniques for investigating and evaluating this form of apparent redundancy [5, 38]. More recently, authors have sought to measure the degree of redundancy in code and the commits applied to it. Gabel and Su [11] sought to answer the question “How unique (and how repetitive) is source code?” in an analysis of approximately 420M SLoC from 6,000 open-source software projects. They observed significant syntactic redundancy at levels of granularity from 6–40 tokens. For example, at the level of granularity with 6 tokens, the projects were between 50% and 100% redundant. This suggests that code contains a great deal of “redundancy” that could potentially be exploited by code improvement techniques. However, Gabel and Su did not consider code commits, nor the cost of finding redundancy. Nguyen et al. sought to answer the question “How often do developers make the same changes (commits) they have made before?”, studying 1.8M revisions to 2, 841 open-source projects [28]. They defined a “repeated change” to be an AST extracted from a change that matches an AST from a previous change in some project’s history, including the same project. Over ASTs, they found that changes are frequently repeated, with the median 1-line change having a 72% chance of having been performed elsewhere. The repetitiveness dropped exponentially as the granularity (number of lines) increased: For 6-line and greater granularity, it was typically below 10%. This “commit redundancy” meant that future changes could be recommended with over 30% accuracy. Martínez et al. [25] also recently studied commit-redundancy, focusing their attention on commits they term “temporally redundant”, or, in our terminology, 100% graftable changes. 100% graftable lines are interesting because they are, in principle, entirely reconstructible. Our measure of graftability in Definition 2.1 additionally measures the degree to which a commit is graftable. Like Nguyen et al. [28] and Gabel and Su [11], they find a perhaps surprisingly high degree of redundancy in the code they studied. Our work has two primary methodological differences to this previous work on commit redundancy [25, 28]: we consider the cost of finding a graft, which the previous work does not, and we are concerned with code-commit graftability rather than just ‘commitcommit’ redundancy, a special case of graftability. That is, we consider the full code space in assessing graftability, whereas previous work focussed on commits alone. In terms of our formalism, both Nguyen et al. and Martínez et al. search the set of changes or deltas; in contrast, we focus on versions, and therefore search the lines in V0 ∩Vn ; the project’s ‘core’ lines, which can dominate a version history, accounting up to 97% of the lines in the final version, as we show in Section 3.1. If we had a full commit history 6. CONTRIBUTIONS In this paper, we validated the plastic surgery hypothesis with a large scale experiment over 15,723 commits from 12 Java projects. Our core finding is that the parent (rather than non-parental ancestors, or other projects) of a commit is by far the most fecund source of grafts. We also find encouraging evidence that many commits are graftable: they can be reconstituted. We also find that grafts are often contiguous, which suggests heuristics that attempt to graft commits out of multiple contiguous lines. Finally, we find that fully 30% of the elements of commits can be found within the same file. These are encouraging results for automatic program repair techniques that exploit the plastic surgery hypothesis. It is also true that there are fragments of commits that are not graftable. The complement of graftability measures the novelty of changes. As future work, we intend to explore if the feature set of novel changes is more predictable than we have found grafts to be, again with the aim of identifying which sorts of changes are most likely to profit from the plastic surgery hypothesis. 7. ACKNOWLEDGEMENTS The research is part funded by the Engineering and Physical Sciences Research Council CREST Platform Grant (EP/G060525), the Dynamic Adaptive Automated Software Engineering (DAASE) programme grant (EP/J017515), and the National Science Foundation under Grants No. CCF-1247280 and CCF-1446683 315 8. REFERENCES [1] Andrea Arcuri, David Robert White, John A. Clark, and Xin Yao. Multi-objective improvement of software using coevolution and smart seeding. In 7th International Conference on Simulated Evolution and Learning (SEAL 2008), pages 61–70, Melbourne, Australia, December 2008. Springer. [2] Andrea Arcuri and Xin Yao. A novel co-evolutionary approach to automatic software bug fixing. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’08), pages 162– 168, Hongkong, China, June 2008. [3] Brenda S Baker. A program for identifying duplicated code. In Computer Science and Statistics 24: Proceedings of the 24th Symposium on the Interface, pages 49–49, 1993. [4] Ira D. Baxter, Andrew Yahin, Leonardo Mendonça de Moura, Marcelo Sant’Anna, and Lorraine Bier. Clone detection using abstract syntax trees. In International Conference on Software Maintenance (ICSE’98), pages 368–377, 1998. [5] Stefan Bellon, Rainer Koschke, Giuliano Antoniol, Jens Krinke, and Ettore Merlo. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9):577–591, 2007. [6] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. [7] Yuriy Brun, Earl Barr, Ming Xiao, Claire Le Goues, and Prem Devanbu. Evolution vs. intelligent design in program patching. Technical Report https://escholarship.org/ uc/item/3z8926ks, UC Davis: College of Engineering, 2013. [8] S Carter, R. Frank, and D.S.W. Tansley. Clone detection in telecommunications software systems: A neural net approach. In Proc. Int. Workshop on Application of Neural Networks to Telecommunications, pages 273–287, 1993. [9] Satish Chandra, Emina Torlak, Shaon Barman, and Rastislav Bodik. Angelic debugging. In Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11, pages 121–130, Honolulu, HI, USA, 2011. ACM. [10] Marios Fokaefs, Nikolaos Tsantalis, Eleni Stroulia, and Alexander Chatzigeorgiou. Identification and application of extract class refactorings in object-oriented systems. Journal of Systems and Software, 85(10):2241 – 2260, 2012. [11] Mark Gabel and Zhendong Su. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, FSE ’10, pages 147–156. ACM, 2010. [12] Ah-Rim Han and Doo-Hwan Bae. Dynamic profiling-based approach to identifying cost-effective refactorings. Information and Software Technology, 55(6):966 – 985, 2013. [13] Mark Harman. Automated patching techniques: The fix is in: Technical perspective. Communications of the ACM, 53(5):108, 2010. [14] Mark Harman, William B. Langdon, Yue Jia, David Robert White, Andrea Arcuri, and John A. Clark. The GISMOE challenge: Constructing the pareto program surface using genetic programming to find better programs (keynote paper). In 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012), pages 1–14, Essen, Germany, September 2012. [15] Mark Harman, William B. Langdon, and Westley Weimer. Genetic programming for reverse engineering (keynote paper). [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] 316 In Rocco Oliveto and Romain Robbes, editors, 20th Working Conference on Reverse Engineering (WCRE 2013), Koblenz, Germany, 14-17 October 2013. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on, pages 837–847. IEEE, 2012. Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. Automated concurrency-bug fixing. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pages 221–236, 2012. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(6):654–670, 2002. Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. Automatic patch generation learned from human-written patches. In 35th International Conference on Software Engineering (ICSE’13), pages 802–811. IEEE / ACM, 2013. Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. Automatic patch generation learned from human-written patches. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 802–811, 2013. William B. Langdon and Mark Harman. Evolving a CUDA kernel from an nVidia template. In IEEE Congress on Evolutionary Computation, pages 1–8. IEEE, 2010. William B. Langdon and Mark Harman. Genetically improved CUDA C++ software. In 17th European Conference on Genetic Programming (EuroGP), Granada, Spain, April 2014. To Appear. William B. Langdon and Mark Harman. Optimising existing software with genetic programming. IEEE Transactions on Evolutionary Computation, 2014. To appear. Claire Le Goues, Stephanie Forrest, and Westley Weimer. Current challenges in automatic software repair. Software Quality Journal, 21(3):421–443, 2013. Matias Martinez, Westley Weimer, and Martin Monperrus. Do the fix ingredients already exist? An empirical inquiry into the redundancy assumptions of program repair approaches. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 492– 495, New York, NY, USA, 2014. ACM. Na Meng, Miryung Kim, and Kathryn S. McKinley. LASE: locating and applying systematic edits by learning from examples. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 502–511, 2013. Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1:251–266, 1986. Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, T.N. Nguyen, and H. Rajan. A study of repetitiveness of code changes in software evolution. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 180–190, Nov 2013. Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. SemFix: Program repair via semantic analysis. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 772–781, San Francisco, CA, USA, 2013. IEEE Press. Michael Orlov and Moshe Sipper. Flight of the FINCH through the java wilderness. IEEE Transactions Evolutionary Computation, 15(2):166–182, 2011. Jeff H. Perkins, Sunghun Kim, Sam Larsen, Saman Amaras- [32] [33] [34] [35] [36] [37] inghe, Jonathan Bachrach, Michael Carbin, Carlos Pacheco, Frank Sherwood, Stelios Sidiroglou, Greg Sullivan, Weng-Fai Wong, Yoav Zibin, Michael D. Ernst, and Martin Rinard. Automatically patching errors in deployed software. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, pages 87–102, Big Sky, MT, USA, October 12–14, 2009. Justyna Petke, Mark Harman, William B. Langdon, and Westley Weimer. Using genetic improvement & code transplants to specialise a C++ program to a problem class. In 17th European Conference on Genetic Programming (EuroGP), Granada, Spain, April 2014. To Appear. Baishakhi Ray and Miryung Kim. A case study of cross-system porting in forked projects. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12, pages 53:1–53:11, New York, NY, USA, 2012. ACM. Pitchaya Sitthi-amorn, Nicholas Modly, Westley Weimer, and Jason Lawrence. Genetic programming for shader simplification. ACM Trans. Graph, 30(6):152:1–152:11, 2011. Sooel Son, Kathryn S. Mckinley, and Vitaly Shmatikov. Fix me up: Repairing access-control bugs in web applications. In In Network and Distributed System Security Symposium, 2013. Nikolaos Tsantalis and Alexander Chatzigeorgiou. Identification of move method refactoring opportunities. IEEE Trans. Softw. Eng., 35(3):347–367, May 2009. András Vargha and Harold D. Delaney. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25(2):101–132, 2000. [38] Tiantian Wang, Mark Harman, Yue Jia, and Jens Krinke. Searching for better configurations: a rigorous approach to clone evaluation. In European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, pages 455–465, Saint Petersburg, Russian Federation, August 2013. ACM. [39] Yi Wei, Yu Pei, Carlo A. Furia, Lucas Serpa Silva, Stefan Buchholz, Bertrand Meyer, and Andreas Zeller. Automated fixing of programs with contracts. In Proceedings of the 19th International Symposium on Software Testing and Analysis, pages 61–72, 2010. [40] Westley Weimer. Patches as better bug reports. In Generative Programming and Component Engineering, pages 181–190, 2006. [41] Westley Weimer, Thanh Vu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. In International Conference on Software Engineering (ICSE), pages 364–374, Vancouver, Canada, 2009. [42] David Robert White, Andrea Arcuri, and John A. Clark. Evolutionary improvement of programs. IEEE Transactions on Evolutionary Computation, 15(4):515–538, 2011. [43] David Robert White, John Clark, Jeremy Jacob, and Simon Poulding. Searching for resource-efficient programs: Lowpower pseudorandom number generators. In 2008 Genetic and Evolutionary Computation Conference (GECCO 2008), pages 1775–1782, Atlanta, USA, July 2008. ACM Press. 317