Academia.eduAcademia.edu

Clone-Aware Configuration Management

2009

Recent research results show several benefits of the management of code clones. In this paper, we introduce Clever, a novel clone-aware software configuration management (SCM) system. In addition to traditional SCM functionality, Clever provides clone management support, including clone detection and update, clone change management, clone consistency validating, clone synchronizing, and clone merging. Clever represents source code and clones as (sub)trees in Abstract Syntax Trees (ASTs), measures code similarity based on structural characteristic vectors, and describes code changes as tree editing scripts. The key techniques of Clever include the algorithms to compute tree editing scripts; to detect and update code clones and their groups; and to analyze the changes of cloned code to validate their consistency and recommend the relevant synchronization. Our empirical study on many real-world programs shows that Clever is highly efficient and accurate in clone detection and updating, and provides useful analysis of clone changes. : First of all, a developer could use Eclipse to work on a software project. At any time, (s)he can start code clone detection on any version of the project. When (s)he checks in the code, if clone detection has never been initiated, Clever will perform the detection. It reads source files and extracts important features. The initial detection is launched using our algorithm on those fragments (Section 5). In addition to the normal check-in data, clonerelated information is also stored for future updating.

2009 IEEE/ACM International Conference on Automated Software Engineering Clone-aware Configuration Management Tung Thanh Nguyen, Hoan Anh Nguyen, Nam H. Pham, Jafar M. Al-Kofahi, Tien N. Nguyen Electrical and Computer Engineering Department Iowa State University {tung,hoan,nampham,jafar,tien}@iastate.edu ✦ Abstract—Recent research results show several benefits of the management of code clones. In this paper, we introduce Clever, a novel clone-aware software configuration management (SCM) system. In addition to traditional SCM functionality, Clever provides clone management support, including clone detection and update, clone change management, clone consistency validating, clone synchronizing, and clone merging. Clever represents source code and clones as (sub)trees in Abstract Syntax Trees (ASTs), measures code similarity based on structural characteristic vectors, and describes code changes as tree editing scripts. The key techniques of Clever include the algorithms to compute tree editing scripts; to detect and update code clones and their groups; and to analyze the changes of cloned code to validate their consistency and recommend the relevant synchronization. Our empirical study on many real-world programs shows that Clever is highly efficient and accurate in clone detection and updating, and provides useful analysis of clone changes. Finally, collaborative and team development supports in SCM will facilitate the maintenance of consistent editing to cloned code from multiple developers. Recent study by Krinke [25] on several open-source systems showed that half of the changes to code clone groups are inconsistent changes. Unfortunately, current SCM tools are not well-equipped with clone management supports. Text line-based change management approaches in existing SCM systems are not suitable for clone management because code clones are not necessarily identical. They often have slight modifications, thus, requiring an approach that can better capture the code semantics than the text-based approach in existing SCM tools. Describing changes to clones and clone groups in term of changed lines is clearly insufficient in supporting code understanding. 1 1.1 Clone-aware SCM functionality I NTRODUCTION Code clones are exactly matched or similar portions of code that are often created by the copy-and-paste programming practice. Classical approaches considered code clones to be harmful, thus, emphasized on the detection and removal of clones [6], [37]. However, recent research has shown more benefits of code clone management during software evolution than removing them [11], [22], [23], [32]. During software development, source code is modified. As regular code, cloned code evolves as well. However, existing code clone management approaches are still ad-hoc, limited, and unsatisfactory, especially with changes to clones. Software configuration management (SCM) [13] area provides many well-established tools for managing the changes to source code in software systems with useful version control and collaboration supports. Thus, code clone management in evolving software should be incorporated into an SCM system. In other words, an SCM tool should be clone-aware. The integration of clone management support into an SCM system creates several benefits. Firstly, the management and tracking of changes to clones and clone groups can take advantage of change management supports without requiring a full retrieval of individual versions. Secondly, clone group management would be more time efficient and complete because reported changes from an SCM tool will help in updating the clone results for a new version without complete re-detection. Re-detection is time-consuming for large-scale systems while changes might affect a small set of code clones. 1527-1366/09 $29.00 © 2009 IEEE DOI 10.1109/ASE.2009.90 111 125 123 In this paper, we introduce Clever, a novel clone-aware software configuration management system, which makes use of SCM functionality and provides additional supports for clone management. Clever was developed as an add-on to Subclipse/SVN, an Eclipse plug-in SCM tool. Let us first describe Clever from the users’ perspective. In addition to traditional SCM functionality [13], Clever has the following clone-aware supports: 1. Detecting code clones and grouping them, 2. Updating clones and groups as source code changes, 3. Managing the changes to individual clones and groups, 4. Reporting clones/groups and their changes at any version, 5. Notifying developers on potential inconsistent modifications to the cloned code, and 6. Supporting consistent changes to clones and merging. Clone Detection and Updating (tasks 1 and 2): First of all, a developer could use Eclipse to work on a software project. At any time, (s)he can start code clone detection on any version of the project. When (s)he checks in the code, if clone detection has never been initiated, Clever will perform the detection. It reads source files and extracts important features. The initial detection is launched using our algorithm on those fragments (Section 5). In addition to the normal check-in data, clonerelated information is also stored for future updating. When the developer checks out a version, along with the code, the clone information is also retrieved. At any time, if requested, Clever produces a clone report that describes the clones and clone groups. Each clone group is reported as a set of cloned fragments and each cloned fragment is reported with its location. The developer is able to use Eclipse’s editor to make changes to source files. As a new version is checked into the repository, Clever will perform the clone updating process. It uses the reported changes from SVN and builds the sets of deleted, modified, and newly created fragments accordingly. Then, Clever executes its clone updating algorithm (Section 5). It also runs clone updating as requested and clone information of the previous detection is used for the update. Clone Change Management (tasks 3 and 4): The developer can use the textual differencing tool of SVN to compare two versions of a program. (S)he can also use Clever’s clone differencing functionality to display the changes in term of editing operations between the versions. This function can be used to show how a clone has been modified from the previous version, which could help him/her understand the changes and consistently modify other cloned code. At any time, the developer can invoke the change report for clone groups. The report shows the evolution of a clone group in consecutive versions. The newly created, modified, disappearing, and un-changed clone groups will also be shown. As selecting a group, (s)he could see its clone members and which clone fragments have been modified, created, or deleted. Clone Consistency Validating (task 5): This clone-aware feature is similar in spirit to the conflicting change detection in traditional SCM systems when two developers made changes to the same file. For code clones, the changes could be made to two cloned fragments. For example, A is a clone of a fragment B. Assume that A has been slightly modified into A′ by developer 1 to fix a bug. A′ is then checked into the repository. In this case, Clever checks if those changes could potentially cause inconsistency. If true, it will bring up the cloned fragments of A including B to the developer’s attention along with editing operations from A to A′ for his reference. Clone Synchronizing (task 6): Those recommended operations could be applied into B one-by-one and the developer can verify the correctness of the result. If (s)he decides that the changes from A to A′ do not apply to B, then B can be kept the same. Assume that later, developer 2 checks out and modifies B into B ′ . When B ′ is checked in, Clever will perform consistency validation on the changes from B to B ′ against those from A to A′ . If the changes are not consistent, they will be brought to the developer for verification as well. Moreover, another type of consistency validation is provided in Clever when a user simply copies a code fragment and renames identifier(s). Clever is able to recommend consistent renaming operations for the applicable identifiers. Clone Merging (task 6): This clone-aware feature corresponds to the merge function in traditional SCM. For example, if the changes to B are consistent with those to A, Clever will perform a recommendation for automatic incorporation of the changes to A into B ′ . If inconsistency occurs, the developer would have to manually update B ′ for consistency based on the reported changes from A to A′ . The merging result B ′′ after human verification will be checked into the repository. 1.2 Approach Overview Let us give an overview of our approach to build such cloneaware SCM system. In addition to reusing SVN’s text linebased storage representation for code, Clever also views a program as an abstract syntax tree (AST). Any sufficiently large subtree of the program’s AST is called a fragment and considered as a potential clone. Two fragments, i.e. two AST’s subtrees, are considered as clones, and called a clone pair, if they are sufficiently similar in structure. Clever measures their structural similarity by the distance of their structural characteristic vectors, extracted by using Exas method [33]. All detected clone pairs form a clone graph. The connected components in that graph are considered as clone groups. Since Clever uses tree-based representation for source code and clones, a new tree edit scripting algorithm, called Treed, is developed to capture and represent their changes. Treed takes two text-based versions of an arbitrary fragment from SVN, parses them into two ASTs, and then computes a sequence of tree editing operations, i.e. a tree editing script, transforming one version to the other. Treed also identifies the matched and different nodes between two trees. From such tree-based changes of the program, Clever derives the change sets of the fragments and updates the clones and groups accordingly. It also analyzes the changes to the cloned fragments to find potential inconsistent changes and recommends the relevant synchronization and merged results. The main contributions of this paper include: 1. A novel clone-aware SCM system named Clever, 2. A new tree edit scripting algorithm, 3. An efficient algorithm for clone detection and updating, 4. A novel clone change analysis method that can detect inconsistent changes to the cloned code and recommend the relevant synchronization and merging, and 5. An empirical study that shows the benefits of Clever. The next section describes Treed. Section 3 is about clone detection and updating. Section 4 discusses our clone change analysis method. Evaluation is given in Section 5. Related work is described in Section 6. Conclusions appear last. 2 T REE E DITING 2.1 Tree-based Code Representation In Clever, an AST’s subtree T is an ordered, attributed tree and modeled as a set of nodes associated with five functions R, P , C, type, and val. R(T ) is the root node of the tree. For each node u ∈ T , P (u) is its parent, C(u, k) is its kth child, type(u) is its AST node type, and val(u) is its attribute value. val is generally used for identifiers, literals, and operators. For the AST nodes representing the control structures (e.g. if, while), their vals are empty. Figure 1 shows an example. Leaf nodes are in rectangles. Inner nodes are in rounded rectangles. 2.2 Editing Operation and Editing Script When the code changes, its corresponding AST is considered to be edited, i.e. transformed, into the tree representing the new code by an editing script, i.e. a sequence of tree editing operations. As the common tree edit approaches, Treed uses the following tree editing operations for an AST: 124 126 112 Legend IF: if statement EXPR: expression ASGN : assignment ID: identifier LIT: literal T E EXPR > ID a ID b if (a>b) a = a - b; else ok = true; if (a>b) a = a + 1; IF A ASGN ID a E' EXPR > ID a EXPR + ID a T' update LIT 1 delete ID b IF insert A' ASGN ID a ASGN EXPR - ID a ID ok insert LIT true insert ID b insert Fig. 1. Tree Editing Example: In the label of each node, its type is in capital font and its val (if exists) is in normal font. U pdate(u, x) changes the value val(u) of node u into x. Insert(u, v, k) inserts node u as the kth child of node v. Delete(u) deletes node u, and inserts its children as the new children of its parent node. M ove(u, v, k) moves the subtree rooted at node u to node v in which u becomes the kth child of node v. In the example in Figure 1, an if statement was edited by modifying the if branch and adding an else branch. The two trees represent the two versions. An editing script consists of one Delete (dotted line box), one U pdate (double-line box), and four Insert operations (bold boxes). Other nodes (singleline boxes) are either unchanged or moved. As we can see, an editing script could be used to describe the change to the code. For example, the nodes that are unaffected by the editing operations can be considered as unchanged, and the affected ones are the changed ones. For any two trees, there will always exist at least one editing script (such as the script that deletes all nodes of the first tree and inserts all the nodes of the second). However, there might exist more than ones. Since such scripts give the same result, we could use the optimal script with the minimum number of operations to model developers’ rational editing. Definition 1 (Tree Editing Script Problem): Given two AST’s (sub)trees T and T ′ , find the optimal editing script that transforms T into T ′ . The number of operations of the optimal editing script for T and T ′ is generally referred to as their editing distance. This could be used to measure their similarity. That is, the shorter the editing distance is, the more similar the trees are [18]. Thus, finding the optimal editing script is helpful in both detecting cloned code and analyzing their changes. However, finding the optimal editing script could be inefficient in large-scale projects since the numbers and the sizes of the trees needed to be processed are huge1 . Therefore, instead of developing such an exact algorithm, we alternatively develop a heuristic algorithm, called Treed (Tree Edit) to efficiently find one editing script. 2.3 Treed Algorithm The task of Treed is to take two text-based versions of an arbitrary portion of code from an SCM tool, parse them into AST’s subtrees, and to compute a tree editing script, as short as possible, that transforms the old version to the new one. 1. The subject systems in our experiments usually have several thousands of code fragments with the minimum size of 50 nodes. 2.3.1 Map Relation versus Editing Script The key insight of Treed is that the nodes in T that are unchanged, updated (i.e. vals are changed), moved, or have no/few changes in its subtrees should be kept as many as possible, rather than deleting them and inserting new nodes, because the latter case creates a longer script. If a node u of T is kept, it must correspond to a node u′ of T ′ . We consider them to be mapped to each other. In Figure 1, nodes E, A, and [EXPR +] are mapped to E ′ , A′ , and [EXPR -], respectively. Given an editing script ∆ for T and T ′ , one could always determine all mapped nodes between them based on the (un)changed, moved, and updated nodes caused by ∆. That is, one could derive a map relation between the nodes of T and T ′ . The map relation is unique. In the other direction, given a map relation ℜ, one could always derive an editing script ∆ because one could imply from the map ℜ the editing operations that were applied on each node. For example, if a node u ∈ T is mapped to a node u′ ∈ T ′ and their val attributes are different, then u is updated. If u or u′ are unmapped, i.e. they are not mapped to any node in the other tree, they are deleted or inserted. The derived script ∆ might not be unique or optimal. The more nodes can be mapped, the more concise the derived script is. Based on that knowledge, Treed works in two steps. The first step is to determine the map relation ℜ between the nodes of T and T ′ , with as many nodes being mapped as possible. Then, the next step is to derive the editing script ∆ from ℜ. 2.3.2 Find the Map Relation Treed finds the map relation with the following observations: 1. If a leaf node belongs to (i.e. its val text sits in) an unchanged line of code (LOC), it is considered as unchanged. In general, the values, i.e. identifiers, literals, and operators, usually lie completely in a line. For example, Java syntax does not allow an identifier or a number to span across multiple lines. Extremely long strings lying in multiple lines are unusual. In Figure 1, the first line (containing the expression if (a > b)) is unchanged, thus, all leaf nodes of the expression E are unchanged. 2. Two inner nodes t ∈ T and t′ ∈ T ′ should not be mapped if they do not have any two mapped descendant nodes in corresponding subtrees. Since there might exist more than one mapping candidates for each node, it should be mapped only to another node such that two respective subtrees are sufficiently similar. 125 127 113 changed changed unchanged unchanged changed changed unchanged unchanged t unchanged changed 1 unchanged 2 1 Fig. 2. Alignment of Unchanged Lines of Code ′ In Figure 1, we see that E should be mapped to E since they contains mapped (unchanged) leaf nodes. E should not be mapped to A′ since they have no mapped descendant. Of course, it should not be mapped to T ′ (i.e. the entire if statement). Although E and T ′ have the mapped descendants, E ′ is more similar to E than T ′ (will be explained later). Those observations suggest that 1) the map relation of unchanged leaf nodes could be found based on unchanged LOCs; 2) the map relation should be calculated from bottom up, i.e. from leaf nodes up to the root, and 3) the map relation of inner nodes should be based on the similarity of their corresponding subtrees. Mapping. Treed maps the nodes in the following steps: 1. Map Leaf Nodes of Unchanged LOCs. First, Treed uses the text line comparison feature in SVN to compare the textbased representation of two versions to detect the unchanged LOCs. After this, the alignment between unchanged lines will partition the text lines in two versions into (un)changed segments as in Figure 2. Treed parses the versions into two trees T and T ′ , and marks the leaf nodes belonging to unchanged LOCs as “unchanged”. Then, it traverses T and T ′ in pre-order and returns two sequences of leaf nodes. Unchanged leaf nodes in such two sequences are mapped one-by-one in that order. In Figure 1, two sequences of leaf nodes are [a, b, a, a, 1] and [a, b, a, a, b, ok, true]. Because the first two nodes of those sequences belong to an unchanged LOC (if (a > b)), they are mapped one to one: a → a, b → b. 2. Map Leaf Nodes of Changed LOCs. The next step is to map leaf nodes belonging to changed LOCs. Segments of changed lines contain all changed leaf nodes and might contain also unchanged leaf nodes (e.g. a line might be just partially changed). For example, two sequences of nodes [a, a, 1] and [a, a, b, ok, true] correspond to changed text lines. However, the first node a is unchanged in a = a + 1. To find mapped nodes, for each pair of aligned segments of changed lines in two versions, Treed finds the largest common subsequences between the corresponding sequences of leaf nodes. Two nodes are considered matched if they have the same type and value. The matched nodes of the resulting subsequences are mapped together as unchanged leaf nodes. In the above segments, Treed will find the common subsequences [a, a] and mapped the corresponding nodes [a, a]. 3. Map Inner Nodes Bottom-Up. After mapping the leaf nodes, Treed maps the inner nodes bottom-up. If an inner node t ∈ T has a descendant t1 (inner or leaf node) and t1 is mapped to t′1 , t will be compared to any ancestor of t′1 . Then, t will be mapped to a candidate node t′ if the subtrees rooted at t and t′ are sufficiently similar in structure and type. If no such t′ exists, t is unmapped. For example, both E ′ and T ′ contains mapped nodes to the 3 maps t' 4 5 1' 2 4' 3 2' LCS 4 3' 5 no map 1' 4' 2' 3' Result: 1 <----> 1' 2 <----> 2' 3 <----> 3' 4 <----> 4': moved 5 was deleted Fig. 3. Alignment of Nodes in Top-Down Pass nodes in E. However, because E is identical to E ′ , they are mapped to each other. Similarly, A is mapped to A′ , although they are not identical in structure. Structural Similarity Measure. To measure the structural similarity of the trees, Treed uses Exas [33] characteristic vectors, which are shown in our previous work to be efficient and accurate in capturing the structure of trees and graphs. Using Exas, each tree is assigned an occurrence-counting vector of its structural features, such as the label sequences of its paths. For any two trees with two vectors x and y, their 2x−y . That is, the structural similarity is defined as 1 − x+y smaller their vector distance, the more they are similar. Larger trees (i.e. having large vectors) are allowed to have a larger distance within the same level of similarity. More details on Exas could be found in [33]. 4. Map Nodes Top-Down and Derive Editing Operations. The bottom-up process might be unable to map some nodes, such as the relocated nodes and renamed identifier nodes. Thus, after bottom-up mapping, Treed examines the two trees top-down and maps those not yet mapped nodes based on the already mapped nodes. Given a pair of mapped inner nodes t and t′ from the bottom-up pass, Treed determines the mapping between their children nodes. Firstly, Treed performs a greedy algorithm to find additional mapped nodes between the children nodes of t and t′ . The already mapped nodes are kept. If an unmapped child node is an inner one, their descendants are compared based on Exas structural similarity as in the previous step. If it is an unmapped leaf node, Treed computes their similarity based on their val attributes. Since those attributes are generally identifiers and literals, they are first separated as sequences of words, using well-known naming conventions such as Hungarian or Camel. For example, “getFileName” is separated into “get”, “file”, and “name”. The similarity of two words is computed via the Levenshtein distance, a string-based similarity measure. After the children nodes are mapped, a largest common subsequence (LCS) algorithm is run on those two sequences. From the mappings, the corresponding operations are derived for nodes. For example, if a node is mapped to another node with a different val, it is considered updated. If they are at different locations, it is considered to be moved. The unmapped nodes are considered as deleted or inserted. Figure 3 shows an example. t and t′ are mapped from bottom-up pass. After running the greedy algorithm, Treed finds all mapped nodes except node 5. From the alignment after the LCS algorithm, Treed derives the mappings, and the move and delete operations on nodes 4 and 5, respectively. 126 128 114 2.3.3 Derive the Editing Script After the map relation and the editing operations are determined, Treed traverses T and then T ′ to generate the editing script. T is traversed in the post-order to assure that in the editing script, the deletion of children nodes occur before that of parent nodes. In contrast, T ′ is traversed in the pre-order for the parent nodes to be inserted before the children nodes. 3 3.1 C LONE D ETECTION AND U PDATE Important Concepts and Formulation Clone detection is the process to detect the clone relation between portions of code of interest. Clone updating is the process to update the clone relation between code portions when changes occur to the codebase. This section presents our formulation of clone detection and updating. Firstly, in Clever, any sufficiently large portion of code which represents a complete syntactical unit of a program, such as a class, a method, a statement, etc is considered as a fragment. Since Clever represents a program as an AST, a fragment is modeled as a subtree of an AST whose size (i.e. the number of nodes in the subtree) is larger than a predefined threshold. Each fragment has an Exas characteristic vector [33] used for similarity measurement as described in Section 2. If two fragments are sufficiently similar, measured by a relevant code similarity measure, they are considered to be clones of each other, and are called a clone pair. In Clever, two fragments with the distance between their vectors smaller than a threshold are considered as a clone pair. All clone pairs forms the clone relation among all fragments. The clone relation could be conveniently represented as a clone graph, in which each node is a cloned fragment and each edge represents a clone pair. In Clever, a clone group is modeled via a connected component in the clone graph. Definition 2 (Clone Detection): Given a program as a set of fragments, clone detection is to build the corresponding clone graph and clone groups. Definition 3 (Change Sets): The change to a program is represented by three change sets: the sets of newly created, deleted, and modified fragments. Definition 4 (Clone Update): Given the change sets to a program, clone updating is to update the clone graph corresponding to the changes. After updating, the newly created, deleted, modified, and unmodified clones are reported. Clever also reports unchanged and changed groups including newly created, disappearing, expanded, shrunk, and modified groups (see Section 3.2.2). 3.2 Algorithmic Solution 3.2.1 Clone Detection Clone Detection aims to build the clone graph for the first time. It could be started at any version and includes four steps: 1. Generate Vectors. The first step of Clone Detection is to build the fragment set. Clever reads all source files of the working version. Then, each source file is parsed into an AST. Clever traverses the AST and computes the Exas characteristic vectors for fragments (i.e. subtrees in the AST). 2. Hash Vectors into Buckets. To build the clone graph, one could find the clone pairs by a pairwise comparison on all fragments. Pairwise comparison is not efficient because a program usually has a large number of fragments. For example, JDK 1.6 with about 3,200KLOCs has over 10K fragments. In Clever, we use locality-sensitive hashing (LSH) functions to find clone pairs, which have similar characteristic vectors. A LSH function is a hash function for vectors such that the probability that two vectors having a same hash code is a strictly decreasing function of their corresponding distance [1]. In other words, two vectors having a smaller distance will have a higher probability of having the same hash code, and vice versa. We use LSH functions (described in [1]) to hash the fragments into smaller sets that we call buckets, based on the hash codes of their vectors. The cloned fragments, i.e. the fragments having similar vectors, tend to be hashed into the same buckets. The other ones are less likely to be so. To increase the chance for any two cloned fragments to be hashed into the same bucket, Clever uses multiple hash functions. Thus, if such two fragments are missed by a hash function, they still have chances to be mapped into the same bucket from the other LSH functions. Each fragment will be hashed into N buckets indexed by its hash codes produced from N independent hash functions. Hash codes produced from different hash functions are made to be different, i.e. no bucket is shared for two functions. 3. Detect Clone Pairs from a Bucket. After hashing, Clever does pairwise comparison for all fragments in each bucket to detect clone pairs. All detected pairs form the clone graph. 4. Build Clone Groups. A fragment could be cloned multiple times, i.e. having more than one clones. Clever reports clones in groups to reduce the redundancy of reporting a clone multiple times in different pairs. Clever traverses the clone graph and detects its connected components as clone groups. 3.2.2 Clone Update The steps for updating clones after changes occur include: 1. Derive the Change Sets. The first step of clone updating is to derive the change sets, i.e. the sets of newly added, deleted, and modified fragments from the changes to the program. To do this, Clever first finds the changed files (i.e. deleted, added, and modified source files). • Fragments of deleted files are put into the set of deleted fragments. • The added files are parsed and traversed as in clone detection to produce newly added fragments. • For each modified file, its two versions are mapped and compared by our tree mapping algorithm in Section 2. If a fragment (i.e. a subtree) of the old version could not be mapped to any fragment of the new version, it is considered as a deleted fragment. Similarly, a fragment of the new version is considered as a new fragment, if it is not mapped to any fragment of the old version. For the fragments that can be mapped between two versions, only the fragments that are affected by the resulting edit script are considered modified fragments. The other fragments are considered as unchanged fragments. 127 129 115 2. Update the Clone Graph. After the change sets are built, the second step is to update the buckets and the clone graph. • Deleted fragments are removed from the clone graph and from buckets. • Each modified fragment is compared with each one of its clones to check whether they are still clones of each other. If not, that clone pair is removed. • All modified and new fragments are (re)hashed into the buckets, and are compared to other fragments (existing or newly added) in those buckets to find the new pairs. The clone graph is then updated with those new pairs. 3. Re-detect Clone Groups. Clone groups are re-detected from the new clone graph and are compared with the groups in the previous version to derive the changes to the groups. Clone A for (i = 0; i < nRegs; i++) { ppTotal[i].start = prMem[i].addr; ppTotal[i].nBytes = prMem[i].size; ppTotal[i].more = ppTotal[i+1]; } Clone A’ for (i = 0; i < nRegs; i++) { ppTotal[i].start = prMem[i].addr; ppTotal[i].nBytes = prMem[i].size; if (i+1 < nRegs) ppTotal[i].more = ppTotal[i+1]; } Fig. 4. Clone Change and Inconsistency a) cloning A cloned B A' B' A cloned B change For source code, a SCM tool provides the function to detect the conflicting changes to the code and merge them consistently. For cloned code, similar functions are also needed, although the scenarios and the analyses are different. An Illustrated Example. The illustrated example in Figure 4 is similar in spirit to a real clone-related bug described in [26], with slight modifications. In Figure 4, A is a portion of code from developer 1 for transferring data from nRegs memory blocks into the ppTotal array. Then, it was cloned by developer 2 into the portion of code B for storing information of tRegs processed memory blocks in another array named ppTaken. However, developer 2 forgot to rename an instance of ppTotal into ppTaken. This caused a bug due to inconsistent editing. Both A and B were checked into the SCM repository. Later, developer 2 checks out and modifies B to fix the error. (S)he also adds a statement to calculate the number of unprocessed blocks (see Clone B ′ ). At the same time, developer 1 detects a potential “out of bound” error in A, which could happen with an access to ppTotal[i+1] when i = nRegs-1. (S)he modifies A into A′ by adding the if statement. Unfortunately, since traditional SCM tool does not maintain the clone relation, (s)he might A c) two-side change change C LONE C HANGE A NALYSIS B synchronize 4 cloning b) one-side change change 3.2.3 Clone Report A clone might be a part of another clone. For example, if a class is cloned from another class, its methods will also be clones of the methods of that class. Therefore, any clone group whose all members belong to the members of another clone group will be considered redundant, and will not be reported. Since the changes to the clone graph is stored after each revision and the cloned fragments could be mapped between revisions, when reporting the clone groups at each revision, Clever is able to map the current groups with the ones in the previous version. Thus, Clever is able to report the changes to each clone group. For example, a group with new members is called “expanded”; a group with removed members is called “shrunk”; a group with all un-changed members is considered as “un-changed”; a group appearing only in the new version is called “new”; a group appearing only in the old version is considered as “disappearing”; and other types of groups are considered as “changed”. Clone B for (i = 0; i < tRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes = prMem[i].size; ppTaken[i].more = ppTotal[i+1]; } Clone B’ for (i = 0; i < tRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes = prMem[i].size; ppTaken[i].more = ppTaken[i+1]; } lRegs = nRegs - tRegs; A' B' merge B'' Fig. 5. Clone Change Scenarios not know the existence of B and could not apply that fixing change into B. Similarly, developer 2 might not know about the change to A and could not incorporate it into B ′ . Scenarios. The example shows that the changes to cloned code could cause the inconsistency and potential bugs. Clonerelated bugs have been reported in previous research [19], [26]. More importantly, the example illustrates three possible scenarios in which clone change analysis are needed for consistent editing. Three scenarios are shown in Figure 5: 1) The first scenario is the cloning task itself. That is, when A is cloned into B, B is usually not textually identical to A, but slightly modified, such as in renaming of the identifiers. Thus, the changes in the cloning process need to be analyzed for consistency. 2) The second scenario, called one-side change, happens when A is modified, and B is unchanged (or vise versa). When A is checked into the repository, B might need to be synchronized, i.e. consistently updated with respect to the changes to A (e.g. the fixing changes to A). 3) The third scenario, called two-side change, occurs when both A and B are under modification. Changes to A were committed into the repository. Then, when B are modified and committed, the changes to A need to be incorporated (merged) into the changes to B. Sources of Inconsistency. The example also illustrates two major sources of inconsistency: 1) inconsistency caused by renaming identifiers during cloning, and 2) inconsistency by modifying program control structures. These types of clone inconsistency were also reported in [19], [26]. There are changes that reduce the inconsistency (such as the modification to B). Other changes could increase the inconsistency between the cloned code (such as when B is cloned, or when A is modified in the above example). Therefore, the 128 130 116 clone changes need to be analyzed for consistency and then relevant clones need to be consistently updated. Clone Change Analysis Operations. Clever provides several operations to deal with the analysis and consistent updating of clones and their changes. The rest of this section will present those operations. 4.1 Clone Matching and Differencing Clone Matching and Differencing aims to find the matched and different elements between two cloned fragments. For example, for A and B in the example, it could find all the matches between the identifiers and program structures in the corresponding ASTs. For example, between A and B, the mapped identifiers are i → i, nRegs → tRegs, ppTotal → ppTaken, prMem → prMem, and ppTotal → ppTotal. Therefore, the bug of missing a rename operation on ppTotal in B can easily be found. Clone Matching and Differencing could also be used to show the changes between two versions of a cloned fragment. For example, between A and A′ , it finds the addition of the if statement; between B and B ′ , it finds the renaming of ppTotal to ppTaken in the last expression. In general, these two operations can be used in all 3 scenarios in Figure 5. Clone Matching and Differencing is based on Treed. From the editing script returned by Treed for two clones or two versions of a clone, their matches and differences are computed from the nodes and subtrees mapped by the editing script. For example, the differences are unmapped, updated, or moved elements of such clones (or versions). After matched and different elements of two clones are identified, Clever finds the inconsistencies between them using Clone Consistency Validating operation. 4.2 Clone Consistency Validating In general, it is not easy to detect all different types of inconsistencies between cloned code because the nature of inconsistency in clones depends very much on the semantics of the code and on the intention of the developers who create and change the clones. In many cases of inconsistency, there is no explicit semantics dependency between the code fragment and its cloned one. This is the major difference between clonerelated inconsistency and the notion of conflicts in traditional SCM (In SCM, if there is no semantics dependency between two changes, there is no conflict). Clever aims to detect only the inconsistencies involving 1) the changes of identifiers, 2) control structures, and 3) literals, which have been shown to be the major sources of clonerelated bugs [19], [26]. Clever applies the following criteria on the mapped elements between clones and their changes to find the inconsistencies. Definition 5 (Identifier Consistency): Given two cloned fragments (or two versions), each identifier in one fragment is mapped to one and only one identifier in the other fragment. For example, ppTotal should be mapped only to ppTaken. However, it is mapped to both ppTaken and ppTotal, thus, the cloning change is potentially inconsistent. Definition 6 (Structure Change Consistency): The changes to control structures (e.g. statements, expressions, etc.) of the clones should be the same. In the illustrated example between A′ and B, an additional if statement was inserted in A, but it does not appear in B, thus, it is potential inconsistency. Definition 7 (Value Change Consistency): The changes to special values of literals or the names of invoked methods should be the same. Clever is interested in only special values for literals, such as null, 0, 1, empty string, true, and false. This is based on the assumption that developers associate the special values with some meanings. When they are changed, developers might want to check them. Similarly, if a different method is called, it is likely that they intend to change the semantics of the code. Thus, such changes should be checked. Generally, Consistency Validating is needed in all 3 scenarios in Figure 5. 4.3 Clone Synchronizing Clone Synchronizing is the operation designed for two clone change scenarios, cloning and one-side change, that is, when there is only one clone that was changed. For the two-side change, Clever uses Clone Merging, which will be discussed later. In the illustrated example, the synchronization will be applied to B when A changes into A′ . Between A′ and B, there are two inconsistencies: 1) of identifiers (ppTotal is mapped to two identifiers), and 2) of control structures (compared with B, A′ has an additional if statement). Clone Synchronizing works as follows. For the identifier inconsistencies, Clever recommends the mapping with the most frequencies. For example, the map ppTotal-ppTaken appears three times, while ppTotal-ppTotal appears once. Thus, ppTotal-ppTaken is recommended. This recommendation is based on the assumption that the developer wanted to change, and he has changed almost all instances except a few. In the cases that the mappings have the same frequency, the one with different node values is recommended (such as ppTotalppTaken). It is assumed that when cloning, the developers are likely to rename identifiers (In [26], it was reported that 6567% of clones in Linux involves identifiers’ renaming). For changes in control structures, literals, and method calls, Clever recommends those changes to the mapped elements in the unchanged clone. For example, in Figure 6, it recommends the addition of the if statement into B at the corresponding position. Of course, when applying the changes into B, Clever also recommends the corresponding identifiers to produce consistent code (nRegs in the condition is renamed to tRegs). Figure 6 shows the synchronized code that Clever suggests. It will apply the changes if the recommendation is accepted. 4.4 Clone Merging Clone Merging is used for the two-side clone change scenario, in which two clones are modified at the same time. We assume that both A and B are modified, but only changes of A need to be synchronized into B ′ (since A′ was committed before B ′ ). In this section, let us show that Clone Merging operation can be solved by two operations: Clone Synchronization (in 129 131 117 Clone A’ for (i = 0; i < nRegs; i++) { ppTotal[i].start = prMem[i].addr; ppTotal[i].nBytes=prMem[i].size; if (i+1 < nRegs) ppTotal[i].more=ppTotal[i+1]; } Clone B* for (i = 0; i < tRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes=prMem[i].size; if (i+1 < tRegs) ppTaken[i].more=ppTaken[i+1]; } // ApplicationSpecificPreferencePage.java ... ApplicationSpecificRegistry.getInstance(). removeApplicationSpecificData(getSelectedAppSpecificObject()); ... for (ApplicationSpecificObject obj: getSelectedAppSpecificObjects()) { ApplicationSpecificRegistry.getInstance(). removeApplicationSpecificData(obj); } Fig. 6. Clone Synchronizing Example Fig. 9. Treed Running Result Example ize B e ng c hr on a ch change A' cloned syn A B' B* tree-based code merge B'' Fig. 7. Clone Merging Section 4.3) and a classic three-way code merging operation in traditional SCM tools [29]. When B ′ is committed into the repository, Clever checks and sees that A, a clone of B, was modified. Thus, clone merging is required. Clever will perform a Clone Merging operation in two steps (see Figure 7): 1) Clone Synchronization is applied on B by taking into account the changes from A to A′ . That is, the change to A is incorporated into B to produce a temporary version B ∗ . 2) A three-way Code Merge operation is applied to B ∗ and B ′ to produce the final merging result B ′′ . This is a three-way code merge since both B ∗ and B ′ were modified from B. In Clever, we use Lippe’s algorithm [27] for three-way code merging. It is an operation-based merging algorithm, which models a change between two versions as explicit operations. The merge result is produced by the application into the base version B the merged sequence of operations from two parallel sequences of operations (one sequence was applied from B to B ∗ and the other from B to B ′ ). Two important inputs of Lippe’s algorithm are the set of operations and the definition of conflicts among each pair of operations. The set of operations that we used was defined in Section 2 on the AST of the program. Two operations from two parallel sequences are considered as conflict if 1) They apply on the same AST node and give different Clone A’ for (i = 0; i < nRegs; i++) { ppTotal[i].start = prMem[i].addr; ppTotal[i].nBytes=prMem[i].size; if (i+1 < nRegs) ppTotal[i].more = ppTotal[i+1]; } Clone B* for (i = 0; i < nRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes=prMem[i].size; if (i+1 < tRegs) ppTaken[i].more=ppTaken[i+1]; } Clone B’ for (i = 0; i < tRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes=prMem[i].size; ppTaken[i].more = ppTaken[i+1]; } lRegs = nRegs - tRegs; Clone B” for (i = 0; i < tRegs; i++) { ppTaken[i].start = prMem[i].addr; ppTaken[i].nBytes=prMem[i].size; if (i+1 < tRegs) ppTaken[i].more=ppTaken[i+1]; } lRegs = nRegs - tRegs; Fig. 8. Clone Merging Example results. For example, if one renames ppTotal to ppTaken, and one renames to ppProcessed, then they are conflicting. 2) An operation removes the nodes which the other operation needs. For example, an operation adds a new expression into a statement that was removed by the second operation. Thus, they are considered as conflicting changes. When Clever detects the conflict between the changes from B to B ∗ and to B ′ , it requires the manual merge from users. If no conflicting change, Clever applies Lippe’s algorithm for merging. Figure 8 shows the operation applied on the illustrated example. B ∗ is synchronized as described in Figure 6. Then, B ′ and B ∗ are merged. First, the if statement is added into the last statement. Then, renaming is applied to ppTotal. In this case, there are two renaming operations (one of B ∗ and one of B ′ ) applied on the same node. However, two renaming operations are identical. Thus, they could be merged. At last, the new statement for the calculation of lRegs is added. 5 E MPIRICAL E VALUATION 5.1 Treed Algorithm We conducted an experiment to evaluate the accuracy and performance of our Treed algorithm. We used GEclipse from revision number 6,000 to 15,000. We randomly selected 100 revisions. For each revision, we picked one Java file and ran Treed on that file and its previous revision. We manually checked the output editing scripts for those files and found that in 92 out of 100 cases, Treed gave correct results. To illustrate interesting results, we discuss the following examples. In the example shown in Figure 9, in the previous version, a developer invoked removeApplicationSpecificData function on a single ApplicationSpecificObject. In the new version, (s)he added a for loop to repeat the invocation of the same function on multiple objects returned by getSelectedAppSpecificObjects. Treed correctly returns the editing script that inserts the nodes representing the for loop and its parameters (shown in bold face), deletes the old parameter getSelectedAppSpecificObject() of that function call, and then inserts the new parameter obj. The second example illustrates an interesting case that Treed was not quite informative. A developer modified the statement return this.vo.getName(); into return this.vo != null ? this.vo.getName():"Vo-Wrapper"; Since Treed found their structure not sufficiently similar, it could not map them and returned a script that deletes the old statement and inserts the new one. The other incorrect cases are similar to this case in nature. That is, when Treed could not match two inner nodes because they are too structurally 130 132 118 Project Axis2 Columba GEclipse jEdit Struts2 TomCat Xerces LOC 477K 193K 326K 175K 121K 321K 213K Frgm 19436 6700 10669 6053 4101 9649 6678 Cov 28% 15% 34% 6% 20% 10% 14% Clever Time Prcs 47s 98% 18s 95% 28s 96% 11s 98% 27s 94% 28s 98% 17s 100% Cov 19% 6% 12% 4% 8% 7% 11% CCFinderX Time Prcs 284s 100% 67s 100% 113s 98% 50s 100% 42s 100% 100s 96% 72s 99% TABLE 1 Clone Detection Result different, it returns the operation sequence [Delete, Insert] on those two nodes. We counted those cases as incorrect ones. 5.2 Clone Detection and Update We conducted an experiment to evaluate Clever’s performance in both clone detection and update on a Windows XP computer using Intel(R) Core(TM) 2 Duo T7300 2GHz, 3GB RAM, and 80GB HDD. Clever was configured to process fragments with the minimum size of 50 nodes, 32 independent hash functions, and the similarity threshold σ of 0.8. 5.2.1 Clone Detection We chose a version for each of 7 subject systems and committed it into Clever/SVN to evaluate detection performance. Table 1 shows the result on clone detection from Clever, in comparison with the clone detection tool CCFinderX [21]. Columns LOC and Frgm show the number of lines of code and that of fragments in the subject systems. Column Time shows the detection time, measured in seconds. Each tool was run 3 times. Processing time of Clever is indicated by the longest one to avoid the effects of file (disk) I/O caching. For CCFinderX, we took the shortest one. Columns Prcs and Cov represent precision and completeness. In general, precision is defined as the percentage of the correctly detected clones in the total detected ones, and completeness is usually expressed in recall, i.e. the percentage of the correctly detected clones in the total existing ones. However, it is impractical to determine all existing and check all reported clones in large projects. Thus, we manually checked 100 reported clone pairs to estimate the precision. Completeness is represented by coverage, i.e. the percentage of detected cloned LOCs in total LOCs as in [18]. The result shows that Clever is faster and more complete than CCFinderX, while maintaining the equivalent level of high precision. For instance, on the largest subject system, Axis, of about 500KLOC, Clever takes less than 1 minute to detect more than 135KLOC of clones, while CCFinderX takes nearly 5 minutes for only about two thirds of that amount. 5.2.2 Clone Update To examine Clever’s clone updating, we selected several consecutive revisions of three subject systems and then committed them into Clever/SVN. 100 consecutive revisions, from rev 101 to rev 200, were processed from the Columba project. In GEclipse and jEdit projects, 1,000 consecutive revisions were Revision Columba 100 110 150 200 GEclipse 1000 1010 1050 1100 1500 2000 jEdit 3000 3010 3050 3100 3500 4000 Update Cov Time 15% 33.0 15% 5.0 15% 1.8 15% 1.4 Re-detection Cov Time 15% 33 15% 33 15% 33 15% 34 Difference Pair LOC 0 0 0 0 0 0 0 0 37% 37% 37% 37% 37% 35% 32.0 3.4 0.9 0.6 0.3 0.4 37% 37% 37% 37% 37% 35% 32 32 32 32 33 39 0 0 0 0 0 0 0 0 0 0 0 0 7% 7% 7% 7% 6% 7% 15.0 5.8 3.0 2.5 2.0 1.8 7% 7% 7% 7% 6% 7% 15 16 17 17 18 18 0 0 0 0 0 0 0 0 0 0 0 0 TABLE 2 Clone Update Result processed, ranging from rev 1,001 to rev 2,000 and rev 3,001 to rev 4,000, respectively. Clone Detection is applied for the first revision, and Clone Update is applied for the following ones. At each revision, the result of Clone Update is compared to that of the re-run of Clone Detection at that revision. The comparison of both results including the differences in cloned lines and in clone pairs is shown in Table 2. The result shows that Clone Update gives exactly the same result as that of re-detection (e.g. all resulting clone pairs and cloned lines of code are the same). However, the average time of updating for each revision in Table 2 is much less than that of re-detection. This is reasonable because the change at each revision is often small (Table 3). Thus, the updating process needs to process less fragments than re-detection. Table 2 also shows that the coverage (i.e. percentage of cloned LOC) is almost unchanged, through the processed revisions. However, it does not imply that cloned code is unchanged. Details on clone changes will be discussed later. For updating, Clever needs to store additional clone information (e.g. fragments’ information, buckets, clone groups). Those storage costs are acceptable. For example, for the GEclipse project, its SVN data is about 207MB, and the overhead for clone detection is about 7MB. However, the benefits of clone management and the gain in efficiency far outweigh the storage costs. Storage costs could also be reduced by reproducing characteristic vectors, instead of storing them all, even for non-cloned fragments. For example, the fragments, buckets, and groups are stored. For updating, if the tool needs to access the vector of an existing fragment (without previously stored vectors), the corresponding file could be parsed to extract vectors for that fragment. This reduces the storage cost with a slightly increasing processing time. 5.3 Clone Change Analysis 5.3.1 Changes to Clones and Groups Table 3 shows the details on the changes to clones and groups. Column ∆F is the total number of changed, i.e. 131 133 119 Project Columba GEclipse jEdit Revision 100-200 1700-1900 3000-3100 ∆F 1316 1645 173 C+ 129 118 39 C25 53 197 C* 163 315 1008 Co 1012 1176 307 Opr 2.6 3.9 1.2 LOC 1.5 1.6 0.7 P+ 185 233 121 P1 398 82 52 P2 221 523 976 G+ 28 32 26 G25 26 9 G* 50 69 23 G> 21 72 164 G< 17 72 117 Go 366 153 403 TABLE 3 Changes of Clones and Groups Project Columba GEclipse jEdit Revision 100-200 1700-1900 3000-3100 SI 9 82 53 II 66 13 78 VI 9 15 12 TABLE 4 Clone Change Inconsistencies newly added, deleted, and modified fragments. Column C+, C-, and C*, are the total numbers of added, deleted, and modified clones, respectively, while Co is the average number of unchanged clones. Column P+, P1, and P2 are the total numbers of newly created (i.e. cloning), one-side changed, and two-side changed clone pairs, respectively. The next five columns represent the total numbers of changed clone groups, which are the numbers of newly created (G+), expanded (G>), shrunk (G<), disappearing (G-) groups, and the groups with changed members (G*), respectively. Go is the average number of unchanged clone groups. Table 3 shows interesting results. Firstly, the numbers of changed fragments and clones at each revision are small. On average, there are about 20 changed fragments, one newly created pair, and 3 modified pairs. The modifications to each clone are also small: about 4 operations and 2 LOCs on average. Many of the clones and groups are unchanged. In addition, most of clone changes are two-side (i.e. both clones are modified), or cloning (i.e. new clones are created). The number of one-side changes is the least among three types. 5.3.2 Clone Consistency Validation Table 4 shows the results of our Clone Consistency Validating experiment. The columns SI, II, VI are the numbers of clone pairs having structural, identifier renaming, and value changing inconsistencies, respectively (if a clone pair has two or more kinds of inconsistencies, they are counted accordingly). The table shows that most of inconsistencies are structural, that is, the cloned code tends to be modified in program structures. As shown in Tables 3 and 4, the changes to cloned code at each revision are often small. However, those changes could potentially create many inconsistencies. Therefore, the analysis and updating of clone changes are still needed at every SCM commit, to assure that there are as few clone-related inconsistencies as possible. Among those inconsistencies in subject systems, we found many interesting cases. Let us discuss the case of two cloned methods from two classes in Columba (see Figure 10). At one revision, only CopyMessageCommand was modified with the addition of the statement in boldface. In this one-side change scenario, Clever recognized the public class CopyMessageCommand extends FolderCommand { public void updateGUI() throws Exception { TableChangedEvent ev=new TableChangedEvent(UPDATE,destFolder); MailFrameController.tableChanged(ev); MainInterface.treeModel.nodeChanged(destFolder); } ... public class CheckForNewMessagesCommand extends FolderCommand { public void updateGUI() throws Exception { TableChangedEvent ev=new TableChangedEvent(UPDATE,inboxFolder); MailFrameController.tableChanged(ev); } ... Fig. 10. Structural Inconsistency Case from Columba Clone A → A’ catch(IOException io) { setAbortable(false); String[] pp = {path1, io.toString()}; VFSManager.error(browser, ”directory-error”,pp); } Clone B catch(IOException io) { String[] args = {io.toString()}; VFSManager.error(browser, ”ioerror”,args); } Fig. 11. Structural Inconsistency Case from jEdit two clones as structurally inconsistent. Clever’s result is correctly confirmed because at a later revision, a change was made to CheckForNewMessagesCommand to fix it by adding MainInterface.treeModel.nodeChanged(inboxFolder);. 5.3.3 Clone Synchronization and Merging To verify the quality of Clever’s clone synchronization, we performed a controlled experiment. In 3 subject systems (Columba, GEclipse, and jEdit), we randomly chose 10 cases in which in one revision, a clone pair A and B was created and was consistently modified in a later revision into A′ and B ′ . We made up a version consisting of A′ and B. In other words, we created one-side change scenarios. Then, we ran Clever’s clone synchronization on the make-up version of each case to produce the recommendation version B ∗ . At last, we compared B ∗ with B ′ . B∗ is considered as a correct recommendation if it has the same program structure and exactly matched identifiers as B ′ . Otherwise, we consider it incorrect. For all 10 cases, Clever produced the correct recommendation changes. Let us explain in details one interesting case among them. Figure 11 shows a clone pair A and B taken from jEdit revision 3791 (file BrowserIORequest.java). Later, A was changed to A′ by adding a new statement (setAbortable(false)) at revision 3925. When we ran Clever in this example, Clever detected them as structural inconsistency. Because it could detect the insertion of that new statement to A, it provided the recommended synchronization B using the Insert operations and returned the code shown in Figure 12. We compared B ∗ with B ′ , a real modified version of B at a later time, and found that they are the same. 132 134 120 Clone A → A’ catch(IOException io) { setAbortable(false); String[] pp={path1, io.toString()}; VFSManager.error(browser, ”directory-error”,pp); } Synchronize Clone B → B* (≡B’) catch(IOException io) { setAbortable(false); String[] args={io.toString()}; VFSManager.error(browser, ”ioerror”,args); } Fig. 12. Recommended Clone Synchronizing for jEdit Make clone A′ → A′∗ catch(IOException io) { setAbortable(false); String[] pp={path1,io.toString()}; VFSManager.error(browser,path1, ”ioerror.directory-error”,pp); } Merge Clone A′∗ → A′′ catch(IOException io) { setAbortable(false); Log.log(Log.ERROR,this,io); String[] pp={path1,io.toString()}; VFSManager.error(browser,path1, ”ioerror.directory-error”,pp); } Make clone B → B ′∗ catch(IOException io) { setAbortable(false); Log.log(Log.ERROR,this,io); String[] args={io.toString()}; VFSManager.error(browser, ”ioerror”,args); } Merge Clone B ′∗ → B ′′ catch(IOException io) { setAbortable(false); Log.log(Log.ERROR,this,io); String[] args={path1,io.toString()}; VFSManager.error(browser,path1, ”ioerror”,args); } Fig. 13. Recommended Clone Merging for jEdit Clone Merging. In that jEdit case study, we found that at revision 4411, both A′ and B ′ were modified into A′′ and B ′′ shown in Figure 13. To verify Clone Merging functionality of Clever, we created two intermediate versions A′∗ and B ′∗ from A′ and B ′ , respectively. Then Clone Merging was applied on both of them. The results matched with A′′ and B ′′ . 6 R ELATED W ORK Recent research has shown the benefits of management tools for code clones [12], [22], [23], [32]. A related research to Clever is CloneTracker clone management tool [12]. It is based on a clone tracking tool with the same name [11]. CloneTracker [12] uses CRD, a light-weight clone region description scheme, to map clone groups from the previous version to the current groups. However, some detected clones could be missed due to the approximate nature of CRD mapping [11]. In contrast, Clever uses its tree mapping algorithm, a more precise approach, that avoids losing detected clones in clone tracking. Moreover, from one version to another, CloneTracker re-runs the detection only on changed and currently tracked files. Thus, it could miss cross-revision clone pairs. Such a pair is the result of a copy of a fragment from an un-changed file into a changed one. Our previous experiment showed that there are nearly 46,000 cross-revision clone pairs in GEclipse [35]. Therefore, in both tracking old and detecting/updating new groups, CloneTracker cannot fully achieve completeness. Via Treed and SVN’s change report, Clever derives the changed fragments, and then updates clone groups, tracks old and detects new ones with reasonable storage overhead. Other clone tracking tools include [4], [17], [31]. Clone Detection Toolbox [32] uses Unix diff to get the changes to clones, tracks them in different versions, and updates its clone database. However, it requires the re-run of clone detection on the entire new version. Furthermore, its line-based tracking of clones does not adapt well with modifications [11]. Bakota et al. [4] proposed the mapping of clones from one version to another based on a light-weight AST-based similarity measure. Mende et al. [31] proposed a token-based similarity approach for grow-and-prune model in evolving software. In brief, the aforementioned clone tracking approaches might result in incompleteness in tracking/managing clones. None of them supports clone-aware synchronizing and merging. To reduce the time complexity for re-detection when software changes, recent research has focused on incremental clone detection [16]. iClones [16] represents each fragment as a sequence of tokens, stores all of such sequences on suffixtrees, and traverses such trees to find groups of identical and similar sequences. When code changes, the suffix-trees are updated with new/deleted sequences of the changed files, and then clone groups are re-produced. Clever’s clone updating was based on ClemanX [34], [35], our previous work on incremental clone detection. However, aiming for efficient redetection, ClemanX fell short of the goal of change management for clones and groups. In addition, it handles changes at the file level, rather than at the fragment level as in Clever. Many approaches for code clone detection have been proposed [6], [37]. Generally, they can be classified based on their code representations. The typical categories are textbased [10], [30], token-based [3], [21], [26], tree-based [5], [18], and graph-based [15], [24]. The text-based and tokenbased are usually efficient but could not detect the clones with many modifications. In contrast, graph-based approaches, though providing clones of higher level of abstraction, are time-consuming in detecting similar subgraphs. Deckard [18] introduced the use of vectors in clone detection. In [33], we showed that our vector representation for tree-based fragments is a more generalized and accurate approach than Deckard’s vectors. Deckard tool counts only the distinct AST node types in a subtree for a fragment, while Clever captures structural features via paths and sibling sets. Moreover, Clever differs from Deckard in the usage of LSH. For every fragment, even non-cloned ones, Deckard uses LSH to find similar fragments. In Clever, non-cloned fragments tend to be hashed to singleton buckets, and are not compared to other fragments. Chilowicz et al. [9] propose a signature for a subtree using a tree fingerprint method. Treed is similar in nature to ChangeDistiller [14]. However, our Exas similarity measure captures better the structural similarity than the counting approach of matched nodes in ChangeDistiller. Also, Treed uses the alignment of lines from SCM for the better detection of mapped nodes. Support for consistent editing for clones in CloneTracker [11] is interactive and only for a single user. With collaborative services from SVN, Clever supports clone change management in team development. Similar to CloneTracker, in Codelink editor [38], a user can modify a fragment and changes can be interactively applied to its cloned fragment. Both of them use token-based mapping between two clones. Clever uses tree-based alignment. Libra [28] searches fragments for simultaneous changes. Within Eclipse, CReN [17] tracks clones and helps consistently renaming of identifiers. In [25], consistency is defined in a text line-based approach. SCM systems have a long history [13]. While early SCM 133 135 121 tools (e.g. CVS) provide versioning for entire files, more advanced SCM systems [8] have also fine-grained version control support. However, none of existing SCM tools support clone-aware change management. Several approaches have been introduced for conflicting change detection/resolution in SCM tools. Direct conflict detection mechanisms, often parts of software merging techniques [29], handle only the changes to the same artifact in two different ways. Textbased merge tools (e.g. in CVS) consider software artifacts merely as text files. Syntactical merging is more powerful than textual merging because it takes the syntax of artifacts into account [2]. However, they cannot detect conflicts when the merged program is syntactically correct but semantically invalid. To deal with this, semantic-based merge algorithms were developed. Those algorithms [39] can detect behavioral conflicts. Recent advanced techniques for indirect conflict resolution [20], [36] raise the awareness among developers on the changes that semantically affect other artifacts. However, all of aforementioned approaches cannot deal with the changes to cloned code fragments. 7 C ONCLUSIONS This paper introduces Clever, a novel clone-aware SCM system which provides clone management support, including clone detection and update, clone change management, clone consistency validating, clone synchronizing, and clone merging. Clever represents source code and clones as the subtrees of ASTs, measures code similarity based on the structural characteristic vectors, and describes codes changes as tree editing scripts. We have introduced new algorithms and techniques to compute the tree editing script; to detect and update code clones; to analyze the changes of cloned code, to validate their consistency, and to recommend the relevant synchronization. Our empirical study on many large-scale, open-source systems shows that Clever is highly efficient and accurate in clone detection and updating, and provides useful consistent validation and synchronization of clone changes. Acknowledgment. This project was funded in part by a grant from Vietnam Education Foundation (VEF) for the first author. R EFERENCES [1] A. Andoni and P. Indyk. E2LSH 0.1 User manual. http://web.mit.edu/andoni/www/LSH/manual.pdf. [2] U. Asklund. Identifying Conflicts During Structural Merge. In Proceedings of Nordic Workshop on Programming Environment, pages 231–242, 1994. [3] B. S. Baker. Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance. SIAM Journal on Computing, 26(5):1343-1362, October, 1997. [4] T. Bakota, R. Ferenc, and T. Gyimothy. Clone Smells in Software Evolution. ICSM’07, pp. 24–33, 2007. [5] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone Detection Using Abstract Syntax Trees. ICSM’98, 1998. [6] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering, 33(9):577–591, 2007. [7] CCFinderX. http://www.ccfinder.net. [8] M. C. Chu-Carroll, J. Wright, D. Shields. Supporting Aggregation in Fine-grained SCM. In FSE’06, pp. 99-108. ACM, 2002. [9] M. Chilowicz, E. Duris, G. Roussel. Syntax Tree Fingerprinting for Source Code Similarity Detection. ICPC’09, IEEE CS, 2009. [10] S. Ducasse, M. Rieger, and S. Demeyer. A Language Independent Approach for Detecting Duplicated Code. ICSM’99. [11] E. Duala-Ekoko and M. P. Robillard. Tracking Code Clones in Evolving Software. In ICSE ’07, pp. 158-167, IEEE CS, 2007. [12] E. Duala-Ekoko and M. P. Robillard. CloneTracker: Tool Support for Code Clone Management. In ICSE08-Demo, 2008. [13] J. Estublier, D. Leblang, A. van der Hoek, R. Conradi, G. Clemm, W. Tichy, D. Weber. Impact of SE Research on the practice of SCM. ACM TOSEM, 14(4):383–430, 2005. [14] B. Fluri, M. Wuersch, M. Pinzger, and H. Gall. Change Distilling: Tree Differencing for Fine-Grained Source Code Change Extraction. TSE, 33(11):725-743, 2007. [15] M. Gabel, L. Jiang, and Z. Su. Scalable Detection of Semantic Clones. ICSE’08, pages 321–330, IEEE CS, 2008. [16] N. Gode and R. Koschke. Incremental Clone Detection. In CSMR’09, pages 219–228, IEEE CS, 2009. [17] P. Jablonski and D. Hou. CReN: a Tool for Tracking Copy-andPaste Code Clones and Renaming Identifiers Consistently in the IDE. In ETX’07, pages 16–20, ACM Press, 2007. [18] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and Accurate Tree-based Detection of Code Clones. In ICSE’07. [19] L. Jiang, Z. Su, and E. Chiu. Context-Based Detection of CloneRelated Bugs. In FSE’07, pages 55–64. IEEE CS, 2007. [20] R. Hegde and P. Dewan. Connecting Programming Environments to Support Ad-Hoc Collaboration. In ASE’08, 2008. [21] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28(7):654-670, 2002. [22] C.Kapser and M.Godfrey. “Cloning considered harmful” Considered Harmful: Patterns of Cloning in Software. In Emperical Software Engineering, 13(6):645-692, 2008. [23] M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An Empirical Study of Code Clone Genealogies. In FSE’05.ACM Press, 2005. [24] R. Komondoor and S. Horwitz. Using Slicing to Identify Duplication in Source Code. In SAS’01, pages 40–56, 2001. [25] J. Krinke. A Study of Consistent and Inconsistent Changes to Code Clones. In WCRE’07, pages 170–178. IEEE CS, 2007. [26] Z. Li, S. Lu, S. Myagmar. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software. IEEE TSE, 32(3), 2006. [27] E. Lippe and N. van Oosterom. Operation-Based Merging. In ACM SIGSOFT Softw. Eng. Notes, 17(5):78–87. ACM, 1992. [28] Y. Higo, Y. Ueda, S .Kusumoto, K. Inoue. Simultaneous Modification Support based on Code Clone Analysis. APSEC’07. [29] T. Mens. A State-of-the-Art Survey on Software Merging. IEEE Trans. on Software Engineering, 28(5):449–462, 2002. [30] A. Marcus and J. Maletic. Identification of High-level Concept Clones in Source Code. In ASE’01, pp. 107-114.IEEE CS, 2001. [31] T. Mende, R. Koschke, and F. Beckwermert. An Evaluation of Code Similarity Identification for the Grow-and-Prune Model. JSME, 21(2):143–169, John Wiley & Sons, 2009. [32] F. Mitter. Tracking Source Code Propagation in Software Systems via Release History Data and Code Clone Detection. Diploma Thesis, University of Zurich, 2006. [33] H.A. Nguyen, T.T. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen. Accurate and Efficient Structural Characteristic Feature Extraction for Clone Detection. In FASE’09, LNCS 5503, pages 440–455. Springer-Verlag, 2009. [34] T.T. Nguyen, H.A. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen. ClemanX: Incremental Clone Detection Tool for Evolving Software. In ICSE’09 Demo. IEEE CS, 2009. [35] T.T. Nguyen, H.A. Nguyen, N.H. Pham, J.M. Al-Kofahi, and T.N. Nguyen. Scalable and Incremental Clone Detection for Evolving Software. In ICSM’09, IEEE CS, 2009. [36] A. Sarma, G. Bortis, A. van der Hoek. Towards supporting awareness of indirect conflicts across SCM workspaces. ASE’07. [37] R. Tairas - Bibliography of Code Detection Literature. http://students.cis.uab.edu/tairasr/clones/literature/. [38] M. Toomim, A. Begel, and S.L. Graham. Managing Duplicated Code with Linked Editing. In VLHCC’04, IEEE CS, 2004. [39] W. Yang, S. Horwitz, and T. Reps. A Program Integration Algorithm that accommodates semantics-preserving Transformations. ACM Trans. Softw. Eng. Methodol., 1(3):310–354, 1992. 134 136 122