Academia.eduAcademia.edu

SCQL

2005, Proceedings of the 2005 international workshop on Mining software repositories - MSR '05

Source Control Repositories are used in most software projects to store revisions to source code files. These repositories operate at the file level and support multiple users. A generalized formal model of source control repositories is described herein. The model is a graph in which the different entities stored in the repository become vertices and their relationships become edges. We then define SCQL, a first order, and temporal logic based query language for source control repositories. We demonstrate how SCQL can be used to specify some questions and then evaluate them using the source control repositories of five different large software projects.

SCQL: A formal model and a query language for source control repositories Abram Hindle Daniel M. German Software Engineering Group Department of Computer Science University of Victoria Software Engineering Group Department of Computer Science University of Victoria [email protected] [email protected] ABSTRACT Source Control Repositories are used in most software projects to store revisions to source code files. These repositories operate at the file level and support multiple users. A generalized formal model of source control repositories is described herein. The model is a graph in which the different entities stored in the repository become vertices and their relationships become edges. We then define SCQL, a first order, and temporal logic based query language for source control repositories. We demonstrate how SCQL can be used to specify some questions and then evaluate them using the source control repositories of five different large software projects. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—Version Control ; D.2.8 [Software Engineering]: Metrics—Process metrics 1. INTRODUCTION A configuration management system, and more specifically, a source control system (SCS) keeps track of the modification history of a software project. A SCS keeps a record of who modifies what part of the system, when and what the change was. Typically a tool that wants to use this historical information starts by doing some type of fact extraction. These facts are processed in order to create new information such as metrics [9, 3] or predictors of future events [6, 7]. In some cases, this information is queried or visualized [5, 10]. Some projects store the extracted facts into a relational database ([8, 5, 2]), and then use SQL queries to analyze the data. Others prefer to use plain text files, and create small programs to answer specific questions [9], or query the SCS repository every time [10]. One of the main disadvantages of these approaches is that querying this history becomes difficult. A query has to be translated from the domain of the SCS history to the data model or schema used to stored this information. Also, questions regarding the temporal aspects of the data are difficult to express. Furthermore, there is no standard for the storage or the querying of the data, making it difficult for a project to share its data or its analysis methods with another one. When a developer completes a task it usually means that she has modified one or more files. The developer then submits these changes to the SCS, in what we call a modification request, or MR (this process has also been called a transaction). A MR is, therefore, atomic (conceptually the MR is atomic, even though it might not be implemented as such by the SCS system). Once the change is accepted by the SCS, it creates a new revision for each file present in the MR. Thus an MR is a set of one or more file revisions, committed by one developer. The SCS allows its users to retrieve any given revision of a file, or for a given date, determine what is the latest revision for every file under its control. There are many SCSs available on the market. They can be divided into two types: centralized repositories (like CVS) and Peer-to-Peer repositories (such as BitKeeper, Darcs, Arch). Even though they differ strongly in the way they operate and store the tracked changes, they all track files and their revisions. We will focus on CVS because there is a large number of CVS repositories available to researchers. We will, therefore, use the CVS nomenclature in this paper. It is important to mention that our model and SCQL can be applied to any SCS. This paper is divided as follow: first we present an abstract model to describe version control systems; second, we define a query language, called SCQL, we end demonstrating how it can be used to pose questions related to the source control history in several mature, large projects. 2. MODEL Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MSR’05, May 17, 2005, Saint Louis, Missouri, USA Copyright 2005 ACM 1-59593-123-6/05/0005...$5.00 In order to create a language for the querying of a SCS we first need to be able to describe its data model. This data model will be used to formally describe the data available in the SCS and to provide a uniform representation of the information available across multiple SCSs. One of the requirements of this model that is “time aware” and it is able to represent the temporal relationships (“before”, “after”) of the different entities stored in the SCS. 1 * Author Revision * 1 1 * MR * 1 File Figure 1: Cardinality and Directions of Edges in the Model 2.1 Characteristic Graph of a Source Code Repository We represent an instance of a SCS as a directed graph. Entities such as MRs, Revisions, Files and Authors are vertices, while their relationships are represented by edges. It is important that for any given instance of a SCS, there exists a corresponding characteristic graph, and that given a query, this query can be translated into an equivalent graph query on its characteristic graph. As a consequence, the original query will be answered by solving the graph query. 2.2 Table 1: Model Primitives isaM R(φ) is φ an MR? isaRevision(φ) is φ a Revision? isaF ile(φ) is φ a File? isaAuthor(φ) is φ an Author? numberT oStr(i) Represent i as a string. length(φ) Length of the string φ substrφ, k, l) Return a substring of φ of length l at k. eq(φ, θ) are φ and θ equivalent strings? match(φ, θ) is θ is a substring of φ? isEdge(φ, θ) is there an edge from φ to θ? count(S) counts the elements in a subset. isAuthorOf (ψ, φ) is ψ an author of φ? isF ileOf (τ, φ) is τ an File of φ? if M ROf (φ, φ) is φ is an mr of φ? isRevisionOf (θ, φ) is θ is a revision of φ? revBef ore(θ, θ2 ) is there is a revision path from θ to θ2 ? revAf ter(θ, θ2 ) is there is a revision path from θ2 to θ? Authors are represented by the subset Author in the graph. Authors have attributes such as user ID, name and email. Time wise authors are associated to their first revision implying their entry into the project. There is only one author per MR and per Revision. Entities The model for SCQL contains four different types of entities: MRs, Revisions, Files and Authors. See figure 1. MRs model modification requests and correspond to the set MR in the graph instance. MRs have attributes such as log comments, timestamp, and a unique ID. We assume that the timestamp of an MR is unique (derived from its earliest revision), and that an MR is an atomic operation. There exists an edge from each MR to the next MR in time (if one exists). One edge extends from the MR to the author of its revisions, and one edge is also created from the MR to each of its revisions (an MR is not connected to more than one revision of the same file). Revisions correspond to the set of file revisions and are denoted by Revision. Revisions are atomic in time with respect to other revisions, thus they have unique timestamps and they are assigned unique identifiers. They have attributes such as the diff of the change, and the lines added and removed. An edge extends from the revision to its author, and another one to the corresponding file. Revisions are also connected to each other. An edge is created from any given revision to each of its successor (the revision which modified it), thus one revision can have multiple children (or branches). Revision subgraphs are characterized as acyclic stream-like graph which springs up from a single node. If a revision merges a branch from another branch (or the main development trunk), an edge will be created from the “predecessor” revisions on both branch to the merged revision. Files are represented as the subset of vertices File in the graph. Files are the springs from which streams of revisions flow. Files have attributes such as path, filename, directory, and a unique full path name. Time-wise, files have unique timestamps associated with the first revision made of a file (this records the moment the file first appears in the graph). Files are connected to by revisions as described above. 2.3 Formalizing the characteristic graph Formally we define the characteristic graph G of a SCS as a directed graph of G = (V, E) where V = MR ∪ File ∪ Author ∪ Revision E = (v1 ∈ MR, v2 ∈ MR) ∪ (v1 ∈ MR, v2 ∈ Revision) ∪ (v1 ∈ MR, v2 ∈ Author) ∪ (v1 ∈ Revision, v2 ∈ Revision) ∪ (v1 ∈ Revision, v2 ∈ Author) ∪ (v1 ∈ Revision, v2 ∈ File) There are 6 data types in our model: Vertices representing entities; edges representing relationships; sets of entities which abstract edges; numbers used for numerical questions; strings are needed since much of the data in the repository is string data; and Booleans which are necessary to prove invariants exist. Table 1 provides a description of some of the primitives that operate on these types. We implement attributes using maps. Attributes can map from entities to subsets, strings, numerics or Booleans. Another assumption is that the output of a mapping is only valid if a node or edge of a correct type is used as an index to the map. More attributes can be added at any time but the attributed mentioned in section 2.2 are the expected attributes. Attributes which are expected to return one entity still return a subset. The motivation is to maintain uniform access to entities while providing a method of abstracting edge traversal. Since sets are returned we use plural function names. Attributes that are subsets of entities (edge traversals) are described in table 2. 2.4 Extraction and Creation The general algorithm for extracting and creating a graph from a SCS is: Table 2: Sub-domain Attributes authors(φ ∈ MR) the author of the MR revisions(φ ∈ MR) the revisions of the MR f iles(φ ∈ MR) the files of the revisions of the MR nextM Rs(φ ∈ MR) next MR in time prevM Rs(φ ∈ MR) previous MR in time mrs(θ ∈ Revision) MR related of the Revision authors(θ ∈ Revision) the author of the revision. f iles(θ ∈ Revision) the files of a the revision nextRevs(θ ∈ Revision) Next revisions version-wise. prevRevs(θ ∈ Revision) Previous revisions version-wise. mrs(τ ∈ File) MRs of the Revisions of the file revisions(τ ∈ File) Revisions of the file authors(τ ∈ File) Authors of the revisions of the file. mrs(ψ ∈ Author) MRs of the author. revisions(ψ ∈ Author) Revisions of the author f iles(ψ ∈ Author) Files of the revisions of the author author1 MR 1 Revision1 Revision3 File2 ... MR 3 Revision5 Revision4 MR n Revision x File3 Revision x+1 File j-1 File j Figure 2: Example Model Subgraph Revision1.1 • Each file becomes a vertex in File. MR 2 Revision2 File1 author2 Revision1.2 Revision1.3 Revision1.4 Revision1.5 branch merge • Each author becomes a vertex in s Author. Revision1.3.1 • Each revision becomes a vertex in Revision. Assign revisions unique timestamps and connect each revision its corresponding author and file. • Create vertices for each MR. The MR inherits the timestamp from its first file revision. Associate MR to its author MR. • Each MR is then connected to the next MR (according to their timestamp), if it exists. • For each file, connect each revision to the next revision of the file, version-wise. If branching is taken into account, only revisions in the same branch are connected in this manner, and then branching and merging points are connected. When this algorithm terminates, the result is a characteristic graph of the instance of SCS. CVS does not record branch merges or modification requests, but some heuristics have been developed to recover both [2, 4, 11]. Branch-merge and MR recovery in CVS are not accurate, and therefore the extracted SCS graph is an interpretation rather than an exact representation of the SCS. An example of the SCS graph is depicted in figures 2 and 3. The vertices corresponding to the revisions in 2 and 3 are the same and they are shown in two figures to avoid clutter. 3. QUERY LANGUAGE The rationale for our model is to provide a basis for a query language for a SCS. We are interested in a language that has the following properties: • It is based on primitives that correspond to the actual data and relationships stored in a SCS. We want a language that directly models files, authors, revisions, etc. Revision1.3.2 Figure 3: Example Revision Subgraph • It has the ability to take advantage of the time dimension. We want to able to pose questions that include predicates such as “previous”, “after, “last time”, “always”, “never”. For example, “has this file always been modified by this author?”, “find all MRs do not include the following file”, “find the file revision after this other one”, “find the last revisions by a given author”, etc. • It is computable. We need confidence that if a query is posed, it can be evaluated. • It is expressive. We are interested in a language that is able to express a wide range of queries. The characteristic graph of a source code repository is the basis for this language. Thus our language is built such that any query expressed in it can be translated to a query of the characteristic graph. First order predicate logic will serve as a basis for our query language, as it can handle both graph semantics and “before and after” aspects of temporal logic [1]. The language is designed to query the model, not to provide a general purpose programming language. We have focused in evaluating decision queries with this language (those which answer is either yes or not), but we also support other types of queries that return other types of data (such as the id of an author, the number of files modified, or a set of files). The language has a rich syntax, but due to a lack of space we only summarize its main features in table 3. Identifiers are unbound variables that reference entities. Using a variable, one can access the attributes of the referenced entities (x.attribute). Identifiers are only created by a scoping operator such as an Anchor, Universal Quantifiers, Existential Quantifiers or Selection Scope. These scopes iterate over elements in a subset by applying a predicate to each element. hypothesize that an old, stable project will have a small proportion, while a project that is still growing, and continues to have structural changes will have a larger proportion. This query can be easily expressed directly in SCQL as: Existential and Universal scopes iterate through an entire subset until a preposition returns either true or false. For empty subsets universal scopes return true and existential scopes return false. 1 - (Count(mr,MR) { Ebefore(a,MR,mr)) { A(f,mr.files) { isFileOf(f,a) } } } / count(MR) Subset/Select based scopes effectively iterate through all the elements in set of entities such as MR, Revision, File , Author selecting entities to form a subset. A subset can only be the same size or smaller than the set it is testing. These subsets may only have 1 type of entity. Anchor scopes are like select- based scopes, but are meant to access a single element in constant time. Scope operators that are “before” or “after” scopes iterate through their respective subsets in sequential order from first to last. 3.1 Example Queries We now present three different queries and show how they are expressed in SCQL. Example 1: Is there an author a who only modifies files that author b has already modified? This query can be formally expressed as: ∃a, b ∈ Author s.t. a 6= b∧ ∀r ∈ Revision s.t. isAuthor(a, r) =⇒ ∃rb ∈ Revision s.t. bef ore(rb , r)∧ It iterates over the set of all MRs, counting only those that have a previous MR that modifies all its files too. Then it counts all MRs, and computes the desired proportion. Example 3: Is there an Author whose changes stay within one directory? ∃a ∈ Author s.t. ∀f ∈ File s.t. isAuthorOf (a, f ) =⇒ ∀f2 ∈ Files.t.isAuthorOf (a, f2 ) =⇒ directory(f ) = directory(f2 ) In this case we want to know if there exists an author such that for all pairs of files modified by this author, they are both in the same directory. This query can be written in SCQL as: isAuthor(b, rb ) ∧ r.f ile = rb .f ile We are trying to find two different authors such that for all revisions of one author, there exists a previous revision (by the second author) to the same file. The SCQL query first finds two authors and makes sure they are different. Then it iterates through all the revisions of author a. Per each revision, it checks if the file of that revision has another previous revision that belongs to author b. a.revisions gets all the revisions related to the author a while isAuthorOf(b ,r2) tests if b is the author of the revision of the file f . E(a, Author) { E(b, Author) { a!=b && A(r, a.revisions) { A(f, r.file) { Ebefore( r2, f.revisions, r) { isAuthorOf( b, r2) } } } } } Example 2: Compute the proportion of MRs that have a unique set of files which have never appeared as part of another MR before. With this query we are want to find out how variable are the sets of files modified in MRs. We E(a, Author) { A(f, author.files) { A(f2, author.files) { eq(f.directory, f2.directory) } } } 4. EVALUATION We have built an implementation for SCQL. In order to demonstrate the effectiveness of SCQL we ran the 3 example queries against five different projects: Evolution (an Email Application), Gnumeric (a spreadsheet), OpenSSL (A Secure Socket Layer library), Samba (Linux support for Win32 network file systems), and modperl (a module for Apache that acts like a Perl Application server). The table 4 provides the output of the 3 example queries for each of these projects. We include the size of the MR set (number of MRs) and the File set too. Table 4: Evaluation of the 3 example queries evolution gnumeric openssl samba modperl Ex 1 true true false false true Ex 2 0.002 0.004 0.003 0.002 0.015 Ex 3 false false false false true |File| 4748 3685 3698 4246 300 |MR| 18573 11337 10847 27413 1398 Name MR Revision Author File Universal Existential Attribute Function Universal Before Universal After Existential Before Existential After Subset Universal From Subset Anchor Select count Sum Average Count Table 3: Language Description Language Explanation MR Set of Modification Requests Revision Set of Revisions Author Set of Authors File Set of Files A(φ, δ){P (φ)} For all φ in the set δ is the predicate P (φ) true? E(φ, δ){P (φ)} Does φ exist in set δ where predicate P (φ) is true? φ.ζ Given an entity φ return its attribute ζ γ(P ) Evaluate the function γ with P as the parameter Abef ore(φ, δ, θ){P (φ, θ)} For all φ in δ before θ is the binary predicate P (φ, θ) true? Aaf ter(φ, δ, θ){P (φ, θ)} For all φ in δ after θ is P (φ, θ) true? Ebef ore(φ, δ, θ){P (φ, θ)} Does φ exist in δ before θ where P (φ, θ) is true? Eaf ter(φ, δ, θ){P (φ, θ)} Does φ exist in δ after θ where P (φ, θ) is true? S(φ, δ){P (φ)} Create a subset of δ, such that for each element φ in that subset, P (φ) is true. A(θ, S(φ, δ){P (φ)}){Q(θ)} For each elements θ in the set δ for which P (φ) is true, Q(θ) is also true Anchor(φ, M R,′′ mrid′′ )P (φ) Evaluate P (φ) on the entity of type MR with id “mrid” count(δ) Count the number of elements of the subsets δ Sum(φ, δ){P (φ)} Summate the predicate P (φ) for all φ in δ Avg(φ, δ){P (φ)} Get the average of the predicate P (φ) for all φ in δ Count(φ, δ){P (φ)} Count the number of elements φ in δ where P (φ) is true. 5. SUMMARY This paper presents a formal model to describe SCSs. This model is then used to define a query language, SCQL, that can be used to pose queries on the SCSs. The objective of SCQL is to be domain specific and to support temporal logic operators in those queries. We have demonstrated the use of SCQL with example queries, and demonstrated their effectiveness by running those queries against the SCS of 5 different large, mature software projects. While it is possible to use other query languages to investigate SCSs (such as SQL and XQuery) we believe that SCQL has 2 important properties that these languages are do not. First, it is domain specific: the queries refer to entities in the repository, and second, it supports temporal logic operators. While it is possible to implement temporal logic operations in SQL or XQuery, it might result in overly complex expressions. We expect to use SCQL in the exploration of the evolution of software and to help us compute metrics on SCS repositories. 6. REFERENCES [1] S. Abiteboul, L. Herr, and J. Van den Bussche. Temporal versus first-order logic to query temporal databases. pages 49–57, 1996. [2] M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proceedings of the International Conference on Software Maintenance (ICSM 2003), pages 23–32, Sept. 2003. [3] D. German. An empirical study of fine-grained software modifications. In 20th IEEE International Conference on Software Maintenance (ICSM’04), Sept 2004. [4] D. M. German. Mining CVS repositories, the softChange experience. In 1st International Workshop on Mining Software Repositories, pages 17–21, May 2004. [5] D. M. German, A. Hindle, and N. Jordan. Visualizing the evolution of software using softchange. In Proceedings SEKE 2004 The 16th Internation Conference on Software Engineering and Knowledge Engineering, pages 336–341, 3420 Main St. Skokie IL 60076, USA, June 2004. Knowledge Systems Institute. [6] T. Girba, S. Ducasse, and M. Lanza. Yesterday’s weather: Guiding early reverse engineering efforts by summarizing the evolution of changes. In 20th IEEE International Conference on Software Maintenance (ICSM’04), Sept 2004. [7] A. E. Hassan and R. C. Holt. Predicting change propagation in software systems. pages 284–293, September 2004. [8] Y. Liu and E. Stroulia. Reverse Engineering the Process of Small Novice Software Teams. In Proc. 10th Working Conference on Reverse Engineering, pages 102–112. IEEE Press, November 2003. [9] A. Mockus, R. T. Fielding, and J. Herbsleb. Two Case Studies of Open Source Software Development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3):1–38, July 2002. [10] X. Wu. Visualization of version control information. Master’s thesis, University of Victoria, 2003. [11] T. Zimmermann and P. Weisgerber. Preprocessing cvs data for fine-grained analysis. In 1st International Workshop on Mining Software Repositories, May 2004.