Generating Program Inputs For Database Application Testing: Kai Pan, Xintao Wu Tao Xie
Generating Program Inputs For Database Application Testing: Kai Pan, Xintao Wu Tao Xie
Generating Program Inputs For Database Application Testing: Kai Pan, Xintao Wu Tao Xie
Tao Xie
North Carolina State University [email protected]
AbstractTesting is essential for quality assurance of database applications. Achieving high code coverage of the database application is important in testing. In practice, there may exist a copy of live databases that can be used for database application testing. Using an existing database state is desirable since it tends to be representative of real-world objects characteristics, helping detect faults that could cause failures in real-world settings. However, to cover a specic program code portion (e.g., block), appropriate program inputs also need to be generated for the given existing database state. To address this issue, in this paper, we propose a novel approach that generates program inputs for achieving high code coverage of a database application, given an existing database state. Our approach uses symbolic execution to track how program inputs are transformed before appearing in the executed SQL queries and how the constraints on query results affect the applications execution. One signicant challenge in our problem context is the gap between program-input constraints derived from the program and from the given existing database state; satisfying both types of constraints is needed to cover a specic program code portion. Our approach includes novel query formulation to bridge this gap. Our approach is loosely integrated into Pex, a state-of-theart white-box testing tool for .NET from Microsoft Research. Empirical evaluations on two real database applications show that our approach assists Pex to generate program inputs that achieve higher code coverage than the program inputs generated by Pex without our approachs assistance.
01:public int calcStat(int type,int zip) { 02: int years = 0, count = 0, totalBalance = 0; 03: int fzip = zip + 1; 04: if (type == 0) 05: years = 15; 06: else 07: years = 30; 08: SqlConnection sc = new SqlConnection(); 09: sc.ConnectionString = ".."; 10: sc.Open(); 11: string query = "SELECT C.SSN, C.income," +" M.balance FROM customer C, mortgage M" +" WHERE M.year=" + years +" AND" +" C.zipcode="+ fzip + " AND C.SSN = M.SSN"; 12: SqlCommand cmd = new SqlCommand(query, sc); 13: SqlDataReader results = cmd.ExecuteReader(); 14: while (results.Read()){ 15: int income = int.Parse(results["income"]); 16: int balance = int.Parse(results["balance"]); 17: int diff = income - 1.5 * balance; 18: if (diff > 100000){ 19: count++; 20: totalBalance = totalBalance + balance;}} 21: return totalBalance;}
Fig. 1.
I. I NTRODUCTION Database applications are ubiquitous, and it is critical to assure high quality of database applications. To assure high quality of database applications, testing is commonly used in practice. Testing database applications can be classied as functional testing, performance testing (load and stress, scalability), security testing, environment and compatibility testing, and usability testing. Among them, functional testing aims to verify the functionality of the code under test. An important task of functional testing is to generate test inputs to achieve full or at least high code coverage, such as block or branch coverage of the database application under test. For database applications, test inputs include both program inputs and database states. A. Illustrative Example The example code snippet shown in Figure 1 includes a portion of C# source code from a database application that calculates some statistic related to mortgages. The corresponding database contains two tables: customer and mortgage. Their
schema-level descriptions and constraints are given in Table I. The calcStat method described in the example code snippet receives two program inputs: type that determines the years of mortgages and zip that indicates the zip codes of customers. A variable fzip is calculated from zip and in our example fzip is given as zip+1. Then the database connection is set up (Lines 08-10). The database query is constructed (Line 11) and executed (Lines 12 and 13). The tuples from the returned result set are iterated (Lines 14-20). For each tuple, a variable diff is calculated from the values of the income eld and the balance eld. If diff is greater than 100000, a counter variable count is increased (Line 19) and totalBalance is updated (Line 20). The method nally returns the calculation result. Both program inputs (i.e., input parameters) and database states are crucial in testing this database application because (1) the program inputs determine the embedded SQL statement in Line 11; (2) the database states determine whether the true branch in Line 14 and/or the true branch in Line 18 can be covered, being crucial to functional testing, because covering
customer Attribute Type SSN Int zipcode String name Int gender String age Int income Int
table mortgage table Constraint Attribute Type Constraint Primary Key SSN Int Primary Key [1, 99999] Foreign Key year Int (0, 100) balance Int (1000, Max)
a branch is necessary to expose a potential fault within that branch; (3) the database states also determine how many times the loop body in Lines 14-20 is executed, being crucial to performance testing. B. Problem Formalization In practice, there may exist a copy of live databases that can be used for database application testing. Using an existing database state is desirable since it tends to be representative of real-world objects characteristics, helping detect faults that could cause failures in real-world settings. However, it often happens that a given database with an existing database state (even with millions of records) returns no records (or returned records do not satisfy branch conditions in the subsequently executed program code) when the database receives and executes a query with arbitrarily chosen program input values. For example, method calcStat takes both type and zip as inputs. To cover a path where conditions at Lines 14 and 18 are both true, we need to assign appropriate values to variables years and fzip so that the execution of the SQL statement in Line 12 with the query string in Line 11 will return nonempty records, while at the same time attributes income and balance of the returned records also satisfy the condition in Line 18. Since the domain for program input zip is large, it is very likely that, if a tester enters an arbitrary zip value, execution of the query on the existing database will return no records, or those returned records do not satisfy the condition in Line 18. Hence, it is crucial to generate program input values such that test inputs with these values can help cover various code portions when executed on the existing database. C. Proposed Solution To address this issue, in this paper, we propose a novel approach that generates program inputs for achieving high code coverage of a database application, given an existing database state. In our approach, we rst examine close relationships among program inputs, program variables, branch conditions, embedded SQL queries, and database states. For example, program variables used in the executed queries may be derived from program inputs via complex chains of computations (we use fzip=zip+1 in our illustrative example) and path conditions involve comparisons with record values in the querys result set (we use if (diff>100000)) in our illustrative example). We then automatically generate appropriate program
inputs via executing a formulated auxiliary query on the given database state. In particular, our approach uses dynamic symbolic execution (DSE) [7] to track how program inputs to the database application under test are transformed before appearing in the executed queries and how the constraints on query results affect the later program execution. We use DSE to collect various intermediate information. Our approach addresses one signicant challenge in our problem context: there exists a gap between program-input constraints derived from the program and those derived from the given existing database state; satisfying both types of constraints is needed to cover a specic program code portion. During DSE, these two types of constraints cannot be naturally collected, integrated, or solved for test generation. To address this challenge, our approach includes novel query formulation to bridge this gap. In particular, based on the intermediate information collected during DSE, our approach automatically constructs new auxiliary queries from the SQL queries embedded in code under test. The constructed auxiliary queries use those database attributes related with program inputs as the target selection and incorporate those path constraints related with query result sets into selection condition. After the new auxiliary queries are executed against the given database, we attain effective program input values for achieving code coverage. This paper makes the following main contributions:
The rst problem formalization for program-input generation given an existing database state to achieve high code coverage. A novel program-input-generation approach based on symbolic execution and query formulation for bridging the gap between program-input constraints from the program and from the given existing database state. Evaluations on two real database applications to assess the effectiveness of our approach upon Pex [12], a stateof-the-art white-box testing tool for .NET from Microsoft Research. Empirical results show that our approach assists Pex to generate program inputs that achieve higher code coverage than the program inputs generated by Pex without our approachs assistance. II. DYNAMIC S YMBOLIC E XECUTION IN DATABASE A PPLICATION T ESTING
Recently, dynamic symbolic execution [7] (DSE) was proposed for test generation. DSE rst starts with default or arbitrary inputs and executes the program concretely. Along the execution, DSE simultaneously performs symbolic execution to collect symbolic constraints on the inputs obtained from predicates in conditions. DSE ips a branch condition and conjuncts the negated branch condition with constraints from the prex of the path before the branch condition. DSE then feeds the conjuncted conditions to a constraint solver to generate new inputs to explore not-yet-covered paths. The whole process terminates when all the feasible program paths
customer table mortgage table SSN zipcode name gender age income SSN year balance 001 27695 Alice female 35 50000 001 15 20000 002 28223 Bob male 40 150000 002 15 30000
constructed. In Line 12 where the query is executed, we can dynamically get the concrete query string as
Q1: SELECT C.SSN, C.income, M.balance FROM customer C, mortgage M WHERE M.year=15 AND C.zipcode=1 AND C.SSN=M.SSN
Through static analysis, we can also get Q1s corresponding abstract form as have been explored or the number of explored paths has reached the predened upper bound. DSE has also been used in testing database applications [6], [11]. Emmi et al. [6] developed an approach for automatic test generation based on DSE. Their approach uses a constraint solver to solve collected symbolic constraints to generate both program input values and corresponding database records. The approach involves running the program simultaneously on concrete program inputs as well as on symbolic inputs and a symbolic database. In the rst run, the approach uses random concrete program input values, collects path constraints over the symbolic program inputs along the execution path, and generates database records such that the program execution with the concrete SQL queries can cover the current path. To explore a new path, it ips a branch condition and generates new program input values and corresponding database records. However, their approach cannot generate effective program inputs based on the content of an existing database state. The reason is that some program inputs (e.g., zip in our illustrative example) appear only in the embedded SQL queries and there is no path constraint over them. Our approach differs from Emmi et al.s approach [6] in that we leverage DSE as a supporting technique to generate effective program input values by executing constructed auxiliary queries against the existing database state. As a result, high code coverage of the application can be achieved without generating new database states. When DSE is applied on a database application, DSE often fails to cover specic branches due to an insufcient returned result set because returned record values from the database often involve in deciding later branches to take. We use Pex [12], a DSE tool for .NET, to illustrate how our approach assists DSE to determine program input values such that the executed query can return sufcient records to cover various code portions. During the program execution, DSE maintains the symbolic expressions for all variables. When the execution along one path terminates, DSE tools such as Pex have collected all the preceding path constraints to form the path condition. Pex also provides a set of APIs that help access intermediate information of its DSE process. For illustration purposes, we assume that we have an existing database state shown in Table II for our preceding example shown in Figure 1. To run the program for the rst time against the existing database state, Pex uses default values for program inputs type and zip. In this example, because type and zip are both integers. Pex simply chooses type=0, zip=0 as default values. The condition in Line 04 is then satised and the query statement with the content in Line 11 is dynamically
Q1abs: SELECT C.SSN, C.income, M.balance FROM customer C, mortgage M WHERE M.year=: years AND C.zipcode=: fzip AND C.SSN=M.SSN
The execution of Q1 on Table II yields zero record. Thus, the while loop body in Lines 1420 is not entered and the exploration of the current path is nished. We use the Pex API method PexSymbolicValue.GetPathConditionString() after Line 14 to get the path condition along this path:
P1:(type == 0) && (results.Read() != true)
To explore a new path, Pex ips a part of the current path condition from type == 0 to type != 0 and generates new program inputs as type=1, zip=0. The condition in Line 04 is then not satised and the SQL statement in Line 11 is dynamically determined as
Q2: SELECT C.SSN, C.income, M.balance FROM customer C, mortgage M WHERE M.year=30 AND C.zipcode=1 AND C.SSN=M.SSN
Note that here we have the same abstract form for Q2 as for Q1. However, the execution of Q2 still returns zero record, and hence the execution cannot enter the while loop body either. The path condition for this path is
P2:(type == 1) && (results.Read() != true)
We can see that at this point no matter how Pex ips the current collected path condition, it fails to explore any new paths. Since Pex has no knowledge about the zipcode distribution in the database state, using the arbitrarily chosen program input values often incurs zero returned record when the query is executed against the existing database state. As a result, none of paths involving the while loop body could be explored. In testing database applications, previous test-generation approaches (e.g., Emmi et al. [6]) then invoke constraint solvers to generate new records and instantiate a new test database state, rather than using the given existing database state, required in our focused problem. In contrast, by looking into the existing database state as shown in Table II, we can see that if we use an input like type=0, zip=27694, the execution of the query in Line 11 will yield one record {C.SSN = 001, C.income = 50000, M.balance = 20000}, which further makes Line 14 condition true and Line 18 condition false. Therefore, using the existing database state, we are still able to explore this new path:
Furthermore, if we use type=0, zip=28222, the execution of the query in Line 11 will yield another record {C.SSN = 002, C.income = 150000, M.balance = 30000}, which will make both Line 14 condition and Line 18 condition true. Therefore, we can explore this new path:
P4:(type == 0) && (results.Read() == true) &&(diff > 100000)
does not cover the true branch of pcs as planned, likely due to database interactions along the path. In the path exploration, DSE also keeps records of all program variables and their concrete and symbolic expressions in the program along this path when DSE reaches pcs . From the records, we determine program variables V = {V1 , V2 , ..., Vt } that are data-dependent on program inputs I. DSE also collects the concrete string of an executed query along the current path. In our approach, we assume the SQL query takes the form:
SELECT FROM WHERE C1, C2, ..., Ch from-list A1 AND A2 ... AND An
In Section III, we present our approach that can assist Pex to determine appropriate program input values such that high code coverage can be achieved using the existing database state. III. A PPROACH Our approach assists Pex to determine appropriate program inputs so that high code coverage can be achieved in database application testing. As illustrated in our example, not-covered branches or paths are usually caused by the empty returned result set (e.g., for path P1) or insufcient returned records that cannot satisfy later executed conditions (e.g., for path P3). The major idea of our approach is to construct an auxiliary query based on the intermediate information (i.e., the executed querys concrete string and its abstract form, symbolic expressions of program variables, and path conditions) collected by DSE. There are two major challenges here. First, program input values are often combined into the executed query after a chain of computations. In our illustrative example, we simply set fzip = zip+1 in Line 3 to represent this scenario. We can see that fzip is contained in the WHERE clause of the executed query and zip is one program input. Second, record values in the returned query result set are often directly or indirectly (via a chain of computations) involved in the path condition. In our illustrative example, the program variable diff in the branch condition diff>100000 (Line 18) is calculated from the retrieved values of attributes income and balance. To satisfy the condition (e.g., diff>100000 in Line 18), we need to make sure that the program input values determined by our auxiliary query are appropriate so that the querys return records are sufcient for satisfying later executed branch conditions. A. Auxiliary Query Construction Algorithm 1 illustrates how to construct an auxiliary query. The algorithm accepts as inputs a simple SQL query in its both concrete and abstract forms, program input values, and the current path condition. Formally, suppose that a program takes a set of parameters I = {I1 , I2 , ..., Ik } as program inputs. During path exploration, DSE ips a branch condition pcs (e.g., one executed after the query execution) from the false branch to the true branch to cover a target path. Such ipping derives a new constraint or path condition for the target path as P C = pc1 pc2 ... pcs . DSE feeds this constraint to the constraint solver to generate a new test input, whose later execution, however,
In the SELECT clause, there is a list of h strings where each may correspond to a column name or with arithmetic or string expressions over column names and constants following the SQL syntax. In the FROM clause, there is a from-list that consists of a list of tables. We assume that the WHERE clause contains n predicates, A = {A1 , A2 , ..., An }, connected by n 1 ANDs. Each predicate Ai is of the form expression op expression, where op is a comparison operator (=, <>, >, >=, <, <=) or a membership operator (IN, NOT IN) and expression is a column name, a constant or an (arithmetic or string) expression. Note that here we assume that the WHERE clause contains only conjunctions using the logical connective AND. We discuss how to process complex SQL queries in Section III-C. Some predicate expressions in the WHERE clause of Q may involve comparisons with program variables. From the corresponding abstract query Qabs , we check whether each predicate Ai contains any program variables from V . We take the path P3 (Line 04 true, Line 14 true, and Line 18 false) in our preceding example shown in Figure 1 to illustrate the idea. The program input set is I = {type, zip} and the path condition P C is
P3:(type == 0) && (results.Read() == true) &&(diff <= 100000)
The program variable set V is {type,zip,fzip}. When ipping the condition diff<=100000, Pex fails to generate satisable test inputs for the ipped condition diff > 100000. The abstract form is shown as
Qabs: SELECT C.SSN, C.income, M.balance FROM customer C, mortgage M WHERE M.year=: years AND C.zipcode=: fzip AND C.SSN=M.SSN
We can see that the predicate set A in the WHERE clause formed as {M.year=:years, C.zipcode=:fzip, C.SSN=M.SSN}. Predicates M.year=:years and C.zipcode=:fzip contain program variables years and fzip, respectively. Furthermore, the program variable fzip is contained in V . In other words, the predicate C.zipcode=:fzip involves comparisons with program inputs. Algorithm 1 shows our procedure to construct the auxiliary query Q based on the executed query (Qs concrete string and is
1: Find variables V = {V1 , V2 , ..., Vt } data-dependent on I; 2: Decompose Qabs with a SQL parser for each clause; 3: Construct a predicate set A = {A1 , A2 , ..., An } from Qs
4: Construct an empty predicate set A, an empty attribute set CV , and an empty query Q;
5: for each predicate Ai A do 6: if Ai does not contain program variables then 7: Leave Ai unmodied and check the next predicate; 8: else 9: if Ai does not contain program variables from V then 10: Substitute Ai s program variables with their correspond11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:
WHERE clause;
ing concrete values in Q; else Substitute the variables from V with the expression expressed by I; Substitute the variables not from V with their corresponding concrete values in Q; Copy Ai to A; Add Ai s associated database attributes to CV ; end if end if end for Append CV to Qs SELECT clause; Copy Qs FROM clause to Qs FROM clause; Append A A to Qs WHERE clause; Find variables U = {U1 , U2 , ..., Uu } coming directly from Qs result set; Find U s corresponding database attributes CU = {CU 1 , CU 2 , ..., CU w }; for each branch condition pci P C after Qs execution do if pci contains variables data-dependent on U then Substitute the variables in pci with the expression expressed by the variables from U ; Substitute the variables from U in pci with U s corresponding database attributes in CU ; Add the branch condition in pci to P C; end if end for Flip the last branch condition in P C; Append all the branch conditions in P to Qs WHERE clause; C return Q;
this category. We retrieve the concrete value of years from Q and the predicate expression is changed as M.year=15. If some program variables contained in the predicate come from V , we substitute them with their symbolic expressions (expressed by the program inputs in I), substitute all the other program variables that are not from V with their corresponding concrete values in Q and copy the predicate Ai to A. The predicate C.zipcode=:fzip in our example belongs to this category. We replace fzip with zip+1 and the new predicate becomes C.zipcode=:zip+1. We also add Ai s associated database attributes into a temporary attribute set CV . Those attributes will be included in the SELECT clause of the auxiliary query Q. For the predicate C.zipcode=:fzip, the attribute C.zipcode is added to A and is also added in the SELECT clause of the auxiliary query Q. After processing all the predicates in A, we get an attribute set CV = {CV 1 , CV 2 , ..., CV j } and a predicate set A = 1 , A2 , ..., Al }. Note that here all the predicates in A {A are still connected by the logical connective AND. The attributes from CV form the attribute list of the Qs SELECT connected by AND clause. All the predicates in A - A form the predicates in the Qs WHERE clause. Note that FROM clause is the same as that the from-list of the Qs of Q. In our example, A is C.zipcode=:zip+1, A A is M.year=15 AND C.SSN=M.SSN, and the attribute set CV is C.zipcode. The constructed auxiliary query Q has the form:
SELECT FROM WHERE C.zipcode customer C, mortgage M M.year=15 AND C.SSN=M.SSN
its abstract form Qabs ) and the intermediate information collected by DSE. Lines 5-21 present how to construct the clauses (SELECT, FROM, and WHERE) of the auxiliary query Q. We decompose Qabs using a SQL parser1 and get its n predicates A = {A1 , A2 , ..., An } from the WHERE clause. We construct an empty predicate set A. For each predicate Ai A, we check whether Ai contains program variables. If not, we leave Ai unchanged and check the next predicate. If yes, we then check whether any contained program variable comes from the set V . If no program variables in the predicate are from V , we substitute them with their corresponding concrete values in Q. In our example, the predicate M.year=:years belongs to
1 http://zql.sourceforge.net/
When executing the preceding auxiliary query against the existing database state, we get two zipcode values, 27695 and 28223. The corresponding program input zip can take either 27694 or 28222 because of the constraint {C.zipcode=: zip+1} in our example. A test input with the program input either type=0, zip=27694 or type=0, zip=28222 can guarantee that the program execution enters the while loop body in Lines 14-20. However, there is no guarantee that the returned record values satisfy later executed branch conditions. For example, if we choose type=0, zip=27694 as the program input, the execution can enter the while loop body but still fails to satisfy the branch condition (i.e., diff>100000) in Line 18. Hence it is imperative to incorporate constraints from later branch conditions into the constructed auxiliary query. Program variables in branch condition pci P C after executing the query may be data-dependent on returned record values. In our example, the value of program variable diff in branch condition diff > 100000 is derived from the values of the two variables income, balance that correspond to the values of attributes C.income, M.balance of returned records. Lines 22-32 in Algorithm 1 show how to incorporate later branch conditions in constructing the WHERE clause of the auxiliary query. Formally, we get the set of program variables U = {U1 , U2 , ..., Uw } that directly retrieve the values from the querys
for the program input zip can then be derived by invoking a constraint solver. In our prototype, we use the constraint solver Z32 integrated in Pex. Z3 is a high-performance theorem prover being developed at Microsoft Research. The constraint solver Z3 supports linear real and integer arithmetic, xed-size bit-vectors, extensional arrays, uninterpreted functions, and quantiers. In practice, the result R could be a set of values. For example, the execution of the auxiliary query returns a set of satisfying zip code values. If multiple program input values are needed, we can repeat the same constraint solving process to produce each returned value in R. B. Dealing with Aggregate Calculation Up to now, we have investigated how to generate program inputs through auxiliary query construction. Our algorithm exploits the relationships among program inputs, program variables, executed queries, and path conditions in source code. Database applications often deal with more than one returned record. In many database applications, multiple records are iterated from the querys returned result set. Program variables that retrieve values from the returned result set further take part in aggregate calculations. The aggregate values then are used in the path condition. In this section, we discuss how to capture the desirable aggregate constraints on the result set returned for one or more specic queries issued from a database application. These constraints play a key role in testing database applications but previous work [3], [5] on generating database states has often not taken them into account. Consider the following code after the querys returned result set has been iterated in our preceding example shown in Figure 1:
... 14: while (results.Read()){ 15: int income = int.Parse(results["income"]); 16: int balance = int.Parse(results["balance"]); 17: int diff = income - 1.5 * balance; 18: if (diff > 100000){ 19: count++; 20: totalBalance = totalBalance + balance;}} 20a: if (totalBalance > 500000) 20b: do other calculation... 21: return ...;}
returned result set, and treat them as symbolic inputs. For each program variable Ui , we also keep its corresponding database attribute CU i . Note that here CU i must come from the columns in the SELECT clause. We save them in the set CU = {CU 1 , CU 2 , ..., CU w }. For each branch condition pci P C, we check whether any program variables in pci are data-dependent on variables in U . If yes, we substitute such variables in pci with their symbolic expressions with respect to the symbolic input variables from U and replace each Ui in pci with its corresponding database attribute CU i . The modied pci is then appended to the Qs WHERE clause. In our example, the modied branch condition C.income-1.5*M.balance>100000 is appended to the WHERE clause, and the new auxiliary query is
SELECT FROM WHERE C.zipcode customer C, mortgage M M.year=15 AND C.SSN=M.SSN AND C.income - 1.5 * M.balance > 100000
When executing the preceding auxiliary query against the existing database state, we get the zipcode value as 28223. Having the constraint C.zipcode=:zip+1, input type=0, zip=28222 can guarantee that the program execution enters the true branch in Line 18. Program Input Generation. Note that executing the auxiliary query Q against the database returns a set of values RV for attributes in CV . Each attribute in CV can be traced back to some program variable in V = {V1 , V2 , ..., Vt }. Recall that V contains program variables that are data-dependent on program inputs I. Our nal goal is to derive the values for program inputs I. Recall in Algorithm 1, we already collected in the predicate set A the symbolic expressions of CV with respect to program inputs I. After substituting the attributes CV with their corresponding concrete values in RV resulted from executing Q against the given database, we have new predicates in A for program inputs I. We then feed these new predicates in A to a constraint solver to derive the values for program inputs I. We give our pseudo procedure in Algorithm 2. In our illustrative example, after executing our auxiliary query on Table II, we get a returned value 28223 for the attribute C.zipcode. In A, we have C.zipcode=:zip+1. After substituting C.zipcode in C.zipcode=:zip+1 with the value 28223, we have 28223=:zip+1. The value 28222
Here, the program variable totalBalance is datadependent on the variable balance and thus is associated with the database attribute M.balance. The variable totalBalance is involved in a branch condition totalBalance > 500000 in Line 20a. Note that the variable totalBalance is aggregated from all returned record values. For simple aggregate calculations (e.g., sum, count, average, minimum, and maximum), we are able to incorporate the constraints from the branch condition in our auxiliary query formulation. Our idea is to extend the auxiliary query with the GROUP BY and HAVING clauses. For example,
2 http://research.microsoft.com/en-us/um/redmond/projects/z3/
we learn that the variable totalBalance is a summation of all the values from the attribute M.balance. The variable totalBalance can be transformed into an aggregation function sum(M.balance). We include C.zipcode in the GROUP By clause and sum(M.balance) in the HAVING clause of the extended auxiliary query:
SELECT C.zipcode,sum(M.balance) FROM customer C, mortgage M WHERE M.year=15 AND C.SSN=M.SSN AND C.income - 1.5 * M.balance > 100000 GROUP BY C.zipcode HAVING sum(M.balance) > 500000
1: for each disjunction Di in Qdpnf s WHERE clause do 2: Build an empty query Qi ; 3: Append Qdpnf s SELECT clause to Qi s SELECT clause; 4: Append Qdpnf s FROM clause to Qi s FROM clause; 5: Append Di to Qi s WHERE clause; 6: Apply Algorithm 1 on Qi and get its auxiliary query Qi ; i and get output Ri ; 7: Apply Algorithm 2 on Q 8: Rdpnf = Rdpnf Ri ; 9: end for 10: return Output nal program input values Rdpnf ;
Cardinality Constraints. In many database applications, we often require the number of returned records to meet some conditions (e.g., for performance testing). For example, after execution reaches Line 20, we may have another piece of code appended to Line 20 as
20c: 20d: if (count >= 3) computeSomething();
Here we can use a special DSE technique [8] for dealing with input-dependent loops. With this technique, we can learn that the subpath with the conditions in Lines 14 and 18 being true has to be invoked at least three times in order to cover the branch condition count >= 3 in Line 20c. Hence we need to have at least three records iterated into Line 18 so that true branches of Lines 14, 18, and 20c can be covered. In our auxiliary query, we can simply add COUNT(*) >= 3 in the HAVING clause to capture this cardinality constraint.
SELECT FROM WHERE C.zipcode customer C, mortgage M M.year=15 AND C.SSN=M.SSN AND C.income - 1.5 * M.balance > 100000 GROUP BY C.zipcode HAVING COUNT(*) >= 3
a query block, which consists of SELECT, FROM, WHERE, GROUP BY, and HAVING clauses. If a predicate or some predicates in the WHERE or HAVING clause are of the form [Ck op Q] where Q is also a query block, the query is a nested query. A large body of work exists on query transformation in databases. Various decorrelation techniques (e.g., [4], [9]) have been explored to unnest complex queries into equivalent single level canonical queries and recent work [1] showed that almost all types of subqueries can be unnested. Generally, there are two types of canonical queries: DPNF with the WHERE clause consisting of a disjunction of conjunctions as shown below
SELECT FROM WHERE C1, C2, ..., Ch from-list (A11 AND ... AND A1n) OR ... OR (Am1 AND ... AND Amn)
Program logic could be far more complex than the appended code in Lines 20a-d of our example. We emphasize here that our approach up to now works for only aggregate calculations that are supported by the SQL built-in aggregate functions. When the logic iterating the result set becomes more complex than SQLs support, we cannot directly determine the appropriate values for program inputs. For example, some zipcode values returned by our auxiliary query could not be used to cover the true branch of Lines 20a-b because the returned records with the input zipcode values may fail to satisfy the complex aggregate condition in Line 20a. However, our approach can still provide a super set of valid program input values. Naively, we could iterate all the candidate program input values to see whether some of them can cover a specic branch or path. C. Dealing with Complex Queries SQL queries embedded in application program code could be very complex. For example, they may involve nested subqueries with aggregation functions, union, distinct, and groupby views, etc. The fundamental structure of a SQL query is
and CPNF with the WHERE clause consisting of a conjunction of disjunctions (such as (A11 OR... OR A1n) AND ... AND (Am1 OR... OR Amn)). Note that DPNF and CPNF can be transformed mutually using DeMorgans rules3 . Next we present our algorithm on how to formulate auxiliary queries and determine program input values given a general DPNF query. Our previous Algorithm 1 deals with only a special case of DPNF where the querys WHERE clause contains only one A11 AND ... AND A1n. We show the algorithm details in Algorithm 3. Our idea is to decompose the DPNF query Qdpnf into m simple queries Qi (i = 1, , m). The WHERE clause of each Qi contains only one disjunction in the canonical form, Ai1 AND ... AND Ain. We apply Algorithm 1 to generate its corresponding auxiliary query Qi and apply Algorithm 2 to generate program input values Ri . The union of Ri s then contains all appropriate program input values. IV. E VALUATION Our approach can provide assistance to DSE-based testgeneration tools (e.g., Pex [12] for .NET) to improve code coverage in database application testing. In our evaluation, we seek to evaluate the benet and cost of our approach from the following two perspectives:
3 http://en.wikipedia.org/wiki/DeMorganslaws
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
method getAllZipcode lterOccupation lterZipcode lterEducation lterMaritalStatus ndTopIndustryCode ndTopOccupationCode updatestability userinformation updatetable updatewagetable lterEstimatedIncome calculateUnemploymentRate calculateScore getValues getOneZipcode browseUserProperties all methods (total)
Pex 17 27 28 27 27 13 13 61 40 42 42 44 45 16 38 23 85 588
covered(blocks) Pex+ours increase 37 51.28% 37 24.39% 38 23.81% 37 24.39% 37 24.39% 14 5.26% 14 5.26% 75 17.72% 57 27.87% 56 23.33% 48 11.54% 54 17.24% 45 0.00% 87 76.35% 99 57.01% 32 26.47% 104 17.60% 871 25.52%
time(seconds) Pex ours time out 21.4 time out 34.3 42.3 25.7 time out 27.6 48.5 29.4 time out 29.2 time out 23.3 time out 23.4 62.4 20.8 67.8 20.9 time out 27.8 time out 23.6 time out 23.7 time out 23.3 time out 42.7 time out 39.1 time out 81.1 1781.0 517.3
RQ1: What is the percentage increase in code coverage by the program inputs generated by Pex with our approachs assistance compared to the program inputs generated without our approachs assistance in testing database applications? RQ2: What is the cost of our approachs assistance? In our evaluation, we rst run Pex without our approachs assistance to generate test inputs. We record their statistics of code coverage, including total program blocks, covered blocks, and coverage percentages. In our evaluation, we also record the number of runs and execution time. A run represents one time that one path is explored by Pex using a set of program input values. Because of the large or innite number of paths in the code under test, Pex uses exploration bounds to make sure that Pex terminates after a reasonable amount of time. For example, the bound TimeOut denotes the number of seconds after which the exploration stops. In our evaluation, we use the default value TimeOut=120s and use time out to indicate timeout cases. Pex often fails to generate test inputs to satisfy or cover branch conditions that are data-dependent on the querys execution or its returned result set. We then perform our algorithms to construct auxiliary queries based on the intermediate information collected from Pexs previous exploration. We then execute the auxiliary queries against the existing database and generate new test inputs. We then run the test inputs previously generated by Pex and the new test inputs generated by our approach, and then record new statistics. We conduct an empirical evaluation on two open source database applications: RiskIt4 and UnixUsage5 . RiskIt is an insurance quote application that makes estimation based on users personal information, such as zipcode and income. It has an existing database containing 13 tables, 57 attributes, and
4 https://riskitinsurance.svn.sourceforge.net 5 http://sourceforge.net/projects/se549unixusage
more than 1.2 million records. UnixUsage is an application to obtain statistics about how users interact with the Unix systems using different commands. It has a database containing 8 tables, 31 attributes, and more than 0.25 million records. Both applications were written in Java. To test them in the Pex environment, we convert the Java source code into C# code using a tool called Java2CSharpTranslator6 . The detailed evaluation subjects and results can be found on our project website7 . A. Code coverage We show the evaluation results in Table III and Table IV. For each table, the rst part (Columns 1-2) shows the index and method names. The second part (Columns 3-6) shows the code coverage result. Column 3 total(blocks) shows the total number of blocks in each method. Columns 4-6 covered(blocks) show the number of covered blocks by Pex without our approachs assistance, the number of covered blocks by Pex together with our approachs assistance, and the percentage increase, respectively. Within the RiskIt application, 17 methods are found to contain program inputs related with database attributes. These 17 methods contain 943 code blocks in total. Test inputs generated by Pex without our approachs assistance cover 588 blocks while Pex with our approachs assistance covers 871 blocks. In fact, Pex with our approachs assistance can cover all branches except those branches related to exception handling. For example, the method No. 1 contains 39 blocks in total. Pex without our approachs assistance covers 17 blocks while Pex with our approachs assistance covers 37 blocks. The two not-covered blocks belong to the catch statements, which mainly deal with exceptions at runtime.
6 http://sourceforge.net/projects/j2cstranslator/ 7 http://www.sis.uncc.edu/xwu/DBGen
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
method courseNameExists getCourseIDByName computeFileToNetworkRatio ForCourseAndSessions outputUserName deptNameExists computeBeforeAfterRatioByDept getDepartmentIDByName computeFileToNetworkRatioForDept ofceNameExists getOfceIdByName raceExists userIdExists(version1) transcriptExist getTranscript commandExists(version1) categoryExists getCategoryByCommand getCommandsByCategory getUnixCommand retrieveUsageHistoriesById userIdExists(version2) commandExists(version2) retrieveMaxLineNo retrieveMaxSequenceNo getSharedCommandCategory getUserInfoBy doesUserIdExist getPrinterUsage all methods (total)
Pex 6 6 8 9 9 8 7 20 7 5 7 7 7 5 6 7 5 6 5 7 7 7 7 7 7 15 9 27 258
covered(blocks) Pex+ours increase 7 14.29% 10 40.00% 25 68.00% 14 13 24 11 21 11 9 11 11 11 6 10 11 8 10 6 21 11 11 10 10 11 47 10 34 394 35.71% 30.77% 66.67% 36.36% 4.76% 36.36% 44.44% 36.36% 36.36% 36.36% 16.67% 40.00% 36.36% 37.50% 40.00% 16.67% 66.67% 36.36% 36.36% 30.00% 30.00% 36.36% 68.09% 10.00% 20.59% 34.52%
time(seconds) Pex ours 32.2 20.0 29.9 20.0 time out 24.9 38.3 43.3 77.5 time out time out 41.4 51.3 32.1 40.3 39.9 33.7 36.0 33.3 34.1 38.4 40.3 58.4 32.7 36.5 time out time out time out time out 41.3 67.2 1718.1 20.5 20.0 22.0 20.0 21.5 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 27.2 20.0 20.0 22.3 20.1 20.4 20.0 20.0 20.6 579.5
The UnixUsage application contains 28 methods whose program inputs are related with database attributes, with 394 code blocks in total. Pex without our approachs assistance covers 258 blocks while Pex with our approachs assistance covers all 394 blocks. The UnixUsage application constructs a connection with the database in a separate class that none of these 28 methods belong to. Thus, failing to generate inputs that can cause runtime database connection exceptions has not been reected when testing these 28 methods. B. Cost In Tables III and IV, the third part (Columns 7-10) shows the cost. Columns 7 and 9 Pex show the number of runs and the execution time used by Pex without our approachs assistance. We notice that, for both applications, Pex often terminates with time out. The reason is that Pex often fails to enter the loops of iterating the returned result records. Columns 8 and 10 ours show the additional number of runs by Pex with assistance of our approach and the extra execution time (i.e., the time of constructing auxiliary queries, deriving program input values by executing auxiliary queries against the existing database, and running new test inputs) incurred by our approach. We observe that, for both applications, Pex with assistance
of our approach achieves much higher code coverage with relatively low additional cost of a few runs and a small amount of extra execution time. In our evaluation, we set the TimeOut as 120 seconds. For those time out methods, Pex could not achieve new code coverage even given larger TimeOut values. Our approach could effectively help cover new branches not covered by Pex with relatively low cost. Note that in our current evaluation, we loosely integrate Pex and our approach: we perform our algorithms only after Pex nishes its previous exploration (i.e., after applying Pex without our approachs assistance) since our algorithms rely on the intermediate information collected during Pexs exploration. We expect that after our approach is tightly integrated into Pex, our approach can effectively reduce the overall cost of Pex integrated with our approach (which is currently the sum of the time in Columns 9 and 10). In such tight integration, our algorithms can be triggered automatically when Pex fails to generate test inputs to satisfy branch conditions that are data-dependent on a querys execution or its returned result set. V. R ELATED WORK Database application testing has attracted much attention recently. The AGENDA project [5] addressed how to generate
test inputs to satisfy basic database integrity constraints and does not consider parametric queries or constraints on query results during input generation. One problem with AGENDA is that it cannot guarantee that executing the test query on the generated database states can produce the desired query results. Willmor and Embury [14] developed an approach that builds a database state for each test case intensionally, in which the user provides a query that species the preand post-conditions for the test case. Binnig et al. [2] also extended symbolic execution and used symbolic query processing to generate some query-aware databases. However, the test database states generated by their QAGen prototype system [2] mainly aim to be used in database management systems (DBMS) testing. Emmi et al. [6] developed an approach for automatic test generation for a database application. Their approach is based on DSE and uses symbolic constraints in conjunction with a constraint solver to generate both program inputs and database states. We developed an approach [13] that leverages DSE to generate database states to achieve advanced structural coverage criteria. In this paper, we focus on program-input generation given an existing database state, avoiding the high overhead of generating new database states during test generation. Li and Csallner [10] considered a similar scenario, i.e., how to exploit existing databases to maximize the coverage under DSE. However, their approach constructs a new query by analyzing the current query, the result tuples, the covered and not-covered paths, and the satised and unsatised branch conditions. It can neither capture the close relationship between program inputs and results of SQL queries, nor generate program inputs to maximize code coverage. VI. C ONCLUSIONS In this paper, we have presented an approach that takes database applications and a given database as input, and generates appropriate program input values to achieve high code coverage. In our approach, we employ dynamic symbolic execution to analyze the code under test and formulate auxiliary queries based on extracted constraints to generate
program input values. Empirical evaluations on two open source database applications showed that our approach can assist Pex, a state-of-the-art DSE tool, to generate program inputs that achieve higher code coverage than the program inputs generated by Pex without our approachs assistance. In our future work, we plan to extend our technique to construct auxiliary queries directly from embedded complex queries (e.g., nested queries), rather than from their transformed norm forms.
Acknowledgment. This work was supported in part by U.S. National Science Foundation under CCF-0915059 for Kai Pan and Xintao Wu, and under CCF-0915400 for Tao Xie.
R EFERENCES
[1] R. Ahmed, A. Lee, A. Witkowski, D. Das, H. Su, M. Zait, and T. Cruanes. Cost-based query transformation in Oracle. In VLDB, pages 10261036, 2006. [2] C. Binnig, D. Kossmann, E. Lo, and M.T. Ozsu. QAGen: generating query-aware test databases. In SIGMOD, pages 341352, 2007. [3] D. Chays and J. Shahid. Query-based test generation for database applications. In DBTest, pages 0106, 2008. [4] U. Dayal. Of nests and trees: A unied approach to processing queries that contain nested subqueries, aggregates, and quantiers. In VLDB, pages 197208, 1987. [5] Y. Deng, P. Frankl, and D. Chays. Testing database transactions with agenda. In ICSE, pages 7887, 2005. [6] M. Emmi, R. Majumdar, and K. Sen. Dynamic test input generation for database applications. In ISSTA, pages 151162, 2007. [7] P. Godefroid, N. Klarlund, and K. Sen. DART: directed automated random testing. In PLDI, pages 213223, 2005. [8] P. Godefroid and D. Luchaup. Automatic partial loop summarization in dynamic test generation. In ISSTA, pages 2333, 2011. [9] W. Kim. On optimizing an SQL-like nested query. ACM Trans. Database Syst., 7(3):443469, 1982. [10] C. Li and C. Csallner. Dynamic symbolic database application testing. In DBTest, pages 0106, 2010. [11] K. Taneja, Y. Zhang, and T. Xie. MODA: Automated test generation for database applications via mock objects. In ASE, pages 289292, 2010. [12] N. Tillmann and J. de Halleux. Pex-white box test generation for .NET. In TAP, pages 134153, 2008. [13] K. Pan, X. Wu, and T. Xie. Database state generation via dynamic symbolic execution for coverage criteria. In DBTest, pages 0106, 2011. [14] D. Willmor and S. M. Embury. An intensional approach to the specication of test cases for database applications. In ICSE, pages 102111, 2006.