Software Verification: Testing vs. Model Checking: A Comparative Evaluation of The State of The Art
Software Verification: Testing vs. Model Checking: A Comparative Evaluation of The State of The Art
Software Verification: Testing vs. Model Checking: A Comparative Evaluation of The State of The Art
1 Introduction
Software testing has been the standard technique for identifying software bugs
for decades. The exhaustive and sound alternative, software model checking,
is believed to be immature for practice. Some often-named disadvantages are
the need for experts in formal verification, extreme resource consumption, and
maturity issues when it comes to handling large software systems.
But are these concerns still true today? We claim that the answer is No,
and show with experiments on a large benchmark of C programs that software
model checkers even find more bugs than testers. We found it is time for a
comparative evaluation of testing tools against model-checking tools, motivated
by the success of software model checkers as demonstrated in the annual In-
ternational Competition on Software Verification (SV-COMP) [4], and by the
move of development groups of large software systems towards formal verification,
such as Facebook 1 , Microsoft [2, 44], and Linux [38].
Our contribution is a thorough experimental comparison of software testers
against software model checkers. We performed our experimental study on
5 693 programs from a widely-used and state-of-the-art benchmarking set.2 To
represent the state of the art in terms of tools, we use AFL-fuzz, CPATiger,
1 2
http://fbinfer.com/ https://github.com/sosy-lab/sv-benchmarks
Crest-ppc, FShell, Klee, and PRtest as software testers, and Cbmc, CPA-Seq,
Esbmc-incr, and Esbmc-kInd as software model checkers.3 The goal in our study
is to evaluate the ability to reliably find specification violations in software. While
the technique of model checking was originally developed as a proof technique
for showing formal correctness, rather than for efficiently finding bugs, this study
evaluates all tools exclusively against the goal of finding bugs.
To make the test generators comparable, we developed a unifying framework
for test-based falsification (TBF) that interfaces between input programs, test
generators, and test cases. For each tester, the infrastructure needs to (a) prepare
the input program source code to match the input format that the tester expects
and can consume, (b) run the tester to generate test cases, (c) extract test vectors
from the tester’s proprietary format for the produced test cases, and (d) execute
the tests using a test harness to validate whether the generated test cases cover
the bug in the program under test (i.e., whether at least one test case exposes
the bug). If a bug is found, the framework outputs a witnessing test case in two
different, human- and machine-readable formats: (1) a compilable test harness
that can be used to directly provoke the bug in the program through execution
and (2) a violation witness in a common exchange format for witnesses [7], which
can be given to a witness validator to check the specification violation formally
or by execution. This allows us to use input programs, produce executable
tests, and check program behavior independently from a specific tester’s quirks
and requirements. We make the following contributions:
test case in its own file. The file’s binary representation is read ‘as is’ as input,
so generated test cases do not have a specific format.
CPATiger [10] uses model checking, more specifically, predicate abstrac-
tion [12], for test case generation. Is is based on the software-verification tool
CPAchecker [11] and uses the FShell query language (FQL) [35] for speci-
fication of coverage criteria. If CPATiger finds a feasible program path to a
coverage criterion with predicate abstraction, it computes test inputs from the
corresponding predicates used along that path. It is designed to create test
vectors for complicated coverage criteria. Output: CPATiger outputs gener-
ated test cases in a single text file, providing the test input as test vectors
in decimal notation together with additional information.
Crest [18] uses concolic execution for test-case generation. It is search-based,
i.e., it chooses test inputs that reach yet uncovered parts of the program fur-
thest from the already explored paths. Crest-ppc [39] improves on the con-
colic execution used in Crest by modifying the input generation method to
query the constraint solver more often, but using only a small set of con-
straints for each query. We performed experiments to ensure that Crest-ppc
outperforms Crest. The results are available on our supplementary web page.
Output: Crest-ppc outputs each generated test case in a text file, listing the
sequence of used input values in decimal notation.
FShell [35] is another model-checking-based test-case generator. It uses
CBMC (described in Sect. 2.2) for state-space exploration and also uses FQL
for specification of coverage criteria. Output: FShell outputs generated test
cases in a single text file, listing input values of tests together with additional
information. Input values of tests are represented in decimal notation.
Klee [19] uses symbolic execution for test-case generation. After each step in a
program, Klee chooses which of the existing program paths to continue on next,
based on different heuristics, including a search-based one and one preventing
inefficient unrolling of loops. Since Klee uses symbolic execution, it can explore
the full state space of a program and can be used for software verification, not
just test-case generation. As we are interested in exploring the capabilities of
testing, we only consider the test cases produced by Klee. Output: Klee outputs
each generated test case in a binary format that can be read with Klee. The
input values of tests are represented by their bit width and bit representation.
PRtest is a simple tool for plain random testing. The tool is delivered to-
gether with TBF and serves as base line in our experiments. Output: PRtest
outputs each generated test case in a text file, listing the sequence of used
input values in hexadecimal notation.
Harness Witness
Harness Verdict Witness
Generator Generator
if verdict false
int main () {
int x = nondet_int (); void __VERIFIER_error () {
int y = x ; fprintf ( stderr , " __TBF_error_found .\ n " );
exit (1);
if ( nondet_short ()) { }
x ++;
} else {
int nondet_int () {
y ++;
unsigned int inp_size = 3000;
}
char * inp_var = malloc ( inp_size );
if ( x > y ) { fgets ( inp_var , inp_size , stdin );
__VERIFIER_error (); return *(( int *) parse_inp ( inp_var ));
} }
}
short nondet_short () {
Fig. 2: An example C program unsigned int inp_size = 3000;
int nondet_int (){ char * inp_var = malloc ( inp_size );
int __sym ; fgets ( inp_var , inp_size , stdin );
CREST_int ( __sym ); return *(( short *) parse_inp ( inp_var ));
}
return __sym ;
} Fig. 4: Excerpt of a test harness; test vec-
Fig. 3: A function definition tors are passed by standard input (fgets,
prepared for Crest-ppc parse_inp)
\return == 43 \return == 1
__VERIFIER_error()
α0 α1 α2 αe
true
nondet_short() nondet_short()
nondet_short()
\return == 43 \return == 1
true
nondet_int()
true
αs
Fig. 5: Violation witness for test vector h43, 1i and two non-deterministic methods
100
10
T
AFL-fuzz
T
CPATiger
CrestT
FShellT
T
KLEE
1
PRTestT
CBMCM
M
CPA-seq
M
ESBMC-incr
ESBMC-kIndM
0.1
0 200 400 600 800 1000 1200 1400
n-th fastest correct result
Fig. 6: Quantile plots for the different tools for finding bugs in programs
Union Testers
No. Programs
ESBMC-kIndm
ESBMC-incrm
Union MC
CPATigert
Crest-ppct
Union All
AFL-fuzzt
CPA-seqm
PRtestt
CBMCm
FShellt
Kleet
Arrays 81 26 0 20 4 22 25 6 3 6 4 31 13 33
BitVectors 24 11 5 7 5 11 10 12 12 12 12 14 17 19
ControlFlow 42 15 0 11 3 20 3 41 23 36 35 21 42 42
ECA 413 234 0 51 0 260 0 143 257 221 169 286 42 338
Floats 31 11 2 2 4 2 11 31 29 17 13 13 31 31
Heap 66 46 22 16 13 48 32 64 31 62 58 48 66 66
Loops 46 45 27 29 5 40 33 42 36 42 38 41 38 43
ProductLines 265 169 1 204 156 255 144 263 265 265 263 265 265 265
Recursive 45 44 0 35 22 45 31 42 41 40 40 45 43 45
Sequentialized 170 4 0 1 24 123 3 135 122 135 134 123 141 147
LDV 307 0 0 0 0 0 0 51 70 113 78 0 147 147
Total Found 1 490 605 57 376 236 826 292 830 889 949 844 887 1 092 1 176
Compilable 1 115 605 57 376 236 826 292 779 819 830 761 887 930 1 014
Wit. Confirmed 1 490 761 857 705 634 887 979 1 068
Median CPU Time (s) 11 4.5 3.4 6.2 3.6 3.6 1.4 15 1.9 2.3
Average CPU Time (s) 82 38 4.1 27 33 6.7 46 51 61 69
the results for each of the 11 categories of the programs under test, second the
results for all categories together, and third the CPU times required.
The row ‘Total Found’ shows that the best tester (Kleet ) is able to find
826 bugs, while all model checkers find more, with the best model checker
(ESBMC-incrm ) finding 15 % more bugs (949) than the best tester. An interesting
observation is that the different tools have different strengths and weaknesses:
column ‘Union Testers’ shows that applying all testers together increases the
amount of solved tasks considerably. This is made possible using our unifying
framework TBF, which abstracts from the differences in input and output of
the various tools and lets us use all testers in a common work flow. The same
holds for the model checkers: the combination of all approaches significantly
increases the number of solved problems (column ‘Union MC’). The combination
of testers and model checkers (column ‘Union All’) in a bug-finding workflow
can further improve the results significantly, i.e., there are program bugs that
one technique can find but not the other, and vice versa.
While it is usually considered an advantage that model checkers can be
applied to incomplete programs that are not yet fully defined (as expected
by static-analysis tools), testers obviously cannot be applied to such programs
(as they are dynamic-analysis tools). This issue applies in particular to the
category ‘LDV’ of device drivers, which contain non-deterministic models of the
operating-system environment. This kind of programs is important because it
is successfully used to find bugs in systems code 18 [47], but in order to provide
a comparison without the influence of this issue, we also report the results
restricted to those programs that are compilable (row ‘Compilable’).
18
http://linuxtesting.org/results/ldv
For the testers, TBF validates whether a test case is generated that identifies
the bug as found. This test case can later be used to reproduce the error path
using execution, and a debugger helps to comprehend the bug. For the model
checkers, the reported violation witness identifies the bug as found. This witness
can later be used to reproduce the error path using witness validation, and an
error-path visualizer helps to comprehend the bug. Since the model checkers
usually do not generate a test case, we cannot perform the same validation
as for the testers, i.e., execute the program with the test case and check if it
crashes. However, all four model checkers that we use support exchangeable
violation witnesses [7], and we can use existing witness validators to confirm
the witnesses. We report the results in row ‘Wit. Confirmed’, which counts
only those error reports that were confirmed by at least one witness validator.
While this technique is not always able to confirm correct witnesses (cf. [4],
Table 8), the big picture does not change. The test generators do not need this
additional confirmation step, because TBF takes care of this already. There are
two interesting insights: (1) Software model checkers should in addition produce
test data, either contained in the violation witness or as separate test vector.
This makes it easier to reproduce a found bug using program execution and
explore the bug with debugging. (2) Test generators should in addition produce
a violation witness. This makes it easier to reproduce a found bug using witness
validation and explore the bug with error-path visualization [6].
Consideration of False Alarms. So far we have discussed only the programs
that contain bugs. In order to evaluate how many false alarms the tools produce,
we have also considered the 4 203 programs without known bug. All testers
report only 3 bugs on those programs. We manually investigated the cause and
found out that we have to blame the benchmark set for these, not the testers.19
Each of the four model checkers solves at least one of these three tasks with
verdict true, implying an imprecise handling of floating-point arithmetics. The
model checkers also produce a very low number of false alarms, the largest
number being 6 false alarms reported by ESBMC-incrm .
4.3 Validity
Validity of Framework for Test-Based Falsification. The results of the
testers depend on a correctly working test-execution framework. In order to
increase the confidence in our own framework TBF, we compare the results
obtained with TBF against the results obtained with a proprietary test-execution
mechanism that Klee provides: Klee-replay20 . Figure 7 shows the CPU time in
seconds required by Kleet using TBF (x-axis) and Klee-replay (y-axis) for each
verification task that could be solved by either one of them. It shows that Kleet
(and thus, TBF) is very similar to Klee’s native solution. Over all verification
19
There are three specific programs in the ReachSafety-Floats category of SV-COMP
that are only safe if compiled with 64-bit rounding mode for floats or for a 64-bit
machine model. The category states the programs should be executed in a 32-bit
machine model, which seems incorrect.
20
http://klee.github.io/tutorials/testing-function/#replaying-a-test-case
1000
100
0.1
0.1 1 10 100 1000
CPU Time for KLEET (s)
tasks, Kleet is able to find bugs in 826 tasks, while Klee-replay is able to find
bugs in 821 tasks. There are 15 tasks that Klee-replay can not solve, while
Kleet can, and 10 tasks that Klee-replay can solve, while Kleet can not.
For Kleet , one unsolved task is due to missing support of a corner case for
the conversion of Klee’s internal representation of numbers to a test vector. The
remaining difference is due to an improper machine model: for Klee-replay, we
only had 64-bit libraries available, while most tasks of SV-COMP are intended to
be analyzed using a 32-bit architecture. This only results in a single false result,
but interprets some of the inputs generated for 32-bit programs differently, thus
reaching different parts of the program in a few cases. This also explains the
few outliers in Fig. 7. The two implementations both need a median of 0.43 s
of CPU time to find a bug in a task. This shows that our implementation is
similarly effective and efficient to Klee’s own, tailored test-execution mechanism.
Other Threats to Internal Validity. We used the state-of-the-art bench-
marking tool BenchExec [14] to run every execution in an isolated container
with dedicated resources, making our results as reliable as possible. Our exper-
imental results for the considered model checkers are very close to the results
of SV-COMP’1721 , indicating their accuracy. Our framework TBF is a proto-
type and may contain bugs that degrade the real performance of test-based
falsification. Probably more tasks could be solved if more time was invested
in improving this approach, but we tried to keep our approach as simple as
possible to influence the results as less as possible.
Threats to External Validity. There are several threats to external validity.
All tools that we evaluated are aimed at analyzing C programs. It might be the
case that testing research is focused on other languages, such as C++ or Java.
Other languages may contain other quirks than C that make certain approaches
to test-case generation and model checking more or less successful. In addition,
21
https://sv-comp.sosy-lab.org/2017/results/
there may be tools using more effective testing or model-checking techniques
that were developed for other languages and thus are not included here.
The selection of testers could be biased by the authors’ background, but
we reflected the state-of-the-art (see discussion of selection) and related work
in our choice. While we tried to represent the current landscape of test-case
generators by using tools that use fundamentally different approaches, there
might be other approaches that may perform better or that may be able to
solve different tasks. We used most of the recent, publicly available test-case
generators aimed at sequential C programs. We did not include model-based
or combinatorial test-case generators in our evaluation.
For representing the current state-of-the-art in model checking, we only used
four tools to limit the scope of this work. The selection of model checkers is
based on competition results: we simply used the four best tools in SV-COMP’17.
There are many other model-checking tools available. Since we performed our
experiments on a subset of the SV-COMP benchmark set and used a similar
execution environment, our results can be compared online with all verifiers that
participated in the competition. The software model checkers might be tuned
towards the benchmark set, because all of the software model checkers participated
in SV-COMP, while of the testers, only FShell participated in SV-COMP before.
While we tried to achieve high external validity by using the largest and most
diverse open benchmark set, there is a high chance that the benchmark set does not
appropriately represent the real landscape of existing programs with and without
bugs. Since the benchmark set is used by the SV-COMP community, it might be bi-
ased towards software model checkers, and thus, must stay a mere approximation.
5 Conclusion
Our comparison of software testers with software model checkers has shown
that the considered model checkers are competitive for finding bugs on the
used benchmark set. We developed a testing framework that supports the easy
comparison of different test-case generators with each other, and with model
checkers. Through this, we were able to perform experiments that clearly showed
that model checking is mature enough to be used in practice, and even outperforms
the bug-finding capabilities of state-of-the-art testing tools. It is able to cover
more bugs in programs than testers and also finds those bugs faster. With this
study, we do not pledge to eradicate testing, whose importance and usability can
not be stressed enough. But we laid ground to show that model checking should
be considered for practical applications. Perhaps the most important insight of
our evaluation is that is does not make much sense to distinguish between testing
and model checking if the purpose is finding bugs, but to leverage the strengths
of different techniques to construct even better tools by combination.
References
1. S. Anand, E. K. Burke, T. Y. Chen, J. A. Clark, M. B. Cohen, W. Grieskamp,
M. Harman, M. J. Harrold, and P. McMinn. An orchestrated survey of methodologies
for automated software test-case generation. Journal of Systems and Software,
86(8):1978–2001, 2013.
2. T. Ball and S. K. Rajamani. The Slam project: Debugging system software via
static analysis. In Proc. POPL, pages 1–3. ACM, 2002.
3. D. Beyer. Competition on software verification (SV-COMP). In Proc. TACAS,
LNCS 7214, pages 504–524. Springer, 2012.
4. D. Beyer. Software verification with validation of results (Report on SV-COMP
2017). In Proc. TACAS, LNCS 10206, pages 331–349. Springer, 2017.
5. D. Beyer, A. J. Chlipala, T. A. Henzinger, R. Jhala, and R. Majumdar. Generating
tests from counterexamples. In Proc. ICSE, pages 326–335. IEEE, 2004.
6. D. Beyer and M. Dangl. Verification-aided debugging: An interactive web-service
for exploring error witnesses. In Proc. CAV, LNCS 9780. Springer, 2016.
7. D. Beyer, M. Dangl, D. Dietsch, M. Heizmann, and A. Stahlbauer. Witness
validation and stepwise testification across software verifiers. In Proc. FSE, pages
721–733. ACM, 2015.
8. D. Beyer, M. Dangl, and P. Wendler. Boosting k-induction with continuously-refined
invariants. In Proc. CAV, LNCS 9206, pages 622–640. Springer, 2015.
9. D. Beyer, T. A. Henzinger, R. Jhala, and R. Majumdar. The software model checker
Blast. Int. J. Softw. Tools Technol. Transfer, 9(5-6):505–525, 2007.
10. D. Beyer, A. Holzer, M. Tautschnig, and H. Veith. Information reuse for multi-goal
reachability analyses. In Proc. ESOP, LNCS 7792, pages 472–491. Springer, 2013.
11. D. Beyer and M. E. Keremoglu. CPAchecker: A tool for configurable software
verification. In Proc. CAV, LNCS 6806, pages 184–190. Springer, 2011.
12. D. Beyer, M. E. Keremoglu, and P. Wendler. Predicate abstraction with adjustable-
block encoding. In Proc. FMCAD, pages 189–197. FMCAD, 2010.
13. D. Beyer and S. Löwe. Explicit-state software model checking based on CEGAR
and interpolation. In Proc. FASE, LNCS 7793, pages 146–162. Springer, 2013.
14. D. Beyer, S. Löwe, and P. Wendler. Reliable benchmarking: Requirements and
solutions. Int. J. Softw. Tools Technol. Transfer, 2017.
15. D. Beyer and P. Wendler. Reuse of verification results: Conditional model checking,
precision reuse, and verification witnesses. In Proc. SPIN, LNCS. Springer, 2013.
16. A. Biere, A. Cimatti, E. M. Clarke, and Y. Zhu. Symbolic model checking without
BDDs. In Proc. TACAS, LNCS 1579, pages 193–207. Springer, 1999.
17. M. Böhme, V. Pham, and A. Roychoudhury. Coverage-based greybox fuzzing as
Markov chain. In Proc. SIGSAC, pages 1032–1043. ACM, 2016.
18. J. Burnim and K. Sen. Heuristics for scalable dynamic test generation. In Proc.
ASE, pages 443–446. IEEE, 2008.
19. C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation
of high-coverage tests for complex systems programs. In Proc. OSDI, pages 209–224.
USENIX Association, 2008.
20. A. Cimatti, A. Griggio, B. J. Schaafsma, and R. Sebastiani. The MathSAT5 SMT
solver. In Proc. TACAS, LNCS 7795, pages 93–107. Springer, 2013.
21. E. M. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith. Counterexample-guided
abstraction refinement for symbolic model checking. J. ACM, 50(5):752–794, 2003.
22. E. M. Clarke, D. Kröning, and F. Lerda. A tool for checking ANSI-C programs. In
Proc. TACAS, LNCS 2988, pages 168–176. Springer, 2004.
23. W. Craig. Linear reasoning. A new form of the Herbrand-Gentzen theorem. J. Symb.
Log., 22(3):250–268, 1957.
24. C. Csallner and Y. Smaragdakis. Check ’n’ crash: Combining static checking and
testing. In Proc. ICSE, pages 422–431. ACM, 2005.
25. M. Dangl, S. Löwe, and P. Wendler. CPAchecker with support for recursive
programs and floating-point arithmetic. In Proc. TACAS, LNCS. Springer, 2015.
26. N. Eén and N. Sörensson. An extensible SAT solver. In Proc. SAT, LNCS 2919,
pages 502–518. Springer, 2003.
27. M. Y. R. Gadelha, H. I. Ismail, and L. C. Cordeiro. Handling loops in bounded
model checking of C programs via k-induction. STTT, 19(1):97–114, 2017.
28. S. J. Galler and B. K. Aichernig. Survey on test data generation tools. STTT,
16(6):727–751, 2014.
29. P. Godefroid, N. Klarlund, and K. Sen. Dart: Directed automated random testing.
In Proc. PLDI, pages 213–223. ACM, 2005.
30. P. Godefroid, M. Y. Levin, and D. A. Molnar. Automated whitebox fuzz testing.
In Proc. NDSS. The Internet Society, 2008.
31. S. Graf and H. Saïdi. Construction of abstract state graphs with Pvs. In Proc.
CAV, LNCS 1254, pages 72–83. Springer, 1997.
32. M. Heizmann, D. Dietsch, J. Leike, B. Musa, and A. Podelski. Ultimate Au-
tomizer with array interpolation. In Proc. TACAS, LNCS 9035, pages 455–457.
Springer, 2015.
33. M. Heizmann, J. Hoenicke, and A. Podelski. Software model checking for people
who love automata. In Proc. CAV, LNCS 8044, pages 36–52. Springer, 2013.
34. T. A. Henzinger, R. Jhala, R. Majumdar, and G. Sutre. Lazy abstraction. In Proc.
POPL, pages 58–70. ACM, 2002.
35. A. Holzer, C. Schallhart, M. Tautschnig, and H. Veith. How did you specify your
test suite? In Proc. ASE, pages 407–416. ACM, 2010.
36. K. Jayaraman, D. Harvison, V. Ganesh, and A. Kiezun. jFuzz: A concolic whitebox
fuzzer for Java. In Proc. NFM, pages 121–125, 2009.
37. R. Jhala and R. Majumdar. Software model checking. ACM Computing Surveys,
41(4), 2009.
38. A. V. Khoroshilov, V. Mutilin, A. K. Petrenko, and V. Zakharov. Establishing
Linux driver verification process. In Proc. Ershov Memorial Conference, LNCS 5947,
pages 165–176. Springer, 2009.
39. Y. Köroglu and A. Sen. Design of a modified concolic testing algorithm with smaller
constraints. In Proc. ISSTA, pages 3–14. ACM, 2016.
40. D. Kröning and M. Tautschnig. Cbmc: C bounded model checker (competition
contribution). In Proc. TACAS, LNCS 8413, pages 389–391. Springer, 2014.
41. K. Li, C. Reichenbach, C. Csallner, and Y. Smaragdakis. Residual investigation:
Predictive and precise bug detection. In Proc. ISSTA, pages 298–308. ACM, 2012.
42. K. L. McMillan. Interpolation and SAT-based model checking. In Proc. CAV,
LNCS 2725, pages 1–13. Springer, 2003.
43. J. Morse, M. Ramalho, L. Cordeiro, D. Nicole, and B. Fischer. Esbmc 1.22
(competition contribution). In Proc. TACAS, LNCS 8413. Springer, 2014.
44. Z. Pavlinovic, A. Lal, and R. Sharma. Inferring annotations for device drivers from
verification histories. In Proc. ASE, pages 450–460. ACM, 2016.
45. K. Sen, D. Marinov, and G. Agha. Cute: A concolic unit testing engine for C. In
Proc. ESEC/FSE, pages 263–272. ACM, 2005.
46. H. Seo and S. Kim. How we get there: A context-guided search strategy in concolic
testing. In Proc. FSE, pages 413–424. ACM, 2014.
47. I. S. Zakharov, M. U. Mandrykin, V. S. Mutilin, E. Novikov, A. K. Petrenko, and
A. V. Khoroshilov. Configurable toolset for static verification of operating systems
kernel modules. Programming and Computer Software, 41(1):49–64, 2015.