Computer Aided Verification: Constantin Enea Akash Lal
Computer Aided Verification: Constantin Enea Akash Lal
Computer Aided Verification: Constantin Enea Akash Lal
Computer Aided
Verification
35th International Conference, CAV 2023
Paris, France, July 17–22, 2023
Proceedings, Part II
Lecture Notes in Computer Science 13965
Founding Editors
Gerhard Goos
Juris Hartmanis
Computer Aided
Verification
35th International Conference, CAV 2023
Paris, France, July 17–22, 2023
Proceedings, Part II
Editors
Constantin Enea Akash Lal
LIX, Ecole Polytechnique, CNRS and Institut Microsoft Research
Polytechnique de Paris Bangalore, India
Palaiseau, France
© The Editor(s) (if applicable) and The Author(s) 2023, corrected publication 2023. This book is an open
access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
It was our privilege to serve as the program chairs for CAV 2023, the 35th International
Conference on Computer-Aided Verification. CAV 2023 was held during July 19–22,
2023 and the pre-conference workshops were held during July 17–18, 2023. CAV 2023
was an in-person event, in Paris, France.
CAV is an annual conference dedicated to the advancement of the theory and practice
of computer-aided formal analysis methods for hardware and software systems. The
primary focus of CAV is to extend the frontiers of verification techniques by expanding
to new domains such as security, quantum computing, and machine learning. This puts
CAV at the cutting edge of formal methods research, and this year’s program is a reflection
of this commitment.
CAV 2023 received a large number of submissions (261). We accepted 15 tool
papers, 3 case-study papers, and 49 regular papers, which amounts to an acceptance
rate of roughly 26%. The accepted papers cover a wide spectrum of topics, from theo-
retical results to applications of formal methods. These papers apply or extend formal
methods to a wide range of domains such as concurrency, machine learning and neu-
ral networks, quantum systems, as well as hybrid and stochastic systems. The program
featured keynote talks by Ruzica Piskac (Yale University), Sumit Gulwani (Microsoft),
and Caroline Trippel (Stanford University). In addition to the contributed talks, CAV
also hosted the CAV Award ceremony, and a report from the Synthesis Competition
(SYNTCOMP) chairs.
In addition to the main conference, CAV 2023 hosted the following workshops: Meet-
ing on String Constraints and Applications (MOSCA), Verification Witnesses and Their
Validation (VeWit), Verification of Probabilistic Programs (VeriProP), Open Problems
in Learning and Verification of Neural Networks (WOLVERINE), Deep Learning-aided
Verification (DAV), Hyperproperties: Advances in Theory and Practice (HYPER), Syn-
thesis (SYNT), Formal Methods for ML-Enabled Autonomous Systems (FoMLAS), and
Verification Mentoring Workshop (VMW). CAV 2023 also hosted a workshop dedicated
to Thomas A. Henzinger for this 60th birthday.
Organizing a flagship conference like CAV requires a great deal of effort from the
community. The Program Committee for CAV 2023 consisted of 76 members—a com-
mittee of this size ensures that each member has to review only a reasonable number of
papers in the allotted time. In all, the committee members wrote over 730 reviews while
investing significant effort to maintain and ensure the high quality of the conference pro-
gram. We are grateful to the CAV 2023 Program Committee for their outstanding efforts
in evaluating the submissions and making sure that each paper got a fair chance. Like
recent years in CAV, we made artifact evaluation mandatory for tool paper submissions,
but optional for the rest of the accepted papers. This year we received 48 artifact submis-
sions, out of which 47 submissions received at least one badge. The Artifact Evaluation
Committee consisted of 119 members who put in significant effort to evaluate each arti-
fact. The goal of this process was to provide constructive feedback to tool developers and
vi Preface
help make the research published in CAV more reproducible. We are also very grateful
to the Artifact Evaluation Committee for their hard work and dedication in evaluating
the submitted artifacts.
CAV 2023 would not have been possible without the tremendous help we received
from several individuals, and we would like to thank everyone who helped make CAV
2023 a success. We would like to thank Alessandro Cimatti, Isil Dillig, Javier Esparza,
Azadeh Farzan, Joost-Pieter Katoen and Corina Pasareanu for serving as area chairs.
We also thank Bernhard Kragl and Daniel Dietsch for chairing the Artifact Evaluation
Committee. We also thank Mohamed Faouzi Atig for chairing the workshop organization
as well as leading publicity efforts, Eric Koskinen as the fellowship chair, Sebastian
Bardin and Ruzica Piskac as sponsorship chairs, and Srinidhi Nagendra as the website
chair. Srinidhi, along with Enrique Román Calvo, helped prepare the proceedings. We
also thank Ankush Desai, Eric Koskinen, Burcu Kulahcioglu Ozkan, Marijana Lazic, and
Matteo Sammartino for chairing the mentoring workshop. Last but not least, we would
like to thank the members of the CAV Steering Committee (Kenneth McMillan, Aarti
Gupta, Orna Grumberg, and Daniel Kroening) for helping us with several important
aspects of organizing CAV 2023.
We hope that you will find the proceedings of CAV 2023 scientifically interesting
and thought-provoking!
Conference Co-chairs
Artifact Co-chairs
Workshop Chair
Fellowship Chair
Website Chair
Sponsorship Co-chairs
Proceedings Chairs
Program Committee
Additional Reviewers
Decision Procedures
Bitwuzla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Aina Niemetz and Mathias Preiner
Model Checking
1 Introduction
Satisfiability Modulo Theories (SMT) solvers serve as back-end reasoning engines
for a wide range of applications in formal methods (e.g., [13,14,21,23,35]). In
particular, the theory of fixed-size bit-vectors, in combination with arrays, unin-
terpreted functions and floating-point arithmetic, have received increasing inter-
est in recent years, as witnessed by the high and increasing numbers of bench-
marks submitted to the SMT-LIB benchmark library [5] and the number of
participants in corresponding divisions in the annual SMT competition (SMT-
COMP) [42]. State-of-the-art SMT solvers supporting (a subset of) these the-
ories include Boolector [31], cvc5 [3], MathSAT [15], STP [19], Yices2 [17] and
Z3 [25]. Among these, Boolector had been largely dominating the quantifier-free
divisions with bit-vectors and arrays in SMT-COMP over the years [2].
Boolector was originally published in 2009 by Brummayer and Biere [11] as
an SMT solver for the quantifier-free theories of fixed-size bit-vectors and arrays.
Since 2012, Boolector has been mainly developed and maintained by the authors
of this paper, who have extended it with support for uninterpreted functions and
lazy handling of non-recursive lambda terms [32,38,39], local search strategies
for quantifier-free bit-vectors [33,34], and quantified bit-vector formulas [40].
While Boolector is still competitive in terms of performance, it has several
limitations. Its code base consists of largely monolithic C code, with a rigid
architecture focused on a very specialized, tight integration of bit-vectors and
arrays. Consequently, it is cumbersome to maintain, and adding new features
is difficult and time intensive. Further, Boolector requires manual management
of memory and reference counts from API users; terms and sorts are tied to a
specific solver instance and cannot be shared across instances; all preprocessing
This work was supported in part by the Stanford Center for Automated Reasoning,
the Stanford Agile Hardware Center, the Stanford Center for Blockchain Research and
a gift from Amazon Web Services.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 3–17, 2023.
https://doi.org/10.1007/978-3-031-37703-7_1
4 A. Niemetz and M. Preiner
2 Architecture
used in multiple Bitwuzla instances. The parser interacts with the solver instance
via the C++ API. A textual command line interface (CLI) builds on top of the
parser, supporting SMT-LIBv2 [4] and BTOR2 [35] as input languages.
while reusing work from earlier checks. On the API level, Bitwuzla also sup-
ports satisfiability queries under a given set of assumptions (SMT-LIB command
check-sat-assuming), which are internally handled via push and pop.
Nodes and types constructed via the Node Manager may be shared between
multiple Solving Contexts. If the set of assertions is satisfiable, the Solving Con-
text provides a model for the input formula. It further allows to query model
values for any term, based on this model (SMT-LIB command get-value). In
case of unsatisfiable queries, the Solving Context can be configured to extract
an unsatisfiable core and unsat assumptions.
A Solving Context consists of three main components: a Rewriter, a Prepro-
cessor and a Solver Engine. The Rewriter and Preprocessor perform local (node
level) and global (over all assertions) simplifications, whereas the Solver Engine
is the central solving engine, managing theory solvers and their interaction.
the assertions on levels i < j when level j is popped to the state before level j
was pushed, and is left to future work.
Boolector, on the other hand, only performs preprocessing based on top-
level assertions (assertion level 0) and does not incorporate any information
from assumptions or higher assertion levels.
Rewriter. The rewriter transforms terms via a predefined set of rewrite rules
into semantically equivalent normal forms. This transformation is local in the
sense that it is independent from the current set of assertions. We distinguish
between required and optional rewrite rules, and further group rules into so-
called rewrite levels from 0–2. The set of required rules consists of operator
elimination rewrites, which are considered level 0 rewrites and ensure that nodes
only contain operators from a reduced base set. For example, the two’s com-
plement −x of a bit-vector term x is rewritten to (∼ x + 1) by means of one’s
complement and bit-vector addition. Optional rewrite rules are grouped into
level 1 and level 2. Level 1 rules perform rewrites that only consider the imme-
diate children of a node, whereas level 2 rules may consider multiple levels of
children. If not implemented carefully, level 2 rewrites can potentially destroy
sharing of subterms and consequently increase the overall size of the formula.
For example, rewriting (t + 0) to t is considered a level 1 rewrite rule, whereas
rewriting (a − b = c) to (b + c = a) is considered a level 2 rule since it may
introduce an additional bit-vector addition (b + c) if (a − b) occurs somewhere
else in the formula. The maximum rewrite level of the rewriter can be configured
by the user.
Rewriting is applied on the current set of assertions as a preprocessing pass
and, as all other passes, applied until fixed-point. That is, on any given term,
the rewriter applies rewrite rules until no further rewrite rules can be applied.
For this, the rewriter must guarantee that no set of applied rewrite rules may
lead to cyclic rewriting of terms. Additionally, all components of the solving
context apply rewriting on freshly created nodes to ensure that all nodes are
always fully normalized. In order to avoid processing nodes more than once, the
rewriter maintains a cache that maps nodes to their fully rewritten form.
Solver Engine. After preprocessing, the solving context sends the current set
of assertions to the Solver Engine, which implements a lazy SMT paradigm
called lemmas on demand [6,24]. However, rather than using a propositional
abstraction of the input formula as in [6,24], it implements a bit-vector abstrac-
tion similar to Boolector [12,38]. At its core, the Solver Engine maintains a
bit-vector theory solver and a solver for each supported theory. Quantifier rea-
soning is handled by a dedicated quantifiers module, implemented as a theory
solver. The Solver Engine manages all theory solvers, the distribution of relevant
terms, and the processing of lemmas generated by the theory solvers.
The bit-vector solver is responsible for reasoning about the bit-vector abstrac-
tion of the input assertions and lemmas generated during solving, which includes
all propositional and bit-vector terms. Theory atoms that do not belong to
the bit-vector theory are abstracted as Boolean constants, and bit-vector terms
8 A. Niemetz and M. Preiner
whose operator does not belong to the bit-vector theory are abstracted as bit-
vector constants. For example, an array select operation of type bit-vector is
abstracted as a bit-vector constant, while an equality between two arrays is
abstracted as a Boolean constant.
If the bit-vector abstraction is satisfiable, the bit-vector solver produces a sat-
isfying assignment, and the floating-point, array, function and quantifier solvers
check this assignment for theory consistency. If a solver finds a theory inconsis-
tency, i.e., a conflict between the current satisfying assignment and the solver’s
theory axioms, it produces a lemma to refine the bit-vector abstraction and rule
out the detected inconsistency. Theory solvers are allowed to send any number
of lemmas, with the only requirement that if a theory solver does not send a
lemma, the current satisfying assignment is consistent with the theory.
Finding a satisfying assignment for the bit-vector abstraction and the subse-
quent theory consistency checks are implemented as an abstraction/refinement
loop as given in Algorithm 1. Whenever a theory solver sends lemmas, the loop
is restarted to get a new satisfying assignment for the refined bit-vector abstrac-
tion. The loop terminates if the bit-vector abstraction is unsatisfiable, or if the
bit-vector abstraction is satisfiable and none of the theory solvers report any the-
ory inconsistencies. Note that the abstraction/refinement algorithm may return
unknown if the input assertions include quantified formulas.
3 Theory Solvers
The Solver Engine maintains a theory solver for each supported theory and
implements a module for handling quantified formulas as a dedicated theory
solver. The central engine of the Solver Engine is the bit-vector theory solver,
which reasons about a bit-vector abstraction of the current set of input asser-
tions, refined with lemmas generated by other theory solvers. The theories of
fixed-size bit-vectors, arrays, floating-point arithmetic, and uninterpreted func-
tions are combined via a model-based theory combination approach similar
to [12,38].
Theory combination is based on candidate models produced by the bit-vector
theory solver for the bit-vector abstraction (function TBV ::solve() in Algorithm
1). For each candidate model, each theory solver checks consistency with the
axioms of the corresponding theory (functions T∗ ::check() in Algorithm 1). If a
theory solver requests a model value for a term that is not part of the current
bit-vector abstraction, the theory solver who “owns” that term is queried for a
value. If this value or the candidate model is inconsistent with the axioms of the
theory querying the value, it sends a lemma to refine the bit-vector abstraction.
3.1 Arrays
The array theory solver implements and extends the array procedure from [12]
with support for reasoning over (equalities of) nested arrays and non-extensional
constant arrays. This is in contrast to Boolector, which generalizes the lemmas
on demand procedure for extensional arrays as described in [12] to non-recursive
first-order lambda terms [37,38], without support for nested arrays. Generalizing
arrays to lambda terms allows to use the same procedure for arrays and uninter-
preted functions and enables a natural, compact representation and extraction
of extended array operations such as memset, memcpy and array initialization
patterns as described in [39]. As an example, memset(a, i, n, e), which updates
n elements of array a within range [i, i + n[ to a value e starting from index i,
can be represented as λj . ite(i ≤ j < i + n, e, a[j]). Reasoning over equalities
involving arbitrary lambda terms (including these operations), however, requires
10 A. Niemetz and M. Preiner
3.2 Bit-Vectors
The bit-vector theory solver implements two orthogonal approaches: the classic
bit-blasting technique employed by most state-of-the-art bit-vector solvers, which
eagerly translates the current bit-vector abstraction to SAT; and the ternary
propagation-based local search approach presented in [27]. Since local search pro-
cedures only allow to determine satisfiability, they are particularly effective as
a complementary strategy, in combination with (rather than instead of) bit-
blasting [27,33]. Bitwuzla’s bit-vector solver allows to combine local search with
bit-blasting in a sequential portfolio setting: the local search procedure is run
until a predefined resource limit is reached before falling back on the bit-blasting
procedure. Currently, Bitwuzla allows combining these two approaches only in
this particular setting. We plan to explore more interleaved configurations, pos-
sibly while sharing information between the procedures as future work.
3.5 Quantifiers
Quantified formulas are handled by the quantifiers module, which is treated as
a theory solver and implements model-based quantifier instantiation [20] for all
supported theories and their combinations. In the bit-vector abstraction, quan-
tified formulas are abstracted as Boolean constants. Based on the assignment of
these constants, the quantifiers solver produces instantiation or Skolemization
lemmas. If the constant is assigned to true, the quantifier is treated as univer-
sal quantifier and the solver produces instantiation lemmas. If the constant is
assigned to false, the solver generates a Skolemization lemma. Bitwuzla allows
to combine quantifiers with all supported theories as well as incremental solving
and unsat core extraction. This is in contrast to Boolector, which only supports
sequential reasoning about quantified bit-vector formulas and, generally, does
not provide unsat cores for unsatisfiable instances.
4 Evaluation
We evaluate the overall performance of Bitwuzla on all non-incremental and
incremental benchmarks of all supported logics in SMT-LIB [5]. We further
include logics with floating-point arithmetic that are classified as containing
linear integer arithmetic (LRA). Bitwuzla does not support LRA reasoning, but
12 A. Niemetz and M. Preiner
5 Conclusion
Our experimental evaluation shows that Bitwuzla is a state-of-the-art SMT
solver for the quantified and quantifier-free theories of fixed-size bit-vectors,
arrays, floating-point arithmetic, and uninterpreted functions. Bitwuzla has been
extensively tested for robustness and correctness with Murxla [30], an API fuzzer
for SMT solvers, which is an integral part of its development workflow. We have
outlined several avenues for future work throughout the paper. We further plan
to add support for the upcoming SMT-LIB version 3 standard, when finalized.
References
1. Boolector. (2023). https://github.com/boolector/boolector
2. The International Satisfiability Modulo Theories Competition (SMT-COMP)
(2023). https://smt-comp.github.io
3. Barbosa, H., et al.: cvc5: a versatile and industrial-strength SMT solver. In: TACAS
2022. LNCS, vol. 13243, pp. 415–442. Springer, Cham (2022). https://doi.org/10.
1007/978-3-030-99524-9 24
4. Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB Standard: Version 2.6. Tech.
rep. Department of Computer Science, The University of Iowa (2017). http://smt-
lib.org
5. Barrett, C., Fontaine, P., Tinelli, C.: The Satisfiability Modulo Theories Library
(SMT-LIB) (2023). http://smt-lib.org
6. Barrett, C.W., Dill, D.L., Stump, A.: Checking satisfiability of first-order formulas
by incremental translation to SAT. In: Brinksma, E., Larsen, K.G. (eds.) CAV
2002. LNCS, vol. 2404, pp. 236–249. Springer, Heidelberg (2002). https://doi.org/
10.1007/3-540-45657-0 18
7. Biere, A., Fazekas, K., Fleury, M., Heisinger, M.: CaDiCaL, Kissat, Paracooba,
Plingeling and Treengeling entering the SAT Competition 2020. In: Balyo, T.,
Froleyks, N., Heule, M., Iser, M., Järvisalo, M., Suda, M. (eds.) Proc. of SAT
Competition 2020 - Solver and Benchmark Descriptions. Department of Computer
Science Report Series B, vol. B-2020-1, pp. 51–53. University of Helsinki (2020)
8. Brain, M., Schanda, F., Sun, Y.: Building better bit-blasting for floating-point
problems. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp.
79–98. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0 5
9. Brain, M., Schanda, F., Sun, Y.: Building better bit-blasting for floating-point
problems. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp.
79–98. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0 5
10. Brummayer, R., Biere, A.: Local two-level and-inverter graph minimization with-
out blowup. In: Proceedings of the 2nd Doctoral Workshop on Mathematical and
Engineering Methods in Computer Science (MEMICS’06), Mikulov, Czechia, Octo-
ber 2006 (2006)
11. Brummayer, R., Biere, A.: Boolector: an efficient SMT solver for bit-vectors and
arrays. In: Kowalewski, S., Philippou, A. (eds.) TACAS 2009. LNCS, vol. 5505, pp.
174–177. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00768-
2 16
12. Brummayer, R., Biere, A.: Lemmas on demand for the extensional theory of arrays.
J. Satisf. Boolean Model. Comput. 6(1–3), 165–201 (2009). https://doi.org/10.
3233/sat190067
Bitwuzla 15
13. Cadar, C., Dunbar, D., Engler, D.R.: KLEE: unassisted and automatic gener-
ation of high-coverage tests for complex systems programs. In: Draves, R., van
Renesse, R. (eds.) 8th USENIX Symposium on Operating Systems Design and
Implementation, OSDI 2008 (December), pp. 8–10, 2008. San Diego, California,
USA, Proceedings. pp. 209–224. USENIX Association (2008). http://www.usenix.
org/events/osdi08/tech/full papers/cadar/cadar.pdf
14. Champion, A., Mebsout, A., Sticksel, C., Tinelli, C.: The kind 2 model checker.
In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 510–517.
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41540-6 29
15. Cimatti, A., Griggio, A., Schaafsma, B.J., Sebastiani, R.: The MathSAT5 SMT
solver. In: Piterman, N., Smolka, S.A. (eds.) TACAS 2013. LNCS, vol. 7795, pp.
93–107. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36742-7 7
16. Dutertre, B., de Moura, L.: The Yices SMT Solver (2006). https://yices.csl.sri.
com/papers/tool-paper.pdf
17. Dutertre, B.: Yices 2.2. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559,
pp. 737–744. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08867-
9 49
18. Fröhlich, A., Biere, A., Wintersteiger, C.M., Hamadi, Y.: Stochastic local search
for satisfiability modulo theories. In: Bonet, B., Koenig, S. (eds.) Proceedings of
the Twenty-Ninth AAAI Conference on Artificial Intelligence, 25–30 January 2015,
Austin, Texas, USA, pp. 1136–1143. AAAI Press (2015). http://www.aaai.org/ocs/
index.php/AAAI/AAAI15/paper/view/9896
19. Ganesh, V., Dill, D.L.: A decision procedure for bit-vectors and arrays. In: Damm,
W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 519–531. Springer, Hei-
delberg (2007). https://doi.org/10.1007/978-3-540-73368-3 52
20. Ge, Y., de Moura, L.: Complete instantiation for quantified formulas in satisfi-
abiliby modulo theories. In: Bouajjani, A., Maler, O. (eds.) CAV 2009. LNCS,
vol. 5643, pp. 306–320. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-02658-4 25
21. Godefroid, P., Levin, M.Y., Molnar, D.A.: SAGE: whitebox fuzzing for security
testing. Commun. ACM 55(3), 40–44 (2012). https://doi.org/10.1145/2093548.
2093564
22. Kunz, W., Stoffel, D.: Reasoning in Boolean Networks - Logic Synthesis and Verifi-
cation Using Testing Techniques. Frontiers in Electronic Testing. Springer (1997).
https://doi.org/10.1007/978-1-4757-2572-8
23. Mann, M., et al.: Pono: a flexible and extensible SMT-based model checker. In:
Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12760, pp. 461–474. Springer,
Cham (2021). https://doi.org/10.1007/978-3-030-81688-9 22
24. Moura, L.D., Rueß, H.: Lemmas on demand for satisfiability solvers. In: The 5th
International Symposium on the Theory and Applications of Satisfiability Testing,
SAT 2002, Cincinnati, 15 May 2002 (2002)
25. de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R.,
Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg
(2008). https://doi.org/10.1007/978-3-540-78800-3 24
26. Niemetz, A., Preiner, M.: Bitwuzla at the SMT-COMP 2020. arXiv preprint (2020).
https://arxiv.org/abs/2006.01621
27. Niemetz, A., Preiner, M.: Ternary propagation-based local search for more bit-
precise reasoning. In: 2020 Formal Methods in Computer Aided Design, FMCAD
2020, Haifa, Israel, 21–24 September 2020, pp. 214–224. IEEE (2020). https://doi.
org/10.34727/2020/isbn.978-3-85448-042-6 29
16 A. Niemetz and M. Preiner
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Decision Procedures for Sequence Theories
1 Introduction
Sequences are an extension of strings, wherein elements might range over an infi-
nite domain (e.g., integers, strings, and even sequences themselves). Sequences
A. Jeż was supported under National Science Centre, Poland project number
2017/26/E/ST6/00191. A. Lin and O. Markgraf were supported by the ERC Consol-
idator Grant 101089343 (LASD). P. Rümmer was supported by the Swedish Research
Council (VR) under grant 2018-04727, the Swedish Foundation for Strategic Research
(SSF) under the project WebSec (Ref. RIT17-0011), and the Wallenberg project
UPDATE.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 18–40, 2023.
https://doi.org/10.1007/978-3-031-37703-7_2
Decision Procedures for Sequence Theories 19
are ubiquitous and commonly used data types in modern programming lan-
guages. They come under different names, e.g., Python/Haskell/Prolog lists,
Java ArrayList (and to some extent Streams) and JavaScript arrays. Crucially,
sequences are extendable, and a plethora of operations (including append, map,
split, filter, concatenation, etc.) can naturally be defined and are supported by
built-in library functions in most modern programming languages.
Various techniques in software model checking [30] — including symbolic
execution, invariant generation — require an appropriate SMT theory, to which
verification conditions could be discharged. In the case of programs operating on
sequences, we would consequently require an SMT theory of sequences, for which
leading SMT solvers like Z3 [6,38] and cvc5 [4] already provide some basic sup-
port for over a decade. The basic design of sequence theories, as done in Z3 and
cvc5, as well as in other formalisms like symbolic automata [15], is in fact quite
natural. That is, sequence theories can be thought of as extensions of theories of
strings with an infinite alphabet of letters, together with a corresponding alpha-
bet theory, e.g. Linear Integer Arithmetic (LIA) for reasoning about sequences of
integers. Despite this, very little is known about what is decidable over theories
of sequences.
In the case of finite alphabets, sequence theories become theories over strings,
in which a lot of progress has been made in the last few decades, barring the
long-standing open problem of string equations with length constraints (e.g. see
[26]). For example, it is known that the existential theory of concatenation over
strings with regular constraints is decidable (in fact, PSpace-complete), e.g.,
see [17,29,36,40,43]. Here, a regular constraint takes the form x ∈ L(E), where
E is a regular expression, mandating that the expression E matches the string
represented by x. In addition, several natural syntactic restrictions — including
straight-line, acylicity, and chain-free (e.g. [1,2,5,11,12,26,35]) — have been
identified, with which string constraints remain decidable in the presence of more
complex string functions (e.g. transducers, replace-all, reverse, etc.). In the case
of infinite alphabets, only a handful of results are available. Furia [25] showed
that the existential theory of sequence equations over the alphabet theory of
LIA is decidable by a reduction to the existential theory of concatenation over
strings (over a finite alphabet) without regular constraints. Loosely speaking, a
number (e.g. 4) can be represented as a string in unary (e.g. 1111), and addition
is then simulated by concatenation. Therefore, his decidability result does not
extend to other data domains and alphabet theories. Wang et al. [45] define an
extension of the array property fragment [9] with concatenation. This fragment
imposes strong restrictions, however, on the equations between sequences (here
called finite arrays) that can be considered.
Contributions. The main contribution of this paper is to provide the first decid-
able fragments of a theory of sequences parameterized in the element theory.
In particular, we show how to leverage string solvers to solve theories over
sequences. We believe this is especially interesting, in view of the plethora of
existing string solvers developed in the last 10 years (e.g. see the survey [3]).
This opens up new possibilities for verification tasks to be automated; in partic-
ular, we show how verification conditions for Quicksort, as well as Bakery and
Dijkstra protocols, can be captured in our sequence theory. This formalization
was done in the style of regular model checking [8,34], whose extension to infinite
alphabets has been a longstanding challenge in the field. We also provide a new
(dedicated) sequence solver SeCo We detail our results below.
We first show that the quantifier-free theory of sequences with concatenation
and PA as regular constraints is decidable. Assuming that the theory is solvable
in PSpace (which is reasonable for most SMT theories), we show that our algo-
rithm runs in ExpSpace (i.e., double-exponential time and exponential space).
We also identify conditions on the SMT theory T under which PSpace can be
achieved and as an example show that Linear Real Arithmetic (LRA) satisfies
those conditions. This matches the PSpace-completeness of the theory of strings
with concatenation and regular constraints [18].
We consider three different variants/extensions:
1
This can be generalized to any arity, which has to be set uniformly for the automaton.
Decision Procedures for Sequence Theories 21
(i) Add length constraints. Length constraints (e.g., |x| = |y| for two sequence
variables x, y) are often considered in the context of string theories, but
the decidability of the resulting theory (i.e., strings with concatenation and
length constraints) is still a long-standing open problem [26]. We show that
the case for sequences is Turing-equivalent to the string case.
(ii) Use SRA instead of PA. We show that the resulting theory of sequences is
undecidable, even over the alphabet theory T of equality.
(iii) Add symbolic transducers. Symbolic transducers [15,16] extend finite-state
input/output transducers in the same way that symbolic automata extend
finite-state automata. To obtain decidability, we consider formulas satisfying
the straight-line restriction that was defined over strings theories [35]. We
show that the resulting theory is decidable in 2-ExpTime and is ExpSpace-
hard, if T is solvable in PSpace.
We have implemented the solver SeCo based on our algorithms, and demon-
strated its efficacy on two classes of benchmarks: (i) invariant checking on
array-manipulating programs and parameterized systems, and (ii) benchmarks
on Symbolic Register Automata (SRA) from [14]. For the first benchmarks,
we model as sequence constraints invariants for QuickSort, Dijkstra’s Self-
Stabilizing Protocol [20] and Lamport’s Bakery Algorithm [33]. For (ii), we solve
decision problems for SRA on benchmarks of [14] such as emptiness, equivalence
and inclusion on regular expressions with back-references. We report promising
experimental results: our solver SeCo is up to three orders of magnitude faster
than the SRA solver in [14].
2 Motivating Example
We illustrate the use of sequence theories in verification using a implementation
of QuickSort [28], shown in Listing 1. The example uses the Java Streams API
and resembles typical implementations of QuickSort in functional languages; the
program uses high-level operations on streams and lists like filter and concatena-
tion. As we show, the data types and operations can naturally be modelled using
a theory of sequences over integer arithmetic, and our results imply decidability
of checks that would be done by a verification system.
The function quickSort processes a given list l by picking the first element
as the pivot p, then creating two sub-lists left, right in which all numbers
22 A. Jeż et al.
/*@
* ensures \ forall int i; \ result . contains (i) == l. contains (i);
*/
public static List < Integer > quickSort ( List < Integer > l) {
if (l. size () < 1) return l;
Integer p = l. get (0);
List < Integer > left = l. stream (). filter (i -> i < p)
. collect ( Collectors . toList ());
List < Integer > right = l. stream (). skip (1). filter (i -> i >= p)
. collect ( Collectors . toList ());
List < Integer > result = quickSort ( left );
result . add (p); result . addAll ( quickSort ( right ));
return result ;
}
≥p (resp., <p) have been eliminated. The function quickSort is then recur-
sively invoked on the two sub-lists, and the results are finally concatenated and
returned.
We focus on the verification of the post-condition shown in the beginning of
Listing 1: sorting does not change the set of elements contained in the input list.
This is a weaker form of the permutation property of sorting algorithms, and as
such known to be challenging for verification methods (e.g., [42]). Sortedness of
the result list can be stated and verified in a similar way, but is not considered
here. Following the classical design-by-contract approach [37], to verify the par-
tial correctness of the function it is enough to show that the post-condition is
established in any top-level call of the function, assuming that the post-condition
holds for all recursive calls. For the case of non-empty lists, the verification con-
dition, expressed in our logic, is:
⎛ ⎞
left = T<l0 (l) ∧ right = T≥l0 (skip 1 (l)) ∧
⎝∀i. (i ∈ left ↔ i ∈ left ) ∧ ∀i. (i ∈ right ↔ i ∈ right ) ∧ ⎠
res = left . [l0 ] . right
→ ∀i. (i ∈ l ↔ i ∈ res)
The variables l, res, left, right, left , right range over sequences of integers,
while i is a bound integer variable. The formula uses several operators that a
useful sequence theory has to provide: (i) l0 : the first element of input list l;
(ii) ∈ and ∈: membership and non-membership of an integer in a list, which
can be expressed using symbolic parametric automata; (iii) skip 1 , T<l0 , T≥l0 :
sequence-to-sequence functions, which can be represented using symbolic para-
metric transducers; (iv) · . ·: concatenation of several sequences. The formula oth-
erwise is a direct model of the method in Listing 1; the variables left , right are
the results of the recursive calls, and concatenated to obtain the result sequence.
Decision Procedures for Sequence Theories 23
As one of the results of this paper, we prove that this final formula is in a
decidable logic. The formula can be rewritten to a disjunction of straight-line
formulas, and shown to be valid using the decision procedure presented in Sect. 5.
3 Models
In this section, we will define our sequence constraint language, and prove some
basic results regarding various constraints in the language. The definition is a
natural generalization of string constraints (e.g. see [12,17,26,29,35]) by employ-
ing an alphabet theory (a.k.a. element theory), as is done in symbolic automata
and automata modulo theories [15,16,44].
For simplicity, our definitions will follow a model-theoretic approach. Let σ
be a vocabulary. We fix a σ-structure S = (D; I), where D can be a finite or
an infinite set (i.e., the universe) and I maps each function/relation symbol in
σ to a function/relation over D. The elements of our sequences will range over
D. We assume that the quantifier-free theory TS over S (including equality)
is decidable. Examples of such TS are abound from SMT, e.g., LRA and LIA.
We write T instead of TS , when S is clear. Our quantifier-free formula will use
uninterpreted T -constants a, b, c, . . ., and may also use variables x, y, z, . . .. (The
distinction between uninterpreted constants and variables is made only for the
purpose of presentation of sequence constraints, as will be clear shortly.) We use
C to denote the set of all uninterpreted T -constants. A formula ϕ is satisfiable if
there is an assignment that maps the uninterpreted constants and variables to
concrete values in D such that the formula becomes true in S.
Next, we define how we lift T to sequence constraints, using T as the alphabet
theory (a.k.a. element theory). As in the case of strings (over a finite alphabet),
we use standard notation like D∗ to refer to the set of all sequences over D. By
default, elements of D∗ are written as standard in mathematics, e.g., 7, 8, 100,
when D = Z. Sometimes we will disambiguate them by using brackets, e.g.,
(7, 8, 100) or [7, 8, 100]. We will use the symbol s (with/without subscript) to
refer to concrete sequences (i.e., a member of D∗ ). We will use x, y, z to refer
to T -sequence variables. Let V denote the set of all T -sequence variables, and
Γ := C ∪ D. We will define constraint languages syntactically at the beginning,
and will instantiate them to specific sequence operations. The theory T ∗ of T -
sequences consists of the following constraints:
ϕ ::= R(x1 , . . . , xr ) | ϕ ∧ ϕ
24 A. Jeż et al.
L=R
0.1.x = x.0.1
the set of all solutions is of the form x → (01)∗ . To make this more formal, we
extend each assignment μ to a homomorphism on Θ∗ . We write μ |= L = R if
μ(L) = μ(R). Notice that this definition is just direct extension of that of word
equations (e.g. see [17]), i.e., when the domain D is finite.
In most cases the inequality constraints L = R can be reduced to equality in
our case this requires also element constraints, described below.
Regular Constraints. Over strings, regular constraints are simply unary con-
straints U (x), where U is an automaton. The interpretation is x is in the language
of U . We define an analogue of regular constraints over sequences using paramet-
ric automata [21,23,24], which generalize both symbolic automata [15,16] and
variable automata [27].
A parametric automaton (PA) over T is of the form A = (X , Q, Δ, q0 , F ),
where X is a finite set of parameters, Q is a finite set of control states, q0 ∈ Q is
the initial state, F ⊆ Q is the set of final states, and Δ⊆fin Q × T (curr, X ) × Q.
Here, parameters are simply uninterpreted T -constants, i.e., X ⊆ C. Formulas
Decision Procedures for Sequence Theories 25
such that qn ∈ F and T |= ϕi (di , μ(X )). Finally, for a regular constraint A(x) is
satisfied by μ, when μ(x) ∈ Lμ (A).
Note, that it is possible to complement a PA A, one has to be careful with the
semantics: we treat A as a symbolic automaton, which are closed under boolean
operations [15]. So we are looking for μ such that μ(x) ∈ Lμ (x). What we cannot
do using the complementation, is a universal quantification over the parameters;
note that already theory of strings with universal and existential quantifiers is
undecidable.
We state next a lemma showing that PAs using only “local” parameters,
together with equational constraints, can encode the constraint language that
we have defined so far.
The proof is standard (e.g. see [21,23,24]), and only sketched here. The algorithm
first nondeterministically guesses a simple path in the automaton A from an
initial state q0 to some final state qF . Let us say that the guards appearing
in this path are ψ1 (curr, X ), . . . , ψk (curr, X ). We need to check if this path is
realizable by checking T -satisfiability of
k
∃X . ∃curr. (ψi (curr, X )).
i=1
Prelude: The Case of Strings. We start with some known results about the
case of strings. The satisfiability of word equations with regular constraints is
PSpace-complete [18,19]. This upper bound can be extended to full quantifier-
free theory [10]. When no regular constraints are given, the problem is only
known to be NP-hard, and it is widely believed to be in NP. In the absence of
regular constraints, without loss of generality Γ can be assumed to contain only
letters from the equations; this is not the case in presence of regular constraints.
The algorithm solving word equations [19] does not need an explicit access to
Γ : it is enough to know whether there is a letter which labels a given set of
transitions in the NFAs used in the regular constraints. In principle, there could
be exponentially many different (i.e., inducing different transitions in the NFAs)
letters. When oracle access to such alphabet is provided, the satisfiability can still
be decided in PSpace: while not explicitly claimed, this is exactly the scenario
in [19, Sect. 5.2]
Other constraints are also considered for word equations; perhaps
the most
widely known are the length constraints, which are of the form: x∈V ax ·|x| ≤ c,
where {ax }x∈V , c are integer constants and |x| denotes the length |μ(x)|, with an
obvious semantics. It is an open problem, whether word equations with length
constraints are decidable, see [26].
Decision Procedures for Sequence Theories 27
The proof can be found in the full version, its intuition is clear: we map each
letter a ∈ D to the unique letter in Dπ of the same type.
Once the assignment is fixed (to π) and domain restricted to a finite set (Dπ ),
the equational and regular constraints reduce to word equations with regular
constraints: treat Dπ as a finite alphabet, for a parametric automaton A =
(X , Q, Δ, q0 , F ) create an NFA A = (Dπ , Q, Δ , q0 , F ), i.e. over the alphabet Dπ ,
with the same set of states Q, same starting state q0 and accepting states F and
the relation defined as (q, a, q ) ∈ Δ if and only if there is (q, ϕ(curr, X ), q ) ∈ Δ
such that ϕ(a, π(X )) holds, i.e. we can move from q to q by a in A if and only if
we can make this move in A under assignment π. Clearly, from the construction
of constraints can be verified [19]. It turns out that we do not need the actual
π, it is enough to know which types are realisable for it, which translates to an
exponential-size formula. We will use letter τ to denote subset of Φ; the idea is
that τ = {typeπ (a) : a ∈ D} ⊆ 2Φ and if different π, π give the same sets of
realizable types, then they both yield a satisfying assignment or both not. Hence
it is enough to focus on τ and not on actual π.
On the other hand, when there is a solution of the input constraints, there is
one for some assignment of parameters π. Hence, by Lemma 2, there is a solution
over Dπ . The algorithm guesses τ = {typeπ (a) : a ∈ D} and (1) is true for it.
Then by Lemma 2 there is a solution over Dπ as constructed in the reduction
and by Lemma 3 the regular constraints define the same subsets of Dπ∗ both
when interpreted as parametric automata and NFAs.
Theorem 1. If theory T is in PSpace then sequence constraints are in
ExpSpace.
If τ is polynomial size and the formula (1) can be verified in PSpace, then
sequence constraints can be verified in PSpace.
One of the difficulties in deciding sequence constraints using the word equa-
tions approach is the size of set of realizable types τ , which could be exponential.
For some concrete theories it is known to be smaller and thus a lower upper
bound on complexity follows. For instance, it is easy to show that for LRA there
are linearly many realizable types, which implies a PSpace upper bound.
Corollary 1. Sequence constraints for Linear Real Arithmetic are in PSpace.
In general, the ExpSpace upper bound from Theorem 1 cannot be improved,
as even non-emptiness of intersection of parametric automata is ExpSpace-
complete for some theories decidable in PSpace. This is in contrast to the case
of symbolic automata, for which the non-emptiness of intersection (for a theory
T decidable in PSpace) is in PSpace. This shows the importance of parameters
in our lower bound proof.
Theorem 2. There are theories with existential fragment decidable in PSpace
and whose non-emptiness of intersection of parametric automata is ExpSpace-
complete.
When no regular constraints are allowed, we can solve the equational and
element constraints in PSpace (note that we do not use Lemma 1).
Theorem 3. For a theory T decidable in PSpace, the element and equational
constraints (so no regular constraints) can be decided in PSpace.
Note that ’=’ on the l.h.s. is syntactic, while the ’=’ on the r.h.s. is in the
metalanguage. The definition of the semantics of the language is now inherited
from Sect. 3.
In addition to the syntactic restrictions, we also need a semantic condition:
in our language, we only permit functions f such that the pre-image of each
regular constraint under f is effectively a recognizable formula:
(RegInvRel) A function f is permitted if for each regular constraint A(y), it is
possible to compute a recognizable formula that is equivalent to the formula
∃y : A(y) ∧ y = f (x1 , . . . , xr , X ).
Decision Procedures for Sequence Theories 31
The proof of this proposition is exactly the same as in the case of strings, e.g.,
see [12,35].
Proposition 3. Given a regular constraint A(y) and a parametric transducer
constraint y = T (x), we can compute a regular constraint A (x) that is equivalent
to ∃y : A(y) ∧ y = T (x). This can be achieved in exponential time.
The construction in Proposition 3 is essentially the same as the pre-image com-
putation of a symbolic automaton under a symbolic transducer [44]. The com-
plexity is exponential in the maximum number of output symbols of a single
transition (i.e. the maximum length of w in the transducer), which is in practice
a small natural number.
The following is our main theorem on the SL fragment with equational con-
straints, regular constraints, and transducers.
Theorem 4. If T is solvable in PSpace, then the SL fragment with concatena-
tion and parametric transducers over T is in 2-ExpTime and is ExpSpace-hard.
There are two ways to see this. The first way is that regular constraints are closed
under intersection. This is in general computationally quite expensive because
of a product automata construction before applying the pre-image computation.
A better way to do this is to observe that ψ is equivalent to the conjunction of
ψi ’s over i = 1, . . . , m, where
ψi := ∃y : Ai (y) ∧ y = f (x1 , . . . , xr ).
32 A. Jeż et al.
curr = k
q0 q0 curr = k q1
start start
0 1
Fig. 1. A0 accepts all words not containing k and A1 accepts all words containing k.
S ; assert(ψ1 ); · · · ; assert(ψm
),
Example 1. We consider the example from Sect. 2 where a weaker form of the
permutation property is shown for QuickSort. The formula that has to be proven
is a disjunction of straight-line formulas and in the following we execute our
procedure only on one disjunct without redundant formulas:
assert(A0 (left )); assert(A0 (right )); res = left . [l0 ] . right ; assert(A1 (res))
We model L(A1 ) as the language which accepts all words which contain
one letter equal to k and L(A0 ) as the language which accepts only words not
containing k, where k is an uninterpreted constant, so a single element. See
Fig. 1. We begin by removing the operation res = left . [l0 ] . right . The product
automaton for all assertions that contain res is just A1 . Hence, we can remove the
assertion assert(A1 (res)). The concatenation function . satisfies RegInvRel
and the pre-image g can be represented by
q ,{q } q ,{q } q ,{q }
A10 i (left ) ∧ A1i j ([l0 ]) ∧ A1j 1 (right ),
0≤i,j≤1
where Ap,F
i is Ai with start state set to p and finals to F .
In the next step, the assertion g is added to the program and all assertions
containing res and the concatenation function are removed.
q ,{q } q ,{q }
Finally, the product automata A0 × A10 1 and A0 × A10 1 are com-
puted for the variables left , right and a non-emptiness check over the prod-
uct automata and the automaton for [l0 ] is done. The procedure will find no
combination of paths for each automaton which can be satisfied, since left
is forced to accept no words containing k by A0 and only accepts by read-
q ,{q }
ing a k from A10 1 . Next, the procedure needs to exhaust all tuples from
q0 ,{qi } qi ,{qj } q ,{q }
(A1 , A1 , A1j 1 )0≤i,j≤1 before it is proven that this disjunct is unsat-
isfiable.
Automata Oblivious of Lengths. We first consider the setting, in which the length
variables L can only be used in length constraints. It is routine to verify that
the reduction from Sect. 4 generalize to the case of length constraints: it is pos-
sible to first fix μ for parameters, calling it again π. Then Lemma 2 shows
that each solution μ can be mapped by a letter-to-letter homomorphism to a
finite alphabet Dπ , and this mapping preserves the satisfiability/unsatisfiability
of length constraints, so Lemma 2 still holds when also length constraints are
allowed. Similarly, Lemma 3 is also not affected by the length constraints and
finally Lemma 4 deals with regular and equational constraints, ignoring the other
possible constraints and the length of substitutions for variables are the same.
Hence it holds also when the length constraints are allowed then the resulting
word equations use regular and length constraints.
Unfortunately, the decidability of word equations with linear length con-
straints (even without regular constraints) is a notorious open problem. Thus
instead of decidability, we get Turing-equivalent problems.
34 A. Jeż et al.
Automata Aware of the Sequence Lengths. We now consider the case when
the underlying theory TS is the Presburger arithmetic, i.e. S is the natural
numbers and we can use addition, constants 0, 1 and comparisons (and vari-
ables). The additional functionality of the parametric automaton A is that
Δ⊆fin Q × T (curr, X , L) × Q, i.e. the guards can also use the length variables;
the semantics is extended in the natural way.
Then the type typeπ (a) of a ∈ N now depends on μ values on X and L, hence
we denote by π the restriction of μ to X ∪ L. Then Lemma 2, 3 still hold, when
we fix π. Similarly, Lemma 4 holds, but the analogue of (1) now uses also the
length variables, which are also used in the length constraints. Such a formula
can be seen as a collection of length constraints for original length variables L
as well as length variables X ∪ {at : t ∈ τ }. Hence we validate this formula as
part of the word equations with length constraints. Note that at has two roles:
as a letter in Dπ and as a length variable. However, the connection is encoded
in the formula from the reduction (analogue of (1)) and we can use two different
sets of symbols.
Theorem 6. Deciding conjunction of regular, equational and length constraints
for sequences of natural numbers with Presburger arithmetic, where the regular
constraints can use length variables, is Turing-equivalent to word equations with
regular and (up to exponentially many) length constraints.
(Sequence Constraint Solver) on top of the SMT solver Princess [41]. We extend a
publicly available library for symbolic automata and transducers [13] to paramet-
ric automata and transducers by connecting them to the uninterpreted constants
in our theory of sequences. Our tool supports symbolic transducers, concatena-
tion of sequences and reversing of sequences. Any additional function which
satisfies RegInvRel such as a replace function which replaces only the first and
leftmost longest match can be added in the future.
Our algorithm is an adaption of the tool OSTRICH [12] and closely follows
the proof of Theorem 4. To summarize the procedure, a depth-first search is
employed to remove all functions in the given input and splitting on the pre-
images of those functions. When removing a function, new assertions are added
to the pre-image constraints. After all functions have been removed and only
assertions are left a nonemptiness check is called over all parametric automata
which encoded the assertions. If the check is successful a corresponding model
can be constructed, otherwise the procedure computes a conflict set and back-
jumps to the last split in the depth search.2
Table 1. Benchmark suite 2. SRA is used for the algorithm for symbolic register
automata and SEQ for our tool. The symbol ∅ indicates the column where emptiness
was checked, ≡ indicates self equivalence and ⊆ inclusion of languages.
L1 L2 SRA∅ (L1 ) SeCo∅ (L1 ) SRA≡ (L1 ) SeCo≡ (L1 ) SRA⊆ (L2 , L1 ) SeCo⊆ (L2 , L1 )
Pr-C2 Pr-CL2 0.03 s 0.65 s 0.43 s 0.10 s 4.7 s 0.10 s
Pr-C3 Pr-CL3 0.58 s 0.70 s 10.73 s 0.12 s 36.90 s 0.10 s
Pr-C4 Pr-CL4 18.40 s 0.77 s 98.38 s 0.14 s – 0.10 s
Pr-C6 Pr-CL6 – 1.00 s – 0.12 s – 0.10 s
Pr-CL2 Pr-C2 0.33 s 0.30 s 1.03 s 0.13 s 0.52 s 0.76 s
Pr-CL3 Pr-C3 14.04 s 0.38 s 20.44 s 0.13 s 10.52 s 0.76 s
Pr-CL4 Pr-C4 – 0.41 s 0.43 s 0.12 s – 0.82 s
Pr-CL6 Pr-C6 – 0.62 s 0.43 s 0.12 s – 1.27 s
IP-2 IP-3 0.11 s 1.53 s 0.63 s 0.14 s 2.43 s 0.15 s
IP-3 IP-4 1.83 s 1.45 s 4.66 s 0.14 s 28.60 s 0.17 s
IP-4 IP-6 30.33 s 1.75 s 80.03 s 0.14 s – 0.17 s
IP-6 IP-9 – 1.60 s 0.43 s 0.13 s – 0.17 s
universal quantification, but similar to the motivating example from Sect. 2 one
can eliminate quantifiers by Skolemization and instantiation which was done by
hand.
The second benchmark suite consists of three different types of benchmarks,
summarized in Table 1. The benchmark PR-Cn describes a regular expression
for matching products which have the same code number of length n, and PR-
CLn matches not only the code number but also the lot number. The last type
of benchmark is IP-n, which matches n positions of 2 IP addresses. The bench-
marks are taken from the regular-expression crowd-sourcing website RegExLib
[39] and are also used in experiments for symbolic register automata [14] which
we also compare our results against. To apply our decision procedure to the
benchmarks, we encode each of the benchmarks as a parametric automaton,
using parameters for the (bounded-size) back-references. The task in the exper-
iments is to check emptiness, language equivalence, and language inclusion for
the same combinations of the benchmarks as considered in [14].
References
1. Abdulla, P.A., et al.: String constraints for verification. In: Biere, A., Bloem, R.
(eds.) CAV 2014. LNCS, vol. 8559, pp. 150–166. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-08867-9_10
2. Abdulla, P.A., Atig, M.F., Diep, B.P., Holík, L., Janků, P.: Chain-free string con-
straints. In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol.
11781, pp. 277–293. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-
31784-3_16
3. Amadini, R.: A survey on string constraint solving. ACM Comput. Surv. 55(2),
16:1-16:38 (2023). https://doi.org/10.1145/3484198
4. Barbosa, H., et al.: cvc5: a versatile and industrial-strength SMT solver. In: TACAS
2022. LNCS, vol. 13243, pp. 415–442. Springer, Cham (2022). https://doi.org/10.
1007/978-3-030-99524-9_24
5. Barceló, P., Figueira, D., Libkin, L.: Graph logics with rational relations. Log.
Methods Comput. Sci. 9(3) (2013). https://doi.org/10.2168/LMCS-9(3:1)2013
6. Bjørner, N., de Moura, L., Nachmanson, L., Wintersteiger, C.M.: Programming
Z3. In: Bowen, J.P., Liu, Z., Zhang, Z. (eds.) SETSS 2018. LNCS, vol. 11430, pp.
148–201. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17601-3_4
38 A. Jeż et al.
7. Bojanczyk, M., Stefanski, R.: Single-use automata and transducers for infinite
alphabets. In: Czumaj, A., Dawar, A., Merelli, E. (eds.) 47th International Col-
loquium on Automata, Languages, and Programming, ICALP 2020, July 8–11,
2020, Saarbrücken, Germany (Virtual Conference). LIPIcs, vol. 168, pp. 113:1–
113:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020). https://doi.
org/10.4230/LIPIcs.ICALP.2020.113
8. Bouajjani, A., Jonsson, B., Nilsson, M., Touili, T.: Regular Model Checking. In:
Emerson, E.A., Sistla, A.P. (eds.) CAV 2000. LNCS, vol. 1855, pp. 403–418.
Springer, Heidelberg (2000). https://doi.org/10.1007/10722167_31
9. Bradley, A.R., Manna, Z., Sipma, H.B.: What’s decidable about arrays? In: Emer-
son, E.A., Namjoshi, K.S. (eds.) VMCAI 2006. LNCS, vol. 3855, pp. 427–442.
Springer, Heidelberg (2005). https://doi.org/10.1007/11609773_28
10. Büchi, J.R., Senger, S.: Definability in the existential theory of concatenation
and undecidable extensions of this theory. In: The Collected Works of J. Richard
Büchi, pp. 671–683. Springer, New York (1990). https://doi.org/10.1007/978-1-
4613-8928-6_37
11. Chen, T., et al.: Solving string constraints with regex-dependent functions through
transducers with priorities and variables. Proc. ACM Program. Lang. 6(POPL),
1–31 (2022). https://doi.org/10.1145/3498707
12. Chen, T., Hague, M., Lin, A.W., Rümmer, P., Wu, Z.: Decision procedures for path
feasibility of string-manipulating programs with complex operations. Proc. ACM
Program. Lang. 3(POPL), 49:1–49:30 (2019). https://doi.org/10.1145/3290362
13. D’Antoni, L.: SVPAlib. Symbolic Automata Library (2018). https://github.com/
lorisdanto/symbolicautomata. Accessed 2 Feb 2023
14. D’Antoni, L., Ferreira, T., Sammartino, M., Silva, A.: Symbolic register automata.
In: Dillig, I., Tasiran, S. (eds.) CAV. vol. 11561, pp. 3–21. Springer, Cham (2019).
https://doi.org/10.1007/978-3-030-25540-4_1
15. D’Antoni, L., Veanes, M.: The power of symbolic automata and transducers.
In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 47–67.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_3
16. D’Antoni, L., Veanes, M.: Automata modulo theories. Commun. ACM 64(5), 86–
95 (2021). https://doi.org/10.1145/3419404
17. Diekert, V.: Makanin’s algorithm. In: Lothaire, M. (ed.) Algebraic Combinatorics
on Words, Encyclopedia of Mathematics and its Applications, vol. 90, chap. 12,
pp. 387–442. Cambridge University Press (2002)
18. Diekert, V., Gutiérrez, C., Hagenah, C.: The existential theory of equations with
rational constraints in free groups is PSPACE-complete. Inf. Comput. 202(2), 105–
140 (2005). https://doi.org/10.1016/j.ic.2005.04.002
19. Diekert, V., Jeż, A., Plandowski, W.: Finding all solutions of equations in free
groups and monoids with involution. Inf. Comput. 251, 263–286 (2016). https://
doi.org/10.1016/j.ic.2016.09.009
20. Dijkstra, E.W.: Self-stabilizing systems in spite of distributed control. Commun.
ACM 17(11), 643–644 (1974). https://doi.org/10.1145/361179.361202
21. Faran, R., Kupferman, O.: On synthesis of specifications with arithmetic. In:
Chatzigeorgiou, A., et al. (eds.) SOFSEM 2020. LNCS, vol. 12011, pp. 161–173.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38919-2_14
22. Faran, R., Kupferman, O.: On synthesis of specifications with arithmetic. In:
Chatzigeorgiou, A., et al. (eds.) SOFSEM 2020. LNCS, vol. 12011, pp. 161–173.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38919-2_14
Decision Procedures for Sequence Theories 39
23. Figueira, D., Jeż, A., Lin, A.W.: Data path queries over embedded graph databases.
In: PODS ’22: International Conference on Management of Data, Philadelphia, 12–
17 June, 2022. pp. 189–201 (2022). https://doi.org/10.1145/3517804.3524159
24. Figueira, D., Lin, A.W.: Reasoning on data words over numeric domains. In: LICS
’22: 37th Annual ACM/IEEE Symposium on Logic in Computer Science, Haifa,
Israel, 2–5 August 2022, pp. 37:1–37:13 (2022). https://doi.org/10.1145/3531130.
3533354
25. Furia, C.A.: What’s decidable about sequences? In: Bouajjani, A., Chin, W.-N.
(eds.) ATVA 2010. LNCS, vol. 6252, pp. 128–142. Springer, Heidelberg (2010).
https://doi.org/10.1007/978-3-642-15643-4_11
26. Ganesh, V., Minnes, M., Solar-Lezama, A., Rinard, M.: Word equations with length
constraints: what’s decidable? In: Biere, A., Nahir, A., Vos, T. (eds.) HVC 2012.
LNCS, vol. 7857, pp. 209–226. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-39611-3_21
27. Grumberg, O., Kupferman, O., Sheinvald, S.: Variable automata over infinite
alphabets. In: Dediu, A.-H., Fernau, H., Martín-Vide, C. (eds.) LATA 2010. LNCS,
vol. 6031, pp. 561–572. Springer, Heidelberg (2010). https://doi.org/10.1007/978-
3-642-13089-2_47
28. Hoare, C.A.R.: Quicksort. Comput. J. 5(1), 10–15 (1962). https://doi.org/10.1093/
comjnl/5.1.10
29. Jeż, A.: Recompression: a simple and powerful technique for word equations. J.
ACM 63(1), 4:1–4:51 (2016). https://doi.org/10.1145/2743014
30. Jhala, R., Majumdar, R.: Software model checking. ACM Comput. Surv. 41(4),
21:1–21:54 (2009). https://doi.org/10.1145/1592434.1592438
31. Kaminski, M., Francez, N.: Finite-memory automata. Theor. Comput. Sci. 134(2),
329–363 (1994). https://doi.org/10.1016/0304-3975(94)90242-9
32. Kroening, D., Strichman, O.: Decision Procedures. Springer (2008)
33. Lamport, L.: A new solution of Dijkstra’s concurrent programming problem. Com-
mun. ACM 17(8), 453–455 (1974). https://doi.org/10.1145/361082.361093
34. Lin, A.W., Rümmer, P.: Regular model checking revisited. In: Olderog, E.-R., Stef-
fen, B., Yi, W. (eds.) Model Checking, Synthesis, and Learning. LNCS, vol. 13030,
pp. 97–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91384-7_6
35. Lin, A.W., Barceló, P.: String solving with word equations and transducers:
towards a logic for analysing mutation XSS. In: Proceedings of the 43rd Annual
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,
POPL 2016, St. Petersburg, 20–22 January 2016, pp. 123–136 (2016). https://doi.
org/10.1145/2837614.2837641
36. Makanin, G.S.: The problem of solvability of equations in a free semigroup. Sbornik:
Mathematics 32(2), 129–198 (1977)
37. Meyer, B.: Applying “Design by contract.” IEEE Comput. 25(10), 40–51 (1992).
https://doi.org/10.1109/2.161279
38. de Moura, L.M., Bjørner, N.: Z3: An efficient SMT solver. In: TACAS (2008)
39. None: RegExLib (2017). https://regexlib.com/. Accessed 2 Feb 2023
40. Plandowski, W.: On PSPACE generation of a solution set of a word equation and
its applications. Theor. Comput. Sci. 792, 20–61 (2019). https://doi.org/10.1016/
j.tcs.2018.10.023
41. Rümmer, P.: A constraint sequent calculus for first-order logic with linear integer
arithmetic. In: Cervesato, I., Veith, H., Voronkov, A. (eds.) LPAR 2008. LNCS
(LNAI), vol. 5330, pp. 274–289. Springer, Heidelberg (2008). https://doi.org/10.
1007/978-3-540-89439-1_20
40 A. Jeż et al.
42. Safari, M., Huisman, M.: A generic approach to the verification of the permutation
property of sequential and parallel swap-based sorting algorithms. In: Dongol, B.,
Troubitsyna, E. (eds.) IFM 2020. LNCS, vol. 12546, pp. 257–275. Springer, Cham
(2020). https://doi.org/10.1007/978-3-030-63461-2_14
43. Schulz, K.U.: Makanin’s algorithm for word equations–two improvements and a
generalization. In: Schulz, K.U. (ed.) IWWERT. Lecture Notes in Computer Sci-
ence, vol. 572, pp. 85–150. Springer, Cham (1990). https://doi.org/10.1007/3-540-
55124-7_4
44. Veanes, M., Hooimeijer, P., Livshits, B., Molnar, D., Bjorner, N.: Symbolic finite
state transducers: algorithms and applications. SIGPLAN Not. 47(1), 137–150
(2012). https://doi.org/10.1145/2103621.2103674
45. Wang, Q., Appel, A.W.: A solver for arrays with concatenation. J. Autom. Reason.
67(1), 4 (2023). https://doi.org/10.1007/s10817-022-09654-y
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Exploiting Adjoints in Property Directed
Reachability Analysis
1 Introduction
Property directed reachability analysis (PDR) refers to a class of verification
algorithms for solving safety problems of transition systems [5,12]. Its essence
consists of 1) interleaving the construction of an inductive invariant (a positive
chain) with that of a counterexample (a negative sequence), and 2) making the
two sequences interact, with one narrowing down the search space for the other.
PDR algorithms have shown impressive performance both in hardware and
software verification, leading to active research [15,18,28,29] going far beyond
its original scope. For instance, an abstract domain [8] capturing the over-
approximation exploited by PDR has been recently introduced in [13], while
PrIC3 [3] extended PDR for quantitative verification of probabilistic systems.
Research supported by MIUR PRIN Project 201784YSZ5 ASPRA, by JST ERATO
HASUO Metamathematics for Systems Design Project JPMJER1603, by JST CREST
Grant JPMJCR2012, by JSPS DC KAKENHI Grant 22J21742 and by EU Next-
GenerationEU (NRRP) SPOKE 10, Mission 4, Component 2, Investment N. 1.4, CUP
N. I53C22000690001.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 41–63, 2023.
https://doi.org/10.1007/978-3-031-37703-7_3
42 M. Kori et al.
To uncover the abstract principles behind PDR and its extensions, Kori et
al. proposed LT-PDR [19], a generalisation of PDR in terms of lattice/category
theory. LT-PDR can be instantiated using domain-specific heuristics to create
effective algorithms for different kinds of systems such as Kripke structures,
Markov Decision Processes (MDPs), and Markov reward models. However, the
theory in [19] does not offer guidance on devising concrete heuristics.
Adjoints in PDR. Our approach shares the same vision of LT-PDR, but we
identify different principles: adjunctions are the core of our toolset.
An adjunction f g is one of the central concepts in category
f
ming [22]. Our use of adjoints in this work comes in the following two flavours.
(see right) such that b(x) = f (x) i for all x ∈ L. Under this assumption, we
have the following equivalences (they follow from the Knaster-Tarski theorem,
see §2):
μb p ⇔ μ(f i) p ⇔ i ν(g p),
where μ(f i) and ν(g p) are, by the Kleene theorem, the limits of the initial
and final chains illustrated below.
⊥ i f (i) i · · · · · · g(p) p p
The theory prescribes the choices to obtain the boundary executions, using
initial and final chains (Proposition 10); it thus identifies a class of heuristics
guaranteeing termination when answers are negative (Theorem 12).
AdjointPDR’s assumption of a forward-backward adjoint f g, however, does
not hold very often, especially in probabilistic settings. Our second algorithm
AdjointPDR↓ circumvents this problem by extending the lattice for the negative
sequence, from L to the lattice L↓ of lower sets in L.
Specifically, by using the second form of
adjoints, namely an abstraction-concretization
b L ⊥ L↓ b ↓ b↓
pair, the problem μb ? p in L can be trans- r
↓ ↓ (−)↓
lated to an equivalent problem on b in L , for
which an adjoint b↓ b↓r is guaranteed. This allows one to run AdjointPDR in
the lattice L↓ . We then notice that the search for a positive chain can be con-
veniently restricted to principals in L↓ , which have representatives in L. The
resulting algorithm, using L for positive chains and L↓ for negative sequences,
is AdjointPDR↓ .
The use of lower sets for the negative sequence is a key advantage. It not
only avoids the restrictive assumption on forward-backward adjoints f g, but
also enables a more thorough search for counterexamples. AdjointPDR↓ can sim-
ulate step-by-step LT-PDR (Theorem 17), while the reverse is not possible due
to a single negative sequence in AdjointPDR↓ potentially representing multiple
(Proposition 18) or even all (Proposition 19) negative sequences in LT-PDR.
Concrete Instances. Our lattice-theoretic algorithms yield many concrete
instances: the original IC3/PDR [5,12] as well as Reverse PDR [27] are instances
of AdjointPDR with L being the powerset of the state space; since LT-PDR can
be simulated by AdjointPDR↓ , the latter generalizes all instances in [19].
As a notable instance, we apply AdjointPDR↓ to MDPs, specifically to decide
if the maximum reachability probability [1] is below a given threshold. Here
the lattice L = [0, 1]S is that of fuzzy predicates over the state space S. Our
theory provides guidance to devise two heuristics, for which we prove negative
termination (Corollary 20). We present its implementation in Haskell, and its
experimental evaluation, where comparison is made against existing probabilistic
PDR algorithms (PrIC3 [3], LT-PDR [19]) and a non-PDR one (Storm [11]). The
performance of AdjointPDR↓ is encouraging—it supports the potential of PDR
algorithms in probabilistic model checking. The experiments also indicate the
importance of having a variety of heuristics, and thus the value of our adjoint
framework that helps coming up with those.
Additionally, we found that abstraction features of Haskell allows us to code
lattice-theoretic algorithms almost literally (∼100 lines). Implementing a few
heuristics takes another ∼240 lines. This way, we found that mathematical
abstraction can directly help easing implementation effort.
Related Work. Reverse PDR [27] applies PDR from unsafe states using a back-
ward transition relation T and tries to prove that initial states are unreachable.
Our right adjoint g is also backward, but it differs from T in the presence of
nondeterminism: roughly, T(X) is the set of states which can reach X in one
44 M. Kori et al.
step, while g(X) are states which only reach X in one step. fbPDR [28,29] runs
PDR and Reverse PDR in parallel with shared information. Our work uses both
forward and backward directions (the pair f g), too, but approximate differ-
ently: Reverse PDR over-approximates the set of states that can reach an unsafe
state, while we over-approximate the set of states that only reach safe states.
The comparison with LT-PDR [19] is extensively discussed in Sect. 4.2.
PrIC3 [3] extended PDR to MDPs, which are our main experimental ground:
Sect. 6 compares the performances of PrIC3, LT-PDR and AdjointPDR↓ .
We remark that PDR has been applied to other settings, such as soft-
ware model checking using theories and SMT-solvers [6,21] or automated plan-
ning [30]. Most of them (e.g., software model checking) fall already in the gen-
erality of LT-PDR and thus they can be embedded in our framework.
It is also worth to mention that, in the context of abstract interpretation, the
use of adjoints to construct initial and final chains and exploit the interaction
between their approximations has been investigated in several works, e.g., [7].
Structure of the Paper. After recalling some preliminaries in Sect. 2, we
present AdjointPDR in Sect. 3 and AdjointPDR↓ in Sect. 4. In Sect. 5 we introduce
the heuristics for the max reachability problems of MDPs, that are experimen-
tally tested in Sect. 6.
s1 s5 s6
s0 s3 s4
s2
By means of (KT), one can prove μb p by finding some pre-fixed point x, often
called invariant, such that x p. However, automatically finding invariants
might be rather complicated, so most of the algorithms rely on another fixed-
point theorem, usually attributed to Kleene. It characterises μb and νb as the
least upper bound and the greatest lower bound, of the initial and final chains:
The assumptions are stronger than for Knaster-Tarski: for the leftmost
state-
ment, it requires the map b to be ω-continuous (i.e., it preserves
of ω-chains)
and, for the rightmost ω-co-continuous (similar but for ). Observe that every
left adjoint is continuous and every right adjoint is co-continuous (see e.g. [23]).
As explained in [19], property directed reachability (PDR) algorithms [5]
exploits (KT) to try to prove the inequation and (Kl) to refute it. In the algo-
rithm we introduce in the next section, we further assume that b is of the form
f i for some element i ∈ L and map f : L → L, namely b(x) = f (x) i for all
x ∈ L. Moreover we require f to have a right adjoint g : L → L. In this case
which by (Kl) provide useful characterisations of least and greatest fixed points.
μ(f i) = n∈N f n (i) ν(g p) = n∈N g n (p) (Kl)
We conclude this section with an example that we will often revisit. It also
provides a justification for the intuitive terminology that we sporadically use.
∅ ⊆ I ⊆ S2 ⊆ S3 ⊆ S4 ⊆ S4 ⊆ · · · · · · ⊆ S4 ⊆ S4 ⊆ P ⊆ S
The (j + 1)-th element of the initial chain contains all the states that can be
reached by I in at most j transitions, while (j + 1)-th element of the final chain
contains all the states that in at most j transitions reach safe states only.
3 Adjoint PDR
AdjointPDR (i, f, g, p)
<INITIALISATION >
(xy)n,k : = (⊥, ε)2,2
<ITERATION > % x, y n o t c o n c l u s i v e
case (xy)n,k of
y = ε and xn−1 p : %(Unfold)
(xy)n,k : = (x, ε)n+1,n+1
y = ε and xn−1 p : %(Candidate)
choose z ∈ L such that xn−1 z and p z;
(xy)n,k : = (xz)n,n−1
y = ε and f (xk−1 ) yk : %(Decide)
choose z ∈ L such that xk−1 z and g(yk ) z;
(xy)n,k : = (xz, y)n,k−1
y = ε and f (xk−1 ) yk : %(Conflict)
choose z ∈ L such that z yk and (f i)(xk−1 z) z;
(xy)n,k : = (x k ztail(y))n,k+1
endcase
<TERMINATION >
i f ∃j ∈ [0, n − 2] . xj+1 xj t h e n r e t u r n true % x c o n c l u s i v e
i f i y1 t h e n r e t u r n f alse % y conclusive
The last state returns true since x4 = x5 = S4 . Observe that the elements of
x, with the exception of the last element xn−1 , are those of the initial chain of
(F ∪ I), namely, xj is the set of states reachable in at most j − 1 steps. In the
second computation, the elements of x are roughly those of the final chain of
(G ∩ P ). More precisely, after (Unfold) or (Candidate), xn−j for j < n − 1 is the
set of states which only reach safe states within j steps.
Ca Co
(∅, Sε)2,2 →P (∅, SP )2,1 →P (∅, P ε)2,2
U Ca D Co Co
→→P (∅, P, SP )3,2 →S4 (∅, P, SS4 , P )3,1 →S4 (∅, S4 , SP )3,2 →P (∅, S4 , P ε)3,3
U Ca D Co
→→P (∅, S4 , P, SP )4,3 →S4 (∅, S4 , P, SS4 , P )4,2 →S4 (∅, S4 , S4 , SP )4,3
Observe that, by invariant (A1), the values of x in the two runs are, respectively,
the least and the greatest values for all possible computations of AdjointPDR.
Theorem 5.1 follows by invariants (I2), (P1), (P3) and (KT); Theorem 5.2
by (N1), (N2) and (Kl). Note that both results hold for any choice of z.
Theorem 5 (Soundness). AdjointPDR is sound. Namely,
1. If AdjointPDR returns true then μ(f i) p.
2. If AdjointPDR returns false then μ(f i) p.
Exploiting Adjoints in PDR 49
3.1 Progression
It is necessary to prove that in any step of the execution, if the algorithm does
not return true or false, then it can progress to a new state, not yet visited.
To this aim we must deal with the subtleties of the non-deterministic choice of
the element z in (Candidate), (Decide) and (Conflict). The following proposition
ensures that, for any of these three rules, there is always a possible choice.
Proposition 6 (Canonical choices). The following are always possible:
1. in (Candidate) z = p; 3. in (Conflict) z = yk ;
2. in (Decide) z = g(yk ); 4. in (Conflict) z = (f i)(xk−1 ).
Thus, for all non-conclusive s ∈ S, if s0 →∗ s then s →.
Then, Proposition 7 ensures that AdjointPDR always traverses new states.
Proposition 7 (Impossibility of loops). If s0 →∗ s →+ s , then s = s .
Observe that the above propositions entail that AdjointPDR terminates
whenever the lattice L is finite, since the set of reachable states is finite in
this case.
Example 8. For (I, F, G, P ) as in Example 1, AdjointPDR behaves essentially
as IC3/PDR [5], solving reachability problems for transition systems with finite
state space S. Since the lattice PS is also finite, AdjointPDR always terminates.
3.2 Heuristics
The nondeterministic choices of the algorithm can be resolved by using heuristics.
Intuitively, a heuristic chooses for any states s ∈ S an element z ∈ L to be
possibly used in (Candidate), (Decide) or (Conflict), so it is just a function
h : S → L. When defining a heuristic, we will avoid to specify its values on
conclusive states or in those performing (Unfold), as they are clearly irrelevant.
With a heuristic, one can instantiate AdjointPDR by making the choice
of z as prescribed by h. Syntactically, this means to erase from the code of
Fig. 3 the three lines of choose and replace them by z:= h( (xc)n,k ). We call
AdjointPDRh the resulting deterministic algorithm and write s→h s to mean
that AdjointPDRh moves from state s to s . We let S h = {s ∈ S | s0 →∗h s} be
def
i.e., the set of all (n, k)-indexed states reachable by AdjointPDRh that trigger
(Candidate) or (Decide), and h(CaD hn,k ) = {h(s) | s ∈ CaD hn,k }, i.e., the set of
def
↓
b (L, ) ⊥ (L , ⊆) b ↓ b↓
r (4)
↓
(−)
↓
In the diagram
above, (−)↓ : x → {x
| x x}↓ and : L → L maps a lower set
into ↓ {x | x∈ X}.
X The maps and (−) form a Galois insertion, namely
(−) and (−)↓ = id, and thus one can think of (4) in terms of abstract
interpretation [8,9]: L↓ represents the concrete domain, L the abstract domain
and b is a sound abstraction of b↓ . Most importantly, it turns out that b is
forward-complete [4,14] w.r.t. b↓ , namely the following equation holds.
AdjointPDR (b, p)
<INITIALISATION >
(xY )n,k : = (∅, ⊥, ε)3,3
<ITERATION >
case (xY )n,k of % x, Y n o t c o n c l u s i v e
Y = ε and xn−1 p : %(Unfold)
(xY )n,k : = (x, ε)n+1,n+1
Y = ε and xn−1 p : %(Candidate)
choose Z ∈ L such that xn−1 ∈ Z and p ∈ Z;
(xY )n,k : = (xZ)n,n−1
Y = ε and b(xk−1 ) ∈ Yk : %(Decide)
choose Z ∈ L such that xk−1 ∈ Z and br (Yk ) ⊆ Z;
(xY )n,k : = (xZ, Y )n,k−1
Y = ε and b(xk−1 ) ∈ Yk : %(Conflict)
choose z ∈ L such that z ∈ Yk and b(xk−1 z) z;
(xY )n,k : = (x k ztail(Y ))n,k+1
endcase
<TERMINATION >
i f ∃j ∈ [0, n − 2] . xj+1 xj t h e n r e t u r n true % x c o n c l u s i v e
i f Y1 = t h e n r e t u r n f alse % Y conclusive
x are all obtained by (Unfold), that adds the principal ↓ , and by (Conflict),
that takes their meets with the chosen principal.
Since principals are in bijective correspondence with the elements of L, by
imposing to AdjointPDR(⊥↓ , b↓ , b↓r , p↓ ) to choose a principal in (Conflict), we
obtain an algorithm, named AdjointPDR↓ , where the elements of the positive
chain are drawn from L, while the negative sequence is taken in L↓ . The algo-
rithm is reported in Fig. 4 where we use the notation (xY )n,k to emphasize
that the elements of the negative sequence are lower sets of elements in L.
All definitions and results illustrated in Sect. 3 for AdjointPDR are inherited1
by AdjointPDR↓ , with the only exception of Proposition 6.3. The latter does not
hold, as it prescribes a choice for (Conflict) that may not be a principal. In
contrast, the choice in Proposition 6.4 is, thanks to (5), a principal. This means
in particular that the simple initial heuristic is always applicable.
Theorem 15. All results in Sect. 3, but Proposition 6.3, hold for AdjointPDR↓ .
1
Up to a suitable renaming: the domain is (L↓ , ⊆) instead of (L, ), the parameters
are ⊥↓ , b↓ , b↓r , p↓ instead of i, f, g, p and the negative sequence is Y instead of y.
Exploiting Adjoints in PDR 53
Since for all D ∈ ([0, 1]S )↓ , b↓r (D) = {d | b(d) ∈ D} = α {d | bα (d) ∈ D} and
↓
since AdjointPDR executes (Decide) only when b(xk−1 ) ∈ / Yk , there should exist
some α such that bα (xk−1 ) ∈ / Yk . One can thus fix
Ca
p↓ if (xY )n,k →
(xY )n,k → D (6)
{d | bα (d) ∈ Yk } if (xY )n,k →
Intuitively, such choices are smart refinements of those in (3): for (Candidate)
they are exactly the same; for (Decide) rather than taking b↓r (Yk ), we consider a
larger lower-set determined by the labels chosen by α. This allows to represent
each Yj as a set of d ∈ [0, 1]S satisfying a single linear inequality, while using
b↓r (Yk ) would yield a systems of possibly exponentially many inequalities (see
Example 21 below). Moreover, from Theorem 12, it follows that such choices
ensures negative termination.
Corollary 20. Let h be a legit heuristic defined for (Candidate) and (Decide)
as in (6). If μb ≤ p, then AdjointPDR↓ h terminates.
1
Example 21. Consider the maximum reachability problem with threshold λ = 4
and β = {s3 } for the following MDP on alphabet A = {a, b} and sι = s0 .
b,1 a, 12
s3
a, 12
s2 s0 s1 a,1 ,
a, 12 b, 23 a, 12
b, 13
Hereafter we write d ∈ [0, 1]S as column vectors with four entries v0 . . . v3 and
we will use · for the usual matrix multiplication. With this notation, the lower
set p↓ ∈ ([0, 1]S )↓ and b : [0, 1]S → [0, 1]S can be written as
v1 +v2 v0 +2v2
v0 v0 v0 max( , )
2 3
p↓ = {
v0 +v3
v1
v2 | [1 0 0 0]· v1
v2 ≤ [ 14 ]} and b( v1
v2 )= 2
v0
.
v3 v3 v3
1
def
Amongst the several memoryless schedulers, only two are relevant for us: ζ =
def
(s0 → a, s1 → a, s2 → b, s3 → a) and ξ = (s0 → b, s1 → a, s2 → b, s3 → a).
By using the definition of bα : [0, 1] → [0, 1]S , we have that
S
v1 +v2 v0 +2v2
v0 2 v0 3
v1 v0 +v3 v1 v0 +v3
bζ ( v2 )= 2
v0
and bξ ( v2 )= 2
v0
.
v3 v3
1 1
It is immediate to see that the problem has negative answer, since using ζ in
4 steps or less, s0 can reach s3 already with probability 14 + 18 .
To illustrate the advantages of (6), we run AdjointPDR↓ with the simple
initial heuristic and with the heuristic that only differs for the choice in (Decide),
taken as in (6). For both heuristics, the first iterations are the same: several
56 M. Kori et al.
v0 v0 v0 v0
0 def
= { vv12 | [1 0 0 0]· vv12 ≤ [ 14 ]} { vv12 | [1 0 0 0 ]· v1
≤ [ 14 ]}
v2
v3 v3 v3 v3
v0 v0 1 v0 v0
1 def
= { vv12 | 01 10 12 00 · vv12 ≤ 23 } { vv12 | [0 1
2
1 0
2 ]· vv12 ≤ [ 14 ]}
4
v3
v3
v3 v3
v0 v0 1 v0 v0
3 0 0 1
2 def v1
= { v2 | 2 1 1 0 · vv12 ≤ 329 } { v1
v2 | [ 34 0 0 1
4 ]· v1
v2 ≤ [ 14 ]}
4 0 2 0
v3
⎡ 3 3
⎤ v3 ⎡0⎤
4 v3 v3
0
2 2
0
0
v0 ⎢ 13 0 2 0
1⎥
v0 ⎢3⎥ v0 v0
3 def
|⎢ 2 ⎥ · v1 ≤ ⎢ 2 ⎥
v1 1 1 v1 v1
= { 2
⎣ 13 ⎣ 329 ⎦} { |[ 0 3 3 0
]· ≤ [0]}
2 ⎦ v2
v2 0 4 1 v v2 8 8 v2
v3 6 3 v3 v3
2 2 2 0 3 4
10 8 9
3
0
3
0
4
v0 1 0 0 0 v0 0 0 v0 v0
4 def v1 0 1 0 0 v1 0 0 v1 v1
= { v2 | 0 0 1 0 · v2 ≤ 0 }={ 0 }{ v2 |[ 9 0 0 3
16 16 ]· v2 ≤ [0]}
v3 0 0 0 1 v3 0 0 v3 v3
5 def
=
Fig. 5. The elements of the negative sequences computed by AdjointPDR↓ for the MDP
in Example 21. In the central column, these elements are computed by means of the
simple initial heuristics, that is F i = (b↓r )i (p↓ ). In the rightmost column, these elements
are computed using the heuristic in (6). In particular F i = {d | bζ (d) ∈ F i−1 } for i ≤ 3,
while for i ≥ 4 these are computed as F i = {d | bξ (d) ∈ F i−1 }.
In the latter state the algorithm has to perform (Decide), since b(x5 ) ∈ / p↓ .
Now the choice of z in (Decide) is different for the two heuristics: the former uses
b↓r (p↓ ) = {d | b(d) ∈ p↓ }, the latter uses {d | bζ (d) ∈ p↓ }. Despite the different
choices, both the heuristics proceed with 6 steps of (Decide):
0
1
1
0
1
1
0 0 4 4 1 0 0 4 4 1
1 5 0 D D D D D 1 5
F 5 , F 4 , F 3 , F 2 , F 1 , F 0 )7,1
0 0 1 1 1
(∅ 0 0
2
0
2
0
8
1 1 F )7,6 →→→→→ (∅ 00 0
0
2
0
2
0
8
1
1
1
0 1 1 4 1 0 1 1 4 1
1 1 1 1
The element of the negative sequence F i are illustrated in Fig. 5 for both the
heuristics. In both cases, F 5 = ∅ and thus AdjointPDR↓ returns false.
To appreciate the advantages provided by (6), it is enough to compare the
two columns for the F i in Fig. 5: in the central column, the number of inequalities
defining F i significantly grows, while in the rightmost column is always 1.
We exploit this property to resolve the choice for (Conflict). We consider its sub
def
set Zk = {d ∈ Gk | b(xk−1 ) ≤ d} and define zB , z01 ∈ [0, 1]S for all s ∈ S as
def ( Zk )(s) if rs = 0, Zk = ∅ def zB (s) if rs = 0, Zk = ∅
zB (s) = z01 (s) = (7)
b(xk−1 )(s) otherwise zB (s) otherwise
where, for u ∈ [0, 1], u denotes 0 if u = 0 and 1 otherwise. We call hCoB and
hCo01 the heuristics defined as in (6) for (Candidate) and (Decide) and as zB ,
respectively z01 , for (Conflict). The heuristics hCo01 can be seen as a Boolean
modification of hCoB, rounding up positive values to 1 to accelerate convergence.
a,1
a, 13
a, 23
s3
a,1 s2 s0 s1 a,1
b, 12
b, 12
and the max reachability problem with threshold λ = 25 and β = {s3 }. The
lower set p↓ ∈ ([0, 1]S )↓ and b : [0, 1]S → [0, 1]S can be written as
v1 +v2
v0 v0 v0 max(v0 , )
2
p↓ = {
v0 +2·v3
v1
v2 | [1 0 0 0]· v1
v2 ≤ [ 25 ]} and b( v1
v2 )= v2
3
v3 v3 v3
1
With the simple initial heuristic, AdjointPDR↓ does not terminate. With the
heuristic hCo01, it returns true in 14 steps, while with hCoB in 8. The first 4
steps, common to both hCoB and hCo01, are illustrated below.
2 2 2 2 2
0 1 0 1 0 0 0
Ca Co 5 5 5 5 5
(∅ 001
1 ε)3,3 → (∅ 00 1
1 p↓ )3,2 → (∅ 00 0
0
ε)3,3 b( 00 ) = 0
0 Z2 = { 0
0
, 1
0
, 0
1
, 1
1
}
0 1 0 1 0 1 0 1 1 1 1 1
2
2
2
2
2 2 2 2 2
0 1 0 0 5 5
U Ca 5 Co 5 5 5 5 5 5
0 1 ↓ 0 0 4 4
→→(∅ 0
0
0 1 p )4,3 → (∅ 0
0
0
1
0
ε)4,4 (∅ 0
0
0
5
0
ε)4,4 b( 0
0
)= 5
0
Z3 = { 1
0
, 1
1
}
0 1 1 0 1 1 0 1 1 1 1 1 1
Observe that in the first (Conflict) zB = z01 , while in the second z01 (s1 ) = 1
and zB (s1 ) = 45 , leading to the two different states prefixed by vertical lines.
hCoB hCo01 hCoS none lin. pol. hyb. sp.-num. sp.-rat. sp.-sd.
heuristics) against LT-PDR [19], PrIC3 (with four heuristics none, lin., pol.,
hyb., see [3]), and Storm 1.5 [11]. Storm is a recent comprehensive toolsuite
that implements different algorithms and solvers. Among them, our comparison
is against sparse-numeric, sparse-rational, and sparse-sound. The sparse engine
uses explicit state space representation by sparse matrices; this is unlike another
representative dd engine that uses symbolic BDDs. (We did not use dd since it
often reported errors, and was overall slower than sparse.) Sparse-numeric is a
value-iteration (VI) algorithm; sparse-rational solves linear (in)equations using
rational arithmetic; sparse-sound is a sound VI algorithm [26].2
2
There are another two sound algorithms in Storm: one that utilizes interval iter-
ation [2] and the other does optimistic VI [16]. We have excluded them from the
results since we observed that they returned incorrect answers.
60 M. Kori et al.
Table 2. Experimental results on MDP benchmarks. The legend is the same as Table 1,
except that P is now the maximum reachability probability.
0.9 MO 0.172 TO
CDrive2 38 0.865 0.75 MO 0.058 TO 0.019 0.019 0.018
0.5 0.015 0.029 86.798
0.9 MO 3.346 TO
TireWorld 8670 0.233 0.75 MO 3.337 TO
0.070 0.164 0.069
0.5 MO 6.928 TO
0.2 4.246 24.538 TO
returned correct answers but was much slower than sparse-numeric. For these
two instances, AdjointPDR↓ outperformed sparse-sound.
It seems that a big part of Storm’s good performance is attributed to the
sparsity of state representation. This is notable in the comparison of the two
instances of Haddad-Monmege (41 vs. 103 ): while Storm handles both of them
easily, AdjointPDR↓ struggles a bit in the bigger instance. Our implementation
can be extended to use sparse representation, too; this is future work.
RQ3. We derived the three heuristics (hCoB, hCo01, hCoS) exploiting the theory
of AdjointPDR↓ . The experiments show that each heuristic has its own strength.
For example, hCo01 is slower than hCoB for MCs, but it is much better for MDPs.
In general, there is no silver bullet heuristic, so coming up with a variety of them
is important. The experiments suggest that our theory of AdjointPDR↓ provides
great help in doing so.
RQ4. Table 2 shows that AdjointPDR↓ can handle nondeterminism well: once a
suitable heuristic is chosen, its performances on MDPs and on MCs of similar
size are comparable. It is also interesting that better-performing heuristics vary,
as we discussed above.
Summary. AdjointPDR↓ clearly outperforms existing probabilistic PDR algo-
rithms in many benchmarks. It also compares well with Storm—a highly sophis-
ticated toolsuite—in a couple of benchmarks. These are notable especially given
that AdjointPDR↓ currently lacks enhancing features such as richer symbolic
templates and sparse representation (adding which is future work). Overall, we
believe that AdjointPDR↓ confirms the potential of PDR algorithms in proba-
bilistic model checking. Through the three heuristics, we also observed the value
of an abstract general theory in devising heuristics in PDR, which is probably
true of verification algorithms in general besides PDR.
References
1. Baier, C., Katoen, J.: Principles of Model Checking. MIT Press (2008)
2. Baier, C., Klein, J., Leuschner, L., Parker, D., Wunderlich, S.: Ensuring the reli-
ability of your model checker: interval iteration for Markov decision processes.
In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 160–180.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9 8
3. Batz, K., et al.: PrIC3: property directed reachability for MDPs. In: Lahiri, S.K.,
Wang, C. (eds.) CAV 2020. LNCS, vol. 12225, pp. 512–538. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-53291-8 27
4. Bonchi, F., Ganty, P., Giacobazzi, R., Pavlovic, D.: Sound up-to techniques and
complete abstract domains. In: Dawar, A., Grädel, E. (eds.) Proceedings of LICS
2018, pp. 175–184. ACM (2018). https://doi.org/10.1145/3209108.3209169
5. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R.,
Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70–87. Springer, Heidel-
berg (2011). https://doi.org/10.1007/978-3-642-18275-4 7
6. Cimatti, A., Griggio, A.: Software model checking via IC3. In: Madhusudan, P.,
Seshia, S.A. (eds.) CAV 2012. LNCS, vol. 7358, pp. 277–293. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-31424-7 23
62 M. Kori et al.
7. Cousot, P.: Partial completeness of abstract fixpoint checking. In: Choueiry, B.Y.,
Walsh, T. (eds.) SARA 2000. LNCS (LNAI), vol. 1864, pp. 1–25. Springer, Heidel-
berg (2000). https://doi.org/10.1007/3-540-44914-0 1
8. Cousot, P.: Principles of Abstract Interpretation. MIT Press (2021)
9. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static
analysis of programs by construction or approximation of fixpoints. In: Proceedings
of POPL 1977, pp. 238–252. ACM (1977). https://doi.org/10.1145/512950.512973
10. Davey, B.A., Priestley, H.A.: Introduction to Lattices and Order, 2nd Edn. Cam-
bridge University Press (2002)
11. Dehnert, C., Junges, S., Katoen, J.-P., Volk, M.: A storm is coming: a modern prob-
abilistic model checker. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS,
vol. 10427, pp. 592–600. Springer, Cham (2017). https://doi.org/10.1007/978-3-
319-63390-9 31
12. Eén, N., Mishchenko, A., Brayton, R.K.: Efficient implementation of property
directed reachability. In: Bjesse, P., Slobodová, A. (eds.) Proc. of FMCAD 2011.
pp. 125–134. FMCAD Inc. (2011). http://dl.acm.org/citation.cfm?id=2157675
13. Feldman, Y.M.Y., Sagiv, M., Shoham, S., Wilcox, J.R.: Property-directed reach-
ability as abstract interpretation in the monotone theory. Proc. ACM Program.
Lang. 6(POPL), 1–31 (2022). https://doi.org/10.1145/3498676
14. Giacobazzi, R., Ranzato, F., Scozzari, F.: Making abstract interpretations com-
plete. J. ACM 47(2), 361–416 (2000). https://doi.org/10.1145/333979.333989
15. Gurfinkel, A.: IC3, PDR, and friends (2015). https://arieg.bitbucket.io/pdf/
gurfinkel ssft15.pdf
16. Hartmanns, A., Kaminski, B.L.: Optimistic value iteration. In: Lahiri, S.K., Wang,
C. (eds.) CAV 2020. LNCS, vol. 12225, pp. 488–511. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-53291-8 26
17. Hartmanns, A., Klauck, M., Parker, D., Quatmann, T., Ruijters, E.: The quanti-
tative verification benchmark set. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019.
LNCS, vol. 11427, pp. 344–350. Springer, Cham (2019). https://doi.org/10.1007/
978-3-030-17462-0 20
18. Hoder, K., Bjørner, N.: Generalized property directed reachability. In: Cimatti, A.,
Sebastiani, R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 157–171. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-31612-8 13
19. Kori, M., Urabe, N., Katsumata, S., Suenaga, K., Hasuo, I.: The lattice-theoretic
essence of property directed reachability analysis. In: Shoham, S., Vizel, Y. (eds.)
Proceedings of CAV 2022, Part I. Lecture Notes in Computer Science, vol.
13371, pp. 235–256. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-
031-13185-1 12
20. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic
real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS,
vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/978-
3-642-22110-1 47
21. Lange, T., Neuhäußer, M.R., Noll, T., Katoen, J.-P.: IC3 software model checking.
Int. J. Softw. Tools Technol. Trans. 22(2), 135–161 (2019). https://doi.org/10.
1007/s10009-019-00547-x
22. Levy, P.B.: Call-By-Push-Value: A Functional/Imperative Synthesis, Semantics
Structures in Computation, vol. 2. Springer, Dordrecht (2004). https://doi.org/
10.1007/978-94-007-0954-6
23. MacLane, S.: Categories for the Working Mathematician. Graduate Texts in Math-
ematics, vol. 5. Springer-Verlag, New York (1971)
Exploiting Adjoints in PDR 63
24. Milner, R.: Communication and Concurrency. Prentice-Hall Inc, USA (1989)
25. de Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: Ramakrishnan, C.R.,
Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg
(2008). https://doi.org/10.1007/978-3-540-78800-3 24
26. Quatmann, T., Katoen, J.-P.: Sound value iteration. In: Chockler, H., Weis-
senbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 643–661. Springer, Cham
(2018). https://doi.org/10.1007/978-3-319-96145-3 37
27. Seufert, T., Scholl, C.: Sequential verification using reverse PDR. In: Große, D.,
Drechsler, R. (eds.) Proceedings of MBMV 2017, pp. 79–90. Shaker Verlag (2017)
28. Seufert, T., Scholl, C.: Combining PDR and reverse PDR for hardware model
checking. In: Madsen, J., Coskun, A.K. (eds.) Proceedings of DATE 2018, pp.
49–54. IEEE (2018). https://doi.org/10.23919/DATE.2018.8341978
29. Seufert, T., Scholl, C.: fbPDR: In-depth combination of forward and backward
analysis in property directed reachability. In: Teich, J., Fummi, F. (eds.) Proceed-
ings of DATE 2019, pp. 456–461. IEEE (2019). https://doi.org/10.23919/DATE.
2019.8714819
30. Suda, M.: Property directed reachability for automated planning. In: Chien, S.A.,
Do, M.B., Fern, A., Ruml, W. (eds.) Proceedings of ICAPS 2014. AAAI (2014).
https://doi.org/10.1613/jair.4231
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Fast Approximations of Quantifier
Elimination
1 Introduction
Quantifier Elimination (qelim) is used in many automated reasoning tasks
including program synthesis [18], exist-forall solving [8,9], quantified SMT [5],
and Model Checking [17]. Complete qelim, even when possible, is computation-
ally expensive, and solvers often approximate it. We call these approximations
quantifier reductions, to separate them from qelim. The difference is that quan-
tifier reduction might leave some free variables in the formula.
For example, Z3 [19] performs quantifier reduction, called QeLite, by greed-
ily substituting variables by definitions syntactically appearing in the formulas.
While it is very useful, it is necessarily sensitive to the order in which variables
are substituted and depends on definitions appearing explicitly in the formula.
Even though it may seem that these shortcomings need to be tolerated to keep
QeLite fast, in this paper we show that it is not actually the case; we propose
an egraph-based algorithm, QEL, to perform fast quantifier reduction that is
complete relative to some semantic properties of the formula.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 64–86, 2023.
https://doi.org/10.1007/978-3-031-37703-7_4
Fast Approximations of Quantifier Elimination 65
2 Background
We assume the reader is familiar with multi-sorted first-order logic (FOL) with
equality and the theory of equality with uninterpreted functions (EUF) (for an
introduction see, e.g. [4]). We use ≈ to denote the designated logical equality
symbol. For simplicity of presentation, we assume that the FOL signature Σ
contains only functions (i.e., no predicates) and constants (i.e., 0-ary functions).
To represent predicates, we assume the FOL signature has a designated sort
Bool, and two Bool constants and ⊥, representing true, and false respectively.
We then use Bool-valued functions to represent predicates, using P (a) ≈ and
P (a) ≈ ⊥ to mean that P (a) is true or false, respectively. Informally, we continue
to write P (a) and ¬P (a) as a syntactic sugar for P (a) ≈ and P (a) ≈ ⊥, respec-
tively. We use lowercase letters like a, b for constants, and f , g for functions,
and uppercase letters like P , Q for Bool functions that represent predicates. We
denote by ψ ∃ the existential closure of ψ.
Model Based Projection (MBP). Let ϕ be a formula with free variables v, and
M a model of ϕ. A model-based projection of ϕ relative to M is a QF formula
ψ such that ψ ⇒ ϕ∃ and M |= ψ. That is, ψ has no free variables, is an under-
approximation of ϕ, and satisfies the designated model M , just like ϕ. MBP is
used by many algorithms to under-approximate qelim, when the computation of
qelim is too expensive or, for some reason, undesirable.
(0) (1)
>
Given an egraph G, the class of a node n ∈ G, class(n) ρroot (n), is the set
of all nodes that are equivalent to n. The term of n, term(n), with L(n) = f is
f if deg(n) = 0 and f (term(n[1]), . . . , term(n[deg(n)])), otherwise. We assume
that the terms of different nodes are different, and refer to a node n by its term.
An example of an egraph G = N , E , L, root is shown in Fig. 1. A symbol f
inside a circle depicts a node n with label L(n) = f , solid black and dashed red
arrows depict E and root, respectively. The order of the black arrows from left
to right defines the order of the children. In our examples, we refer to a specific
node i by its number using N(i) or its term, e.g., N(k + 1). A node n without an
outgoing red arrow is its own root. A set of nodes connected to the same node
with red edges forms an equivalence class. In this example, root defines the
equivalence classes {N(3), N(4), N(5), N(6)}, {N(8), N(9)}, and a class for each
of the remaining nodes. Examples of some terms in G are term(N(9)) = y and
term(N(5)) = read (a, y).
Explicit and Implicit Equality. Note that egraphs represent equality implicitly
by placing nodes with equal terms in the same equivalence class. Sometimes, it
is necessary to represent equality explicitly, for example, when using egraphs for
Fast Approximations of Quantifier Elimination 69
eq eq eq eq
c f d f c f d f c f d f
eq eq
x y x y x y
(a) Ga , interpreting eq as . (b) Gb , not interpreting eq. (c) Gc , combining (a) and (b).
2
The set S affects the result, but for this section, we restrict to the case of S ∅.
Fast Approximations of Quantifier Elimination 71
(a) repr4a N(4), N(5) . (b) repr4b N(4), N(1) . (c) repr4c N(3), N(1) .
letting the representative be a node with minimal cost. However, observe that
not all costs guarantee that the chosen repr can be used (the computation does
not terminate). For example, the ill-defined repr4c from above is a representative
function that satisfies the cost function that assigns function applications cost 0
and variables and constants cost 1. A commonly used cost function is term AST
size, which is sufficient to ensure termination of ntt(n, repr).
We are thus interested in characterizing representative functions motivated
by two observations: not every cost function guarantees that ntt(n) terminates;
and the kind of representative choices that are most suitable for qelim (repr4b )
cannot be expressed over term AST size.
Dotted blue edges in the graphs of Fig. 4 show the corresponding Grepr .
Intuitively, for each node n, all reachable nodes in Grepr are the nodes whose
ntt term is necessary to produce the ntt(n). Observe that Grepr4c has a cycle,
thus, repr4c is not admissible.
4 Quantifier Reduction
Quantifier reduction is a relaxation of quantifier elimination: given two formulas
ϕ and ψ with free variables v and u, respectively, ψ is a quantifier reduction of
ϕ if u ⊆ v and ϕ∃ ≡ ψ ∃ . If u is empty, then ψ is a quantifier elimination of ϕ∃ .
Note that quantifier reduction is possible even when quantifier elimination is not
(e.g., for EUF). We are interested in an efficient quantifier reduction algorithm
(that can be used as pre-processing for qelim), even if a complete qelim is possible
(e.g., for LIA). In this section, we present such an algorithm called QEL.
Intuitively, QEL is based on the well-known substitution rule: (∃x·x ≈ t∧ϕ) ≡
ϕ[x → t]. A naive implementation of this rule, called QeLite in Z3, looks for syn-
tactic definitions of the form x ≈ t for a variable x and an x-free term t and sub-
stitutes x with t. While efficient, QeLite is limited because of: (a) dependence
on syntactic equality in the formula (specifically, it misses implicit equalities due
to transitivity and congruence); (b) sensitivity to the order in which variables are
eliminated (eliminating one variable may affect available syntactic equalities for
another); and (c) difficulty in dealing with circular equalities such as x ≈ f (x).
For example, consider the formula ϕ4 (x, y) in Fig. 4. Assume that y is elimi-
nated first using y ≈ f (x), resulting in x ≈ g(f (x)) ∧ f (x) ≈ 6. Now, x cannot be
eliminated since the only equality for x is circular. Alternatively, assume that
QeLite somehow noticed that by transitivity, ϕ4 implies y ≈ 6, and obtains
(∃y · ϕ4 ) x ≈ g(6) ∧ f (x) ≈ 6. This time, x ≈ g(6) can be used to obtain
f (g(6)) ≈ 6 that is a qelim of ϕ∃4 . Thus, both the elimination order and implicit
equalities are crucial.
In QEL, we address the above issues by using an egraph data structure to
concisely capture all implicit equalities and terms. Furthermore, egraphs allow
eliminating multiple variables together, ensuring that a variable is eliminated if
it is equivalent (explicitly or implicitly) to a ground term in the egraph.
Pseudocode for QEL is shown in Algorithm 1. Given an input formula ϕ, QEL
first builds its egraph G (line 1). Then, it finds a representative function repr
that maps variables to equivalent ground terms, as much as possible (line 2).
Next, it further reduces the remaining free variables by refining repr to map
each variable x to an equivalent x-free (but not variable-free) term (line 3).
At this point, QEL is committed to the variables to eliminate. To produce the
output, find_core identifies the subset of the nodes of G, which we call core,
Fast Approximations of Quantifier Elimination 73
(a) repr5a = N(1), N(4), N(5) (b) repr5b = N(3), N(6), N(5) (c) repr5c = N(1), N(6), N(5)
that must be considered in the output (line 4). Finally, to_formula converts
the core of G to the resulting formula (line 5). We show that the combination of
these steps is even stronger than variable substitution.
To illustrate QEL, we apply it on ϕ1 and its egraph G from Fig. 1. The func-
tion find_defs returns repr = {N(6), N(8)}3 . Node N(6) is the only node with
a ground term in the equivalence class class(N(3)). This corresponds to the defi-
nition z ≈ k + 1. Node N(8) is chosen arbitrarily since class(N(8)) has no ground
terms. There is no refinement possible, so refine_defs returns repr. The core
is N \ {N(3), N(5), N(9)}. Nodes N(3) and N(9) are omitted because they corre-
spond to variables with definitions (under repr), and N(5) is omitted because
it is congruent to N(4) so only one of them is needed. Finally, to_formula
produces k + 1 ≈ read (a, x) ∧ 3 > k + 1. Variables z and y are eliminated.
In the rest of this section we present QEL in detail and QEL’s key properties.
3
Recall that we only show representatives of non-singleton classes.
74 I. Garcia-Contreras et al.
syntactic terms in the output. For example, for ϕ1 and repr1 , find_core returns
core1 = N1 \{N(3), N(5), N(9)}. Nodes N(3) and N(9) are excluded because they
are labeled with variables; and node N(5) because it is congruent with N(4).
Finally, QEL produces a quantifier reduction by applying to_formula with
the computed repr and core. Variables that are not in the core (they are not
representatives) are eliminated – this includes variables that have a ground defi-
nition. However, QEL may eliminate a variable even if it is a representative (and
thus it is in the core). As an example, consider ψ(x, y) f (x) ≈ f (y) ∧ x ≈ y,
whose egraph G contains 2 classes with 2 nodes each. The core Nc relative to
any admissible repr contains only one representative per class: in the class(N(x))
because both nodes are labeled with variables, and in the class(N(f (x))) because
nodes are congruent. In this case, to_formula(repr, Nc ) results in (since sin-
gleton classes in the core produce no literals in the output formula), a quantifier
elimination of ψ. More generally, the variables are eliminated because none of
them is reachable in Grepr from a non-singleton class in the core (only such
classes contribute literals to the output).
We conclude the presentation of QEL by showing its output for our exam-
ples. For ϕ1 , QEL obtains (k + 1 ≈ read (a, x) ∧ 3 > k + 1), a quantifier reduction,
using repr1 = {N(3), N(8))} and core1 = N1 \ {N(3), N(5), N(9)}. For ϕ4 , QEL
obtains (6 ≈ f (g(6))), a quantifier elimination, using repr4b = {N(4), N(1)},
and core4b = N4 \ {N(3), N(2)}. Finally, for ϕ5 , QEL obtains (y ≈ h(f (y)) ∧
f (g(f (y))) ≈ f (y)), a quantifier reduction, using repr5c = {N(1), N(6), N(5)}
and core5c = N5 \ {N(3)}.
ElimWrRd
1: function match(t)
2: ret t = read(write(s, i, v ), j)
ElimWrRd1 3: function apply(t, M, G )
ϕ[read (write(t, i, v ), j)] 4: if M |= i ≈ j then
M |= i ≈ j
ϕ[v] ∧ i ≈ j 5: G.assert(i ≈ j)
6: G.assert(t ≈ v)
ElimWrRd2 7: else
ϕ[read (write(t, i, v ), j)] 8: G.assert(i j)
M |= i j
ϕ[read (t, j )] i j 9: G.assert(t read(s, j ))
Fig. 6. Two MBP rules from [16]. The Fig. 7. Adaptation of rules in Fig. 6
notation ϕ[t] means that ϕ contains using QEL API.
term t. The rules rewrite all occur-
rences of read (write(t, i, v), j) with v
and read (t, j), respectively.
To implement MBP using egraphs, we implement all rewrite rules for MBP in
Arrays [16] and ADTs [5] on top of egraphs. In the interest of space, we explain
the implementation of just a couple of the MBP rules for Arrays4 .
Figure 6 shows two Array MBP rules from [16]: ElimWrRd1 and
ElimWrRd2. Here, ϕ is a formula with arrays and M is a model for ϕ. Both
rules rewrite terms which match the pattern read (write(t, i, v), j), where t, i, j, k
are all terms and t contains a variable to be projected. ElimWrRd1 is applicable
when M |= i ≈ j. It rewrites the term read (write(t, i, v), j) to v. ElimWrRd2
is applicable when M |= i ≈ j and rewrites read (write(t, i, v), j) to read (t, j).
Figure 7 shows the egraph implementation of ElimWrRd1 and ElimWrRd2.
The match(t) method checks if t syntactically matches read (write(s, i, v), j), where
s contains a variable to be projected. The apply(t) method assumes that t is
read (write(s, i, v), j). It first checks if M |= i ≈ j, and, if so, it adds i ≈ j and t ≈ v
to the egraph G. Otherwise, if M |= i ≈ j, apply(t) adds a disequality i ≈ j and
an equality t ≈ read (s, v) to G. That is, the egraph implementation of the rules
only adds (and does not remove) literals that capture the side condition and the
conclusion of the rule.
Our algorithm for MBP based on egraphs, MBP-QEL, is shown in Alg. 4.
It initializes an egraph with the input formula (line 1), applies MBP rules until
saturation (line 4), and then uses the steps of QEL (lines 7–12) to generate the
projected formula.
Applying rules is as straightforward as iterating over all terms t in the egraph,
and for each rule r such that r.match(t) is true, calling r.apply(t, M, G) (lines 14–
22). As opposed to the standard approach based on formula rewriting, here the
terms are not rewritten – both remain. Therefore, it is possible to get into an
infinite loop by re-applying the same rules on the same terms over and over again.
To avoid this, MBP-QEL marks terms as seen (line 23) and avoids them in the
next iteration (line 15). Some rules in MBP are applied to pairs of terms. For
example, Ackermann rewrites pairs of read terms over the same variable. This
is different from usual applications where rewrite rules are applied to individual
expressions. Yet, it is easy to adapt such pairwise rewrite rules to egraphs by
iterating over pairs of terms (lines 25–30).
MBP-QEL does not apply MBP rules to terms that contain variables but
are already c-ground (line 16), which is sound because such terms are replaced by
ground terms in the output (Theorem 3). This prevents unnecessary application
of MBP rules thus allowing MBP-QEL to compute MBPs that are closer to a
quantifier elimination (less model-specific).
Just like each application of a rewrite rule introduces a new term to a formula,
each call to the apply method of a rule adds new terms to the egraph. Therefore,
each call to ApplyRules (line 4) makes the egraph bigger. However, provided
that the original MBP combination is terminating, the iterative application of
ApplyRules terminates as well (due to marking).
Some MBP rules introduce new variables to the formula. MBP-QEL com-
putes repr based on both original and newly introduced variables (line 7). This
4
Implementation of all other rules is similar.
80 I. Garcia-Contreras et al.
Input: A QF formula ϕ with free variables v all of sort Array(I, V ) or ADT, a model
M |= ϕ∃ , and sets of rules ArrayRules and ADTRules
Output: A cube ψ s.t. ψ ∃ ⇒ ϕ∃ , M |= ψ ∃ , and vars(ψ) are not Arrays or ADTs
MBP-QEL(ϕ, v, M ) ApplyRules(G, M, R, S, Sp )
1: G := egraph(ϕ) 13: progress := ⊥
2: p1 , p2 := , ; S, Sp := ∅, ∅ 14: N := G.Nodes()
3: while p1 ∨ p2 do 15: U := {n | n ∈ N \ S}
4: p1 := ApplyRules(G, M, ArrayRules, S, Sp ) 16: T := {term(n) | n ∈ U ∧
5: p2 := ApplyRules(G, M, ADTRules, S, Sp ) (is_eq(term(n)) ∨ ¬c-ground(n))}
6: v := G.Vars() 17: Rp := {r ∈ R | r.is_for _pairs()}
7: repr := G.find_defs(v )
18: Ru := R \ Rp
8: repr := G.refine_defs(repr, v )
19: for each t ∈ T, r ∈ Ru do
9: core := G.find_core(repr, v )
20: if r.match(t) then
10: v e := {v ∈ v | is_arr(v) ∨ is_adt(v)} 21: r.apply(t, M, G)
11: coree := {n ∈ core | gr(term(n), v e )} 22: progress :=
12: ret G.to_formula(repr, G.Nodes()\coree ) 23: S := S ∪ N
24: Np := { n1 , n2 | n1 , n2 ∈ N }
25: Tp := {term(np ) | np ∈ Np \ Sp }
26: for each tp ∈ Tp , r ∈ Rp do
27: if r.match(p) then
28: r.apply(p, M, G)
29: progress :=
30: Sp := Sp ∪ Np
31: ret progress
where p and a are free variables that we want to project and all of i, j, l, p1 , p2 , p
are constants that we want to keep. MBP is guided by a model Mmbp |= ϕmbp .
To eliminate p and a, MBP-QEL constructs the egraph of ϕmbp and applies the
MBP rules. In particular, it uses Array MBP rules to rewrite the write(p1 , j, p)
term by adding the equality read (p2 , j) ≈ p and merging it with the equivalence
class of p2 ≈ write(p1 , j, p). It then applies ADT MBP rules to deconstruct the
equality p ≈ pair (a, l) by creating two equalities fst(p) ≈ a and snd (p) ≈ l. Finally,
the call to to_formula produces
read (fst(read (p1 , j)), i) ≈ i ∧ snd (read (p1 , j)) ≈ l ∧
read (p2 , j) ≈ pair (fst(read (p1 , j)), l) ∧
p2 ≈ write(p1 , j, read (p2 , j)) ∧ read (p2 , j) ≈ p
The output is easy to understand by tracing it back to the input. For example,
the first literal is a rewrite of the literal read (a, i) ≈ i where a is represented
with fst(p) and p is represented with read (p1 , j). While the interaction of these
rules might seem straightforward in this example, the MBP implementation in
Z3 fails to project a in this example because of the multilevel nesting.
Notably, in this example, the c-ground computation during projection allows
MBP-QEL not splitting on the disequality p ≈ p based on the model. While
ADT MBP rules eliminate disequalities by using the model to split them, MBP-
QEL benefits from the fact that, after the application of Array MBP rules, the
class of p becomes ground, making p ≈ p c-ground. Thus, the c-ground compu-
tation allows MBP-QEL to produce a formula that is less approximate than
those produced by syntactic application of MBP rules. In fact, in this example,
a quantifier elimination is obtained (the model Mmbp was not used).
In the next section, we show that our improvements to MBP translate to
significant improvements in a CHC-solving procedure that relies on MBP.
6 Evaluation
We implement QEL (Alg. 1) and MBP-QEL (Alg. 4) inside Z3 [19] (version
4.12.0), a state-of-the-art SMT solver. Our implementation (referred to as Z3eg),
is publicly available on GitHub5 . Z3eg replaces QeLite with QEL, and the
existing MBP with MBP-QEL.
We evaluate Z3eg using two solving tasks. Our first evaluation is on the
QSAT algorithm [5] for checking satisfiability of formulas with alternating quan-
tifiers. In QSAT, Z3 uses both QeLite and MBP to under-approximate quan-
tified formulas. We compare three QSAT implementations: the existing version
in Z3 with the default QeLite and MBP; the existing version in Z3 in which
QeLite and MBP are replaced by our egraph-based algorithms, Z3eg; and the
QSAT implementation in YicesQS6 , based on the Yices [8] SMT solver. Dur-
ing the evaluation, we found a bug in QSAT implementation of Z3 and fixed it7 .
5
Available at https://github.com/igcontreras/z3/tree/qel-cav23.
6
Available at https://github.com/disteph/yicesQS.
7
Available at https://github.com/igcontreras/z3/commit/133c9e438ce.
82 I. Garcia-Contreras et al.
Table 1. Instances solved within 20 min by dif- Table 2. Instances solved within
ferent implementations. Benchmarks are quanti- 60 s for our handcrafted bench-
fied LIA and LRA formulas from SMT-LIB [2]. marks.
LIA 416 150 266 150 266 107 102 LIA-ADT 416 150 266 150 56
LRA 2 419 795 1 589 793 1 595 808 1 610 LRA-ADT 2 419 757 1 415 196 964
The fix resulted in Z3 solving over 40 sat instances and over 120 unsat instances
more than before. In the following, we use the fixed version of Z3.
We use benchmarks in the theory of (quantified) LIA and LRA from SMT-
LIB [2,3], with alternating quantifiers. LIA and LRA are the only tracks in which
Z3 uses the QSAT tactic by default. To make our experiments more comprehen-
sive, we also consider two modified variants of the LIA and LRA benchmarks,
where we add some non-recursive ADT variables to the benchmarks. Specif-
ically, we wrap all existentially quantified arithmetic variables using a record
type ADT and unwrap them whenever they get used8 . Since these benchmarks
are similar to the original, we force Z3 to use the QSAT tactic on them with a
tactic.default_tactic=qsat command line option.
Table 1 summarizes the results for the SMT-LIB benchmarks. In LIA, both
Z3eg and Z3 solve all benchmarks in under a minute, while YicesQS is unable
to solve many instances. In LRA, YicesQS solves all instances with very good
performance. Z3 is able to solve only some benchmarks, and our Z3eg performs
similarly to Z3. We found that in the LRA benchmarks, the new algorithms in
Z3eg are not being used since there are not many equalities in the formula, and
no equalities are inferred during the run of QSAT. Thus, any differences between
Z3 and Z3eg are due to inherent randomness of the solving process.
Table 2 summarizes the results for the categories of mixed ADT and arith-
metic. YicesQS is not able to compete because it does not support ADTs. As
expected, Z3eg solves many more instances than Z3.
The second part of our evaluation shows the efficacy of MBP-QEL for Arrays
and ADTs (Alg. 4) in the context of CHC-solving. Z3 uses both QeLite and
MBP inside the CHC-solver Spacer [17]. Therefore, we compare Z3 and Z3eg
on CHC problems containing Arrays and ADTs. We use two sets of benchmarks
to test out the efficacy of our MBP. The benchmarks in the first set were gener-
ated for verification of Solidity smart contracts [1] (we exclude benchmarks with
non-linear arithmetic, they are not supported by Spacer). These benchmarks
have a very complex structure that nests ADTs and Arrays. Specifically, they
contain both ADTs of Arrays, as well as Arrays of ADTs. This makes them suit-
able to test our MBP-QEL. Row 1 of Table 3 shows the number of instances
8
The modified benchmarks are available at https://github.com/igcontreras/LIA-
ADT and https://github.com/igcontreras/LRA-ADT.
Fast Approximations of Quantifier Elimination 83
Table 3. Instances solved within 20 min by Z3eg, Z3, and Eldarica. Benchmarks are
CHCs from Solidity [1] and CHC competition [13]. The abi benchmarks are a subset
of Solidity benchmarks.
Z3eg Z3 Eldarica
Cat. Count
sat unsat sat unsat sat unsat
7 Conclusion
Acknowledgment. The research leading to these results has received funding from
the European Research Council under the European Union’s Horizon 2020 research
and innovation programme (grant agreement No [759102-SVIS]). This research was
partially supported by the Israeli Science Foundation (ISF) grant No. 1810/18. We
acknowledge the support of the Natural Sciences and Engineering Research Council of
Canada (NSERC), MathWorks Inc., and the Microsoft Research PhD Fellowship.
References
1. Alt, L., Blicha, M., Hyvärinen, A.E.J., Sharygina, N.: SolCMC: Solidity Compiler’s
Model Checker. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification - 34th
International Conference, CAV 2022, Haifa, Israel, August 7–10, 2022, Proceed-
ings, Part I. Lecture Notes in Computer Science, vol. 13371, pp. 325–338. Springer
(2022). https://doi.org/10.1007/978-3-031-13185-1_16
2. Barrett, C., Fontaine, P., Tinelli, C.: The Satisfiability Modulo Theories Library
(SMT-LIB). www.SMT-LIB.org (2016)
3. Barrett, C., Stump, A., Tinelli, C.: The SMT-LIB Standard: Version 2.0. In: Gupta,
A., Kroening, D. (eds.) Proceedings of the 8th International Workshop on Satisfi-
ability Modulo Theories (Edinburgh, UK) (2010)
4. Barrett, Clark, Tinelli, Cesare: Satisfiability modulo theories. In: Handbook of
Model Checking, pp. 305–343. Springer, Cham (2018). https://doi.org/10.1007/
978-3-319-10575-8_11
5. Bjørner, N.S., Janota, M.: Playing with quantified satisfaction. In: Fehnker, A.,
McIver, A., Sutcliffe, G., Voronkov, A. (eds.) 20th International Conferences on
Logic for Programming, Artificial Intelligence and Reasoning - Short Presentations,
LPAR 2015, Suva, Fiji, November 24–28, 2015. EPiC Series in Computing, vol. 35,
pp. 15–27. EasyChair (2015). https://doi.org/10.29007/vv21
6. Chang, B.E., Leino, K.R.M.: Abstract interpretation with alien expressions and
heap structures. In: Cousot, R. (ed.) Verification, Model Checking, and Abstract
Interpretation, 6th International Conference, VMCAI 2005, Paris, France, January
17–19, 2005, Proceedings. Lecture Notes in Computer Science, vol. 3385, pp. 147–
163. Springer (2005). https://doi.org/10.1007/978-3-540-30579-8_11
7. Detlefs, D., Nelson, G., Saxe, J.B.: Simplify: A theorem prover for program check-
ing. J. ACM 52(3), 365–473 (may 2005). https://doi.org/10.1145/1066100.1066102
8. Dutertre, B.: Yices 2.2. In: Biere, A., Bloem, R. (eds.) Computer Aided Verification
- 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of
Logic, VSL 2014, Vienna, Austria, July 18–22, 2014. Proceedings. Lecture Notes
in Computer Science, vol. 8559, pp. 737–744. Springer (2014). https://doi.org/10.
1007/978-3-319-08867-9_49
9. Dutertre, B.: Solving Exists/Forall Problems with Yices. In: Workshop on Satisfi-
ability Modulo Theories (2015). https://yices.csl.sri.com/papers/smt2015.pdf
Fast Approximations of Quantifier Elimination 85
10. Gange, G., Navas, J.A., Schachte, P., Søndergaard, H., Stuckey, P.J.: An abstract
domain of uninterpreted functions. In: Jobstmann, B., Leino, K.R.M. (eds.) Verifi-
cation, Model Checking, and Abstract Interpretation - 17th International Confer-
ence, VMCAI 2016, St. Petersburg, FL, USA, January 17–19, 2016. Proceedings.
Lecture Notes in Computer Science, vol. 9583, pp. 85–103. Springer (2016). https://
doi.org/10.1007/978-3-662-49122-5_4
11. Gascón, A., Subramanyan, P., Dutertre, B., Tiwari, A., Jovanovic, D., Malik, S.:
Template-based circuit understanding. In: Formal Methods in Computer-Aided
Design, FMCAD 2014, Lausanne, Switzerland, October 21–24, 2014, pp. 83–90.
IEEE (2014). https://doi.org/10.1109/FMCAD.2014.6987599
12. Gulwani, S., Tiwari, A., Necula, G.C.: Join algorithms for the theory of unin-
terpreted functions. In: Lodaya, K., Mahajan, M. (eds.) FSTTCS 2004: Founda-
tions of Software Technology and Theoretical Computer Science, 24th International
Conference, Chennai, India, December 16–18, 2004, Proceedings. Lecture Notes in
Computer Science, vol. 3328, pp. 311–323. Springer (2004). https://doi.org/10.
1007/978-3-540-30538-5_26
13. Gurfinkel, A., Ruemmer, P., Fedyukovich, G., Champion, A.: CHC-COMP.
https://chc-comp.github.io/ (2018)
14. Hojjat, H., Rümmer, P.: The ELDARICA horn solver. In: Bjørner, N.S., Gurfinkel,
A. (eds.) 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin,
TX, USA, October 30 - November 2, 2018, pp. 1–7. IEEE (2018). https://doi.org/
10.23919/FMCAD.2018.8603013
15. Joshi, R., Nelson, G., Randall, K.H.: Denali: A goal-directed superoptimizer. In:
Knoop, J., Hendren, L.J. (eds.) Proceedings of the 2002 ACM SIGPLAN Con-
ference on Programming Language Design and Implementation (PLDI), Berlin,
Germany, June 17–19, 2002, pp. 304–314. ACM (2002). https://doi.org/10.1145/
512529.512566
16. Komuravelli, A., Bjørner, N.S., Gurfinkel, A., McMillan, K.L.: Compositional ver-
ification of procedural programs using horn clauses over integers and arrays. In:
Kaivola, R., Wahl, T. (eds.) Formal Methods in Computer-Aided Design, FMCAD
2015, Austin, Texas, USA, September 27–30, 2015, pp. 89–96. IEEE (2015).
https://doi.org/10.5555/2893529.2893548
17. Komuravelli, A., Gurfinkel, A., Chaki, S.: Smt-based model checking for recursive
programs. In: Biere, A., Bloem, R. (eds.) Computer Aided Verification - 26th Inter-
national Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL
2014, Vienna, Austria, July 18–22, 2014. Proceedings. Lecture Notes in Computer
Science, vol. 8559, pp. 17–34. Springer (2014). https://doi.org/10.1007/978-3-319-
08867-9_2
18. Kuncak, V., Mayer, M., Piskac, R., Suter, P.: Complete functional synthesis. In:
Zorn, B.G., Aiken, A. (eds.) Proceedings of the 2010 ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation, PLDI 2010, Toronto,
Ontario, Canada, June 5–10, 2010, pp. 316–329. ACM (2010). https://doi.org/10.
1145/1806596.1806632
19. de Moura, L.M., Bjørner, N.S.: Z3: an efficient SMT solver. In: Ramakrishnan,
C.R., Rehof, J. (eds.) Tools and Algorithms for the Construction and Analysis
of Systems, 14th International Conference, TACAS 2008, Held as Part of the
Joint European Conferences on Theory and Practice of Software, ETAPS 2008,
Budapest, Hungary, March 29-April 6, 2008. Proceedings. Lecture Notes in Com-
puter Science, vol. 4963, pp. 337–340. Springer (2008). https://doi.org/10.1007/
978-3-540-78800-3_24
86 I. Garcia-Contreras et al.
20. Nelson, G., Oppen, D.C.: Fast decision algorithms based on union and find. In:
18th Annual Symposium on Foundations of Computer Science, Providence, Rhode
Island, USA, 31 October - 1 November 1977, pp. 114–119. IEEE Computer Society
(1977). https://doi.org/10.1109/SFCS.1977.12
21. Nelson, G., Oppen, D.C.: Simplification by cooperating decision procedures.
ACM Trans. Program. Lang. Syst. 1(2), 245–257 (1979). https://doi.org/10.1145/
357073.357079
22. Tate, R., Stepp, M., Tatlock, Z., Lerner, S.: Equality saturation: A new approach to
optimization. In: Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Sym-
posium on Principles of Programming Languages. p. 264–276. POPL ’09, Associ-
ation for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.
1145/1480881.1480915
23. Tate, R., Stepp, M., Tatlock, Z., Lerner, S.: Equality saturation: A new approach
to optimization. Log. Methods Comput. Sci. 7(1) (2011). https://doi.org/10.2168/
LMCS-7(1:10)2011
24. Willsey, M., Nandi, C., Wang, Y.R., Flatt, O., Tatlock, Z., Panchekha, P.: egg:
Fast and extensible equality saturation. Proc. ACM Program. Lang. 5(POPL),
1–29 (2021). https://doi.org/10.1145/3434304
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Local Search for Solving Satisfiability
of Polynomial Formulas
1 Introduction
Satisfiability modulo theories (SMT) refers to the problem of determining
whether a first-order formula is satisfiable with respect to (w.r.t.) certain theo-
ries, such as the theories of linear integer/real arithmetic, nonlinear integer/real
arithmetic and strings. In this paper, we consider the theory of nonlinear real
arithmetic (NRA) and restrict our attention to the problem of solving satisfia-
bility of quantifier-free polynomial formulas.
Solving polynomial constraints has been a central problem in the develop-
ment of mathematics. In 1951, Tarski’s decision procedure [33] made it pos-
sible to solve polynomial constraints in an algorithmic way. However, Tarski’s
The authors are listed in alphabetical order and they make equal contribution.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 87–109, 2023.
https://doi.org/10.1007/978-3-031-37703-7_5
88 H. Li et al.
[8] developed a local search procedure for SMT on the theory of linear inte-
ger arithmetic (LIA) through the critical move operation, which works on the
literal-level and changes the value of one variable in a false LIA literal to make it
true. We also notice that there exists a local search SMT solver for the theory of
NRA, called NRA-LS, performing well at the SMT Competition 20221 . A simple
description of the solver without details about local search can be found in [25].
In this paper, we propose a local search algorithm for a special subclass of
SMT(NRA), where all constraints are strict inequalities. The idea of applying the
local search method to SMT(NRA) comes from CAD, which is a decomposition
of the search space Rn into finitely many cells such that every polynomial in the
formula is sign-invariant on each cell. CAD guarantees that the search space only
has finitely many states. Similar to the local search method for SAT which moves
between finitely many Boolean assignments, local search for SMT(NRA) should
jump between finitely many cells. So, we may use a local search framework for
SAT to solve SMT(NRA).
Local search algorithms require an operation to perform local changes. For
SAT, a standard operation is flip, which modifies the current assignment by
flipping the value of one Boolean variable from false to true or vice-versa. For
SMT(NRA), we propose a novel operation, called cell-jump, updating the current
assignment x1 → a1 , . . . , xn → an (ai ∈ Q) to a solution of a false polynomial
constraint ‘p < 0’ or ‘p > 0’, where xi is a variable appearing in the given
polynomial formula. Different from the critical move operation for linear integer
constraints [8], it is difficult to determine the threshold value of some variable xi
such that the false polynomial constraint becomes true. We deal with the issue by
the method of real root isolation, which isolates every real root of the univariate
polynomial p(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ) in an open interval sufficiently small
with rational endpoints. If there exists at least one endpoint making the false
constraint true, a cell-jump operation assigns xi to one closest to ai . The proce-
dure can be viewed as searching for a solution along a line parallel to the xi -axis.
In fact, a cell-jump operation can search along any fixed straight line, and thus
one cell-jump may change the values of more than one variables. Each step, the
local search algorithm picks a cell-jump operation to execute according to a two-
level operation selection and updates the current assignment, until a solution to
the polynomial formula is found or the terminal condition is satisfied. Moreover,
our algorithm can be generalized to deal with a wider subclass of SMT(NRA)
where polynomial equations linear w.r.t. some variable are allowed.
The local search algorithm is implemented with Maple2022 as a tool. Experi-
ments are conducted to evaluate the tool on two classes of benchmarks, including
selected instances from SMT-LIB2 , and some hard instances generated randomly
with only nonlinear constraints. Experimental results show that our tool is com-
petitive with state-of-the-art SMT solvers on the SMT-LIB benchmarks, and
performs particularly well on the hard instances. We also combine our tool with
1
https://smt-comp.github.io/2022.
2
https://smtlib.cs.uiowa.edu/benchmarks.shtml.
90 H. Li et al.
Z3, CVC5, Yices2 and MathSAT5 respectively to obtain four sequential portfolio
solvers, which show better performance.
The rest of the paper is organized as follows. The next section introduces some
basic definitions and notation and a general local search framework for solving
a satisfiability problem. Section 3 shows from the CAD perspective, the search
space for SMT(NRA) only has finite states. In Sect. 4, we describe cell-jump
operations, while in Sect. 5 we provide the scoring function which gives every
operation a score. The main algorithm is presented in Sect. 6. And in Sect. 7,
experimental results are provided to indicate the efficiency of the algorithm.
Finally, the paper is concluded in Sect. 8.
2 Preliminaries
2.1 Notation
Let x̄ := (x1 , . . . , xn ) be a vector of variables. Denote by Q, R and Z the set of
rational numbers, real numbers and integer numbers, respectively. Let Q[x̄] and
R[x̄] be the ring of polynomials in the variables x1 , . . . , xn with coefficients in Q
and in R, respectively.
8 return unknown
92 H. Li et al.
where f1 = 17x2 +2xy +17y 2 +48x−48y and f2 = 17x2 −2xy +17y 2 −48x−48y.
The solution set of F is shown as the shaded area in Fig. 1. Notice
that poly(F ) consists of two polynomials and decomposes R2 into 10 areas:
C1 , . . . , C10 (see Fig. 2). We refer to these areas as cells.
y y
4 4 C4
3 3
C10
2 C5 2 C6
C3
1 C2 1 C1
C7 C8
x x
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4
C9
−1 −1
−2 −2
Fig. 1. The solution set of F in Example Fig. 2. The zero level set of poly(F )
2. decomposes R2 into 10 cells.
the satisfiability of F is constant on every cell of Q, that is, either all the points
in a cell are solutions to F or none of them are solutions to F .
Example 3. Consider the polynomial formula F in Example 2. As shown in
Fig. 3, assume that we start from point a to search for a solution to F . Jumping
from a to b makes no difference, as both points are in the same cell and thus
neither are solutions to F . However, jumping from a to c or from a to d crosses
different cells and we may discover a cell satisfying F . Herein, the cell containing
d satisfies F .
y a y
b
4 4
c
3 3
2 2
1 1
d x x
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4
−1 −1
−2 −2
For the remainder of this section, we will demonstrate how to traverse all
cells through point jumps between cells. The method of traversing cell by cell in
a variable by variable direction will be explained step by step from Definition 6
to Definition 8.
Definition 6 (Expansion). Let Q ⊆ R[x̄] be finite and ā = (a1 , . . . , an ) ∈
Rn . Given a variable xi (1 ≤ i ≤ n), let r1 < · · · < rs be all real roots of
{q(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ) | q(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ) ≡ 0, q ∈ Q},
where s ∈ Z≥0 . An expansion of ā to xi on Q is a point set Λ ⊆ Rn satisfying
(a) ā ∈ Λ and (a1 , . . . , ai−1 , rj , ai+1 , . . . , an ) ∈ Λ for 1 ≤ j ≤ s,
(b) for any b̄ = (b1 , ..., bn ) ∈ Λ, bj = aj for j ∈ {1, . . . , n} \ {i}, and
(c) for any interval I ∈ {(−∞, r1 ), (r1 , r2 ), . . . , (rs−1 , rs ), (rs , +∞)}, there
exists a unique b̄ = (b1 , ..., bn ) ∈ Λ such that bi ∈ I.
Real root isolation is a symbolic way to compute the real roots of a polynomial,
which is of fundamental importance in computational real algebraic geometry
(e.g., it is a routing sub-algorithm for CAD). There are many efficient algorithms
and popular tools in computer algebra systems such as Maple and Mathematica
to isolate the real roots of polynomials.
We first introduce the definition of sequences of isolating intervals for nonzero
univariate polynomials, which can be obtained by any real root isolation tool,
e.g. CLPoly3 .
Definition 9 (Sequence of Isolating Intervals). For any nonzero univariate
polynomial p(x) ∈ Q[x], a sequence of isolating intervals of p(x) is a sequence
of open intervals (a1 , b1 ), . . . , (as , bs ) where s ∈ Z≥0 , such that
Remark 1. For any nonzero univariate polynomial p(x) that has real roots, let
r1 , . . . , rs (s ∈ Z≥1 ) be all distinct real roots of p(x). It is obvious that the
sign of p(x) is positive constantly or negative constantly on each interval I of
the set {(−∞, r1 ), (r1 , r2 ), . . . , (rs−1 , rs ), (rs , +∞)}. So, we only need to take a
point x∗ from the interval I, and then the sign of p(x∗ ) is the constant sign of
3
https://github.com/lihaokun/CLPoly.
96 H. Li et al.
p(x) on I. Specially, we take a1 as the sample point for the interval (−∞, r1 ),
bi , bi +a2 i+1 or ai+1 as a sample point for (ri , ri+1 ) where 1 ≤ i ≤ s − 1, and bs
as the sample point for (rs , +∞). By Definition 10, there exists no sample point
for the zero polynomial and a univariate polynomial with no real roots.
Example 7. Consider the polynomial p(x) = x8 − 4x6 + 6x4 − 4x2 + 1. It has two
real roots −1 and 1, and a sequence of isolating intervals of it is (− 215 128 , − 32 ),
19
( 32 , 128 ). Every point in the set {− 128 , − 32 , 0, 32 , 128 } is a sample point of p(x).
19 215 215 19 19 215
Note that p(x) > 0 holds on the intervals (−∞, −1) and (1, +∞), and p(x) < 0
holds on the interval (−1, 1). Thus, − 215 215
128 and 128 are positive sample points of
p(x); − 32 , 0 and 32 are negative sample points of p(x).
19 19
Similarly, we have:
Theorem 3. Suppose the current assignment is α : x1 → a1 , . . . , xn → an
where ai ∈ Q. Let be a false atomic polynomial formula under α with a
relational operator ‘<’ or ‘>’, dir := (d1 , . . . , dn ) a vector in Qn and L :=
{(a1 + d1 t, . . . , an + dn t) | t ∈ R}. There exists a solution of in L if and only
if there exists a cjump(dir, ) operation.
Theorem 3 implies that through one-step cell-jump from the point α(x̄) along
any line that intersects the solution set of , a solution to will be found.
Example 9. Assume the current assignment is α : x1 → 1, x2 → 1. Consider
the false atomic polynomial formula 1 : 2x21 + 2x22 − 1 < 0 in Example 8. Let
p := poly(1 ). By Fig. 5 (b), the line (line L3 ) specified by the point α(x̄) and
the direction vector dir = (1, 1) intersects the solution set of 1 . So, there exists
a cjump(dir, 1 ) operation by Theorem 3. Notice that the line can be described
in a parametric form, that is {(x1 , x2 ) | x1 = 1+t, x2 = 1+t where t ∈ R}. Then,
analyzing the values of p(x̄) on the line is equivalent to analyzing those of p∗ (t)
on the real axis, where p∗ (t) = p(1+t, 1+t) = 4t2 +8t+3. A sequence of isolating
intervals of p∗ is (− 215
128 , − 64 ), (− 32 , − 128 ), and there are two negative sample
75 19 61
1 . Again by Fig. 5, there are other lines (the dashed lines) that go through
α(x̄) and intersect the solution set. So, we can also find a solution to 1 along
these lines. Actually, for any false atomic polynomial formula with ‘<’ or ‘>’
that really has solutions, there always exists some direction dir in Qn such that
cjump(dir, ) finds one of them. Therefore, the more directions we try, the greater
the probability of finding a solution of .
x2 x2
A A
1 1
L1 L1
−1 1 x1 −1 1 x1
−1 −1
L2 L3 L2
(a) Neither L1 nor L2 intersects (b) Line L3 and the dashed lines
the solution set. intersect the solution set.
Fig. 5. The figure of the cell-jump operations along the lines L1 , L2 and L3 for the
false atomic polynomial formula 1 : 2x21 + 2x22 − 1 < 0 under the assignment α : x1 →
1, x2 → 1. The dashed circle denotes the circle 2x21 + 2x22 − 1 = 0 and the shaded part in
it represents the solution set of the atom. The coordinate of point A is (1, 1). Lines L1 ,
L2 and L3 pass through A and are parallel to the x1 -axis, the x2 -axis and the vector
(1, 1), respectively.
Remark 2. For a false atomic polynomial formula with ‘<’ or ‘>’, cjump(xi , )
and cjump(dir, ) make an assignment move to a new assignment, and both
assignments map to an element in Qn . In fact, we can view cjump(xi , ) as a
special case of cjump(dir, ) where the i-th component of dir is 1 and all the other
components are 0. The main difference between cjump(xi , ) and cjump(dir, ) is
that cjump(xi , ) only changes the value of one variable while cjump(dir, ) may
change the values of many variables. The advantage of cjump(xi , ) is to avoid
that some atoms can never become true when the values of many variables are
adjusted together. However, performing cjump(dir, ) is more efficient in some
cases, since it may happen that a solution to can be found through one-step
cjump(dir, ), but through many steps of cjump(xi , ).
5 Scoring Functions
Scoring functions guide local search algorithms to pick an operation at each step.
In this section, we introduce a score function which measures the difference of
the distances to satisfaction under the assignments before and after performing
an operation.
First, we define the distance to truth of an atomic polynomial formula.
Definition 12 (Distance to Truth). Given the current assignment α such
that α(x̄) = (a1 , . . . , an ) ∈ Qn and a positive parameter pp ∈ Q>0 , for an atomic
polynomial formula with p := poly(), its distance to truth is
0, if α is a solution to ,
dtt(, α, pp) :=
|p(a1 , . . . , an )| + pp, otherwise.
where w(c) denotes the weight of clause c, and α is the assignment after per-
forming op.
Note that the definition of the score is associated with the weights of clauses.
In our algorithm, we employ the probabilistic version of the PAWS scheme [9,
32] to update clause weights. The initial weight of every clause is 1. Given a
probability sp, the clause weights are updated as follows: with probability 1−sp,
the weight of every falsified clause is increased by one, and with probability sp,
for every satisfied clause with weight greater than 1, the weight is decreased by
one.
strategy, called tabu strategy [18], to deal with it. The tabu strategy forbids
reversing the recent changes and can be directly applied in LS Algorithm. Notice
that every cell-jump operation increases or decreases the values of some variables.
After executing an operation that increases/decreases the value of a variable,
the tabu strategy forbids decreasing/increasing the value of the variable in the
subsequent tt iterations, where tt ∈ Z≥0 is a given parameter.
Algorithm 3. LS Algorithm
Input : F , a polynomial formula such that the relational operator of every atom is ‘<’ or ‘>’
initα , an initial assignment that maps to an element in Qn
Output: a solution (in Qn ) to F or unknown
1 α ← initα
2 while the terminal condition is not reached do
3 if α satisfies F then return α
4 f al cl ← the set of atoms in falsified clauses
5 sat cl ← the set of false atoms in satisfied clauses
6 if ∃ a decreasing cjump(xi , ) operation where ∈ f al cl then
7 op ← such an operation with the highest score
8 α ← α with op performed
9 else if ∃ a decreasing cjump(xi , ) operation where ∈ sat cl then
10 op ← such an operation with the highest score
11 α ← α with op performed
12 else
13 update clause weights according to the PAWS scheme
14 generate a direction vector set dset
15 if ∃ a decreasing cjump(dir, ) operation where dir ∈ dset and ∈ f al cl then
16 op ← such an operation with the highest score
17 α ← α with op performed
18 else if ∃ a decreasing cjump(dir, ) operation where dir ∈ dset and ∈ sat cl then
19 op ← such an operation with the highest score
20 α ← α with op performed
21 else
22 return unknown
23 return unknown
Remark 3. If the input formula has equality constraints, then we need to define
a cell-jump operation for a false atom of the form p(x̄) = 0. Given the current
assignment α : x1 → a1 , . . . , xn → an (ai ∈ Q), the operation should assign some
variable xi to a real root of p(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ), which may be not a
rational number. Since it is time-consuming to isolate real roots of a polynomial
with algebraic coefficients, we must guarantee that all assignments are rational
during the search. Thus, we restrict that for every equality equation p(x̄) = 0
in the formula, there exists at least one variable such that the degree of p w.r.t.
the variable is 1. Then, LS Algorithm also works for such a polynomial formula
after some minor modifications: In Line 6 (or Line 9), for every atom ∈ f al cl
(or ∈ sat cl) and for every variable xi , if has the form p(x̄) = 0, p is linear
w.r.t. xi and p(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ) is not a constant polynomial, there
is a candidate operation that changes the value of xi to the (rational) solution
of p(a1 , . . . , ai−1 , xi , ai+1 , . . . , an ) = 0; if has the form p(x̄) > 0 or p(x̄) < 0, a
candidate operation is cjump(xi , ). We perform a decreasing candidate operation
with the highest score if such one exists, and update α in Line 8 (or Line 11).
Local Search for Solving Satisfiability of Polynomial Formulas 103
In Line 15 (or Line 18), we only deal with inequality constraints from f al cl (or
sat cl), and skip equality constraints.
7 Experiments
We carried out experiments to evaluate LS Algorithm on two classes of instances,
where one class consists of selected instances from SMT-LIB while another is
generated randomly, and compared our tool with state-of-the-art SMT(NRA)
solvers. Furthermore, we combine our tool with Z3, CVC5, Yices2 and Math-
SAT5 respectively to obtain four sequential portfolio solvers, which show better
performance.
7.2 Instances
We prepare two classes of instances. One class consists of in total 2736 unknown
and satisfiable instances from SMT-LIB(NRA)4 , where in every equality poly-
nomial constraint, the degree of the polynomial w.r.t. each variable is less than
or equal to 1.
The rest are random instances. Before introducing the generation approach
of random instances, we first define some notation. Let rn(down, up) denote a
4
https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/QF NRA.
104 H. Li et al.
random integer between two integers m down and up, and rp({x1 , . . . , xn }, d, m)
denote a random polynomial i=1 ci Mi + c0 , where ci = rn(−1000, 1000) for
0 ≤ i ≤ m, M1 is a random monomial in {xa1 1 · · · xann | ai ∈ Z≥0 , a1 +· · ·+an = d}
and Mi (2 ≤ i ≤ m) is a random monomial in {xa1 1 · · · xann | ai ∈ Z≥0 , a1 + · · · +
an ≤ d}.
A randomly generated polynomial formula rf ({v n1 , v n2 }, {p n1 , p n2 }, {d− ,
d+ }, {n− , n+ }, {m− , m+ }, {cl n1 , cl n2 }, {cl l1 , cl l2 }), where all parameters are
in Z≥0 , is constructed as follows: First, let n := rn(v n1 , v n2 ) and gener-
ate n variables x1 , . . . , xn . Second, let num := rn(p n1 , p n2 ) and generate
num polynomials p1 , . . . , pnum . Every pi is a random polynomial rp({xi1 , . . .
, xini }, d, m), where ni = rn(n− , n+ ), d = rn(d− , d+ ), m = rn(m− , m+ ), and
{xi1 , . . . , xini } are ni variables randomly selected from {x1 , . . . , xn }. Finally, let
cl n := rn(cl n1 , cl n2 ) and generate cl n clauses such that the number of atoms
in a generated clause is rn(cl l1 , cl l2 ). The rn(cl l1 , cl l2 ) atoms are randomly
picked from {pi < 0, pi > 0, pi = 0 | 1 ≤ i ≤ num}. If some picked atom has
the form pi = 0 and there exists a variable such that the degree of pi w.r.t. the
variable is greater than 1, replace the atom with pi < 0 or pi > 0 with equal
probability. We generate totally 500 random polynomial formulas according to
rf ({30, 40}, {60, 80}, {20, 30}, {10, 20}, {20, 30}, {40, 60}, {3, 5}).
The two classes of instances have different characteristics. The instances
selected from SMT-LIB(NRA) usually contain lots of linear constraints, and their
complexity is reflected in the propositional abstraction. For a random instance,
all the polynomials in it are nonlinear and of high degrees, while its propositional
abstraction is relatively simple.
Fig. 10. Comparing LS with MathSAT5. Fig. 11. Comparing LS with Yices2.
8 Conclusion
For a given SMT(NRA) formula, although the domain of variables in the for-
mula is infinite, the satisfiability of the formula can be decided through tests on
a finite number of samples in the domain. A complete search on such samples
is inefficient. In this paper, we propose a local search algorithm for a special
class of SMT(NRA) formulas, where every equality polynomial constraint is
linear with respect to at least one variable. The novelty of our algorithm con-
tains the cell-jump operation and a two-level operation selection which guide the
Local Search for Solving Satisfiability of Polynomial Formulas 107
algorithm to jump from one sample to another heuristically. The algorithm has
been applied to two classes of benchmarks and the experimental results show
that it is competitive with state-of-the-art SMT solvers and is good at solving
those formulas with high-degree polynomial constraints. Tests on the solvers
developed by combining this local search algorithm with Z3, CVC5, Yices2 or
MathSAT5 indicate that the algorithm is complementary to these state-of-the-
art SMT(NRA) solvers. For the future work, we will improve our algorithm such
that it is able to handle all polynomial formulas.
References
1. Ábrahám, E., Davenport, J.H., England, M., Kremer, G.: Deciding the consistency
of non-linear real arithmetic constraints with a conflict driven search using cylin-
drical algebraic coverings. J. Logical Algebraic Methods Programm. 119, 100633
(2021)
2. Balint, A., Schöning, U.: Choosing probability distributions for stochastic local
search and the role of make versus break. In: Cimatti, A., Sebastiani, R. (eds.)
SAT 2012. LNCS, vol. 7317, pp. 16–29. Springer, Heidelberg (2012). https://doi.
org/10.1007/978-3-642-31612-8 3
3. Barbosa, H., et al.: cvc5: a versatile and industrial-strength SMT solver. In: TACAS
2022. LNCS, pp. 415–442. Springer, Cham (2022). https://doi.org/10.1007/978-3-
030-99524-9 24
4. Biere, A.: Splatz, lingeling, plingeling, treengeling, yalsat entering the sat compe-
tition 2016. In: Proceedings of SAT Competition, pp. 44–45 (2016)
5. Biere, A., Heule, M., van Maaren, H.: Handbook of Satisfiability, vol. 185. IOS
press (2009)
6. Brown, C.W.: Improved projection for cylindrical algebraic decomposition. J.
Symb. Comput. 32(5), 447–465 (2001)
7. Brown, C.W., Košta, M.: Constructing a single cell in cylindrical algebraic decom-
position. J. Symb. Comput. 70, 14–48 (2015)
8. Cai, S., Li, B., Zhang, X.: Local search for SMT on linear integer arithmetic. In:
International Conference on Computer Aided Verification. pp. 227–248. Springer
(2022). https://doi.org/10.1007/978-3-031-13188-2 12
9. Cai, S., Su, K.: Local search for Boolean satisfiability with configuration checking
and subscore. Artif. Intell. 204, 75–98 (2013)
10. Cimatti, A., Griggio, A., Irfan, A., Roveri, M., Sebastiani, R.: Incremental lin-
earization for satisfiability and verification modulo nonlinear arithmetic and tran-
scendental functions. ACM Trans. Comput. Log. 19(3), 1–52 (2018)
11. Cimatti, A., Griggio, A., Schaafsma, B.J., Sebastiani, R.: The MathSAT5 SMT
solver. In: Piterman, N., Smolka, S.A. (eds.) TACAS 2013. LNCS, vol. 7795, pp.
93–107. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36742-7 7
12. Clarke, E.M., Henzinger, T.A., Veith, H., Bloem, R., et al.: Handbook of Model
Checking, vol. 10. Springer (2018). https://doi.org/10.1007/978-3-319-10575-8
108 H. Li et al.
13. Collins, G.E.: Quantifier elimination for real closed fields by cylindrical algebraic
decompostion. In: Brakhage, H. (ed.) GI-Fachtagung 1975. LNCS, vol. 33, pp.
134–183. Springer, Heidelberg (1975). https://doi.org/10.1007/3-540-07407-4 17
14. Collins, G.E., Hong, H.: Partial cylindrical algebraic decomposition for quantifier
elimination. J. Symb. Comput. 12(3), 299–328 (1991)
15. De Moura, L., Jovanović, D.: A model-constructing satisfiability calculus. In: Inter-
national Workshop on Verification, Model Checking, and Abstract Interpretation,
pp. 1–12. Springer (2013). https://doi.org/10.1007/978-3-642-35873-9 1
16. Dutertre, B.: Yices 2.2. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559,
pp. 737–744. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08867-
9 49
17. Fröhlich, A., Biere, A., Wintersteiger, C., Hamadi, Y.: Stochastic local search for
satisfiability modulo theories. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 29 (2015)
18. Glover, F., Laguna, M.: Tabu search. In: Handbook of Combinatorial Optimization,
pp. 2093–2229. Springer (1998)
19. Griggio, A., Phan, Q.-S., Sebastiani, R., Tomasi, S.: Stochastic local search
for SMT: combining theory solvers with WalkSAT. In: Tinelli, C., Sofronie-
Stokkermans, V. (eds.) FroCoS 2011. LNCS (LNAI), vol. 6989, pp. 163–178.
Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24364-6 12
20. Hong, H.: An improvement of the projection operator in cylindrical algebraic
decomposition. In: Proceedings of the International Symposium on Symbolic and
Algebraic Computation, pp. 261–264 (1990)
21. Jovanović, D., de Moura, L.: Solving non-linear arithmetic. In: Gramlich, B., Miller,
D., Sattler, U. (eds.) IJCAR 2012. LNCS (LNAI), vol. 7364, pp. 339–354. Springer,
Heidelberg (2012). https://doi.org/10.1007/978-3-642-31365-3 27
22. Lazard, D.: An improved projection for cylindrical algebraic decomposition. In:
Algebraic Geometry and its Applications, pp. 467–476. Springer (1994). https://
doi.org/10.1007/978-1-4612-2628-4 29
23. Li, C.M., Li, Yu.: Satisfying versus falsifying in local search for satisfiability.
In: Cimatti, A., Sebastiani, R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 477–478.
Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31612-8 43
24. Li, H., Xia, B.: Solving satisfiability of polynomial formulas by sample-cell projec-
tion. arXiv preprint arXiv:2003.00409 (2020)
25. Liu, M., et al.: NRA-LS at the SMT competition 2022. Tool description document,
see https://github.com/minghao-liu/NRA-LS (2022)
26. McCallum, S.: An improved projection operation for cylindrical algebraic decom-
position. In: Quantifier Elimination and Cylindrical Algebraic Decomposition, pp.
242–268. Springer (1998). https://doi.org/10.1007/978-3-7091-9459-1 12
27. Mitchell, D., Selman, B., Leveque, H.: A new method for solving hard satisfiability
problems. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp.
440–446 (1992)
28. Moura, L.d., Bjørner, N.: Z3: An efficient SMT solver. In: International Conference
on Tools and Algorithms for the Construction and Analysis of Systems. pp. 337–
340. Springer (2008). https://doi.org/10.1007/978-3-540-78800-3 24
29. Nalbach, J., Ábrahám, E., Specht, P., Brown, C.W., Davenport, J.H., England,
M.: Levelwise construction of a single cylindrical algebraic cell. arXiv preprint
arXiv:2212.09309 (2022)
30. Niemetz, A., Preiner, M.: Ternary propagation-based local search for more bit-
precise reasoning. In: 2020 Formal Methods in Computer Aided Design (FMCAD),
pp. 214–224. IEEE (2020)
Local Search for Solving Satisfiability of Polynomial Formulas 109
31. Niemetz, A., Preiner, M., Biere, A.: Precise and complete propagation based local
search for satisfiability modulo theories. In: Chaudhuri, S., Farzan, A. (eds.) CAV
2016. LNCS, vol. 9779, pp. 199–217. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-41528-4 11
32. Talupur, M., Sinha, N., Strichman, O., Pnueli, A.: Range allocation for separation
logic. In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 148–161.
Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27813-9 12
33. Tarski, A.: A decision method for elementary algebra and geometry. University of
California Press (1951)
34. Tung, V.X., Van Khanh, T., Ogawa, M.: raSAT: an SMT solver for polynomial
constraints. Formal Methods Syst Design 51(3), 462–499 (2017). https://doi.org/
10.1007/s10703-017-0284-9
35. Weispfenning, V.: Quantifier elimination for real algebra-the quadratic case and
beyond. Appl. Algebra Eng. Commun. Comput. 8(2), 85–101 (1997)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Partial Quantifier Elimination and
Property Generation
Eugene Goldberg(B)
1 Introduction
In this paper, we consider the following problem. Let F (X, Y ) be a propositional
formula in conjunctive normal form (CNF)1 where X, Y are sets of variables.
Let G be a subset of clauses of F . Given a formula ∃X[F ], find a quantifier-free
formula H(Y ) such that ∃X[F ] ≡ H ∧ ∃X[F \ G]. In contrast to full quantifier
elimination (QE), only the clauses of G are taken out of the scope of quantifiers
here. So, we call this problem partial QE (PQE) [1]. (In this paper, we consider
PQE only for formulas with existential quantifiers.) We will refer to H as a
solution to PQE. Like SAT, PQE is a way to cope with the complexity of QE.
But in contrast to SAT that is a special case of QE (where all variables are
quantified), PQE generalizes QE. The latter is just a special case of PQE where
G = F and the entire formula is unquantified. Interpolation [2,3] can be viewed
as a special case of PQE as well [4,5].
1
Every formula is a propositional CNF formula unless otherwise stated. Given a CNF
formula F represented as the conjunction of clauses C1 ∧· · ·∧Ck , we will also consider
F as the set of clauses {C1 , . . . , Ck }.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 110–131, 2023.
https://doi.org/10.1007/978-3-031-37703-7_6
Partial Quantifier Elimination and Property Generation 111
The appeal of PQE is threefold. First, it can be much more efficient than
QE if G is a small subset of F . Second, many verification problems like SAT,
equivalence checking, model checking can be solved in terms of PQE [1,6–8]. So,
PQE can be used to design new efficient methods for solving known problems.
Third, one can apply PQE to solving new problems like property generation
considered in this paper. In practice, to perform PQE, it suffices to have an
algorithm that takes a single clause out of the scope of quantifiers. Namely, given
a formula ∃X[F (X, Y )] and a clause C ∈ F , this algorithm finds a formula H(Y )
such that ∃X[F ] ≡ H ∧ ∃X[F \ {C}]. To take out k clauses, one can apply this
algorithm k times. Since H ∧ ∃X[F ] ≡ H ∧ ∃X[F \ {C}], solving the PQE above
reduces to finding H(Y ) that makes C redundant in H ∧ ∃X[F ]. So, the PQE
algorithms we present here employ redundancy based reasoning. We describe
two PQE algorithms called EG-PQE and EG-PQE + where “EG” stands for
“Enumerate and Generalize”. EG-PQE is a very simple SAT-based algorithm
that can sometimes solve very large problems. EG-PQE + is a modification of
EG-PQE that makes the algorithm more powerful and robust.
In [7], we showed the viability of an equivalence checker based on PQE. In par-
ticular, we presented instances for which this equivalence checker outperformed
ABC [9], a high quality tool. In this paper, we describe and check experimen-
tally one more important application of PQE called property generation. Our
motivation here is as follows. Suppose a design implementation Imp meets the
set of specification properties P1 , . . . , Pm . Typically, this set is incomplete. So,
∗
Imp can still be buggy even if every Pi , i = 1, . . . , m holds. Let Pm+1 , . . . , Pn∗
be desired properties adding which makes the specification complete. If Imp
meets the properties P1 , . . . , Pm but is still buggy, a missed property Pi∗ above
fails. That is, Imp has the unwanted property Pi∗ . So, one can detect bugs by
generating unspecified properties of Imp and checking if there is an unwanted
one.
Currently, identification of unwanted properties is mostly done by massive
testing. (As we show later, the input/output behavior specified by a single test
can be cast as a simple property of Imp.) Another technique employed in prac-
tice is guessing unwanted properties that may hold and formally checking if
this is the case. The problem with these techniques is that they can miss an
unwanted property. In this paper, we describe property generation by PQE. The
benefit of PQE is that it can produce much more complex properties than those
corresponding to single tests. So, using PQE one can detect bugs that testing
overlooks or cannot find in principle. Importantly, PQE generates properties
covering different parts of Imp. This makes the search for unwanted properties
more systematic and facilitates discovering bugs that can be missed if one simply
guesses unwanted properties that may hold.
In this paper, we experimentally study generation of invariants of a sequen-
tial circuit N . An invariant of N is unwanted if a state that is supposed to be
reachable in N falsifies this invariant and hence is unreachable. Note that find-
ing a formal proof that N has no unwanted invariants is impractical. (It is hard
to efficiently prove a large set of states reachable because different states are
112 E. Goldberg
2 Basic Definitions
In this section, when we say “formula” without mentioning quantifiers, we mean
“a quantifier-free formula”.
Definition 1. We assume that formulas have only Boolean variables. A literal
of a variable v is either v or its negation. A clause is a disjunction of literals.
A formula F is in conjunctive normal form (CNF) if F = C1 ∧ · · · ∧ Ck where
C1 , . . . , Ck are clauses. We will also view F as the set of clauses {C1 , . . . , Ck }.
We assume that every formula is in CNF.
Definition 2. Let F be a formula. Then Vars(F ) denotes the set of variables
of F and Vars(∃X[F ]) denotes Vars(F )\X.
Definition 3. Let V be a set of variables. An assignment #» q to V is a mapping
V → {0, 1} where V ⊆ V . We will denote the set of variables assigned in #»
q as
Vars(q). We will refer to #» q as a full assignment to V if Vars(q) = V . We
#» ⊆ #»
will denote as q r the fact that a) Vars(q) ⊆ Vars(r) and b) every variable
of Vars(q) has the same value in #»q and #»
r.
Definition 4. A literal, a clause and a formula are said to be satisfied (respec-
tively falsified) by an assignment #»
q if they evaluate to 1 (respectively 0) under
#»
q.
Definition 5. Let C be a clause. Let H be a formula that may have quantifiers,
and #»
q be an assignment to Vars(H). If C is satisfied by #»
q , then Cq ≡ 1. Oth-
erwise, Cq is the clause obtained from C by removing all literals falsified by #»
q.
Denote by Hq the formula obtained from H by removing the clauses satisfied by
#»
q and replacing every clause C unsatisfied by #»
q with Cq .
Definition 6. Given a formula ∃X[F (X, Y )], a clause C of F is called a quan-
tified clause if Vars(C) ∩ X = ∅. If Vars(C) ∩ X = ∅, the clause C depends
only on free, i.e., unquantified variables of F and is called a free clause.
Partial Quantifier Elimination and Property Generation 113
Ntriv that simply stays in the initial state #»sini and Pagg ( #»
sini ) = 1. Then Pagg
holds for Ntriv but the latter has op-state reachability bugs (assuming that the
correct circuit must reach states other than #» sini ).
Let R #»
s (S) be the predicate satisfied only by a state #»
s . In terms of CTL,
identifying an op-state reachability bug means finding #» s for which the property
EF.R #» #»
s must hold but it does not. The reason for assuming s to be unknown
is that the set of op-states is typically too large to explicitly specify every prop-
erty ET.R #» s to hold. This makes finding op-state reachability bugs very hard.
The problem is exacerbated by the fact that reachability of different states is
established by different traces. So, in general, one cannot efficiently prove many
properties EF.R #» s (for different states) at once.
In practice, there are two methods to check reachability of op-states for large
circuits. The first method is testing. Of course, testing cannot prove a state
unreachable, however, the examination of execution traces may point to a poten-
tial problem. (For instance, after examining execution traces of the circuit Ntriv
above one realizes that many op-states look unreachable.) The other method
is to check unwanted invariants, i.e., those that are supposed to fail. If an
unwanted invariant holds for a circuit, the latter has an op-state reachability
bug. For instance, one can check if a state variable si ∈ S of a circuit never
changes its initial value. To break this unwanted invariant, one needs to find
an op-state where the initial value of si is flipped. (For the circuit Ntriv above
this unwanted invariant holds for every state variable.) The potential unwanted
invariants are formed manually, i.e., simply guessed.
The two methods above can easily overlook an op-state reachability bug.
Testing cannot prove that an op-state is unreachable. To correctly guess an
unwanted invariant that holds, one essentially has to know the underlying bug.
Below, we describe a method for invariant generation by PQE that is based on
property generation for combinational circuits. The appeal of this method is
twofold. First, PQE generates invariants “inherent” to the implementation at
hand, which drastically reduces the set of invariants to explore. Second, PQE is
able to generate invariants related to different parts of the circuit (including the
buggy one). This increases the probability of generating an unwanted invariant.
We substantiate this intuition in Sect. 7.
Let formula Fk specify the combinational circuit obtained by unfolding a
sequential circuit N for k time frames and adding the initial state constraint
I(S0 ). That is, Fk = I(S0 ) ∧ T (S0 , V0 , S1 ) ∧ · · · ∧ T (Sk−1 , Vk−1 , Sk ) where Sj , Vj
denote the state and input variables of j-th time frame respectively. Let H(Sk )
be a solution to the PQE problem of taking a clause C out of ∃Xk [Fk ] where
Xk = S0 ∪V0 ∪· · ·∪Sk−1 ∪Vk−1 . That is, ∃Xk [Fk ] ≡ H∧ ∃Xk [Fk \{C}]. Note that
in contrast to Sect. 3, here some external variables of the combinational circuit
(namely, the input variables V0 , . . . , Vk−1 ) are quantified too. So, H depends only
118 E. Goldberg
on state variables of the last time frame. H can be viewed as a local invariant
asserting that no state falsifying H can be reached in k transitions.
One can use H to find global invariants (holding for every time frame) as
follows. Even if H is only a local invariant, a clause Q of H can be a global
invariant. The experiments of Sect. 8 show that, in general, this is true for many
clauses of H. (To find out if Q is a global invariant, one can simply run a model
checker to see if the property Q holds.) Note that by taking out different clauses
of Fk one can produce global single-clause invariants Q relating to different parts
of N . From now on, when we say “an invariant” without a qualifier we mean a
global invariant.
5 Introducing EG-PQE
In this section, we describe a simple SAT-based algorithm for performing PQE
called EG-PQE . Here ‘EG’ stands for ‘Enumerate and Generalize’. EG-PQE
accepts a formula ∃X[F (X, Y )] and a clause C ∈ F . It outputs a formula H(Y )
such that ∃X[Fini ] ≡ H ∧ ∃X[Fini \ {C}] where Fini is the initial formula F .
(This point needs clarification because EG-PQE changes F by adding clauses.)
5.1 An Example
Before describing the pseudocode of EG-PQE , we explain how it solves the PQE
problem of Example 1. That is, we consider taking clause C1 out of ∃X[F (X, Y )]
where F = C1 ∧ · · · ∧ C4 , C1 = x3 ∨ x4 , C2 = y1 ∨x3 , C3 = y1 ∨ x4 , C4 = y2 ∨x4
and Y = {y1 , y2 } and X = {x3 , x4 }.
EG-PQE iteratively generates a full assignment #» y to Y and checks if (C1 )y
is redundant in ∃X[Fy ] (i.e., if C1 is redundant in ∃X[F ] in subspace #» y ). Note
that if (F \ {C1 })y implies (C1 )y , then (C1 )y is trivially redundant in ∃X[Fy ].
To avoid such subspaces, EG-PQE generates #» y by searching for an assignment
( #» x ) satisfying the formula (F \{C1 })∧C 1 . (Here #»
y , #» y and #»
x are full assignments
#» #»
to Y and X respectively.) If such ( y , x ) exists, it satisfies F \ {C1 } and falsifies
C1 thus proving that (F \ {C1 })y does not imply (C1 )y .
Assume that EG-PQE found an assignment(y1 = 0, y2 = 1, x3 = 1, x4 = 0)
satisfying (F \ {C1 }) ∧ C 1 . So #»
y = (y1 = 0, y2 = 1). Then EG-PQE checks if Fy is
satisfiable. Fy = (x3 ∨x4 )∧x3 ∧x4 and so it is unsatisfiable. This means that (C1 )y
is not redundant in ∃X[Fy ]. (Indeed, (F \ {C1 })y is satisfiable. So, removing
C1 makes F satisfiable in subspace #» y .) EG-PQE makes (C1 )y redundant in
∃X[Fy ] by adding to F a clause B falsified by #» y . The clause B equals y1
and is obtained by identifying the assignments to individual variables of Y that
made Fy unsatisfiable. (In our case, this is the assignment y1 = 0.) Note that
derivation of clause y1 generalizes the proof of unsatisfiability of F in subspace
(y1 = 0, y2 = 1) so that this proof holds for subspace (y1 = 0, y2 = 0) too.
Now EG-PQE looks for a new assignment satisfying (F \ {C1 }) ∧ C 1 . Let the
assignment (y1 = 1, y2 = 1, x3 = 1, x4 = 0) be found. So, #» y = (y1 = 1, y2 = 1).
Since (y1 = 1, y2 = 1, x3 = 0) satisfies F , the formula Fy is satisfiable. So, (C1 )y
Partial Quantifier Elimination and Property Generation 119
5.3 Discussion
EG-PQE is similar to the QE algorithm presented at CAV-2002 [12]. We will
refer to it as CAV02 -QE . Given a formula ∃X[F (X, Y )], CAV02 -QE enumerates
full assignments to Y . In subspace #» y , if Fy is unsatisfiable, CAV02 -QE adds
to F a clause falsified by #» y . Otherwise, CAV02 -QE generates a plugging clause
D. (In [12], D is called “a blocking clause”. This term can be confused with the
term “blocked clause” specifying a completely different kind of a clause. So, we
use the term “the plugging clause” instead.) To apply the idea of CAV02 -QE
to PQE, we reformulated it in terms of redundancy based reasoning.
The main flaw of EG-PQE inherited from CAV02 -QE is the necessity to
use plugging clauses produced from a satisfying assignment. Consider the PQE
problem of taking a clause C out of ∃X[F (X, Y )]. If F is proved unsatisfiable in
subspace #» y , typically, only a small subset of clauses of Fy is involved in the proof.
Then the clause generated by EG-PQE is short and thus proves C redundant
in many subspaces different from #» y . On the contrary, to prove F satisfiable
in subspace #» y , every clause of F must be satisfied. So, the plugging clause
built off a satisfying assignment includes almost every variable of Y . Despite
this flaw of EG-PQE , we present it for two reasons. First, it is a very simple
SAT-based algorithm that can be easily implemented. Second, EG-PQE has
a powerful advantage over CAV02 -QE since it solves PQE rather than QE.
Namely, EG-PQE does not need to examine the subspaces #» y where C is implied
by F \ {C}. Surprisingly, for many formulas this allows EG-PQE to completely
avoid examining subspaces where F is satisfiable. In this case, EG-PQE is very
efficient and can solve very large problems. Note that when CAV02 -QE performs
complete QE on ∃X[F ], it cannot avoid subspaces #» y where Fy is satisfiable
unless F itself is unsatisfiable (which is very rare in practical applications).
6 Introducing EG-PQE +
In this section, we describe EG-PQE + , an improved version of EG-PQE .
Example 3. Consider the example solved in Subsect. 5.1. That is, we consider
taking clause C1 out of ∃X[F (X, Y )] where F = C1 ∧ · · · ∧ C4 , C1 = x3 ∨ x4 ,
C2 = y1 ∨ x3 , C3 = y1 ∨ x4 , C4 = y2 ∨ x4 and Y = {y1 , y2 } and X = {x3 , x4 }.
Consider the step where EG-PQE proves redundancy of C1 in subspace #» y =
(y1 = 1, y2 = 1). EG-PQE shows that (y1 = 1, y2 = 1,x3 = 0) satisfies F , thus
proving every clause of F (including C1 ) redundant in ∃X[F ] in subspace #» y.
Then EG-PQE generates the plugging clause D = y 1 ∨ y 2 falsified by #» y.
In contrast to EG-PQE , EG-PQE + calls PrvClsRed to produce a proof of
redundancy for the clause C1 alone. Note that F has no clauses resolvable with
C1 on x3 in subspace #»
y ∗ = (y1 = 1). (The clause C2 containing x3 is satisfied by
y .) This means that C1 is blocked in subspace #»
#»∗
y ∗ and hence redundant there
#»∗ #»
(see Proposition 2). Since y ⊂ y , EG-PQE produces a more general proof of
+
6.2 Discussion
Consider the PQE problem of taking a clause C out of ∃X[F (X, Y )]. There are
two features of PQE that make it easier than QE. The first feature mentioned
earlier is that one can ignore the subspaces #»
y where F \ {C} implies C. The sec-
ond feature is that when Fy is satisfiable, one only needs to prove redundancy of
the clause C alone. Among the three algorithms we run in experiments, namely,
DS -PQE , EG-PQE and EG-PQE + only the latter exploits both features. (In
addition to using DS -PQE inside EG-PQE + we also run it as a stand-alone
PQE solver.) DS -PQE does not use the first feature [1] and EG-PQE does not
exploit the second one. As we show in Sects. 7 and 8, this affects the performance
of DS -PQE and EG-PQE .
122 E. Goldberg
2
Let P (Ŝ) be an invariant for a circuit N depending only on a subset Ŝ of the state
variables S. Identifying P as an unwanted invariant is much easier if Ŝ is meaningful
from the high-level view of the design. Suppose, for instance, that assignments to Ŝ
specify values of a high-level variable v. Then P is unwanted if it claims unreachabil-
ity of a value of v that is supposed to be reachable. Another simple example is that
assignments to Ŝ specify values of high-level variables v and w that are supposed to
be independent. Then P is unwanted if it claims that some combinations of values of
v and w are unreachable. (This may mean, for instance, that an assignment operator
setting the value of v erroneously involves the variable w.)
Partial Quantifier Elimination and Property Generation 123
Table 1. FIFO buffer with n elements of 32 bits. Time limit is 10 s per PQE problem
buff. lat- time total pqe probs finished pqe probs unwant. invar runtime (s.)
size ches fra- ds- eg- eg- ds- eg- eg- ds- eg- eg- ds- eg- eg-
n mes pqe pqe pqe + pqe pqe pqe + pqe pqe pqe + pqe pqe pqe +
8 300 5 1,236 311 8 2% 36% 35% no yes yes 12,141 2,138 52
8 300 10 560 737 39 2% 1% 3% yes yes yes 5,551 7,681 380
16 560 5 2,288 2,288 16 1% 65% 71% no no yes 22,612 9,506 50
16 560 10 653 2,288 24 1% 36% 38% yes no yes 6,541 16,554 153
EG-PQE + found such an invariant after solving 8 problems. On the other hand,
DS -PQE failed to find an unwanted invariant and had to solve all 1,236 PQE
problems of taking out a clause of Fk with an unquantified variable. The following
three columns show the share of PQE problems finished in the time limit of 10 s.
For instance, EG-PQE finished 36% of 311 problems. The next three columns
show if an unwanted invariant was generated by a PQE solver. (EG-PQE and
EG-PQE + found one whereas DS -PQE did not.) The last three columns give
the total run time. Table 1 shows that only EG-PQE + managed to generate an
unwanted invariant for all four instances of Fifo. This invariant asserted that
Fifo cannot reach a state where an element of Data equals Val .
The bug above (or its modified version) can be overlooked by conventional meth-
ods. Consider, for instance, testing. It is hard to detect this bug by random tests
because it is exposed only if one tries to add Val to Fifo. The same applies to
testing using the line coverage metric [19]. On the other hand, a test set with
100% branch coverage [19] will find this bug. (To invoke the else branch of the
if statement marked with ‘*’ in Fig. 3, one must set dataIn to Val .) However, a
slightly modified bug can be missed even by tests with 100% branch coverage [5].
Now consider, manual generation of unwanted properties. It is virtually
impossible to guess an unwanted invariant of Fifo exposing this bug unless one
knows exactly what this bug is. However, one can detect this bug by checking
a property asserting that the element dataIn must appear in the buffer if Fifo
is ready to accept it. Note that this is a non-invariant property involving states
of different time frames. The more time frames are used in such a property the
more guesswork is required to pick it. Let us consider a modified bug. Suppose
Fifo does not reject the element Val . So, the non-invariant property above holds.
However, if dataIn == Val , then Fifo changes the previous accepted element if
that element was Val too. So, Fifo cannot have two consecutive elements Val .
Our method will detect this bug via generating an unwanted invariant falsified by
states with consecutive elements Val . One can also identify this bug by checking
a property involving two consecutive elements of Fifo. But picking it requires a
lot of guesswork and so the modified bug can be easily overlooked.
8.1 Experiment 1
8.2 Experiment 2
The second experiment was an extension of the first one. Its goal was to show
that PQE can generate invariants for realistic designs. For each clause Q of a
local invariant H generated by PQE we used IC3 to verify if Q was a global
invariant. If so, we checked if Pagg ⇒ Q held. To make the experiment less time
consuming, in addition to the time limit of 10 s per PQE problem we imposed
a few more constraints. The PQE problem of taking a clause out of ∃Xk [Fk ]
terminated as soon as H accumulated 5 clauses or more. Besides, processing
a benchmark aborted when the summary number of clauses of all formulas H
generated for this benchmark reached 100 or the total run time of all PQE
problems generated off ∃Xk [Fk ] exceeded 2,000 s.
Table 2 shows the results of the exper-
Table 2. Invariant generation iment. The third column gives the num-
pqe #bench results ber of local single-clause invariants (i.e.,
solver marks local glob not imp
invar. invar. by Pagg
the total number of clauses in all H over
ds-pqe 98 5,556 2,678 2,309 all benchmarks). The fourth column shows
eg-pqe 98 9,498 4,839 4,009 how many local single-clause invariants
eg-pqe + 98 9,303 4,773 3,940 turned out to be global. (Since global
invariants were extracted from H and the
summary size of all H could not exceed
100, the number of global invariants per benchmark could not exceed 100.) The
last column gives the number of global invariants not implied by Pagg . So, these
invariants are candidates for checking if they are unwanted. Table 2 shows that
EG-PQE and EG-PQE + performed much better than DS -PQE .
8.3 Experiment 3
To prove an invariant P true, IC3 conjoins it with clauses Q1, . . . ,Qn to make
P ∧ Q1 ∧ · · · ∧ Qn inductive. If IC3 succeeds, every Qi is an invariant. More-
over, Qi may be an unwanted invariant. The goal of the third experiment was to
demonstrate that PQE and IC3 , in general, produce different invariant clauses.
The intuition here is twofold. First, IC3 generates clauses Qi to prove a prede-
fined invariant rather than find an unwanted one. Second, the closer P to being
inductive, the fewer new invariant clauses are generated by IC3 . Consider the
circuit Ntriv that simply stays in the initial state #» sini (Sect. 4). Any invariant
satisfied by #»sini is already inductive for Ntriv . So, IC3 will not generate a single
new invariant clause. On the other hand, if the correct circuit is supposed to
leave the initial state, Ntriv has unwanted invariants that our method will find.
∗
In this experiment, we used IC3 to generate Pagg , an inductive version of
Pagg . The experiment showed that in 88% cases, an invariant clause generated
∗
by EG-PQE + and not implied by Pagg was not implied by Pagg either. (More
details about this experiment can be found in [5].)
Partial Quantifier Elimination and Property Generation 127
The next column shows that 6s326 has 3,342 latches. The third column gives
the number of time frames used to produce a combinational circuit Mk (here
k = 20). The next column shows that the clause B introduced above consisted of
15 literals of variables from Sk . (Here and below we still use the index k assum-
ing that k = 20.) The literals of B were generated randomly. When picking the
length of B we just tried to simulate the situation where one wants to set a
particular subset of output variables of Mk to specified values. The next two
columns give the size of the subcircuit Mk of Mk that feeds the output variables
present in B. When computing a property H we took a clause out of formula
∃S1,k [Fk ∧ B] where Fk specifies Mk instead of formula ∃S1,k [Fk ∧ B] where Fk
specifies Mk . (The logic of Mk not feeding a variable of B is irrelevant for com-
puting H.) The first column of the pair gives the number of gates in Mk (i.e.,
348,479). The second column provides the number of input variables feeding Mk
(i.e., 1,774). Here we count only variables of V0 ∪· · ·∪Vk−1 and ignore those of S0
since the latter are already assigned values specifying the initial state #» sini of N .
The next four columns show the results of taking a clause out of ∃S1,k [Fk ∧
B]. For each PQE problem the time limit was set to 10 s. Besides, EG-PQE +
terminated as soon as 5 clauses of property H(S0 , V0 , . . . , Vk−1 ) were generated.
The first three columns out of four describe the minimum and maximum sizes
of clauses in H and the run time of EG-PQE + . So, it took for EG-PQE + 2.9 s.
to produce a formula H containing clauses of sizes from 27 to 28 variables. A
clause Q of H with 27 variables, for instance, specifies 21747 tests falsifying Q that
produce the same output of Mk (falsifying the clause B). Here 1747 = 1774 − 27
is the number of input variables of Mk not present in Q. The last column shows
that at least one clause Q of H specifies a property that cannot be produced by
3-valued simulation (a version of symbolic simulation [23]). To prove this, one
just needs to set the input variables of Mk present in Q to the values falsifying Q
and run 3-valued simulation. (The remaining input variables of Mk are assigned
a don’t-care value.) If after 3-valued simulation some output variable of Mk is
assigned a don’t-care value, the property specified by Q cannot be produced by
3-valued simulation.
Running DS -PQE , EG-PQE and EG-PQE + on the 1,586 PQE problems
mentioned above showed that a) EG-PQE performed poorly producing proper-
ties only for 28% of problems; b) DS -PQE and EG-PQE + showed much better
results by generating properties for 62% and 66% of problems respectively. When
DS -PQE and EG-PQE + succeeded in producing properties, the latter could not
be obtained by 3-valued simulation in 74% and 78% of cases respectively.
10 Some Background
In this section, we discuss some research relevant to PQE and property genera-
tion. Information on BDD based QE can be found in [24,25]. SAT based QE is
described in [12,21,26–32]. Our first PQE solver called DS -PQE was introduced
in [1]. It was based on redundancy based reasoning presented in [33] in terms of
variables and in [34] in terms of clauses. The main flaw of DS -PQE is as follows.
Partial Quantifier Elimination and Property Generation 129
References
1. Goldberg, E., Manolios, P.: Partial quantifier elimination. In: Yahav, E. (ed.) HVC
2014. LNCS, vol. 8855, pp. 148–164. Springer, Cham (2014). https://doi.org/10.
1007/978-3-319-13338-6 12
130 E. Goldberg
2. Craig, W.: Three uses of the Herbrand-Gentzen theorem in relating model theory
and proof theory. J. Symbolic Logic 22(3), 269–285 (1957)
3. McMillan, K.L.: Interpolation and SAT-based model checking. In: Hunt, W.A.,
Somenzi, F. (eds.) CAV 2003. LNCS, vol. 2725, pp. 1–13. Springer, Heidelberg
(2003). https://doi.org/10.1007/978-3-540-45069-6 1
4. Goldberg, E.: Property checking by logic relaxation, Technical report
arXiv:1601.02742 [cs.LO] (2016)
5. Goldberg, E.: Partial quantifier elimination and property generation, Technical
report arXiv:2303.13811 [cs.LO] (2023)
6. Goldberg, E., Manolios, P.: Software for quantifier elimination in propositional
logic. In: ICMS-2014, Seoul, South Korea, 5–9 August 2014, pp. 291–294 (2014)
7. Goldberg, E.: Equivalence checking by logic relaxation. In: FMCAD-2016, pp. 49–
56 (2016)
8. Goldberg, E.: Property checking without inductive invariant generation, Technical
report arXiv:1602.05829 [cs.LO] (2016)
9. B. L. Synthesis and V. Group: ABC: a system for sequential synthesis and verifi-
cation (2017). www.eecs.berkeley.edu/∼alanmi/abc
10. Kullmann, O.: New methods for 3-SAT decision and worst-case analysis. Theor.
Comput. Sci. 223(1–2), 1–72 (1999)
11. Tseitin, G.: On the complexity of derivation in the propositional calculus. Zapiski
nauchnykh seminarov LOMI, vol. 8, pp. 234–259 (1968). English translation of this
volume: Consultants Bureau, N.Y., pp. 115–125 (1970)
12. McMillan, K.L.: Applying SAT methods in unbounded symbolic model checking.
In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 250–264.
Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45657-0 19
13. The source of DS-P QE. http://eigold.tripod.com/software/ds-pqe.tar.gz
14. The source of EG-P QE. http://eigold.tripod.com/software/eg-pqe.1.0.tar.gz
15. The source of EG-P QE + . http://eigold.tripod.com/software/eg-pqe-pl.1.0.tar.gz
16. Eén, N., Sörensson, N.: An extensible SAT-solver. In: SAT, Santa Margherita Lig-
ure, Italy, pp. 502–518 (2003)
17. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R.,
Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70–87. Springer, Heidel-
berg (2011). https://doi.org/10.1007/978-3-642-18275-4 7
18. An implementation of IC3 by A. Bradley. https://github.com/arbrad/IC3ref
19. Aniche, M.: Effective Software Testing: A Developer’s Guide. Manning Publications
(2022)
20. HardWare Model Checking Competition (HWMCC-2013) (2013). http://fmv.jku.
at/hwmcc13/
21. Rabe, M.N.: Incremental determinization for quantifier elimination and functional
synthesis. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11562, pp. 84–94.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25543-5 6
22. CADET. http://github.com/MarkusRabe/cadet
23. Bryant, R.: Symbolic simulation–techniques and applications. In: DAC-1990, pp.
517–521 (1990)
24. Bryant, R.: Graph-based algorithms for Boolean function manipulation. IEEE
Trans. Comput. C-35(8), 677–691 (1986)
25. Chauhan, P., Clarke, E., Jha, S., Kukula, J., Veith, H., Wang, D.: Using com-
binatorial optimization methods for quantification scheduling. In: Margaria, T.,
Melham, T. (eds.) CHARME 2001. LNCS, vol. 2144, pp. 293–309. Springer, Hei-
delberg (2001). https://doi.org/10.1007/3-540-44798-9 24
Partial Quantifier Elimination and Property Generation 131
26. Jin, H., Somenzi, F.: Prime clauses for fast enumeration of satisfying assignments
to Boolean circuits. In: DAC 2005, pp. 750–753 (2005)
27. Ganai, M., Gupta, A., Ashar, P.: Efficient SAT-based unbounded symbolic model
checking using circuit cofactoring. In: ICCAD-2004, pp. 510–517 (2004)
28. Jiang, J.-H.R.: Quantifier elimination via functional composition. In: Bouajjani, A.,
Maler, O. (eds.) CAV 2009. LNCS, vol. 5643, pp. 383–397. Springer, Heidelberg
(2009). https://doi.org/10.1007/978-3-642-02658-4 30
29. Brauer, J., King, A., Kriener, J.: Existential quantification as incremental SAT. In:
Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 191–207.
Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1 17
30. Klieber, W., Janota, M., Marques-Silva, J., Clarke, E.: Solving QBF with free
variables. In: Schulte, C. (ed.) CP 2013. LNCS, vol. 8124, pp. 415–431. Springer,
Heidelberg (2013). https://doi.org/10.1007/978-3-642-40627-0 33
31. Bjorner, N., Janota, M., Klieber, W.: On conflicts and strategies in QBF. In: LPAR
(2015)
32. Bjorner, N., Janota, M.: Playing with quantified satisfaction. In: LPAR (2015)
33. Goldberg, E., Manolios, P.: Quantifier elimination by dependency sequents. In:
FMCAD-2012, pp. 34–44 (2012)
34. Goldberg, E., Manolios, P.: Quantifier elimination via clause redundancy. In:
FMCAD 2013, pp. 85–92 (2013)
35. Goldberg, E.: Quantifier elimination with structural learning, Technical report
arXiv: 1810.00160 [cs.LO] (2018)
36. Goldberg, E.: Partial quantifier elimination by certificate clauses, Technical report
arXiv:2003.09667 [cs.LO] (2020)
37. Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via abduc-
tive inference, vol. 48, pp. 443–456, October 2013
38. Baumgartner, J., Mony, H., Case, M., Sawada, J., Yorav, K.: Scalable conditional
equivalence checking: an automated invariant-generation based approach. In: For-
mal Methods in Computer-Aided Design, pp. 120–127 (2009)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Rounding Meets Approximate Model
Counting
1 Introduction
Given a Boolean formula F , the problem of model counting is to compute the
number of models of F . Model counting is a fundamental problem in computer
science with a wide range of applications, such as control improvisation [13],
network reliability [9,28], neural network verification [2], probabilistic reason-
ing [5,11,20,21], and the like. In addition to myriad applications, the problem of
model counting is a fundamental problem in theoretical computer science. In his
seminal paper, Valiant showed that #SAT is #P-complete, where #P is the set
of counting problems whose decision versions lie in NP [28]. Subsequently, Toda
demonstrated the theoretical hardness of the problem by showing that every
problem in the entire polynomial hierarchy can be solved by just one call to a
#P oracle; more formally, PH ⊆ P#P [27].
Given the computational intractability of #SAT, there has been sustained
interest in the development of approximate techniques from theoreticians and
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 132–162, 2023.
https://doi.org/10.1007/978-3-031-37703-7_7
Rounding Meets Approximate Model Counting 133
median estimate to be wrong, either the event L happens in half of the invo-
cations of ApproxMCCore or the event U happens in half of the invocations
of ApproxMCCore. The number of repetitions depends on max(Pr[L], Pr[U ]).
The current algorithmic design (and ensuing analysis) of ApproxMCCore pro-
vides a weak upper bound on max{Pr[L], Pr[U ]}: in particular, the bounds on
max{Pr[L], Pr[U ]} and Pr[L∪U ] are almost identical. Our key technical contribu-
tion is to design a new procedure, ApproxMC6Core, based on the rounding tech-
nique that allows us to obtain significantly better bounds on max{Pr[L], Pr[U ]}.
The resulting algorithm, called ApproxMC6, follows a similar structure
to that of ApproxMC: it repeatedly invokes the underlying core procedure
ApproxMC6Core and returns the median of the estimates. Since a single invo-
cation of ApproxMC6Core takes as much time as ApproxMCCore, the reduction in
the number of repetitions is primarily responsible for the ensuing speedup. As
an example, for ε = 0.8, the number of repetitions of ApproxMC6Core to attain
δ = 0.1 and δ = 0.001 is just 5 and 19, respectively; the corresponding num-
bers for ApproxMC were 21 and 117. An extensive experimental evaluation on
1890 benchmarks shows that the rounding technique provided 4× speedup than
the state-of-the-art approximate model counter, ApproxMC. Furthermore, for a
given timeout of 5000 s, ApproxMC6 solves 204 more instances than ApproxMC
and achieves a reduction of 1063 s in the PAR-2 score.
The rest of the paper is organized as follows. We introduce notation and
preliminaries in Sect. 2. To place our contribution in context, we review related
works in Sect. 3. We identify the weakness of the current technique in Sect. 4 and
present the rounding technique in Sect. 5 to address this issue. Then, we present
our experimental evaluation in Sect. 6. Finally, we conclude in Sect. 7.
Let F be a Boolean formula in conjunctive normal form (CNF), and let Vars(F )
be the set of variables appearing in F . The set Vars(F ) is also called the support
of F . An assignment σ of truth values to the variables in Vars(F ) is called a
satisfying assignment or witness of F if it makes F evaluate to true. We denote
the set of all witnesses of F by sol(F). Throughout the paper, we will use n to
denote |Vars(F )|.
The propositional model counting problem is to compute |sol(F)| for a given
CNF formula F . A probably approximately correct (or PAC) counter is a proba-
bilistic algorithm ApproxCount(·, ·, ·) that takes as inputs a formula F , a tolerance
parameter ε > 0, and a confidence parameter δ ∈ (0, 1], and returns an (ε, δ)-
estimate c, i.e., Pr |sol(F)|
1+ε ≤ c ≤ (1 + ε)|sol(F)| ≥ 1 − δ. PAC guarantees are also
sometimes referred to as (ε, δ)-guarantees.
A closely related notion is projected model counting, where we are interested
in computing the cardinality of sol(F) projected on a subset of variables P ⊆
Vars(F ). While for clarity of exposition, we describe our algorithm in the context
of model counting, the techniques developed in this paper are applicable to
Rounding Meets Approximate Model Counting 135
1 t
η(t, t/2 , p) ∈ Θ t− 2 2 p(1 − p)
136 J. Yang and K. S. Meel
Proof. We will derive both an upper and a matching lower bound for
η(t,
t t/2 ,tp).
k We t−kbegin by deriving
t t η(t, t/2 ,t p)
an upper bound: =
t k t−k 2 ·
1
t
k= k p (1−p) ≤ t/2 t p (1−p)
k= ≤ t/2 ·(p(1−p)) 1−2p
2 2
t 1 1 1 1 t
≤ √1
2π
· t
· t
t−1 · e 12t − 6t+6 − 6t−6 · t− 2 2t · (p(1 − p)) 2 · (p(1 −
( 2t −0.5)( 2t +0.5)
1 1
p)) 2 · 1−2p . The last inequality follows Stirling’s approximation. As a result,
1
t
η(t, t/2 , p) ∈ O t− 2 2 p(1 − p) . Afterwards; we move on to deriving a
t t t
matching lower bound: η(t, t/2 , p) = k= t kt pk (1 − p)t−k ≥ t/2 p 2 (1 −
2
t 1 1 1
t− 2t √1 t t 12t − 6t+6 − 6t−6
1 t
p) ≥ 2π
· · t+1 ·e · t− 2 2t · (p(1 − p)) 2 ·
( t
)(
2 −0.5 ) t
2 +0.5
1 1 1
p 2 (1 − p)− 2 · 1−2p . The last inequality again follows Stirling’s approximation.
1
t
Hence, η(t, t/2 , p) ∈ Ω t− 2 2 p(1 − p) . Combining these two bounds,
1
t
we conclude that η(t, t/2 , p) ∈ Θ t− 2 2 p(1 − p) .
3 Related Work
The seminal work of Valiant established that #SAT is #P-complete [28]. Toda
later showed that every problem in the polynomial hierarchy could be solved
by just a polynomial number of calls to a #P oracle [27]. Based on Carter and
Wegman’s seminal work on universal hash functions [4], Stockmeyer proposed a
probabilistic polynomial time procedure, with access to an NP oracle, to obtain
an (ε, δ)-approximation of F [25].
Built on top of Stockmeyer’s work, the core theoretical idea behind the
hashing-based approximate solution counting framework, as presented in Algo-
rithm 1 (ApproxMC [7]), is to use 2-universal hash functions to partition the
solution space (denoted by sol(F) for a given formula F ) into small cells of
roughly equal size. A cell is considered small if the number of solutions it con-
tains is less than or equal to a pre-determined threshold, thresh. An NP oracle is
used to determine if a cell is small by iteratively enumerating its solutions until
either there are no more solutions or thresh + 1 solutions have been found. In
practice, an SAT solver is used to implement the NP oracle. To ensure a polyno-
mial number of calls to the oracle, the threshold, thresh, is set to be polynomial
in the input parameter ε at Line 1. The subroutine ApproxMCCore takes the
formula F and thresh as inputs and estimates the number of solutions at Line 7.
To determine the appropriate number of cells, i.e., the value of m for H(n, m),
ApproxMCCore uses a search procedure at Line 3 of Algorithm 2. The estimate
is calculated as the number of solutions in a randomly chosen cell, scaled by
the number of cells, i.e., 2m at Line 5. To improve confidence in the estimate,
ApproxMC performs multiple runs of the ApproxMCCore subroutine at Lines 5–
9 of Algorithm 1. The final count is computed as the median of the estimates
obtained at Line 10.
Rounding Meets Approximate Model Counting 137
Algorithm 1. ApproxMC(F, ε, δ)
2
ε
1: thresh ← 9.84 1 + 1+ε
1 + 1ε ;
2: Y ← BoundedSAT(F, thresh);
3: if (|Y | < thresh) then return |Y |;
4: t ← 17 log2 (3/δ) ; C ← emptyList; iter ← 0;
5: repeat
6: iter ← iter + 1;
7: nSols ← ApproxMCCore(F, thresh);
8: AddToList(C, nSols);
9: until (iter ≥ t);
10: finalEstimate ← FindMedian(C);
11: return finalEstimate;
4 Weakness of ApproxMC
As noted above, the core algorithm of ApproxMC has not changed since 2016,
and in this work, we aim to address the core limitation of ApproxMC. To put our
contribution in context, we first review ApproxMC and its core algorithm, called
138 J. Yang and K. S. Meel
ApproxMCCore returns an estimate less than |sol(F)|1+ε . Similarly, let U denote the
event that an individual estimate of |sol(F)| is greater than (1+ε)|sol(F)|. For sim-
plicity of exposition, we assume t is odd; the current implementation of t indeed
ensures that t is odd by choosing the smallest odd t for which Pr[Errort ] ≤ δ.
In the remainder of the section, we will demonstrate that reducing
max {Pr [L] , Pr [U ]} can effectively reduce the number of repetitions t, mak-
ing the small-δ scenarios practical. To this end, we will first demonstrate the
existing analysis technique of ApproxMC leads to loose bounds on Pr[Errort ]. We
then present a new analysis that leads to tighter bounds on Pr[Errort ].
The existing combinatorial analysis in [7] derives the following proposition:
Proposition 3.
Proposition 3 follows from the observation that if the median falls outside
the PAC range, at least t/2 of the results must also be outside the range. Let
η(t, t/2 , Pr [L ∪ U ]) ≤ δ, and we can compute a valid t at Line 4 of ApproxMC.
Proposition 3 raises a question: can we derive a tight upper bound for
Pr [Errort ]? The following lemma provides an affirmative answer to this ques-
tion.
Lemma 2. Assuming t is odd, we have:
t
implies i=1 IiL ≥ 2t . Similarly, in the case that the median is greater than
(1+ε)|sol(F)|, since half of the estimates are greater than the median, at least 2t
t
estimates are greater than (1+ε)|sol(F)|, thus formally implying i=1 IiU ≥ 2t .
t
On the other hand, we prove the left (⇐) implication. Given i=1 IiL ≥ 2t ,
more than half of the estimates are less than |sol(F)|
1+ε , and therefore the median is
|sol(F)| t
less than 1+ε , violating the PAC guarantee. Similarly, given i=1 IiU ≥ 2t ,
more than half of the estimates are greater than (1 + ε)|sol(F)|, and therefore
the median is greater than (1 + ε)|sol(F)|, violating the PAC guarantee. This
t L
t t U
t
concludes the proof of Errort ⇔ i=1 Ii ≥ 2 ∨ i=1 Ii ≥ 2 . Then we
obtain:
t t
L U
Pr [Errort ] = Pr Ii ≥ t/2 ∨ Ii ≥ t/2
i=1 i=1
t t
= Pr IiL ≥ t/2 + Pr IiU ≥ t/2
i=1 i=1
t t
− Pr IiL ≥ t/2 ∧ IiU ≥ t/2
i=1 i=1
t
Given IiL + IiU ≤ 1 for i = 1, 2, ..., t, L U
i=1 (Ii + Ii ) ≤ t is there, but if
t L
t U
t
i=1 Ii ≥ t/2 ∧ i=1 Ii ≥ t/2 is also given, we obtain i=1 (IiL +
t
IiU ) ≥ t + 1 contradicting L U
i=1 (Ii + Ii ) ≤ t; Hence, we can conclude that
t L
t U
Pr i=1 Ii ≥ t/2 ∧ i=1 Ii ≥ t/2 = 0. From this, we can deduce:
t t
Pr [Errort ] = Pr IiL ≥ t/2 + Pr IiU ≥ t/2
i=1 i=1
= η(t, t/2 , Pr [L]) + η(t, t/2 , Pr [U ])
Though Lemma 2 shows that reducing Pr [L] and Pr [U ] can decrease the error
probability, it is still uncertain to what extent Pr [L] and Pr [U ] affect the error
probability. To further understand this impact, the following lemma is presented
to establish a correlation between the error probability and t depending on Pr [L]
and Pr [U ].
Lemma 3. Let pmax = max {Pr [L] , Pr [U ]} and pmax < 0.5, we have
1 t
Pr [Errort ] ∈ Θ t− 2 2 pmax (1 − pmax )
140 J. Yang and K. S. Meel
1 t t
Pr [Errort ] ∈ Θ t− 2 2 Pr [L] (1 − Pr [L]) + 2 Pr [U ] (1 − Pr [U ])
1 t
= Θ t− 2 2 pmax (1 − pmax )
5.1 Algorithm
Algorithm 3. ApproxMC6(F, ε, δ)
2
ε
1: thresh ← 9.84 1 + 1+ε
1 + 1ε ;
2: Y ← BoundedSAT(F, thresh);
3: if (|Y | < thresh) then return |Y |;
4: C ← emptyList; iter ← 0;
5: (roundUp, roundValue) ← configRound(ε)
6: t ← computeIter(ε, δ)
7: repeat
8: iter ← iter + 1;
9: nSols ← ApproxMC6Core(F, thresh, roundUp, roundValue);
10: AddToList(C, nSols);
11: until (iter ≥ t);
12: finalEstimate ← FindMedian(C);
13: return finalEstimate ;
in Algorithm 6 following Lemma 2. The iterator keeps increasing until the tight
error bound is no more than δ. As we will show in Sect. 5.2, Pr [L] and Pr [U ]
depend on ε. In the loop of Lines 7–11, ApproxMC6Core repeatedly estimates
|sol(F)|. Each estimate nSols is stored in List C, and the median of C serves as
the final estimate satisfying the (ε, δ)-guarantee.
Algorithm 4 shows the pseudo-code of ApproxMC6Core. A random hash func-
tion is chosen at Line 1 to partition sol(F) into roughly equal cells. A random
hash value is chosen at Line 2 to randomly pick a cell for estimation. In Line 3,
we search for a value m such that the cell picked from 2m available cells is small
enough to enumerate solutions one by one while providing a good estimate of
|sol(F)|. In Line 4, a bounded model counting is invoked to compute the size of the
picked cell, i.e., CntF,m . Finally, if roundUp equals 1, CntF,m is rounded up to
roundValue at Line 6. Otherwise, roundUp equals 0, and CntF,m is rounded to
roundValue at Line 8. Note that rounding up returns roundValue only if CntF,m
is less than roundValue. However, in the case of rounding, roundValue is always
returned no matter what value CntF,m is.
For large ε (ε ≥ 3), ApproxMC6Core returns a value that is independent of
the value returned by BoundedSAT in line 4 of Algorithm 4. However, observe
the value depends on m returned by LogSATSearch [8], which in turn uses
BoundedSAT to find the value of m; therefore, the algorithm’s run is not indepen-
dent of all the calls to BoundedSAT. The technical reason for correctness stems
from the observation that for large values of ε, we can always find a value of m
such that 2m × c (where c is a constant) is a (1 + ε)-approximation of |sol(F)|. An
example, consider n = 7 and let c = 1, then a (1 + 3)-approximation of a number
between 1 and 128 belongs to [1, 2, 4, 8, 16, 32, 64, 128]; therefore, returning an
answer of the form c × 2m suffices as long as we are able to search for the right
value of m, which is accomplished by LogSATSearch. We could skip the final call
to BoundedSAT in line 4 of ApproxMC6Core for large values of ε, but the actual
computation of BoundedSAT comes with LogSATSearch.
142 J. Yang and K. S. Meel
Algorithm 5. configRound(ε)
√ √
1: if (ε < 2 − 1) then return (1, 1+2ε 2
pivot);
2: else if (ε < 1) then return (1, pivot
√ );
2
3: else if (ε < 3)√then return (1, pivot);
4: else if (ε < 4 2 − 1) then return (0, pivot);
5: else √
6: return (0, 2pivot);
0.169 if ε < 3
Pr [U ] ≤
0.044 if ε ≥ 3
The proof of Lemma 4 is deferred to Sect. 5.3. Observe that Lemma 4 influ-
ences the choices in the design of configRound (Algorithm 5). Recall that
max {Pr [L] , Pr [U ]} ≤ 0.36 for ApproxMC (Appendix C),√but Lemma 4 ensures
max {Pr [L] , Pr [U ]} ≤ 0.262 for ApproxMC6. For ε ≥ 4 2 − 1, Lemma 4 even
delivers max {Pr [L] , Pr [U ]} ≤ 0.044.
Rounding Meets Approximate Model Counting 143
Algorithm 6. computeIter(ε, δ)
1: iter ← 1;
2: while (η(iter, iter/2, Prε [L]) + η(iter, iter/2, Prε [U ]) > δ) do
3: iter ← iter + 2;
4: return iter;
The following theorem analytically presents the gap between the error prob-
ability of ApproxMC6 and that of ApproxMC1 .
√
Theorem 1. For 2 − 1 ≤ ε < 1,
⎧
⎨O t− 12 0.75t for ApproxMC6
Pr [Errort ] ∈
⎩O t 2 0.96
− 1
t
for ApproxMC
1 t 1
Pr [Errort ] ∈ O t− 2 2 0.169(1 − 0.169) ⊆ O t− 2 0.75t
1 t 1
Pr [Errort ] ∈ O t− 2 2 0.36(1 − 0.36) = O t− 2 0.96t
Figure 1 visualizes the large gap between the error probability of ApproxMC6
and that of ApproxMC. The x-axis represents the number of repetitions (t) in
ApproxMC6 or ApproxMC. The y-axis represents the upper bound of error proba-
bility in the log scale. For example, as t = 117, ApproxMC guarantees that with a
probability of 10−3 , the median over 117 estimates violates the PAC guarantee.
However, ApproxMC6
√ allows a much smaller error probability that is at most
10−15 for 2 − 1 ≤ ε < 1. The smaller error probability enables ApproxMC6
to repeat fewer repetitions while providing the same level of theoretical guar-
antee. For example, given δ = 0.001 to ApproxMC, i.e., y = 0.001 in Fig. 1,
ApproxMC requests 117 repetitions to obtain the given √ error probability. How-
ever,
√ ApproxMC6 claims that 37 repetitions for ε < 2 − 1, 19 repetitions
√ for
2 − 1 ≤ ε < 1, 17 repetitions
√ for 1 ≤ ε < 3, 7 repetitions for 3 ≤ ε < 4 2 − 1,
and 5 repetitions for ε ≥ 4 2 − 1 are sufficient to obtain the same level of error
probability. Consequently, ApproxMC6 can obtain 3×, 6×, 7×, 17×, and 23×
speedups, respectively, than ApproxMC.
1
√
We state the result for the case 2 − 1 ≤ ε < 1. A similar analysis can be applied to
other cases, which leads to an even bigger gap between ApproxMC6 and ApproxMC.
144 J. Yang and K. S. Meel
√
5.3 Proof of Lemma 4 for Case 2−1≤ε<1
√
We provide full proof of Lemma 4 for case 2 − 1 ≤ ε < 1. We defer the proof
of other cases to Appendix D.
Let Tm denote the event CntF,m < thresh , and let Lm and Um denote the
E[CntF,m ]
events CntF,m < 1+ε and CntF,m > E CntF,m (1 + ε) , respec-
ε
tively. To ease the proof, let Um denote CntF,m > E CntF,m (1 + 1+ε ) ,
and thereby Um ⊆ Um . Let m∗ = log2 |sol(F)| − log2 (pivot) + 1 such that m∗
is the smallest m satisfying |sol(F)| ε
2m (1 + 1+ε ) ≤ thresh − 1.
Let us first prove the lemmas used in the proof of Lemma 4.
Lemma 5. For every 0 < β < 1, γ > 1, and 1 ≤ m ≤ n, the following holds:
1. Pr CntF,m ≤ βE CntF,m ≤ 1+(1−β)2 E1 Cnt
[ F,m ]
2. Pr CntF,m ≥ γE CntF,m ≤ 1+(γ−1)2 E1 Cnt
[ F,m ]
Proof. Statement 1 can be proved following the proof of Lemma 1 in [8]. For
statement 2, we rewrite the left-hand side and apply Cantelli’s inequality:
σ 2 [CntF,m ]
Pr CntF,m −E CntF,m ≥(γ−1)E CntF,m ≤ σ2 Cnt .
[ F,m ]+((γ−1)E[CntF,m ])2
Finally, applying Eq. 2 completes the proof.
√
Lemma 6. Given 2 − 1 ≤ ε < 1, the following bounds hold:
1
1. Pr [Tm∗ −3 ] ≤ 62.5
1
2. Pr [Lm∗ −2 ] ≤ 20.68
1
3. Pr [Lm∗ −1 ] ≤ 10.84
1
4. Pr [Um∗ ] ≤ 5.92
Rounding Meets Approximate Model Counting 145
Proof. Following the proof of Lemma 2 in [8], we can prove statements 1, 2, and
ε
3. To prove statement 4, replacing γ with (1 + 1+ε ) in Lemma 5 and employing
1 1
E CntF,m∗ ≥ pivot/2, we obtain Pr [Um∗ ] ≤ 2 ≤ 5.92 .
1+( 1+ε
ε
) pivot/2
√
Now we prove the upper bounds of Pr [L] and Pr [U ] in Lemma 4 for 2−1 ≤
ε < 1. The proof for other ε is deferred to Appendix D due to the page limit.
Lemma 4. The following bounds hold for ApproxMC6:
⎧ √
⎪
⎪ 0.262 if ε < 2−1
⎪
⎪ √
⎪
⎪
⎨0.157 if 2 − 1 ≤ ε < 1
Pr [L] ≤ 0.085 if 1 ≤ ε < 3
⎪
⎪ √
⎪0.055 if 3 ≤ ε < 4 2 − 1
⎪
⎪
⎩0.023 if ε ≥ 4√2 − 1
⎪
0.169 if ε < 3
Pr [U ] ≤
0.044 if ε ≥ 3
√
Proof. We prove the case of 2 − 1 ≤ ε < 1. The proof for other ε is deferred to
Appendix D. Let us first bound Pr [L]. Following LogSATSearch in [8], we have
⎡ ⎤
Pr [L] = ⎣ Ti−1 ∩ Ti ∩ Li ⎦ (3)
i∈{1,...,n}
pivot
O3 : ∀i ≥ m∗ , since rounding CntF,i up to √
2
and m∗ ≥ log2 |sol(F)| −
≥ |sol(F)| ≥ |sol(F)|
∗
log2 (pivot), we have 2i × CntF,i ≥ 2m × pivot
√
2
√
2 1+ε . The last
√ E[CntF,i ]
inequality follows from ε ≥ 2 − 1. Then we have CntF,i ≥ 1+ε .
Therefore, Li = ∅ for i ≥ m∗ and we have
(Ti−1 ∩ Ti ∩ Li ) = ∅
i∈{m∗ ,...,n}
146 J. Yang and K. S. Meel
Following the observations O1, O2, and O3, we simplify Eq. 3 and obtain
Pr [U ] ≤ Pr [Um∗ ]
count to a value such that Lm∗ becomes an empty event with zero probability
while Um∗ remains unchanged. To make Lm∗ empty, we have
∗ ∗ 1 1
2m × roundValue ≥ 2m × pivot ≥ |sol(F)| (6)
1+ε 1+ε
where the last inequality follows from m∗ ≥ log2 |sol(F)| − log2 (pivot). To main-
tain Um∗ unchanged, we obtain
∗ ∗ 1+ε
2m × roundValue ≤ 2m × pivot ≤ (1 + ε)|sol(F)| (7)
2
where the last inequality follows from m∗ ≤ log2 |sol(F)| − log2 (pivot) + 1. Com-
bining Eqs. 6 and 7 together, we obtain
∗ 1 ∗ 1+ε
2m × pivot ≤ 2m × pivot
1+ε 2
√
which gives us ε ≥ 2 − 1. Similarly, we can derive other breakpoints.
6 Experimental Evaluation
Table 1. The number of solved instances and PAR-2 score for ApproxMC6 versus
ApproxMC4 on 1890 instances. The geometric mean of the speedup of ApproxMC6 over
ApproxMC4 is also reported.
ApproxMC4 ApproxMC6
# Solved 998 1202
PAR-2 score 4934 3871
Speedup — 4.68
5000
ApproxMC6
4000 ApproxMC4
Runtime(s)
3000
2000
1000
0
0 200 400 600 800 1000 1200
Instance Index
sorted in ascending order by the number of solutions, and the y-axis represents
the number of solutions in a log scale. Theoretically, the approximate count
from ApproxMC6 should be within the range of |sol(F)| · 1.8 and |sol(F)|/1.8 with
probability 0.999, where |sol(F)| denotes the exact count returned by Ganak.
The range is indicated by the upper and lower bounds, represented by the
curves y = |sol(F)| · 1.8 and y = |sol(F)|/1.8, respectively. Figure 3 shows
150 J. Yang and K. S. Meel
that the approximate counts from ApproxMC6 fall within the expected range
[|sol(F)|/1.8, |sol(F)| · 1.8] for all instances except for four points slightly above
the upper bound. These four outliers are due to a bug in the preprocessor Arjun
that probably depends on the version of the C++ compiler and will be fixed
in the future. We also calculated the observed error, which is the mean relative
difference between the approximate and exact counts in our experiments, i.e.,
max{finalEstimate/|sol(F)| − 1, |sol(F)|/finalEstimate − 1}. The overall observed
error was 0.1, which is significantly smaller than the theoretical error tolerance
of 0.8.
7 Conclusion
In this paper, we addressed the scalability challenges faced by ApproxMC in
the smaller δ range. To this end, we proposed a rounding-based algorithm,
ApproxMC6, which reduces the number of estimations required by 84% while
providing the same (ε, δ)-guarantees. Our empirical evaluation on 1890 instances
shows that ApproxMC6 solved 204 more instances and achieved a reduction in
PAR-2 score of 1063 s. Furthermore, ApproxMC6 achieved a 4× speedup over
ApproxMC on the instances both ApproxMC6 and ApproxMC could solve.
A Proof of Proposition 1
Proof. For ∀y ∈ {0, 1}n , α(m) ∈ {0, 1}m , let γy,α(m) be an indicator variable that
is 1 when h(m) (y) = α(m) . According to thedefinition of strongly
2-universal
function, we obtain ∀x, y ∈ {0, 1}n , E γy,α(m) = 21m and E γx,α(m) · γy,α(m) =
1
22m . To prove Eq. 1, we obtain
⎡ ⎤
|sol(F)|
E CntF,m = E ⎣ γy,α(m) ⎦ = E γy,α(m) =
2m
y∈sol(F) y∈sol(F)
Rounding Meets Approximate Model Counting 151
B Weakness of Proposition 3
The following proposition states that Proposition 3 provides a loose upper bound
for Pr [Errort ].
Proposition 4. Assuming t is odd, we have:
Proof. We will now construct a case counted by η(t, t/2 , Pr [L ∪ U ]) but not
contained within the event Errort . Let IiL be an indicator variable that is 1
when ApproxMCCore returns a nSols less than |sol(F)| 1+ε , indicating the occurrence
of event L in the i-th repetition. Let IiU be an indicator variable that is 1
when ApproxMCCore returns a nSols greater than (1 + ε)|sol(F)|, indicating the
L
t U inU the i-th repetition.
occurrence of event t Consider ta scenario Lwhere IUi = 1
for i = 1, 2, ..., 4 , Ij = 1 for j = 4 + 1, ..., 2 , and Ik = Ik = 0
t
for k > 2t . η(t, t/2 , Pr [L ∪ U ]) represents L U t
i=1 (Ii ∨ Ii ) ≥ 2 . We can
t L U t
see that this case is included in i=1 (Ii ∨ Ii ) ≥ 2 and therefore counted
by η(t, t/2 , Pr [L ∪ U ]) since there are 2t estimates outside the PAC range.
However, this case means that 4t estimates fall within the range less than |sol(F)|
t t 1+ε
and 2 − 4 estimates fall within the range greater than(1+ε)|sol(F)|, while the
remaining 2t estimates correctly fall within the range |sol(F)| 1+ε , (1 + ε)|sol(F)| .
152 J. Yang and K. S. Meel
D Proof of Lemma 4
We restate
√ the lemma below and prove the statements section by section. The
proof for 2 − 1 ≤ ε < 1 has been shown in Sect. 5.3.
Lemma 4. The following bounds hold for ApproxMC6:
⎧ √
⎪
⎪0.262
⎪
⎪
if ε < 2 − 1
√
⎪
⎪
⎨0.157 if 2 − 1 ≤ ε < 1
Pr [L] ≤ 0.085 if 1 ≤ ε < 3
⎪
⎪ √
⎪0.055
⎪ if 3 ≤ ε < 4 2 − 1
⎪
⎪ √
⎩0.023 if ε ≥ 4 2 − 1
0.169 if ε < 3
Pr [U ] ≤
0.044 if ε ≥ 3
√
D.1 Proof of Pr [L] ≤ 0.262 for ε < 2 − 1
We first consider two cases: E CntF,m∗ < 1+ε 2 thresh and E CntF,m ≥
∗
1+ε
2 thresh, and then merge the results to complete the proof.
Case 1: E CntF ,m ∗ < 1+ε
2
thresh
√
Lemma 7. Given ε < 2 − 1, the following bounds hold:
1
1. Pr [Tm∗ −2 ] ≤ 29.67
1
2. Pr [Lm∗ −1 ] ≤ 10.84
√
Proof. Let’s first prove the statement 1. For ε < 2 − 1, we have
√
2
thresh
< (2 − )pivot and E CntF,m ∗ −2 ≥ 2pivot. Therefore, Pr [Tm∗ −2 ] ≤
2 √
2
Pr CntF,m∗ −2 ≤ (1 − 4 )E CntF,m∗ −2 . Finally, employing Lemma 5 with
√
2 1 1
β = 1− 4 , we obtain Pr [Tm∗ −2 ] ≤ √
2 2
≤ √
2 2 √1 2
≤
1+( 4 ) ·2pivot 1+( 4 ) ·2·9.84·(1+ 2−1 )
1 1
To prove the statement 2, we employ Lemma 5 with β =
. and
29.67 1
1+ε
E CntF,m∗ −1 ≥ pivot to obtain Pr [Lm∗ −1 ] ≤ 1+(1− 1 )2 ·E Cnt ≤
1+ε [ F,m∗ −1 ]
1 1
1+(1− 1 )2 ·9.84·(1+ 1 )2
= 10.84 .
1+ε ε
1+ε
Then, we prove that Pr [L] ≤ 0.126 for E CntF,m∗ < 2 thresh.
which can be simplified by the three observations labeled O1, O2 and O3 below.
154 J. Yang and K. S. Meel
O1 : ∀i ≤ m∗ − 2, Ti ⊆ Ti+1 . Therefore,
(Ti−1 ∩ Ti ∩ Li ) ⊆ Ti ⊆ Tm∗ −2
i∈{1,...,m∗ −2} i∈{1,...,m∗ −2}
O2 : For i = m∗ − 1, we have
Following the observations O1, O2 and O3, we simplify Eq. 3 and obtain
1 1 1 1
1
)2 ·E[CntF,m∗ ]
≤ 1
1+(1− 1+ε )2 · 1+ε
= 1+4.92(1+2ε) ≤ 5.92 .
1+(1− 1+ε 2 thresh
Then, we prove that Pr [L] ≤ 0.262 for E CntF,m∗ ≥ 1+ε 2 thresh.
which can be simplified by the three observations labeled O1, O2 and O3 below.
Rounding Meets Approximate Model Counting 155
O1 : ∀i ≤ m∗ − 1, Ti ⊆ Ti+1 . Therefore,
(Ti−1 ∩ Ti ∩ Li ) ⊆ Ti ⊆ Tm∗ −1
i∈{1,...,m∗ −1} i∈{1,...,m∗ −1}
O2 : For i = m∗ , we have
Following the observations O1, O2 and O3, we simplify Eq. 3 and obtain
Now let us prove the statement for ApproxMC6: Pr [L] ≤ 0.085 for 1 ≤ ε < 3.
156 J. Yang and K. S. Meel
which can be simplified by the three observations labeled O1, O2 and O3 below.
O1 : ∀i ≤ m∗ − 4, Ti ⊆ Ti+1 . Therefore,
(Ti−1 ∩ Ti ∩ Li ) ⊆ Ti ⊆ Tm∗ −4
i∈{1,...,m∗ −4} i∈{1,...,m∗ −4}
2 1+ε . The
E[CntF,i ]
last inequality follows from ε ≥ 1. Then we have CntF,i ≥ 1+ε .
Therefore, Li = ∅ for i ≥ m∗ − 1 and we have
(Ti−1 ∩ Ti ∩ Li ) = ∅
i∈{m∗ −1,...,n}
Following the observations O1, O2 and O3, we simplify Eq. 3 and obtain
√
D.3 Proof of Pr [L] ≤ 0.055 for 3 ≤ ε < 4 2 − 1
√
Lemma 10. Given 3 ≤ ε < 4 2 − 1, the following bound hold:
1
Pr [Tm∗ −3 ] ≤
18.19
√ √
2
Proof. For ε < 4 2 − 1, we have thresh
< (2 − )pivot and E CntF,m ∗ −3 ≥
8
1
√
2
4pivot. Therefore, Pr [Tm∗ −3 ] ≤ Pr CntF,m∗ −3 ≤ ( 2 − 32 )E CntF,m∗ −3 .
√
1 2
Finally, employing Lemma 5 with β = 2 − 32 , we obtain Pr [Tm∗ −3 ] ≤
1√ 1 1
1 2 2
≤ 1
√
2 2
≤ 18.19 .
1+(1−( 2 − 32 )) ·4pivot 1+(1−( 2 − 32 )) ·4·9.84·(1+ 4√12−1 )2
√ Now let us prove the statement for ApproxMC6: Pr [L] ≤ 0.055 for 3 ≤ ε <
4 2 − 1.
Rounding Meets Approximate Model Counting 157
O1 : ∀i ≤ m∗ − 3, Ti ⊆ Ti+1 . Therefore,
(Ti−1 ∩ Ti ∩ Li ) ⊆ Ti ⊆ Tm∗ −3
i∈{1,...,m∗ −3} i∈{1,...,m∗ −3}
4 1+ε . The
E[CntF,i ]
last inequality follows from ε ≥ 3. Then we have CntF,i ≥ 1+ε .
Therefore, Li = ∅ for i ≥ m∗ − 2 and we have
(Ti−1 ∩ Ti ∩ Li ) = ∅
i∈{m∗ −2,...,n}
Pr [L] ≤ Pr [Tm∗ −3 ]
√
D.4 Proof of Pr [L] ≤ 0.023 for ε ≥ 4 2 − 1
√
Lemma 11. Given ε ≥ 4 2 − 1, the following bound hold:
1
Pr [Tm∗ −4 ] ≤
45.28
Proof. We have thresh < 2pivot and E CntF,m
∗ −4 ≥ 8pivot. Therefore,
Pr [Tm∗ −4 ] ≤ Pr CntF,m∗ −4 ≤ 14 E CntF,m∗ −4 . Finally, employing Lemma 5
with β = 14 , we obtain Pr [Tm∗ −4 ] ≤ 1+(1− 11)2 ·8pivot ≤ 1+(1− 11)2 ·8·9.84 ≤ 45.28
1
.
4 4
√
Now let us prove the statement for ApproxMC6: Pr [L] ≤ 0.023 for ε ≥ 4 2 − 1.
Proof. We aim to bound Pr [L] by the following equation:
⎡ ⎤
Pr [L] = ⎣ Ti−1 ∩ Ti ∩ Li ⎦ (3 revisited)
i∈{1,...,n}
O1 : ∀i ≤ m∗ − 4, Ti ⊆ Ti+1 . Therefore,
(Ti−1 ∩ Ti ∩ Li ) ⊆ Ti ⊆ Tm∗ −4
i∈{1,...,m∗ −4} i∈{1,...,m∗ −4}
√
O2 : ∀i ≥ m∗ − 3, since rounding CntF,i to 2pivot and m∗ ≥ log2 |sol(F)| −
∗ √ √
2|sol(F)|
log2 (pivot), we have 2i × CntF,i ≥ 2m −3 × 2pivot ≥ 8 ≥
|sol(F)| √
1+ε . The last inequality follows from ε ≥ 4 2 − 1. Then we have
E[CntF,i ]
CntF,i ≥ 1+ε . Therefore, Li = ∅ for i ≥ m∗ − 3 and we have
(Ti−1 ∩ Ti ∩ Li ) = ∅
i∈{m∗ −3,...,n}
Pr [L] ≤ Pr [Tm∗ −4 ]
Pr [U ] ≤ Pr [Um∗ ]
ε
Pr CntF,m∗ +1 > 2(1 + 1+ε )E CntF,m∗ +1 . Employing Lemma 5 with γ =
2(1 + 1+ε ε
) and E CntF,m∗ +1 ≥ pivot 4 , we obtain Pr Tm +1
∗ ≤
1 1 1 1
2 = 2 ≤ 1+2.46·32 ≤ 23.14 .
1+(1+ 1+ε
2ε
) pivot/4 1+2.46·(3+ 1ε )
References
1. Alur, R., et al.: Syntax-guided synthesis. In: Proceedings of FMCAD (2013)
2. Baluta, T., Shen, S., Shine, S., Meel, K.S., Saxena, P.: Quantitative verification of
neural networks and its security applications. In: Proceedings of CCS (2019)
3. Beck, G., Zinkus, M., Green, M.: Automating the development of chosen ciphertext
attacks. In: Proceedings of USENIX Security (2020)
4. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst.
Sci. (1977)
5. Chakraborty, S., Fremont, D.J., Meel, K.S., Seshia, S.A., Vardi, M.Y.: Distribution-
aware sampling and weighted model counting for SAT. In: Proceedings of AAAI
(2014)
6. Chakraborty, S., Meel, K.S., Mistry, R., Vardi, M.Y.: Approximate probabilistic
inference via word-level counting. In: Proceedings of AAAI (2016)
7. Chakraborty, S., Meel, K.S., Vardi, M.Y.: A scalable approximate model counter.
In: Proceedings of CP (2013)
8. Chakraborty, S., Meel, K.S., Vardi, M.Y.: Algorithmic improvements in approxi-
mate counting for probabilistic inference: from linear to logarithmic SAT calls. In:
Proceedings of IJCAI (2016)
9. Duenas-Osorio, L., Meel, K.S., Paredes, R., Vardi, M.Y.: Counting-based reliability
estimation for power-transmission grids. In: Proceedings of AAAI (2017)
10. Ermon, S., Gomes, C.P., Sabharwal, A., Selman, B.: Embed and project: discrete
sampling with universal hashing. In: Proceedings of NeurIPS (2013)
11. Ermon, S., Gomes, C.P., Sabharwal, A., Selman, B.: Taming the curse of dimension-
ality: discrete integration by hashing and optimization. In: Proceedings of ICML
(2013)
Rounding Meets Approximate Model Counting 161
12. Fichte, J.K., Hecher, M., Hamiti, F.: The model counting competition 2020. ACM
J. Exp. Algorithmics (2021)
13. Gittis, A., Vin, E., Fremont, D.J.: Randomized synthesis for diversity and cost
constraints with control improvisation. In: Proceedings of CAV (2022)
14. Gomes, C.P., Sabharwal, A., Selman, B.: Model counting: a new strategy for obtain-
ing good bounds. In: Proceedings of AAAI (2006)
15. Hecher, M., Fichte, J.K.: Model counting competition 2021 (2021). https://www.
mccompetition.org/2021/mc_description
16. Hecher, M., Fichte, J.K.: Model counting competition 2022 (2022). https://
mccompetition.org/2022/mc_description
17. Ivrii, A., Malik, S., Meel, K.S., Vardi, M.Y.: On computing minimal independent
support and its applications to sampling and counting. Constraints (2016)
18. Meel, K.S., Akshay, S.: Sparse hashing for scalable approximate model counting:
theory and practice. In: Proceedings of LICS (2020)
19. Meel, K.S., et al.: Constrained sampling and counting: universal hashing meets sat
solving. In: Proceedings of Workshop on Beyond NP(BNP) (2016)
20. Roth, D.: On the hardness of approximate reasoning. Artif. Intell. (1996)
21. Sang, T., Bearne, P., Kautz, H.: Performing Bayesian inference by weighted model
counting. In: Proceedings of AAAI (2005)
22. Soos, M., Gocht, S., Meel, K.S.: Tinted, detached, and lazy CNF-XOR solving and
its applications to counting and sampling. In: Proceedings of CAV (2020)
23. Soos, M., Meel, K.S.: Bird: engineering an efficient CNF-XOR sat solver and its
applications to approximate model counting. In: Proceedings of AAAI (2019)
24. Soos, M., Meel, K.S.: Arjun: an efficient independent support computation tech-
nique and its applications to counting and sampling. In: Proceedings of ICCAD
(2022)
25. Stockmeyer, L.: The complexity of approximate counting. In: Proceedings of STOC
(1983)
26. Teuber, S., Weigl, A.: Quantifying software reliability via model-counting. In: Pro-
ceedings of QEST (2021)
27. Toda, S.: On the computational power of PP and (+)P. In: Proceedings of FOCS
(1989)
28. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J.
Comput. (1979)
29. Yang, J., Chakraborty, S., Meel, K.S.: Projected model counting: beyond indepen-
dent support. In: Proceedings of ATVA (2022)
30. Yang, J., Meel, K.S.: Engineering an efficient PB-XOR solver. In: Proceedings of
CP (2021)
162 J. Yang and K. S. Meel
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Satisfiability Modulo Finite Fields
1 Introduction
Finite fields are critical to the design of recent cryptosystems. For instance,
elliptic curve operations are defined in terms of operations in a finite field. Also,
Zero-Knowledge Proofs (ZKPs) and Multi-Party Computations (MPCs), pow-
erful tools for building secure and private systems, often require key properties
of the system to be expressed as operations in a finite field.
Field-based cryptosystems already safeguard everything from our money
to our privacy. Over 80% of our TLS connections, for example, use elliptic
curves [4,66]. Private cryptocurrencies [32,59,89] built on ZKPs have billion-
dollar market capitalizations [44,45]. And MPC protocols have been used to
operate auctions [17], facilitate sensitive cross-agency collaboration in the US
federal government [5], and compute cross-company pay gaps [8]. These systems
safeguard our privacy, assets, and government data. Their importance justifies
spending considerable effort to ensure that the systems are free of bugs that
could compromise the resources they are trying to protect; thus, they are prime
targets for formal verification.
However, verifying field-based cryptosystems is challenging, in part because
current automated verification tools do not reason directly about finite fields.
Many tools use Satisfiability Modulo Theories (SMT) solvers as a back-end [9,
27,33,93,95]. SMT solvers [7,10,12,20,26,35,73,76,77] are automated reasoners
that determine the satisfiability of formulas in first-order logic with respect to one
or more background theories. They combine propositional search with specialized
reasoning procedures for these theories, which model common data types such
as Booleans, integers, reals, bit-vectors, arrays, algebraic datatypes, and more.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 163–186, 2023.
https://doi.org/10.1007/978-3-031-37703-7_8
164 A. Ozdemir et al.
Since SMT solvers do not currently support a theory of finite fields, SMT-based
tools must encode field operations using another theory.
There are two natural ways to represent finite fields using commonly sup-
ported theories in SMT, but both are ultimately inefficient. Recall that a finite
field of prime order can be represented as the integers with addition and multi-
plication performed modulo a prime p. Thus, field operations can be represented
using integers or bit-vectors: both support addition, multiplication, and mod-
ular reduction. However, both approaches fall short. Non-linear integer reason-
ing is notoriously challenging for SMT solvers, and bit-vector solvers perform
abysmally on fields of cryptographic size (hundreds of bits).
In this paper, we develop for the first time a direct solver for finite fields
within an SMT solver. We use well-known ideas from computer algebra (specifi-
cally, Gröbner bases [21] and triangular decomposition [6,99]) to form the basis
of our decision procedure. However, we improve on this baseline in two impor-
tant ways. First, our decision procedure does not manipulate field polynomials
(i.e., those of form X p − X). As expected, this results in a loss of completeness
at the Gröbner basis stage. However, surprisingly, this often does not matter.
Furthermore, completeness is recovered during the model construction algorithm
(albeit in a rather rudimentary way). This modification turns out to be crucial for
obtaining reasonable performance. Second, we implement a proof-tracing mech-
anism in the Gröbner basis engine, thereby enabling it to compute unsatisfiable
cores, which is also very beneficial in the context of SMT solving. Finally, we
implement all of this as a theory solver for prime-order fields inside the cvc5
SMT solver.
To guide research in this area, we also give a first set of QF_FF (quantifier-free,
finite field) benchmarks, obtained from the domain of ZKP compiler correctness.
ZKP compilers translate from high-level computations (e.g., over Booleans, bit-
vectors, arrays, etc.) to systems of finite field constraints that are usable by ZKPs.
We instrument existing ZKP compilers to produce translation validation [86] ver-
ification conditions, i.e. conditions that represent desirable correctness properties
of a specific compilation. We give these compilers concrete Boolean computa-
tions (which we sample at random), and construct SMT formulas capturing the
correctness of the ZKP compilers’ translations of those computations into field
constraints. We represent the formulas using both our new theory of finite fields
and also the alternative theory encodings mentioned above.
We evaluate our tool on these benchmarks and compare it to the approaches
based on bit-vectors, integers, and pure computer algebra (without SMT). We
find that our tool significantly outperforms the other solutions. Compared to the
best previous solution (we list prior alternatives in Sect. 7), it is 6× faster and
it solves 2× more benchmarks.
In sum, our contributions are:
4. the first set of QF_FF benchmarks, which encode translation validation queries
for ZKP compilers on Boolean computations.
In the rest of the paper, we discuss related work (§1.1), cover background
and notation (§2), define the theory of finite fields (§3), give a decision procedure
(§4), describe our implementation (§5), explain the benchmarks (§6), and report
on experiments (§7).
There is a large body of work on computer algebra, with many algorithms imple-
mented in various tools [1,18,31,37,49,52,58,72,100,101]. However, the focus
in this work is on quickly constructing useful algebraic objects (e.g., a Gröbner
basis), rather than on searching for a solution to a set of field constraints.
One line of recent work [54,55] by Hader and Kovács considers SMT-oriented
field reasoning. One difference with our work is that it scales poorly with field
size because it uses field polynomials to achieve completeness. Furthermore, their
solver is not public.
Others consider verifying field constraints used in ZKPs. One paper surveys
possible approaches [97], and another considers proof-producing ZKP compila-
tion [24]. However, neither develops automated, general-purpose tools.
Still other works study automated reasoning for non-linear arithmetic over
reals and integers [3,23,25,29,47,60–62,70,74,96,98]. A key challenge is reason-
ing about comparisons. We work over finite fields and do not consider compar-
isons because they are used for neither elliptic curves nor most ZKPs.
Further afield, researchers have developed techniques for verified algebraic
reasoning in proof assistants [15,64,75,79], with applications to mathemat-
ics [19, 28,51,65] and cryptography [39,40,85,91]. In contrast, our focus is on
fully automated reasoning about finite fields.
2 Background
2.1 Algebra
Here, we summarize algebraic definitions and facts that we will use; see [71,
Chapters 1 through 8] or [34, Part IV] for a full presentation.
Finite Fields. A finite field is a finite set equipped with binary operations +
and × that have identities (0 and 1 respectively), have inverses (save that there
is no multiplicative inverse for 0), and satisfy associativity, commutativity, and
distributivity. The order of a finite field is the size of the set. All finite fields have
order q = pe for some prime p (called the characteristic) and positive integer e.
Such an integer q is called a prime power.
Up to isomorphism, the field of order q is unique and is denoted Fq , or F when
the order is clear from context. The fields Fqd for d > 1 are called extension fields
of Fq . In contrast, Fq may be called the base field. We write F ⊂ G to indicate
166 A. Ozdemir et al.
The lexicographical ordering for monomials X1e1 · · · Xkek orders them lexico-
graphically by the tuple (e1 , . . . , ek ). The graded-reverse lexicographical (grevlex)
ordering is lexicographical by the tuple (e1 + · · · + ek , e1 , . . . , ek ). With respect
to an ordering, lm(f ) denotes the greatest monomial of a polynomial f .
Reduction. For polynomials p and d, if lm(d) divides a term t of p, then we say
that p reduces to r modulo d (written p →d r) for r = p − lm(d)
t
d. For a set of
polynomials D, we write p →D r if p →d r for some d ∈ D. Let →∗D be the
transitive closure of →D . We define p ⇒D r to hold when p →∗D r and there is
no r such that r →D r .
Reduction is a sound—but incomplete—algorithm for ideal membership.
That is, one can show that p ⇒D 0 implies p ∈ D, but the converse does
not hold in general.
Gröbner Bases. Define the s-polynomial for polynomials p and q, by spoly(p, q) =
p · lm(q) − q · lm(p). A Gröbner basis (GB) [21] is a set of polynomials P char-
acterized by the following equivalent conditions:
Gröbner bases are useful for deciding ideal membership. From the first charac-
terization, one can build algorithms for constructing a Gröbner basis for any
ideal [21]. Then, the second characterization gives an ideal membership test.
When P is a GB, the relation ⇒P is a function (i.e., →P is confluent), and it
can be efficiently computed [1,21]; thus, this test is efficient.
A Gröbner basis engine takes a set of generators G for some ideal I and
computes a Gröbner basis for I. We describe the high-level design of such engines
here. An engine constructs a sequence of bases G0 , G1 , G2 , . . . (with G0 = G)
until some Gi is a Gröbner basis. Each Gi is constructed from Gi−1 according to
one of three types of steps. First, for some p, q ∈ Gi−1 such that spoly(p, q) ⇒Gi−1
r = 0, the engine can set Gi = Gi−1 ∪ {r}. Second, for some p ∈ Gi−1 such that
p ⇒Gi−1 \{p} r = p, the engine can set Gi = (Gi−1 \ {p}) ∪ {r}. Third, for some
p ∈ Gi−1 such that p ⇒Gi−1 \{p} 0, the engine can set Gi = Gi−1 \ {p}. Notice
that all rules depend on the current basis; some add polynomials, and some
remove them. In general, it is unclear which sequence of steps will construct a
Gröbner basis most quickly: this is an active area of research [1,18,41,43].
Zero-knowledge proofs allow one to prove that some secret data satisfies a public
property, without revealing the data itself. See [94] for a full presentation; we
give a brief overview here. There are two parties: a verifier V and a prover P. V
knows a public instance x and asks P to show that it has knowledge of a secret
witness w satisfying a public predicate φ(x, w). To do so, P runs an efficient
(i.e., polytime in a security parameter λ) proving algorithm Prove(φ, x, w) → π
and sends the resulting proof π to V. Then, V runs an efficient verification
168 A. Ozdemir et al.
algorithm Verify(φ, x, π) → {0, 1} that accepts or rejects the proof. A system for
Zero-Knowledge Proofs of knowledge (ZKPs) is a (Prove, Verify) pair with:
ZKP applications are manifold. ZKPs are the basis of private cryptocurren-
cies such as Zcash and Monero, which have a combined market capitalization
of $2.80B as of 30 June 2022 [44,45]. They’ve also been proposed for auditing
sealed court orders [46], operating private gun registries [63], designing privacy-
preserving middleboxes [53] and more [22,56].
This breadth of applications is possible because implemented ZKPs are very
general: they support any φ checkable in polytime. However, φ must be first
compiled to a cryptosystem-compatible computation language. The most com-
mon language is a rank-1 constraint system (R1CS). In an R1CS C, x and w are
together encoded as a vector z ∈ Fm . The system C is defined by three matrices
A, B, C ∈ Fn×m ; it is satisfied when Az ◦ Bz = Cz, where ◦ is the element-
wise product. Thus, the predicate can
be viewed as n distinct constraints, where
constraint i has form ( j Aij zj )( j Bij zj ) − ( j Cij zj ) = 0. Note that each
constraint is a degree ≤ 2 polynomial in m variables that z must be a zero of.
For security reasons, F must be large: its prime must have ≈255 bits.
Encoding. The efficiency of the ZKP scales quasi-linearly with n. Thus, it’s
useful to encode φ as an R1CS with a minimal number of constraints. Since
equisatifiability—not logical equivalence—is needed, encodings may introduce
new variables.
As an example, consider the Boolean computation a ← c1 ∨ · · · ∨ ck . Assume
that c1 , . . . , ck ∈ F are elements in z that are 0 or 1 such that ci ↔ (ci = 1).
How can one ensure that a ∈ F (also in z) is 0 or 1 and a ↔ (a = 1)?
Given that there are k − 1 ORs, natural approaches use Θ(k) constraints. One
clever approach is to introduce variable x and enforce constraints x ( i ci ) = a
and (1 − a )( i ci ) = 0. If any ci is true, a must be 1 to satisfy the second
constraint; setting x to the sum’s inverse satisfies the first. If all ci are false, the
first constraint ensures a is 0. This encoding is correct when the sum does not
overflow; thus, k must be smaller than F’s characteristic.
Optimizations like this can be quite complex. Thus, ZKP programmers use
constraint synthesis libraries [14,69] or compilers [13,24,80,81,84,92,102] to gen-
erate an R1CS from a high-level description. Such tools support objects like
Booleans, fixed-width integers, arrays, and user-defined data-types. The correct-
ness of these tools is critical to the correctness of any system built with them.
2
f (λ) ≤ negl(λ) if for all c ∈ N, f (λ) = o(λ−c ).
Satisfiability Modulo Finite Fields 169
2.4 SMT
We assume usual terminology for many-sorted first order logic with equality
( [38] gives a complete presentation). Let Σ be a many-sorted signature including
a sort Bool and symbol family ≈σ (abbreviated ≈) with sort σ × σ → Bool for
all σ in Σ. A theory is a pair T = (Σ, I), where Σ is a signature and I is a class
of Σ-interpretations. A Σ-formula φ is satisfiable (resp., unsatisfiable) in T if it
is satisfied by some (resp., no) interpretation in I. Given a (set of) formula(s) S,
we write S |=T φ if every interpretation M ∈ I that satisfies S also satisfies φ.
When using the CDCL(T ) framework for SMT, the reasoning engine for each
theory is encapsulated inside a theory solver. Here, we mention the fragment of
CDCL(T ) that is relevant for our purposes ( [78] gives a complete presentation)).
The goal of CDCL(T ) is to check a formula φ for satisfiability. A core mod-
ule manages a propositional search over the propositional abstraction of φ and
communicates with the theory solver. As the core constructs partial proposi-
tional assignments for the abstract formula, the theory solver is given the literals
that correspond to the current propositional assignment. When the propositional
assignment is completed (or, optionally, before), the theory solver must deter-
mine whether its literals are jointly satisfiable. If so, it must be able to provide
an interpretation in I (which includes an assignment to theory variables) that
satisfies them. If not, it may indicate a strict subset of the literals which are
unsatisfiable: an unsatisfiable core. Smaller unsatisfiable cores usually accelerate
the propositional search.
We define the theory TFq of the finite field Fq , for any order q. Its sort and
symbols are indexed by the parameter q; we omit q when clear from context.
The signature of the theory is given in Fig. 1. It includes sort F, which intu-
itively denotes the sort of elements of Fq and is represented in our proposed
SMT-LIB format as (_ FiniteField q). There is a constant symbol for each
element of Fq , and function symbols for addition and multiplication. Other finite
field operations (e.g., negation, subtraction, and inverses) naturally reduce to this
signature.
An interpretation M of TFq must interpret: F as Fq , n ∈ {0, . . . , q − 1}
as the nth element of Fq in lexicographical order,3 + as addition in Fq , × as
multiplication in Fq , and ≈ as equality in Fq .
Note that in order to avoid ambiguity, we require that the sort of any constant
ffn must be ascribed. For instance, the nth element of Fq would be (as ffn
(_ FiniteField q)). The sorts of non-nullary function symbols need not be
ascribed: they can be inferred from their arguments.
3
For non-prime Fpe , we use the lexicographical ordering of elements represented as
polynomials in Fp [X] modulo the Conway polynomial [83, 90] Cp,e (X). This repre-
sentation is standard [57].
170 A. Ozdemir et al.
1 Function DecisionProcedure:
Input: A set of F-literals L in variables X
Output: UNSAT and a core C ⊆ L, or
Output: SAT and a model M : X F
2 P empty set; Wi fresh, ∀i;
3 for si i ti ∈ L do
4 if i = ≈ then P P ∪ {[[si ]] − [[ti ]]} ;
5 else if i = then P P ∪ {Wi ([[si ]] − [[ti ]]) − 1} ;
6 B GB (P );
7 if 1 ⇒B 0 then return UNSAT, CoreFromTree() ;
8 m FindZero(P );
9 if m = ⊥ then return UNSAT, L ;
10 else return SAT, X z : (X z) m, X X ;
4 Decision Procedure
Recall (§2.4) that a CDCL(T ) theory solver for F must decide the satisfiability of
a set of F-literals. At a high level, our decision procedure comprises three steps.
First, we reduce to a problem concerning a single algebraic variety. Second, we
use a GB-based test for unsatisfiability that is fast and sound, but incomplete.
Third, we attempt model construction. Figure 2 shows pseudocode for the deci-
sion procedure; we will explain it incrementally.
Note that each p ∈ PD has zeros for exactly the values of X where its analog in
PD is not zero. Also note that PD ⊂ Fq [X ], with X = X ∪ {Wi }i∈D .
We define P to be PE ∪ PD (constructed in lines 2 to 6, Fig. 2) and note three
useful properties of P . First, L is satisfiable if and only if V(P ) is non-empty.
Second, for any P ⊂ P , if V(P ) = ∅, then {π(p) : p ∈ P } is an unsatisfiable
core, where π maps a polynomial to the literal it is derived from. Third, from
any x ∈ V(P ) one can immediately construct a model. Thus, our theory solver
reduces to understanding properties of the variety V(P ).
Recall (§2.2) that if 1 ∈ P , then V(P ) is empty. We can answer this ideal
membership query using a Gröbner basis engine (line 7, Fig. 2). Let GB be a
subroutine that takes a list of polynomials and computes a Gröbner basis for the
ideal that they generate, according to some monomial ordering. We use grevlex:
the ordering for which GB engines are typically most efficient [42]. We compute
GB (P ) and check whether 1 ⇒GB(P ) 0. If so, we report that V(P ) is empty. If
not, recall (§2.2) that V(P ) may still be empty; we proceed to attempt model
construction (lines 9 to 11, Fig. 2, described in the next subsection).
If 1 does reduce by the Gröbner basis, then identifying a subset of P which
is sufficient to reduce 1 yields an unsatisfiable core. To construct such a subset,
we formalize the inferences performed by the Gröbner basis engine as a calculus
for proving ideal membership.
Figure 4 presents IdealCalc: our ideal membership calculus. IdealCalc proves
facts of the form p ∈ P , where p is a polynomial and P is the set of generators
for an ideal. The G rule states that the generators are in the ideal. The Z rule
states that 0 is in the ideal. The S rule states that for any two polynomials in
the ideal, their s-polynomial is in the ideal too. The R↑ and R↓ rules state that
if p →q r with q in the ideal, then p is in the ideal if and only if r is.
The soundness of IdealCalc follows immediately from the definition of an ideal.
Completeness relies on the existence of algorithms for computing Gröbner bases
using only s-polynomials and reduction [21,41,43]. We prove both properties in
Appendix A.
p∈P r P q P p q r
Z G R
0∈ P p∈ P p∈ P
p P q P p P q P p q r
S R
spoly(p, q) P r P
1 Function FindZero:
Input: A Gröbner basis B ⊂ F[X ]
Input: A partial map M : X F (empty by default)
Output: A total map M : X F or ⊥
2 if 1 B then return ⊥ ;
3 if |M | = |X | then return M ;
4 for (Xi z) ∈ ApplyRule(B, M ) do
5 r FindZero(GB (B ∪ {Xi − z}), M ∪ {Xi z});
6 if r = ⊥ then return r;
7 return
Fig. 5. Finding common zeros for a Gröbner basis. After handling trivial cases,
FindZero uses ApplyRule to apply the first applicable rule from Fig. 6.
By instrumenting a Gröbner basis engine and reduction engine, one can con-
struct IdealCalc proof trees. Then, for a conclusion 1 ∈ P , traversing the proof
tree to its leaves gives a subset P ⊆ P such that 1 ∈ P . The procedure
CoreFromTree (called in line 8, Fig. 2) performs this traversal, by accessing a
proof tree recorded by the GB procedure and the reductions. The proof of The-
orem 2 explains our instrumentation in more detail (Appendix A).
p∈B p ∈ F[Xi ] Xi ∈
/M Z UnivariateZeros(p)
Univariate
z∈Z (Xi z)
Dim( B ) = 0 Xi ∈
/M p MinPoly(B, Xi ) Z UnivariateZeros(p)
Triangular
z∈Z (Xi z)
Exhaust
z F Xi /M (Xi z)
X1 + X 2 + X 3 + X 4 + X 5 = 0
X1 X2 + X2 X3 + X3 X4 + X4 X5 + X5 X1 = 0
X1 X 2 X 3 + X 2 X 3 X 4 + X 3 X 4 X 5 + X 4 X 5 X 1 + X 5 X 1 X 2 = 0
X1 X 2 X 3 X 4 + X 2 X 3 X 4 X 5 + X 3 X 4 X 5 X 1 + X 4 X 5 X 1 X 2 + X 5 X 1 X 2 X 3 = 0
X1 X2 X3 X4 X5 = 1
4
The dimension of an ideal is a natural number that can be efficiently computed from
a Gröbner basis. If the dimension is zero, then one can efficiently compute a minimal
polynomial in any variable X, given a Gröbner basis [2, 68].
174 A. Ozdemir et al.
in F394357 [68]. The system is unsatisfiable, it has dimension 0, and its ideal
does not contain 1. Moreover, our solver computes a (reduced) Gröbner basis
for it that does not contain any univariate polynomials. Thus, Univariate does
not apply. However, Triangular does, and with it, FindZero quickly terminates.
Without Triangular, Exhaust would create at least |F| branches.
In the above examples, Exhaust performs very poorly. However, that is not
always the case. For example, in the system X1 + X2 = 0, using Exhaust to guess
X1 , and then using the univariate rule to determine X2 is quite reasonable. In
general, Exhaust is a powerful tool for solving underconstrained systems. Our
experiments will show that despite including Exhaust, our procedure performs
quite well on our benchmarks. We reflect on its performance in Sect. 8.
5 Implementation
We have implemented our decision procedure for prime fields in the cvc5 SMT
solver [7] as a theory solver. It is exposed through cvc5’s SMT-LIB, C++, Java,
and Python interfaces. Our implementation comprises ≈2k lines of C++. For the
algebraic sub-routines of our decision procedure (§4), it uses CoCoALib [1]. To
compute unsatisfiable cores (§4.2), we inserted hooks into CoCoALib’s Gröbner
basis engine (17 lines of C++).
Our theory solver makes sparse use of the interface between it and the rest
of the SMT solver. It acts only once a full propositional assignment has been
constructed. It then runs the decision procedure, reporting either satisfiability
(with a model) or unsatisfiability (with an unsatisfiable core).
6 Benchmark Generation
Recall that one motivation for this work is to enable translation validation for
compilers to field constraint systems (R1CSs) used in zero-knowledge proofs
(ZKPs). Our benchmarks are SMT formulas that encode translation validation
queries for compilers from Boolean computations to R1CS. At a high level, each
benchmark is generated as follows.
5
We add field polynomials to our procedure on line 2, Fig. 2. This renders our ideal
triviality test (lines 7 and 8) complete, so we can eliminate the fallback to FindZero.
Satisfiability Modulo Finite Fields 175
6.1 Examples
We describe our benchmark generator in full and give the definitions of soundness
and determinism in Appendix C. Here, we give three example benchmarks. Our
examples are based on the Boolean formula Ψ(x1 , x2 , x3 , x4 ) = x1 ∨ x2 ∨ x3 ∨
x4 . Our convention is to mark field variables with a prime, but not Boolean
variables. Using the technique from Sect. 2.3, CirC compiles
3 this formula to the
two-constraint system: i s = r ∧(1−r )s = 0 where s i=0 xi . Each Boolean
input xi corresponds to field element xi and r corresponds to the result of Ψ.
=⇒
(ri = 0 ∨ ri = 1) ∧ (ri = 1 ⇐⇒ Ψ)
output is correct
where Ψ and s are defined as above. This is an UNSAT benchmark, because the
formula is valid.
=⇒
r =
r
outputs agree
176 A. Ozdemir et al.
Unsoundness. Removing constraints from the system can give a formula that is
not valid (a SAT benchmark). For example, if we remove (1 − r )s = 0, then the
soundness formula is falsified by {xi → , xi → 1, r → 0, i → 0}.
7 Experiments
Our experiments show that our approach:
Our test bed is a cluster with Intel Xeon E5-2637 v4 CPUs. Each run is
limited to one physical core, 8GB memory, and 300s.
Throughout, we generate benchmarks for two correctness properties (sound-
ness and determinism), three different ZKP compilers, and three different sta-
tuses (sat, unsat, and unknown). We vary the field size, encoding, number of
inputs, and number of terms, depending on the experiment. We evaluate our
cvc5 extension, Bitwuzla (commit 27f6291), and z3 (version 4.11.2).
300
System 200
Total time (s)
200
bv−bitwuzla System
Time (s)
bv−cvc5 bv−bitwuzla
bv−z3 100 ff−cvc5
100
ff−cvc5
0 0
0 50 100 150 200 20 40 60
Solved instances Bits
(a) Instances solved (b) Total solve time for (field-based) cvc5
and (BV-based) Bitwuzla on commonly
solved instances at all bit-widths.
fields this large. Thus, for this set of experiments we use b ∈ {5, 10, . . . , 60}, and
we sample formulas with 4 inputs and 8 intermediate terms.
Figure 7a shows performance of three bit-vector solvers (cvc5 [7], Bitwu-
zla [76], and z3 [73]) and our F solver as a cactus plot; Table 1 splits the solved
instances by property and status. We see that even for these small bit-widths,
the field-based approach is already superior. The bit-vector solvers are more
competitive on the soundness benchmarks, since these benchmarks include only
half as many field operations as the determinism benchmarks.
For our benchmarks, Bitwuzla is the most efficient BV solver. We further
examine the time that it and our solver take to solve the 9 benchmarks they can
both solve at all bit-widths. Figure 7b plots the total solve time against b. While
the field-based solver’s runtime is nearly independent of field size, the bit-vector
solvers slow down substantially as the field grows.
In sum, the BV approach scales poorly with field size and is already inferior
on fields of size at least 240 .
100x 10x
Time without field polynomials (s)
256.0 256.0
32.0 32.0
10x
Solve Time
Property Field Polynomials
deterministic 4.0 no
4.0
100x sound yes
0.5
0.5
(a) All benchmarks, both configurations. (b) Each series is one property at different
numbers of bits.
Fig. 8. Solve times, with and without field polynomials. The field size varies from 4 to
12 bits. The benchmarks are all SAT or unknown.
4 4
100x 100x
4 32 256 4 32 256
Time without UNSAT cores (s) Time with pure FF (s)
(a) Our SMT solver with and without (b) Our SMT solver compared with a pure
UNSAT cores. computer algebra system.
300 System
bv−bitwuzla
200 ff−cvc5
100 nia−cvc5
nia−z3
pureff−cvc5
0
0 200 400 600
Solved instances
of theory literals, so the SMT core makes only one theory query. For them,
returning a good UNSAT core has no benefit—but also induces little overhead.
In our main experiment, we compare our approach against all reasonable alter-
natives: a pure computer-algebra approach (§7.4), a BV approach with Bitwuzla
(the best BV solver on our benchmarks, §7.1), an NIA approach with cvc5 and
z3, and our own tool without UNSAT cores (§7.3). We use the same benchmark
set as the last experiment; this uses a 255-bit field.
Figure 10 shows the results as a cactus plot. Table 2 shows the number of
solved instances for each system, split by property and status. Bitwuzla quickly
runs out of memory on most of the benchmarks. A pure computer-algebra app-
roach outperforms Bitwuzla and cvc5’s NIA solver. The NIA solver of z3 does a
bit better, but our field-aware SMT solver is the best by far. Moreover, its best
configuration uses UNSAT cores. Comparing the total solve time of ff-cvc5 and
180 A. Ozdemir et al.
nia-z3 on commonly solved benchmarks, we find that ff-cvc5 reduces total solve
time by 6×. In sum, the techniques we describe in this paper yield a tool that
substantially outperforms all alternatives on our benchmarks.
We’ve presented a basic study of the potential of an SMT theory solver for finite
fields based on computer algebra. Our experiments have focused on translation
validation for ZKP compilers, as applied to Boolean input computations. The
solver shows promise, but much work remains.
As discussed (Sect. 5), our implementation makes limited use of the interface
exposed to a theory solver for CDCL(T ). It does no work until a full propositional
assignment is available. It also submits no lemmas to the core solver. Exploring
which lightweight reasoning should be performed during propositional search
and what kinds of lemmas are useful is a promising direction for future work.
Our model construction (Sect. 4.3) is another weakness. Without univariate
polynomials or a zero-dimensional ideal, it falls back to exhaustive search. If a
solution over an extension field is acceptable, then there are Θ(|F|d ) solutions,
so an exhaustive search seems likely to quickly succeed. Of course, we need a
solution in the base field. If the base field is closed, then every solution is in the
base field. Our fields are finite (and thus, not closed), but for our benchmarks,
they seem to bear some empirical resemblance to closed fields (e.g., the GB-based
test for an empty variety never fails, even though it is theoretically incomplete).
For this reason, exhaustive search may not be completely unreasonable for our
benchmarks. Indeed, our experiments show that our procedure is effective on our
benchmarks, including for SAT instances. However, the worst-case performance
of this kind of model construction is clearly abysmal. We think that a more
intelligent search procedure and better use of ideas from computer algebra [6,67]
would both yield improvement.
Theory combination is also a promising direction for future work. The bench-
marks we present here are in the QF_FF logic: they involve only Booleans and finite
Satisfiability Modulo Finite Fields 181
fields. Reasoning about different fields in combination with one another would
have natural applications to the representation of elliptic curve operations inside
ZKPs. Reasoning about datatypes, arrays, and bit-vectors in combination with
fields would also have natural applications to the verification of ZKP compilers.
Proof. It suffices to show that for each branching rule that results in j (Xij −rj ),
C Benchmark Generation
References
1. Abbott, J., Bigatti, A.M.: CoCoALib: A C++ library for computations in commu-
tative algebra... and beyond. In: International Congress on Mathematical Software
(2010)
182 A. Ozdemir et al.
2. Abbott, J., Bigatti, A.M., Palezzato, E., Robbiano, L.: Computing and using
minimal polynomials. J. Symbolic Comput. 100 (2020)
3. Ábrahám, E., Davenport, J.H., England, M., Kremer, G.: Deciding the consis-
tency of non-linear real arithmetic constraints with a conflict driven search using
cylindrical algebraic coverings. J. Logical Algebraic Methods in Programm. 119
(2021)
4. Anderson, B., McGrew, D.: Tls beyond the browser: Combining end host and
network data to understand application behavior. In: IMC (2019)
5. Archer, D., O’Hara, A., Issa, R., Strauss, S.: Sharing sensitive department of edu-
cation data across organizational boundaries using secure multiparty computation
(2021)
6. Aubry, P., Lazard, D., Maza, M.M.: On the theories of triangular sets. J. Symbolic
Comput. 28(1) (1999)
7. Barbosa, H., et al.: cvc5: A versatile and industrial-strength SMT solver. In:
TACAS (2022)
8. Barlow, R.: Computational thinking breaks a logjam. https://www.bu.edu/cise/
computational-thinking-breaks-a-logjam/ (2015)
9. Barnett, M., Chang, B.Y.E., DeLine, R., Jacobs, B., Leino, K.R.M.: Boogie: A
modular reusable verifier for object-oriented programs. In: FMCO (2005)
10. Barrett, C., et al.: CVC4. In: CAV (2011)
11. Bayer, D., Stillman, M.: Computation of hilbert functions. J. Symb. Comput.
14(1), 31–50 (1992)
12. Bayless, S., Bayless, N., Hoos, H., Hu, A.: SAT modulo monotonic theories. In:
AAAI (2015)
13. Baylina, J.: Circom. https://github.com/iden3/circom
14. bellman. https://github.com/zkcrypto/bellman
15. Bertot, Y., Castéran, P.: Interactive theorem proving and program development:
Coq’Art: the calculus of inductive constructions. Springer Science & Business
Media (2013)
16. Blum, M., Feldman, P., Micali, S.: Non-interactive zero-knowledge and its appli-
cations. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of
Computing, pp. 103–112 (1988)
17. Bogetoft, P., et al.: Secure multiparty computation goes live. In: FC (2009)
18. Bosma, W., Cannon, J., Playoust, C.: The magma algebra system i: the user
language. J. Symb. Comput. 24(3–4), 235–265 (1997)
19. Braun, D., Magaud, N., Schreck, P.: Formalizing some "small" finite models of
projective geometry in coq. In: International Conference on Artificial Intelligence
and Symbolic Computation (2018)
20. Bruttomesso, R., Pek, E., Sharygina, N., Tsitovich, A.: The OpenSMT solver. In:
TACAS (2010)
21. Buchberger, B.: A theoretical basis for the reduction of polynomials to canonical
forms. SIGSAM Bulletin (1976)
22. Campanelli, M., Gennaro, R., Goldfeder, S., Nizzardo, L.: Zero-knowledge con-
tingent payments revisited: Attacks and payments for services. In: CCS (2017)
23. Caviness, B.F., Johnson, J.R.: Quantifier elimination and cylindrical algebraic
decomposition. Springer Science & Business Media (2012)
24. Chin, C., Wu, H., Chu, R., Coglio, A., McCarthy, E., Smith, E.: Leo: A pro-
gramming language for formally verified, zero-knowledge applications. Cryptology
ePrint Archive (2021)
Satisfiability Modulo Finite Fields 183
25. Cimatti, A., Griggio, A., Irfan, A., Roveri, M., Sebastiani, R.: Incremental lin-
earization for satisfiability and verification modulo nonlinear arithmetic and tran-
scendental functions. In: ACM TOCL 19(3) (2018)
26. Cimatti, A., Griggio, A., Schaafsma, B.J., Sebastiani, R.: The MathSAT5 SMT
solver. In: TACAS (2013)
27. Cimatti, A., Mover, S., Tonetta, S.: Smt-based verification of hybrid systems. In:
AAAI (2012)
28. Cohen, C.: Pragmatic quotient types in coq. In: ITP (2013)
29. Corzilius, F., Kremer, G., Junges, S., Schupp, S., Ábrahám, E.: SMT-RAT: an
open source C++ toolbox for strategic and parallel SMT solving. In: Heule, M.,
Weaver, S. (eds.) SAT 2015. LNCS, vol. 9340, pp. 360–368. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24318-4_26
30. Cox, D., Little, J., OShea, D.: Ideals, varieties, and algorithms: an introduction
to computational algebraic geometry and commutative algebra. Springer Science
& Business Media (2013)
31. Davenport, J.: The axiom system (1992)
32. developers, M.: Monero technical specs. https://monerodocs.org/technical-specs/
(2022)
33. D’silva, V., Kroening, D., Weissenbacher, G.: A survey of automated techniques
for formal software verification. IEEE Trans. Comput.-Aided Design Integr. Circ.
Syst. 27(7) (2008)
34. Dummit, D.S., Foote, R.M.: Abstract algebra, vol. 3. Wiley Hoboken (2004)
35. Dutertre, B.: Yices 2.2. In: CAV (2014)
36. Eberhardt, J., Tai, S.: ZoKrates–scalable privacy-preserving off-chain computa-
tions. In: IEEE Blockchain (2018)
37. Eisenbud, D., Grayson, D.R., Stillman, M., Sturmfels, B.: Computations in alge-
braic geometry with Macaulay 2, vol. 8. Springer Science & Business Media (2001)
38. Enderton, H.B.: A mathematical introduction to logic. Elsevier (2001)
39. Erbsen, A., Philipoom, J., Gross, J., Sloan, R., Chlipala, A.: Systematic Gen-
eration Of Fast Elliptic Curve Cryptography Implementations. Tech. rep, MIT
(2018)
40. Erbsen, A., Philipoom, J., Gross, J., Sloan, R., Chlipala, A.: Simple high-level
code for cryptographic arithmetic: With proofs, without compromises. ACM
SIGOPS Oper. Syst. Rev. 54(1) (2020)
41. Faugère, J.C.: A new efficient algorithm for computing Gröbner bases without
reduction to zero (F5). In: ISSAC. ACM (2002)
42. Faugere, J.C., Gianni, P., Lazard, D., Mora, T.: Efficient computation of zero-
dimensional Gröbner bases by change of ordering. J. Symb. Comput. 16(4) (1993)
43. Faugére, J.C.: A new efficient algorithm for computing gröbner bases (f4). J. Pure
Appl. Algebra 139(1), 61–88 (1999)
44. Finance, Y.: Monero quote. https://finance.yahoo.com/quote/XMR-USD/ (2022)
Accessed 30 June 2022
45. Finance, Y.: Zcash quote. https://finance.yahoo.com/quote/ZEC-USD/ (2022).
Accessed 30 June 2022
46. Frankle, J., Park, S., Shaar, D., Goldwasser, S., Weitzner, D.: Practical account-
ability of secret processes. In: USENIX Security (2018)
47. Fränzle, M., Herde, C., Teige, T., Ratschan, S., Schubert, T.: Efficient solving of
large non-linear arithmetic constraint systems with complex boolean structure. J.
Satisfiability, Boolean Modeling and Comput. 1(3–4) (2006)
48. Gao, S.: Counting zeros over finite fields with Gröbner bases. Ph.D. thesis, Mas-
ter’s thesis, Carnegie Mellon University (2009)
184 A. Ozdemir et al.
73. Moura, L.d., Bjørner, N.: Z3: An efficient smt solver. In: TACAS (2008)
74. Moura, L.d., Jovanović, D.: A model-constructing satisfiability calculus. In:
VMCAI (2013)
75. Moura, L.d., Kong, S., Avigad, J., Doorn, F.v., Raumer, J.v.: The lean theorem
prover (system description). In: CADE (2015)
76. Niemetz, A., Preiner, M.: Bitwuzla at the SMT-COMP 2020. arXiv:2006.01621
(2020)
77. Niemetz, A., Preiner, M., Wolf, C., Biere, A.: Btor2, BtorMC and Boolector 3.0.
In: CAV (2018)
78. Nieuwenhuis, R., Oliveras, A., Tinelli, C.: Solving SAT and SAT Modulo Theories:
From an abstract davis-putnam-logemann-loveland procedure to DPLL(T). J.
ACM (2006)
79. Nipkow, T., Wenzel, M., Paulson, L.C. (eds.): Isabelle/HOL. LNCS, vol. 2283.
Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45949-9
80. Noir. https://noir-lang.github.io/book/index.html
81. Ozdemir, A., Brown, F., Wahby, R.S.: Circ: Compiler infrastructure for proof
systems, software verification, and more. In: IEEE S&P (2022)
82. Ozdemir, A., Kremer, G., Tinelli, C., Barrett, C.: Satisfiability modulo finite fields
(2023), https://eprint.iacr.org/2023/091, (Full version)
83. Parker, R.: Finite fields and conway polynomials (1990), talk at the IBM Heidel-
berg Scientific Center. Cited by Scheerhorn
84. Parno, B., Howell, J., Gentry, C., Raykova, M.: Pinocchio: nearly practical veri-
fiable computation. Commun. ACM 59(2), 103–112 (2016)
85. Philipoom, J.: Correct-by-construction finite field arithmetic in Coq. Ph.D. thesis,
Massachusetts Institute of Technology (2018)
86. Pnueli, A., Siegel, M., Singerman, E.: Translation validation. In: TACAS (1998)
87. Rabin, M.O.: Probabilistic algorithms in finite fields. SIAM Journal on computing
9(2) (1980)
88. Rabinowitsch, J.L.: Zum hilbertschen nullstellensatz. Mathematische Annalen 102
(1930), https://doi.org/10.1007/BF01782361
89. Sasson, E.B., et al.: Zerocash: Decentralized anonymous payments from bitcoin.
In: IEEE S&P (2014)
90. Scheerhorn, A.: Trace-and Norm-compatible Extensions of Finite Fields. Appli-
cable Algebra in Engineering, Communication and Computing (1992)
91. Schwabe, P., Viguier, B., Weerwag, T., Wiedijk, F.: A coq proof of the correctness
of x25519 in tweetnacl. In: CSF (2021)
92. Setty, S., Braun, B., Vu, V., Blumberg, A.J., Parno, B., Walfish, M.: Resolving
the conflict between generality and plausibility in verified computation. In: Pro-
ceedings of the 8th ACM European Conference on Computer Systems, pp. 71–84
(2013)
93. Shankar, N.: Automated deduction for verification. CSUR 41(4) (2009)
94. Thaler, J.: Proofs, Arguments, and Zero-Knowledge (2022)
95. Torlak, E., Bodik, R.: A lightweight symbolic virtual machine for solver-aided
host languages. In: PLDI (2014)
96. Tung, V.X., Khanh, T.V., Ogawa, M.: raSAT: An smt solver for polynomial con-
straints. In: IJCAR (2016)
97. Vella, L., Alt, L.: On satisfiability of polynomial equations over large prime
field. In: SMT (2022). http://ceur-ws.org/Vol-3185/extended9913.pdf extended
Abstract
98. Weispfenning, V.: Quantifier elimination for real algebra-the quadratic case and
beyond. Appl. Algebra Eng., Commun. Comput. 8(2) (1997)
186 A. Ozdemir et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Solving String Constraints Using SAT
1 Introduction
Many problems in software verification require reasoning about strings. To tackle
these problems, numerous string solvers—automated decision procedures for
quantifier-free first-order theories of strings and string operations—have been
developed over the last years. These solvers form the workhorse of automated-
reasoning tools in several domains, including web-application security [19,31,33],
software model checking [15], and conformance checking for cloud-access-control
policies [2,30].
The general theory of strings relies on deep results in combinatorics
on words [5,16,23,29]; unfortunately, the related decision procedures remain
intractable in practice. Practical string solvers achieve scalability through a judi-
cious mix of heuristics and restrictions on the language of constraints.
We present a new approach to string solving that relies on an eager reduc-
tion to the Boolean satisfiability problem (SAT), using incremental solving and
unsatisfiable-core analysis for completeness and scalability. Our approach sup-
ports a theory that contains Boolean combinations of regular membership con-
straints and equality constraints on string variables, and captures a large set of
practical queries [6].
2 Preliminaries
F := F F F F F Atom
. .
Atom := x RE x = y
RE := RE RE RE RE RE RE RE ? w
Fig. 1. Syntax: x and y denote string variables and w denotes a word of Σ ∗ . The symbol
? is the wildcard character.
Boolean Abstraction: ψ
Boolean Incremental SAT
Return SAT
Abstraction SAT Solving
Definitions: D
b, b
D h
Alphabet Σ Propositional
ψ UNSAT
Reduction Encoding
b
Bounds: b
Bound Bound
Return UNSAT
Initialization Refinement
3 Overview
Our solving method is illustrated in Fig. 2. It first performs three preprocessing
steps that generate a Boolean abstraction of the input formula, reduce the size of
the input alphabet, and initialize bounds on the lengths of all string variables.
After preprocessing, we enter an encode-solve-and-refine loop that iteratively
queries a SAT solver with a problem encoding based on the current bounds
and refines the bounds after each unsatisfiable solver call. We repeat this loop
until either the propositional encoding is satisfiable, in which case we conclude
satisfiability of the input formula, or each bound has reached a theoretical upper
bound, in which case we conclude unsatisfiability.
Reducing the Alphabet. In the SMT-LIB theory of strings [4], the alphabet Σ
comprises 3 · 216 letters, but we can typically use a much smaller alphabet with-
out affecting satisfiability. In Sect. 4, we show that using Σ(ψ) and one extra
Solving String Constraints Using SAT 191
. . .
z=w y R2 z=w r q r
. .
x R1 y R2 p q
Fig. 3. Example of Boolean abstraction. The formula ψ, whose expression tree is shown
on the left, results in the Boolean abstraction illustrated on the right,. where p, q, and
.
r
are fresh Boolean variables. We additionally get the definitions p → x ∈ R1 , q ↔ y ∈. R2 ,
.
and r ↔ z = w. We use an implication (instead of an equivalence) for atom x ∈ R1
since it occurs only with positive polarity within ψ.
character per string variable is sufficient. Reducing the alphabet is critical for
our SAT encoding to be practical.
Proof. We set B = Σ(ψ) and use the previous construction. So the alphabet
A = B ∪ {α1 , . . . , αn } has cardinality |Σ(ψ)| + n, where α1 , . . . αn are distinct
symbols of Σ \ B. We can assume that ψ is in disjunctive normal form, meaning
that it is a disjunction of the form ψ = ψ1 ∨ · · · ∨ ψm , where each ψt is a
conjunctive formula. If ψ is satisfiable, then one of the disjuncts ψk is satisfiable
and we have Σ(ψk ) ⊆ B. We can turn ψk into normal form by eliminating all
.
variable equalities of the form xi = xj from ψk , resulting in a conjunction ϕk of
. . .
literals of the form xi ∈ R, xi ∈ R, or xi =
xj . Clearly, for any A ⊆ Σ, ϕk is
satisfiable in A if and only if ψk is satisfiable in A.
Let h : V (ϕk ) → Σ ∗ be a model of ϕk and define the mapping h : V (ϕk ) →
A as h (xi ) = fi (h(xi )). We show that h is a model of ϕk . Consider a literal l
∗
The reduction presented here can be improved and generalized. For example, it
can be worthwhile to use different alphabets for different variables or to reduce
large character intervals to smaller sets.
5 Propositional Encodings
Our algorithm performs a series of calls to a SAT solver. Each call determines the
b
satisfiability of the propositional encoding ψ of ψ for some upper bounds b.
b b b
Recall that ψ = ψA ∧ h ∧ D , where ψA is the Boolean abstraction of ψ,
b b
h is an encoding of the set of possible substitutions, and D is an encoding
b
of the theory-literal definitions, both bounded by b. Intuitively, h tells the
b
SAT solver to “guess” a substitution, D makes sure that all theory literals
are assigned proper truth values according to the substitution, and ψA forces
the evaluation of the whole formula under these truth values.
Suppose the algorithm performs n calls and let bk : Γ → N for k ∈ 1, . . . , n
denote the upper bounds used in the k-th call to the SAT solver. For conve-
nience, we additionally define b0 (x) = 0 for all x ∈ Γ . In the k-th call, the SAT
b
solver decides whether ψ k is satisfiable. The Boolean abstraction ψA , which
we already discussed in Sect. 3, stays the same for each call. In the following,
b
we thus discuss the encodings of the substitutions h k and of the various the-
bk bk bk
ory literals a and ¬a that are part of D . Even though SAT solvers
expect their input in CNF, we do not present the encodings in CNF to simplify
194 K. Lotz et al.
the presentation, but they can be converted to CNF using simple equivalence
transformations.
Most of our encodings are incremental in the sense that the formula for call
k is constructed by only adding clauses to the formula for call k − 1. In other
b b
words, for substitution encodings we have h k = h k−1 ∧ hbbkk−1 and for
bk bk−1 b0 b0
literals we have l = l ∧ lbbkk−1 , with the base case h = l = .
In these cases, it is thus enough to encode the incremental additions lbbkk−1
and hbbkk−1 for each call to the SAT solver. Some of our encodings, however,
introduce clauses that are valid only for a specific bound bk and thus become
invalid for larger bounds. We handle the deactivation of these encodings with
selector variables as is common in incremental SAT solving.
Our encodings are correct in the following sense.1
5.1 Substitutions
bk (x)
hbbkk−1 = EO({hax[i] | a ∈ Σλ }) (1)
x∈Γ i=bk−1 (x)+1
bk (x)−1
∧ hλx[i] → hλx[i+1] (2)
x∈Γ i=bk−1 (x)
1
Proof is omitted due to space constraints but made available for review purposes.
Solving String Constraints Using SAT 195
Constraint (2) prevents the SAT solver from considering filled substitutions that
are equivalent modulo λ-substitutions—it enforces that if a position i is mapped
to λ, all following positions are mapped to λ too. For instance, abλλ, aλbλ,
and λλab all correspond to the same word ab, but our encoding allows only
abλλ. Thus, every Boolean assignment ω that satisfies hb encodes exactly one
substitution hω , and for every substitution h (bounded by b) there exists a
corresponding assignment ωh that satisfies hb .
.
Regular Constraints. We encode a regular constraint x ∈ R by constructing
a propositional formula that is true if and only if the word h(x) is accepted by a
specific
.
nondeterministic finite automaton that accepts the language L(R). Let
x ∈ R be a regular constraint and let M = (Q, Σ, δ, q0 , F ) be a nondeterministic
finite automaton (with states Q, alphabet Σ, transition relation δ, initial state
q0 , and accepting states F ) that accepts L(R) and that additionally allows λ-self-
transitions on every state. Given that λ is a placeholder for the empty symbol,
λ-transitions do not change the language accepted by M . We allow them so
that M performs exactly b(x) transitions, even for substitutions of length less
than b(x). This reduces checking whether the automaton accepts a word to only
evaluating the states reached after exactly b(x) transitions.
b
Given a model ω |= h , we express the semantics of M in propositional logic
by encoding which states are reachable after reading hω (x). To this end, we assign
b(x)
b(x) + 1 Boolean variables {Sq0 , Sq1 , . . . , Sq } to each state q ∈ Q and assert
that ωh (Sqi ) = 1 if and only if q can be reached by reading prefix hω (x)[1..i]. We
encode this as a conjunction (M ; x) = I(M ;x) ∧ T(M ;x) ∧ P(M ;x) of three
formulas, modelling the semantics of the initial state, the transition relation,
and the predecessor relation of M . We assert that the initial state q0 is the
only
state reachable after reading the prefix of length 0, i.e., I(M ;x) b1 = Sq00 ∧
q∈Q\{q0 } ¬Sq . The condition is independent of the bound on x, thus we set
0
bk (x)−1
T(M ;x) bbkk−1 = (Sqi ∧ hax[i+1] ) → Sqi+1
The formula captures all possible forward moves from each state. We must also
ensure that a state is reachable only if it has a reachable predecessor, which we
encode with the following formula, where pred(q ) = {(q, a) | q ∈ δ(q, a)}:
bk (x)
P(M ;x) bbkk−1 = (Sqi → (Sqi−1 ∧ hax[i] ))
i=bk−1 (x)+1 q ∈Q (q,a)∈pred(q )
.
u
x = ybbkk−1 = (hax[i] → hay[i] ).
i=l+1 a∈Σλ
.
For the negation x =
y, we encode that h(x) and h(y) must disagree on at least
one position, which can happen either because they map to different symbols
or because the variable with the higher bound is mapped to a longer word. As
for the regular constraints, we again use selector variable sk to deactivate the
encoding for all later bounds, for which it will be re-encoded:
⎧ u
⎨sk → (i=1 a∈Σλ (¬hx[i] ∧ hy[i] ))
⎪ a a
if bk (x) = bk (y)
. bk u
x =
ybk−1 = sk → ( i=1 a∈Σλ (¬hx[i] ∧ hy[i] )) ∨ ¬hy[u+1] if bk (x) < bk (y)
a a λ
⎪
⎩ u
sk → ( i=1 a∈Σλ (¬hax[i] ∧ hay[i] )) ∨ ¬hλx[u+1] if bk (x) > bk (y)
.
Constant Equations. Given a constant equation x = w, if the upper bound
of x is less than |w|, the atom is trivially unsatisfiable. Thus, for all i such that
.
bi (x) < |w|, we encode x = w with a simple literal ¬sx,w and add sx,w to the
assumptions. For bk (x) ≥ |w|, the encoding is based on the value of bk−1 (x):
⎧
|w| w[i]
⎪
⎪ i=1 hx[i] if bk−1 (x) < |w| = bk (x)
⎪
⎪
⎨ |w| hw[i] ∧ hλ if bk−1 (x) < |w| < bk (x)
. i=1 x[i] x[|w|+1]
x = wbbkk−1 =
⎪
⎪ λ
if bk−1 (x) = |w| < bk (x)
⎪hx[|w|+1]
⎪
⎩
if |w| < bk−1 (x)
If bk−1 (x) < |w|, then equality is encoded for all positions 1, . . . , |w|. Addition-
ally, if bk (x) > |w|, we ensure that the suffix of x is empty starting from position
|w| + 1. If bk−1 (x) = |w| < bk (x), then only the empty suffix has to be ensured.
. b . b
Lastly, if |w| < bk−1 (x), then x = w k−1 ⇔ x = w k .
.
Conversely, for an inequality x = w, if bk (x) < |w|, then any substitution
trivially is a solution, which we simply encode with . Otherwise, we introduce
a selector variable sx,w and define
⎧ |w|
w[i]
⎨sx,w → i=1 ¬hx[i]
⎪ if bk−1 (x) < |w| = bk (x)
. |w|
x = wbbkk−1 = w[i]
i=1 ¬hx[i] ∨ ¬hx[|w|+1]
λ
if bk−1 (x) < |w| < bk (x)
⎪
⎩
if |w| < bk−1 (x) ≤ bk (x)
.
Prefix and Suffix Constraints. A prefix constraint w x expresses that
the first |w| positions of x must be mapped exactly onto w. As with equations
between a variable x and a constant word w, we could express this as a regular
198 K. Lotz et al.
.
constraint of the form x ∈ w·?∗ . However, we achieve a more efficient encoding
.
simply by dropping from the encoding of x = w the assertion that the suffix
.
of x starting at |w + 1| be empty. Accordingly, a negated prefix constraint w x
expresses that there is an index i ∈ 1, . . . , |w| such that the i-th position of x
is mapped onto a symbol different from w[i], which we encode by repurposing
. . .
w in a similar manner. Suffix constraints w x and w x can be encoded
x =
. .
by analogous modifications to the encodings of x = w and x = w.
where L(R) denotes the complement of L(R). We say that Mi accepts the regu-
lar constraints on xi in ϕ. If there are no such constraints on xi , then Mi is the
one-state NFA that accepts the full language Σ ∗ . Let Qi denote the set of states
of Mi . If we do not take inequalities into account and if the regular constraints
on xi are satisfiable, then a shortest solution h has length |h(xi )| ≤ |Qi |.
Theorem 6.1 gives a bound for the general case with variable inequalities.
Intuitively, we prove the theorem by constructing a single automaton P that
takes as input a vector of words W = (w1 , ..., wn )T and accepts W iff the sub-
stitution hW with hW (xi ) = wi satisfies ϕ. To construct P, we introduce one
two-state NFA for each inequality and we then form the product of these NFAs
with (slightly modified versions of) the NFAs M1 , . . . , Mn . We can then derive
the bound of a shortest solution from the number of states of P.
Theorem 6.1. Let ϕ be a conjunctive formula in normal form over variables
x1 , . . . , xn . Let Mi = (Qi , Σ, δi , q0,i , Fi ) be an NFA that accepts the regular con-
straints on xi in ϕ and let k be the number of inequalities occurring in ϕ. If ϕ
is satisfiable, then it has a model h such that
Proof. Let λ be a symbol that does not belong to Σ and define Σλ = Σ ∪{λ}. As
previously, we use λ to extend words of Σ ∗ by padding. Given a word w ∈ Σλ∗ , we
denote by ŵ the word of Σ ∗ obtained by removing all occurrences of λ from w.
We say that w is well-formed if it can be written as w = v · λt with v ∈ Σ ∗ and
t ≥ 0. In this case, we have ŵ = v. Thus a well-formed word w consists of a
prefix in Σ ∗ followed by a sequence of λs.
Let Δ be the alphabet Σλn , i.e., the letters of Δ are the n-letter words over
Σλ . We can then represent a letter u of Δ as an n-element vector (u1 , . . . , un ),
and a word W of Δt can be written as an n × t matrix
⎛ ⎞
u11 . . . ut1
⎜ .. ⎟
W = ⎝ ... . ⎠
u1n . . . utn
where uij ∈ Σλ . Each column of this matrix is a letter in Δ and each row is a word
in Σλt . We denote by pi (W ) the i-th row of this matrix and by pˆi (W ) = p i (W )
the word pi (W ) with all occurrences of λ removed. We say that W is well-formed
if the words p1 (W ), . . . , pn (W ) are all well-formed. Given a well-formed word W ,
we can construct a mapping hW : {x1 , . . . , xn } → Σ ∗ by setting hW (xi ) = pˆi (W )
and we have |hW (xi )| ≤ |W | = t.
To prove the theorem, we build an NFA P with alphabet Δ such that a well-
formed word W is accepted by P iff hW satisfies ϕ. The shortest well-formed
W accepted by P has length no more than the number of states of P and the
bound will follow.
We first extend the NFA Mi = (Qi , Σ, δi , q0,i , Fi ) to an automaton Mi with
alphabet Δ. Mi has the same set of states, initial state, and final states as Mi .
Its transition relation δi is defined by
δi (q, ui ) if ui ∈ Σ
δi (q, u) =
{q} if ui = λ
One can easily check that Mi accepts a word W iff Mi accepts pˆi (W ).
.
For an inequality xi = xj , we construct an NFA Di,j = ({e, d}, Δ, δ, e, {d})
with transition function defined as follows:
δ(e, u) = {e} if ui = uj
δ(e, u) = {d} if ui = uj
δ(d, u) = {d}.
This NFA has two states. It starts in state e (for “equal”) and stays in e as long
as the characters ui and uj are equal. It transitions to state d (for “different”)
on the first u where ui = uj and stays in state d from that point. Since d is the
final state, a word W is accepted by Di,j iff pi (W ) = pj (W ). If W is well-formed,
we also have that W is accepted by Di,j iff pˆi (W ) = pˆj (W ).
. .
xj1 , . . . , xik =
Let xi1 = xjk denote the k inequalities of ϕ. We define P to be
the product of the NFAs M1 , . . . , Mn and Di1 ,j1 , . . . , Dik ,jk . A well-formed word
200 K. Lotz et al.
W is accepted by P if it is accepted by all Mi and all Dit ,jt , which means that
P accepts a well-formed word W iff hW satisfies ϕ.
Let P be the set of states of P. We then have |P | ≤ 2k × |Q1 | × . . . × |Qn |.
Assume ϕ is satisfiable, so P accepts a well-formed word W . The shortest well-
formed word accepted by P has an accepting run that does not visit the same
state twice. So the length of this well-formed word W is no more than |P |. The
mapping hW satisfies ϕ and for every xi , it satisfies |hW (xi )| = |pˆi (W )| ≤ |W | ≤
|P | ≤ 2k × |Q1 | × . . . × |Qn |.
The bound given by Theorem 6.1 holds if ϕ is in normal form but it also holds
for a general conjunctive formula ψ. This follows from the observation that
converting conjunctive formulas to normal form preserves the length of solutions.
.
In particular, we convert ψ ∧ x = y to formula ψ = ψ[x := y] so x does not occur
in ψ , but clearly, a bound for y in ψ gives us the same bound for x in ψ.
In practice, before we apply the theorem we decompose the conjunctive for-
mula ϕ into subformulas that have disjoint sets of variables. We write ϕ as
ϕ1 ∧ . . . ∧ ϕm where the conjuncts have no common variables. Then, ϕ is satisfi-
able if each conjunct ϕt is satisfiable and we derive upper bounds on the shortest
solution for the variables of ϕt , which gives more precise bounds than deriving
bounds from ϕ directly. In particular, if a variable xi of ψ does not occur in any
inequality, then the bound on |h(xi )| is |Qi |.
Theorem 6.1 only holds for conjunctive formulas. For an arbitrary (non-
conjunctive) formula ψ, a generalization is to convert ψ into disjunctive normal
form. Alternatively, it is sufficient to enumerate the subsets of lits(ψ). Given a
subset A of lits(ψ), let us denote by dA a mapping that bounds the length of
solutions to A, i.e., any solution h to A satisfies |h(x)| ≤ dA (x). This mapping
dA can be computed from Theorem 6.1. The following property gives a bound
for ψ.
Proposition 6.2. If ψ is satisfiable, then it has a model h such that for all
x ∈ Γ , it holds that |h(x)| ≤ max{dA (x) | A ⊆ lits(ψ)}.
Proof. We can assume that ψ is in negation normal form. We can then convert ψ
to disjunctive normal form ψ ⇔ ψ1 ∨· · ·∨ψn and we have lits(ψi ) ⊆ lits(ψ). Also,
ψ is satisfiable if and only if at least one ψi is satisfiable and the proposition
follows.
Since there are 2|lits(ψ)| subsets of lits(ψ), a direct application of Proposition 6.2
is rarely feasible in practice. Fortunately, we can use unsatisfiable cores to reduce
the number of subsets to consider.
Instead of calculating the bounds upfront, we use the unsatisfiable core produced
by the SAT solver after each incremental call to evaluate whether the upper
Solving String Constraints Using SAT 201
bounds on the variables exceed the upper bounds of the shortest solution. If
b
ψ is unsatisfiable for bounds b, then it has an unsatisfiable core
C = CA ∧ Ch ∧ Ca ∧ Cā
a∈atoms+ (ψ) a∈atoms− (ψ)
b
with (possibly empty) subsets of clauses CA ⊆ ψA , Ch ⊆ h , Ca ⊆ (d(a) →
b b b
a ), and Cā ⊆ (¬ d(a) → ¬a ). Here we implicitly assume ψA , d(a) → a ,
and ¬ d(a) → ¬a to be in CNF. Let C + = {a | Ca = ∅} and C − = {¬a | Cā =
b
∅} be the sets of literals whose encodings contain at least one clause of the core
C. Using these sets, we construct the formula
ψ C = ψA ∧ d(a) → a ∧ ¬ d(a) → ¬a,
a∈C + ¬a∈C −
which consists of the conjunction of the abstraction and the definitions of the
literals that are containedin C + , respectively C − . Recall that ψ is equisatisfiable
to the conjunction ψA ∧ d∈D d of the abstraction and all definitions in D. Let
ψ denote this formula, i.e.,
ψ = ψA ∧ d(a) → a ∧ ¬ d(a) → ¬a.
a∈atoms+ (ψ) ¬a∈atoms− (ψ)
The following proposition shows that it suffices to refine the bounds according
to ψ C .
Proposition 6.3. Let ψ be unsatisfiable with respect to b and let C be an unsat-
isfiable core of ψ . Then, ψ C is unsatisfiable with respect to b and ψ |= ψ C .
b
b
Proof. By definition, we have ψ C = ψA ∧ h ∧ a∈C + d(a) → a ∧
b b
b C b
¬a∈C − ¬ d(a) → ¬¬a . This implies C ⊆ ψ and, since C is an unsat-
b
isfiable core, ψ C is unsatisfiable. That is, ψ C is unsatisfiable with respect
to b. We also have ψ |= ψ C since C + ⊆ atoms+ (ψ) and C − ⊆ atoms− (ψ).
7 Implementation
We have implemented our approach in a solver called nfa2sat. nfa2sat is
written in Rust and uses CaDiCaL [9] as the backend SAT solver. We use the
incremental API provided by CaDiCaL to solve problems under assumptions.
Soundness of nfa2sat follows from Theorem 5.1. For completeness, we rely
on CaDiCaL’s failed function to efficiently determine failed assumptions, i.e.,
assumption literals that were used to conclude unsatisfiability.
The procedure works as follows. Given a formula ψ, we first introduce one
fresh Boolean selector variable sl for each theory literal l ∈ lits(ψ). Then, instead
of adding the encoded definitions of the theory literals directly to the SAT
solver, we precede them with their corresponding selector variables: for a pos-
itive literal a, we add sa → (d(a) → a), and for a negative literal ¬a, we
add s¬a → (¬ d(a) → ¬a) (considering assumptions introduced by a as unit
clauses). In the resulting CNF formula, the new selector variables are present
in all clauses that encode their corresponding definition, and we use them as
assumptions for every incremental call to the SAT solver, which does not affect
satisfiability. If such an assumption failed, then we know that at least one of the
corresponding clauses in the propositional formula was part of an unsatisfiable
core, which enables us to efficiently construct the sets C + and C − of positive and
negative atoms present in the unsatisfiable core. As noted previously, we have
lits(ψ C ) = C + ∪ C − and hence the sets are sufficient to find bounds on a shortest
model for ψ C .
This approach is efficient for obtaining lits(ψ C ) but since CaDiCaL does not
guarantee that the set of failed assumptions is minimal, lits(ψ C ) is not minimal
in general. Moreover, even a minimal lits(ψ C ) can contain too many elements
for processing all subsets. To address this issue, we enumerate the subsets only
if lits(ψ C ) is small (by default, we use a limit of ten literals). In this case, we
construct the automata Mi used in Theorem 6.1 for each subset, facilitating the
techniques described in [7] for quickly ruling out unsatisfiable ones. Otherwise,
Solving String Constraints Using SAT 203
8 Experimental Evaluation
We have evaluated our solver on a large set of benchmarks from the ZaligVin-
der [22] repository2 . The repository contains 120,287 benchmarks stemming
from both academic and industrial applications. In particular, all the string prob-
lems from the SMT-LIB repository,3 are included in the ZaligVinder reposi-
tory. We converted the ZaligVinder problems to the SMT-LIB 2.6 syntax and
removed duplicates. This resulted in 82,632 unique problems out of which 29,599
are in the logical fragment we support.
We compare nfa2sat with the state-of-the-art solvers cvc5 (version 1.0.3)
and Z3 (version 4.12.0). The comparison is limited to these two solvers because
they are widely adopted and because they had the best performance in our evalu-
ation. Other string solvers either don’t support our logical fragment (CertiStr,
Woorpje) or gave incorrect answers on the benchmark problems considered
here. Older, no-longer maintained, solvers have known soundness problems, as
reported in [7] and [27].
We ran our experiment on a Linux server, with a timeout of 1200 s seconds
CPU time and a memory limit of 16 GB. Table 1 shows the results. As a single
tool, nfa2sat solves more problems than cvc5 but not as many as Z3. All three
tools solve more than 98% of the problems.
The table also shows results of portfolios that combine two solvers. In a port-
folio configuration, the best setting is to use both Z3 and nfa2sat. This com-
bination solves all but 20 problems within the timeout. It also reduces the total
run-time from 283,942 s for Z3 (about 79 h) to 28,914 s for the portfolio (about
8 h), that is, a 90% reduction in total solve time. The other two portfolios—
namely, Z3 with cvc5 and nfa2sat with cvc5—also have better performance
than a single solver, but the improvement in runtime and number of timeouts is
not as large.
Figure 4a illustrates why nfa2sat and Z3 complement each other well. The
figure shows three scatter plots that compare the runtime of nfa2sat and Z3 on
our problems. The plot on the left compares the two solvers on all problems, the
one in the middle compares them on satisfiable problems, and the one on the right
compares them on unsatisfiable problems. Points in the left plot are concentrated
close to the axes, with a smaller number of points near the diagonal, meaning
that Z3 and nfa2sat have different runtime on most problems. The other two
2
https://github.com/zaligvinder/zaligvinder.
3
https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/QF S.
204 K. Lotz et al.
Table 1. Evaluation on ZaligVinder benchmarks. The three left columns show results
of individual solvers. The other three columns show results of portfolios combining two
solvers.
Fig. 4. Comparison of runtime (in seconds) with Z3 and cvc5. The left plots include
all problems, the middle plots include only satisfiable problems, and the right plots
include only unsatisfiable problems. The lines marked “failed” correspond to problems
that are not solved because a solver ran out of memory. The lines marked “timeout”
correspond to problems not solved because of a timeout (1200 s).
plots show this even more clearly: nfa2sat is faster on satisfiable problems while
Z3 is faster on unsatisfiable problems. Figure 4b shows analogous scatter plots
comparing nfa2sat and cvc5. The two solvers show similar performance on
a large set of easy benchmarks although cvc5 is faster on problems that both
Solving String Constraints Using SAT 205
solvers can solve in less than 1 s. However, cvc5 times out on 38 problems that
nfa2sat solves in less than 2 s. On unsatisfiable problems, cvc5 tends to be
faster than nfa2sat, but there is a class of problems for which nfa2sat takes
between 10 and 100 s whereas cvc5 is slower.
Overall, the comparison shows that nfa2sat is competitive with cvc5 and
Z3 on these benchmarks. We also observe that nfa2sat tends to work better on
satisfiable problems. For best overall performance, our experiments show that a
portfolio of Z3 and nfa2sat would solve all but 20 problems within the timeout,
and reduce the total solve time by 90%.
9 Conclusion
We have presented the first eager SAT-based approach to string solving that is
both sound and complete for a reasonably expressive fragment of string theory.
Our experimental evaluation shows that our approach is competitive with the
state-of-the-art lazy SMT solvers Z3 and cvc5, outperforming them on satisfi-
able problems but falling behind on unsatisfiable ones. A portfolio that combines
our approach with these solvers—particularly with Z3—would thus yield strong
performance across both types of problems.
In future work, we plan to extend our approach to a more expressive logi-
cal fragment, including more general word equations. Other avenues of research
include the adaption of model checking techniques such as IC3 [10] to string
problems, which we hope would lead to better performance on unsatisfiable
instances. A particular benefit of the eager approach is that it enables the use
of mature techniques from the SAT world, especially for proof generation and
parallel solving. Producing proofs of unsatisfiability is complex for traditional
CDCL(T) solvers because of the complex rewriting and deduction rules they
employ. In contrast, efficiently generating and checking proofs produced by SAT
solvers (using the DRAT format [32]) is well-established and practicable. A chal-
lenge in this respect would be to combine unsatisfiability proofs from a SAT
solver with proof that our reduction to SAT is sound. For parallel solving, we
plan to explore the use of a parallel incremental solver (such as iLingeling [9])
as well as other possible ways to solve multiple bounds in parallel.
References
1. Abdulla, P.A., et al.: Trau: SMT solver for string constraints. In: 2018 Formal
Methods in Computer Aided Design (FMCAD), pp. 1–5 (2018). https://doi.org/
10.23919/FMCAD.2018.8602997
2. Backes, J., et al.: Semantic-based automated reasoning for AWS access policies
using SMT. In: 2018 Formal Methods in Computer Aided Design (FMCAD), pp.
1–9 (2018). https://doi.org/10.23919/FMCAD.2018.8602994
206 K. Lotz et al.
3. Barbosa, H., et al.: cvc5: A versatile and industrial-strength SMT solver. In: Fis-
man, D., Rosu, G. (eds.) Tools and Algorithms for the Construction and Analysis
of Systems - 28th International Conference, TACAS 2022, Held as Part of the
European Joint Conferences on Theory and Practice of Software, ETAPS 2022,
Munich, Germany, April 2-7, 2022, Proceedings, Part I. Lecture Notes in Com-
puter Science, vol. 13243, pp. 415–442. Springer (2022). https://doi.org/10.1007/
978-3-030-99524-9 24
4. Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB Standard: Version 2.6. Tech.
rep., Department of Computer Science, The University of Iowa (2017). www.smt-
lib.org
5. Berzish, M., et al.: String theories involving regular membership predicates: From
practice to theory and back. In: Lecroq, T., Puzynina, S. (eds.) Combinatorics on
Words, pp. 50–64. Springer International Publishing, Cham (2021)
6. Berzish, M., et al.: Towards more efficient methods for solving regular-expression
heavy string constraints. Theoretical Computer Science 943, 50–72 (2023). https://
doi.org/10.1016/j.tcs.2022.12.009, https://www.sciencedirect.com/science/article/
pii/S030439752200723X
7. Berzish, M., et al.: An SMT solver for regular expressions and linear arithmetic
over string length. In: Silva, A., Leino, K.R.M. (eds.) Computer Aided Verification,
pp. 289–312. Springer International Publishing, Cham (2021)
8. Biere, A.: Bounded model checking. In: Biere, A., Heule, M., van Maaren, H.,
Walsh, T. (eds.) Handbook of Satisfiability, Frontiers in Artificial Intelligence and
Applications, vol. 185, pp. 457–481. IOS Press (2009). https://doi.org/10.3233/
978-1-58603-929-5-457
9. Biere, A., Fazekas, K., Fleury, M., Heisinger, M.: CaDiCaL, Kissat, Paracooba,
Plingeling and Treengeling entering the SAT Competition 2020. In: Balyo, T.,
Froleyks, N., Heule, M., Iser, M., Järvisalo, M., Suda, M. (eds.) Proc. of SAT
Competition 2020 - Solver and Benchmark Descriptions. Department of Computer
Science Report Series B, vol. B-2020-1, pp. 51–53. University of Helsinki (2020)
10. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R.,
Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70–87. Springer, Heidel-
berg (2011). https://doi.org/10.1007/978-3-642-18275-4 7
11. Chen, T., Hague, M., Lin, A.W., Rümmer, P., Wu, Z.: Decision procedures for
path feasibility of string-manipulating programs with complex operations. Proc.
ACM Program. Lang. 3(POPL) (jan 2019). https://doi.org/10.1145/3290362
12. Day, J.D., Ehlers, T., Kulczynski, M., Manea, F., Nowotka, D., Poulsen, D.B.:
On solving word equations using SAT. In: Filiot, E., Jungers, R., Potapov, I.
(eds.) Reachability Problems, pp. 93–106. Springer International Publishing, Cham
(2019)
13. Eén, N., Sörensson, N.: Temporal induction by incremental SAT solving. Electronic
Notes in Theoretical Computer Science 89(4), 543–560 (2003). https://doi.org/10.
1016/S1571-0661(05)82542-3, https://www.sciencedirect.com/science/article/pii/
S1571066105825423, bMC’2003, First International Workshop on Bounded Model
Checking
14. Gao, Y., Moreira, N., Reis, R., Yu, S.: A survey on operational state complexity.
CoRR abs/1509.03254 (2015), http://arxiv.org/abs/1509.03254
15. Hojjat, H., Rümmer, P., Shamakhi, A.: On strings in software model checking.
In: Lin, A.W. (ed.) Programming Languages and Systems, pp. 19–30. Springer
International Publishing, Cham (2019)
Solving String Constraints Using SAT 207
16. Jez, A.: Word Equations in Nondeterministic Linear Space. In: Chatzigiannakis,
I., Indyk, P., Kuhn, F., Muscholl, A. (eds.) 44th International Colloquium on
Automata, Languages, and Programming (ICALP 2017). Leibniz International
Proceedings in Informatics (LIPIcs), vol. 80, pp. 95:1–95:13. Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2017). https://doi.org/10.
4230/LIPIcs.ICALP.2017.95, http://drops.dagstuhl.de/opus/volltexte/2017/7408
17. Kan, S., Lin, A.W., Rümmer, P., Schrader, M.: Certistr: A certified string solver.
In: Proceedings of the 11th ACM SIGPLAN International Conference on Certi-
fied Programs and Proofs, pp. 210–224. CPP 2022, Association for Computing
Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3497775.3503691
18. Karhumäki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and
relations by word equations. J. ACM 47(3), 483–505 (may 2000). https://doi.org/
10.1145/337244.337255
19. Kiezun, A., Ganesh, V., Guo, P.J., Hooimeijer, P., Ernst, M.D.: Hampi: A solver for
string constraints. In: Proceedings of the Eighteenth International Symposium on
Software Testing and Analysis, pp. 105–116. ISSTA ’09, Association for Computing
Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1572272.1572286
20. Klieber, W., Kwon, G.: Efficient CNF encoding for selecting 1 from N objects. In:
Fourth Workshop on Constraints in Formal Verification (CFV) (2007)
21. Kulczynski, M., Lotz, K., Nowotka, D., Poulsen, D.B.: Solving string theories
involving regular membership predicates using SAT. In: Legunsen, O., Rosu, G.
(eds.) Model Checking Software, pp. 134–151. Springer International Publishing,
Cham (2022)
22. Kulczynski, M., Manea, F., Nowotka, D., Poulsen, D.B.: Zaligvinder: A
generic test framework for string solvers. J. Softw.: Evolution and Process
n/a(n/a), e2400. https://doi.org/10.1002/smr.2400, https://onlinelibrary.wiley.
com/doi/abs/10.1002/smr.2400
23. Makanin, G.S.: The problem of solvability of equations in a free semi-
group. Math. USSR, Sb. 32, 129–198 (1977). https://doi.org/10.1070/
SM1977v032n02ABEH002376
24. Mora, F., Berzish, M., Kulczynski, M., Nowotka, D., Ganesh, V.: Z3str4: A multi-
armed string solver. In: Huisman, M., Păsăreanu, C., Zhan, N. (eds.) Formal Meth-
ods, pp. 389–406. Springer International Publishing, Cham (2021)
25. de Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: Proceedings of
the Theory and Practice of Software, 14th International Conference on Tools
and Algorithms for the Construction and Analysis of Systems, pp. 337–340.
TACAS’08/ETAPS’08, Springer-Verlag, Berlin, Heidelberg (2008)
26. Murray, N.V.: Completely non-clausal theorem proving. Artificial Intelligence
18(1), 67–85 (1982). https://doi.org/10.1016/0004-3702(82)90011-X, https://
www.sciencedirect.com/science/article/pii/000437028290011X
27. Nötzli, A., Reynolds, A., Barbosa, H., Barrett, C.W., Tinelli, C.: Even faster
conflicts and lazier reductions for string solvers. In: Shoham, S., Vizel, Y. (eds.)
Computer Aided Verification - 34th International Conference, CAV 2022, Haifa,
Israel, August 7-10, 2022, Proceedings, Part II. Lecture Notes in Computer Sci-
ence, vol. 13372, pp. 205–226. Springer (2022). https://doi.org/10.1007/978-3-031-
13188-2 11, https://doi.org/10.1007/978-3-031-13188-2 11
28. Plaisted, D.A., Greenbaum, S.: A structure-preserving clause form translation.
Journal of Symbolic Computation 2(3), 293–304 (1986). https://doi.org/10.
1016/S0747-7171(86)80028-1, https://www.sciencedirect.com/science/article/pii/
S0747717186800281
208 K. Lotz et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
The GOLEM Horn Solver
1 Introduction
The framework of Constrained Horn Clauses (CHC) has been proposed as a uni-
fied, purely logic-based, intermediate format for software verification tasks [33].
CHC provides a powerful way to model various verification problems, such as
safety, termination, and loop invariant computation, across different domains like
transition systems, functional programs, procedural programs, concurrent sys-
tems, and more [33–35,41]. The key advantage of CHC is the separation of mod-
elling from solving, which aligns with the important software design principle—
separation of concerns. This makes CHCs highly reusable, allowing a specialized
CHC solver to be used for different verification tasks across domains and pro-
gramming languages. The main focus of the front end is then to translate the
source code into the language of constraints, while the back end can focus solely
on the well-defined formal problem of deciding satisfiability of a CHC system.
CHC-based verification is becoming increasingly popular, with several frame-
works developed in recent years, including SeaHorn, Korn and TriCera for
C [27,28,36], JayHorn for Java [44], RustHorn for Rust [48], HornDroid for
Android [18], SolCMC and SmartACE for Solidity [2,57]. A novel CHC-based
approach for testing also shows promising results [58]. The growing demand from
verifiers drives the development of specialized Horn solvers. Different solvers
implement different techniques based on, e.g., model-checking approaches (such
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 209–223, 2023.
https://doi.org/10.1007/978-3-031-37703-7_10
210 M. Blicha et al.
as predicate abstraction [32], CEGAR [22] and IC3/PDR [16,26]), machine learn-
ing, automata, or CHC transformations. Eldarica [40] uses predicate abstrac-
tion and CEGAR as the core solving algorithm. It leverages Craig interpo-
lation [23] not only to guide the predicate abstraction but also for accelera-
tion [39]. Additionally, it controls the form of the interpolants with interpolation
abstraction [46,53]. Spacer [45] is the default algorithm for solving CHCs in
Z3 [51]. It extends PDR-style algorithm for nonlinear CHC [38] with under-
approximations and leverages model-based projection for predecessor computa-
tion. Recently it was enriched with global guidance [37]. Ultimate TreeAu-
tomizer [25] implements automata-based approaches to CHC solving [43,56].
HoIce [20] implements a machine-learning-based technique adapted from the
ICE framework developed for discovering inductive invariants of transition sys-
tems [19]. FreqHorn [29,30] combines syntax-guided synthesis [4] with data
derived from unrollings of the CHC system.
According to the results of the international competition on CHC solving
CHC-COMP [24,31,54], solvers applying model-checking techniques, namely
Spacer and Eldarica, are regularly outperforming the competitors. These
are the solvers most often used as the back ends in CHC-based verification
projects. However, only specific algorithms have been explored in these tools
for CHC solving, limiting their application for diverse verification tasks. Experi-
ence from software verification and model checking of transition systems shows
that in contrast to the state of affairs in CHC solving, it is possible to build a
flexible infrastructure with a unified environment for multiple back-end solving
algorithms. CPAchecker [6–11], and Pono [47] are examples of such tools.
This work aims to bring this flexibility to the general domain-independent
framework of constrained Horn clauses. We present Golem, a new solver
for CHC satisfiability, that provides a unique combination of flexibility and
efficiency.1 Golem implements several SMT-based model-checking algorithms:
our recent model-checking algorithm based on Transition Power Abstraction
(TPA) [13,14], and state-of-the-art model-checking algorithms Bounded Model
Checking (BMC) [12], k-induction [55], Interpolation-based Model Checking
(IMC) [49], Lazy Abstractions with Interpolants (LAWI) [50] and Spacer [45].
Golem achieves efficiency through tight integration with the underlying interpo-
lating SMT solver OpenSMT [17,42] and preprocessing transformations based
on predicate elimination, clause merging and redundant clause elimination. The
flexible and modular framework of OpenSMT enables customization for differ-
ent algorithms; its powerful interpolation modules, particularly, offer fine con-
trol (in size and strength) with multiple interpolant generation procedures. We
report experimentation that confirms the advantage of multiple diverse solving
techniques and shows that Golem is competitive with state-of-the-art Horn
solvers on large sets of problems.2 Overall, Golem can serve as an efficient back
1
Golem is available at https://github.com/usi-verification-and-security/golem.
2
This is in line with results from CHC-COMP 2021 and 2022 [24, 31]. In 2022, Golem
beat other solvers except Z3-Spacer in the LRA-TS, LIA-Lin and LIA-Nonlin
tracks.
The Golem Horn Solver 211
end for domain-specific verification tools and as a research tool for prototyping
and evaluating SMT- and interpolation-based verification techniques in a unified
setting.
2 Tool Overview
In this section, we describe the main components and features of the tool together
with the details of its usage. For completeness, we recall the terminology related
to CHCs first.
Interpolation
SAT customization
Engines
+
OpenSMT
model
TPA Spacer
Core solver
UNSAT
+
proof Interpolator LAWI IMC
Architecture. The flow of data inside Golem is depicted in Fig. 1. The system
of CHCs is read from .smt2 file, a script in an extension of the language of SMT-
LIB.3 Interpreter interprets the SMT-LIB script and builds the internal rep-
resentation of the system of CHCs. In Golem, CHCs are first normalized, then
the system is translated into an internal graph representation. Normalization
rewrites clauses to ensure that each predicate has only variables as arguments.
The graph representation of the system is then passed to the Preprocessor,
which applies various transformations to simplify the input graph. Preprocessor
then hands the transformed graph to the chosen back-end engine. Engines in
3
https://chc-comp.github.io/format.html.
212 M. Blicha et al.
Models and Proofs. Besides solving the CHC satisfiability problem, a witness
for the answer is often required by the domain-specific application. Satisfiabil-
ity witness is a model, an interpretation of the CHC predicates that makes all
clauses valid. Unsatisfiability witness is a proof, a derivation of the empty clause
from the input clauses. In software verification these witnesses correspond to pro-
gram invariants and counterexample paths, respectively. All engines in Golem
produce witnesses for their answer. Witnesses from engines are translated back
through the applied preprocessing transformations. Only after this backtransla-
tion, the witness matches the original input system and is reported to the user.
Witnesses must be explicitly requested with the option --print-witness.
Models are internally stored as formulas in the background theory, using only
the variables of the (normalized) uninterpreted predicates. They are presented
to the user in the format defined by SMT-LIB [5]: a sequence of SMT-LIB’s
define-fun commands, one for each uninterpreted predicate.
For the proofs, Golem follows the trace format proposed by Eldarica.
Internally, proofs are stored as a sequence of derivation steps. Every derivation
step represents a ground instance of some clause from the system. The ground
instances of predicates from the body form the premises of the step, and the
ground instance of the head’s predicate forms the conclusion of the step. For
the derivation to be valid, the premises of each step must have been derived
earlier, i.e., each premise must be a conclusion of some derivation step earlier in
the sequence. To the user, the proof is presented as a sequence of derivations of
ground instances of the predicates, where each step is annotated with the indices
of its premises. See Example 1 below for the illustration of the proof trace.
Golem also implements an internal validator that checks the correctness
of the witnesses. It validates a model by substituting the interpretations for the
predicates and checking the validity of all the clauses with OpenSMT. Proofs are
validated by checking all conditions listed above for each derivation step. Valida-
tion is enabled with an option --validate and serves primarily as a debugging
tool for the developers of witness production.
The Golem Horn Solver 213
Example 1. Consider the following CHC system and the proof of its
unsatisfiability.
The core components of Golem that solve the problem of satisfiability of a CHC
system are referred to as back-end engines, or just engines. Golem implements
several popular state-of-the-art algorithms from model checking and software
verification: BMC, k-induction, IMC, LAWI and Spacer. These algorithms treat
the problem of solving a CHC system as a reachability problem in the graph
representation.
The unique feature of Golem is the implementation of the new model-
checking algorithm based on the concept of Transition Power Abstraction (TPA).
It is capable of much deeper analysis than other algorithms when searching for
counterexamples [14], and it discovers transition invariants [13], as opposed to
the usual (state) invariants.
214 M. Blicha et al.
4 Experiments
In this section, we evaluate the performance of individual Golem’s engines on
the benchmarks from the latest edition of CHC-COMP. The goal of these experi-
ments is to 1) demonstrate the usefulness of multiple back-end engines and their
potential combined use for solving various problems, and 2) compare Golem
against state-of-the-art Horn solvers.
The benchmark collections of CHC-COMP represent a rich source of prob-
lems from various domains.6 Version 0.3.2 of Golem was used for these exper-
iments. Z3-Spacer (Z3 4.11.2) and Eldarica 2.0.8 were run (with default
options) for comparison as the best Horn solvers available. All experiments
were conducted on a machine with an AMD EPYC 7452 32-core processor and
8 × 32 GiB of memory; the timeout was set to 300 s. No conflicting answers were
observed in any of the experiments. The results are in line with the results of
the last editions of CHC-COMP where Golem participated [24,31]. Our artifact
for reproducing the experiments is available at https://doi.org/10.5281/zenodo.
7973428.
engines’ performance are not substantial, but the BMC engine firmly dominates
the others. On satisfiable problems, we see significant differences. Figure 2 plots,
for each engine, the number of solved satisfiable benchmarks (x-axis) within the
given time limit (y-axis, log scale).
100
10
runtime (s)
IMC
KIND
1
LAWI
0.1
VB
0.01
# solved problems
The large lead of VB suggests that the solving abilities of the engines are
widely complementary. No single engine dominates the others on satisfiable
instances. The portfolio of techniques available in Golem is much stronger than
any single one of them.
Moreover, the unified setting enables direct comparison of the algorithms.
For example, we can conclude from these experiments that the extra check for
k-inductive invariants on top of the BMC-style search for counterexamples, as
implemented in the KIND engine, incurs only a small overhead on unsatisfi-
able problems, but makes the KIND engine very successful in solving satisfiable
problems.
Next, we considered the LIA-Lin category of CHC-COMP. These are linear sys-
tems of CHCs with linear integer arithmetic as the background theory. There
The Golem Horn Solver 217
are many benchmarks in this category, and for the evaluation at the competition,
a subset of benchmarks is selected (see [24,31]). We evaluated the LAWI and
Spacer engines of Golem (the engines capable of solving general linear CHC
systems) on the benchmarks selected at CHC-COMP 2022 and compared their
performance to Z3-Spacer and Eldarica. Notably, we also examined a spe-
cific subcategory of LIA-lin, namely extra-small-lia8 with benchmarks that
fall into the fragment accepted by Golem’s TPA engine.
There are 55 benchmarks in extra-small-lia subcategory, all satisfiable,
but known to be highly challenging for all tools. The results, given in Table 2,
show that split-TPA outperforms not only LAWI and Spacer engines in
Golem, but also Z3-Spacer. Only Eldarica solves more benchmars. We
ascribe this to split-TPA’s capability to perform deep analysis and discover
transition invariants.
Table 2. Number of solved benchmarks from extra-small-lia subcategory.
Golem
split-TPA LAWI Spacer Z3-Spacer Eldarica
22 12 18 18 36
For the whole LIA-Lin category, 499 benchmarks were selected in the 2022
edition of CHC-COMP [24]. The performance of the LAWI and Spacer engines
of Golem, Z3-Spacer and Eldarica on this selection is summarized in Table 3.
Here, the Spacer engine of Golem significantly outperforms the LAWI engine.
Moreover, even though Golem loses to Z3-Spacer, it beats Eldarica. Given
that Golem is a prototype, and Z3-Spacer and Eldarica have been developed
and optimized for several years, this demonstrates the great potential of Golem.
Golem
LAWI Spacer Z3-Spacer Eldarica
SAT 131 184 211 183
UNSAT 77 82 96 60
100 100
Z3-Spacer
10 10
Eldarica
1 1
0.1 0.1
0.01 0.01
0.01 0.1 1 10 100 m/o 0.01 0.1 1 10 100 m/o
t/o t/o
Golem-Spacer Golem-Spacer
Overall, Golem solved fewer problems than Z3-Spacer but more than
Eldarica; however, all tools solved some instances uniquely. A detailed compar-
ison is depicted in Fig. 3. For each benchmark, its data point in the plot reflects
the runtime of Golem (x-axis) and the runtime of the competitor (y-axis). The
plots suggest that the performance of Golem is often orthogonal to Eldarica,
but highly correlated with the performance of Z3-Spacer. This is not surpris-
ing as the Spacer engine in Golem is built on the same core algorithm. Even
though Golem is often slower than Z3-Spacer, there is a non-trivial amount
of benchmarks on which Z3-Spacer times out, but which Golem solves fairly
quickly. Thus, Golem, while being a newcomer, already complements existing
state-of-the-art tools, and more improvements are expected in the near future.
To summarise, the overall experimentation with different engines of Golem
demonstrates the advantages of the multi-engine general framework and illus-
trates the competitiveness of its analysis. It provides a lot of flexibility in address-
ing various verification problems while being easily customizable with respect to
the analysis demands.
8
https://github.com/chc-comp/extra-small-lia.
The Golem Horn Solver 219
5 Conclusion
In this work, we presented Golem, a flexible and effective Horn solver with mul-
tiple back-end engines, including recently-introduced TPA-based model-checking
algorithms. Golem is a suitable research tool for prototyping new SMT-based
model-checking algorithms and comparing algorithms in a unified framework.
Additionally, the effective implementation of the algorithm achieved with tight
coupling with the underlying SMT solver makes it an efficient back end for
domain-specific verification tools. Future directions for Golem include support
for VMT input format [21] and analysis of liveness properties, extension of TPA
to nonlinear CHC systems, and support for SMT theories of arrays, bit-vectors
and algebraic datatypes.
References
1. Alt, L.: Controlled and Effective Interpolation. Ph.D. thesis, Università della
Svizzera italiana (2016). https://susi.usi.ch/usi/documents/318933
2. Alt, L., Blicha, M., Hyvärinen, A.E.J., Sharygina, N.: SolCMC: Solidity compiler’s
model checker. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification, pp.
325–338. Springer International Publishing, Cham (2022)
3. Alt, L., Hyvärinen, A.E.J., Sharygina, N.: LRA interpolants from no man’s land.
In: Strichman, O., Tzoref-Brill, R. (eds.) Hardware and Software: Verification and
Testing, pp. 195–210. Springer International Publishing, Cham (2017)
4. Alur, R., et al.: Syntax-guided synthesis. In: 2013 Formal Methods in Computer-
Aided Design, pp. 1–8 (2013)
5. Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB Standard: Version 2.6. Tech.
rep., Department of Computer Science, The University of Iowa (2017). https://
www.SMT-LIB.org
6. Beyer, D., Wendler, P.: Algorithms for software model checking: Predicate abstrac-
tion vs. Impact. In: 2012 Formal Methods in Computer-Aided Design (FMCAD),
pp. 106–113 (Oct 2012)
7. Beyer, D., Dangl, M.: Software verification with PDR: an implementation of the
state of the art. In: Biere, A., Parker, D. (eds.) Tools and Algorithms for the
Construction and Analysis of Systems, pp. 3–21. Springer International Publishing,
Cham (2020)
8. Beyer, D., Dangl, M., Wendler, P.: Boosting k-induction with continuously-refined
invariants. In: Kroening, D., Păsăreanu, C.S. (eds.) Computer Aided Verification,
pp. 622–640. Springer International Publishing, Cham (2015)
9. Beyer, D., Dangl, M., Wendler, P.: A unifying view on SMT-based software verifi-
cation. J. Autom. Reason. 60(3), 299–335 (2018)
10. Beyer, D., Keremoglu, M.E.: CPAchecker: a tool for configurable software verifica-
tion. In: Gopalakrishnan, G., Qadeer, S. (eds.) Computer Aided Verification, pp.
184–190. Springer, Berlin Heidelberg, Berlin, Heidelberg (2011)
220 M. Blicha et al.
11. Beyer, D., Lee, N.Z., Wendler, P.: Interpolation and SAT-based model check-
ing revisited: Adoption to software verification. Tech. Rep. arXiv/CoRR
arXiv:2208.05046 (August 2022)
12. Biere, A., Cimatti, A., Clarke, E., Zhu, Y.: Symbolic model checking without BDDs.
In: Cleaveland, W.R. (ed.) Tools and Algorithms for the Construction and Analysis
of Systems, pp. 193–207. Springer, Berlin Heidelberg, Berlin, Heidelberg (1999)
13. Blicha, M., Fedyukovich, G., Hyvärinen, A.E.J., Sharygina, N.: Split transition
power abstractions for unbounded safety. In: Griggio, A., Rungta, N. (eds.) Pro-
ceedings of the 22nd Conference on Formal Methods in Computer-Aided Design -
FMCAD 2022. pp. 349–358. TU Wien Academic Press (2022). https://doi.org/10.
34727/2022/isbn.978-3-85448-053-2_42
14. Blicha, M., Fedyukovich, G., Hyvärinen, A.E.J., Sharygina, N.: Transition power
abstractions for deep counterexample detection. In: Fisman, D., Rosu, G. (eds.)
Tools and Algorithms for the Construction and Analysis of Systems, pp. 524–542.
Springer International Publishing, Cham (2022)
15. Blicha, M., Hyvärinen, A.E.J., Kofroň, J., Sharygina, N.: Using linear algebra in
decomposition of Farkas interpolants. Int. J. Softw. Tools Technol. Transfer 24(1),
111–125 (2022)
16. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R., Schmidt,
D. (eds.) Verification, Model Checking, and Abstract Interpretation, pp. 70–87.
Springer, Berlin Heidelberg, Berlin, Heidelberg (2011)
17. Bruttomesso, R., Pek, E., Sharygina, N., Tsitovich, A.: The OpenSMT solver. In:
Esparza, J., Majumdar, R. (eds.) Tools and Algorithms for the Construction and
Analysis of Systems, pp. 150–153. Springer, Berlin Heidelberg, Berlin, Heidelberg
(2010)
18. Calzavara, S., Grishchenko, I., Maffei, M.: HornDroid: Practical and sound static
analysis of android applications by SMT solving. In: 2016 IEEE European Sympo-
sium on Security and Privacy, pp. 47–62 (2016)
19. Champion, A., Chiba, T., Kobayashi, N., Sato, R.: ICE-based refinement type
discovery for higher-order functional programs. In: Beyer, D., Huisman, M. (eds.)
Tools and Algorithms for the Construction and Analysis of Systems, pp. 365–384.
Springer International Publishing, Cham (2018)
20. Champion, A., Kobayashi, N., Sato, R.: HoIce: an ICE-based non-linear Horn
clause solver. In: Ryu, S. (ed.) Programming Languages and Systems, pp. 146–
156. Springer International Publishing, Cham (2018)
21. Cimatti, A., Griggio, A., Tonetta, S.: The VMT-LIB language and tools (2021)
22. Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided
abstraction refinement. In: Emerson, E.A., Sistla, A.P. (eds.) Computer Aided
Verification, pp. 154–169. Springer, Berlin Heidelberg, Berlin, Heidelberg (2000)
23. Craig, W.: Three uses of the Herbrand-Gentzen theorem in relating model theory
and proof theory. J. Symbolic Logic 22(3), 269–285 (1957)
24. De Angelis, E., Vediramana Krishnan, H.G.: CHC-COMP 2022: Competition
report. Electron. Proc. Theor. Comput. Sci. 373, 44–62 (nov 2022)
25. Dietsch, D., Heizmann, M., Hoenicke, J., Nutz, A., Podelski, A.: Ultimate
TreeAutomizer (CHC-COMP tool description). In: Angelis, E.D., Fedyukovich, G.,
Tzevelekos, N., Ulbrich, M. (eds.) Proceedings of the Sixth Workshop on Horn
Clauses for Verification and Synthesis and Third Workshop on Program Equiv-
alence and Relational Reasoning, HCVS/PERR@ETAPS 2019, Prague, Czech
Republic, 6–7th April 2019. EPTCS, vol. 296, pp. 42–47 (2019)
The Golem Horn Solver 221
26. Een, N., Mishchenko, A., Brayton, R.: Efficient implementation of property
directed reachability. In: Proceedings of the International Conference on Formal
Methods in Computer-Aided Design, pp. 125–134. FMCAD ’11, FMCAD Inc,
Austin, TX (2011)
27. Ernst, G.: Korn–software verification with Horn clauses (competition contribu-
tion). In: Sankaranarayanan, S., Sharygina, N. (eds.) Tools and Algorithms for the
Construction and Analysis of Systems, pp. 559–564. Springer Nature Switzerland,
Cham (2023)
28. Esen, Z., Rümmer, P.: TriCera: Verifying C programs using the theory of heaps. In:
Griggio, A., Rungta, N. (eds.) Proceedings of the 22nd Conference on Formal Meth-
ods in Computer-Aided Design - FMCAD 2022, pp. 360–391. TU Wien Academic
Press (2022)
29. Fedyukovich, G., Kaufman, S.J., Bodík, R.: Sampling invariants from frequency
distributions. In: 2017 Formal Methods in Computer Aided Design (FMCAD), pp.
100–107 (2017)
30. Fedyukovich, G., Prabhu, S., Madhukar, K., Gupta, A.: Solving constrained Horn
clauses using syntax and data. In: 2018 Formal Methods in Computer Aided Design
(FMCAD), pp. 1–9 (2018)
31. Fedyukovich, G., Rümmer, P.: Competition report: CHC-COMP-21. In: Hojjat, H.,
Kafle, B. (eds.) Proceedings 8th Workshop on Horn Clauses for Verification and
Synthesis, HCVS@ETAPS 2021, Virtual, 28th March 2021. EPTCS, vol. 344, pp.
91–108 (2021)
32. Graf, S., Saidi, H.: Construction of abstract state graphs with PVS. In: Grum-
berg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 72–83. Springer, Heidelberg (1997).
https://doi.org/10.1007/3-540-63166-6_10
33. Grebenshchikov, S., Lopes, N.P., Popeea, C., Rybalchenko, A.: Synthesizing soft-
ware verifiers from proof rules. In: Proceedings of the 33rd ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation, pp. 405–416. PLDI
’12, Association for Computing Machinery, New York, NY, USA (2012)
34. Gurfinkel, A., Bjørner, N.: The science, art, and magic of constrained Horn clauses.
In: 2019 21st International Symposium on Symbolic and Numeric Algorithms for
Scientific Computing (SYNASC), pp. 6–10 (2019)
35. Gurfinkel, A.: Program verification with constrained Horn clauses (invited paper).
In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification, pp. 19–29. Springer
International Publishing, Cham (2022)
36. Gurfinkel, A., Kahsai, T., Komuravelli, A., Navas, J.A.: The SeaHorn verification
framework. In: Kroening, D., Păsăreanu, C.S. (eds.) Computer Aided Verification,
pp. 343–361. Springer International Publishing, Cham (2015)
37. Hari Govind, V.K., Chen, Y., Shoham, S., Gurfinkel, A.: Global guidance for local
generalization in model checking. In: Lahiri, S.K., Wang, C. (eds.) Computer Aided
Verification, pp. 101–125. Springer International Publishing, Cham (2020)
38. Hoder, K., Bjørner, N.: Generalized property directed reachability. In: Cimatti, A.,
Sebastiani, R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 157–171. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-31612-8_13
39. Hojjat, H., Iosif, R., Konečný, F., Kuncak, V., Rümmer, P.: Accelerating inter-
polants. In: Chakraborty, S., Mukund, M. (eds.) Automated Technology for Verifi-
cation and Analysis, pp. 187–202. Springer, Berlin Heidelberg, Berlin, Heidelberg
(2012)
40. Hojjat, H., Rümmer, P.: The Eldarica Horn solver. In: FMCAD, pp. 158–164. IEEE
(10 2018)
222 M. Blicha et al.
41. Hojjat, H., Rümmer, P., Subotic, P., Yi, W.: Horn clauses for communicating timed
systems. Electronic Proceedings in Theoretical Computer Science 169, 39–52 (dec
2014)
42. Hyvärinen, A.E.J., Marescotti, M., Alt, L., Sharygina, N.: OpenSMT2: An SMT
Solver for Multi-core and Cloud Computing. In: Creignou, N., Le Berre, D. (eds.)
SAT 2016. LNCS, vol. 9710, pp. 547–553. Springer, Cham (2016). https://doi.org/
10.1007/978-3-319-40970-2_35
43. Kafle, B., Gallagher, J.P.: Tree automata-based refinement with application to
Horn clause verification. In: D’Souza, D., Lal, A., Larsen, K.G. (eds.) Verifica-
tion, Model Checking, and Abstract Interpretation, pp. 209–226. Springer, Berlin
Heidelberg, Berlin, Heidelberg (2015)
44. Kahsai, T., Rümmer, P., Sanchez, H., Schäf, M.: Jayhorn: a framework for verifying
Java programs. In: Chaudhuri, S., Farzan, A. (eds.) Computer Aided Verification,
pp. 352–358. Springer International Publishing, Cham (2016)
45. Komuravelli, A., Gurfinkel, A., Chaki, S.: SMT-based model checking for recursive
programs. Formal Methods in System Design 48(3), 175–205 (2016)
46. Leroux, J., Rümmer, P., Subotić, P.: Guiding Craig interpolation with domain-
specific abstractions. Acta Informatica 53(4), 387–424 (2016)
47. Mann, M., et al.: Pono: a flexible and extensible SMT-based model checker. In:
Silva, A., Leino, K.R.M. (eds.) Computer Aided Verification, pp. 461–474. Springer
International Publishing, Cham (2021)
48. Matsushita, Y., Tsukada, T., Kobayashi, N.: RustHorn: CHC-based verification for
Rust programs. ACM Trans. Program. Lang. Syst. 43(4) (oct 2021)
49. McMillan, K.L.: Interpolation and SAT-Based model checking. In: Hunt, W.A.,
Somenzi, F. (eds.) CAV 2003. LNCS, vol. 2725, pp. 1–13. Springer, Heidelberg
(2003). https://doi.org/10.1007/978-3-540-45069-6_1
50. McMillan, K.L.: Lazy abstraction with interpolants. In: Ball, T., Jones, R.B. (eds.)
Computer Aided Verification, pp. 123–136. Springer, Berlin Heidelberg, Berlin,
Heidelberg (2006)
51. de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R.,
Rehof, J. (eds.) Tools and Algorithms for the Construction and Analysis of Systems,
pp. 337–340. Springer, Berlin Heidelberg, Berlin, Heidelberg (2008)
52. Rollini, S.F., Alt, L., Fedyukovich, G., Hyvärinen, A.E.J., Sharygina, N.: PeRIPLO:
a framework for producing effective interpolants in SAT-based software verification.
In: McMillan, K., Middeldorp, A., Voronkov, A. (eds.) Logic for Programming,
Artificial Intelligence, and Reasoning, pp. 683–693. Springer, Berlin Heidelberg,
Berlin, Heidelberg (2013)
53. Rümmer, P., Subotić, P.: Exploring interpolants. In: 2013 Formal Methods in
Computer-Aided Design, pp. 69–76 (Oct 2013)
54. Rümmer, P.: Competition report: CHC-COMP-20. Electron. Proc. Theor. Comput.
Sci. 320, 197–219 (2020)
55. Sheeran, M., Singh, S., Stålmarck, G.: Checking safety properties using induc-
tion and a SAT-solver. In: Hunt, W.A., Johnson, S.D. (eds.) Formal Methods in
Computer-Aided Design, pp. 127–144. Springer, Berlin Heidelberg, Berlin, Heidel-
berg (2000)
56. Wang, W., Jiao, L.: Trace Abstraction Refinement for Solving Horn Clauses. Com-
put. J.59(8), 1236–1251 (08 2016)
57. Wesley, S., Christakis, M., Navas, J.A., Trefler, R., Wüstholz, V., Gurfinkel, A.:
Verifying solidity smart contracts via communication abstraction in smartace. In:
Finkbeiner, B., Wies, T. (eds.) Verification, Model Checking, and Abstract Inter-
pretation, pp. 425–449. Springer International Publishing, Cham (2022)
The Golem Horn Solver 223
58. Zlatkin, I., Fedyukovich, G.: Maximizing branch coverage with constrained Horn
clauses. In: Fisman, D., Rosu, G. (eds.) Tools and Algorithms for the Construction
and Analysis of Systems, pp. 254–272. Springer International Publishing, Cham
(2022)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Model Checking
CoqCryptoLine: A Verified Model
Checker with Certified Results
1 Introduction
The original version of this chapter was revised: The mistakes in authors affiliation
information and typographical errors have been corrected. The correction to this
chapter is available at https://doi.org/10.1007/978-3-031-37703-7_22
c The Author(s) 2023, corrected publication 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 227–240, 2023.
https://doi.org/10.1007/978-3-031-37703-7_11
228 M.-H. Tsai et al.
Related Work. There are numerous model checkers in the community, e.g. [8,
13,21–23]. Nevertheless, few of them are formally verified. To our knowl-
edge, the first verification of a model checker was performed in Coq for
the modal μ-calculus [34]. The LTL model checker CAVA [15,27] and the
model checker Munta [38,39] for timed automata were developed and verified
using Isabelle/HOL [29], which can be considered as verified counterparts
of SPIN [21] and Uppaal [23], respectively. CoqCryptoLine instead checks
CryptoLine models [16,31] that are for the correctness of cryptographic pro-
grams. It can be seen as a verified version of CryptoLine. A large body of work
studies the correctness of cryptographic programs, e.g. [2–4,9,12,14,24,26,40],
cf. [5] for a survey. They either require human intervention or are unverified, while
our work is fully automatic and verified. The most relevant work is bvCryp-
toLine [37], which is the first automated and partly verified model checker
for a very limited subset of CryptoLine. We will compare our work with it
comprehensively in Sect. 2.3.
2 CoqCryptoLine
CoqCryptoLine is an automatic verification tool that takes a CryptoLine
specification as input and returns certified results indicating the validity of the
specification. We briefly describe the CryptoLine language [16] followed by the
modules, features, and optimizations of CoqCryptoLine in this section.
trusted
C OQ C RYPTO L INE parser
DSL
verified
SSA2ZSSA SSA SSA2QFBV
Validator solve
solve
solve
SMT QF BV solver verified module
validate
unverified module
untrusted
verified external solver
computer algebra system
unverified external solver
programs from the certified SMT QF_BV solver CoqQFBV. Our trusted
computing base consists of (1) CoqCryptoLine parser, (2) text interface with
external SAT solvers (from CoqQFBV), (3) the proof assistant Isabelle [29]
(from the SAT solver certificate validator Grat used by CoqQFBV) and (4) the
Coq proof assistant. Particularly, sophisticated decision procedures in external
CASs and SAT solvers used in CoqQFBV need not be trusted.
Type System. CoqCryptoLine fully supports the type system of the Cryp-
toLine language. The type system is used to model bit-vectors of arbitrary
bit-widths with unsigned or signed interpretation. Such a type system allows
CoqCryptoLine to model more industrial examples translated from C pro-
grams via GCC [16] or LLVM [24] compared to bvCryptoLine [37], which only
allows unsigned bit-vectors, all of the same bit-width.
Multi-threading. All extracted OCaml code from the verified algorithms in Coq
runs sequentially. To speed up, SMT QF_BV problems, as well as root entail-
ment problems, are solved parallelly.
3 Walkthrough
We illustrate how CoqCryptoLine is used in this section. The x86_64 assembly
subroutine ecp_nistz256_mul_montx from OpenSSL [30] shown in Fig. 2 is
verified as an example.
An input for CoqCryptoLine contains a CryptoLine specification for the
assembly subroutine. The original subroutine is marked between the comments
CoqCryptoLine: A Verified Model Checker with Certified Results 233
The output 256-bit integer represented by the four variables ci (for 0 ≤ i < 4)
has two requirements. Firstly, the output integer times 2256 equals the product
of the input integers modulo p256. Secondly, the output integer is less than p256.
Formally, we have this post-condition:
eqmod limbs 64 [0, 0, 0, 0, c0, c1, c2, c3]
limbs 64 [a0, a1, a2, a3] * limbs 64 [b0, b1, b2, b3]
limbs 64 [m0, m1, m2, m3]
&&
limbs 64 [c0, c1, c2, c3] <u limbs 64 [m0, m1, m2, m3]
The assert statement verifies that both the carry and overflow flags are
zeroes through the bit-vector reduction. The assume statement then passes this
information to the algebraic reduction. Effectively, CoqCryptoLine checks that
both flags are zero for all inputs satisfying the pre-condition, then uses those facts
as lemmas to verify the post-condition with the algebraic reduction.
The full specification for ecp_nistz256_mul_montx has 230 lines, including
50 lines of manual annotations. 20 are straightforward annotations for variable
declaration and initialization. The remaining 30 lines of annotations are hints to
CoqCryptoLine, which then verifies the post-condition in 30 s with 24 threads.
The illustration of the typical verification flow shows how a user constructs
a CryptoLine specification. The pre-condition for program inputs, the post-
condition for outputs, and variable initialization must be specified manually.
Additional annotations may be added as hints. Notice that hints only tell
CoqCryptoLine what, not why properties should hold. Proofs of annotated
hints and the post-condition are found by CoqCryptoLine automatically. Con-
sequently, manual annotations are minimized and verification efforts are reduced
significantly.
CoqCryptoLine: A Verified Model Checker with Certified Results 235
4 Evaluation
We evaluate CoqCryptoLine on 52 benchmarks from four industrial security
libraries Bitcoin [35], boringSSL [14,18], nss [25], and OpenSSL [30]. The C
reference and optimized avx2 implementations of the Number-Theoretic Trans-
form (NTT) from the post-quantum key encapsulation mechanism Kyber [10]
are also evaluated. Among the total 54 benchmarks, 43 benchmarks contain fea-
tures not supported by bvCryptoLine such as signed variables. All experiments
are performed on an Ubuntu 22.04.1 machine with a 3.20GHz Intel Xeon Gold
6134M CPU and 1TB RAM.
Benchmarks from security libraries are various field and group operations
from elliptic curve cryptography (ECC). In ECC, rational points on curves are
represented by elements in large finite fields. In Bitcoin, the finite field is the
residue system modulo the prime p256k1 = 2256 − 232 − 29 − 28 − 27 − 26 − 24 − 1.
For other security libraries (boringSSL, nss, and OpenSSL), we verify the
operations in Curve25519 using the residue system modulo the prime p25519 =
2255 − 19 as the underlying field. Rational points on elliptic curves form a group.
The group operation in turn is implemented by a number of field operations.
In lattice-based post-quantum cryptosystems, polynomial rings are used.
Specifically, the polynomial ring Z3329 [X]/X 256 + 1 is used in Kyber. To
speed up multiplication in the polynomial ring, Kyber requires the multiplica-
tion to be implemented by NTT. NTT is a discrete Fast Fourier Transform over
finite fields. Instead of complex roots of unity, NTT uses the principal roots of
unity in fields. Mathematically, the Kyber NTT computes the following ring
isomorphism
100 93.38
3
10
50
101
(a) C OQ C RYPTO L INE versus C RYPTO - (b) Percentages of average running time
L INE for C OQ C RYPTO L INE internal OC AML
code (INT), external SMT QF BV solver
(SMT), and external computer algebra sys-
tem (CAS)
105
103
103
lift
101
101
10−1 10−1
5 Conclusion
References
1. CoqCryptoLine GitHub repository (2023). https://github.com/fmlab-iis/coq-
cryptoline
2. Affeldt, R.: On construction of a library of formally verified low-level arithmetic
functions. Innov. Syst. Softw. Eng. 9(2), 59–77 (2013)
3. Almeida, J.B., et al.: Jasmin: High-assurance and high-speed cryptography. In:
ACM SIGSAC Conference on Computer and Communications Security, pp. 1807–
1823. ACM (2017)
4. Appel, A.W.: Verification of a cryptographic primitive: SHA-256. ACM Trans.
Programm. Lang. Syst. 37(2), 7:1–7:31 (2015)
5. Barbosa, M., et al.: Sok: Computer-aided cryptography. In: 42nd IEEE Symposium
on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pp.
777–795. IEEE (2021). https://doi.org/10.1109/SP40001.2021.00008
238 M.-H. Tsai et al.
6. Barrett, C., Fontaine, P., Tinelli, C.: The Satisfiability Modulo Theories Library
(SMT-LIB). www.SMT-LIB.org (2016)
7. Bertot, Y., Castéran, P.: Interactive Theorem Proving and Program Development -
Coq’Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer
Science, Springer (2004). https://doi.org/10.1007/978-3-662-07964-5
8. Beyer, D., Keremoglu, M.E.: Cpachecker: A tool for configurable software verifi-
cation. In: Gopalakrishnan, G., Qadeer, S. (eds.) Computer Aided Verification -
23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011.
Proceedings. Lecture Notes in Computer Science, vol. 6806, pp. 184–190. Springer
(2011). https://doi.org/10.1007/978-3-642-22110-1_16
9. Bond, B., et al.: Vale: Verifying high-performance cryptographic assembly code.
In: USENIX Security Symposium, pp. 917–934. USENIX Association (2017)
10. Bos, J., et al.: CRYSTALS - Kyber: a CCA-secure module-lattice-based KEM. In:
Smith, M., Piessens, F. (eds.) IEEE European Symposium on Security and Privacy,
pp. 353–367. IEEE (2018)
11. Chalupa, M., Strejcek, J.: Evaluation of program slicing in software verification. In:
Ahrendt, W., Tarifa, S.L.T. (eds.) Integrated Formal Methods - 15th International
Conference, IFM 2019, Bergen, Norway, December 2-6, 2019, Proceedings. Lecture
Notes in Computer Science, vol. 11918, pp. 101–119. Springer (2019). https://doi.
org/10.1007/978-3-030-34968-4_6
12. Chen, Y.F., et al.: Verifying Curve25519 software. In: Ahn, G.J., Yung, M., Li, N.
(eds.) ACM SIGSAC Conference on Computer and Communications Security, pp.
299–309. ACM (2014)
13. Cimatti, A., et al.: NuSMV 2: An opensource tool for symbolic model checking. In:
Brinksma, E., Larsen, K.G. (eds.) Computer Aided Verification, 14th International
Conference, CAV 2002,Copenhagen, Denmark, July 27-31, 2002, Proceedings. Lec-
ture Notes in Computer Science, vol. 2404, pp. 359–364. Springer (2002). https://
doi.org/10.1007/3-540-45657-0_29
14. Erbsen, A., Philipoom, J., Gross, J., Sloan, R., Chlipala, A.: Simple high-level
code for cryptographic arithmetic - with proofs, without compromises. In: IEEE
Symposium on Security and Privacy, pp. 1202–1219. IEEE (2019)
15. Esparza, J., Lammich, P., Neumann, R., Nipkow, T., Schimpf, A., Smaus, J.: A
fully verified executable LTL model checker. In: Sharygina, N., Veith, H. (eds.)
Computer Aided Verification - 25th International Conference, CAV 2013, Saint
Petersburg, Russia, July 13-19, 2013. Proceedings. Lecture Notes in Computer
Science, vol. 8044, pp. 463–478. Springer (2013). https://doi.org/10.1007/978-3-
642-39799-8_31
16. Fu, Y.F., Liu, J., Shi, X., Tsai, M.H., Wang, B.Y., Yang, B.Y.: Signed cryptographic
program verification with typed cryptoline. In: Cavallaro, L., Kinder, J., Wang,
X., Katz, J. (eds.) ACM SIGSAC Conference on Computer and Communications
Security, pp. 1591–1606. ACM (2019)
17. Gonthier, G., Mahboubi, A.: An introduction to small scale reflection in Coq. J.
Formalized Reason. 3(2), 95–152 (2010)
18. Google: Boringssl (2021). https://boringssl.googlesource.com/boringssl/
19. Greuel, G.M., Pfister, G.: A Singular Introduction to Commutative Algebra.
Springer-Verlag (2002)
20. Harrison, J.: Automating elementary number-theoretic proofs using Gröbner bases.
In: Pfenning, F. (ed.) CADE 2007. LNCS (LNAI), vol. 4603, pp. 51–66. Springer,
Heidelberg (2007). https://doi.org/10.1007/978-3-540-73595-3_5
21. Holzmann, G.J.: The SPIN Model Checker - primer and reference manual. Addison-
Wesley (2004)
CoqCryptoLine: A Verified Model Checker with Certified Results 239
22. Lamport, L.: Specifying Systems, The TLA+ Language and Tools for Hardware
and Software Engineers. Addison-Wesley (2002). http://research.microsoft.com/
users/lamport/tla/book.html
23. Larsen, K.G., Pettersson, P., Yi, W.: UPPAAL in a nutshell. Int. J. Softw. Tools
Technol. Transf. 1(1-2), 134–152 (1997). https://doi.org/10.1007/s100090050010
24. Liu, J., Shi, X., Tsai, M.H., Wang, B.Y., Yang, B.Y.: Verifying arithmetic in cryp-
tographic C programs. In: Lawall, J., Marinov, D. (eds.) IEEE/ACM International
Conference on Automated Software Engineering, pp. 552–564. IEEE (2019)
25. Mozilla: Network security services (2021). https://developer.mozilla.org/en-US/
docs/Mozilla/Projects/NSS
26. Myreen, M.O., Curello, G.: Proof Pearl: a verified bignum implementation in x86-
64 machine code. In: Gonthier, G., Norrish, M. (eds.) CPP 2013. LNCS, vol. 8307,
pp. 66–81. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03545-1_5
27. Neumann, R.: Using promela in a fully verified executable LTL model checker.
In: Giannakopoulou, D., Kroening, D. (eds.) Verified Software: Theories, Tools
and Experiments - 6th International Conference, VSTTE 2014, Vienna, Austria,
July 17-18, 2014, Revised Selected Papers. Lecture Notes in Computer Science,
vol. 8471, pp. 105–114. Springer (2014). https://doi.org/10.1007/978-3-319-12154-
3_7
28. Niemetz, A., Preiner, M., Biere, A.: Boolector 2.0. J. Satisfiability, Boolean Mod-
eling Comput. 9(1), 53–58 (2014)
29. Nipkow, T., Wenzel, M., Paulson, L.C. (eds.): Isabelle/HOL. LNCS, vol. 2283.
Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45949-9
30. OpenSSL: OpenSSL library. https://github.com/openssl/openssl (2021)
31. Polyakov, A., Tsai, M.H., Wang, B.Y., Yang, B.Y.: Verifying arithmetic assembly
programs in cryptographic primitives. In: Schewe, S., Zhang, L. (eds.) Interna-
tional Conference on Concurrency Theory, pp. 4:1–4:16. LIPIcs, Schloss Dagstuhl
- Leibniz-Zentrum fuer Informatik (2018)
32. Pottier, L.: Connecting Gröbner bases programs with Coq to do proofs in algebra,
geometry and arithmetics. In: Rudnicki, P., Sutcliffe, G., Konev, B., Schmidt, R.A.,
Schulz, S. (eds.) Proceedings of the LPAR 2008 Workshops, Knowledge Exchange:
Automated Provers and Proof Assistants, and the 7th International Workshop on
the Implementation of Logics, Doha, Qatar, November 22, 2008. CEUR Workshop
Proceedings, vol. 418. CEUR-WS.org (2008). http://ceur-ws.org/Vol-418/paper5.
pdf
33. Shi, X., Fu, Y.F., Liu, J., Tsai, M.H., Wang, B.Y., Yang, B.Y.: CoqQFBV: a
scalable certified SMT quantifier-free bit-vector solver. In: Leino, R., Silva, A.
(eds.) International Conference on Computer Aided Verification. Springer, Lecture
Notes in Computer Science (2021)
34. Sprenger, C.: A verified model checker for the modal µ-calculus in Coq. In: Steffen,
B. (ed.) Tools and Algorithms for Construction and Analysis of Systems, 4th Inter-
national Conference, TACAS ’98, Held as Part of the European Joint Conferences
on the Theory and Practice of Software, ETAPS’98, Lisbon, Portugal, March 28
- April 4, 1998, Proceedings. Lecture Notes in Computer Science, vol. 1384, pp.
167–183. Springer (1998). https://doi.org/10.1007/BFb0054171
35. The Bitcoin Developers: Bitcoin source code (2021). https://github.com/bitcoin/
bitcoin
36. Tsai, M.H., Fu, Y.F., Shi, X., Liu, J., Wang, B.Y., Yang, B.Y.: Automatic certified
verification of cryptographic programs with COQCRYPTOLINE. IACR Cryptol.
ePrint Arch. p. 1116 (2022)
240 M.-H. Tsai et al.
37. Tsai, M.H., Wang, B.Y., Yang, B.Y.: Certified verification of algebraic properties
on low-level mathematical constructs in cryptographic programs. In: Evans, D.,
Malkin, T., Xu, D. (eds.) ACM SIGSAC Conference on Computer and Communi-
cations Security, pp. 1973–1987. ACM (2017)
38. Wimmer, S.: Munta: A verified model checker for timed automata. In: André,
É., Stoelinga, M. (eds.) Formal Modeling and Analysis of Timed Systems - 17th
International Conference, FORMATS 2019, Amsterdam, The Netherlands, August
27-29, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11750, pp. 236–
243. Springer (2019). https://doi.org/10.1007/978-3-030-29662-9_14
39. Wimmer, S., Lammich, P.: Verified model checking of timed automata. In: Beyer,
D., Huisman, M. (eds.) Tools and Algorithms for the Construction and Analysis
of Systems - 24th International Conference, TACAS 2018, Held as Part of the
European Joint Conferences on Theory and Practice of Software, ETAPS 2018,
Thessaloniki, Greece, April 14-20, 2018, Proceedings, Part I. Lecture Notes in
Computer Science, vol. 10805, pp. 61–78. Springer (2018). https://doi.org/10.1007/
978-3-319-89960-2_4
40. Zinzindohoué, J.K., Bhargavan, K., Protzenko, J., Beurdouche, B.: HACL*: A
verified modern cryptographic library. In: ACM SIGSAC Conference on Computer
and Communications Security, pp. 1789–1806. ACM (2017)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Incremental Dead State Detection
in Logarithmic Time
1 Introduction
Classifying states in a transition system as live or dead is a recurring problem in
formal verification. For example, given an expression, can it be simplified to the
identity? Given an input to a nondeterministic program, can it reach a terminal
state, or can it reach an infinitely looping state? Given a state in an automaton,
can it reach an accepting state? State classification is relevant to satisfiability
modulo theories (SMT) solvers [8,9,24,51], where theory-specific partial decision
procedures often work by exploring the state space to find a reachable path that
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 241–264, 2023.
https://doi.org/10.1007/978-3-031-37703-7_12
242 C. Stanford and M. Veanes
1
The specific setting is regexes with intersection and complement (extended [31, 44]
or generalized [26] regexes), which are found natively in security applications [6, 61].
Other solvers have also leveraged derivatives [45] and laziness in general [36].
Incremental Dead State Detection in Logarithmic Time 243
2 3
Fig. 1. GID consisting of the sequence of updates E(1, 2), E(1, 3), T(2). Terminal states
are drawn as double circles. After the update T(2), states 1 and 2 are known to be live.
State 3 is not dead in this GID, as a future update may cause it to be live.
1 4
2 3 5
Fig. 2. GID extending Fig. 1 with additional updates E(4, 3), E(4, 5), C(4), C(5). Closed
states are drawn as solid circles. After the update C(5) (but not earlier), state 5 is dead.
State 4 is not dead because it can still reach state 3.
terminal, and we say that a state is live if it can reach a terminal state and
dead if it will never reach a terminal state in any extension – i.e. if all reachable
states from it are closed (see Figs. 1 and 2). To our knowledge, the problem of
detecting dead states in such a system has not been studied by existing work in
graph algorithms. Our problem can be solved through solving SCC maintenance,
but not necessarily the other way around (Sect. 2, Proposition 1). We provide
two new algorithms for dead-state detection in GIDs.
First, we show that the dead-state detection problem for GIDs can be solved
in time O(m · log m) for m edge additions, within a logarithmic factor of the
O(m) cost for offline search. The worst-case performance of our algorithm thus
strictly improves on the O(m3/2 ) upper bound for SCC maintenance in gen-
eral incremental graphs. Our algorithm is technically sophisticated, and utilizes
several data structures and existing results in online algorithms: in particular,
Union-Find [63] and Henzinger and King’s Euler Tour Trees [35]. The main idea
is that, rather than explicitly computing the set of SCCs, for closed states we
maintain a single path to a non-closed (open) state. This turns out to reduce the
problem to quickly determining whether two states are currently assigned a path
to the same open state. On the other hand, Euler Tour Trees can solve undirected
reachability for graphs that are forests in logarithmic time.2 The challenge then
lies in figuring out how to reduce directed connectivity in the graph of paths to
an undirected forest connectivity problem. At the same time, we must maintain
this reduction under Union-Find state merges, in order to deal with cycles that
are found in the graph along the way.
While as theorists we would like to believe that asymptotic complexity is
enough, the truth is that the use of complex data structures (1) can be pro-
hibitively expensive in practice due to constant-factor overheads, and (2) can
make algorithms substantially more difficult to implement, leading practition-
ers to prefer simpler approaches. To address these needs, in addition to the
logarithmic-time algorithm, we provide a second lazy algorithm which avoids
the user of Euler Tour Trees, and only uses union-find. This algorithm is based
on an optimization of adding shortcut jump edges for long paths in the graph to
quickly determine reachability. This approach aims to perform well in practice
on typical graphs, and is evaluated in our evaluation along with the logarithmic
time algorithm, though we do not prove its asymptotic complexity.
Finally, we implement and empirically evaluate both of our algorithms for
GIDs against several baselines in 5.5k lines of code in Rust [47]. Our evaluation
focuses on the performance of the GID data structure itself, rather than its end-
to-end performance in applications. To ensure an apples-to-apples comparison
with existing approaches, we put particular focus on providing a directed graph
data structure backend shared by all algorithms, so that the cost of graph search
as well as state and edge merges is identical across algorithms. We implement
two naı̈ve baselines, as well as an implementation of the state-of-the-art solution
2
Reachability in dynamic forests can also be solved by Sleator-Tarjan trees [59],
Frederickson’s Topology Trees [30], or Top Trees [3]. Of these, we found Euler Tour
Trees the easiest to work with in our implementation. See also [64].
Incremental Dead State Detection in Logarithmic Time 245
The GID is valid if the closed labels are correct: there are no instances of
E(u, v) or T(u) after an update C(u). The denotation of G is the directed graph
(V, E) where V is the set of all states u which have occurred in any update in
the sequence, and E is the set of all (u, v) such that E(u, v) occurs in G. An
extension of a valid GID G is a valid GID G such that G is a prefix of G .
3
https://github.com/cdstanford/gid.
246 C. Stanford and M. Veanes
Fig. 3. Top: Basic classification of GID states into four disjoint categories. Bottom:
Additional terminology used in this paper.
Despite this reduction one way, there is no obvious reduction the other way –
from cycle detection or SCCs to Definition 2. This is because, while the existence
of a cycle of non-live states implies bi-reachability between all states in the cycle,
it does not necessarily imply that all of the bi-reachable states are dead.
3 Algorithms
This section presents Algorithm 2, which solves the state classification problem
in logarithmic time (Theorem 3); and Algorithm 3, an alternative lazy solution.
Both algorithms are optimized versions of Algorithm 1, a first-cut algorithm
which establishes the structure of our approach. We begin by establishing some
basic terminology shared by all of the algorithms (see Fig. 3).
States in a GID can be usefully classified as exactly one of four statuses:
live, dead, unknown, or open, where unknown means “closed but not yet live or
dead”, and open means “not closed and not live”. Note that a state may be live
and neither open nor closed; this terminology keeps the classification disjoint.
Pragmatically, for live states it does not matter if they are classified as open or
closed, since edges from those states no longer have any effect. However, all dead
and unknown states are closed, and no states are both open and closed.
Given this classification, the intuition is that for each unknown state u, we
only need one path from u to an open state to prove that it is not dead; we want
to maintain one such path for all unknown states. To maintain all of these paths
4
To be precise, “maintains” means that (i) we can check whether two states are in
the same SCC in O(1) time; and (ii) we can iterate over all the states, edges from,
or edges into a SCC in O(1) time per state or edge.
248 C. Stanford and M. Veanes
Unfortunately, this idea does not work straightforwardly – once again because
of the presence of cycles in the original graph. We cannot simply store the forest
as a condensed graph with edges on condensed states. As we saw in Algorithm
1, it was important to store successor edges as edges into V, rather than edges
into X – this is the only way that we can merge states in O(1), without actually
inspecting the edge lists. If we needed to update the forest edges to be in X, this
could require O(m) work to merge two O(m)-sized edge lists as each edge might
need to be relabeled in the EF graph.
To solve this challenge, we instead store the EF data structure on the original
states, rather than the condensed graph; but we ensure that each canonical state
is represented by a tree of original states. When adding edges between canonical
states, we need to make sure to remember the original label (u, v), so that we can
later remove it using the original labels (this happens when its target becomes
252 C. Stanford and M. Veanes
dead). When an edge would create a cycle, we instead simply ignore it in the EF
graph, because a line of connected trees forms a tree.
Summary and Invariants. In summary, the algorithm reuses the data, proce-
dures, and invariants from Algorithm 1, with the following important changes:
(1) We maintain the EF data structure EF, a forest on V. (2) The successor edges
are stored as their original edge labels (u, v), rather than just as a target state.
(3) The procedure OnClosed is rewritten to maintain the graph EF. (4) The
successor edges and no cycles invariants use the new succ representation: that
is, they are constraints on the edges (x, UF.find(v)), where succ(x) = (u, v).
(5) We add the following two constraints on edges in EF, depending on whether
those states are equivalent in the union-find structure.
– EF inter-edges: For all inequivalent u, v, (u, v) is in the EF if and only if
(u, v) = succ(UF.find(u)) or (v, u) = succ(UF.find(v)).
– EF intra-edges: For all unknown canonical states x, the set of edges (u, v) in
the EF between states belonging to x forms a tree.
Proof. Observe that the EF inter-edges constraint implies that EF only contains
edges between unknown and open states, together with isolated trees. In the
modified OnTerminal procedure, when marking states as live we remove inter-
edges, so we preserve this invariant.
Next we argue that given the invariants about EF, for an open state y the
CheckCycle procedure returns true if and only if (y, z) would create a directed
cycle. If there is a cycle of canonical states, then because canonical states are
connected trees in EF, the cycle can be lifted to a cycle on original states, so y and
z must already be connected in this cycle without the edge (y, z). Conversely, if
y and z are connected in EF, then there is a path from y to z, and this can be
projected to a path on canonical states. However, because y is open, it is a root
in the successor forest, so any path from y along successor edges travels only
on backward-edges; hence z is an ancestor of y in the directed graph, and thus
(y, z) creates a directed cycle.
This leaves the OnClosed procedure. Other than the EF lines, the structure
is the same as in Algorithm 1, so the previous invariants are still preserved,
and it remains to check the EF invariants. When we delete the successor edge
and temporarily mark status(x) = Open for recursive calls, we also remove it
from EF, preserving the inter-edge invariant. Similarly, when we add a successor
edge to x, we add it to EF, preserving the inter-edge invariant. So it remains to
consider when the set of canonical states changes, which is when merging states
in a cycle. Here, a line of canonical states is merged into a single state, and a
line of connected trees is still a tree, so the intra-edge invariant still holds for
the new canonical state, and we are done.
either constant-time, α(m) = o(log m) time for the UF calls, or O(log m) time for
the EF calls, so in total the algorithm takes O(m log m) time total, or amortized
O(log m) time per edge.
repeatedly calling succ. But there are two issues with this. First, maintaining this
may be difficult (when the root changes, potentially updating a linear number
of root pointers). Second, the root may be marked dead, in which case we have
to re-compute all pointers to that root.
Instead, we introduce a jump list from each state: intuitively, it will contain
states after calling successor once, twice, four times, eight times, and so on at
powers of two; and it will be updated lazily, at most once for every visit to
the state. When a jump becomes obsolete (the target dead), we just pop off
the largest jump, so we do not lose all of our work in building the list. We
maintain the following additional information: for each unknown canonical state
x, a nonempty list of jumps [v0 , v1 , v2 , . . . , vk ], such that v0 is reachable from x,
v1 is reachable from v0 , v2 is reachable from v1 , and so on, and v1 = succ(x).
The resulting algorithm is shown in Algorithm 3. The key procedure is Get-
Rootz, which is called when adding a reserve edge (y, z) to the graph. In
addition to all invariants from Algorithm 1, we maintain the following invari-
ants for every unknown canonical state x, where jumps(x) is a list of states
v0 , v1 , v2 , . . . , vk . First jump: if the jump list is nonempty, then v0 = succ(v).
Reachability: vi+1 is reachable from vi for all i. The jump list also satisfies the
following powers of two invariant: on the path of canonical states from v0 to vi ,
the total number of states (including all states in each equivalence class) is at
least 2i . While this invariant is not necessary for correctness, it is the key to the
algorithm’s practical efficiency: it follows from this that if the jump list is fully
saturated for every state, querying GetRootz will take only logarithmic time.
However, since jump lists are updated lazily, the jump list may not be saturated,
so this does not establish a true asymptotic complexity for the algorithm.
Proof. The first jump and reachability invariants imply that v1 , v2 , . . . is some
sublist of the states along the path from an unknown state to its root, potentially
followed by some dead states. We need to argue that the subprocedure GetRoot
(i) receives the same verdict as repeatedly calling succ to find a cycle in the first-
cut algorithm and (ii) preserve both invariants. For first jump, if the jump list is
empty, then GetRoot ensures that the first jump is set to the successor state.
For reachability, popping dead states from the jump list clearly preserves the
invariant, as does adding on a state along the path to the root, which is done
when k ≥ k. Merging states preserves both invariants trivially because we throw
the jump list away, and marking states live preserves both invariants trivially
since the jump list is only maintained and used for unknown states.
4 Experimental Evaluation
The primary goal of our evaluation has been to experimentally validate the
performance of GIDs as a data structure in isolation, rather than their use in a
particular application. Our evaluation seeks to answer the following questions:
Incremental Dead State Detection in Logarithmic Time 255
To answer Q2, first, we compiled a range of basic graph classes which are
designed to expose edge case behavior in the algorithms, as well as randomly
generated graphs. We focus on graphs with no live states, as live states are
treated similarly by all algorithms. Most of the generated graphs come in 2×2 =
4 variants: (i) the states are either read in a forwards- or backwards- order; and
(ii) they are either dead graphs, where there are no open states at the end and so
everything gets marked dead; or unknown graphs, where there is a single open
state at the end, so most states are unknown. In the unknown case, it is sufficient
to have one open state at the end, as many open states can be reduced to the
case of a single open state where all edges point to that one. We include GIDs
from line graphs and cycle graphs (up to 100K states in multiples of 3); complete
and complete acyclic graphs (up to 1K states); and bipartite graphs (up to 1K
states). These are important cases, for example, because the reverse-order line
and cycle graphs are a potential worst case for Simple and BFGT.
Second, to exhibit more dynamic behavior, we generated random graphs:
sparse graphs with a fixed out-degree from each state, chosen from 1, 2, 3, or
10 (up to 100K states); and dense graphs with a fixed probability of each edge,
chosen from .01, .02, or .03 (up to 10K states). Each case uses 10 different random
seeds. As with the basic graphs, states are read in some order and marked closed.
6
That is, BFGT for SCC maintenance. BFGT for cycle detection has been imple-
mented before, for instance, in [28] and formally verified in [32].
256 C. Stanford and M. Veanes
Fig. 4. Left: Lines of code for each algorithm and other implementation components.
Right: Benchmark GIDs used in our evaluation. Where present, the source column
indicates the quantity prior to filtering out trivially small graphs.
7
https://github.com/cdstanford/regex-smt-benchmarks.
Incremental Dead State Detection in Logarithmic Time 257
Naive Simple BFGT Alg 2 Alg 3 Naive Simple BFGT Alg 2 Alg 3
90
Benchmarks Solved
60 10000
30
1000
Time (ms)
0
1 10 100 1000 10000 100
Time (ms)
10
300
Benchmark Size
200
100
Naive Simple BFGT Alg 2 Alg 3
0
1 10 100 1000 10000
100
Benchmarks Solved
60
40 10
20
1
0 100 1000 10000 100000 1000000
1 10 100 1000 10000
Fig. 5. Evaluation results. Left: Cumulative plot showing the number of benchmarks
solved in time t or less for basic GID classes (top), randomly generated GIDs (middle),
and regex-derived GIDs (bottom). Top right: Scatter plot showing the size of each
benchmark vs time to solve. Bottom right: Average time to solve benchmarks of size
closest to s, where values of s are chosen in increments of 1/3 on a log scale.
In this section, we explain how precisely the GID state classification problem
arises in the context of derivative-based solvers [45,61]. We first define extended
regexes [31] (regexes extended with intersection & and complement ~) modulo a
symbolic alphabet A of predicates that represent sets of characters. We explain
the main idea behind symbolic derivatives, as found in [61]; these generalize Brzo-
zowski [18] and Antimirov derivatives [5] (see also [19,42] for other proposals).
Symbolic derivatives provide the foundation for incrementally creating a GID.
Then we show, through an example, how a solver can incrementally expand
derivatives to reduce the satisfiability problem to the GID state classification
problem (Definition 2).
Define a regex by the following grammar, where ϕ ∈ A denotes a predicate:
RE ::= ϕ | ε | RE1 · RE2 | RE* | RE1 | RE2 | RE1 & RE2 | ~RE
“is digit” predicate that is true of characters that are digits (often denoted \d).
The solver manipulates regex membership constraints on strings by unfolding
them [61]. The constraint s ∈ R, that essentially tests nonemptiness of R with
s as a witness, becomes
where, s = since R is not nullable, si.. is the suffix of s from index i, and
δ(R) = δ(L)
& δ(α) = (α ? L & ~(100 ) : L)
& α = (α ? L & ~(100 ) & α : L & α)
Let R1 = L & ~(100 ) & α and R2 = L & α. So R has two outgoing transitions
α ¬α
R− →R1 and R−−→R2 that contribute the edges (R, R1 ) and (R, R2 ) into the
GID. Note that these edges depend only on R and not on s0 .
We continue the search incrementally by checking the two branches of the
if-then-else constraint, where R1 and R2 are again not nullable (so s1.. = ):
6 Related Work
Online Graph Algorithms. Online graph algorithms are typically divided into
problems over incremental graphs (where edges are added), decremental graphs
260 C. Stanford and M. Veanes
(where edges are deleted), and dynamic graphs (where edges are both added
and deleted), with core data structures discussed in [27,49]. Important prob-
lems include transitive closure, cycle detection, topological ordering, and strongly
connected component (SCC) maintenance.
For incremental topological ordering, [46] is an early work, and [33] presents
two different algorithms, one for sparse graphs and one for dense graphs – the
algorithms are also extended to work with SCCs. The sparse algorithm was sub-
sequently simplified in [10] and is the basis of our implementation named BFGT
in Sect. 4. A unified approach of several algorithms based on [10] is presented
in [21] that uses a notion of weak topological order and a labeling technique that
estimates transitive closure size. Further extensions of [10] are studied in [11,14]
based on randomization.
For dynamic directed graphs, a topological sorting algorithm that is experi-
mentally preferable for sparse graphs is discussed in [56], and a related article [55]
discusses strongly connected components maintenance. Transitive closure for
dynamic graphs is studied in [57], improving upon some algorithms presented
earlier in [34]. One major application for these algorithms is in pointer analy-
sis [54].
For undirected forests, fully dynamic reachability is solvable in amortized
logarithmic time per edge via multiple possible approaches [3,30,35,59,64]; our
implementation uses Euler Tour Trees [35].
Data Structures for SMT. UnionFind [63] is a foundational data structure
used in SMT. E-graphs [23,67] are used to ensure functional extensionality, where
two expressions are equivalent if their subexpressions are equivalent [25,52]. In
both UnionFind and E-graphs, the maintained relation is an equivalence rela-
tion. In contrast, maintaining live and dead states involves tracking reachability
rather than equivalence. To the best of our knowledge, the specific formulation
of incremental reachability we consider here is new.
Dead State Elimination in Automata. A DFA or NFA may be viewed as a
GID, so state classification in GIDs solves dead state elimination in DFAs and
NFAs, while additionally working in an incremental fashion. Dead state elimi-
nation is also known as trimming [37] and plays an important role in automata
minimization [12,38,48]. The literature on minimization is vast, and goes back
to the 1950s [16,17,39–41,50,53]; see [65] for a taxonomy, [2] for an experimen-
tal comparison, and [22] for the symbolic case. Watson et. al. [66] propose an
incremental minimization algorithm, in the sense that it can be halted at any
point to produce a partially minimized, equivalent DFA; unlike in our setting,
the DFA’s states and transitions are fixed and read in a predetermined order.
References
1. Abboud, A., Williams, V.V.: Popular conjectures imply strong lower bounds for
dynamic problems. In: 2014 IEEE 55th Annual Symposium on Foundations of
Computer Science, pp. 434–443. IEEE (2014)
2. Almeida, M., Moreira, N., Reis, R.: On the performance of automata minimization
algorithms. Tech. Rep. DCC-2007-03, University of Porto (2007)
3. Alstrup, S., Holm, J., Lichtenberg, K.D., Thorup, M.: Maintaining information in
fully dynamic trees with top trees. Acm Trans. Algorithms (talg) 1(2), 243–264
(2005)
4. Amadini, R.: A survey on string constraint solving. ACM Comput. Surv. (CSUR)
55(1), 1–38 (2021)
5. Antimirov, V.: Partial derivatives of regular expressions and finite automata con-
structions. Theoret. Comput. Sci. 155, 291–319 (1995)
6. Backes, J., et al.: Semantic-based automated reasoning for AWS access policies
using SMT. In: 2018 Formal Methods in Computer Aided Design, FMCAD 2018,
Austin, TX, USA, 30 October - 2 November 2018, pp. 1–9. IEEE (2018). https://
doi.org/10.23919/FMCAD.2018.8602994
7. Bakaric, R.: Euler tour tree representation (GitHub repository) (2019). https://
github.com/RobertBakaric/EulerTour
8. Barbosa, H., et al.: cvc5: A Versatile and Industrial-Strength SMT Solver. In:
Fisman, D., Rosu, G. (eds.) TACAS 2022. LNCS, vol. 13243, pp. 415–442. Springer,
Cham (2022). https://doi.org/10.1007/978-3-030-99524-9 24
9. Barrett, C., et al.: CVC4. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011.
LNCS, vol. 6806, pp. 171–177. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-22110-1 14
10. Bender, M.A., Fineman, J.T., Gilbert, S., Tarjan, R.E.: A new approach to incre-
mental cycle detection and related problems. ACM Trans. Algorithms 12(2), 14:1–
14:22 (2015). https://doi.org/10.1145/2756553, https://arxiv.org/abs/1112.0784
11. Bernstein, A., Chechik, S.: Incremental topological sort and cycle detection in
o(m sqrt(n)) expected total time. In: Proceedings of the 29th Annual ACM-SIAM
Symposium on Discrete Algorithms, SODA 2018, pp. 21–34. Society for Industrial
and Applied Mathematics (2018)
12. Berstel, J., Boasson, L., Carton, O., Fagnot, I.: Minimization of automata. Hand-
book of Automata (2011)
13. Berzish, M., et al.: An SMT solver for regular expressions and linear arithmetic over
string length. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12760, pp.
289–312. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81688-9 14
14. Bhattacharya, S., Kulkarni, J.: An improved algorithm for incremental cycle detec-
tion and topological ordering in sparse graphs. In: Proceedings of the Fourteenth
Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2509–2521. SIAM
(2020)
15. Bjørner, N., Ganesh, V., Michel, R., Veanes, M.: An SMT-LIB format for sequences
and regular expressions. In: SMT workshop, pp. 76–86 (2012), RegExLib bench-
marks can be found at https://github.com/cdstanford/regex-smt-benchmarks/,
originally downloaded from https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/02/nbjorner-microsoft.automata.smtbenchmarks.zip
16. Blum, N.: An O(n log n) implementation of the standard method for minimizing
n-state finite automata. Inf. Process. Lett. 57, 65–69 (1996)
262 C. Stanford and M. Veanes
17. Brzozowski, J.A.: Canonical regular expressions and minimal state graphs for def-
inite events. In: Proceedings of the Symposium on Mathematical Theory ofAu-
tomata, New York, pp. 529–561 (1963)
18. Brzozowski, J.A.: Derivatives of regular expressions. J. ACM (JACM) 11(4), 481–
494 (1964)
19. Caron, P., Champarnaud, J.-M., Mignot, L.: Partial derivatives of an extended
regular expression. In: Dediu, A.-H., Inenaga, S., Martı́n-Vide, C. (eds.) LATA
2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011). https://doi.org/
10.1007/978-3-642-21254-3 13
20. Clarke, E., Grumberg, O., Hamaguchi, K.: Another look at LTL model checking.
In: Dill, D.L. (ed.) CAV 1994. LNCS, vol. 818, pp. 415–427. Springer, Heidelberg
(1994). https://doi.org/10.1007/3-540-58179-0 72
21. Cohen, E., Fiat, A., Kaplan, H., Roditty, L.: A Labeling Approach to Incremental
Cycle Detection. arXiv preprint arXiv:1310.8381 (Oct 2013). https://arxiv.org/
abs/1310.8381
22. D’Antoni, L., Veanes, M.: Minimization of symbolic automata. In: ACM SIGPLAN
Notices - POPL 2014, vol. 49(1), pp. 541–553 (2014). https://doi.org/10.1145/
2535838.2535849
23. de Moura, L., Bjørner, N.: Efficient E-matching for SMT solvers. In: Pfenning,
F. (ed.) CADE 2007. LNCS (LNAI), vol. 4603, pp. 183–198. Springer, Heidelberg
(2007). https://doi.org/10.1007/978-3-540-73595-3 13
24. De Moura, L., Bjørner, N.: Satisfiability modulo theories: introduction and appli-
cations. Commun. ACM 54(9), 69–77 (2011)
25. Downey, P.J., Sethi, R., Tarjan, R.E.: Variations on the common subexpression
problem. J. ACM (JACM) 27(4), 758–771 (1980)
26. Ellul, K., Krawetz, B., Shallit, J., Wang, M.W.: Regular expressions: New results
and open problems. J. Autom. Lang. Comb. 10(4), 407–437 (2005)
27. Eppstein, D., Galil, Z., Italiano, G.F.: Dynamic graph algorithms. Algorithms The-
ory Comput. Handbook 1, 1–9 (1999)
28. Fairbanks, J., Besançon, M., Simon, S., Hoffiman, J., Eubank, N., Karpinski, S.:
An optimized graphs package for the Julia programming language (2021). https://
github.com/JuliaGraphs/Graphs.jl/, commit 075a01eb6a
29. Fan, W., Hu, C., Tian, C.: Incremental graph computations: Doable and undoable.
In: Proceedings of the 2017 ACM International Conference on Management of
Data, pp. 155–169 (2017)
30. Frederickson, G.N.: A data structure for dynamically maintaining rooted trees. J.
Algorithms 24(1), 37–65 (1997). https://arxiv.org/pdf/cs/0310065.pdf
31. Gelade, W., Neven, F.: Succinctness of the complement and intersection of regular
expressions. arXiv preprint arXiv:0802.2869 (2008)
32. Guéneau, A., Jourdan, J.H., Charguéraud, A., Pottier, F.: Formal proof and anal-
ysis of an incremental cycle detection algorithm. In: Interactive Theorem Proving.
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2019)
33. Haeupler, B., Kavitha, T., Mathew, R., Sen, S., Tarjan, R.E.: Incremental cycle
detection, topological ordering, and strong component maintenance. ACM Trans.
Algorithms 8(1.3), 1–33 ( 2012). https://doi.org/10.1145/2071379.2071382
34. Henzinger, M., King, V.: Fully dynamic biconnectivity and transitive closure. In:
Proceedings of the 36th Annual Symposium on Foundations of Computer Science,
pp. 664–672, Milwaukee, WI (1995)
35. Henzinger, M.R., King, V.: Randomized fully dynamic graph algorithms with poly-
logarithmic time per operation. J. ACM (JACM) 46(4), 502–516 (1999)
Incremental Dead State Detection in Logarithmic Time 263
36. Hooimeijer, P., Weimer, W.: Solving string constraints lazily. In: Proceedings of
the IEEE/ACM International Conference on Automated Software Engineering, pp.
377–386 (2010)
37. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and
Computation. Addison Wesley (1979)
38. Hopcroft, J.: An n log n algorithm for minimizing states in a finite automaton. In:
Theory of Machines and Computations: Proceedings of an International Sympo-
sium on the Theory of Machines and Computations Held at Technion in Haifa, pp.
189–196. Academic Press, New York (1971)
39. Hopcroft, J.E., Ullman, J.D.: Formal languages and their relation to automata.
Addison-Wesley Longman Publishing Co., Inc., Boston (1969)
40. Huffman, D.: The synthesis of sequential switching circuits. J. Franklin Inst. 257(3–
4), 161–190, 275–303 (1954)
41. Kameda, T., Weiner, P.: On the state minimization of nondeterministic finite
automata. IEEE Trans. Comput. C-19(7), 617–627 (1970)
42. Keil, M., Thiemann, P.: Symbolic solving of extended regular expression inequali-
ties. In: FSTTCS 2014, pp. 175–186. LIPIcs (2014)
43. Kupferman, O., Vardi, M.Y.: Model checking of safety properties. Formal Methods
Syst. Design 19(3), 291–314 (2001)
44. Kupferman, O., Zuhovitzky, S.: An improved algorithm for the membership prob-
lem for extended regular expressions. In: Diks, K., Rytter, W. (eds.) MFCS 2002.
LNCS, vol. 2420, pp. 446–458. Springer, Heidelberg (2002). https://doi.org/10.
1007/3-540-45687-2 37
45. Liang, T., Tsiskaridze, N., Reynolds, A., Tinelli, C., Barrett, C.: A decision pro-
cedure for regular membership and length constraints over unbounded strings. In:
Lutz, C., Ranise, S. (eds.) FroCoS 2015. LNCS (LNAI), vol. 9322, pp. 135–150.
Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24246-0 9
46. Marchetti-Spaccamela, A., Nanni, U., Rohnert, H.: Maintaining a topological order
under edge insertions. Inf. Process. Lett. 59(1), 53–58 (1996). https://doi.org/10.
1016/0020-0190(96)00075-0
47. Matsakis, N.D., Klock, F.S.: The Rust language. ACM SIGAda Ada Letters 34(3),
103–104 (2014). https://www.rust-lang.org/
48. Mayr, R., Clemente, L.: Advanced automata minimization. In: POPL 2013, pp.
63–74 (2013)
49. Mehlhorn, K.: Data Structures and Algorithms, Graph Algorithms and NP-
Completeness, vol. 2. Springer (1984). https://doi.org/10.1007/978-3-642-69897-
2
50. Moore, E.F.: Gedanken-experiments on sequential machines, pp. 129–153.
Automata studies, Annals of mathematics studies (1956)
51. de Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: Ramakrishnan, C.R.,
Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg
(2008). https://doi.org/10.1007/978-3-540-78800-3 24
52. Nelson, G., Oppen, D.C.: Fast decision procedures based on congruence closure. J.
ACM (JACM) 27(2), 356–364 (1980)
53. Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM J. Comput.
16(6), 973–989 (1987)
54. Pearce, D.J.: Some directed graph algorithms and their application to pointer
analysis. Ph.D. thesis, Imperial College, London (2005)
264 C. Stanford and M. Veanes
55. Pearce, D.J., Kelly, P.H.J.: A dynamic algorithm for topologically sorting directed
acyclic graphs. In: Ribeiro, C.C., Martins, S.L. (eds.) WEA 2004. LNCS, vol.
3059, pp. 383–398. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-
540-24838-5 29
56. Pearce, D.J., Kelly, P.H.J.: A dynamic topological sort algorithm for directed
acyclic graphs. ACM J. Experimental Algorithmics 11(1.7), 1–24 (2006)
57. Roditty, L., Zwick, U.: Improved dynamic reachability algorithms for directed
graphs. SIAM J. Comput. 37(5), 1455–1471 (2008). https://doi.org/10.1137/
060650271
58. Rozier, K.Y., Vardi, M.Y.: LTL satisfiability checking. In: Bošnački, D., Edelkamp,
S. (eds.) SPIN 2007. LNCS, vol. 4595, pp. 149–167. Springer, Heidelberg (2007).
https://doi.org/10.1007/978-3-540-73370-6 11
59. Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst.
Sci. 26(3), 362–391 (1983)
60. Stanford, C., Veanes, M.: Incremental dead state detection in logarithmic time
(extended version for arxiv). arXiv preprint arXiv:2301.05308 (2023)
61. Stanford, C., Veanes, M., Bjørner, N.: Symbolic Boolean derivatives for efficiently
solving extended regular expression constraints. In: Proceedings of the 42nd ACM
SIGPLAN International Conference on Programming Language Design and Imple-
mentation (PLDI), pp. 620–635 (2021)
62. Stockmeyer, L.J., Meyer, A.R.: Word problems requiring exponential time (pre-
liminary report). In: Proceedings of the Fifth Annual ACM Symposium on Theory
of Computing, pp. 1–9 (1973)
63. Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. JACM 22,
215–225 (1975)
64. Tarjan, R.E., Werneck, R.F.: Dynamic trees in practice. J. Exp. Algorithmics
(JEA) 14, 4–5 (2010)
65. Watson, B.W.: A taxonomy of finite automata minimization algorithms. Comput-
ing Science Report 93/44, Eindhoven University of Technology (January 1995)
66. Watson, B.W., Daciuk, J.: An efficient incremental DFA minimization algorithm.
Nat. Lang. Eng. 9(1), 49–64 (2003). https://doi.org/10.1017/S1351324903003127
67. Willsey, M., Nandi, C., Wang, Y.R., Flatt, O., Tatlock, Z., Panchekha, P.: egg: fast
and extensible equality saturation. In: Proceedings of the ACM on Programming
Languages 5(POPL), pp. 1–29 (2021)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Model Checking Race-Freedom When
“Sequential Consistency
for Data-Race-Free Programs” is
Guaranteed
1 Introduction
prototypes have been developed (e.g., [10]). The most significant limitation is
imprecision: a tool may report that race-free code has a possible race— a “false
alarm”. Some static approaches are also not sound, i.e., they may fail to detect
a race in a racy program; like dynamic tools, these approaches are used more as
bug hunters than verifiers.
Finite-state model checking [15] offers an interesting compromise. This app-
roach requires a finite-state model of the program, which is usually achieved
by placing small bounds on the number of threads, the size of inputs, or other
program parameters. The reachable states of the model can be explored through
explicit enumeration or other means. This can be used to implement a sound and
precise race analysis of the model. If a race is found, detailed information can
be produced, such as a program trace highlighting the two conflicting memory
accesses. Of course, if the analysis concludes the model is race-free, it is still pos-
sible that a race exists for larger parameter values. In this case, one can increase
those values and re-run the analysis until time or computational resources are
exhausted. If one accepts the “small scope hypothesis”—the claim that most
defects manifest in small configurations of a system—then model checking can
at least provide strong evidence for the absence of data races. In any case, the
results provide specific information on the scope that is guaranteed to be race-
free, which can be used to guide testing or further analysis.
The main limitation of model checking is state explosion, and one of the
most effective techniques for limiting state explosion is partial order reduction
(POR) [17]. A typical POR technique is based on the following observation:
from a state s at which a thread t is at a “local” statement—i.e., one which
commutes with all statements from other threads—then it is often not necessary
to explore all enabled transitions from s; instead, the search can explore only
the enabled transitions from t. Usually local statements are those that access
only thread-local variables. But if the program is known to be race-free, shared
variable accesses can also be considered “local” for POR. This is the essential
observation at the heart of recent work on POR in the verification of Pthreads
programs [29].
In this paper, we explore a new model checking technique that can be used
to verify race-freedom, as well as other correctness properties, for programs in
which threads synchronize through locks and barriers. The approach requires
two simple modifications to the standard state reachability algorithm. First,
each thread maintains a history of the memory locations accessed since its last
synchronization operation. These sets are examined for races and emptied at
specific synchronization points. Second, a novel POR is used in which only lock
(release and acquire) operations are considered non-local. In Sect. 2, we present
a precise mathematical formulation of the technique and a theorem that it has
the claimed properties, including that it is sound and precise for verification of
race-freedom of finite-state models.
Using the CIVL symbolic execution and model checking platform [31], we
have implemented a prototype tool, based on the new technique, for verify-
ing race-freedom in C/OpenMP programs. OpenMP is an increasingly popular
268 W. Wu et al.
2 Theory
All of the sets Locali and Stmti (i ∈ TID) are pairwise disjoint.
Each thread has a unique thread ID number, an element of TID. A local state
for thread i encodes the values of all thread-local variables, including the program
counter. A shared state encodes the values of all shared variables. (Locks are not
considered shared variables.) A thread at an acquire state σ is attempting to
acquire the lock lock(σ). At a release state, the thread is about to release a lock.
At a barrier state, a thread is waiting inside a barrier. After executing one of
the three operations, each thread moves to a unique next local state. A thread
that reaches a terminal state has terminated. From an nsync state, any positive
number of statements are enabled, and each of these statements may read and
update the local state of the thread and/or the shared state.
1
Any OpenMP program that does not use non-sequentially consistent atomic direc-
tives, omp_test_lock, or omp_test_nest_lock [26, §1.4.6].
Model Checking and Sequential Consistency for Data-Race-Free Programs 269
For i ∈ TID, the local graph of thread i is the directed graph with nodes
Locali and an edge σ → σ if either (i) σ ∈ Acquirei ∪ Releasei ∪ Barrieri and
σ = next(σ), or (ii) σ ∈ Nsynci and there is some ζ ∈ Shared such that (σ , ζ )
is in the image of update(σ).
Fix a multithreaded program P and let
A lock state specifies the owner of each lock. The owner is a thread ID, or 0 if the
lock is free. The elements of State are the (global) states of P . A state specifies
a local state for each thread, a shared state, a lock state, and the set of threads
that are currently blocked at a barrier.
Let i ∈ TID and Li = Locali × Shared × LockState × 2TID . Define
enabledi : Li → 2Stmti
⎧
⎪
⎪{acquirei (l)} if σ ∈ Acquirei ∧ l = lock(σ) ∧ θ(l) = 0
⎪
⎪
⎪
⎪
⎨{releasei (l)} if sigma ∈ Releasei ∧ l = lock(σ) ∧ θ(l) = i
λ → {exiti } if σ ∈ Barrieri ∧ i ∈ w
⎪
⎪
⎪
⎪stmts(σ) if σ ∈ Nsynci
⎪
⎪
⎩∅ otherwise.
where λ = (σ, ζ, θ, w) ∈ Li . This function returns the set of statements that are
enabled in thread i at a given state. This function does not depend on the local
states of threads other than i, which is why those are excluded from Li . An
acquire statement is enabled if the lock is free; a release is enabled if the calling
thread owns the lock. A barrier exit is enabled if the thread is not currently in
the barrier blocked set.
Execution of an enabled statement in thread i updates the state as follows:
Note a thread arriving at a barrier will have its ID added to the barrier blocked
set, unless it is the last thread to arrive, in which case all threads are released
from the barrier.
At a given state, the set of enabled statements is the union over all threads
of the enabled statements in that thread. Execution of a statement updates the
state as above, leaving the local states of other threads untouched:
Note that an execution is completely determined by its initial state s0 and its
statement sequence t1 t2 · · · .
Having specified the semantics of the computational model, we now turn to
the concept of the data race. The traditional definition requires the notion of
“conflicting” accesses: two accesses to the same memory location conflict when
at least one is a write. The following abstracts this notion:
Definition 3. A symmetric binary relation conflict on Stmt is a conflict relation
for P if the following hold for all t1 , t2 ∈ Stmt:
1. if (t1 , t2 ) ∈ conflict then t1 and t2 are nsync statements from different threads
2. if t1 and t2 are nsync statements from different threads and (t1 , t2 ) ∈ conflict,
then for all s ∈ State, if t1 , t2 ∈ enabled(s) then
execute(execute(s, t1 ), t2 ) = execute(execute(s, t2 ), t1 ).
Two events “race” when they conflict but are not ordered by happens-before:
Definition 5. Let α be an execution and e, e ∈ [α]. Say e = t, n and e =
t , n . We say e and e race in α if (t, t ) ∈ conflict and neither (e, e ) nor (e , e)
is in HB(α). The data race relation of α is the symmetric binary relation on [α]
DR(α) = {(e, e ) ∈ [α] × [α] | e and e race in α}.
Now we turn to the problem of detecting data races. Our approach is to
explore a modified state space. The usual state space is a directed graph with
node set State and transitions for edges. We make two modifications. First,
we add some “history” to the state. Specifically, each thread records the nsync
statements it has executed since its last lock event or barrier exit. This set is
checked against those of other threads for conflicts, just before it is emptied after
its next lock event or barrier exit. The second change is a reduction: any state
that has an enabled statement that is not a lock statement will have outgoing
edges from only one thread in the modified graph.
A well-known technical challenge with partial order reduction concerns cycles
in the reduced state space. We deal with this challenge by assuming that P comes
with some additional information. Specifically, for each i, we are given a set Ri ,
with Releasei ∪ Acquirei ⊆ Ri ⊆ Locali , satisfying: any cycle in the local graph
of thread i has at least one node in Ri . In general, the smaller Ri , the more
effective the reduction. In many application domains, there are no cycles in the
local graphs, so one can take Ri = Releasei ∪ Acquirei . For example, standard for
272 W. Wu et al.
where l1 and l2 are distinct locks. Let Ri = Releasei ∪ Acquirei (i = 1, 2). One
path in the race-detecting state graph G executes as follows:
A data race occurs on this path since the two assignments conflict but are not
ordered by happens-before. The race is not detected, since at each lock operation,
the statement set in the other thread is empty. However, there is another path
between two threads. Any type of data that can answer this question would
work equally well. In our implementation, each thread instead records the set of
memory locations read, and the set of memory locations modified, since the last
synchronization. A conflict occurs if the read or write set of one thread intersects
the write set of another read. As CIVL-C provides robust support for tracking
memory accesses, this approach is relatively straightforward to implement by a
program transformation.
In Sect. 3.1, we summarize the basics of OpenMP. In Sect. 3.2, we provide the
necessary background on CIVL-C and the primitives used in the transformation.
In Sect. 3.3, we describe the transformation itself. In Sect. 3.4, we report the
results of experiments using this tool.
All software and other artifacts necessary to reproduce the experiments, as
well as the full results, are included in a VirtualBox virtual machine available at
https://doi.org/10.5281/zenodo.7978348.
The CIVL framework includes a front-end for preprocessing, parsing, and build-
ing an AST for a C program. It also provides an API for transforming the AST.
We used this API to build a tool which consumes a C/OpenMP program and pro-
duces a CIVL-C “model” of the program. The CIVL-C language includes most
of sequential C, including functions, recursion, pointers, structs, and dynami-
cally allocated memory. It adds nested function definitions and primitives for
concurrency and verification.
In CIVL-C, a thread is created by spawning a function: $spawn f(...);.
There is no special syntax for shared or thread-local variables; any variable that
276 W. Wu et al.
added to the top entry on the write stack. Function $write_set_pop pops the
write stack, returning the top mem-set. The corresponding functions for the
read stack are $read_set_push and $read_set_pop. The library also provides
various operations on mem-sets, such as $mem_disjoint, which consumes two
mem-sets and returns true if the intersection of the two mem-sets is empty.
Lock operations. Several OpenMP operations are modeled using locks. The
omp_set_lock and omp_unset_lock functions are the obvious examples, but we
also use locks to model the behavior of atomic and critical section constructs. In
any case, a lock acquire operation is translated to
It is similar to the acquire case, except that the check occurs upon leaving the
release location, i.e., after the yield. A similar sequence is inserted in any loop
(e.g., a while loop or a for loop not in standard form) that may create a cycle
in the local space, only without the release statement.
3.4 Evaluation
2
While there are a number of effective dynamic race detectors, the goal of those tools
is to detect races on a particular execution. Our goal is more aligned with that
of static analyzers: to cover as many executions as possible, including for different
inputs, number of threads, and thread interleavings.
280 W. Wu et al.
Fig. 3. Excerpts from three benchmarks with data races: two from DataRaceBench
(left and middle) and erroneous 1d-diffusion (right).
Fig. 4. Code for synchronization using an atomic variable (left) and a 2-thread barrier
using locks (right).
For each program, we created an erroneous version with a data race, for a total
of 20 tests. These codes are included in the experimental archive, and two are
excerpted in Fig. 4.
CIVL obtains the expected result in all 20. While we wrote these additional
examples to verify that CIVL can reason correctly about programs with complex
interleaving semantics or alias issues, for completeness we also evaluated them
with LLOV. It should be noted, however, that the authors of LLOV warn that it
“. . . does not provide support for the OpenMP constructs for synchronization. . . ”
and “. . . can produce False Positives for programs with explicit synchronizations
with barriers and locks.” [9] It is therefore unsurprising that the results were
somewhat mixed: LLOV produced no output for 6 of our examples (the racy
and race-free versions of diffusion2 and the two producer-consumer codes) and
produced the correct answer on 7 of the remaning 14. On these problems, LLOV
reported a race for both the racy and race-free version, with the exception of
diffusion1 (Fig. 3, right), where a failure to detect the alias between u and v leads
it to report both versions as race-free.
CIVL’s verification time is significantly longer than LLOV’s. On the DRB
benchmarks, total CIVL time for the 88 tests was 27 min. Individual times ranged
from 1 to 150 seconds: 66 took less than 5s, 80 took less than 30s, and 82 took
less than 1 min. (All CIVL runs used an M1 MacBook Pro with 16GB memory.)
282 W. Wu et al.
Total CIVL runtime on the 20 extra tests was 210s. LLOV analyzes all 88 DRB
problems in less than 15 s (on a standard Linux machine).
4 Related Work
By Theorem 1, if barriers are the only form of synchronization used in a program,
only a single interleaving will be explored, and this suffices to verify race-freedom
or to find all states at the end of each barrier epoch. This is well known in other
contexts, such as GPU kernel verification (cf. [5]).
Prior work involving model checking and data races for unstructured con-
currency includes Schemmel et al. [29]. This work describes a technique, using
symbolic execution and POR, to detect defects in Pthreads programs. The app-
roach involves intricate algorithms for enumerating configurations of prime event
structures, each representing a set of executions. The completeness results deal
with the detection of defects under the assumption that the program is race-
free. While the implementation does check for data races, it is not clear that the
theoretical results guarantee a race will be found if one exists.
Earlier work of Elmas et al. describes a sound and precise technique for
verifying race-freedom in finite-state lock-based programs [16]. It uses a bespoke
POR-based model checking algorithm that associates significant and complex
information with the state, including, for each shared memory location, a set of
locks a thread should hold when accessing that location, and a reference to the
node in the depth first search stack from which the last access to that location
was performed.
Both of these model checking approaches are considerably more complex than
the approach of this paper. We have defined a simple state-transition system and
shown that a program has a data race if and only if a state or edge satisfying
a certain condition is reachable in that system. Our approach is agnostic to the
choice of algorithm used to check reachability. The earlier approaches are also
path-precise for race detection, i.e., for each execution path, a race is detected if
and only if one exists on that path. As we saw in the example following Theorem
1, our approach is not path-precise, nor does it have to be: to verify race-freedom,
it is only necessary to find one race in one execution, if one exists. This partly
explains the relative simplicity of our approach.
A common approach for verifying race-freedom is to establish consistent
correlation: for each shared memory location, there is some lock that is held
whenever that location is accessed. Locksmith [27] is a static analysis tool for
multithreaded C programs that takes this approach. The approach should never
report that a racy program is race-free, but can generate false alarms, since there
are race-free programs that are not consistently correlated. False alarms can also
arise from imprecise approximations of the set of shared variables, alias analysis,
and so on. Nevertheless, the technique appears very effective in practice.
Static analysis-based race-detection tools for OpenMP include OMPRacer
[33]. OMPRacer constructs a static graph representation of the happens-before
relation of a program and analyzes this graph, together with a novel whole-
program pointer analysis and a lockset analysis, to detect races. It may miss
Model Checking and Sequential Consistency for Data-Race-Free Programs 283
5 Conclusion
partial order reduction scheme is used that treats all memory accesses as local,
and (3) checks for conflicting accesses are performed around synchronizations.
We proved our technique is sound and precise for finite-state models, using a
simple mathematical model for multithreaded programs with locks and barriers.
We implemented our technique in a prototype tool based on the CIVL symbolic
execution and model checking platform and applied it to a suite of C/OpenMP
programs from DataRaceBench. Although based on completely different tech-
niques, our tool achieved performance comparable to that of the state-of-the-art
static analysis tool, LLOV v.0.3.
Limitations of our tool include incomplete coverage of the OpenMP speci-
fication (e.g., target, simd, and task directives are not supported); the need
for some manual instrumentation; the potential for state explosion necessitat-
ing small scopes; and a combinatorial explosion in the mappings of threads to
loop iterations, OpenMP sections, or single constructs. In the last case, we have
compromised soundness by selecting one mapping, but in future work we will
explore ways to efficiently cover this space. On the other hand, in contrast to
LLOV and because of the reliance on model checking and symbolic execution,
we were able to verify the presence or absence of data races even for programs
using unstructured synchronization with locks, critical sections, and atomics,
including barrier algorithms and producer-consumer code.
References
1. Andrews, G.R.: Foundations of Multithreaded, Parallel, and Distributed Pro-
gramming. Addison-Wesley (2000). https://www.pearson.ch/HigherEducation/
Pearson/EAN/9780201357523/Foundations-of-Multithreaded-Parallel-and-
Distributed-Programming
2. Atzeni, S., et al.: ARCHER: Effectively spotting data races in large OpenMP appli-
cations. In: 2016 IEEE International Parallel and Distributed Processing Sympo-
sium (IPDPS), pp. 53–62 (2016). https://doi.org/10.1109/IPDPS.2016.68
3. Atzeni, S., Gopalakrishnan, G., Rakamaric, Z., Laguna, I., Lee, G.L., Ahn, D.H.:
SWORD: A bounded memory-overhead detector of OpenMP data races in pro-
duction runs. In: 2018 IEEE International Parallel and Distributed Processing
Symposium (IPDPS), pp. 845–854 (2018). https://doi.org/10.1109/IPDPS.2018.
00094
4. Bernstein, A.J.: Analysis of programs for parallel processing. IEEE Trans. Elec-
tronic Comput. EC 15(5), 757–763 (1966). https://doi.org/10.1109/PGEC.1966.
264565
5. Betts, A., et al.: The design and implementation of a verification technique for
GPU kernels. ACM Trans. Program. Lang. Syst. 37(3) (2015). https://doi.org/10.
1145/2743017
Model Checking and Sequential Consistency for Data-Race-Free Programs 285
6. Blom, S., Darabi, S., Huisman, M., Safari, M.: Correct program parallelisations.
Int. J. Softw. Tools Technol. Trans. 23(5), 741–763 (2021). https://doi.org/10.
1007/s10009-020-00601-z
7. Boehm, H.J.: How to miscompile programs with "benign" data races. In: Proceed-
ings of the 3rd USENIX Conference on Hot Topic in Parallelism, HotPar 2011, pp.
1–6. USENIX Association, Berkeley, CA, USA (2011). http://dl.acm.org/citation.
cfm?id=2001252.2001255
8. Boehm, H.J., Adve, S.V.: Foundations of the C++ concurrency memory model.
In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation, pp. 68–78. PLDI ’08, Association for Comput-
ing Machinery, New York (2008). https://doi.org/10.1145/1375581.1375591
9. Bora, U., Das, S., Kukreja, P., Joshi, S., Upadrasta, R., Rajopadhye, S.: LLOV:
A fast static data-race checker for OpenMP programs. ACM Trans. Archit. Code
Optimiz. (TACO) 17(4), 1–26 (2020). https://doi.org/10.1145/3418597
10. Bora, U., Vaishay, S., Joshi, S., Upadrasta, R.: OpenMP aware MHP analysis for
improved static data-race detection. In: 2021 IEEE/ACM 7th Workshop on the
LLVM Compiler Infrastructure in HPC (LLVM-HPC). pp. 1–11 (2021). https://
doi.org/10.1109/LLVMHPC54804.2021.00006
11. Boushehrinejadmoradi, N., Yoga, A., Nagarakatte, S.: On-the-fly data race detec-
tion with the enhanced openmp series-parallel graph. In: Milfeld, K., de Supinski,
B.R., Koesterke, L., Klinkenberg, J. (eds.) IWOMP 2020. LNCS, vol. 12295, pp.
149–164. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_10
12. Chatarasi, P., Shirako, J., Kong, M., Sarkar, V.: An extended polyhedral model for
spmd programs and its use in static data race detection. In: Ding, C., Criswell, J.,
Wu, P. (eds.) LCPC 2016. LNCS, vol. 10136, pp. 106–120. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-52709-3_10
13. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory
programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.
1109/99.660313
14. Davis, M.J.: Dynamatic: An OpenMP Race Detection Tool Combining Static and
Dynamic Analysis. Undergraduate research scholars thesis, Texas A&M University
(2021). https://oaktrust.library.tamu.edu/handle/1969.1/194411
15. Edmund M. Clarke, J., Grumberg, O., Kroening, D., Peled, D., Veith, H.: Model
Checking, 2 edn. MIT press, Cambridge, MA, USA (2018). https://mitpress.mit.
edu/books/model-checking-second-edition
16. Elmas, T., Qadeer, S., Tasiran, S.: Precise race detection and efficient
model checking using locksets. Tech. Rep. MSR-TR-2005-118, Microsoft
Research (2006). https://www.microsoft.com/en-us/research/publication/precise-
race-detection-and-efficient-model-checking-using-locksets/
17. Godefroid, P. (ed.): Partial-Order Methods for the Verification of Concurrent Sys-
tems. LNCS, vol. 1032. Springer, Heidelberg (1996). https://doi.org/10.1007/3-
540-60761-7
18. Gu, Y., Mellor-Crummey, J.: Dynamic data race detection for OpenMP programs.
In: SC18: International Conference for High Performance Computing, Networking,
Storage and Analysis (2018). https://doi.org/10.1109/SC.2018.00064
19. Ha, O.-K., Jun, Y.-K.: Efficient thread labeling for on-the-fly race detection of
programs with nested parallelism. In: Kim, T., Adeli, H., Kim, H., Kang, H., Kim,
K.J., Kiumi, A., Kang, B.-H. (eds.) ASEA 2011. CCIS, vol. 257, pp. 424–436.
Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-27207-3_47
286 W. Wu et al.
20. Ha, O.K., Kuh, I.B., Tchamgoue, G.M., Jun, Y.K.: On-the-fly detection of data
races in OpenMP programs. In: Proceedings of the 2012 Workshop on Parallel and
Distributed Systems: Testing, Analysis, and Debugging, pp. 1–10. PADTAD 2012,
Association for Computing Machinery, New York (2012). https://doi.org/10.1145/
2338967.2336808
21. International Organization for Standardization: ISO/IEC 9899:2018. Information
technology – Programming languages – C (2018). https://www.iso.org/standard/
74528.html
22. Lamport, L.: How to make a multiprocessor computer that correctly executes mul-
tiprocess programs. IEEE Trans. Comput. C-28(9), 690–691 (1979). https://doi.
org/10.1109/TC.1979.1675439
23. Manson, J., Pugh, W., Adve, S.V.: The Java memory model. In: Proceedings of
the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, pp. 378–391. POPL ’05, Association for Computing Machinery, New
York (2005). https://doi.org/10.1145/1040305.1040336
24. Mellor-Crummey, J.: On-the-fly detection of data races for programs with
nested fork-join parallelism. In: Supercomputing 1991: Proceedings of the 1991
ACM/IEEE Conference On Supercomputing, pp. 24–33. IEEE (1991). https://
doi.org/10.1145/125826.125861
25. Open Group: IEEE Std 1003.1: Standard for information technology–Portable
Operating System Interface (POSIX(R)) base specifications, issue 7: General con-
cepts: Memory synchronization (2018). https://pubs.opengroup.org/onlinepubs/
9699919799/basedefs/V1_chap04.html#tag_04_12
26. OpenMP Architecture Review Board: OpenMP Application Programming Inter-
face (Nov 2021). https://www.openmp.org/wp-content/uploads/OpenMP-API-
Specification-5-2.pdf, version 5.2
27. Pratikakis, P., Foster, J.S., Hicks, M.: LOCKSMITH: Practical static race detection
for C. ACM Trans. Program. Lang. Syst. 33, 3:1–3:55 (2011). https://doi.org/10.
1145/1889997.1890000
28. Protze, J., Hahnfeld, J., Ahn, D.H., Schulz, M., Müller, M.S.: OpenMP tools inter-
face: synchronization information for data race detection. In: de Supinski, B.R.,
Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017.
LNCS, vol. 10468, pp. 249–265. Springer, Cham (2017). https://doi.org/10.1007/
978-3-319-65578-9_17
29. Schemmel, D., Büning, J., Rodríguez, C., Laprell, D., Wehrle, K.: Symbolic partial-
order execution for testing multi-threaded programs. In: Lahiri, S.K., Wang, C.
(eds.) CAV 2020. LNCS, vol. 12224, pp. 376–400. Springer, Cham (2020). https://
doi.org/10.1007/978-3-030-53288-8_18
30. Serebryany, K., Iskhodzhanov, T.: ThreadSanitizer: Data race detection in practice.
In: Proceedings of the Workshop on Binary Instrumentation and Applications,
pp. 62–71. WBIA 2009. Association for Computing Machinery, New York (2009).
https://doi.org/10.1145/1791194.1791203
31. Siegel, S.F., et al.: CIVL: The concurrency intermediate verification language. In:
SC15: Proceedings of the International Conference for High Performance Comput-
ing, Networking, Storage and Analysis. ACM, New York (Nov 2015). https://doi.
org/10.1145/2807591.2807635, article no. 61, pages 1-12
32. Swain, B., Huang, J.: Towards incremental static race detection in OpenMP pro-
grams. In: 2018 IEEE/ACM 2nd International Workshop on Software Correctness
for HPC Applications (Correctness), pp. 33–41. IEEE (2018). https://doi.org/10.
1109/Correctness.2018.00009
Model Checking and Sequential Consistency for Data-Race-Free Programs 287
33. Swain, B., Li, Y., Liu, P., Laguna, I., Georgakoudis, G., Huang, J.: OMPRacer: A
scalable and precise static race detector for OpenMP programs. In: SC20: Inter-
national Conference for High Performance Computing, Networking, Storage and
Analysis, pp. 1–14. IEEE (2020). https://doi.org/10.1109/SC41405.2020.00058
34. Swain, B., Liu, B., Liu, P., Li, Y., Crump, A., Khera, R., Huang, J.: OpenRace: An
open source framework for statically detecting data races. In: 2021 IEEE/ACM 5th
International Workshop on Software Correctness for HPC Applications (Correct-
ness), pp. 25–32. IEEE (2021). https://doi.org/10.1109/Correctness54621.2021.
00009
35. Verma, G., Shi, Y., Liao, C., Chapman, B., Yan, Y.: Enhancing DataRaceBench
for evaluating data race detection tools. In: 2020 IEEE/ACM 4th International
Workshop on Software Correctness for HPC Applications (Correctness), pp. 20–30
(2020). https://doi.org/10.1109/Correctness51934.2020.00008
36. Ye, F., Schordan, M., Liao, C., Lin, P.H., Karlin, I., Sarkar, V.: Using polyhedral
analysis to verify OpenMP applications are data race free. In: 2018 IEEE/ACM 2nd
International Workshop on Software Correctness for HPC Applications (Correct-
ness), pp. 42–50. IEEE (2018). https://doi.org/10.1109/Correctness.2018.00010
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Searching for i-Good Lemmas
to Accelerate Safety Model Checking
1 Introduction
IC3 (also known as PDR [17]) has spawned several variants, including those
that attempt to combine forward and backward search [29]. Particularly relevant
in this paper is CAR (Complementary Approximate Reachability), which com-
bines the forward overapproximation with a backward underapproximation [23].
It has been noted that different ways to refine the over-approximating sequence
can impact the performance of the algorithm. For example, [21] attempts to dis-
cover good lemmas, that can be “pushed to the top” since they are inductive. In
this paper, we propose an alternative way to drive the refinement of the over-
approximating sequence. We identify i- good lemmas, i.e. lemmas that are induc-
tive with respect to the i-th overapproximating level. The intuition is that such
i-good lemmas are useful in the search since they are fundamental to reach a fix
point in the safe case. In order to guide the search towards the discovery of i-good
lemmas, we propose a heuristic approach based on two key insights, i.e., branching
and refer-skipping. First, with branching we try to control the way the SAT solver
extracts unsatisfiable cores by privileging variables occurring in i-good lemmas.
Second, we control lemma generalization by avoiding dropping literals occurring
in a subsuming lemma in the previous layer (refer-skipping).
The proposed approach is applicable both to IC3/PDR and CAR, and it is
very simple to implement. Yet, it appears to be quite effective in practice. We
implemented the i-good lemma heuristics in two open-source implementations
of IC3 and CAR, and also in the mature, state-of-the-art IC3 implementation
available inside the nuXmv model checker [12], and we carried out an extensive
experimental evaluation on Hardware Model Checking Competition (HWMCC)
benchmarks. Analysis of the results suggests that increasing the ratio of i-good
lemmas leads to an increase in performance, and the heuristics appear to be quite
effective in driving the search towards i-good lemmas. In terms of performance,
this results in significant improvements for all the tools when equipped with the
proposed approach.
This paper is structured as follows. In Sect. 2 we present the problem and the
IC3/PDR and CAR algorithms. In Sect. 3 we present the intuition underlying i-
good lemmas and the algorithms to find them. In Sect. 4 we overview the related
work. In Sect. 5 we present the experimental evaluation. In Sect. 6 we draw some
conclusions and present directions for future work.
2 Preliminaries
2.1 Boolean Transition System
A Boolean transition system Sys is a tuple X, Y, I, T , where X and X denote
the set of state variables in the present state and the next state, respectively,
and Y denotes the set of input variables. The state space of Sys is the set of
possible assignments to X. I(X) is a Boolean formula corresponding to the set
of initial states, and T (X, Y, X ) is a Boolean formula representing the transition
relation. State s2 is a successor of state s1 with input y iff s1 ∧ y ∧ s2 |= T, which
is also denoted by (s1 , y, s2 ) ∈ T . In the following, we will also write (s1 , s2 ) ∈ T
meaning that (s1 , y, s2 ) ∈ T for some assignment y to the input variables. A path
290 Y. Xia et al.
Forward Backward
Base F0 = I B0 = ¬P
Induction Fi+1 = T (Fi ) Bi+1 = T −1 (Bi )
Safe Check Fi+1 ⊆ 0≤j≤i Fj Bi+1 ⊆ 0≤j≤i Bj
Unsafe Check Fi ∩ ¬P = ∅ Bi ∩ I = ∅
For forward search, Fi denotes the set of states that are reachable from I
within i steps, which is computed by iteratively applying T . At each iteration,
we first compute a new Fi , and then perform safe checking and unsafe checking. If
the safe/unsafe checking hits, the search terminates. Intuitively, unsafe checking
i ∩ ¬P = ∅ indicates some bad states are within Fi and safe checking Fi+1 ⊆
F
0≤j≤i Fj indicates that all reachable states from I have been checked and none
of them violate P . For backward search, Bi is the set of states that can reach
¬P in i steps, and the search procedure is analogous to the forward one.
Notations. A literal is an atomic variable or its negation. If l is a literal, we
denote its corresponding variable with var(l). A cube (resp. clause) is a conjunc-
tion (resp. disjunction) of literals. The negation of a clause is a cube and vice
Searching for i-Good Lemmas to Accelerate Safety Model Checking 291
Theorem 1. IC3 terminates with safe at frame i (i > 0), if and only if every
lemma at frame i is i-good.
Theorem 2. CAR terminates with safe at frame i (i > 0), if every lemma at
frame i is i-good.
Such theorems provide the theoretical foundation on which we base our main
conjecture: the computation of i-good lemmas can be helpful for both IC3 and
CAR to accelerate the convergence in proving properties. Intuitively, an i-good
lemma shows the promise of being independent of the reachability layer, and
hence holds in general.
2
The algorithms differ in the way they check reaching the fixpoint, but this difference
will be ignored unless otherwise stated.
Searching for i-Good Lemmas to Accelerate Safety Model Checking 295
– Before each SAT query in which a (negated) lemma c (or its next-state version
c ) is part of the assumptions, c is sorted in descending order of S[var(l)] , where
l ∈ c, to give higher priority to assumption literals with higher scores. This
corresponds to the calls to the function sort(c) in the pseudo-code description
of the main components of IC3 and CAR: at the beginning of Unsafecheck
(Algorithm 1 and 2), in Get predecessor (line 6 of Algorithm 4, line 6 of
Algorithm 5), and in Generalization (line 25 of Algorithm 4, line 23 of
Algorithm 5).
– Whenever IC3 or CAR discovers an i-good lemma c, all the variables in c are
rewarded by increasing their score. A lemma c is determined to be i-good
either when it is propagated forward from frame i to frame i + 1 (function
propagation of Algorithm 4 and 5) or when c is the result of a generaliza-
tion from d ⊇ c at frame i + 1 such that c is already in frame i (function
generalize, Algorithm 3). In the pseudo-code, the reward steps correspond
to the calls to the function reward(c) at line 12 of Algorithm 3, line 42 of
Algorithm 4, and line 37 of Algorithm 5. The reward function first decays
the scores of all the variables in S[v] by a small amount (we multiply by 0.99
in our implementation), and then increments the score of all the variables in
c (by 1 in our implementation).
In order to determine whether generalize produced an i-good lemma, we
also use the function get parentnode(c) (line 3 of Algorithm 3), which
returns a cube p in frame i − 1 such that p ⊆ c when c belongs to frame i. (If
multiple such p exist, the one with the highest score is returned).
– When performing inductive generalization of a lemma c at frame i (Algo-
rithm 3), in which c is strengthened by trying to drop literals from it as long
as the result is still a valid lemma for frame i, the literals of c are sorted in
increasing order of S[var(l)] , with l ∈ c. This corresponds to the call to the
function reverse sort(c) at line 2 of Algorithm 3 in the pseudo-code.
to be added into frame i (i > 0), the generalize procedure tries to compute a
new lemma g such that g ⊆ c and g is also valid to be added to frame i (Oi ).
The main idea of generalization is to try to drop literals in the original lemma
one by one, to see whether the left part can still be a valid lemma.
There are several generalization algorithms with different trade-offs between
efficiency (in terms of the number of SAT queries) and effectiveness (in terms
of the potential reduction in the size of the generalized lemma), e.g. [11,17,20].
More in general, there might be multiple different ways in which a lemma c can
be generalized, with results of uncomparable strength (i.e. there might be both
g1 ⊆ c and g2 ⊆ c such that g1 ⊆ g2 and g2 ⊆ g1 ).
The main idea of the refer-skipping heuristic is to bias the generalization to
increase the likelihood that the result g is a (i − 1)-good lemma. Consider the
generalization of lemma c = ¬1 ∨ 2 ∨ ¬3 at frame i (i > 1). If there is already a
298 Y. Xia et al.
4 Related Work
In the field of safety model checking, after the introduction of IC3 [11], several
variants have been presented: [20] presents the counterexample-guided general-
ization (CTG) of a lemma by blocking states that interfere with it, which sig-
nificantly improves the performance of IC3; AVY [33] introduces the ideas of IC3
into IMC (Interpolant Model Checking) [25] to induce a better model checking
algorithm; its upgrade version kAVY [32] uses k-induction to guide the interpola-
tion and IC3/PDR generalization inside; [28] proposes to combine IC3/PDR with
reverse IC3/PDR; the subsequent work [29] interleaves a forward and a back-
ward execution of IC3 and strengthens one frame sequence by leveraging the
proof-obligations from the other; IC3-INN [15] enables IC3 to leverage the inter-
nal signal information of the system to induce a variant of IC3 that can perform
better on certain industrial benchmarks; [30] introduces under-approximation in
PDR to improve the performance of bug-finding.
The importance of discovering inductive lemmas for improving convergence is
first noted in [17]. In PDR terminology, inductive lemmas are the ones belonging
to frame O∞ , as they represent an over-approximation of all the reachable states.
The most relevant related work is [21], where a variant of IC3 named QUIP
is proposed for implementing the pushing of the discovered lemmas to O∞ . At
its essence, QUIP adds the negation of a discovered lemma c as a may-proof-
obligation, hence trying to push c to the next frame. Counterexamples of may-
proof-obligations represent an under-approximation of the reachable states and
are stored to disprove the inductiveness of other lemmas. In QUIP terminology,
such lemmas are classified as bad lemmas, as they have no chance of being part
of the inductive invariant. Since the pushing is not limited to the current number
of frames, inductive lemmas are discovered when all the clauses of a frame can
be pushed (Ok \ Ok+1 = ∅ for a level k), and then added in O∞ . In QUIP
terminology, lemmas belonging to O∞ are classified as good lemmas, and are
always kept during the algorithm. Observe that the concept of good lemma in
[21] is a stronger version of Definition 1, which instead is local to a frame i and
characterizes lemmas that can be propagated one frame ahead.
Both QUIP and our heuristic try to accomplish a similar task, which is prior-
itizing the use of already discovered lemmas during the generalization. There
are however several differences: QUIP proceeds by adding additional proof-
obligations to the queue and by progressively proving the inductiveness of a
lemma relative to any frame. Our approach, on the other hand, is based on a
cheap heuristic strategy that locally guides the generalization prioritizing the
locally good lemmas. Some i-good lemmas computed may not be part of the
final invariant and can not be pushed later; in QUIP, such lemmas would not be
considered good. In our view, pushing them is not necessarily a waste of effort,
because they still strengthen the frames and their presence might be necessary
to deduce the final invariant. Finally, it is worth mentioning that our heuristic
is much simpler to implement and integrate into different PDR-based engines.
The idea of ordering literals when performing inductive generalization is
already proposed in [11] and adopted, as a default strategy, in several imple-
300 Y. Xia et al.
5 Evaluation
5.1 Experimental Setup
We integrated the branching and refer-skipping heuristics into three systems: the
IC3Ref [3] and SimpleCAR [6] (open-source) model checkers, which implement
the IC3 and (Forward and Backward3 ) CAR algorithms respectively, and the
mature, state-of-the-art implementation of IC3 available inside the nuXmv model
checker [12]. We make our implementations and data for reproducing the exper-
iments available at https://github.com/youyusama/i-Good Lemmas MC.
Since our approach is related to QUIP [21], we include the evaluation of
QUIP, and IC3 (mainly as the baseline for QUIP), as implemented4 in IIMC [4].
We also consider the PDR implementation in the ABC model checker [1], which
is state-of-the-art in hardware model checking.
Table 3 summarizes the tested tools, algorithms, and their flags. We use the
flag “-br” to enable the branching heuristic and “-rs” to enable refer-skipping.
Furthermore, we evaluate also another configuration (denoted as “-sh”), in which
the calls to sort() functions in Algorithms 4 and 5 are replaced by random
3
Although there is an implementation of Backward CAR in SimpleCAR, this method-
ology corresponds to reverse IC3. As a result, we did not include Backward CAR in
this paper and left the evaluation in future work.
4
As far as we know, this is the only publicly available QUIP implementation.
Searching for i-Good Lemmas to Accelerate Safety Model Checking 301
shuffles, thus simulating a strategy that orders variables randomly. When no flag
is active, IC3Ref runs the instances with its own strategy of sorting variables,
present in the original implementation.
We evaluate all the tools on 749 benchmarks, in aiger format, of the SINGLE
safety property track of the 2015 and 2017 editions of HWMCC [8]5 . We ran the
experiments on a cluster, which consists of 2304 2.5GHz CPUs in 240 nodes
running RedHat 4.8.5 with a total of 96GB RAM. For each test, we set the
memory limit to 8GB and the time limit to 5 h. During the experiments, each
model-checking run has exclusive access to a dedicated node.
To increase our confidence in the correctness of the results, we compare the
results of the solvers to make sure they are all consistent (modulo timeouts).
For the cases with unsafe results, we also check the provided counterexample
with the aigsim tool from the Aiger package [2]. We have no discrepancies in the
results, and all unsafe cases successfully pass the aigsim check.
5
From HWMCC 2019, the official format used in the competition is switched from
Aiger to Btor2 [27], a format for word-level model checking. As a result, we did not
include those instances in our experiments.
302 Y. Xia et al.
Similar insights can be obtained from Fig. 1, which clearly shows the positive
effect of improvements in performance.
A comparison of the performance of the tools with and without the heuristics
is shown in Fig. 2. All three solvers are able to reduce their time cost when equip-
ping with branching and refer-skipping (see the last row of the figure). Explicitly,
67.8% of the instances cost less or equal to check by ‘nuXmv -br -rs’, and the
corresponding portions for ‘ic3 -br -rs’ and ‘fcar -br -rs’ are 77.9% and 87.0%.
The variability occurs when considering only a single heuristic, which needs to
be explored in the future. For example, ‘fcar -br’ and ‘nuXmv -rs’ generally cost
slightly more time than ‘fcar’ and ‘nuXmv’, respectively.
Fig. 1. Comparisons among the implementations of IC3, PDR and CAR under different
configurations. (To make the figure more readable, we skip the results with a single
heuristic, which are still shown in Table 4.)
Fig. 2. Time comparison between IC3/CAR with and without two heuristics on safe-
unsafe cases. The baseline is always on the y-axis. Points above the diagonal indicate
better performance with the heuristics active. Points on the borders indicate timeouts
(18000 s).
Fig. 3. Comparison on the success rate (sr) to compute i-good lemmas between
IC3/CAR with and without branching and refer-skipping.
Searching for i-Good Lemmas to Accelerate Safety Model Checking 305
– Consider the results presented in Fig. 3. The figure shows the comparison
of the success rate in computing i-good lemmas between IC3/CAR with and
without the heuristics. ‘ic3 -br -rs’ computes more i-good lemmas than ‘ic3’
on 54% tested instances, while ‘fcar -br -rs’ computes more i-good lemmas
than ‘fcar’ on 67% tested instances, the portion of which is even higher.
This supports the conjecture that enabling branching and refer-skipping makes
IC3/CAR compute more i-good lemmas.
– Now consider Fig. 4. The figure shows the comparison between the deviation
of success rate to compute i-good lemmas (Y axis) and the deviation of check-
ing (CPU) time (X axis) for IC3/CAR with and without the heuristics. The
meaning of each point in the plot is explained in the title of the figure. In
general, the more points located in the first quadrant, the better our claim
can be supported.
Clearly, the plot for both IC3 and CAR in Fig. 4 supports the conjecture
that searching more i-good lemmas can help achieve better model-checking
performance (time cost).
Fig. 4. Comparison between the deviation of the success rate (sr) to compute i-good
lemmas (Y axis) and the deviation of checking (CPU) time (X axis) for IC3/CAR with
and without the heuristics. For each instance, let the checking time of ‘ic3’/‘fcar’ be
t and that of ‘ic3 -br -rs’/‘fcar -br -rs’ be t . Each point has t − t as the x value and
sr − sr as the y value.
References
1. ABC. https://github.com/berkeley-abc/abc
2. AIGER Tools. http://fmv.jku.at/aiger/aiger-1.9.9.tar.gz
3. IC3Ref. https://github.com/arbrad/IC3ref
4. IIMC-QUIP. https://github.com/ryanberryhill/iimc
5. Minisat 2.2.0. https://github.com/niklasso/minisat
6. SimpleCAR. https://github.com/lijwen2748/simplecar/releases/tag/v0.1
7. Balyo, T., Heule, M., Iser, M., Järvisalo, M., Suda, M.: Proceedings of sat compe-
tition 2022: Solver and benchmark descriptions. Department of Computer Science
Series of Publications B, vol. B-2022-1. http://hdl.handle.net/10138/347211
8. Biere, A.: AIGER Format. http://fmv.jku.at/aiger/FORMAT
9. Biere, A., Cimatti, A., Clarke, E., Zhu, Y.: Symbolic model checking without
BDDs. In: Cleaveland, W.R. (ed.) TACAS 1999. LNCS, vol. 1579, pp. 193–207.
Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49059-0 14
10. Biere, A., Fröhlich, A.: Evaluating CDCL variable scoring schemes. In: Heule, M.,
Weaver, S. (eds.) SAT 2015. LNCS, vol. 9340, pp. 405–422. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24318-4 29
Searching for i-Good Lemmas to Accelerate Safety Model Checking 307
11. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R.,
Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70–87. Springer, Heidel-
berg (2011). https://doi.org/10.1007/978-3-642-18275-4 7
12. Cavada, R., et al.: The nuXmv symbolic model checker. In: Biere, A., Bloem, R.
(eds.) CAV 2014. LNCS, vol. 8559, pp. 334–342. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-08867-9 22
13. Cimatti, A., Griggio, A., Mover, S., Tonetta, S.: IC3 modulo theories via implicit
predicate abstraction. In: Ábrahám, E., Havelund, K. (eds.) TACAS 2014. LNCS,
vol. 8413, pp. 46–61. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-
642-54862-8 4
14. Cimatti, A., Griggio, A., Tonetta, S.: The VMT-LIB language and tools. CoRR
abs/ arXiv: 2109.12821 (2021)
15. Dureja, R., Gurfinkel, A., Ivrii, A., Vizel, Y.: Ic3 with internal signals. In: 2021
Formal Methods in Computer Aided Design (FMCAD), pp. 63–71 (2021)
16. Dureja, R., Li, J., Pu, G., Vardi, M.Y., Rozier, K.Y.: Intersection and rotation
of assumption literals boosts bug-finding. In: Chakraborty, S., Navas, J.A. (eds.)
VSTTE 2019. LNCS, vol. 12031, pp. 180–192. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-41600-3 12
17. Een, N., Mishchenko, A., Brayton, R.: Efficient implementation of property
directed reachability. In: Proceedings of the International Conference on Formal
Methods in Computer-Aided Design, FMCAD 2011, pp. 125–134. FMCAD Inc.,
Austin, Texas (2011)
18. Eén, N., Sörensson, N.: An extensible SAT-solver. In: Giunchiglia, E., Tacchella,
A. (eds.) SAT 2003. LNCS, vol. 2919, pp. 502–518. Springer, Heidelberg (2004).
https://doi.org/10.1007/978-3-540-24605-3 37
19. Griggio, A., Roveri, M.: Comparing different variants of the ic3 algorithm for
hardware model checking. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
35(6), 1026–1039 (2015)
20. Hassan, Z., Bradley, A.R., Somenzi, F.: Better generalization in ic3. In: 2013 Formal
Methods in Computer-Aided Design, pp. 157–164. IEEE (2013)
21. Ivrii, A., Gurfinkel, A.: Pushing to the top. In: Proceedings of the 15th Conference
on Formal Methods in Computer-Aided Design, FMCAD 2015, pp. 65–72. FMCAD
Inc., Austin, Texas (2015)
22. Li, J., Dureja, R., Pu, G., Rozier, K.Y., Vardi, M.Y.: SimpleCAR: an efficient bug-
finding tool based on approximate reachability. In: Chockler, H., Weissenbacher,
G. (eds.) CAV 2018. LNCS, vol. 10982, pp. 37–44. Springer, Cham (2018). https://
doi.org/10.1007/978-3-319-96142-2 5
23. Li, J., Zhu, S., Zhang, Y., Pu, G., Vardi, M.Y.: Safety model checking with com-
plementary approximations. In: Proceedings of the 36th International Conference
on Computer-Aided Design, ICCAD 2017, pp. 95–100. IEEE Press (2017)
24. Marques-Silva, J., Lynce, I., Malik, S.: Conflict-driven clause learning sat solvers.
In: Handbook of satisfiability, vol. 185 (2009)
25. McMillan, K.L.: Interpolation and SAT-based model checking. In: Hunt, W.A.,
Somenzi, F. (eds.) CAV 2003. LNCS, vol. 2725, pp. 1–13. Springer, Heidelberg
(2003). https://doi.org/10.1007/978-3-540-45069-6 1
26. Moskewicz, M.W., Madigan, C.F., Zhao, Y., Zhang, L., Malik, S.: Chaff: Engineer-
ing an efficient sat solver. In: Proceedings of the 38th annual Design Automation
Conference, pp. 530–535 (2001)
27. Niemetz, A., Preiner, M., Wolf, C., Biere, A.: Btor2 , BtorMC and Boolector 3.0.
In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 587–
595. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3 32
308 Y. Xia et al.
28. Seufert, T., Scholl, C.: Combining pdr and reverse pdr for hardware model check-
ing. In: 2018 Design, Automation and Test in Europe Conference and Exhibition
(DATE), pp. 49–54 (2018)
29. Seufert, T., Scholl, C.: fbpdr: In-depth combination of forward and backward
analysis in property directed reachability. In: Teich, J., Fummi, F. (eds.) Design,
Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence,
Italy, 25–29 March 2019, pp. 456–461. IEEE (2019)
30. Seufert, T., Scholl, C., Chandrasekharan, A., Reimer, S., Welp, T.: Making progress
in property directed reachability. In: Finkbeiner, B., Wies, T. (eds.) VMCAI 2022.
LNCS, vol. 13182, pp. 355–377. Springer, Cham (2022). https://doi.org/10.1007/
978-3-030-94583-1 18
31. Sheeran, M., Singh, S., Stålmarck, G.: Checking safety properties using induction
and a SAT-solver. In: Hunt, W.A., Johnson, S.D. (eds.) FMCAD 2000. LNCS,
vol. 1954, pp. 127–144. Springer, Heidelberg (2000). https://doi.org/10.1007/3-
540-40922-X 8
32. Vediramana Krishnan, H.G., Vizel, Y., Ganesh, V., Gurfinkel, A.: Interpolating
strong induction. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11562, pp.
367–385. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25543-5 21
33. Vizel, Y., Gurfinkel, A.: Interpolating property directed reachability. In: Biere, A.,
Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 260–276. Springer, Cham (2014).
https://doi.org/10.1007/978-3-319-08867-9 17
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Second-Order Hyperproperties
1 Introduction
About a decade ago, Clarkson and Schneider coined the term hyperproperties [21]
for the rich class of system requirements that relate multiple computations. In
their definition, hyperproperties generalize trace properties, which are sets of
traces, to sets of sets of traces. This covers a wide range of requirements, from
information-flow security policies to epistemic properties describing the knowl-
edge of agents in a distributed system. Missing from Clarkson and Schneider’s
original theory was, however, a concrete specification language that could express
customized hyperproperties for specific applications and serve as the common
semantic foundation for different verification methods.
A first milestone towards such a language was the introduction of the tem-
poral logic HyperLTL [20]. HyperLTL extends linear-time temporal logic (LTL)
with quantification over traces. Suppose, for example, that an agent i in a dis-
tributed system observes only a subset of the system variables. The agent knows
that some LTL formula ϕ is true on some trace π iff ϕ holds on all traces π
that agent i cannot distinguish from π. If we denote the indistinguishability of
π and π by π ∼i π , then the property that there exists a trace π where agent i
knows ϕ can be expressed as the HyperLTL formula
∃π.∀π . π ∼i π → ϕ(π ),
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 309–332, 2023.
https://doi.org/10.1007/978-3-031-37703-7_15
310 R. Beutner et al.
where we write ϕ(π ) to denote that the trace property ϕ holds on trace π .
While HyperLTL and its variations have found many applications [28,32,44],
the expressiveness of these logics is limited, leaving many widely used hyperprop-
erties out of reach. A prominent example is common knowledge, which is used in
distributed applications to ensure simultaneous action [30,40]. Common knowl-
edge in a group of agents means that the agents not only know individually that
some condition ϕ is true, but that this knowledge is “common” to the group in
the sense that each agent knows that every agent knows that ϕ is true; on top
of that, each agent in the group knows that every agent knows that every agent
knows that ϕ is true; and so on, forming an infinite chain of knowledge.
The fundamental limitation of HyperLTL that makes it impossible to express
properties like common knowledge is that the logic is restricted to first-order
quantification. HyperLTL, then, cannot reason about sets of traces directly, but
must always do so by referring to individual traces that are chosen existentially
or universally from the full set of traces. For the specification of an agent’s indi-
vidual knowledge, where we are only interested in the (non-)existence of a single
trace that is indistinguishable and that violates ϕ, this is sufficient; however,
expressing an infinite chain, as needed for common knowledge, is impossible.
In this paper, we introduce Hyper2 LTL, a temporal logic for hyperproperties
with second-order quantification over traces. In Hyper2 LTL, the existence of a
trace π where the condition ϕ is common knowledge can be expressed as the
following formula (using slightly simplified syntax):
n
∃π. ∃X. π ∈ X ∧ ∀π ∈ X. ∀π . π ∼i π → π ∈ X ∧ ∀π ∈ X. ϕ(π ).
i=1
of all agents. This smallest set X is defined by the (monotone) fixpoint opera-
tion that adds, in each step, all traces that are indistinguishable to some trace
already in X.
We develop an approximate model checking algorithm for Hyper2 LTLfp that
uses bidirectional inference to deduce lower and upper bounds on second-order
variables, interposed with first-order model checking in the style of HyperLTL.
Our procedure is parametric in an oracle that provides (increasingly precise)
lower and upper bounds. In the paper, we realize the oracles with fixpoint itera-
tion for underapproximations of the sets of traces assigned to the second-order
variables, and automata learning for overapproximations. We report on encour-
aging experimental results with our model-checking algorithm, which has been
implemented in a tool called HySO.
2 Preliminaries
For n ∈ N we define [n] := {1, . . . , n}. We assume that AP is a finite set of
atomic propositions and define Σ := 2AP . For t ∈ Σ ω and i ∈ N define t(i) ∈ Σ
as the ith element in t (starting with the 0th); and t[i, ∞] for the infinite suffix
starting at position i. For traces t1 , . . . , tn ∈ Σ ω we write zip(t1 , . . . , tn ) ∈ (Σ n )ω
for the pointwise zipping of the traces, i.e., zip(t1 , . . . , tn )(i) := (t1 (i), . . . , tn (i)).
HyperLTL. HyperLTL [20] is one of the most studied temporal logics for the
specification of hyperproperties. We assume that V is a fixed set of trace vari-
ables. For the most part, we use variations of π (e.g., π, π , π1 , . . .) to denote
trace variables. HyperLTL formulas are then generated by the grammar
ϕ := Qπ. ϕ | ψ
ψ := aπ | ¬ψ | ψ ∧ ψ | ψ | ψU ψ
Π aπ iff a ∈ Π(π)(0)
Π ¬ψ iff Πψ
Π ψ1 ∧ ψ2 iff Π ψ1 and Π ψ2
Π ψ iff Π[1, ∞] ψ
Π ψ1 U ψ2 iff ∃i ∈ N. Π[i, ∞] ψ2 and ∀j < i. Π[j, ∞] ψ1 .
3 Second-Order HyperLTL
The (first-order) trace quantification in HyperLTL ranges over the set of all sys-
tem traces; we thus cannot reason about arbitrary sets of traces as required for,
e.g., common knowledge. We introduce a second-order extension of HyperLTL
by introducing second-order variables (ranging over sets of traces) and allowing
quantification over traces from any such set. We present two variants of our logic
that differ in the way quantification is resolved. In Hyper2 LTL, we quantify over
arbitrary sets of traces. While this yields a powerful and intuitive logic, second-
order quantification is inherently non-constructive. During model checking, there
thus does not exist an efficient way to even approximate possible witnesses for the
sets of traces. To solve this quandary, we restrict Hyper2 LTL to Hyper2 LTLfp ,
where we instead quantify over sets of traces that satisfy some minimality or
maximality constraint. This allows for large fragments of Hyper2 LTLfp that
admit algorithmic approximations to its model checking (by, e.g., using known
techniques from fixpoint computations [47,48]).
Second-Order Hyperproperties 313
ϕ := Qπ ∈ X. ϕ | QX. ϕ | ψ
ψ := aπ | ¬ψ | ψ ∧ ψ | ψ | ψU ψ
Π, Δ ψ iff Πψ
Π, Δ Qπ ∈ X. ϕ iff Qt ∈ Δ(X). Π[π → t], Δ ϕ
Π, Δ QX. ϕ iff QA ⊆ Σ ω . Π, Δ[X → A] ϕ
Syntactic Sugar. In Hyper2 LTL, we can quantify over traces within a second-
order variable, but we cannot state, within the body of the formula, that some
path is a member of some second-order variable. For that, we define π X (as
an atom within the body) as syntactic sugar for ∃π ∈ X. (π =AP π), i.e., π
is in X if there exists some trace in X that agrees with π on all propositions.
Note that we can only use π X outside of the scope of any temporal operators;
this ensures that we can bring the resulting formula into a form that conforms
to the Hyper2 LTL syntax.
The semantics of Hyper2 LTL quantifies over arbitrary sets of traces, making
even approximations to its semantics challenging. We propose Hyper2 LTLfp as
a restriction that only quantifies over sets that are subject to an additional
minimality or maximality constraint. For large classes of formulas, we show that
this admits effective model-checking approximations. We define Hyper2 LTLfp by
the following grammar:
ϕ := Q π ∈ X. ϕ | Q (X,
, ϕ). ϕ | ψ
ψ := aπ | ¬ψ | ψ ∧ ψ | ψ | ψ U ψ
Semantics. For path formulas, the semantics of Hyper2 LTLfp is defined analo-
gously to that of Hyper2 LTL and HyperLTL. For the quantifier prefix we define:
Π, Δ ψ iff Πψ
Π, Δ Qπ ∈ X. ϕ iff Qt ∈ Δ(X). Π[π → t], Δ ϕ
Π, Δ Q(X,
, ϕ1 ). ϕ2 iff QA ∈ sol (Π, Δ, (X,
, ϕ1 )). Π, Δ[X → A] ϕ2
π = an dω
a d
K2 (π) = an−1 bdω
b
K1 K2 (π) = an−1 cdω
K2 K1 K2 (π) = an−2 bcdω
c
...
K1 K2 . . . K2 (π) = acn−1 dω
Fig. 1. Left: An example for a multi-agent system with two agents, where agent 1
observes a and d, and agent 2 observes c and d. Right: The iterative construction of
the traces to be considered for common knowledge starting with an dω .
the last step we add acn−1 dω to the set of indistinguishable traces, concluding
that a is not common knowledge.
The following Hyper2 LTLfp formula specifies the property stated above. The
abbreviation obs(π1 , π2 ) := (π1 ={a,d} π2 ) ∨ (π1 ={c,d} π2 ) denotes that π1
and π2 are observationally equivalent for either agent 1 or agent 2.
n−1
∀π ∈ S. i
aπ ∧ n
dπ →
i=0
X, , π X ∧ ∀π1 ∈ X.∀π2 ∈ S. obs(π1 , π2 ) → π2 X . ∀π ∈ X. aπ
Conversely, the existential fragment of Hyper2 LTL can be encoded back into
HyperQPTL satisfiability:
Lastly, we present some easy fragments of Hyper2 LTL for which the model-
checking problem is decidable. Here we write ∃∗ X (resp. ∀∗ X) for some sequence
of existentially (resp. universally) quantified second-order variables and ∃∗ π
(resp. ∀∗ π) for some sequence of existentially (resp. universally) quantified
first-order variables. For example, ∃∗ X∀∗ π captures all formulas of the form
∃X1 , . . . Xn .∀π1 , . . . , πm .ψ where ψ is quantifier-free.
We refer the reader to the full version [11] for detailed proofs.
In this section, we point to existing logics that can naturally be encoded within
our second-order hyperlogics Hyper2 LTL and Hyper2 LTLfp .
LTLK extends LTL with the knowledge operator K. For some subset of agents
A, the formula KA ψ holds in timestep i, if ψ holds on all traces equivalent to
some agent in A up to timestep i. See full version [11] for detailed semantics.
LTLK and HyperCTL∗ have incomparable expressiveness [16] but the knowledge
operator K can be encoded by either adding a linear past operator [16] or by
adding propositional quantification (as in HyperQPTL) [45].
Using Hyper2 LTLfp we can encode LTLK,C , featuring the knowledge operator
K and the common knowledge operator C (which requires that ψ holds on the
closure set of equivalent traces, up to the current timepoint) [41]. Note that
LTLK,C is not encodable by only adding propositional quantification or the linear
past operator.
Proposition 6. For every LTLK,C formula ϕ there exists an Hyper2 LTLfp for-
mula ϕ such that for any system T we have T LT LK,C ϕ iff T ϕ .
318 R. Beutner et al.
Proof (Sketch). We follow the intuition discussed in Sect. 3.3. For each occur-
rence of a knowledge operator in {K, C}, we use a fresh trace variable to keep
track on the points in time with respect to which we need to compare traces.
We then use this trace variable to introduce a second-order set that collects all
equivalent traces (by the observations of one agent, or the closure of all agents’
observations). We then inductively construct a Hyper2 LTLfp formula that cap-
tures all the knowledge and common-knowledge sets, over which we check the
properties at hand. See full version for more details [11].
Proposition 7. For any AHLTL formula ϕ there exists a Hyper2 LTLfp formula
ϕ such that for any system T we have T AHLTL ϕ iff T ϕ .
∀π1 ∈ Xi .∀π2 ∈ A.
π1 =AP π2 U (π1 =AP π2 ) ∧ aπ1 ↔ aπ2 → π2 Xi
a∈AP
Second-Order Hyperproperties 319
The formula asserts that the set of traces bound to Xi is closed under stuttering,
i.e., if we start from any trace in Xi and stutter it once (at some arbitrary
position) we again end up in Xi . Using the formulas ϕi , we then construct a
Hyper2 LTLfp formula that is equivalent to ϕ as follows
ϕ := Q1 π1 ∈ S, . . . , Qn πn ∈ S.(X1 , , π1 X1 ∧ ϕ1 ) · · · (Xn , , πn Xn ∧ ϕn )
∃π1 ∈ X1 , . . . , ∃πn ∈ Xn .ψ[π1 /π1 , . . . , πn /πn ]
We first mimic the quantification in ϕ and, for each trace πi , construct a least
set Xi that contains πi and is closed under stuttering (thus describing exactly
the set of all stuttering of πi ). Finally, we assert that there are traces π1 , . . . , πn
with πi ∈ Xi (so πi is a stuttering of πi ) such that π1 , . . . , πn satisfy ψ. It is easy
to see that T AHLTL ϕ iff T ϕ holds for all systems.
Hyper2 LTLfp captures all properties expressible in AHLTL. In particular, our
approximate model-checking algorithm for Hyper2 LTLfp (cf. Sect. 5) is applica-
ble to AHLTL; even for instances where no approximate solutions were previously
known. In Sect. 6, we show that our prototype model checker for Hyper2 LTLfp
can verify asynchronous properties in practice.
is the block of first-order quantifiers that sits between the quantification of Yj−1
and Yj . Here Xlj +1 , . . . , Xlj+1 ∈ {S, A, Y1 , . . . , Yj−1 } are second-order variables
that are quantified before γj . In particular, π1 , . . . , πlj are the first-order variables
quantified before Yj .
320 R. Beutner et al.
We consider a fragment of Hyper2 LTLfp which we call the least fixpoint frag-
ment. Within this fragment, we restrict the formulas ϕcon con
1 , . . . , ϕk such that
Y1 , . . . , Yk can be approximated as (least) fixpoints. Concretely, we say that ϕ
is in the least fixpoint fragment of Hyper2 LTLfp if for all j ∈ [k], ϕcon j is a
conjunction of formulas of the form
First, we focus on first-order quantification, and assume that we are given a con-
crete assignment for each second-order variable as fixed automata BY1 , . . . , BYk
Second-Order Hyperproperties 321
1
Note that in this case lj < i: if trace πi is resolved on Yj (i.e., Xi = Yj ), then Yj
must be quantified before πi so there are at most i − 1 traces quantified before Yj .
322 R. Beutner et al.
Algorithm 1
1 verify(ϕ, T ) =
k li+1
2 let ϕ = γj (Yj , , ϕcon j ) j=1 γk+1 . ψ where γi = Qm πm ∈ Xm m=li +1
3 let N = 0
4 let AT = systemToNBA(T )
5 repeat
6 // Start outside-in traversal on second-order variables
7 let = S → (AT , AT ), A → (A , A )
8 for j from 1 to k do
9 Bjl := underApprox((Yj , , ϕconj ),,N )
10 Bju := overApprox((Yj , , ϕconj ),,N )
11 (Yj ) := (Bjl , Bju )
12 // Start inside-out traversal on first-order variables
13 let Alk+1 +1 = LTLtoNBA(ψ)
14 for i from lk+1 to 1 do
15 let (C l , C u ) = (Xi )
16 if Qi = ∃ then
17 Ai := eProduct(Ai+1 , C l )
18 else
19 Ai := uProduct(Ai+1 , C u )
20 if L(A1 ) = ∅ then
21 return SAT
22 else
23 N = N + 1
2
This effectively poses the assumption that the step formula specifies a safety prop-
erty, which seems to be the case for almost all examples. As an example, common
knowledge infers a safety property: In each step, we add all traces for which there
exists some trace that agrees on all propositions observed by that agent.
Second-Order Hyperproperties 325
Should this approximation not be precise enough, the first-order model checking
(Sect. 5.3) returns some concrete counterexample, i.e., some trace contained in
the invariant but violating the property, which we use to provide more coun-
terexamples to the learner.
Muddy Children. The muddy children puzzle [30] is one of the classic exam-
ples in common knowledge literature. The puzzle consists of n children standing
such that each child can see all other children’s faces. From the n children, an
unknown number k ≥ 1 have a muddy forehead, and in incremental rounds, the
children should step forward if they know if their face is muddy or not. Consider
the scenario of n = 2 and k = 1, so child a sees that child b has a muddy forehead
and child b sees that a is clean. In this case, b immediately steps forward, as it
knows that its forehead is muddy since k ≥ 1. In the next step, a knows that its
face is clean since b stepped forward in round 1. In general, one can prove that
all children step forward in round k, deriving common knowledge.
For each n we construct a transition system Tn that encodes the muddy chil-
dren scenario with n children. For every m we design a Hyper2 LTLfp formula
ϕm that adds to the common knowledge set X all traces that appear indistin-
guishable in the first m steps for some child. We then specify that all traces in
X should agree on all inputs, asserting that all inputs are common knowledge.4
We used HySO to fully automatically check Tn against ϕm for varying values of n
and m, i.e., we checked if, after the first m steps, the inputs of all children are
common knowledge. As expected, the above property holds only if m ≥ n (in the
worst case, where all children are dirty (k = n), the inputs of all children only
become common knowledge after n steps). We depict the results in Table 1a.
Table 1. In Table 1a, we check common knowledge in the muddy children puzzle for
n children and m rounds. We give the result (✓ if common knowledge holds and ✗ if it
does not), and the running time. In Table 1a, we check synchronous and asynchronous
versions of observational determinism. We depict the number of iterations needed and
running time. Times are given in seconds.
(a) (b)
HyperQPTL formulas. HySO can check such properties precisely, i.e., it consti-
tutes a sound-and-complete model checker for HyperQPTL properties with an
arbitrary quantifier prefix. The synchronous version of observational determin-
ism is a HyperLTL property and thus needs no second-order approximation (we
set the method column to “-” in these cases).
Table 2. In Table 1a, we check common knowledge in the example from Fig. 1 when
starting with an dω for varying values of n. We depict the number of refinement iter-
ations, the result, and the running time. In Table 2b, we verify various properties on
Mazurkiewicz traces. We depict whether the property could be verified or refuted by
iteration or automata learning, the result, and the time. Times are given in seconds.
(a) (b)
Using HySO we verify a selection of such trace properties that often require
non-trivial reasoning by coming up with a suitable invariant. We depict the
results in Table 2b. In our preliminary experiments, we model a situation where
we start with {a}1 {}ω and can swap letters {a} and {}. We then, e.g., ask if
on any trace in the resulting Mazurkiewicz trace, a holds at most once, which
requires inductive invariants and cannot be established by iteration.
7 Related Work
In recent years, many logics for the formal specification of hyperproperties
have been developed, extending temporal logics with explicit path quantification
(examples include HyperLTL, HyperCTL∗ [20], HyperQPTL [10,45], HyperPDL
[38], and HyperATL∗ [5,9]); or extending first and second-order logics with an
equal level predicate [25,33]. Others study (ω)-regular [14,37] and context-free
hyperproperties [35]; or discuss hyperproperties over data and modulo theo-
ries [24,31]. Hyper2 LTL is the first temporal logic that reasons about second-
order hyperproperties which allows is to capture many existing (epistemic, asyn-
chronous, etc.) hyperlogics while at the same time taking advantage of model-
checking solutions that have been proven successful in first-order settings.
8 Conclusion
Hyperproperties play an increasingly important role in many areas of computer
science. There is a strong need for specification languages and verification meth-
ods that reason about hyperproperties in a uniform and general manner, similar
to what is standard for more traditional notions of safety and reliability. In
this paper, we have ventured forward from the first-order reasoning of logics
like HyperLTL into the realm of second-order hyperproperties, i.e., properties
that not only compare individual traces but reason comprehensively about sets
of such traces. With Hyper2 LTL, we have introduced a natural specification
language and a general model-checking approach for second-order hyperprop-
erties. Hyper2 LTL provides a general framework for a wide range of relevant
hyperproperties, including common knowledge and asynchronous hyperproper-
ties, which could previously only be studied with specialized logics and algo-
rithms. Hyper2 LTL also provides a starting point for future work on second-
order hyperproperties in areas such as cyber-physical [44] and probabilistic sys-
tems [28].
Second-Order Hyperproperties 329
References
1. Alur, R., Henzinger, T.A.: A really temporal logic. J. ACM 41(1) (1994). https://
doi.org/10.1145/174644.174651
2. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput.
75(2) (1987). https://doi.org/10.1016/0890-5401(87)90052-6
3. Baumeister, J., Coenen, N., Bonakdarpour, B., Finkbeiner, B., Sánchez, C.: A
temporal logic for asynchronous hyperproperties. In: Silva, A., Leino, K.R.M. (eds.)
CAV 2021. LNCS, vol. 12759, pp. 694–717. Springer, Cham (2021). https://doi.
org/10.1007/978-3-030-81685-8 33
4. Beutner, R., Carral, D., Finkbeiner, B., Hofmann, J., Krötzsch, M.: Deciding
hyperproperties combined with functional specifications. In: Annual ACM/IEEE
Symposium on Logic in Computer, LICS 2022. ACM (2022). https://doi.org/10.
1145/3531130.3533369
5. Beutner, R., Finkbeiner, B.: A temporal logic for strategic hyperproperties. In:
International Conference on Concurrency Theory, CONCUR 2021. LIPIcs, vol.
203. Schloss Dagstuhl (2021). https://doi.org/10.4230/LIPIcs.CONCUR.2021.24
6. Beutner, R., Finkbeiner, B.: Prophecy variables for hyperproperty verification.
In: IEEE Computer Security Foundations Symposium, CSF 2022. IEEE (2022).
https://doi.org/10.1109/CSF54842.2022.9919658
7. Beutner, R., Finkbeiner, B.: Software verification of hyperproperties beyond k-
safety. In: International Conference on Computer Aided Verification, CAV 2022.
LNCS, vol. 13371. Springer (2022). https://doi.org/10.1007/978-3-031-13185-1 17
8. Beutner, R., Finkbeiner, B.: AutoHyper: Explicit-state model checking for Hyper-
LTL. In: International Conference on Tools and Algorithms for the Construction
and Analysis of Systems, TACAS 2023, vol. 13993. Springer (2023). https://doi.
org/10.1007/978-3-031-30823-9 8
9. Beutner, R., Finkbeiner, B.: HyperATL∗ : A logic for hyperproperties in multi-agent
systems. Log. Methods Comput, Sci (2023)
10. Beutner, R., Finkbeiner, B.: Model checking omega-regular hyperproperties with
AutoHyperQ. In: International Conference on Logic for Programming, Artificial
Intelligence and Reasoning, LPAR 2023. EPiC Series in Computing, EasyChair
(2023)
11. Beutner, R., Finkbeiner, B., Frenkel, H., Metzger, N.: Second-order hyperprop-
erties. CoRR abs/2305.17935 (2023). https://doi.org/10.48550/arXiv.2305.17935,
https://doi.org/10.48550/arXiv.2305.17935
12. Boigelot, B., Legay, A., Wolper, P.: Iterating transducers in the large. In: Hunt,
W.A., Somenzi, F. (eds.) CAV 2003. LNCS, vol. 2725, pp. 223–235. Springer, Hei-
delberg (2003). https://doi.org/10.1007/978-3-540-45069-6 24
13. Boigelot, B., Legay, A., Wolper, P.: Omega-regular model checking. In: Jensen,
K., Podelski, A. (eds.) TACAS 2004. LNCS, vol. 2988, pp. 561–575. Springer,
Heidelberg (2004). https://doi.org/10.1007/978-3-540-24730-2 41
14. Bonakdarpour, B., Sheinvald, S.: Finite-word hyperlanguages. In: Leporati, A.,
Martı́n-Vide, C., Shapira, D., Zandron, C. (eds.) LATA 2021. LNCS, vol. 12638, pp.
173–186. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68195-1 17
330 R. Beutner et al.
15. Bouajjani, A., Jonsson, B., Nilsson, M., Touili, T.: Regular model checking. In:
Emerson, E.A., Sistla, A.P. (eds.) CAV 2000. LNCS, vol. 1855, pp. 403–418.
Springer, Heidelberg (2000). https://doi.org/10.1007/10722167 31
16. Bozzelli, L., Maubert, B., Pinchinat, S.: Unifying hyper and epistemic temporal
logics. In: Pitts, A. (ed.) FoSSaCS 2015. LNCS, vol. 9034, pp. 167–182. Springer,
Heidelberg (2015). https://doi.org/10.1007/978-3-662-46678-0 11
17. Bozzelli, L., Peron, A., Sánchez, C.: Asynchronous extensions of HyperLTL. In:
Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2021. IEEE
(2021). https://doi.org/10.1109/LICS52264.2021.9470583
18. Büchi, J.R.: On a decision method in restricted second-order arithmetic. In: Studies
in Logic and the Foundations of Mathematics, vol. 44. Elsevier (1966)
19. Chen, Y., Hong, C., Lin, A.W., Rümmer, P.: Learning to prove safety over param-
eterised concurrent systems. In: Formal Methods in Computer Aided Design,
FMCAD 2017. IEEE (2017). https://doi.org/10.23919/FMCAD.2017.8102244
20. Clarkson, M.R., Finkbeiner, B., Koleini, M., Micinski, K.K., Rabe, M.N., Sánchez,
C.: Temporal logics for hyperproperties. In: Abadi, M., Kremer, S. (eds.) POST
2014. LNCS, vol. 8414, pp. 265–284. Springer, Heidelberg (2014). https://doi.org/
10.1007/978-3-642-54792-8 15
21. Clarkson, M.R., Schneider, F.B.: Hyperproperties. J. Comput. Secur. 18(6) (2010).
https://doi.org/10.3233/JCS-2009-0393
22. Coenen, N., et al.: Explaining hyperproperty violations. In: International Con-
ference on Computer Aided Verification, CAV 2022. LNCS, vol. 13371. Springer
(2022). https://doi.org/10.1007/978-3-031-13185-1 20
23. Coenen, N., Finkbeiner, B., Frenkel, H., Hahn, C., Metzger, N., Siber, J.: Tem-
poral causality in reactive systems. In: International Symposium on Automated
Technology for Verification and Analysis, ATVA 2022. LNCS, vol. 13505. Springer
(2022). https://doi.org/10.1007/978-3-031-19992-9 13
24. Coenen, N., Finkbeiner, B., Hofmann, J., Tillman, J.: Smart contract synthesis
modulo hyperproperties. To appear at the 36th IEEE Computer Security Founda-
tions Symposium (CSF 2023) (2023)
25. Coenen, N., Finkbeiner, B., Sánchez, C., Tentrup, L.: Verifying hyperliveness. In:
Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 121–139. Springer,
Cham (2019). https://doi.org/10.1007/978-3-030-25540-4 7
26. Dams, D., Lakhnech, Y., Steffen, M.: Iterating transducers. In: Berry, G., Comon,
H., Finkel, A. (eds.) CAV 2001. LNCS, vol. 2102, pp. 286–297. Springer, Heidelberg
(2001). https://doi.org/10.1007/3-540-44585-4 27
27. Diekert, V., Rozenberg, G. (eds.): The Book of Traces. World Scientific (1995).
https://doi.org/10.1142/2563
28. Dimitrova, R., Finkbeiner, B., Torfah, H.: Probabilistic hyperproperties of markov
decision processes. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol.
12302, pp. 484–500. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
59152-6 27
29. Duret-Lutz, A., et al.: From spot 2.0 to spot 2.10: What’s new? In: International
Conference on Computer Aided Verification, CAV 2022. LNCS, vol. 13372. Springer
(2022). https://doi.org/10.1007/978-3-031-13188-2 9
30. Fagin, R., Halpern, J.Y., Moses, Y., Vardi, M.Y.: Reasoning About Knowledge.
MIT Press (1995). https://doi.org/10.7551/mitpress/5803.001.0001
31. Finkbeiner, B., Frenkel, H., Hofmann, J., Lohse, J.: Automata-based software
model checking of hyperproperties. In: Rozier, K.Y., Chaudhuri, S. (eds.) NASA
Formal Methods, 15th International Symposium, NFM 2023, Houston, TX, USA,
Second-Order Hyperproperties 331
16–18 May 2023, Proceedings. LNCS, vol. 13903. Springer (2023). https://doi.org/
10.1007/978-3-031-33170-1 22
32. Finkbeiner, B., Rabe, M.N., Sánchez, C.: Algorithms for model checking Hyper-
LTL and HyperCTL∗ . In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS,
vol. 9206, pp. 30–48. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-
21690-4 3
33. Finkbeiner, B., Zimmermann, M.: The first-order logic of hyperproperties. In: Sym-
posium on Theoretical Aspects of Computer Science, STACS 2017. LIPIcs, vol. 66.
Schloss Dagstuhl (2017). https://doi.org/10.4230/LIPIcs.STACS.2017.30
34. Fortin, M., Kuijer, L.B., Totzke, P., Zimmermann, M.: HyperLTL satisfiability is
Σ11 -complete, HyperCTL* satisfiability is Σ12 -complete. In: International Sympo-
sium on Mathematical Foundations of Computer Science, MFCS 2021. LIPIcs, vol.
202. Schloss Dagstuhl (2021). https://doi.org/10.4230/LIPIcs.MFCS.2021.47
35. Frenkel, H., Sheinvald, S.: Realizable and context-free hyperlanguages. In: Ganty,
P., Monica, D.D. (eds.) Proceedings of the 13th International Symposium on
Games, Automata, Logics and Formal Verification, GandALF 2022, Madrid, Spain,
21–23 September 2022. EPTCS, vol. 370, pp. 114–130 (2022). https://doi.org/10.
4204/EPTCS.370.8, https://doi.org/10.4204/EPTCS.370.8
36. Gammie, P., van der Meyden, R.: MCK: model checking the logic of knowledge.
In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 479–483. Springer,
Heidelberg (2004). https://doi.org/10.1007/978-3-540-27813-9 41
37. Goudsmid, O., Grumberg, O., Sheinvald, S.: Compositional model checking for
multi-properties. In: Henglein, F., Shoham, S., Vizel, Y. (eds.) VMCAI 2021.
LNCS, vol. 12597, pp. 55–80. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-67067-2 4
38. Gutsfeld, J.O., Müller-Olm, M., Ohrem, C.: Propositional dynamic logic for
hyperproperties. In: International Conference on Concurrency Theory, CONCUR
2020. LIPIcs, vol. 171. Schloss Dagstuhl (2020). https://doi.org/10.4230/LIPIcs.
CONCUR.2020.50
39. Gutsfeld, J.O., Müller-Olm, M., Ohrem, C.: Automata and fixpoints for asyn-
chronous hyperproperties. Proc. ACM Program. Lang. 5(POPL) (2021). https://
doi.org/10.1145/3434319
40. Halpern, J.Y., Moses, Y.: Knowledge and common knowledge in a distributed
environment. J. ACM 37(3), 549–587 (1990)
41. van der Hoek, W., Wooldridge, M.: Model checking knowledge and time. In:
Bošnački, D., Leue, S. (eds.) SPIN 2002. LNCS, vol. 2318, pp. 95–111. Springer,
Heidelberg (2002). https://doi.org/10.1007/3-540-46017-9 9
42. Lomuscio, A., Qu, H., Raimondi, F.: MCMAS: an open-source model checker for
the verification of multi-agent systems. Int. J. Softw. Tools Technol. Transfer 19(1),
9–30 (2015). https://doi.org/10.1007/s10009-015-0378-x
43. van der Meyden, R.: Common knowledge and update in finite environments. Inf.
Comput. 140(2) (1998). https://doi.org/10.1006/inco.1997.2679
44. Nguyen, L.V., Kapinski, J., Jin, X., Deshmukh, J.V., Johnson, T.T.: Hyperprop-
erties of real-valued signals. In: ACM-IEEE International Conference on For-
mal Methods and Models for System Design, MEMOCODE 2017. ACM (2017).
https://doi.org/10.1145/3127041.3127058
45. Rabe, M.N.: A temporal logic approach to information-flow control. Ph.D. thesis,
Saarland University (2016)
46. Sistla, A.P.: Theoretical issues in the design and verification of distributed systems.
Ph.D. thesis, Harvard University (1983)
332 R. Beutner et al.
47. Tarski, A.: A lattice-theoretical fixpoint theorem and its applications (1955)
48. Winskel, G.: The formal semantics of programming languages - an introduction.
MIT Press, Foundation of computing series (1993)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Neural Networks and Machine Learning
Certifying the Fairness of KNN
in the Presence of Dataset Bias
1 Introduction
This work was partially funded by the U.S. National Science Foundation grants CNS-
1702824, CNS-1813117 and CCF-2220345.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 335–357, 2023.
https://doi.org/10.1007/978-3-031-37703-7_16
336 Y. Li et al.
Fig. 1. FairKNN: our method for certifying fairness of KNNs with label bias.
form of individual fairness that has been studied in the fairness literature [14];
it requires that the classification output remains the same for input x even if
historical bias were not in the training dataset T . However, this is a challenging
problem and, to the best of our knowledge, techniques for solving it efficiently
are still severely lacking. Our work aims to fill the gap.
Specifically, we are concerned with three variants of the fairness definition.
Let the input x = x1 , . . . , xD be a D-dimensional input vector, and P be the
subset of vector indices corresponding to the protected attributes (e.g., race,
gender, etc.). The first variant of the fairness definition is individual fairness,
which requires that similar individuals are treated similarly by the machine
learning model. For example, if two individual inputs x and x differ only in some
protected attribute xi , where i ∈ P, but agree on all the other attributes, the
classification output must be the same. The second variant is -fairness, which
extends the notion of individual fairness to include inputs whose un-protected
attributes differ and yet the difference is bounded by a small constant (). In
other words, if two individual inputs are almost the same in all unprotected
attributes, they should also have the same classification output. The third variant
is label-flipping fairness, which requires the aforementioned fairness requirements
to be satisfied even if a biased dataset T has been used to train the model in
the first place. That is, as long as the number of mislabeled elements in T is
bounded by n, the classification output must be the same.
We want to certify the fairness of the classification output for a popular
supervised learning technique called the k-nearest neighbors (KNN) algorithm.
Our interest in KNN comes from the fact that, unlike many other machine
learning techniques, KNN is a model-less technique and thus does not have
the high cost associated with training the model. Because of this reason, KNN
has been widely adopted in real-world applications [1,4,16,18,23,29,36,45,46].
However, obtaining a fairness certification for KNN is still challenging and, in
practice, the most straightforward approach of enumerating all possible scenarios
and then checking if the classification outputs obtained in these scenarios agree
would have been prohibitively expensive.
To overcome the challenge, we propose an efficient method based on the idea
of abstract interpretation [10]. Our method relies on sound approximations to
analyze the arithmetic computations used by the state-of-the-art KNN algorithm
Certifying the Fairness of KNN in the Presence of Dataset Bias 337
both accurately and efficiently. Figure 1 shows an overview of our method in the
lower half of this figure, which conducts the analysis in an abstract domain, and
the default KNN algorithm in the upper half, which operates in the concrete
domain. The main difference is that, by staying in the abstract domain, our
method is able to analyze a large set of possible training datasets (derived from
T due to n label-flips) and a potentially-infinite set of inputs (derived from x due
to perturbation) symbolically, as opposed to analyze a single training dataset
and a single input concretely.
To the best of our knowledge, this is the first method for KNN fairness
certification in the presence of dataset bias. While Meyer et al. [26,27] and
Drews et al. [12] have investigated robustness certification techniques, their
methods target decision trees and linear regression, which are different types
of machine learning models from KNN. Our method also differs from the KNN
data-poisoning robustness verification techniques developed by Jia et al. [20] and
Li et al. [24], which do not focus on fairness at all; for example, they do not
distinguish protected attributes from unprotected attributes. Furthermore, Jia et
al. [20] consider the prediction step only while ignoring the learning step, and
Li et al. [24] do not consider label flipping. Our method, in contrast, considers
all of these cases.
We have implemented our method and demonstrated the effectiveness
through experimental evaluation. We used all of the six popular datasets in
the fairness research literature as benchmarks. Our evaluation results show that
the proposed method is efficient in analyzing complex arithmetic computations
used in the state-of-the-art KNN algorithm, and is accurate enough to obtain
fairness certifications for a large number of test inputs. To better understand
the impact of historical bias, we also compared the fairness certification success
rates across different demographic groups.
To summarize, this paper makes the following contributions:
The remainder of this paper is organized as follows. We first present the tech-
nical background in Sect. 2 and then give an overview of our method in Sect. 3.
Next, we present our detailed algorithms for certifying the KNN prediction step
in Sect. 4 and certifying the KNN learning step in Sect. 5. This is followed by
our experimental results in Sect. 6. We review the related work in Sect. 7 and,
finally, give our conclusion in Sect. 8.
338 Y. Li et al.
2 Background
Let L be a supervised learning algorithm that takes the training dataset T
as input and returns a learned model M = L(T ) as output. The training set
T = {(x, y)} is a set of labeled samples, where each x ∈ X ⊆ RD has D
real-valued attributes, and the y ∈ Y ⊆ N is a class label. The learned model
M : X → Y is a function that returns the classification output y ∈ Y for any
input x ∈ X .
We are concerned with fairness of the classification output M (x) for an individ-
ual input x. Let P be the set of vector indices corresponding to the protected
attributes in x ∈ X . We say that xi is a protected attribute (e.g., race, gender,
etc.) if and only if i ∈ P.
In this case, such inputs x form a set. Let Δ (x) be the set of all inputs x con-
sidered in the −fairness definition. That is, Δ (x) := {x | xj = xj for some j ∈
P, |xi − xi | ≤ for all i ∈ P}. By requiring M (x) = M (x ) for all x ∈ Δ (x),
-fairness guarantees that a larger set of individuals similar to x are treated
equally.
Individual fairness can be viewed as a special case of -fairness, where = 0.
In contrast, when > 0, the number of elements in Δ (x) is often large and
sometimes infinite. Therefore, the most straightforward approach of certifying
fairness by enumerating all possible elements in Δ (x) would not work. Instead,
any practical solution would have to rely on abstraction.
Due to historical bias, the training dataset T may have contained samples whose
output are unfairly labeled. Let the number of such samples be bounded by n.
Certifying the Fairness of KNN in the Presence of Dataset Bias 339
We assume that there are no additional clues available to help identify the mis-
labeled samples. Without knowing which these samples are, fairness certification
must consider all of the possible scenarios. Each scenario corresponds to a de-
biased dataset, T , constructed by flipping back the incorrect labels in T . Let
dBiasn (T ) = {T } be the set of these possible de-biased (clean) datasets. Ideally,
we want all of them to lead to the same classification output.
Given the tuple T, P, n, , x, where T is the training set, P represents the
protected attributes, n bounds the number of biased elements in T , and bounds
the perturbation of x, our method checks if the KNN classification output for x
is fair.
1 func KNN_predict(T, K, x) {
2 Let TxK = the K nearest neighbors of x in T ;
3 Let F req(TxK ) = the most frequent label in TxK ;
4 return F req(TxK );
5 }
6
7 func KNN_learn(T ) {
8 for (each candidate k value) { // conducting p-fold cross validation
9 Let {Gi } = a partition of T into p groups of roughly equal size;
10 Let errik = {(x, y) ∈ Gi | y = KNN_predict(T \ Gi , k, x)} for each Gi ;
11 }
1 p |errik |
12 Let K = arg min p i=1 |Gi | ;
k
13 return K;
14 }
Fig. 2. The KNN algorithm, consisting of the prediction and learning steps.
Inside KNN predict, the set TxK represents the K-nearest neighbors of x in
the dataset T , where distance is measured by Euclidean (or Manhattan) distance
in the input vector space. F req(TxK ) is the most frequent label in TxK .
Inside KNN learn, a technique called p-fold cross validation is used to select
the optimal value for K, e.g., from a set of candidate k values in the range
[1, |T | × (p − 1)/p] by minimizing classification error, as shown in Line 12. This is
accomplished by first partitioning T into p groups of roughly equal size (Line 9),
and then computing errik (a set of misclassified samples from Gi ) by treating Gi
as the evaluation set, and T \ Gi as the training set. Here, an input (x, y) ∈ Gi
is “misclassified” if the expected output label, y, differs from the output of
KNN predict using the candidate k value.
In the abstract learning step (Line 3), instead of considering T , our method
considers the set of all clean datasets in dBiasn (T ) symbolically, to compute the
set of possible optimal K values, denoted KSet.
In the abstract prediction step (Lines 4–8), for each K, instead of consider-
ing input x, our method considers all perturbed inputs in Δ (x) and all clean
datasets in dBiasn (T ) symbolically, to check if the classification output always
stays the same. Our method returns “certified” only when the classification out-
put always stays the same (Line 9); otherwise, it returns “unknown” (Line 6).
We only perturb numerical attributes in the input x since perturbing cate-
gorical or binary attributes often does not make sense in practice.
In the next two sections, we present our detailed algorithms for abstracting
the prediction step and the learning step, respectively.
Fig. 3. Four cases for computing the upper and lower bounds of the distance function
di (δi ) = (δi + A)2 for δi ∈ [−i , i ]. In these figures, δi is the x-axis, and di is the y-axis,
LB denotes LB(di ), and UB denotes U B(di ).
Figure 3 shows the plot, which reminds us of where the minimum and maxi-
mum values of a quadratic function is. There are two versions of the quadratic
function, depending on whether A > 0 (corresponding to the two subfigures at
the top) or A < 0 (corresponding to the two subfigures at the bottom). Each ver-
sion also has two cases, depending on whether the perturbation interval [−i , i ]
falls inside the constant interval [−|A|, |A|] (corresponding to the two subfigures
on the left) or falls outside (corresponding to the two subfigures on the right).
Thus, there are four cases in total.
In each case, the maximal and minimal values of the quadratic function are
different, as shown by the LB and UB marks in Fig. 3.
Case (a). This is when (xi − ti ) > 0 and −i > −(xi − ti ), which is the same
as saying A > 0 and −i > −A. In this case, function di (i ) = (δi + A)2 is
monotonically increasing w.r.t. variable δi ∈ [−i , +i ].
Thus, LB(di ) = (−i + (xi − ti ))2 and U B(di ) = (+i + (xi − ti ))2 .
Case (b). This is when (xi − ti ) > 0 and −i < −(xi − ti ), which is the same
as saying A > 0 and −i < −A. In this case, the function is not monotonic.
The minimal value is 0, obtained when δi = −A. The maximal value is obtained
when δi = +i .
Thus, LB(di ) = 0 and U B(di ) = (+i + (xi − ti ))2 .
344 Y. Li et al.
Case (c). This is when (xi − ti ) < 0 and i < −(xi − ti ), which is the same as
saying A < 0 and i < −A. In this case, the function is monotonically decreasing
w.r.t. variable δi ∈ [−i , i ].
Thus, LB(di ) = (i + (xi − ti ))2 and U B(di ) = (−i + (xi − ti ))2 .
Case (d). This is when (xi − ti ) < 0 and i > −(xi − ti ), which is the same
as saying A < 0 and i > −A. In this case, the function is not monotonic. The
minimal value is 0, obtained when δi = −A. The maximal value is obtained
when δi = −i .
Thus, LB(di ) = 0 and U B(di ) = (−i + (xi − ti ))2 .
Summary. By combining the above four cases, we compute the bounds of the
entire distance function d as follows:
⎡ ⎤
D D
⎣ max(|xi − ti | − i , 0)2 , (|xi − ti | + i )2 ⎦
i=1 i=1
Here, the take-away message is that, since xi , ti and i are all fixed values, the
upper and lower bounds can be computed in constant time, despite that there
is a potentially-infinite number of inputs in Δ (x).
Computing overN N Using Bounds. With the upper and lower bounds of the
distance between Δ (x) and sample t in the dataset T , denoted [LB(d (x, t)),
U B(d (x, t))], we are ready to compute overN N such that every t ∈ overN N
may be among the K nearest neighbors of Δ (x).
Let U BKmin denote the K-th minimum value of U B(d (x, t)) for all t ∈ T .
Then, we define overN N as the set of samples in T whose LB(d (x, t)) is not
greater than U BKmin . In other words,
Therefore, to check if our method should return T rue, meaning the prediction
result is guaranteed to be the same as label y, we only need to compare K − |S|
with #y + 2 ∗ n. This is checked using the condition in Line 3 of Algorithm 2.
In this section, we present our method for abstracting the learning step, which
computes the optimal K value based on T and the impact of flipping at most n
labels. The output is a super set of possible optimal K values, denoted KSet.
Algorithm 3 shows our method, which takes the training set T and parameter
n as input, and returns KSet as output. To be sound, we require the KSet to
include any candidate k value that may become the optimal K for some clean
set T ∈ dBiasn (T ).
In Algorithm 3, our method first computes the lower and upper bounds of
the classification error for each k value, denoted LBk and U Bk , as shown in
Lines 5–6. Next, it computes minU B, which is the minimal upper bound for all
candidate k values (Line 8). Finally, by comparing minU B with LBk for each
candidate k value, our method decides whether this candidate k value should be
put into KSet (Line 9).
We will explain the steps needed to compute LBk and U Bk in the remainder
of this section. For now, assuming that they are available, we explain how they
are used to compute KSet.
Soundness Proof. Here we prove that any k ∈ / KSet cannot result in the smallest
classification error. Assume that ks is the candidate k value that has the minimal
upper bound (minU B), and errks is the actual classification error. By definition,
we have errks ≤ minU B. Meanwhile, for any k ∈ / KSet, we have LBk >
Certifying the Fairness of KNN in the Presence of Dataset Bias 347
minU B. Combining the two cases, we have errk > minU B ≥ errks . Here,
errk > errks means that k cannot result in the smallest classification error.
The LP Problem. The question is how to decide whether the set S (defined
in Line 1 of Algorithm 5) exists. We can formulate it as a linear programming
(LP) problem. The LP problem has two constraints. The first one is defined as
follows: Let y be the expected label, li = y be another label, where i = 1, ..., q
and q is the total number of class labels (e.g., in the above two examples, the
number q = 3). Let #y be the number of elements in TxK that have the y label.
Similarly, let #li be the number of elements with li label. Assume that a set S
as defined in Algorithm 5 exists, then all of the labels li = y must satisfy
q
#li − #f lipi < #y + #f lipi , (1)
i=1
where #f lipi is a variable representing the number of li –to–y flips. Thus, in the
above formula, the left-hand side is the count of li after flipping, the right-hand
side is the count of y after flipping. Since y is the most frequent label in S, y
should have a higher count than any other label.
The second constraint is
q
#f lipi ≤ n , (2)
i=1
which says that the total number of label flips is bounded by the parameter n.
Since the number of class labels (q) is often small (from 2 to 10), this LP
problem can be solved quickly. However, the LP problem must be solved |T |
times, where |T | may be as large as 50,000. To avoid invoking the LP solver
Certifying the Fairness of KNN in the Presence of Dataset Bias 349
The second condition requires that, in S, label y has a higher count (after flip-
ping) than any other label, including the label lp = y with the highest count in
the current TxK . The resulting condition is
6 Experiments
Datasets. Table 1 shows the statistics of each dataset, including the name,
a short description, the size (|T |), the number of attributes, the protected
attributes, and the parameters and n. The value of is set to 1% of the attribute
range. The bias parameter n is set to 1 for small datasets, 10 for medium datasets,
and 50 for large datasets. The protected attributes include Gender for all six
datasets, and Race for two datasets, Compas and Adult, which are consistent
with known biases in these datasets.
In preparation for the experimental evaluation, we have employed state-of-
the-art techniques in the machine learning literature to preprocess and balance
the datasets for KNN, including encoding, standard scaling, k-bins-discretizer,
downsampling and upweighting.
350 Y. Li et al.
Table 1. Statistics of all of the datasets used during our experimental evaluation.
Table 2. Results for certifying label-flipping and individual fairness (gender) on small
datasets, for which ground truth can still be obtained by naive enumeration, and com-
pared with our method.
Table 3. Results for certifying label-flipping, individual, and -fairness by our method.
Based on the results in Table 2, we conclude that the accuracy of our method
is high (81.9% on average) despite its aggressive use of abstraction to reduce
the computational cost. Our method is also 7.5X to 126X faster than the naive
approach. Furthermore, the larger the dataset, the higher the speedup.
For medium and large datasets, it is infeasible for the naive enumeration
approach to compute and show the ground truth in Table 2. However, the fairness
scores of our method shown in Table 3 provide “lower bounds” for the ground
truth since our method is sound for certification. For example, when our method
reports 95% for Compas (race) in Table 3, it means the ground truth must be
≥95% (and thus the gap must be ≤5%). However, there does not seem to be
obvious relationship between the gap and the dataset size – the gap may be due
to some unique characterristics of each dataset.
76.2% accurate. Furthermore, the efficiency of our method is high: for Adult,
which has 50,000 samples in the training set, the average certification time of
our method remains within a few seconds.
Table 4. Results for certifying label-flipping + -fairness with both Race and Gender
as protected attributes.
7 Related Work
For fairness certification, as explained earlier in this paper, our method is the
first method for certifying KNN in the presence of historical (dataset) bias.
While there are other KNN certification and falsification techniques, including
Jia et al. [20] and Li et al. [24,25], they focus solely on robustness against data
poisoning attacks as opposed to individual and -fairness against historical bias.
Meyer et al. [26,27] and Drews et al. [12] propose certification techniques that
handle dataset bias, but target different machine learning techniques (decision
tree or linear regression); furthermore, they do not handle -fairness.
Throughout this paper, we have assumed that the KNN learning (parameter-
tuning) step is not tampered with or subjected to fairness violation. However,
since the only impact of tampering with the KNN learning step will be changing
the optimal value of the parameter K, the biased KNN learning step can be
modeled using a properly over-approximated KSet. With this new KSet, our
method for certifying fairness of the prediction result (as presented in Sect. 4)
will work AS IS.
Our method aims to certify fairness with certainty. In contrast, there are
statistical techniques that can be used to prove that a system is fair or robust
with a high probability. Such techniques have been applied to various machine
learning models, for example, in VeriFair [6] and FairSquare [2]. However, they
are typically applied to the prediction step while ignoring the learning step,
although the learning step may be affected by dataset bias.
There are also techniques for mitigating bias in machine learning systems.
Some focus on improving the learning algorithms using random smoothing [33],
better embedding [7] or fair representation [34], while others rely on formal
methods such as iterative constraint solving [38]. There are also techniques for
repairing models to improve fairness [3]. Except for Ruoss et al. [34], most of
them focus on group fairness such as demographic parity and equal opportunity;
they are significantly different from our focus on certifying individual and -
fairness of the classification results in the presence of dataset bias.
At a high level, our method that leverages a sound over-approximate analysis
to certify fairness can be viewed as an instance of the abstract interpretation
paradigm [10]. Abstract interpretation based techniques have been successfully
used in many other settings, including verification of deep neural networks [17,
30], concurrent software [21,22,37], and cryptographic software [43,44].
Since fairness is a type of non-functional property, the verifica-
tion/certification techniques are often significantly different from techniques used
to verify/certify functional correctness. Instead, they are more closely related to
techniques for verifying/certifying robustness [8], noninterference [5], and side-
channel security [19,39,40,48], where a program is executed multiple times, each
354 Y. Li et al.
time for a different input drawn from a large (and sometimes infinite) set, to see
if they all agree on the output. At a high level, this is closely related to differen-
tial verification [28,31,32], synthesis of relational invariants [41] and verification
of hyper-properties [15,35].
8 Conclusions
We have presented a method for certifying the individual and -fairness of the
classification output of the KNN algorithm, under the assumption that the train-
ing dataset may have historical bias. Our method relies on abstract interpreta-
tion to soundly approximate the arithmetic computations in the learning and
prediction steps. Our experimental evaluation shows that the method is efficient
in handling popular datasets from the fairness research literature and accurate
enough in obtaining certifications for a large amount of test data. While this
paper focuses on KNN only, as a future work, we plan to extend our method to
other machine learning models.
References
1. Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and
recommendation system using k-nearest neighbor (KNN) classification method.
Appl. Comput. Inf. 12(1), 90–108 (2016)
2. Albarghouthi, A., D’Antoni, L., Drews, S., Nori, A.V.: FairSquare: probabilistic
verification of program fairness. Proc. ACM Programm. Lang. 1(OOPSLA), 1–30
(2017)
3. Albarghouthi, A., D’Antoni, L., Drews, S.: Repairing decision-making programs
under uncertainty. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol.
10426, pp. 181–200. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
63387-9 9
4. Andersson, M., Tran, L.: Predicting movie ratings using KNN (2020)
5. Barthe, G., D’Argenio, P.R., Rezk, T.: Secure information flow by self-composition.
In: IEEE Computer Security Foundations Workshop, pp. 100–114 (2004)
6. Bastani, O., Zhang, X., Solar-Lezama, A.: Probabilistic verification of fairness
properties via concentration. Proc. ACM Programm. Lang. 1(OOPSLA), 1–27
(2019)
7. Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to
computer programmer as woman is to homemaker? Debiasing word embeddings.
In: Annual Conference on Neural Information Processing Systems, vol. 29 (2016)
8. Chaudhuri, S., Gulwani, S., Lublinerman, R.: Continuity and robustness of pro-
grams. Commun. ACM 55(8), 107–115 (2012)
9. Cortez, P., Silva, A.M.G.: Using data mining to predict secondary school student
performance. EUROSIS-ETI (2008)
10. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static
analysis of programs by construction or approximation of fixpoints. In: ACM Sym-
posium on Principles of Programming Languages, pp. 238–252 (1977)
11. Dieterich, W., Mendoza, C., Brennan, T.: COMPAS risk scales: demonstrating
accuracy equity and predictive parity. Northpointe Inc (2016)
Certifying the Fairness of KNN in the Presence of Dataset Bias 355
12. Drews, S., Albarghouthi, A., D’Antoni, L.: Proving data-poisoning robustness in
decision trees. In: ACM SIGPLAN International Conference on Programming Lan-
guage Design and Implementation, pp. 1083–1097 (2020)
13. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.
edu/ml
14. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through
awareness. In: Innovations in Theoretical Computer Science, pp. 214–226 (2012)
15. Finkbeiner, B., Haas, L., Torfah, H.: Canonical representations of k-safety hyper-
properties. In: IEEE Computer Security Foundations Symposium, pp. 17–31 (2019)
16. Firdausi, I., Erwin, A., Nugroho, A.S., et al.: Analysis of machine learning tech-
niques used in behavior-based malware detection. In: 2010 Second International
Conference on Advances in Computing, Control, and Telecommunication Tech-
nologies, pp. 201–203. IEEE (2010)
17. Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev,
M.T.: AI2: safety and robustness certification of neural networks with abstract
interpretation. In: IEEE Symposium on Security and Privacy, pp. 3–18 (2018)
18. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: KNN model-based approach in
classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) OTM 2003. LNCS,
vol. 2888, pp. 986–996. Springer, Heidelberg (2003). https://doi.org/10.1007/978-
3-540-39964-3 62
19. Guo, S., Wu, M., Wang, C.: Adversarial symbolic execution for detecting
concurrency-related cache timing leaks. In: ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium on the Foundations of Software Engi-
neering, pp. 377–388 (2018)
20. Jia, J., Liu, Y., Cao, X., Gong, N.Z.: Certified robustness of nearest neighbors
against data poisoning and backdoor attacks. In: The AAAI Conference on Artifi-
cial Intelligence (2022)
21. Kusano, M., Wang, C.: Flow-sensitive composition of thread-modular abstract
interpretation. In: ACM SIGSOFT International Symposium on Foundations of
Software Engineering, pp. 799–809 (2016)
22. Kusano, M., Wang, C.: Thread-modular static analysis for relaxed memory mod-
els. In: ACM Joint Meeting on European Software Engineering Conference and
Symposium on Foundations of Software Engineering, pp. 337–348 (2017)
23. Li, Y., Fang, B., Guo, L., Chen, Y.: Network anomaly detection based on TCM-
KNN algorithm. In: ACM Symposium on Information, Computer and Communi-
cations Security, pp. 13–19 (2007)
24. Li, Y., Wang, J., Wang, C.: Proving robustness of KNN against adversarial data
poisoning. In: International Conference on Formal Methods in Computer-Aided
Design, pp. 7–16 (2022)
25. Li, Y., Wang, J., Wang, C.: Systematic testing of the data-poisoning robustness
of KNN. In: ACM SIGSOFT International Symposium on Software Testing and
Analysis (2023)
26. Meyer, A.P., Albarghouthi, A., D’Antoni, L.: Certifying robustness to pro-
grammable data bias in decision trees. In: Annual Conference on Neural Infor-
mation Processing Systems, pp. 26276–26288 (2021)
27. Meyer, A.P., Albarghouthi, A., D’Antoni, L.: Certifying data-bias robustness in
linear regression. CoRR abs/2206.03575 (2022)
28. Mohammadinejad, S., Paulsen, B., Deshmukh, J.V., Wang, C.: DiffRNN: differ-
ential verification of recurrent neural networks. In: International Conference on
Formal Modeling and Analysis of Timed Systems, pp. 117–134 (2021)
356 Y. Li et al.
29. Narudin, F.A., Feizollah, A., Anuar, N.B., Gani, A.: Evaluation of machine learning
classifiers for mobile malware detection. Soft. Comput. 20(1), 343–357 (2016)
30. Paulsen, B., Wang, C.: Example guided synthesis of linear approximations for
neural network verification. In: International Conference on Computer Aided Ver-
ification, pp. 149–170 (2022)
31. Paulsen, B., Wang, J., Wang, C.: ReluDiff: differential verification of deep neu-
ral networks. In: International Conference on Software Engineering, pp. 714–726
(2020)
32. Paulsen, B., Wang, J., Wang, J., Wang, C.: NEURODIFF: scalable differential
verification of neural networks using fine-grained approximation. In: International
Conference on Automated Software Engineering, pp. 784–796 (2020)
33. Rosenfeld, E., Winston, E., Ravikumar, P., Kolter, J.Z.: Certified robustness to
label-flipping attacks via randomized smoothing. In: International Conference on
Machine Learning, vol. 119, pp. 8230–8241 (2020)
34. Ruoss, A., Balunovic, M., Fischer, M., Vechev, M.T.: Learning certified individu-
ally fair representations. In: Annual Conference on Neural Information Processing
Systems (2020)
35. Sousa, M., Dillig, I.: Cartesian hoare logic for verifying k-safety properties. In: ACM
SIGPLAN Conference on Programming Language Design and Implementation, pp.
57–69 (2016)
36. Su, M.Y.: Real-time anomaly detection systems for denial-of-service attacks by
weighted k-nearest-neighbor classifiers. Expert Syst. Appl. 38(4), 3492–3498 (2011)
37. Sung, C., Kusano, M., Wang, C.: Modular verification of interrupt-driven software.
In: International Conference on Automated Software Engineering, pp. 206–216
(2017)
38. Wang, J., Li, Y., Wang, C.: Synthesizing fair decision trees via iterative constraint
solving. In: Shoham, S., Vizel, Y. (eds.) International Conference on Computer
Aided Verification, pp. 364–385. Springer, Cham (2022). https://doi.org/10.1007/
978-3-031-13188-2 18
39. Wang, J., Sung, C., Raghothaman, M., Wang, C.: Data-driven synthesis of provably
sound side channel analyses. In: International Conference on Software Engineering,
pp. 810–822 (2021)
40. Wang, J., Sung, C., Wang, C.: Mitigating power side channels during compila-
tion. In: ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, pp. 590–601 (2019)
41. Wang, J., Wang, C.: Learning to synthesize relational invariants. In: International
Conference on Automated Software Engineering, pp. 65:1–65:12 (2022)
42. Weisberg, S.: Applied Linear Regression, p. 194. Wiley (1985)
43. Wu, M., Guo, S., Schaumont, P., Wang, C.: Eliminating timing side-channel leaks
using program repair. In: ACM SIGSOFT International Symposium on Software
Testing and Analysis, pp. 15–26 (2018)
44. Wu, M., Wang, C.: Abstract interpretation under speculative execution. In: ACM
SIGPLAN Conference on Programming Language Design and Implementation, pp.
802–815 (2019)
45. Wu, W., Zhang, W., Yang, Y., Wang, Q.: DREX: developer recommendation with
k-nearest-neighbor search and expertise ranking. In: Asia-Pacific Software Engi-
neering Conference, pp. 389–396 (2011)
46. Xie, M., Hu, J., Han, S., Chen, H.H.: Scalable hypergrid K-NN-based online
anomaly detection in wireless sensor networks. IEEE Trans. Parallel Distrib. Syst.
24(8), 1661–1670 (2012)
Certifying the Fairness of KNN in the Presence of Dataset Bias 357
47. Yeh, I.C., Lien, C.h.: The comparisons of data mining techniques for the predictive
accuracy of probability of default of credit card clients. Exp. Syst. Appl. 36(2),
2473–2480 (2009)
48. Zhang, J., Gao, P., Song, F., Wang, C.: SCInfer: refinement-based verification of
software countermeasures against side-channel attacks. In: International Confer-
ence on Computer Aided Verification, pp. 157–177 (2018)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Monitoring Algorithmic Fairness
1 Introduction
This work is supported by the European Research Council under Grant No.:
ERC-2020-AdG101020093.
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 358–382, 2023.
https://doi.org/10.1007/978-3-031-37703-7_17
Monitoring Algorithmic Fairness 359
of humans, like gender, ethnicity, etc. However, they have often shown biases in
their decisions in the past [20,47,55,57,58]. While there are many approaches
for mitigating biases before deployment [20,47,55,57,58], recent runtime verifi-
cation approaches [3,34] offer a new complementary tool to oversee algorithmic
fairness in AI and machine-learned decision makers during deployment.
To verify algorithmic fairness at runtime, the given decision-maker is treated
as a generator of events with an unknown model. The goal is to algorithmically
design lightweight but rigorous runtime monitors against quantitative formal
specifications. The monitors observe a long stream of events and, after each
observation, output a quantitative, statistically sound estimate of how fair or
biased the generator was until that point in time. While the existing approaches
[3,34] considered only sequential decision making models and built monitors
from the frequentist viewpoint in statistics, we allow the richer class of Markov
chain models and present monitors from both the frequentist and the Bayesian
statistical viewpoints.
Monitoring algorithmic fairness involves on-the-fly statistical estimations, a
feature that has not been well-explored in the traditional runtime verification
literature. As far as the algorithmic fairness literature is concerned, the existing
works are mostly model-based, and either minimize decision biases of machine-
learned systems at design-time (i.e., pre-processing) [11,41,65,66], or verify their
absence at inspection-time (i.e., post-processing) [32]. In contrast, we verify algo-
rithmic fairness at runtime, and do not require an explicit model of the gener-
ator. On one hand, the model-independence makes the monitors trustworthy,
and on the other hand, it complements the existing model-based static analyses
and design techniques, which are often insufficient due to partially unknown or
imprecise models of systems in real-world environments.
We assume that the sequences of events generated by the generator can
be modeled as sequences of states visited by a finite unknown Markov chain.
This implies that the generator is well-behaved and the events follow each other
according to some fixed probability distributions. Not only is this assumption
satisfied by many machine-learned systems (see Sect. 1.1 for examples), it also
provides just enough structure to lay the bare-bones foundations for runtime
verification of algorithmic fairness properties. We emphasize that we do not
require knowledge of the transition probabilities of the underlying Markov chain.
We propose a new specification language, called the Probabilistic Specifica-
tion Expressions (PSEs), which can formalize a majority of the existing algo-
rithmic fairness properties in the literature, including demographic parity [21],
equal opportunity [32], disparate impact [25], etc. Let Q be the set of events.
Syntactically, a PSE is a restricted arithmetic expression over the (unknown)
transition probabilities of a Markov chain with the state space Q. Semantically,
a PSE ϕ over Q is a function that maps every Markov chain M with the state
space Q to a real number, and the value ϕ(M ) represents the degree of fairness
or bias (with respect to ϕ) in the generator M . Our monitors observe a long
sequence of events from Q, and after each observation, compute a statistically
rigorous estimate of ϕ(M ) with a PAC-style error bound for a given confidence
level. As the observed sequence gets longer, the error bound gets tighter.
360 T. A. Henzinger et al.
Algorithmic fairness properties that are expressible using PSEs are quan-
titative refinements of the traditional qualitative fairness properties studied in
formal methods. For example, a qualitative fairness property may require that
if a certain event A occurs infinitely often, then another event B should follow
infinitely often. In particular, a coin is qualitatively fair if infinitely many coin
tosses contain both infinitely many heads and infinitely many tails. In contrast,
the coin will be algorithmically fair (i.e., unbiased) if approximately half of the
tosses come up heads. Technically, while qualitative weak and strong fairness
properties are ω-regular, the algorithmic fairness properties are statistical and
require counting. Moreover, for a qualitative fairness property, the satisfaction or
violation cannot be established based on a finite prefix of the observed sequence.
In contrast, for any given finite prefix of observations, the value of an algorith-
mic fairness property can be estimated using statistical techniques, assuming the
future behaves statistically like the past (the Markov assumption).
As our main contribution, we present two different monitoring algorithms,
using tools from frequentist and Bayesian statistics, respectively. The central
idea of the frequentist monitor is that the probability of every transition of the
monitored Markov chain M can be estimated using the fraction of times the
transition is taken per visit to its source vertex. Building on this, we present a
practical implementation of the frequentist monitor that can estimate the value
of a given PSE from an observed finite sequence of states. For the coin example,
after every new toss, the frequentist monitor will update its estimate of proba-
bility of seeing heads by computing the fraction of times the coin came up heads
so far, and then by using concentration bounds to find a tight error bound for
a given confidence level. On the other hand, the central idea of the Bayesian
monitor is that we begin with a prior belief about the transition probabilities of
M , and having seen a finite sequence of observations, we can obtain an updated
posterior belief about M . For a given confidence level, the output of the monitor
is computed by applying concentration inequalities to find a tight error bound
around the mean of the posterior belief. For the coin example, the Bayesian
monitor will begin with a prior belief about the degree of fairness, and, after
observing the outcome of each new toss, will compute a new posterior belief.
If the prior belief agrees with the true model with a high probability, then the
Bayesian monitor’s output converges to the true value of the PSE more quickly
than the frequentist monitor. In general, both monitors can efficiently estimate
more complicated PSEs, such as the ratio and the squared difference of the
probabilities of heads of two different coins. The choice of the monitor for a par-
ticular application depends on whether an objective or a subjective evaluation,
with respect to a given prior, is desired.
Both frequentist and Bayesian monitors use registers (and counters as a
restricted class of registers) to keep counts of the relevant events and store the
intermediate results. If the size of the given PSE is n, then, in theory, the fre-
quentist monitor uses O(n4 2n ) registers and computes its output in O(n4 2n )
time after each new observation, whereas the Bayesian monitor uses O(n2 2n )
registers and computes its output in O(n2 2n ) time after each new observation.
Monitoring Algorithmic Fairness 361
The computation time and the required number of registers get drastically
reduced to O(n2 ) for the frequentist monitor with PSEs that contain up to one
division operator, and for the Bayesian monitor with polynomial PSEs (possibly
having negative exponents in the monomials). This shows that under given cir-
cumstances, one or the other type of the monitor can be favorable computation-
wise. These special, efficient cases cover many algorithmic fairness properties of
interest, such as demographic parity and equal opportunity.
Our experiments confirm that our monitors are fast in practice. Using a
prototype implementation in Rust, we monitored a couple of decision-making
systems adapted from the literature. In particular, we monitor if a bank is fair
in lending money to applicants from different demographic groups [48], and if
a college is fair in admitting students without creating an unreasonable finan-
cial burden on the society [54]. In our experiments, both monitors took, on an
average, less than a millisecond to update their verdicts after each observation,
and only used tens of internal registers to operate, thereby demonstrating their
practical usability at runtime.
In short, we advocate that runtime verification introduces a new set of tools in
the area of algorithmic fairness, using which we can monitor biases of deployed AI
and machine-learned systems in real-time. While existing monitoring approaches
only support sequential decision making problems and use only the frequentist
statistical viewpoint, we present monitors for the more general class of Markov
chain system models using both frequentist and Bayesian statistical viewpoints.
All proofs can be found in the longer version of the paper [33].
We first present two real-world examples from the algorithmic fairness literature
to motivate the problem; these examples will later be used to illustrate the
technical developments.
The Lending Problem [48]: Suppose a bank lends money to individuals based
on certain attributes, like credit score, age group, etc. The bank wants to max-
imize profit by lending money to only those who will repay the loan in time—
called the “true individuals.” There is a sensitive attribute (e.g., ethnicity) clas-
sifying the population into two groups g and g. The bank will be considered fair
(in lending money) if its lending policy is independent of an individual’s mem-
bership in g or g. Several group fairness metrics from the literature are relevant
in this context. Disparate impact [25] quantifies the ratio of the probability of
an individual from g getting the loan to the probability of an individual from g
getting the loan, which should be close to 1 for the bank to be considered fair.
Demographic parity [21] quantifies the difference between the probability of an
individual from g getting the loan and the probability of an individual from g
getting the loan, which should be close to 0 for the bank to be considered fair.
Equal opportunity [32] quantifies the difference between the probability of a true
individual from g getting the loan and the probability of a true individual from
g getting the loan, which should be close to 0 for the bank to be considered fair.
362 T. A. Henzinger et al.
There has been a plethora of work on algorithmic fairness from the machine
learning standpoint [10,12,21,32,38,42,45,46,52,59,63,66]. In general, these
works improve algorithmic fairness through de-biasing the training dataset (pre-
processing), or through incentivizing the learning algorithm to make fair deci-
sions (in-processing), or through eliminating biases from the output of the
machine-learned model (post-processing). All of these are interventions in the
design of the system, whereas our monitors treat the system as already deployed.
Recently, formal methods-inspired techniques have been used to guarantee
algorithmic fairness through the verification of a learned model [2,9,29,53,61],
and enforcement of robustness [6,30,39]. All of these works verify or enforce
algorithmic fairness statically on all runs of the system with high probability.
This requires certain knowledge about the system model, which may not be
always available. Our runtime monitor dynamically verifies whether the current
run of an opaque system is fair.
Our frequentist monitor is closely related to the novel work of Albarghouthi
et al. [3], where the authors build a programming framework that allows run-
time monitoring of algorithmic fairness properties on programs. Their monitor
evaluates the algorithmic fairness of repeated “single-shot” decisions made by
machine-learned functions on a sequence of samples drawn from an underly-
ing unknown but fixed distribution, which is a special case of our more general
Markov chain model of the generator. They do not consider the Bayesian point
of view. Moreover, we argue and empirically show in Sect. 4 that our frequentist
approach produces significantly tighter statistical estimates than their approach
on most PSEs. On the flip side, their specification language is more expressive,
in that they allow atomic variables for expected values of events, which is useful
Monitoring Algorithmic Fairness 363
for specifying individual fairness criteria [21]. We only consider group fairness,
and leave individual fairness as part of future research. Also, they allow logical
operators (like boolean connectives) in their specification language. However, we
obtain tighter statistical estimates for the core arithmetic part of algorithmic
fairness properties (through PSEs), and point out that we can deal with logical
operators just like they do in a straightforward manner.
Shortly after the first manuscript of this paper was written, we published
a separate work for monitoring long-run fairness in sequential decision making
problems, where the feature distribution of the population may dynamically
change due to the actions of the individuals [34]. Although this other work
generalizes our current paper in some aspects (support for dynamic changes in
the model), it only allows sequential decision making models (instead of Markov
chains) and does not consider the Bayesian monitoring perspective.
There is a large body of research on monitoring, though the considered prop-
erties are mainly temporal [5,7,19,24,40,50,60]. Unfortunately, these techniques
do not directly extend to monitoring algorithmic fairness, since checking algo-
rithmic fairness requires statistical methods, which is beyond the limit of finite
automata-based monitors used by the classical techniques. Although there are
works on quantitative monitoring that use richer types of monitors (with coun-
ters/registers like us) [28,35,36,56], the considered specifications do not easily
extend to statistical properties like algorithmic fairness. One exception is the
work by Ferrère et al. [26], which monitors certain statistical properties, like
mode and median of a given sequence of events. Firstly, they do not consider
algorithmic fairness properties. Secondly, their monitors’ outputs are correct only
as the length of the observed sequence approaches infinity (asymptotic guaran-
tee), whereas our monitors’ outputs are always correct with high confidence
(finite-sample guarantee), and the precision gets better for longer sequences.
Although our work uses similar tools as used in statistical verification [1,
4,14,17,64], the goals are different. In traditional statistical verification, the
system’s runs are chosen probabilistically, and it is verified if any run of the
system satisfies a boolean property with a certain probability. For us, the run
is given as input to the monitor, and it is this run that is verified against a
quantitative algorithmic fairness property with statistical error bounds. To the
best of our knowledge, existing works on statistical verification do not consider
algorithmic fairness properties.
2 Preliminaries
For any alphabet Σ, the notation Σ ∗ represents the set of all finite words over
Σ. We write R, N, and N+ to denote the sets of real numbers, natural numbers
(including zero), and positive integers, respectively. For a pair of real (natural)
numbers a, b with a < b, we write [a, b] ([a . . b]) to denote the set of all real
(natural) numbers between and including a and b. For a given c, r ∈ R, we write
[c ± r] to denote the set [c − r, c + r]. For simpler notation, we will use | · | to
denote both the cardinality of a set and the absolute value of a real number,
whenever the intended use is clear.
364 T. A. Henzinger et al.
1
An alternate commonly used definition of probability distribution is directly in terms
of the probability measure induced over S, instead of through the random variable.
Monitoring Algorithmic Fairness 365
g 0
g gy z
1
init y init
..
.
g gy z
g N
Fig. 1. Markov chains for the lending and the college-admission examples. (left) The
lending example: The state init denotes the initiation of the sampling, and the rest
represent the selected individual, namely, g and g denote the two groups, (gy) and (gy)
denote that the individual is respectively from group g and group g and the loan was
granted, y denotes that the loan was refused, and z and z denote whether the loan
was repaid or not. (right) The college admission example: The state init denotes the
initiation of the sampling, the states g, g represent the group identity of the selected
candidate, and the states {0, . . . , N } represent the amount of money invested by a truly
eligible candidate.
Randomized register monitors, or simply monitors, are adapted from the (deter-
ministic) polynomial monitors of Ferrère et al. [27]. Let R be a finite set of integer
variables called registers. A function v : R → N assigning concrete value to every
register in R is called a valuation of R. Let NR denote the set of all valuations
of R. Registers can be read and written according to relations in the signature
S = 0, 1, +, −, ×, ÷, ≤. We consider two basic operations on registers:
We use Φ(R) and Γ (R) to respectively denote the set of tests and updates over
R. Counters are special registers with a restricted signature S = 0, 1, +, −, ≤.
σ
written as v −→ v . A run of A on a word w0 . . . wn ∈ Σ ∗ is a sequence of concrete
w0 w1 wn
transitions v0 −−→ v1 −−→ . . . −−→ vn+1 . The probabilistic transitions of A induce
a probability distribution over the sample space of finite runs of the monitor,
denoted P(·). For a given finite word w ∈ Σ ∗ , the semantics of the monitor A
is given by a random variable [[A]](w) := λ(Y ) inducing the probability measure
PA , where Y is the random variable representing the distribution over the final
state in a run of A on the word w, i.e., PA (Y = v) := P({r = r0 . . . rm ∈ Σ ∗ |
r is a run of A on w and rm = v}).
Example: A Monitor for Detecting the (Unknown) Bias of a Coin. We
present a simple deterministic monitor that computes a PAC estimate of the bias
of an unknown coin from a sequence of toss outcomes, where the outcomes are
denoted as “h” for heads and “t” for tails. The input alphabet is the set of toss
outcomes, i.e., Σ = {h, t}, the output alphabet is the set of every bias intervals,
i.e., Γ = {[a, b] | 0 ≤ a < b ≤ 1}, the set of registers is R = {rn , rh }, where
rn and rh are counters counting the total number of tosses and the number of
heads, respectively, and the output function λ maps every valuation of rn , rh
to an interval estimate of the bias that has the form λ ≡ v(rh )/v(rn ) ± ε(rn , δ),
where δ ∈ [0, 1] is a given upper bound on the probability of an incorrect estimate
and ε(rn , δ) is the estimation error computed using PAC analysis. For instance,
after observing a sequence of 67 tosses with 36 heads, the values of the registers
will be v(rn ) = 67 and v(rh ) = 36, and the output of the monitor will be
λ(67, 36) = 36/67 ± ε(n, δ) for some appropriate ε(·). Now, suppose the next
input to the monitor is h, in which case the monitor’s transition is given as
T (h, ·) = (rn + 1, rh + 1), which updates the registers to the new values v (rn ) =
67 + 1 = 68 and v (rh ) = 36 + 1 = 37. For this example, the tests Φ(R) over the
registers are redundant, but they can be used to construct monitors for more
complex properties.
To formalize algorithmic fairness properties, like the ones in Sect. 1.1, we intro-
duce probabilistic specification expressions (PSE). A PSE ϕ over a given finite
set Q is an algebraic expression with some restricted set of operations that uses
variables labeled vij with i, j ∈ Q and whose domains are the real interval [0, 1].
The syntax of ϕ is:
where {vij }i,j∈Q are the variables with domain [0, 1] and κ is a constant. The
expression ξ in (1a) is called a monomial and is simply a product of powers of
variables with integer exponents. A polynomial is a weighted sum of monomials
Monitoring Algorithmic Fairness 367
Example: Social Burden. Using PSEs, we can express the social burden of
the college admission example described in Sect. 1.1, with the help of the Markov
chain depicted in the right subfigure of Fig. 1:
Social burden [54]: 1 · vg1 + . . . + N · vgN .
Informally, our goal is to build monitors that observe a single long path of a
Markov chain and, after each observation, output a new estimate for the value
of the PSE. Since the monitor’s estimate is based on statistics collected from
2
Although monomials and polynomials usually only have positive exponents, we take
the liberty to use the terminologies even when negative exponents are present.
368 T. A. Henzinger et al.
a finite path, the output may be incorrect with some probability, where the
source of this probability is different between the frequentist and the Bayesian
approaches. In the frequentist approach, the underlying Markov chain is fixed
(but unknown), and the randomness stems from the sampling of the observed
path. In the Bayesian approach, the observed path is fixed, and the randomness
stems from the uncertainty about a prior specifying the Markov chain’s param-
eters. The commonality is that, in both cases, we want our monitors to estimate
the value of the PSE up to an error with a fixed probabilistic confidence.
We formalize the monitoring problem separately for the two approaches. A
problem instance is a triple (Q, ϕ, δ), where Q = [1 . . N ] is a set of states, ϕ is
a PSE over Q, and δ ∈ [0, 1] is a constant. In the frequentist approach, we use
Ps to denote the probability measure induced by sampling of paths, and in the
Bayesian approach we use Pθ to denote the probability measure induced by the
prior probability density function pθ : Δ(n − 1)n → R ∪ {∞} over the transition
matrix of the Markov chain. In both cases, the output alphabets of the monitors
contain every real interval.
Problem 1 (Frequentist monitor). Suppose (Q, ϕ, δ) is a problem instance
given as input. Design a monitor A such that for every Markov chain M with
transition probability matrix M and for every finite path #„
x ∈ Paths(M):
Ps,A (ϕ(M ) ∈ [[A]]( #„
x )) ≥ 1 − δ, (2)
where Ps,A is the joint probability measure of Ps and PA .
Problem 2 (Bayesian monitor). Suppose (Q, ϕ, δ) is a problem instance and
pθ is a prior density function, both given as inputs. Design a monitor A such
that for every Markov chain M with transition probability matrix M and for
every finite path #„
x ∈ Paths(M):
Pθ,A (ϕ(M ) ∈ [[A]]( #„
x ) | #„
x ) ≥ 1 − δ, (3)
where Pθ,A is the joint probability measure of Pθ and PA .
Notice that the state space of the Markov chain and the input alphabet of the
monitor are the same, and so, many times, we refer to observed states as (input)
symbols, and vice versa. The estimate [l, u] = [[A]]( #„
x ) is called the (1 − δ) · 100%
confidence interval for ϕ(M ).3 The radius, given by ε = 0.5 · (u − l), is called the
estimation error, and the quantity 1 − δ is called the confidence. The estimate
gets more precise as the error gets smaller and the confidence gets higher.
In many situations, we are interested in a qualitative question of the form
“is ϕ(M ) ≤ c?” for some constant c. We point out that, once the quantitative
problem is solved, the qualitative questions can be answered using standard
procedures by setting up a hypothesis test [44, p. 380].
3
While in the Bayesian setting credible intervals would be more appropriate, we
use confidence intervals due to uniformity and the relative ease of computation. To
relate the two, our confidence intervals are over-approximations of credible intervals
(non-unique) that are centered around the posterior mean.
Monitoring Algorithmic Fairness 369
4 Frequentist Monitoring
Suppose the given PSE is only a single variable ϕ = vij , i.e., we are monitoring
the probability of going from state i to another state j. The frequentist monitor
A for ϕ can be constructed in two steps: (1) empirically compute the average
number of times the edge (i, j) was taken per visit to the state i on the observed
path of the Markov chain, and (2) compute the (1 − δ) · 100% confidence interval
using statistical concentration inequalities.
Now consider a slightly more complex PSE ϕ =
vij + vik . One approach to monitor ϕ , proposed by
Albarghouthi et al. [3], would be to first compute the
(1 − δ) · 100% confidence intervals [l1 , u1 ] and [l2 , u2 ]
separately for the two constituent variables vij and
vik , respectively. Then, the (1 − 2δ) · 100% confidence
interval for ϕ would be given by the sum of the two
intervals [l1 , u1 ] and [l2 , u2 ], i.e., [l1 +l2 , u1 +u2 ]; notice
the drop in overall confidence due to the union bound.
Fig. 2. Variation of ratio
The drop in the confidence level and the additional of the est. error using the
error introduced by the interval arithmetic accumulate existing approach [3] to
quickly for larger PSEs, making the estimate unus- est. error using our app-
able. Furthermore, we lose all the advantages of hav- roach, w.r.t. the size of the
ing any dependence between the terms in the PSE. For chosen PSE.
instance, by observing that vij and vik correspond to
the mutually exclusive transitions i to j and i to k, we know that ϕ (M ) is
always less than 1, a feature that will be lost if we use plain merging of individ-
ual confidence intervals for vij and vik . We overcome these issues by estimating
the value of the PSE as a whole as much as possible. In Fig. 2, we demonstrate
how the ratio between the estimation errors from the ntwo approaches vary as
the number of summands (i.e., n) in the PSE ϕ = i=1 v1n changes; in both
cases we fixed the overall δ to 0.05 (95% confidence). The ratio remains the same
for different observation lengths. Our approach is always at least as accurate as
their approach [3], and is significantly better for larger PSEs.
We first explain the idea for division-free PSEs, i.e., PSEs that do not involve
any division operator; later we extend our approach to the general case.
Divison-Free PSEs: In our algorithm, for every variable vij ∈ Vϕ , we introduce
a Bernoulli (Mij ) random variable Y ij with the mean Mij unknown to us. We
make an observation ypij for every p-th visit to the state i on a run, and if j follows
immediately afterwards then record ypij = 1 else record ypij = 0. This gives us
a sequence of observations #„ y ij = y1ij , y2ij , . . . corresponding to the sequence of
#„ij
i.i.d. random variables Y = Y1ij , Y2ij , . . .. For instance, for the run 121123 we
obtain #„y 12 = 1, 0, 1 for the variable v12 .
370 T. A. Henzinger et al.
not need to store the exact order in which the outcomes appeared. Instead, for
every vij ∈ Vϕ , we only store the number of times we have seen thestate i and
the edge (i, j) in counters ci and cij , respectively. Observe that ci ≥ vik ∈Vϕ cik ,
where the possible difference accounts for the visits to irrelevant states, denoted
as a dummy state . Given {cik }k , whenever needed, we generate in xi a random
reshuffling of the sequence of states, together with , seen after the past visits
to i. From the sequence stored in xi , for every vik ∈ Vϕ , we can consistently
determine the value of ypik (consistency dictates ypik = 1 ⇒ ypij = 0). Moreover,
we reuse space by resetting xi whenever the sequence stored in xi is no longer
needed. It can be shown that the size of every xi can be at most the size of
the expression [33, Proof of Thm. 2]. This random reshuffling of the observation
sequences is the cause of the probabilistic transitions of the frequenitst monitor.
Fix a problem instance (Q, ϕ, δ), with size of ϕ being n. Let ϕ be transformed
into ϕl by relabeling duplicate occurrences of vij using distinct labels vij
1 2
, vij , . . ..
The set of labeled variables in ϕ is Vϕ , and |Vϕ | = O(n). Let SubExpr (ϕ) denote
l l l
the set of every subexpression in the expression ϕ, and use [lϕ , uϕ ] to denote the
range of values the expression ϕ can take for every valuation of every variable
as per the domain [0, 1]. Let Dep(ϕ) = {i | ∃vij ∈ Vϕ }, and every subexpression
ϕ1 · ϕ2 with Dep(ϕ1 ) ∩ Dep(ϕ2 ) = ∅ is called a dependent multiplication.
Implementation of FreqMonitorDivFree in Algorithm 1 has two main func-
tions. Init initializes the registers. Next implements the transition function of
#„
the monitor, which attempts to compute a new observation w for W (Line 4)
after observing a new input σ , and if successful it updates the output of the
monitor by invoking the UpdateEst function. In addition to the registers in Init
and Next labeled in the pseudocode, following registers are used internally:
Algorithm 1. FreqMonitorDivFree
Parameters: Q, ϕ, δ
Output: Λ 1: function Next(σ )
1: function Init(σ) 2: c σ ← cσ + 1 update counters
unique labeling 3: cσσ ← cσσ + 1
2: ϕl ←−−−−−−−−−− ϕ
3: for all vij ∈ Vϕ do 4: w ← Eval(ϕl )
4: cij ← 0 # of (i, j) 5: if w = ⊥ then
5: ci ← 0 # of i 6: n←n+1
#„ 7: Λ ← UpdateEst(w, n)
6: n←0 length of w
8: ResetX ()
7: σ←σ prev. symbol
8: μΛ ← ⊥ est. mean 9: σ ← σ
9: εΛ ← ⊥ est. error 10: return Λ
10: ResetX () reset xi -s
11: Compute lϕ , uϕ int. arith.
1: function Eval(ϕl )
2: if rϕl = ⊥ then
1: function UpdateEst(w, n)
3: if ϕl ≡ ϕl1 + ϕl2 then μ ·(n−1)+w
4: rϕl ← Eval(ϕl1 ) + Eval(ϕl2 ) 2: μΛ ← Λ n
(u −l )2
5: else if ϕl ≡ ϕl1 − ϕl2 then 3: εΛ ← − ϕ 2nϕ · ln δ2
6: rϕl ← Eval(ϕl1 ) − Eval(ϕl2 ) 4: return [μΛ ± εΛ ]
7: else if ϕl ≡ ϕl1 · ϕl2 then
8: if Dep(Vϕl 1 ) ∩ Dep(Vϕl 2 ) = ∅ then 1: function ExtractOutcome(xi , t)
generate a shuffled sequence of symbols
9: rϕl ← Eval(ϕl1 ) · Eval(ϕl2 ) seen after i so that |xi | = t
10: else dep. mult. 2: Let U ← {j ∈ Q | vij ∈ Vϕ }
11: for vij l
∈ Vϕl 2 ∩ Dep(Vϕl 1 ) do 3: for p = |xi | + 1, . . . , t do
12: tlij ← max({tm ik | vik ∈ Vϕ1 })
m l 4: q ← ∀u ∈ U .
c
pick u w/ prob. ciu ,
13: tlij ← tlij + 1 make indep. i
(ci − j cij )
pick w/ prob.
14: rϕl ← Eval(ϕl1 ) · Eval(ϕl2 ) ci
5: ci ← ci − 1
15: else if ϕl ≡ vij l
then 6: if q = then
16: if xi [tij + 1] = ⊥ then
l
7: ciq ← ciq − 1
17: ExtractOutcome(xi , tlij + 1) 8: xi [|xi | + 1] ← q
18: if xi [tlij + 1] = j = ⊥ then
19: rϕl ← 1 1: function ResetX ()
20: else 2: for all i ∈ Dom(Vϕ ) do
21: rϕl ← 0
3: xi ← ∅
22: else if ϕl ≡ c then 4: l
for all vij ∈ Vϕl do
23: rϕl ← c l
5: tij ← 0
24: return rϕl
Algorithm 2. FrequentistMonitor
Parameters: Q, ϕ, δ
Output: Λ
1: function Next(σ )
1: function Init(σ) 2: [μa ± εa ] ← Aa .Next(σ )
ϕ change form labeling
2: ϕa + ϕb ←−−−−−−−−− ϕl ←−−−−−− ϕ
c 3: [μb ± εb ] ← Ab .Next(σ )
3: Aa ← FreqMonitorDivFree(Q, ϕa , δ/3) 4: [μc ± εc ] ← Ac .Next(σ )
4: Ab ← FreqMonitorDivFree(Q, ϕb , δ/3) 5: if μa = ⊥ ∧ μb = ⊥ ∧ μc = ⊥ then
[μ ±ε ]
5: Ac ← FreqMonitorDivFree(Q, ϕc , δ/3) 6: [μΛ ± εΛ ] ← [μa ± εa ] + [μb ±εb ]
c c
6: Aa .Init(σ)
7: Ab .Init(σ) 7: return [μΛ ± εΛ ]
8: Ac .Init(σ)
Monitoring Algorithmic Fairness 373
symbol. For the special case of ϕ containing at most one division operator (divi-
sion by constant does not count), A requires only O(n2 ) registers, and takes only
O(n2 ) time to update its output after receiving a new input symbol.
There is a tradeoff between the estimation error, the confidence, and the
length of the observed sequence of input symbols. For instance, for a fixed con-
fidence, the longer the observed sequence is, the smaller is the estimation error.
The following theorem establishes a lower bound on the length of the sequence
for a given upper bound on the estimation error and a fixed confidence.
The bound follows from the Hoeffding’s inequality, together with the fact
that every dependent multiplication increments the required number of samples
by 1. A similar bound for the general case with division is left open.
5 Bayesian Monitoring
Fix a problem instance (Q = [1 . . N ], ϕ, δ). Let M = Δ(N −1)N be the shorthand
notation for the set of transition probability matrices of the Markov chains with
state space Q. Let pθ : M → [0, 1] be the prior probability density function
over M, which is assumed to be specified using the matrix beta distribution
(the definition can be found in standard textbooks on Bayesian statistics [37,
pp. 280]). Let be a matrix, with its size dependent on the context, whose every
element is 1. We make the following common assumption [31,37, p. 50]:
overall expectation as the weighted sum of expectations of the individual mono-
mials: Eθ (ϕ(M ) | #„ x ) = Eθ (ϕ (M ) | #„
x ) = l κl Eθ (ξl (M ) | #„ x ). In the following,
we summarize the procedure for estimating Eθ (ξ(M ) | #„ x ) for every monomial ξ.
Let ξ be a monomial, and let #„ x ab ∈ Q∗ be a sequence of states. We use
dij to store the exponent of the variable vij in the monomial ξ, and define
da := j∈[1..N ] daj . Also, we record the sets of (i, j)-s and i-s with positive
and negative dij and di entries: Di+ := {j | dij > 0}, Di− := {j | dij < 0},
D+ := {i | di > 0}, and D− := {i | di < 0}.
#„ ∗ #„ #„
#„ word #„w ∈ Q , let c#„
For any given ij ( w) denote the
#„ number of ij-s in w#„ and
let ci ( w) := j∈Q cij ( w). Define ci ( w) := ci ( w) + j∈[1..N ] θij and cij ( w) :=
#„ + θ . Let H : Q∗ → R be defined as:
cij ( w) ij
N n
#„)−1)+|d | |dij |
n
j∈Di+ P (cij ( w #„)−1) |di |
P (ci ( w
#„ :=
H( w)
i=1
ij
· N i∈D −
,
n n
#„)−1)+|d | |di |
P (ci ( w #„)−1) |dij |
P (cij ( w
i∈D + i i=1 j∈Di−
(5)
n
where P n k := (n−k)!
n!
is the number of permutations of k > 0 items from n > 0
objects, for k ≤ n, and we use the convention that for S = ∅, s∈S . . . = 1.
#„ = H( w),
Below, in Lemma 1, we establish that Eθ (ξ(M ) | w) #„ and present an
efficient incremental scheme to compute Eθ (ξ(M ) | x ab) from Eθ (ξ(M ) | #„
#„ x a).
Algorithm 3. BayesExpMonitor
1: function Next(σ )
2: cσ ← cσ + 1 update counters
p 3: cσσ ← cσσ + 1
Parameters: Q, ϕ = l=1 κl ξl , θ 4: if active = false then
Output: E 5: if (∀vij ∈ Vϕ . cij + mij > 0) then
1: function Init(σ = 1) 6: active ← true Eq. 6 is true
2: for vij ∈ Vϕ do 7: for l ∈ [1 . . p] do Eq. 5
3: cij ← θij 8: h ← H ({cij }i,j , {ci }i )
l l
4: ci ← j∈[1..N ] θij 9: else
5: mij ← minl∈[1..p] dlij cache 10: for l ∈ [1 . . p] do Eq. 7
−1+dl
σσ
c
6: active ← false Eq. 6 not true 11: hl ← hl · σσ · cσ −1
−1
σσ cσ −1+dlσ
c
7: σ←σ prev. state
8: E←⊥ expect. val. 12: if active = true then
p
13: E← l=1 κl · h
l
overall expect.
14: σ ← σ
15: return E
Condition (6) guarantees that the permutations in (5) are well-defined. The
first equality in (7) follows from Marchal et al. [51], and the rest uses the conju-
gacy of the prior. Lemma 1 forms the basis of the efficient update of our Bayesian
monitor. Observe that on any given path, once (6) holds, it continues to hold for-
ever. Thus, initially the monitor keeps updating H internally without outputting
anything. Once (6) holds, it keeps outputting H from then on.
Algorithm 4. BayesConfIntMonitor
Parameters: Q, ϕ, θ
Output: Λ 1: function Next(σ )
1: function Init(σ = 1) 2: E ← EXP.Next(σ )
polyn. polyn. 3: E2 ← EXP2 .Next(σ )
2: ϕ ←−−−− ϕ, ϕ2 ←−−−− ϕ2 polyn. form 4: if E = ⊥ and E2 = ⊥ then
3: EXP ← BayesExpMonitor(Q, ϕ, θ) S ← E2 2
4: EXP2 ← BayesExpMonitor(Q, ϕ2 , θ)
5: −E
variance
5: EXP.Init(σ) 6: Λ← E± S
δ Chebysh.
6: EXP2 .Init(σ)
7: return Λ
7: Λ←⊥
A bound on the convergence speed of the Bayesian monitor is left open. This
would require a bound on the change in variance with respect to the length
of the observed path, which is not known for the general case of PSEs. Note
that the efficient (quadratic) cases are different for the frequentist and Bayesian
monitors, suggesting the use of different monitors for different specifications.
6 Experiments
truth values of the properties, empirically showing that they are always objec-
tively correct. On the other hand, the Bayesian monitors’ outputs can vary dras-
tically for different choices of the prior, empirically showing that the correctness
of outputs is subjective. It may be misleading that the outputs of the Bayesian
monitors are wrong as they often do not contain the ground truth values. We
reiterate that from the Bayesian perspective, the ground truth does not exist.
Instead, we only have a probability distribution over the true values that gets
updated after observing the generated sequence of events. The choice of the type
of monitor ultimately depends on the application requirements.
Fig. 3. The plots show the 95% confidence intervals estimated by the monitors over
time, averaged over 10 different sample paths, for the lending with demographic parity
(left), lending with equalized opportunity (middle), and the college admission with
social burden (right) problems. The horizontal dotted lines are the ground truth values
of the properties, obtained by analyzing the Markov chains used to model the systems
(unknown to the monitors). The table summarizes various performance metrics.
7 Conclusion
which will allow us to encode individual fairness properties [21]. Secondly, more
liberal assumptions on the system model will be crucial for certain practical
applications. In particular, hidden Markov models, time-inhomogeneous Markov
models, Markov decision processes, etc., are examples of system models with
widespread use in real-world applications. Finally, better error bounds tailored
for specific algorithmic fairness properties can be developed through a deeper
mathematical analysis of the underlying statistics, which will sharpen the con-
servative bounds obtained through off-the-shelf concentration inequalities.
References
1. Agha, G., Palmskog, K.: A survey of statistical model checking. ACM Trans. Model.
Comput. Simul. (TOMACS) 28(1), 1–39 (2018)
2. Albarghouthi, A., D’Antoni, L., Drews, S., Nori, A.V.: Fairsquare: probabilistic
verification of program fairness. Proc. ACM Program. Lang. 1(OOPSLA), 1–30
(2017)
3. Albarghouthi, A., Vinitsky, S.: Fairness-aware programming. In: Proceedings of
the Conference on Fairness, Accountability, and Transparency, pp. 211–219 (2019)
4. Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov
decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019.
LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/
978-3-030-25540-4_29
5. Baier, C., Haverkort, B., Hermanns, H., Katoen, J.P.: Model-checking algorithms
for continuous-time Markov chains. IEEE Trans. Softw. Eng. 29(6), 524–541
(2003). https://doi.org/10.1109/TSE.2003.1205180
6. Balunovic, M., Ruoss, A., Vechev, M.: Fair normalizing flows. In: International
Conference on Learning Representations (2021)
7. Bartocci, E., et al.: Specification-based monitoring of cyber-physical systems: a
survey on theory, tools and applications. In: Bartocci, E., Falcone, Y. (eds.) Lec-
tures on Runtime Verification. LNCS, vol. 10457, pp. 135–175. Springer, Cham
(2018). https://doi.org/10.1007/978-3-319-75632-5_5
8. Bartocci, E., Falcone, Y.: Lectures on Runtime Verification. Springer, Heidelberg
(2018). https://doi.org/10.1007/978-3-319-75632-5
9. Bastani, O., Zhang, X., Solar-Lezama, A.: Probabilistic verification of fairness prop-
erties via concentration. Proc. ACM Program. Lang. 3(OOPSLA), 1–27 (2019)
10. Bellamy, R.K., et al.: Ai fairness 360: an extensible toolkit for detecting and miti-
gating algorithmic bias. IBM J. Res. Dev. 63(4/5), 4–1 (2019)
11. Berk, R., et al.: A convex framework for fair regression. arXiv preprint
arXiv:1706.02409 (2017)
12. Bird, S., et al.: Fairlearn: a toolkit for assessing and improving fairness in ai.
Microsoft, Technical Report. MSR-TR-2020-32 (2020)
13. Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidi-
vism prediction instruments. Big Data 5(2), 153–163 (2017)
14. Clarke, E.M., Zuliani, P.: Statistical model checking for cyber-physical systems. In:
Bultan, T., Hsiung, P.-A. (eds.) ATVA 2011. LNCS, vol. 6996, pp. 1–12. Springer,
Heidelberg (2011). https://doi.org/10.1007/978-3-642-24372-1_1
15. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A.: Algorithmic deci-
sion making and the cost of fairness. In: Proceedings of the 23rd ACM Sigkdd
International Conference on Knowledge Discovery and Data Mining, pp. 797–806
(2017)
Monitoring Algorithmic Fairness 379
16. D’Amour, A., Srinivasan, H., Atwood, J., Baljekar, P., Sculley, D., Halpern, Y.:
Fairness is not static: deeper understanding of long term fairness via simulation
studies. In: Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency, FAT* 2020, pp. 525–534 (2020)
17. David, A., Du, D., Guldstrand Larsen, K., Legay, A., Mikučionis, M.: Optimizing
control strategy using statistical model checking. In: Brat, G., Rungta, N., Venet,
A. (eds.) NFM 2013. LNCS, vol. 7871, pp. 352–367. Springer, Heidelberg (2013).
https://doi.org/10.1007/978-3-642-38088-4_24
18. Dimitrova, R., Finkbeiner, B., Torfah, H.: Probabilistic hyperproperties of markov
decision processes (2020). https://doi.org/10.48550/ARXIV.2005.03362, https://
arxiv.org/abs/2005.03362
19. Donzé, A., Maler, O.: Robust satisfaction of temporal logic over real-valued sig-
nals. In: Chatterjee, K., Henzinger, T.A. (eds.) FORMATS 2010. LNCS, vol.
6246, pp. 92–106. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-
15297-9_9
20. Dressel, J., Farid, H.: The accuracy, fairness, and limits of predicting recidivism.
Sci. Adv. 4(1), eaao5580 (2018)
21. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through aware-
ness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Con-
ference, pp. 214–226 (2012)
22. Dwork, C., Ilvento, C.: Individual fairness under composition. In: Proceedings of
Fairness, Accountability, Transparency in Machine Learning (2018)
23. Ensign, D., Friedler, S.A., Neville, S., Scheidegger, C., Venkatasubramanian,
S.: Runaway feedback loops in predictive policing. In: Conference on Fairness,
Accountability and Transparency, pp. 160–171. PMLR (2018)
24. Faymonville, P., Finkbeiner, B., Schwenger, M., Torfah, H.: Real-time stream-based
monitoring. arXiv preprint arXiv:1711.03829 (2017)
25. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian,
S.: Certifying and removing disparate impact. In: proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
259–268 (2015)
26. Ferrère, T., Henzinger, T.A., Kragl, B.: Monitoring event frequencies. In: Fernán-
dez, M., Muscholl, A. (eds.) 28th EACSL Annual Conference on Computer Science
Logic (CSL 2020). Leibniz International Proceedings in Informatics (LIPIcs), vol.
152, pp. 20:1–20:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl
(2020). https://doi.org/10.4230/LIPIcs.CSL.2020.20, https://drops.dagstuhl.de/
opus/volltexte/2020/11663
27. Ferrère, T., Henzinger, T.A., Saraç, N.E.: A theory of register monitors. In: Pro-
ceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Sci-
ence, pp. 394–403 (2018)
28. Finkbeiner, B., Sankaranarayanan, S., Sipma, H.: Collecting statistics over runtime
executions. Electron. Notes Theor. Comput. Sci. 70(4), 36–54 (2002)
29. Ghosh, B., Basu, D., Meel, K.S.: Justicia: a stochastic sat approach to formally
verify fairness. arXiv preprint arXiv:2009.06516 (2020)
30. Ghosh, B., Basu, D., Meel, K.S.: Algorithmic fairness verification with graphical
models. arXiv preprint arXiv:2109.09447 (2021)
31. Gómez-Corral, A., Insua, D.R., Ruggeri, F., Wiper, M.: Bayesian inference of
markov processes. In: Wiley StatsRef: Statistics Reference Online, pp. 1–15 (2014)
32. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning.
Adv. Neural Inf. Process. Syst. 29 (2016)
380 T. A. Henzinger et al.
33. Henzinger, T.A., Karimi, M., Kueffner, K., Mallik, K.: Monitoring algorithmic
fairness. arXiv preprint arXiv:2305.15979 (2023)
34. Henzinger, T.A., Karimi, M., Kueffner, K., Mallik, K.: Runtime monitoring of
dynamic fairness properties. arXiv preprint arXiv:2305.04699 (2023). to appear in
FAccT ’23
35. Henzinger, T.A., Saraç, N.E.: Monitorability under assumptions. In: Deshmukh, J.,
Ničković, D. (eds.) RV 2020. LNCS, vol. 12399, pp. 3–18. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-60508-7_1
36. Henzinger, T.A., Saraç, N.E.: Quantitative and approximate monitoring. In: 2021
36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp.
1–14. IEEE (2021)
37. Insua, D., Ruggeri, F., Wiper, M.: Bayesian Analysis of Stochastic Process Models.
John Wiley & Sons, Hoboken (2012)
38. Jagielski, M., et al.: Differentially private fair learning. In: International Conference
on Machine Learning, pp. 3000–3008. PMLR (2019)
39. John, P.G., Vijaykeerthy, D., Saha, D.: Verifying individual fairness in machine
learning models. In: Conference on Uncertainty in Artificial Intelligence, pp. 749–
758. PMLR (2020)
40. Junges, S., Torfah, H., Seshia, S.A.: Runtime monitors for markov decision pro-
cesses. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12760, pp. 553–
576. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81688-9_26
41. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without
discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
42. Kearns, M., Neel, S., Roth, A., Wu, Z.S.: Preventing fairness gerrymandering:
auditing and learning for subgroup fairness. In: International Conference on
Machine Learning, pp. 2564–2572. PMLR (2018)
43. Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair deter-
mination of risk scores. In: Papadimitriou, C.H. (ed.) 8th Innovations in Theoretical
Computer Science Conference (ITCS 2017). Leibniz International Proceedings in
Informatics (LIPIcs), vol. 67, pp. 43:1–43:23. Schloss Dagstuhl-Leibniz-Zentrum
fuer Informatik, Dagstuhl (2017). https://doi.org/10.4230/LIPIcs.ITCS.2017.43,
http://drops.dagstuhl.de/opus/volltexte/2017/8156
44. Knight, K.: Mathematical Statistics. CRC Press, Boca Raton (1999)
45. Konstantinov, N.H., Lampert, C.: Fairness-aware pac learning from corrupted data.
J. Mach. Learn. Res. 23 (2022)
46. Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Adv. Neural
Inf. Process. Syst. 30 (2017)
47. Lahoti, P., Gummadi, K.P., Weikum, G.: ifair: learning individually fair data rep-
resentations for algorithmic decision making. In: 2019 IEEE 35th International
Conference on Data Engineering (icde), pp. 1334–1345. IEEE (2019)
48. Liu, L.T., Dean, S., Rolf, E., Simchowitz, M., Hardt, M.: Delayed impact of fair
machine learning. In: International Conference on Machine Learning, pp. 3150–
3158. PMLR (2018)
49. Lum, K., Isaac, W.: To predict and serve? Significance 13(5), 14–19 (2016)
50. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In:
Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp.
152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-
3_12
51. Marchal, O., Arbel, J.: On the sub-gaussianity of the beta and dirichlet distribu-
tions. Electron. Commun. Probabil. 22, 1–14 (2017)
Monitoring Algorithmic Fairness 381
52. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on
bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54(6), 1–35
(2021)
53. Meyer, A., Albarghouthi, A., D’Antoni, L.: Certifying robustness to programmable
data bias in decision trees. Adv. Neural Inf. Process. Syst. 34, 26276–26288 (2021)
54. Milli, S., Miller, J., Dragan, A.D., Hardt, M.: The social cost of strategic classifi-
cation. In: Proceedings of the Conference on Fairness, Accountability, and Trans-
parency, pp. 230–239 (2019)
55. Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S.: Dissecting racial bias in an
algorithm used to manage the health of populations. Science 366(6464), 447–453
(2019)
56. Otop, J., Henzinger, T.A., Chatterjee, K.: Quantitative automata under proba-
bilistic semantics. Logical Methods Comput. Sci. 15 (2019)
57. Scheuerman, M.K., Paul, J.M., Brubaker, J.R.: How computers see gender: an
evaluation of gender classification in commercial facial analysis services. In: Pro-
ceedings of the ACM on Human-Computer Interaction, vol. 3, no. CSCW, pp. 1–33
(2019)
58. Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I.Y., Ghassemi, M.: Chex-
clusion: fairness gaps in deep chest x-ray classifiers. In: BIOCOMPUTING 2021:
Proceedings of the Pacific Symposium, pp. 232–243. World Scientific (2020)
59. Sharifi-Malvajerdi, S., Kearns, M., Roth, A.: Average individual fairness: algo-
rithms, generalization and experiments. Adv. Neural Inf. Process. Syst. 32 (2019)
60. Stoller, S.D., Bartocci, E., Seyster, J., Grosu, R., Havelund, K., Smolka, S.A.,
Zadok, E.: Runtime verification with state estimation. In: Khurshid, S., Sen, K.
(eds.) RV 2011. LNCS, vol. 7186, pp. 193–207. Springer, Heidelberg (2012). https://
doi.org/10.1007/978-3-642-29860-8_15
61. Sun, B., Sun, J., Dai, T., Zhang, L.: Probabilistic verification of neural networks
against group fairness. In: Huisman, M., Păsăreanu, C., Zhan, N. (eds.) FM 2021.
LNCS, vol. 13047, pp. 83–102. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-90870-6_5
62. Wachter, S., Mittelstadt, B., Russell, C.: Bias preservation in machine learning:
the legality of fairness metrics under eu non-discrimination law. W. Va. L. Rev.
123, 735 (2020)
63. Wexler, J., Pushkarna, M., Bolukbasi, T., Wattenberg, M., Viégas, F., Wilson, J.:
The what-if tool: Interactive probing of machine learning models. IEEE Trans. Vis.
Comput. Graph. 26(1), 56–65 (2019)
64. Younes, H.L.S., Simmons, R.G.: Probabilistic verification of discrete event systems
using acceptance sampling. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS,
vol. 2404, pp. 223–235. Springer, Heidelberg (2002). https://doi.org/10.1007/3-
540-45657-0_17
65. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness con-
straints: a flexible approach for fair classification. J. Mach. Learn. Res. 20(1),
2737–2778 (2019)
66. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., Dwork, C.: Learning fair represen-
tations. In: International Conference on Machine Learning, pp. 325–333. PMLR
(2013)
382 T. A. Henzinger et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
nl2spec: Interactively Translating
Unstructured Natural Language
to Temporal Logics with Large Language
Models
1 Introduction
A rigorous formalization of desired system requirements is indispensable when
performing any verification-related task, such as model checking [7], synthesis [6],
or runtime verification [20]. Writing formal specifications, however, is an error-
prone and time-consuming manual task typically reserved for experts in the field.
This paper presents nl2spec, a framework, accompanied by a web-based tool,
to facilitate and automate writing formal specifications (in LTL [34] and similar
temporal logics). The core contribution is a new methodology to decompose
the natural language input into sub-translations by utilizing Large Language
Models (LLMs). The nl2spec framework provides an interface to interactively
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 383–396, 2023.
https://doi.org/10.1007/978-3-031-37703-7_18
384 M. Cosler et al.
add, edit, and delete these sub-translations instead of attempting to grapple with
the entire formalization at once (a feature that is sorely missing in similar work,
e.g., [13,30]).
Figure 1 shows the web-based frontend of nl2spec. As an example, we con-
sider the following system requirement given in natural language: “Globally,
grant 0 and grant 1 do not hold at the same time until it is allowed”. The tool
automatically translates the natural language specification correctly into the
LTL formula G((!((g0 & g1)) U a)). Additionally, the tool generates sub-
translations, such as the pair (“do not hold at the same time”, !(g0 & g1)),
which help in verifying the correctness of the translation.
Consider, however, the following ambiguous example: “a holds until b holds
or always a holds”. Human supervision is needed to resolve the ambiguity on
the operator precedence. This can be easily achieved with nl2spec by adding or
editing a sub-translation using explicit parenthesis (see Sect. 4 for more details
and examples). To capture such (and other types of) ambiguity in a benchmark
data set, we conducted an expert user study specifically asking for challenging
translations of natural language sentences to LTL formulas.
The key insight in the design of nl2spec is that the process of translation
can be decomposed into many sub-translations automatically via LLMs, and
the decomposition into sub-translations allows users to easily resolve ambigu-
ous natural language and erroneous translations through interactively modifying
sub-translations. The central goal of nl2spec is to keep the human supervision
nl2spec 385
minimal and efficient. To this end, all translations are accompanied by a con-
fidence score. Alternative suggestions for sub-translations can be chosen via a
drop-down menu and misleading sub-translations can be deleted before the next
loop of the translation. We evaluate the end-to-end translation accuracy of our
proposed methodology on the benchmark data set obtained from our expert
user study. Note that nl2spec can be applied to the user’s respective appli-
cation domain to increase the quality of translation. As proof of concept, we
provide additional examples, including an example for STL [31] in the GitHub
repository1 .
nl2spec is agnostic to machine learning models and specific application
domains. We will discuss possible parameterizations and inputs of the tool in
Sect. 3. We discuss our sub-translation methodology in more detail in Sect. 3.2
and introduce an interactive few-shot prompting scheme for LLMs to generate
them. We evaluate the effectiveness of the tool to resolve erroneous formaliza-
tions in Sect. 4 on a data set obtained from conducting an expert user study.
We discuss limitations of the framework and conclude in Sect. 5. For additional
details, please refer to the complete version [8].
Linear-time Temporal Logic (LTL) [34] is a temporal logic that forms the basis
of many practical specification languages, such as the IEEE property specifica-
tion language (PSL) [22], Signal Temporal Logic (STL) [31], or System Verilog
Assertions (SVA) [43]. By focusing on the prototype temporal logic LTL, we
keep the nl2spec framework extendable to specification languages in specific
application domains. LTL extends propositional logic with temporal modalities
U (until) and X (next). There are several derived operators, such as Fϕ ≡ trueUϕ
and Gϕ ≡ ¬F¬ϕ. Fϕ states that ϕ will eventually hold in the future and
Gϕ states that ϕ holds globally. Operators can be nested: GFϕ, for example,
states that ϕ has to occur infinitely often. LTL specifications describe a sys-
tems behavior and its interaction with an environment over time. For exam-
ple given a process 0 and a process 1 and a shared resource, the formula
G(r0 → Fg0 ) ∧ G(r1 → Fg1 ) ∧ G¬(g0 ∧ g1 ) describes that whenever a process
requests (ri ) access to a shared resource it will eventually be granted (gi ). The
subformula G¬(g0 ∧ g1 ) ensures that grants given are mutually exclusive.
Early work in translating natural language to temporal logics focused on
grammar-based approaches that could handle structured natural language [17,
24]. A survey of earlier research before the advent of deep learning is provided
in [4]. Other approaches include an interactive method using SMT solving and
semantic parsing [15], or structured temporal aspects in grounded robotics [45]
and planning [32]. Neural networks have only recently been used to translate
1
The tool is available at GitHub: https://github.com/realChrisHahn2/nl2spec.
386 M. Cosler et al.
into temporal logics, e.g., by training a model for STL from scratch [21], fine-
tuning language models [19], or an approach to apply GPT-3 [13,30] in a one-
shot fashion, where [13] output a restricted set of declare templates [33] that
can be translated to a fragment of LTLf [10]. Translating natural langauge to
LTL has especially been of interest to the robotics community (see [16] for an
overview), where datasets and application domains are, in contrast to our setting,
based on structured natural language. Independent of relying on structured data,
all previous tools lack a detection and interactive resolving of the inerherent
ambiguity of natural language, which is the main contribution of our framework.
Related to our approach is recent work [26], where generated code is iteratively
refined to match desired outcomes based on human feedback.
LLMs are large neural networks typically consisting of up to 176 billion parame-
ters. They are pre-trained on massive amounts of data, such as “The Pile” [14].
Examples of LLMs include the GPT [36] and BERT [11] model families, open-
source models, such as T5 [38] and Bloom [39], or commercial models, such as
Codex [5]. LLMs are Transformers [42], which is the state of the art neural archi-
tecture for natural language proccessing. Additionally, Transformers have shown
remarkable performance when being applied to classical problems in verification
(e.g., [9,18,25,40]), reasoning (e.g., [28,50]), as well as the auto-formalization [35]
of mathematics and formal specifications (e.g., [19,21,49]).
In language modelling, we model the probability of a sequence of tokens in a
text [41]. The joint probability of tokens in a text is generally expressed as [39]:
T
p(x) = p(x1 , . . . , xT ) = p(xt |x<t ) ,
t=1
where x is the sequence of tokens, xt represents the t-th token, and x<t is the
sequence of tokens preceding xt . We refer to this as an autoregressive language
model that iteratively predicts the probability of the next token. Neural network
approaches to language modelling have superseded classical approaches, such as
n-grams [41]. Especially Transformers [42] were shown to be the most effective
architecture at the time of writing [1,23,36].
While fine-tuning neural models on a specific translation task remains a valid
approach showing also initial success in generalizing to unstructured natural lan-
guage when translating to LTL [19], a common technique to obtain high perfor-
mance with limited amount of labeled data is so-called “few-shot prompting” [3].
The language model is presented a natural language description of the task usu-
ally accompanied with a few examples that demonstrate the input-output behav-
ior. The framework presented in this paper relies on this technique. We describe
the proposed few-shot prompting scheme in detail in Sect. 3.2.
Currently implemented in the framework and used in the expert-user study
are Codex and Bloom, which showed the best performance during testing.
nl2spec 387
Codex and GPT-3.5-turbo. Codex [5] is a GPT-3 variant that was initially of
up to 12B parameters in size and fine-tuned on code. The initial version of
GPT-3 itself was trained on variations of Common Crawl,2 Webtext-2 [37], two
internet-based book corpora and Wikipedia [3]. The fine-tuning dataset for the
vanilla version Codex was collected in May 2020 from 54 million public software
repositories hosted on GitHub, using 159GB of training data for fine-tuning. For
our experiments, we used the commercial 2022 version of code-davinci-002,
which is likely larger (in the 176B range3 ) than the vanilla codex models. GPT-
3.5-turbo is the currently available follow-up model of GPT-3.
Natural Language
Formal LTL spec
Temperature
Number of runs
Sub-translations
Sub-translations
scores
Frontend
Backend
Natural Language Temperature Sub-translations
Formal LTL spec
Sub-translations Number of runs scores
Ambiguity
Prompts
Detection
Neural Models
The core of the methodology is the decomposition of the natural language input
into sub-translations. We introduce an interactive prompting scheme that gener-
ates sub-translations using the underlying neural model and leverages the sub-
translations to produce the final translation. Algorithm 1 depicts a high-level
overview of the interactive loop. The main idea is to give a human-in-the-loop
the options to add, edit, or delete sub-translations and feed them back into
the language models as “Given translations” in the prompt (see Fig. 3). After
querying a language model M with this prompt F , model specific parameters P
and the interactive prompt that is computed in the loop, the model generates
a natural language explanation, a dictionary of sub-translations, and the final
translation. Notably, the model M can be queried multiple times as specified
by the number of runs r, thereby generating multiple possible sub-translations.
The confidence score of each sub-translation is computed as votes over multiple
queries and by default the sub-translation with the highest confidence score is
selected to be used as a given sub-translation in the next iteration. In the fron-
tend, the user may view and select alternative generated sub-translations for
each sub-translation via a drop-down menu (see Fig. 1).
Figure 3 shows a generic prompt, that illustrates our methodology. The
prompting scheme consists of three parts. The specification language specific
part (lines 1–4), the fewshot examples (lines 5–19), and the interactive prompt
nl2spec 389
minimal.txt
1 Translate the following natural language sentences into an LTL formula and explain your
2 translation step by step. Remember that X means "next", U means "until", G means
3 "globally", F means "finally", which means GF means "infinitely often". The formula
4 should only contain atomic propositions or operators &, , ->, <->, X, U, G, F.
5 Natural Language: Globally if a holds then c is true until b. Given translations: {}
6 Explanation: "a holds" from the input translates to the atomic proposition a.
7 "c is true until b" from the input translates to the subformula c U b. "if x then y"
8 translates to an implication x -> y, so "if a holds then c is true until b" translates
9 to an implication a -> c U b. "Globally" from the input translates to the temporal
10 operator G. Explanation dictionary: {"a holds" : "a", "c is true until b" : "c U b",
11 "if a holds then c is true until b" : "a -> c U b", "Globally" : "G"} So the final
12 LTL translation is G a -> c U b.FINISH Natural Language: Every request r is
13 eventually followed by a grant g. Given translations: {} Explanation: "Request r"
14 from the input translates to the atomic proposition r and "grant g" translates to the
15 atomic proposition g. "every" means at every point in time, i.e., globally, "never"
16 means at no point in time, and "eventually" translates to the temporal operator F.
17 "followed by" is the natural language representation of an implication. Explanation
18 dictionary: {"Request r" : "r", "grant g" : "g", "every" : "G", "eventually": "F",
19 "followed by" : "->"} So the final LTL translation is G r -> F g.FINISH
including the natural language and sub-translation inputs (not displayed, given
as input). The specification language specific part leverages “chain-of-thought”
prompt-engineering to elicit reasoning from large language models [46]. The key
of nl2spec, however, is the setup of the few-shot examples. This minimal prompt
consists of two few-shot examples (lines 5–12 and 12–19). The end of an exam-
ple is indicated by the “FINISH” token, which is the stop token for the machine
learning models. A few-shot example in nl2spec consists of the natural language
input (line 5), a dictionary of given translations, i.e., the sub-translations (line
5), an explanation of the translation in natural language (line 6–10), an expla-
nation dictionary, summarizing the sub-translations, and finally, the final LTL
formula.
This prompting scheme elicits sub-translations from the model, which serve
as a fine-grained explanation of the formalization. Note that sub-translations
provided in the prompt are neither unique nor exhaustive, but provide the con-
text for the language model to generate the correct formalization.
4 Evaluation
In this section, we evaluate our framework and prompting methodology on a data
set obtained by conducting an expert user study. To show the general applica-
bility of this framework, we use the minimal prompt that includes only minimal
domain knowledge of the specification language (see Fig. 3). This prompt has
intentionally been written before conducting the expert user study. We lim-
ited the few-shot examples to two and even provided no few-shot example that
includes “given translations”. We use the minimal prompt to focus the evaluation
on the effectiveness of our interactive sub-translation refinement methodology in
390 M. Cosler et al.
resolving ambiguity and fixing erroneous translations. In practice, one would like
to replace this minimal prompt with domain-specific examples that capture the
underlying distribution as closely as possible. As a proof of concept, we elaborate
on this in the full version [8].
The poor performance of existing methods (cf. Table 1) exemplify the diffi-
culty of this data set.
4.2 Results
We evaluated our approach using the minimal prompt (if not otherwise stated),
with number of runs set to three and with a temperature of 0.2.
nl2spec 391
Table 1. Translation accuracy on the benchmark data set, where B stands for Bloom
and C stands for Codex and G for GPT-3.5-Turbo.
with the help of Codex were: “It is never the case that a and b hold at the same
time.”, “Whenever a is enabled, b is enabled three steps later.”, “If it is the case
that every a is eventually followed by a b, then c needs to holds infinitely often.”,
and “One of the following aps will hold at all instances: a,b,c”. This demonstrates
that our sub-translation methodology is a valid appraoch: improving the quality
of the sub-translations indeed has a positive effect on the quality of the final
formalization. This even holds true when using underperforming neural network
models. Note that no supervision by a human was needed in this experiment to
improve the formalization quality.
5 Conclusion
We presented nl2spec, a framework for translating unstructured natural lan-
guage to temporal logics. A limitation of this approach is its reliance on compu-
tational resources at inference time. This is a general limitation when applying
deep learning techniques. Both, commercial and open-source models, however,
provide easily accessible APIs to their models. Additionally, the quality of initial
translations might be influenced by the amount of training data on logics, code,
or math that the underlying neural models have seen during pre-training.
At the core of nl2spec lies a methodology to decompose the natural language
input into sub-translations, which are mappings of formula fragments to relevant
parts of the natural language input. We introduced an interactive prompting
scheme that queries LLMs for sub-translations, and implemented an interface
for users to interactively add, edit, and delete the sub-translations, which avoids
users from manually redrafting the entire formalization to fix erroneous transla-
tions. We conducted a user study, showing that nl2spec can be efficiently used
to interactively formalize unstructured and ambigous natural language.
References
1. Al-Rfou, R., Choe, D., Constant, N., Guo, M., Jones, L.: Character-level language
modeling with deeper self-attention. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, pp. 3159–3166 (2019)
2. Bocklisch, T., Faulkner, J., Pawlowski, N., Nichol, A.: Rasa: open source language
understanding and dialogue management. arXiv preprint arXiv:1712.05181 (2017)
3. Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inf. Process.
Syst. 33, 1877–1901 (2020)
4. Brunello, A., Montanari, A., Reynolds, M.: Synthesis of ltl formulas from natural
language texts: state of the art and research directions. In: 26th International
Symposium on Temporal Representation and Reasoning (TIME 2019). Schloss
Dagstuhl-Leibniz-Zentrum fuer Informatik (2019)
5. Chen, M., et al.: Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 (2021)
6. Church, A.: Application of recursive arithmetic to the problem of circuit synthesis.
J. Symb. Logic 28(4) (1963)
7. Clarke, E.M.: Model checking. In: Ramesh, S., Sivakumar, G. (eds.) FSTTCS 1997.
LNCS, vol. 1346, pp. 54–56. Springer, Heidelberg (1997). https://doi.org/10.1007/
BFb0058022
8. Cosler, M., Hahn, C., Mendoza, D., Schmitt, F., Trippel, C.: nl2spec: interactively
translating unstructured natural language to temporal logics with large language
models. arXiv preprint arXiv:2303.04864 (2023)
9. Cosler, M., Schmitt, F., Hahn, C., Finkbeiner, B.: Iterative circuit repair against
formal specifications. In: International Conference on Learning Representations (to
appear) (2023)
394 M. Cosler et al.
10. De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic
on finite traces. In: IJCAI 2013 Proceedings of the Twenty-Third international
joint conference on Artificial Intelligence, pp. 854–860. Association for Computing
Machinery (2013)
11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
12. Fuggitti, F.: LTLf2DFA. Zenodo (2019). https://doi.org/10.5281/ZENODO.
3888410, https://zenodo.org/record/3888410
13. Fuggitti, F., Chakraborti, T.: Nl2ltl-a python package for converting natural lan-
guage (nl) instructions to linear temporal logic (ltl) formulas (2023)
14. Gao, L., et al.: The pile: an 800 gb dataset of diverse text for language modeling.
arXiv preprint arXiv:2101.00027 (2020)
15. Gavran, I., Darulova, E., Majumdar, R.: Interactive synthesis of temporal spec-
ifications from examples and natural language. Proc. ACM Program. Lang.
4(OOPSLA), 1–26 (2020)
16. Gopalan, N., Arumugam, D., Wong, L.L., Tellex, S.: Sequence-to-sequence lan-
guage grounding of non-markovian task specifications. In: Robotics: Science and
Systems, vol. 2018 (2018)
17. Grunske, L.: Specification patterns for probabilistic quality properties. In: 2008
ACM/IEEE 30th International Conference on Software Engineering, pp. 31–40.
IEEE (2008)
18. Hahn, C., Schmitt, F., Kreber, J.U., Rabe, M.N., Finkbeiner, B.: Teaching tempo-
ral logics to neural networks. In: International Conference on Learning Represen-
tations (2021)
19. Hahn, C., Schmitt, F., Tillman, J.J., Metzger, N., Siber, J., Finkbeiner, B.: Formal
specifications from natural language. arXiv preprint arXiv:2206.01962 (2022)
20. Havelund, K., Roşu, G.: Monitoring java programs with java pathexplorer. Elec-
tron. Notes Theor. Comput. Sci. 55(2), 200–217 (2001)
21. He, J., Bartocci, E., Ničković, D., Isakovic, H., Grosu, R.: Deepstl: from english
requirements to signal temporal logic. In: Proceedings of the 44th International
Conference on Software Engineering, pp. 610–622 (2022)
22. IEEE-Commission, et al.: IEEE standard for property specification language
(PSL). IEEE Std 1850–2005 (2005)
23. Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361 (2020)
24. Konrad, S., Cheng, B.H.: Real-time specification patterns. In: Proceedings of the
27th International Conference on Software Engineering, pp. 372–381 (2005)
25. Kreber, J.U., Hahn, C.: Generating symbolic reasoning problems with transformer
gans. arXiv preprint arXiv:2110.10054 (2021)
26. Lahiri, S.K., et al.: Interactive code generation via test-driven user-intent formal-
ization. arXiv preprint arXiv:2208.05950 (2022)
27. Laurençon, H., et al.: The bigscience roots corpus: a 1.6 tb composite multilingual
dataset. In: Thirty-sixth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (2022)
28. Lewkowycz, A., et al.: Solving quantitative reasoning problems with language mod-
els. arXiv preprint arXiv:2206.14858 (2022)
29. Lhoest, Q., et al.: Datasets: a community library for natural language processing.
arXiv preprint arXiv:2109.02846 (2021)
nl2spec 395
30. Liu, J.X., et al.: Lang2ltl: translating natural language commands to temporal
specification with large language models. In: Workshop on Language and Robotics
at CoRL 2022
31. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In:
Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp.
152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-
3 12
32. Patel, R., Pavlick, R., Tellex, S.: Learning to ground language to temporal logical
form. In: NAACL (2019)
33. Pesic, M., van der Aalst, W.M.P.: A declarative approach for flexible business
processes management. In: Eder, J., Dustdar, S. (eds.) BPM 2006. LNCS, vol. 4103,
pp. 169–180. Springer, Heidelberg (2006). https://doi.org/10.1007/11837862 18
34. Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foun-
dations of Computer Science (sfcs 1977), pp. 46–57. IEEE (1977)
35. Rabe, M.N., Szegedy, C.: Towards the automatic mathematician. In: Platzer, A.,
Sutcliffe, G. (eds.) CADE 2021. LNCS (LNAI), vol. 12699, pp. 25–37. Springer,
Cham (2021). https://doi.org/10.1007/978-3-030-79876-5 2
36. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan-
guage understanding by generative pre-training (2018)
37. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
38. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/
v21/20-074.html
39. Scao, T.L., et al.: Bloom: a 176b-parameter open-access multilingual language
model. arXiv preprint arXiv:2211.05100 (2022)
40. Schmitt, F., Hahn, C., Rabe, M.N., Finkbeiner, B.: Neural circuit synthesis from
specification patterns. Adv. Neural Inf. Process. Syst. 34, 15408–15420 (2021)
41. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3),
379–423 (1948)
42. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30
(2017)
43. Vijayaraghavan, S., Ramanathan, M.: A Practical Guide for SystemVerilog Asser-
tions. Springer, Heidelberg (2005). https://doi.org/10.1007/b137011
44. Vyshnavi, V.R., Malik, A.: Efficient way of web development using python and
flask. Int. J. Recent Res. Asp 6(2), 16–19 (2019)
45. Wang, C., Ross, C., Kuo, Y.L., Katz, B., Barbu, A.: Learning a natural-
language to ltl executable semantic parser for grounded robotics. arXiv preprint
arXiv:2008.03277 (2020)
46. Wei, J., et al.: Chain of thought prompting elicits reasoning in large language
models. arXiv preprint arXiv:2201.11903 (2022)
47. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language pro-
cessing. arXiv preprint arXiv:1910.03771 (2019)
48. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, pp. 38–45 (2020)
49. Wu, Y., et al.: Autoformalization with large language models. arXiv preprint
arXiv:2205.12615 (2022)
50. Zelikman, E., Wu, Y., Goodman, N.D.: Star: bootstrapping reasoning with reason-
ing. arXiv preprint arXiv:2203.14465 (2022)
396 M. Cosler et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
NNV 2.0: The Neural Network
Verification Tool
1 Introduction
2 Related Work
The area of DNN verification has increasingly grown in recent years, leading
to the development of standard input formats [29] as well as friendly com-
petitions [50,55], that help compare and evaluate all the recent methods and
tools proposed in the community [4,6,19,22,31,39–41,48,55,59,64,65,77,83,84,
1
Code available at: https://github.com/verivital/nnv/releases/tag/cav2023.
2
Archival version: https://doi.org/10.24433/CO.0803700.v1.
NNV 2.0 399
86,87]. However, the majority of these methods focus on regression and classifica-
tion tasks performed by FFNN and CNN. In addition to FFNN and CNN verifi-
cation, Tran et al. [79] introduced a collection of star-based reachability analysis
that also verify SSNNs. Fischer et al. [21] proposed a probabilistic method for the
robustness verification of SSNNs based on randomize smoothing [14]. Since then,
some of the other recent tools, including Verinet [31], α,β-Crown [84,87], and
MN-BaB [20] are also able to verify image segmentation properties as demon-
strated in [55]. A less explored area is the verification of RNN. These models have
unique “memory units” that enable them to store information for a period of
time and learn complex patterns of time-series or sequential data. However, due
to their memory units, verifying the robustness of RNNs is challenging. Recent
notable state-of-the-art methodologies for verifying RNNs include unrolling the
network into an FFNN and then verify it [2], invariant inference [36,62,90], and
star-based reachability [74]. Similar to RNNs, neural ODEs are also deep learning
models with “memory”, which makes them suitable to learn time-series data, but
are also applicable to other tasks such as continuous normalizing flows (CNF)
and image classification [11,61]. However, existing work is limited to a stochastic
reachability approach [27,28], reachability approaches using star and zonotope
reachability methods for a general class of neural ODEs (GNODE) with contin-
uous and discrete time layers [52], and GAINS [89], which leverages ODE-solver
information to discretize the models using a computation graph that represent
all possible trajectories from a given input to accelerate their bound propaga-
tion method. However, one of the main challenges is to find a framework that is
able to verify several of these models successfully. For example, α,β-Crown was
the top performer on last year’s NN verification competition [55], able to verify
FFNN, CNN and SSNNs, but it lacks support for neural ODEs or NNCS. There
exist other tools that focus more on the verification of NNCS such as Verisig
[34,35], Juliareach [63], ReachNN [17,33], Sherlock [16], RINO [26], VenMas [1],
POLAR [32], and CORA [3,42]. However, their support is limited to NNCS
with a linear, nonlinear ODE or hybrid automata as the plant model, and a
FFNN as the controller.
Finally, for a more detailed comparison to state-of-the-art methods for the
novel features of NNV 2.0, we refer to the comparison and discussion about
neural ODEs in [52]. For SSNNs [79], there is a discussion on scalability and
conservativeness of methods presented (approx and relax star) for the different
layers that may be part of a SSNN [79]. For RNNs, the approach details and
a state-of-the-art comparison can be found in [74]. We also refer the reader to
two verification competitions, namely VNN-COMP [6,55] and AINNCS ARCH-
COMP [38,50], for a comparison on state-of-the-art methods for neural network
verification and neural network control system verification, respectively.
nonlinear ordinary differential equations (ODE) [73] and hybrid automata, MPT
toolbox [45] for polytope-based operations [76], YALMIP [49] for some optimiza-
tion problems in addition to MATLAB’s Optimization Toolbox [53] and GLPK
[56], and MatConvNet [82] for some convolution and pooling operations. NNV
also makes use of MATLAB’s deep learning toolbox to load the Open Neu-
ral Network Exchange (ONNX) format [57,68], and the Hybrid Systems Model
Transformation and Translation tool (HyST) [5] for NNCS plant configuration.
NNV consists of two main modules: a computation engine and an analyzer,
as illustrated in Fig. 1. The computation engine module consists of four com-
ponents: 1) NN constructor, 2) NNCS constructor, 3) reachability solvers, and
4) evaluator. The NN constructor takes as an input a neural network, either
as a DAGNetwork, dlnetwork, SeriesNetwork (MATLAB built-in formats) [69],
or as an ONNX file [57], and generates a NN object suitable for verification.
The NNCS constructor takes as inputs the NN object and an ODE or Hybrid
Automata (HA) file describing the dynamics of a system, and then creates an
NNCS object. Depending on the task to solve, either the NN (or NNCS) object
is passed into the reachability solver to compute the reachable set of the system
from a given set of initial conditions. Then, the computed set is sent to the ana-
lyzer module to verify/falsify a given property, and/or visualize the reachable
sets. Given a specification, the verifier can formally reason whether the spec-
ification is met by computing the intersection of the define property and the
reachable sets. If an exact (sound and complete) method is used, (e.g., exact-
star), the analyzer can determine if the property is satisfied or unsatisfied. If an
over-approximate (sound and incomplete) method is used, the verifier may also
return “uncertain” (unknown), in addition to satisfied or unsatisfied.
Analyzer
Computation Engine Reachable Plot of reachable
Visualizer sets/traces
Network NN Reachability Sets
Configuration Constructor solvers Safe/
Initial Condition Verifier Unsafe/
Evaluation Robust
Plant NNCS Traces Unsafe/
Configuration Evaluator Uncertain?
Constructor Set of counter
Falsifier inputs or
unsafe traces
Since the introduction of NNV [80], we have added to NNV support for the
verification of a larger subset of deep learning models. We have added reacha-
bility methods to verify SSNNs [79], and a collection of relax-star reachability
methods [79], reachability techniques for Neural ODEs [52] and RNNs [74]. In
addition, there have been changes that include the creation of a common NN
class that encapsulates previously supported neural network classes (FFNN and
CNN) as well as Neural ODEs, SSNNs, and RNNs, which significantly reduces
the software complexity and simplifies user experience. We have also added direct
support for ONNX [57], as well as a parser for VNN-LIB [29], which describes
NNV 2.0 401
properties to verify of any class of neural networks. We have also added flexibility
to use one of the many solvers supported by YALMIP [49], GLPK [56] or lin-
prog [70]. Table 1 shows a summary of the major features of NNV, highlighting
the novel features.
4 Evaluation
The evaluation is divided into 4 sections: 1) Comparison of FFNN and CNN
to MATLAB’s commercial toolbox [53,69], 2) Reachability analysis of Neural
ODEs [52], 3) Robustness Verification of RNNs [74], and 4) Robustness Verifica-
tion of SSNNs [79]. The results presented were all performed on a desktop with
the following configuration: AMD Ryzen 9 5900X @3.7GHz 12-Core Processor,
64 GB Memory, and 64-bit Microsoft Windows 10 Pro.
matlab approx relax 25% relax 50% relax 75% relax 100% exact (8)
prop 3 (45) SAT 3 3 3 2 0 0 3
UNSAT 10 29 8 2 1 0 42
time (s) 0.1383 0.6368 0.6192 0.5714 0.3843 0.0276 521.9
prop 4 (45) SAT 1 3 3 2 0 0 3
UNSAT 2 32 6 1 1 0 42
time (s) 0.1387 0.6492 0.6420 0.5682 0.3568 0.0261 89.85
Table 3. Verification results of the RL, tllverify and oval21 benchmarks. We selected
50 random specifications from the RL benchmarks, 10 from tllverify and all 30 from
oval21. - means that the benchmark is not supported.
Dynamical Systems. For the FPA, we compute the reachable set for a time
horizon of 10 s, given a perturbation of ± 0.01 on all 5 input dimensions. The
results of this example are illustrated in Fig. 2c, with a computation time of
3.01 s. The FPA model consists of one nonlinear neural ODE, no discrete-time
layers are part of this model [52].
110 0
100
N 100
2,0
-0.5
Computation Time (s)
N rel dist
10 4,4 90
safe dist
Distance (m)
N8,0
80 -1
x2
1
70 -1.5
NNV
60
0.1 -2
50
0.01 40 -2.5
5 10 15 20 0 1 2 3 4 5 -1.5 -1 -0.5 0 0.5
T steps Time (s) x1
Fig. 2. Verification of RNN and neural ODE results. Figure 2a shows the verification
time of the 3 RNNs evaluated. Figure 2b depicts the safety verification of the ACC,
and Fig. 2c shows the reachability results of the FPA benchmark.
value of ± 0.5 on all the pixels. We are able to prove the robustness of both
models on 100% of images, with an average computation time of 16.3 s for the
CNODES , and 119.9 s for the CNODEM .
random image of M2NIST [18] by attacking each image using an UBAA bright-
ening attack [79]. One of the main differences of this evaluation with respect
to the robustness analysis of other classification is the evaluation metrics used.
For these networks, we evaluate the average robustness values (percentage of
pixels correctly classified), sensitivity (number of not robust pixels over number
of attacked pixels), and IoU (intersection over union) of the SSNNs. The compu-
tation time for the dilated example, shown in Fig. 3, is 54.52 s, with a robustness
value of 97.2%, a sensitivity of 3.04, and a IoU of 57.8%. For the equivalent exam-
ple with the transposed network, the robustness value is 98.14%, sensitivity of
2, IoU of 72.8%, and a computation time of 7.15 s.
misclass
ten misclass
ten
nine
ten
nine
six
six
six
two
two
zero
zero zero
Fig. 3. Robustness verification of the dilated and transposed SSNN under a UBAA
brightening attack to 150 random pixels in the input image.
5 Conclusions
We presented version 2.0 of NNV, the updated version of the Neural Network
Verification (NNV) tool [80], a software tool for the verification of deep learning
models and learning-enabled CPS. To the best of our knowledge, NNV is the
most comprehensive verification tool in terms of the number of tasks and neural
networks architectures supported, including the verification of feedforward, con-
volutional, semantic segmentation, and recurrent neural networks, neural ODEs
and NNCS. With the recent additions to NNV, we have demonstrated that NNV
can be a one-stop verification tool for users with a diverse problem set, where ver-
ification of multiple neural network types is needed. In addition, NNV supports
zonotope, polyhedron based methods, and up to 6 different star-based reachabil-
ity methods to handle verification tradeoffs for the verification problem of neural
networks, ranging from the exact-star, which is sound and complete, but com-
putationally expensive, to the relax-star methods, which are significantly faster
but more conservative. We have also shown that NNV outperforms a commer-
cially available product from MATLAB, which computes the reachable sets of
feedforward neural networks using the zonotope reachability method presented
in [66]. In the future, we plan to ensure support for other deep learning models
such as ResNets [30] and UNets [60].
NNV 2.0 407
Acknowledgments. The material presented in this paper is based upon work sup-
ported by the National Science Foundation (NSF) through grant numbers 1910017,
2028001, 2220418, 2220426 and 2220401, and the NSF Nebraska EPSCoR under grant
OIA-2044049, the Defense Advanced Research Projects Agency (DARPA) under con-
tract number FA8750-18-C-0089 and FA8750-23-C-0518, and the Air Force Office of
Scientific Research (AFOSR) under contract number FA9550-22-1-0019 and FA9550-
23-1-0135. Any opinions, findings, and conclusions or recommendations expressed in
this paper are those of the authors and do not necessarily reflect the views of AFOSR,
DARPA, or NSF.
References
1. Akintunde, M.E., Botoeva, E., Kouvaros, P., Lomuscio, A.: Formal verification of
neural agents in non-deterministic environments. In: Proceedings of the 19th Inter-
national Conference on Autonomous Agents and Multiagent Systems (AAMAS
2020), IFAAMAS 2020, . ACM, Auckland (2020)
2. Akintunde, M.E., Kevorchian, A., Lomuscio, A., Pirovano, E.: Verification of rnn-
based neural agent-environment systems. In: Proceedings of the AAAI Conference
on Artificial Intelligence, vol. 33, pp. 6006–6013 (2019)
3. Althoff, M.: An introduction to cora 2015. In: Proceedings of the Workshop on
Applied Verification for Continuous and Hybrid Systems (2015)
4. Bak, S.: nnenum: verification of ReLU neural networks with optimized abstraction
refinement. In: Dutle, A., Moscato, M.M., Titolo, L., Muñoz, C.A., Perez, I. (eds.)
NFM 2021. LNCS, vol. 12673, pp. 19–36. Springer, Cham (2021). https://doi.org/
10.1007/978-3-030-76384-8 2
5. Bak, S., Bogomolov, S., Johnson, T.T.: Hyst: a source transformation and transla-
tion tool for hybrid automaton models. In: Proceedings of the 18th International
Conference on Hybrid Systems: Computation and Control, pp. 128–133. ACM
(2015)
6. Bak, S., Liu, C., Johnson, T.T.: The second international verification of neu-
ral networks competition (VNN-COMP 2021): Summary and results. CoRR
abs/2109.00498 (2021)
7. Bak, S., Tran, H.-D., Hobbs, K., Johnson, T.T.: Improved geometric path enumer-
ation for verifying ReLU neural networks. In: Lahiri, S.K., Wang, C. (eds.) CAV
2020. LNCS, vol. 12224, pp. 66–96. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-53288-8 4
8. Bogomolov, S., Forets, M., Frehse, G., Potomkin, K., Schilling, C.: Juliareach: a
toolbox for set-based reachability. In: Proceedings of the 22nd ACM International
Conference on Hybrid Systems: Computation and Control, pp. 39–44 (2019)
9. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint
arXiv:1604.07316 (2016)
10. Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: learning affordance for
direct perception in autonomous driving. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 2722–2730 (2015)
11. Chen, R.T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.: Neural ordinary dif-
ferential equations. Adv. Neural Inf. Process. Syst. (2018)
12. Chen, X., Ábrahám, E., Sankaranarayanan, S.: Flow*: an analyzer for non-linear
hybrid systems. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp.
258–263. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39799-
8 18
408 D. M. Lopez et al.
13. Cireşan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for
image classification. arXiv preprint arXiv:1202.2745 (2012)
14. Cohen, J., Rosenfeld, E., Kolter, Z.: Certified adversarial robustness via ran-
domized smoothing. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of
the 36th International Conference on Machine Learning. Proceedings of Machine
Learning Research, vol. 97, pp. 1310–1320. PMLR (2019)
15. Collobert, R., Weston, J.: A unified architecture for natural language processing:
Deep neural networks with multitask learning. In: Proceedings of the 25th Inter-
national Conference on Machine Learning, pp. 160–167. ACM (2008)
16. Dutta, S., Jha, S., Sankaranarayanan, S., Tiwari, A.: Output range analysis for
deep feedforward neural networks, pp. 121–138 (2018)
17. Fan, J., Huang, C., Chen, X., Li, W., Zhu, Q.: ReachNN*: a tool for reachability
analysis of neural-network controlled systems. In: Hung, D.V., Sokolsky, O. (eds.)
ATVA 2020. LNCS, vol. 12302, pp. 537–542. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-59152-6 30
18. (farhanhubble), F.A.: M2NIST, MNIST of semantic segmentation. https://www.
kaggle.com/datasets/farhanhubble/multimnistm2nist
19. Ferlez, J., Khedr, H., Shoukry, Y.: Fast BATLLNN: fast box analysis of two-level
lattice neural networks. In: Bartocci, E., Putot, S. (eds.) HSCC ’22: 25th ACM
International Conference on Hybrid Systems: Computation and Control, Milan,
Italy, 4–6 May 2022. pp. 23:1–23:11. ACM (2022)
20. Ferrari, C., Müller, M.N., Jovanovic, N., Vechev, M.T.: Complete verification via
multi-neuron relaxation guided branch-and-bound. In: The Tenth International
Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April
2022. OpenReview.net (2022)
21. Fischer, M., Baader, M., Vechev, M.: Scalable certified segmentation via ran-
domized smoothing. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th
International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 139, pp. 3340–3351. PMLR (2021)
22. Fischer, M., Sprecher, C., Dimitrov, D.I., Singh, G., Vechev, M.: Shared certificates
for neural network verification. In: Shoham, S., Vizel, Y. (eds.) Computer Aided
Verification, pp. 127–148. Springer, Cham (2022). https://doi.org/10.1007/978-3-
031-13185-1 7
23. Frehse, G., et al.: SpaceEx: scalable verification of hybrid systems. In: Gopalakr-
ishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 379–395. Springer,
Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1 30
24. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional
neural networks. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2414–2423 (2016)
25. Goldberg, Y.: A primer on neural network models for natural language processing.
J. Artif. Intell. Res. 57, 345–420 (2016)
26. Goubault, E., Putot, S.: Rino: Robust inner and outer approximated reachability
of neural networks controlled systems. In: Shoham, S., Vizel, Y. (eds.) Computer
Aided Verification, pp. 511–523. Springer, Cham (2022). https://doi.org/10.1007/
978-3-031-13185-1 25
27. Gruenbacher, S., Hasani, R.M., Lechner, M., Cyranka, J., Smolka, S.A., Grosu, R.:
On the verification of neural odes with stochastic guarantees. In: AAAI (2021)
28. Gruenbacher, S., et al.: Gotube: scalable stochastic verification of continuous-depth
models (2021)
29. Guidotti, D., Demarchi, S., Tacchella, A., Pulina, L.: The Verification of Neural
Networks Library (VNN-LIB) (2022). https://www.vnnlib.org
NNV 2.0 409
30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778 (2016)
31. Henriksen, P., Hammernik, K., Rueckert, D., Lomuscio, A.: Bias field robustness
verification of large neural image classifiers. In: Proceedings of the 32nd British
Machine Vision Conference (BMVC21). BMVA Press (2021)
32. Huang, C., Fan, J., Chen, X., Li, W., Zhu, Q.: Polar: a polynomial arithmetic
framework for verifying neural-network controlled systems (2021). https://doi.org/
10.48550/ARXIV.2106.13867
33. Huang, C., Fan, J., Li, W., Chen, X., Zhu, Q.: Reachnn: reachability analysis of
neural-network controlled systems. arXiv preprint arXiv:1906.10654 (2019)
34. Ivanov, R., Carpenter, T., Weimer, J., Alur, R., Pappas, G., Lee, I.: Verisig 2.0:
verification of neural network controllers using taylor model preconditioning. In:
Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp. 249–262. Springer,
Cham (2021). https://doi.org/10.1007/978-3-030-81685-8 11
35. Ivanov, R., Weimer, J., Alur, R., Pappas, G.J., Lee, I.: Verisig: verifying safety
properties of hybrid systems with neural network controllers. In: Hybrid Systems:
Computation and Control (HSCC) (2019)
36. Jacoby, Y., Barrett, C., Katz, G.: Verifying recurrent neural networks using invari-
ant inference. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302,
pp. 57–74. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6 3
37. Jain, L.C., Medsker, L.R.: Recurrent neural networks: design and applications
(1999)
38. Johnson, T.T., et al.: Arch-comp21 category report: artificial intelligence and neu-
ral network control systems (ainncs) for continuous and hybrid systems plants. In:
Frehse, G., Althoff, M. (eds.) 8th International Workshop on Applied Verification
of Continuous and Hybrid Systems (ARCH21). EPiC Series in Computing, vol. 80,
pp. 90–119. EasyChair (2021). https://doi.org/10.29007/kfk9
39. Katz, G., Barrett, C., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: an
efficient SMT solver for verifying deep neural networks. In: Majumdar, R., Kunčak,
V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 97–117. Springer, Cham (2017). https://
doi.org/10.1007/978-3-319-63387-9 5
40. Katz, G., et al.: The marabou framework for verification and analysis of deep
neural networks. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp.
443–452. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4 26
41. Khedr, H., Ferlez, J., Shoukry, Y.: PEREGRiNN: penalized-relaxation greedy neu-
ral network verifier. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol.
12759, pp. 287–300. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
81685-8 13
42. Kochdumper, N., Schilling, C., Althoff, M., Bak, S.: Open- and closed-loop neural
network verification using polynomial zonotopes (2022)
43. Krizhevsky, A.: Learning multiple layers of features from tiny images, pp. 32–33
(2009)
44. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
45. Kvasnica, M., Grieder, P., Baotić, M., Morari, M.: Multi-parametric toolbox
(MPT). In: Alur, R., Pappas, G.J. (eds.) HSCC 2004. LNCS, vol. 2993, pp. 448–
462. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24743-2 30
46. LeCun, Y.: The mnist database of handwritten digits (1998). http://yann.lecun.
com/exdb/mnist/
410 D. M. Lopez et al.
47. Lenz, I.: Deep learning for robotics. Ph.D. thesis, Cornell University (2016)
48. Liu, C., Arnon, T., Lazarus, C., Strong, C., Barrett, C., Kochenderfer, M.J.: Algo-
rithms for verifying deep neural networks. Found. Trends Optim. 4(3–4), 244–404
(2021). https://doi.org/10.1561/2400000035
49. Löfberg, J.: Yalmip : A toolbox for modeling and optimization in MATLAB. In:
Proceedings of the CACSD Conference, Taipei, Taiwan (2004). http://users.isy.liu.
se/johanl/yalmip
50. Lopez, D.M., et al.: Arch-comp22 category report: artificial intelligence and neu-
ral network control systems (ainncs) for continuous and hybrid systems plants.
In: Frehse, G., Althoff, M., Schoitsch, E., Guiochet, J. (eds.) Proceedings of 9th
International Workshop on Applied Verification of Continuous and Hybrid Systems
(ARCH22). EPiC Series in Computing, vol. 90, pp. 142–184. EasyChair (2022)
51. Lopez, D.M., Johnson, T.T., Bak, S., Tran, H.D., Hobbs, K.: Evaluation of neural
network verification methods for air to air collision avoidance. AIAA J. Air Transp.
(JAT) (2022)
52. Manzanas Lopez, D., Musau, P., Hamilton, N., Johnson, T.: Reachability analysis
of a general class of neural ordinary differential equation. In: Proceedings of the
20th International Conference on Formal Modeling and Analysis of Timed Systems
(FORMATS 2022), Co-Located with CONCUR, FMICS, and QEST as part of
CONFEST 2022, Warsaw, Poland (2022)
53. MATLAB: Update 3, (R2022b). The MathWorks Inc., Natick, Massachusetts
(2022)
54. Musau, P., Johnson, T.T.: Continuous-time recurrent neural networks (ctrnns)
(benchmark proposal). In: 5th Applied Verification for Continuous and Hybrid
Systems Workshop (ARCH), Oxford, UK (2018). https://doi.org/10.29007/6czp
55. Müller, M.N., Brix, C., Bak, S., Liu, C., Johnson, T.T.: The third international
verification of neural networks competition (vnn-comp 2022): Summary and results
(2022)
56. Oki, E.: Glpk (gnu linear programming kit) (2012)
57. (ONNX), O.N.N.E.: https://github.com/onnx/
58. O’Shea, K., Nash, R.: An introduction to convolutional neural net-
works. CoRR abs/1511.08458 (2015). http://dblp.uni-trier.de/db/journals/corr/
corr1511.html#OSheaN15
59. Prabhakar, P., Rahimi Afzal, Z.: Abstraction based output range analysis for neural
networks. Adv. Neural Inf. Process. Syst. 32 (2019)
60. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
61. Rubanova, Y., Chen, R.T.Q., Duvenaud, D.K.: Latent ordinary differential equa-
tions for irregularly-sampled time series. In: Wallach, H., Larochelle, H., Beygelz-
imer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Infor-
mation Processing Systems, vol. 32. Curran Associates, Inc. (2019)
62. Ryou, W., Chen, J., Balunovic, M., Singh, G., Dan, A., Vechev, M.: Scalable poly-
hedral verification of recurrent neural networks. In: Silva, A., Leino, K.R.M. (eds.)
CAV 2021. LNCS, vol. 12759, pp. 225–248. Springer, Cham (2021). https://doi.
org/10.1007/978-3-030-81685-8 10
63. Schilling, C., Forets, M., Guadalupe, S.: Verification of neural-network control
systems by integrating Taylor models and zonotopes. In: AAAI, pp. 8169–8177.
AAAI Press (2022). https://doi.org/10.1609/aaai.v36i7.20790
NNV 2.0 411
64. Shriver, D., Elbaum, S., Dwyer, M.B.: DNNV: a framework for deep neural network
verification. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp.
137–150. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81685-8 6
65. Singh, G., Gehr, T., Mirman, M., Püschel, M., Vechev, M.: Fast and effective
robustness certification. In: Advances in Neural Information Processing Systems,
pp. 10825–10836 (2018)
66. Singh, G., Gehr, T., Püschel, M., Vechev, M.: An abstract domain for certifying
neural networks. Proc. ACM Program. Lang. 3(POPL), 41 (2019)
67. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199 (2013)
68. The MathWorks, I.: Deep Learning Toolbox Converter for ONNX Model Format.
Natick, Massachusetts, United State (2022). https://www.mathworks.com/
matlabcentral/fileexchange/67296-deep-learning-toolbox-converter-for-onnx-
model-format
69. The MathWorks, I.: Deep Learning Toolbox Verification Library. Natick, Mas-
sachusetts, United State (2022). https://www.mathworks.com/matlabcentral/
fileexchange/118735-deep-learning-toolbox-verification-library
70. The MathWorks, I.: Optimization Toolbox. Natick, Massachusetts, United State
(2022). https://www.mathworks.com/products/optimization.html
71. Thoma, M.: A survey of semantic segmentation (2016)
72. Tran, H.-D., Bak, S., Xiang, W., Johnson, T.T.: Verification of deep convolutional
neural networks using imagestars. In: Lahiri, S.K., Wang, C. (eds.) CAV 2020.
LNCS, vol. 12224, pp. 18–42. Springer, Cham (2020). https://doi.org/10.1007/
978-3-030-53288-8 2
73. Tran, H.D., Cei, F., Lopez, D.M., Johnson, T.T., Koutsoukos, X.: Safety veri-
fication of cyber-physical systems with reinforcement learning control. In: ACM
SIGBED International Conference on Embedded Software (EMSOFT 2019). ACM
(2019)
74. Tran, H.D., Choi, S., Yamaguchi, T., Hoxha, B., Prokhorov, D.: Verification of
recurrent neural networks using star reachability. In: The 26th ACM International
Conference on Hybrid Systems: Computation and Control (HSCC) (2023)
75. Tran, H.D., et al.: Parallelizable reachability analysis algorithms for feed-forward
neural networks. In: Proceedings of the 7th International Workshop on Formal
Methods in Software Engineering (FormaliSE 2019), pp. 31–40. IEEE Press, Pis-
cataway (2019). https://doi.org/10.1109/FormaliSE.2019.00012
76. Tran, H.D., et al.: Parallelizable reachability analysis algorithms for feed-forward
neural networks. In: 7th International Conference on Formal Methods in Software
Engineering (FormaliSE2019), Montreal, Canada (2019)
77. Tran, H.-D., et al.: Star-based reachability analysis of deep neural networks. In:
ter Beek, M.H., McIver, A., Oliveira, J.N. (eds.) FM 2019. LNCS, vol. 11800, pp.
670–686. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30942-8 39
78. Tran, H.D., et al.: Verification of piecewise deep neural networks: a star set app-
roach with zonotope pre-filter. Formal Asp. Comput. 33(4), 519–545 (2021)
79. Tran, H.-D., et al.: Robustness verification of semantic segmentation neural net-
works using relaxed reachability. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021.
LNCS, vol. 12759, pp. 263–286. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-81685-8 12
80. Tran, H.D., et al.: NNV: the neural network verification tool for deep neural net-
works and learning-enabled cyber-physical systems. In: 32nd International Confer-
ence on Computer-Aided Verification (CAV) (2020)
412 D. M. Lopez et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
QEBVerif: Quantization Error Bound
Verification of Neural Networks
1 Introduction
In the past few years, the development of deep neural networks (DNNs) has
grown at an impressive pace owing to their outstanding performance in solving
various complicated tasks [23,28]. However, modern DNNs are often large in
size and contain a great number of 32-bit floating-point parameters to achieve
competitive performance. Thus, they often result in high computational costs
and excessive storage requirements, hindering their deployment on resource-
constrained embedded devices, e.g., edge devices. A promising solution is to
quantize the weights and/or activation tensors as fixed-point numbers of lower
c The Author(s) 2023
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 413–437, 2023.
https://doi.org/10.1007/978-3-031-37703-7_20
414 Y. Zhang et al.
and QNNs using the existing state-of-the-art (symbolic) interval analysis [22,55],
and then conducts an interval subtraction. The experimental results show that
both our interval- and symbolic-based approaches are much more accurate and
can successfully verify much more tasks without the MILP-based verification. We
also find that the quantization error interval returned by DRA is getting tighter
with the increase of the quantization bit size. The experimental results also con-
firm the effectiveness of our MILP-based verification method, which can help
verify many tasks that cannot be solved by DRA solely. Finally, our results also
allow us to study the potential correlation of quantization errors and robustness
for QNNs using QEBVerif.
We summarize our contributions as follows:
– We introduce the first sound, complete, and reasonably efficient quantiza-
tion error bound verification method QEBVerif for fully QNNs by cleverly
combining novel DRA and MILP-based verification methods;
– We propose a novel DRA to compute sound and tight quantization error
intervals accompanied by an abstract domain tailored to QNNs, which can
significantly and soundly tighten the quantization error intervals;
– We implement QEBVerif as an end-to-end open-source tool [64] and conduct
an extensive evaluation on various verification tasks, demonstrating its effec-
tiveness and efficiency.
The source code of our tool and benchmarks are available at https://github.
com/S3L-official/QEBVerif. Missing proofs, more examples, and experimental
results can be found in [65].
2 Preliminaries
We denote by R, Z, N and B the sets of real-valued numbers, integers, natu-
ral numbers, and Boolean values, respectively. Let [n] denote the integer set
{1, . . . , n} for given n ∈ N. We use BOLD UPPERCASE (e.g., W) and bold
lowercase (e.g., x) to denote matrices and vectors, respectively. We denote by
Wi,j the j-entry in the i-th row of the matrix W, and by xi the i-th entry of
the vector x. Given a matrix W and a vector x, we use W and x̂ (resp. W
and
x̃) to denote their quantized/integer (resp. fixed-point) counterparts.
e .
Fig. 1. A 3-layer DNN Ne and its quantized version N
We remark that 2Fi and 2Fh −Fb in Definition 2 are used to align the precision
between the inputs and outputs of hidden layers, and Fi for i = 2 and i > 2
because quantization bit sizes for the outputs of the input layer and hidden layers
can be different.
We now give the formal definition of the quantization error bound verification
problem considered in this work as follows.
Example 1. Consider the DNN Ne with 3 layers (one input layer, one hidden
layer, and one output layer) given in Fig. 1, where weights are associated with the
edges and all the biases are 0. The quantization configurations for the weights,
the output of the input layer and hidden layer are Cw = ±, 4, 2, Cin = +, 4, 4
and Ch = +, 4, 2. Its QNN N e is shown in Fig. 1.
QEBVerif: Quantization Error Bound Verification of Neural Networks 419
Given a quantized input x̂ = (9, 6) and a radius r = 1, the input region for
QNN N e is R((9, 6), 1) = {(x, y) ∈ Z2 | 8 ≤ x ≤ 10, 5 ≤ y ≤ 7}. Since C ub = 15
in
lb
and Cin = 0, by Definitions 1, 2, and 3, we have the maximum quantization error
as max(2−2 N e (x̂ ) − Ne (x̂ /15)) = 0.067 for x̂ ∈ R((9, 6), 1). Then, N e has a
quantization error bound of w.r.t. input region R((9, 6), 1) for any > 0.067.
We remark that if only weights are quantized and the activation tensors
are floating-point numbers, the maximal quantization error of N e for the input
region R((9, 6), 1) is 0.04422, which implies that existing methods [48,49] cannot
be used to analyze the error bound for a fully QNN.
Note that N (·)g denotes the g-th entry of the vector N (·).
2.3 DeepPoly
We briefly recap DeepPoly [55], which will be leveraged in this work for com-
puting the output of each neuron in a DNN.
The core idea of DeepPoly is to give each neuron an abstract domain in the
form of a linear combination of the variables preceding the neuron. To achieve
this, each hidden neuron xij (the j-th neuron in the i-th layer) in a DNN is
ni−1
seen as two nodes xij,0 and xij,1 , such that xij,0 = k=1 Wj,k
i
xi−1
k,1 + bj (affine
i
function) and xij,1 = ReLU(xij,0 ) (ReLU function). Then, the affine function is
characterized as an abstract transformer using an upper polyhedral computa-
tion and a lower polyhedral computation in terms of the variables xi−1 k,1 . Finally,
it recursively substitutes the variables in the upper and lower polyhedral com-
putations with the corresponding upper/lower polyhedral computations of the
variables until they only contain the input variables from which the concrete
intervals are computed.
Formally, the abstract element Aij,s for the node xij,s (s ∈ {0, 1}) is a tuple
Aj,s = ai,≤
i i,≥ i i i,≤ i,≥
j,s , aj,s , lj,s , uj,s , where aj,s and aj,s are respectively the lower and
upper polyhedral computations in the form of a linear combination of the vari-
k,1 ’s if s = 0 or xk,0 ’s if s = 1, lj,s ∈ R and uj,s ∈ R are the concrete
ables xi−1 i i i
lower and upper bound of the neuron. Then, the concretization of the abstract
element Aij,s is Γ (Aij,s ) = {x ∈ R | ai,≤ i,≥
j,s ≤ x ∧ x ≤ aj,s }.
ni−1
Concretely, ai,≤ i,≥ i,≤
j,0 and aj,0 are defined as aj,0 = aj,0 =
i,≥
k=1 Wj,k xk,1 + bj .
i i−1 i
i,≤ i,≥
Furthermore, we can repeatedly substitute every variable in aj,0 (resp. aj,0 ) with
420 Y. Zhang et al.
its lower (resp. upper) polyhedral computation according to the coefficients until
no further substitution is possible. Then, we can get a sound lower (resp. upper)
bound in the form of a linear combination of the input variables based on which
i
lj,0 (resp. uij,0 ) can be computed immediately from the given input region.
For ReLU function xij,1 = ReLU(xij,0 ), there are three cases to consider of
the abstract element Aij,1 :
Note that DeepPoly also introduces transformers for other functions, such
as sigmoid, tanh, and maxpool functions. In this work, we only consider DNNs
with only ReLU as non-linear operators.
3 Methodology of QEBVerif
In this section, we first give an overview of our quantization error bound verifi-
cation method, QEBVerif, and then give the detailed design of each component.
Naively, one could use an existing verification tool in the literature to indepen-
dently compute the output intervals for both the QNN and the DNN, and then
compute their output difference directly by interval subtraction. However, such
an approach would be ineffective due to the significant precision loss.
Recently, Paulsen et al. [48] proposed ReluDiff and showed that the accu-
racy of output difference for two DNNs can be greatly improved by propagating
the difference intervals layer-by-layer. For each hidden layer, they first compute
the output difference of affine functions (before applying the ReLU), and then
they use a ReLU transformer to compute the output difference after applying
the ReLU functions. The reason why ReluDiff outperforms the naive method
is that ReluDiff first computes part of the difference before it accumulates.
ReluDiff is later improved to tighten the approximated difference intervals [49].
However, as mentioned previously, they do not support fully quantified neural
networks. Inspired by their work, we design a difference propagation algorithm
for our setting. We use S in (xij ) (resp. S in (x̂ij )) to denote the interval of the j-th
neuron in the i-th layer in the DNN (resp. QNN) before applying the ReLU func-
tion (resp. clamp function), and use S(xij ) (resp. S(x̂ij )) to denote the output
interval after applying the ReLU function (resp. clamp function). We use δiin
(resp. δi ) to denote the difference interval for the i-th layer before (resp. after)
in
applying the activation functions, and use δi,j (resp. δi,j ) to denote the interval
for the j-th neuron of the i-th layer. We denote by LB(·) and UB(·) the concrete
lower and upper bounds accordingly.
Based on the above notations, we give our difference propagation in Algo-
rithm 1. It works as follows. Given a DNN N , a QNN N and a quantized input
region R(x̂, r), we first compute intervals S (xj ) and S(xij ) for neurons in N
in i
using symbolic interval analysis DeepPoly, and compute interval S in (x̂ij ) and
S(x̂ij ) for neurons in N using concrete interval analysis method [22]. Remark that
no symbolic interval analysis for QNNs exists. By Definition 3, for each quan-
tized input x̂ for QNN, we obtain the input for DNN as x = x̂ /(Cin ub
− Cinlb
).
−Fin
After precision alignment, we get the input difference as 2 x̂ − x =
(2−Fin − 1/(Cin ub
− Cin
lb
))x̂ . Hence, given an input region, we get the output dif-
ference of the input layer: δ1 = (2−Fin − 1/(Cin ub
− Cin
lb
))S(x̂1 ). Then, we compute
the output difference δi of each hidden layer iteratively by applying the affine
transformer and activation transformer given in Algorithm 2 and Algorithm 3.
422 Y. Zhang et al.
Finally, we get the output difference for the output layer using only the affine
transformer.
Affine Transformer. The difference before applying the activation function for
in
the j-th neuron in the i-th layer is: δi,j i S(x̂i−1 ) + 2Fh −Fb b̂i −
= 2−Fh 2Fi W j,: j
Wj,:
i
S(xi−1 )−bij where 2−Fh is used to align the precision between the outputs of
the two networks (cf. Sect. 2). Then, we soundly remove the rounding operators
in
and give constraints for upper/lower bounds of δi,j as follows:
in
UB(δi,j i S(x̂i−1 ) + 2Fh −Fb b̂i + 0.5) − Wi S(xi−1 ) − bi )
) ≤ UB(2−Fh (2Fi W j,: j j,:
−Fh Fi i
LB(δi,j ) ≥ LB(2
in
(2 Wj,: S(x̂i−1 ) + 2Fh −Fb b̂ij − 0.5) − Wj,:
i
S(xi−1 ) − bi )
in
Finally, we have UB(δi,j i S(x̃i−1 ) − Wi S(xi−1 ) + Δbi + ξ and
) ≤ UB W j,: j,: j
in
LB(δi,j ) ≥ LB W i S(x̃i−1 ) − Wi S(xi−1 ) + Δbi − ξ, which can be further
j,: j,: j
reformulated as follows:
in
UB(δi,j i δi−1 + ΔWi S(xi−1 ) + Δbi + ξ
) ≤ UB W j,: j,: j
in
LB(δi,j i δi−1 + ΔWi S(xi−1 ) + Δbi − ξ
) ≥ LB W j,: j,: j
i =
where S(x̃i−1 ) = 2−Fin S(x̂i−1 ) if i = 2, and 2−Fh S(x̂i−1 ) otherwise. Wj,:
i , ΔWi = W
2−Fw W i − Wi , Δbi = 2−Fb b̂i − bi and ξ = 2−Fh −1 .
j,: j,: j,: j,: j j j
QEBVerif: Quantization Error Bound Verification of Neural Networks 423
Let x̂dg (resp. xdg ) be the g-th output of N (resp. N ). We introduce a real-
valued variable η and a Boolean variable v such that η = max(2−Fh x̂dg − xdg , 0)
can be encoded by the set Θg of constraints with an extremely large number M:
Θg = η ≥ 0, η ≥ 2−Fh x̂dg − xdg , η ≤ M · v, η ≤ 2−Fh x̂dg − xdg + M · (1 − v) .
As a result, |2−Fh x̂dg − xdg | ≥ iff the set of linear constraints Θ = Θg ∪ {2η −
(2−Fh x̂dg − xdg ) ≥ } holds.
Finally, the quantization error bound verification problem is equivalent to the
solving of the constraints: ΘP = ΘN ∪ ΘN ∪ ΘR ∪ Θ . Remark that the output
difference intervals of hidden neurons obtained from Algorithm 1 can be encoded
as linear constraints which are added into the set ΘP to boost the solving.
as follows.
Following DeepPoly, âi,≤ i,≥
j,0 and âj,0 for the affine function of x̂j,0 with round-
i
– If ˆlj,1
i
≥ Chub , then âi,≤ i,≥ ub ˆi
j,2 = âj,2 = Ch , lj,2 = ûj,2 = Ch ;
i ub
i,≤ i,≤ i,≥ i,≥ ˆi
– If ûj,1 ≤ Ch , then âj,2 = âj,1 , âj,2 = âj,1 , lj,2 = ˆlj,1 and ûij,2 = ûij,1 ;
i ub i
426 Y. Zhang et al.
Fig. 3. Convex approximation for the min function in QNNs, where Fig. 3(a) and
ub
Ch −l̂ij,1 (ûij,1 −Ch
ub
)
Fig. 3(b) show the two ways where α = ûij,1 −l̂ij,1
and β = ûij,1 −l̂ij,1
.
ub i
Ch −l̂j,1
– If ˆlj,1
i
< Chub ∧ ûij,1 > Chub , then âi,≥ i i,≤
j,2 = λx̂j,1 + μ and âj,2 = ûij,1 −l̂j,1
i x̂ij,1 +
(ûij,1 −Chub
) ˆi
i
û −l̂ i lj,1 , where (λ, μ) ∈ {(0, Chub ), (1, 0)} such that the area of resulting
j,1 j,1
From our abstract domain for QNNs, we get a symbolic interval analysis,
similar to the one for DNNs using DeepPoly, to replace Line 2 in Algorithm 1.
i,∗
Then, the sound lower bound Δlj,s and upper Δui,∗
j,s bound of the difference can
be derived as follows, where p = 2s:
QEBVerif: Quantization Error Bound Verification of Neural Networks 427
QNNs
Arch #Paras DNNs
Q=4 Q = 6 Q = 8 Q = 10
i,∗
– Δlj,s = LB(2−Fh x̂ij,p − xij,s ) = 2−Fh âi,≤,∗
j,p − ai,≥,∗
j,s ;
– Δuj,s = UB(2−Fh x̂ij,p − xij,s ) = 2−Fh âj,p − ai,≤,∗
i,∗ i,≥,∗
j,s .
Given a quantized input x̂ of the QNN N , the input difference of two networks
is 2−Fin x̂ − x = (2−Fin Chub − 1)x. Therefore, we have Δ1k = x̃1k − x1k = 2−Fin x̂1k −
x1k = (2−Fin Chub − 1)x. Then, the lower bound of difference can be reformulated
i,∗
as follows which only contains the input variables of DNN N : Δlj,s = Δbl,∗
j +
m u,∗ −Fin ub l,∗ 1 l,∗ −Fh l,∗ u,∗ ∗
k=1 (−wk + 2 Ch w̃k )xk , where Δbj = 2 b̂j − bj , F = Fin − Fh ,
1 1 1 l,∗ F ∗ l,∗
Δk = x̃k − xk and w̃k = 2 ŵk .
Similarly, we can reformulated the upper bound Δui,∗ j,s as follows using the
input variables of the DNN: Δuj,s = Δbj + k=1 (−wkl,∗ + 2−Fin Chub w̃ku,∗ )x1k ,
i,∗ u,∗ m
F ∗ u,∗
where Δbu,∗ j = 2−Fh b̂u,∗
j − bl,∗ ∗ u,∗
j , F = Fin − Fh , and w̃k = 2 ŵk .
in
Finally, we compute the concrete input difference interval δi,j based on the
in i,∗ i,∗
given input region as δi,j = [LB(Δlj,0 ), UB(Δuj,0 )], with which we can replace
the AffTrs functions in Algorithm 1 directly. An illustrating example is given
in [65].
5 Evaluation
We have implemented our method QEBVerif as an end-to-end tool written in
Python, where we use Gurobi [20] as our back-end MILP solver. All floating-
point numbers used in our tool are 32-bit. Experiments are conducted on a
96-core machine with Intel(R) Xeon(R) Gold 6342 2.80 GHz CPU and 1 TB
main memory. We allow Gurobi to use up to 24 threads. The time limit for each
verification task is 1 h.
Benchmarks. We first build 45 * 4 QNNs from the 45 DNNs of ACAS Xu [26],
following a post-training quantization scheme [44] and using quantization con-
figurations Cin = ±, 8, 8, Cw = Cb = ±, Q, Q − 2, Ch = +, Q, Q − 2, where
Q ∈ {4, 6, 8, 10}. We then train 5 DNNs with different architectures using the
MNIST dataset [31] and build 5 * 4 QNNs following the same quantization
scheme and quantization configurations except that we set Cin = +, 8, 8 and
Cw = ±, Q, Q − 1 for each DNN trained on MNIST. Details on the networks
428 Y. Zhang et al.
trained on the MNIST dataset are presented in Table 1. Column 1 gives the
name and architecture of each DNN, where Ablk_B means that the network
has A hidden layers with each hidden layer size B neurons, Column 2 gives the
number of parameters in each DNN, and Columns 3–7 list the accuracy of these
networks. Hereafter, we denote by Px-y (resp. Ax-y) the QNN using the archi-
tecture Px (using the x-th DNN) and quantization bit size Q = y for MNIST
(resp. ACAS Xu), and by Px-Full (resp. Ax-Full) the DNN of architecture Px
for MNIST (resp. the x-th DNN in ACAS Xu).
r=3 r=6 r = 13 r = 19 r = 26
Q Method
H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T
Naive 270.5 0.70 15/0.47 423.7 0.99 9/0.52 1,182 4.49 0/0.67 6,110 50.91 0/0.79 18,255 186.6 0/0.81
4 QEBVerif (Int) 270.5 0.70 15/0.49 423.4 0.99 9/0.53 1,181 4.46 0/0.70 6,044 50.91 0/0.81 17,696 186.6 0/0.85
QEBVerif (Sym) 749.4 145.7 0/2.02 780.9 150.2 0/2.11 1,347 210.4 0/2.24 6,176 254.7 0/2.35 18,283 343.7 0/2.39
Naive 268.3 1.43 5/0.47 557.2 4.00 0/0.51 1,258 6.91 0/0.67 6,145 53.29 0/0.77 18,299 189.0 0/0.82
6 QEBVerif (Int) 268.0 1.41 5/0.50 555.0 3.98 0/0.54 1,245 6.90 0/0.69 6,125 53.28 0/0.80 18,218 189.0 0/0.83
QEBVerif (Sym) 299.7 2.58 10/1.48 365.1 3.53 9/1.59 1,032 7.65 5/1.91 5,946 85.46 4/2.15 18,144 260.5 0/2.27
Naive 397.2 3.57 0/0.47 587.7 5.00 0/0.51 1,266 7.90 0/0.67 6,160 54.27 0/0.78 18,308 190.0 0/0.81
8 QEBVerif (Int) 388.4 3.56 0/0.49 560.1 5.00 0/0.53 1,222 7.89 0/0.69 6,103 54.27 0/0.79 18,212 190.0 0/0.83
QEBVerif (Sym) 35.75 0.01 24/1.10 93.78 0.16 18/1.19 845.2 5.84 8/1.65 5,832 58.73 5/1.97 18,033 209.6 5/2.12
Naive 394.5 3.67 0/0.49 591.4 5.17 0/0.51 1,268 8.04 0/0.68 6,164 54.42 0/0.78 18,312 190.1 0/0.80
10 QEBVerif (Int) 361.9 3.67 0/0.50 546.2 5.17 0/0.54 1,209 8.04 0/0.68 6,083 54.42 0/0.79 18,182 190.1 0/0.83
QEBVerif (Sym) 15.55 0.01 25/1.04 54.29 0.06 22/1.15 764.6 4.53 9/1.52 5,780 57.21 5/1.91 18,011 228.7 5/2.08
P1 P2 P3 P4 P5
Q Method
H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T H_Diff O_Diff #S/T
Naive 64.45 7.02 61/0.77 220.9 20.27 0/1.53 551.6 47.75 0/2.38 470.1 22.69 2/11.16 5,336 140.4 0/123.0
4 QEBVerif (Int) 32.86 6.65 63/0.78 194.8 20.27 0/1.54 530.9 47.75 0/2.40 443.3 22.69 2/11.23 5,275 140.4 0/123.4
QEBVerif (Sym) 32.69 3.14 88/1.31 134.9 7.11 49/2.91 313.8 14.90 1/5.08 365.2 11.11 35/22.28 1,864 50.30 1/310.2
Naive 68.94 7.89 66/0.77 249.5 24.25 0/1.52 616.2 54.66 0/2.38 612.2 31.67 1/11.18 7,399 221.0 0/125.4
6 QEBVerif (Int) 10.33 2.19 115/0.78 89.66 12.81 14/1.54 466.0 52.84 0/2.39 307.6 20.22 5/11.28 7,092 221.0 0/125.1
QEBVerif (Sym) 10.18 1.46 130/1.34 55.73 3.11 88/2.85 131.3 5.33 70/4.72 158.5 3.99 102/21.85 861.9 12.67 22/279.9
Naive 69.15 7.95 64/0.77 251.6 24.58 0/1.52 623.1 55.42 0/2.38 620.6 32.43 1/11.29 7,542 226.1 0/125.3
8 QEBVerif (Int) 4.27 0.89 135/0.78 38.87 5.99 66/1.54 320.1 40.84 0/2.39 134.0 8.99 50/11.24 7,109 226.1 0/125.7
QEBVerif (Sym) 4.13 1.02 136/1.35 34.01 2.14 108/2.82 82.90 3.48 86/4.61 96.26 2.39 128/21.45 675.7 6.20 27/273.6
Naive 69.18 7.96 65/0.77 252.0 24.63 0/1.52 624.0 55.55 0/2.36 620.4 32.40 1/11.19 7,559 226.9 0/124.2
10 QEBVerif (Int) 2.72 0.56 139/0.78 25.39 4.15 79/1.53 260.9 34.35 0/2.40 84.12 5.75 73/11.26 7,090 226.9 0/125.9
QEBVerif (Sym) 2.61 0.92 139/1.35 28.59 1.91 112/2.82 71.33 3.06 92/4.56 81.08 2.01 131/21.48 646.5 5.68 31/271.5
the abstract domain of the affine function in each hidden layer of QNNs is large
due to the small bit size, and (2) such errors can accumulate and magnify layer
by layer, in contrast to the naive approach where we directly apply the interval
subtraction. We remark that symbolic-based reachability analysis methods for
DNNs become less accurate as the network gets deeper and the input region
gets larger. It means that for a large input region, the output intervals of hid-
den/output neurons computed by symbolic interval analysis for DNNs can be
very large. However, the output intervals of their quantized counterparts are
Q
always limited by the quantization grid limit, i.e., [0, 22Q−2
−1
]. Hence, the difference
intervals computed in Table 2 can be very conservative for large input regions
and deeper networks.
(a) ACAS Xu: Q = 4. (b) ACAS Xu: Q =(c) ACAS Xu: Q = 8.(d) ACAS Xu: Q =
6. 10.
differential reachability analysis, i.e., QEBVerif (Sym). The yellow bars give the
results by a full verification process in QEBVerif as shown in Fig. 2, i.e., we first
use DRA and then use MILP solving if DRA fails. The red bars are similar to
the yellow ones except that linear constraints of the difference intervals of hidden
neurons got from DRA are added into the MILP encoding.
Overall, although DRA successfully proved most of the tasks (60.19% with
DRA solely), our MILP-based verification method can help further verify many
tasks on which DRA fails, namely, 85.67% with DRA+MILP and 88.59% with
DRA+MILP+Diff. Interestingly, we find that the effectiveness of the added lin-
ear constraints of the difference intervals varies on the MILP solving efficiency
on different tasks. Our conjecture is that there are some heuristics in the Gurobi
solving algorithm for which the additional constraints may not always be helpful.
However, those difference linear constraints allow the MILP-based verification
method to verify more tasks, i.e., 79 tasks more in total.
(a) Robustness Re-(b) Errors for P1-4(c) Errors for P1-4(d) Errors for P1-4
sults for P1-4. under r = 3. under r = 5. under r = 7.
(e) Robustness Re- (f) Errors for P1-8 (g) Errors for P1-8 (h) Errors for P1-8
sults for P1-8. under r = 3. under r = 5. under r = 7.
quantization error interval for each input region. By comparing the results of
P1-8 and P1-4, we observe that P1-8 is more robust than P1-4 w.r.t. the 90
input regions and its quantization errors are also generally much smaller than
that of P1-4. Furthermore, we find that P1-8 remains consistently robust as the
radius increases, and its quantization error interval changes very little. However,
P1-4 becomes increasingly less robust as the radius increases and its quantiza-
tion error also increases significantly. Thus, we speculate that there may be some
correlation between network robustness and quantization error in QNNs. Specif-
ically, as the quantization bit size decreases, the quantization error increases
and the QNN becomes less robust. The reason we suspect “the fewer bits, the
less robust” is that with fewer bits, a perturbation may easily cause significant
change on hidden neurons (i.e., the change is magnified by the loss of precision)
and consequently the output. Furthermore, the correlation between the quanti-
zation error bound and the empirical robustness of the QNN suggests that it is
indeed possible to apply our method to compute the quantization error bound
and use it as a guide for identifying the best quantization scheme which balances
the size of the model and its robustness.
6 Related Work
While there is a large and growing body of work on quality assurance tech-
niques for neural networks including testing (e.g., [4–7,47,50,56,57,63,69]) and
formal verification (e.g., [2,8,12,13,15,19,24,29,30,32,34,37,38,51,54,55,58–60,
62,70]). Testing techniques are often effective in finding violations, but they
cannot prove their absence. While formal verification can prove their absence,
existing methods typically target real-valued neural networks, i.e., DNNs, and
432 Y. Zhang et al.
are not effective in verifying quantization error bound [48]. In this section, we
mainly discuss the existing verification techniques for QNNs.
Early work on formal verification of QNNs typically focuses on 1-bit quan-
tized neural networks (i.e., BNNs) [3,9,46,52,53,66,67]. Narodytska et al. [46]
first proposed to reduce the verification problem of BNNs to a satisfiability prob-
lem of a Boolean formula or an integer linear programming problem. Baluta
et al. [3] proposed a PAC-style quantitative analysis framework for BNNs via
approximate SAT model-counting solvers. Shih et al. proposed a quantitative
verification framework for BNNs [52,53] via a BDD learning-based method [45].
Zhang et al. [66,67] proposed a BDD-based verification framework for BNNs,
which exploits the internal structure of the BNNs to construct BDD models
instead of BDD-learning. Giacobbe et al. [16] pushed this direction further by
introducing the first formal verification for multiple-bit quantized DNNs (i.e.,
QNNs) by encoding the robustness verification problem into an SMT formula
based on the first-order theory of quantifier-free bit-vector. Later, Henzinger et
al. [22] explored several heuristics to improve the efficiency and scalability of [16].
Very recently, [40,68] proposed an ILP-based method and an MILP-based verifi-
cation method for QNNs, respectively, and both outperform the SMT-based ver-
ification approach [22]. Though these works can directly verify QNNs or BNNs,
they cannot verify quantization error bounds.
There are also some works focusing on exploring the properties of two neural
networks which are most closely related to our work. Paulsen et al. [48,49] pro-
posed differential verification methods to verify two DNNs with the same network
topology. This idea has been extended to handle recurrent neural networks [41].
The difference between [41,48,49] and our work has been discussed throughout
this work, i.e., they focus on quantized weights and cannot handle quantized
activation tensors. Moreover, their methods are not complete, thus would fail to
prove tighter error bounds. Semi-definite programming was used to analyze the
different behaviors of DNNs and fully QNNs [33]. Different from our work focus-
ing on verification, they aim at generating an upper bound for the worst-case
error induced by quantization. Furthermore, [33] only scales tiny QNNs, e.g., 1
input neuron, 1 output neuron, and 10 neurons per hidden layer (up to 4 hidden
layers). In comparison, our differential reachability analysis scales to much larger
QNNs, e.g., QNN with 4890 neurons.
7 Conclusion
In this work, we proposed a novel quantization error bound verification method
QEBVerif which is sound, complete, and arguably efficient. We implemented it as
an end-to-end tool and conducted thorough experiments on various QNNs with
different quantization bit sizes. Experimental results showed the effectiveness and
the efficiency of QEBVerif. We also investigated the potential correlation between
robustness and quantization errors for QNNs and found that as the quantization
error increases the QNN might become less robust. For further work, it would be
interesting to investigate the verification method for other activation functions
and network architectures, towards which this work makes a significant step.
QEBVerif: Quantization Error Bound Verification of Neural Networks 433
References
1. Amir, G., Wu, H., Barrett, C.W., Katz, G.: An SMT-based approach for verifying
binarized neural networks. In: Proceedings of the 27th International Conference on
Tools and Algorithms for the Construction and Analysis of Systems, pp. 203–222
(2021)
2. Anderson, G., Pailoor, S., Dillig, I., Chaudhuri, S.: Optimization and abstraction:
a synergistic approach for analyzing neural network robustness. In: Proceedings
of the 40th ACM SIGPLAN Conference on Programming Language Design and
Implementation, pp. 731–744 (2019)
3. Baluta, T., Shen, S., Shinde, S., Meel, K.S., Saxena, P.: Quantitative verification
of neural networks and its security applications. In: Proceedings of the 2019 ACM
SIGSAC Conference on Computer and Communications Security, pp. 1249–1264
(2019)
4. Bu, L., Zhao, Z., Duan, Y., Song, F.: Taking care of the discretization problem:
a comprehensive study of the discretization problem and a black-box adversarial
attack in discrete integer domain. IEEE Trans. Dependable Secur. Comput. 19(5),
3200–3217 (2022)
5. Carlini, N., Wagner, D.A.: Towards evaluating the robustness of neural networks.
In: Proceedings of the 2017 IEEE Symposium on Security and Privacy, pp. 39–57
(2017)
6. Chen, G., et al.: Who is real Bob? Adversarial attacks on speaker recognition
systems. In: Proceedings of the 42nd IEEE Symposium on Security and Privacy,
pp. 694–711 (2021)
7. Chen, G., Zhao, Z., Song, F., Chen, S., Fan, L., Liu, Y.: AS2T: arbitrary source-to-
target adversarial attack on speaker recognition systems. IEEE Trans. Dependable
Secur. Comput., 1–17 (2022)
8. Chen, G., et al.: Towards understanding and mitigating audio adversarial examples
for speaker recognition. IEEE Trans. Dependable Secur. Comput., 1–17 (2022)
9. Choi, A., Shi, W., Shih, A., Darwiche, A.: Compiling neural networks into tractable
Boolean circuits. In: Proceedings of the AAAI Spring Symposium on Verification
of Neural Networks (2019)
10. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static
analysis of programs by construction or approximation of fixpoints. In: Conference
Record of the Fourth ACM Symposium on Principles of Programming Languages,
pp. 238–252 (1977)
11. Duncan, K., Komendantskaya, E., Stewart, R., Lones, M.: Relative robustness
of quantized neural networks against adversarial attacks. In: Proceedings of the
International Joint Conference on Neural Networks, pp. 1–8 (2020)
12. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks.
In: Proceedings of the 15th International Symposium on Automated Technology
for Verification and Analysis, pp. 269–286 (2017)
434 Y. Zhang et al.
13. Elboher, Y.Y., Gottschlich, J., Katz, G.: An abstraction-based framework for neu-
ral network verification. In: Proceedings of the 32nd International Conference on
Computer Aided Verification, pp. 43–65 (2020)
14. Eykholt, K., et al.: Robust physical-world attacks on deep learning visual classifi-
cation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1625–1634 (2018)
15. Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev,
M.T.: AI2 : safety and robustness certification of neural networks with abstract
interpretation. In: Proceedings of the IEEE Symposium on Security and Privacy,
pp. 3–18 (2018)
16. Giacobbe, M., Henzinger, T.A., Lechner, M.: How many bits does it take to quan-
tize your neural network? In: TACAS 2020. LNCS, vol. 12079, pp. 79–97. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-45237-7_5
17. Gong, R., et al.: Differentiable soft quantization: bridging full-precision and low-bit
neural networks. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 4851–4860 (2019)
18. Google: Tensorflow lite (2022). https://www.tensorflow.org/lite
19. Guo, X., Wan, W., Zhang, Z., Zhang, M., Song, F., Wen, X.: Eager falsification
for accelerating robustness verification of deep neural networks. In: Proceedings of
the 32nd IEEE International Symposium on Software Reliability Engineering, pp.
345–356 (2021)
20. Gurobi: a most powerful mathematical optimization solver (2018). https://www.
gurobi.com/
21. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network
with pruning, trained quantization and Huffman coding. In: Proceedings of the 4th
International Conference on Learning Representations (2016)
22. Henzinger, T.A., Lechner, M., Zikelic, D.: Scalable verification of quantized neural
networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
35, pp. 3787–3795 (2021)
23. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition:
the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97
(2012)
24. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural
networks. In: Proceedings of the 29th International Conference on Computer Aided
Verification, pp. 3–29 (2017)
25. Jacob, B., et al.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2704–2713 (2018)
26. Julian, K.D., Kochenderfer, M.J., Owen, M.P.: Deep neural network compression
for aircraft collision avoidance systems. J. Guid. Control. Dyn. 42(3), 598–608
(2019)
27. Jung, S., et al.: Learning to quantize deep networks by optimizing quantization
intervals with task loss. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4350–4359 (2019)
28. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-
scale video classification with convolutional neural networks. In: Proceedings of
2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–
1732 (2014)
29. Katz, G., Barrett, C.W., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: an
efficient SMT solver for verifying deep neural networks. In: Proceedings of the 29th
International Conference on Computer Aided Verification, pp. 97–117 (2017)
QEBVerif: Quantization Error Bound Verification of Neural Networks 435
30. Katz, G., et al.: The marabou framework for verification and analysis of deep neural
networks. In: Proceedings of the 31st International Conference on Computer Aided
Verification, pp. 443–452 (2019)
31. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010)
32. Li, J., Liu, J., Yang, P., Chen, L., Huang, X., Zhang, L.: Analyzing deep neural
networks with symbolic propagation: towards higher precision and faster verifica-
tion. In: Chang, B.-Y.E. (ed.) SAS 2019. LNCS, vol. 11822, pp. 296–319. Springer,
Cham (2019). https://doi.org/10.1007/978-3-030-32304-2_15
33. Li, J., Drummond, R., Duncan, S.R.: Robust error bounds for quantised and pruned
neural networks. In: Proceedings of the 3rd Annual Conference on Learning for
Dynamics and Control, pp. 361–372 (2021)
34. Li, R., et al.: Prodeep: a platform for robustness verification of deep neural net-
works. In: Proceedings of the 28th ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering, pp. 1630–
1634 (2020)
35. Lin, D.D., Talathi, S.S., Annapureddy, V.S.: Fixed point quantization of deep
convolutional networks. In: Proceedings of the 33nd International Conference on
Machine Learning, pp. 2849–2858 (2016)
36. Lin, J., Gan, C., Han, S.: Defensive quantization: when efficiency meets robustness.
In: Proceedings of the International Conference on Learning Representations (2019)
37. Liu, J., Xing, Y., Shi, X., Song, F., Xu, Z., Ming, Z.: Abstraction and refinement:
towards scalable and exact verification of neural networks. CoRR abs/2207.00759
(2022)
38. Liu, W., Song, F., Zhang, T., Wang, J.: Verifying ReLU neural networks from a
model checking perspective. J. Comput. Sci. Technol. 35(6), 1365–1381 (2020)
39. Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward
ReLU neural networks. CoRR abs/1706.07351 (2017)
40. Mistry, S., Saha, I., Biswas, S.: An MILP encoding for efficient verification of quan-
tized deep neural networks. IEEE Trans. Comput.-Aided Des. Integrated Circuits
Syst. (Early Access) (2022)
41. Mohammadinejad, S., Paulsen, B., Deshmukh, J.V., Wang, C.: DiffRNN: differen-
tial verification of recurrent neural networks. In: Proceedings of the 19th Interna-
tional Conference on Formal Modeling and Analysis of Timed Systems, pp. 117–134
(2021)
42. Moore, R.E., Kearfott, R.B., Cloud, M.J.: Introduction to Interval Analysis, vol.
110. SIAM (2009)
43. Nagel, M., Amjad, R.A., Van Baalen, M., Louizos, C., Blankevoort, T.: Up or
down? Adaptive rounding for post-training quantization. In: Proceedings of the
International Conference on Machine Learning, pp. 7197–7206 (2020)
44. Nagel, M., Fournarakis, M., Amjad, R.A., Bondarenko, Y., van Baalen, M.,
Blankevoort, T.: A white paper on neural network quantization. arXiv preprint
arXiv:2106.08295 (2021)
45. Nakamura, A.: An efficient query learning algorithm for ordered binary decision
diagrams. Inf. Comput. 201(2), 178–198 (2005)
46. Narodytska, N., Kasiviswanathan, S.P., Ryzhyk, L., Sagiv, M., Walsh, T.: Veri-
fying properties of binarized deep neural networks. In: Proceedings of the AAAI
Conference on Artificial Intelligence, pp. 6615–6624 (2018)
47. Odena, A., Olsson, C., Andersen, D.G., Goodfellow, I.J.: TensorFuzz: debugging
neural networks with coverage-guided fuzzing. In: Proceedings of the 36th Inter-
national Conference on Machine Learning, pp. 4901–4911 (2019)
436 Y. Zhang et al.
48. Paulsen, B., Wang, J., Wang, C.: ReluDiff: differential verification of deep neural
networks. In: 2020 IEEE/ACM 42nd International Conference on Software Engi-
neering (ICSE), pp. 714–726. IEEE (2020)
49. Paulsen, B., Wang, J., Wang, J., Wang, C.: NeuroDiff: scalable differential verifi-
cation of neural networks using fine-grained approximation. In: Proceedings of the
35th IEEE/ACM International Conference on Automated Software Engineering,
pp. 784–796 (2020)
50. Pei, K., Cao, Y., Yang, J., Jana, S.: DeepXplore: automated whitebox testing
of deep learning systems. In: Proceedings of the 26th Symposium on Operating
Systems Principles, pp. 1–18 (2017)
51. Pulina, L., Tacchella, A.: An abstraction-refinement approach to verification of
artificial neural networks. In: Proceedings of the 22nd International Conference on
Computer Aided Verification, pp. 243–257 (2010)
52. Shih, A., Darwiche, A., Choi, A.: Verifying binarized neural networks by Angluin-
style learning. In: Janota, M., Lynce, I. (eds.) SAT 2019. LNCS, vol. 11628, pp.
354–370. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24258-9_25
53. Shih, A., Darwiche, A., Choi, A.: Verifying binarized neural networks by local
automaton learning. In: Proceedings of the AAAI Spring Symposium on Verifica-
tion of Neural Networks (2019)
54. Singh, G., Ganvir, R., Püschel, M., Vechev, M.T.: Beyond the single neuron convex
barrier for neural network certification. In: Proceedings of the Annual Conference
on Neural Information Processing Systems, pp. 15072–15083 (2019)
55. Singh, G., Gehr, T., Püschel, M., Vechev, M.T.: An abstract domain for certifying
neural networks. Proc. ACM Program. Lang. (POPL) 3, 41:1–41:30 (2019)
56. Song, F., Lei, Y., Chen, S., Fan, L., Liu, Y.: Advanced evasion attacks and mitiga-
tions on practical ml-based phishing website classifiers. Int. J. Intell. Syst. 36(9),
5210–5240 (2021)
57. Tian, Y., Pei, K., Jana, S., Ray, B.: DeepTest: automated testing of deep-neural-
network-driven autonomous cars. In: Proceedings of the 40th International Con-
ference on Software Engineering, pp. 303–314 (2018)
58. Tran, H.-D., Bak, S., Xiang, W., Johnson, T.T.: Verification of deep convolutional
neural networks using ImageStars. In: Lahiri, S.K., Wang, C. (eds.) CAV 2020.
LNCS, vol. 12224, pp. 18–42. Springer, Cham (2020). https://doi.org/10.1007/
978-3-030-53288-8_2
59. Tran, H., et al.: Star-based reachability analysis of deep neural networks. In: Pro-
ceedings of the 3rd World Congress on Formal Methods, pp. 670–686 (2019)
60. Wang, S., Pei, K., Whitehouse, J., Yang, J., Jana, S.: Formal security analysis
of neural networks using symbolic intervals. In: Proceedings of the 27th USENIX
Security Symposium, pp. 1599–1614 (2018)
61. WikiChip: FSD chip - tesla. https://en.wikichip.org/wiki/tesla_(car_company)/
fsd_chip. Accessed 30 Apr 2022
62. Yang, P., et al.: Improving neural network verification through spurious region
guided refinement. In: Groote, J.F., Larsen, K.G. (eds.) Proceedings of 27th Inter-
national Conference on Tools and Algorithms for the Construction and Analysis
of Systems, pp. 389–408 (2021)
63. Zhang, J.M., Harman, M., Ma, L., Liu, Y.: Machine learning testing: survey, land-
scapes and horizons. IEEE Trans. Software Eng. 48(2), 1–36 (2022)
64. Zhang, Y., Song, F., Sun, J.: QEBVerif (2023). https://github.com/S3L-official/
QEBVerif
65. Zhang, Y., Song, F., Sun, J.: QEBVerif: quantization error bound verification of
neural networks. CoRR abs/2212.02781 (2023)
QEBVerif: Quantization Error Bound Verification of Neural Networks 437
66. Zhang, Y., Zhao, Z., Chen, G., Song, F., Chen, T.: BDD4BNN: a BDD-based
quantitative analysis framework for binarized neural networks. In: Silva, A., Leino,
K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp. 175–200. Springer, Cham (2021).
https://doi.org/10.1007/978-3-030-81685-8_8
67. Zhang, Y., Zhao, Z., Chen, G., Song, F., Chen, T.: Precise quantitative analysis
of binarized neural networks: a BDD-based approach. ACM Trans. Software Eng.
Methodol. 32(3) (2023)
68. Zhang, Y., et al.: QVIP: an ILP-based formal verification approach for quantized
neural networks. In: Proceedings of the 37th IEEE/ACM International Conference
on Automated Software Engineering, pp. 82:1–82:13 (2023)
69. Zhao, Z., Chen, G., Wang, J., Yang, Y., Song, F., Sun, J.: Attack as defense:
characterizing adversarial examples using robustness. In: Proceedings of the 30th
ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.
42–55 (2021)
70. Zhao, Z., Zhang, Y., Chen, G., Song, F., Chen, T., Liu, J.: CLEVEREST: acceler-
ating CEGAR-based neural network verification via adversarial attacks. In: Singh,
G., Urban, C. (eds.) Proceedings of the 29th International Symposium on Static
Analysis, pp. 449–473 (2022). https://doi.org/10.1007/978-3-031-22308-2_20
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Verifying Generalization in Deep Learning
Guy Amir(B) , Osher Maayan, Tom Zelazny, Guy Katz, and Michael Schapira
1 Introduction
Over the past decade, deep learning [35] has achieved state-of-the-art results
in natural language processing, image recognition, game playing, computational
biology, and many additional fields [4,18,21,45,50,84,85]. However, despite its
impressive success, deep learning still suffers from severe drawbacks that limit
its applicability in domains that involve mission-critical tasks or highly variable
inputs.
One such crucial limitation is the notorious difficulty of deep neural networks
(DNNs) to generalize to new input domains, i.e., their tendency to perform
poorly on inputs that significantly differ from those encountered while training.
During training, a DNN is presented with input data sampled from a specific dis-
tribution over some input domain (“in-distribution” inputs). The induced DNN-
based rules may fail in generalizing to inputs not encountered during training
due to (1) the DNN being invoked “out-of-distribution” (OOD), i.e., when there
is a mismatch between the distribution over inputs in the training data and in
the DNN’s operational data; (2) some inputs not being sufficiently represented
in the finite training data (e.g., various low-probability corner cases); and (3)
“overfitting” the decision rule to the training data.
A notable example of the importance of establishing the generalizability of
DNN-based decisions lies in recently proposed applications of deep reinforce-
ment learning (DRL) [56] to real-world systems. Under DRL, an agent, realized
as a DNN, is trained by repeatedly interacting with its environment to learn a
decision-making policy that attains high performance with respect to a certain
objective (“reward ”). DRL has recently been applied to many real-world chal-
lenges [20,44,54,55,64–67,96,108]. In many application domains, the learned
policy is expected to perform well across a daunting breadth of operational
environments, whose diversity cannot possibly be captured in the training data.
Further, the cost of erroneous decisions can be dire. Our discussion of DRL-based
Internet congestion control (see Sect. 4.3) illustrates this point.
Here, we present a methodology for identifying DNN-based decision rules
that generalize well to all possible distributions over an input domain of interest.
Our approach hinges on the following key observation. DNN training in general,
and DRL policy training in particular, incorporate multiple stochastic aspects,
such as the initialization of the DNN’s weights and the order in which inputs
are observed during training. Consequently, even when DNNs with the same
architecture are trained to perform an identical task on the same data, somewhat
different decision rules will typically be learned. Paraphrasing Tolstoy’s Anna
Karenina [93], we argue that “successful decision rules are all alike; but every
unsuccessful decision rule is unsuccessful in its own way”. Differently put, when
examining the decisions by several independently trained DNNs on a certain
input, these are likely to agree only when their (similar) decisions yield high
performance.
In light of the above, we propose the following heuristic for generating DNN-
based decision rules that generalize well to an entire given domain of inputs:
independently train multiple DNNs, and then seek a subset of these DNNs that
are in strong agreement across all possible inputs in the considered input domain
(implying, by our hypothesis, that these DNNs’ learned decision rules generalize
well to all probability distributions over this domain). Our evaluation demon-
strates (see Sect. 4) that this methodology is extremely powerful and enables
distilling from a collection of decision rules the few that indeed generalize better
to inputs within this domain. Since our heuristic seeks DNNs whose decisions
are in agreement for each and every input in a specific domain, the decision rules
reached this way achieve robustly high generalization across different possible
distributions over inputs in this domain.
Since our methodology involves contrasting the outputs of different DNNs
over possibly infinite input domains, using formal verification is natural. To
this end, we build on recent advances in formal verification of DNNs [2,12,14,
16,27,60,78,86,102]. DNN verification literature has focused on establishing the
local adversarial robustness of DNNs, i.e., seeking small input perturbations that
result in misclassification by the DNN [31,36,61]. Our approach broadens the
440 G. Amir et al.
applicability of DNN verification by demonstrating, for the first time (to the best
of our knowledge), how it can also be used to identify DNN-based decision rules
that generalize well. More specifically, we show how, for a given input domain,
a DNN verifier can be utilized to assign a score to a DNN reflecting its level
of agreement with other DNNs across the entire input domain. This enables
iteratively pruning the set of candidate DNNs, eventually keeping only those in
strong agreement, which tend to generalize well.
To evaluate our methodology, we focus on three popular DRL benchmarks:
(i) Cartpole, which involves controlling a cart while balancing a pendulum; (ii)
Mountain Car, which involves controlling a car that needs to escape a valley;
and (iii) Aurora, an Internet congestion controller.
Aurora is a particularly compelling example for our approach. While Aurora
is intended to tame network congestion across a vast diversity of real-world
Internet environments, Aurora is trained only on synthetically generated data.
Thus, to deploy Aurora in the real world, it is critical to ensure that its policy
is sound for numerous scenarios not captured by its training inputs.
Our evaluation results show that, in all three settings, our verification-driven
approach is successful at ranking DNN-based DRL policies according to their
ability to generalize well to out-of-distribution inputs. Our experiments also
demonstrate that formal verification is superior to gradient-based methods and
predictive uncertainty methods. These results showcase the potential of our app-
roach. Our code and benchmarks are publicly available as an artifact accompa-
nying this work [8].
The rest of the paper is organized as follows. Section 2 contains background
on DNNs, DRLs, and DNN verification. In Sect. 3 we present our verification-
based methodology for identifying DNNs that successfully generalize to OOD
inputs. We present our evaluation in Sect. 4. Related work is covered in Sect. 5,
and we conclude in Sect. 6.
2 Background
training see [35]. Figure 1 depicts a toy DNN. For input V1 = [1, 2]T , the sec-
ond layer computes the (weighted sum) V2 = [10, −1]T . The ReLU functions are
subsequently applied in the third layer, and the result is V3 = [10, 0]T . Finally,
the network’s single output is V4 = [20].
Deep Reinforcement Learning (DRL) [56] is a machine learning paradigm,
in which a DRL agent, implemented as a DNN, interacts with an environment
across discrete time-steps t ∈ 0, 1, 2.... At each time-step, the agent is presented
with the environment’s state st ∈ S, and selects an action N (st ) = at ∈ A.
The environment then transitions to its next state st+1 , and presents the agent
with the reward rt for its previous action. The agent is trained through repeated
interactions with
its environment
to maximize
the expected cumulative discounted
reward Rt = E t γ · rt (where γ ∈ 0, 1 is termed the discount factor ) [38,
t
82,90,91,97,107].
DNN and DRL Verification. A sound DNN verifier [46] receives as input
(i) a trained DNN N ; (ii) a precondition P on the DNN’s inputs, limiting the
possible assignments to a domain of interest; and (iii) a postcondition Q on
the DNN’s outputs, limiting the possible outputs of the DNN. The verifier can
reply in one of two ways: (i) SAT, with a concrete input x for which P (x ) ∧
Q(N (x )) is satisfied; or (ii) UNSAT, indicating there does not exist such an x .
Typically, Q encodes the negation of N ’s desirable behavior for inputs that
satisfy P . Thus, a SAT result indicates that the DNN errs, and that x triggers
a bug; whereas an UNSAT result indicates that the DNN performs as intended.
An example of this process appears in Appendix B of our extended paper [7].
To date, a plethora of verification approaches have been proposed for general,
feed-forward DNNs [3,31,41,46,61,99], as well as DRL-based agents that operate
within reactive environments [5,9,15,22,28].
The definition captures the notion that for any input in Ψ , N1 and N2 pro-
duce outputs that are at most α-distance apart. A small α value indicates that
the outputs of N1 and N2 are close for all inputs in Ψ , whereas a high value
indicates that there exists an input in Ψ for which the decision models diverge
significantly.
To compute PDT values, our approach employs verification to conduct a
binary search for the maximum distance between the outputs of two DNNs; see
Algorithm 1.
Intuitively, the disagreement score measures how much a single decision model
tends to disagree with the remaining models, on average.
Using disagreement scores, our heuristic employs an iterative scheme for
selecting a subset of models that generalize to OOD scenarios—as encoded by
inputs in Ψ (see Algorithm 2). First, a set of k DNNs {N1 , N2 , . . . , Nk } are inde-
pendently trained on the training data. Next, a backend verifier is invoked to
calculate, for each of the k2 DNN-based model pairs, their respective pairwise-
disagreement threshold (up to some accuracy). Next, our algorithm iteratively:
(i) calculates the disagreement score for each model in the remaining subset of
models; (ii) identifies the models with the (relative) highest DS scores; and (iii)
removes them (Line 9 in Algorithm 2). The algorithm terminates after exceed-
ing a user-defined number of iterations (Line 3 in Algorithm 2), or when the
remaining models “agree” across the input domain, as indicated by nearly iden-
tical disagreement scores (Line 7 in Algorithm 2). We note that the algorithm
is also given an upper bound (M) on the maximum difference, informed by the
user’s domain-specific knowledge.
4 Evaluation
We extensively evaluated our method using three DRL benchmarks. As discussed
in the introduction, verifying the generalizability of DRL-based systems is impor-
tant since such systems are often expected to provide robustly high performance
444 G. Amir et al.
4.1 Cartpole
Cartpole [33] is a well-known RL
benchmark in which an agent con-
trols the movement of a cart with
an upside-down pendulum (“pole”)
attached to its top. The cart moves
Fig. 2. Cartpole: in-distribution setting
on a platform and the agent’s goal is (blue) and OOD setting (red). (Color figure
to keep the pole balanced for as long online)
as possible (see Fig. 2).
Agent and Environment. The agent’s inputs are s = (x, vx , θ, vθ ), where x
represents the cart’s location on the platform, θ represents the pole’s angle (i.e.,
|θ| ≈ 0 for a balanced pole, |θ| ≈ 90◦ for an unbalanced pole), vx represents the
cart’s horizontal velocity and vθ represents the pole’s angular velocity.
In-Distribution Inputs. During training, the agent is incentivized to balance
the pole, while staying within the platform’s boundaries. In each iteration, the
agent’s single output indicates the cart’s acceleration (sign and magnitude) for
the next step. During training, we defined the platform’s bounds to be [−2.4, 2.4],
Verifying Generalization in Deep Learning 445
and the cart’s initial position as near-static, and close to the center of the plat-
form (left-hand side of Fig. 2). This was achieved by drawing the cart’s initial
state vector values uniformly from the range [−0.05, 0.05].
(OOD) Input Domain. We consider an input domain with larger platforms
than the ones used in training. To wit, we now allow the x coordinate of the
input vectors to cover a wider range of [−10, 10]. For the other inputs, we used
the same bounds as during the training. See [7] for additional details.
Evaluation. We trained
k = 16 models, all
of which achieved high
rewards during training
on the short platform.
Next, we ran Algorithm 2
until convergence (7 itera-
tions, in our experiments)
on the aforementioned
input domain, resulting in
a set of 3 models. We
then tested all 16 origi-
nal models using (OOD) Fig. 3. Cartpole: Algorithm 2’s results, per iteration: the
inputs drawn from the bars reflect the ratio between the good/bad models (left
new domain, such that y-axis) in the surviving set of models, and the curve indi-
the generated distribu- cates the number of surviving models (right y-axis).
tion encodes a novel set-
ting: the cart is now placed at the center of a much longer, shifted platform (see
the red cart in Fig. 2).
All other parameters in the OOD environment were identical to those used
for the original training. Figure 9 (in [7]) depicts the results of evaluating the
models using 20, 000 OOD instances. Of the original 16 models, 11 scored a low-
to-mediocre average reward, indicating their poor ability to generalize to this
new distribution. Only 5 models obtained high reward values, including the 3
models identified by Algorithm 2; thus implying that our method was able to
effectively remove all 11 models that would have otherwise performed poorly in
this OOD setting (see Fig. 3). For additional information, see [7].
Evaluation. We ran our algorithm and scored the models based on their dis-
agreement upon this large domain, which includes inputs they had not encoun-
tered during training, representing the aforementioned novel link conditions.
Experiment (1): High Packet Loss. In this experiment, we trained over 100
Aurora agents in the original (in-distribution) environment. Out of these, we
selected k = 16 agents that achieved a high average reward in-distribution (see
Fig. 20a in [7]). Next, we evaluated these agents on OOD inputs that are included
in the previously described domain. The main difference between the training
distribution and the new (OOD) ones is the possibility of extreme packet loss
rates upon initialization.
Our evaluation over the OOD inputs, within the domain, indicates that
although all 16 models performed well in-distribution, only 7 agents could suc-
cessfully handle such OOD inputs (see Fig. 20b in [7]). When we ran Algorithm 2
on the 16 models, it was able to filter out all 9 models that generalized poorly
on the OOD inputs (see Fig. 4). In particular, our method returned model {16},
which is the best-performing model according to our simulations. We note that
in the first iterations, the four models to be filtered out were models {1, 2, 6, 13},
which are indeed the four worst-performing models on the OOD inputs (see
Appendix F of [7]).
(a) Reward statistics of remaining models (b) Ratio between good/bad models
5 Related Work
Recently, a plethora of approaches and tools have been put forth for ensur-
ing DNN correctness [2,6,10,15,19,24–27,29,31,32,34,36,37,41–43,46–49,52,
57,61,70,76,81,83,86,87,89,92,94,95,98,100,102,104,106], including techniques
for DNN shielding [60], optimization [14,88], quantitative verification [16],
abstraction [12,13,73,78,86,105], size reduction [77], and more. Non-verification
techniques, including runtime-monitoring [39], ensembles [71, 72,80,103] and
additional methods [75] have been utilized for OOD input detection.
In contrast to the above approaches, we aim to establish generalization guar-
antees with respect to an entire input domain (spanning all distributions across
this domain). In addition, to the best of our knowledge, ours is the first attempt
to exploit variability across models for distilling a subset thereof, with improved
generalization capabilities. In particular, it is also the first approach to apply
formal verification for this purpose.
Verifying Generalization in Deep Learning 449
6 Conclusion
This work describes a novel, verification-driven approach for identifying DNN
models that generalize well to an input domain of interest. We presented an
iterative scheme that employs a backend DNN verifier, allowing us to score
models based on their ability to produce similar outputs on the given domain.
We demonstrated extensively that this approach indeed distills models capable
of good generalization. As DNN verification technology matures, our approach
will become increasingly scalable, and also applicable to a wider variety of DNNs.
Acknowledgements. The work of Amir, Zelazny, and Katz was partially supported
by the Israel Science Foundation (grant number 683/18). The work of Amir was sup-
ported by a scholarship from the Clore Israel Foundation. The work of Maayan and
Schapira was partially supported by funding from Huawei.
References
1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech-
niques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
2. Alamdari, P., Avni, G., Henzinger, T., Lukina, A.: Formal methods with a touch
of magic. In: Proceedings 20th International Conference on Formal Methods in
Computer-Aided Design (FMCAD), pp. 138–147 (2020)
3. Albarghouthi, A.: Introduction to Neural Network Verification (2021). verified-
deeplearning.com
4. AlQuraishi, M.: AlphaFold at CASP13. Bioinformatics 35(22), 4862–4865 (2019)
5. Amir, G., et al.: Verifying learning-based robotic navigation systems. In: Sankara-
narayanan, S., Sharygina, N. (eds.) Proceedings 29th International Conference on
Tools and Algorithms for the Construction and Analysis of Systems (TACAS),
pp. 607–627. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-30823-
9 31
6. Amir, G., Freund, Z., Katz, G., Mandelbaum, E., Refaeli, I.: veriFIRE: verify-
ing an industrial, learning-based wildfire detection system. In: Proceedings 25th
International Symposium on Formal Methods (FM), pp. 648–656. Springer, Cham
(2023). https://doi.org/10.1007/978-3-031-27481-7 38
7. Amir, G., Maayan, O., Zelazny, O., Katz, G., Schapira, M.: Verifying generaliza-
tion in deep learning. Technical report (2023). https://arxiv.org/abs/2302.05745
8. Amir, G., Maayan, O., Zelazny, T., Katz, G., Schapira, M.: Verifying general-
ization in deep learning: artifact (2023). https://zenodo.org/record/7884514#.
ZFAz 3ZBy3B
9. Amir, G., Schapira, M., Katz, G.: Towards scalable verification of deep reinforce-
ment learning. In: Proceedings 21st Internationl Conference on Formal Methods
in Computer-Aided Design (FMCAD), pp. 193–203 (2021)
10. Amir, G., Wu, H., Barrett, C., Katz, G.: An SMT-based approach for verify-
ing binarized neural networks. In: TACAS 2021. LNCS, vol. 12652, pp. 203–222.
Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72013-1 11
11. Amir, G., Zelazny, T., Katz, G., Schapira, M.: Verification-aided deep ensemble
selection. In: Proceedings 22nd International Conference on Formal Methods in
Computer-Aided Design (FMCAD), pp. 27–37 (2022)
450 G. Amir et al.
12. Anderson, G., Pailoor, S., Dillig, I., Chaudhuri, S.: Optimization and abstraction:
a synergistic approach for analyzing neural network robustness. In: Proceedings
40th ACM SIGPLAN Conference on Programming Languages Design and Imple-
mentations (PLDI), pp. 731–744 (2019)
13. Ashok, P., Hashemi, V., Kretinsky, J., Mohr, S.: DeepAbstract: neural network
abstraction for accelerating verification. In: Proceedings 18th International Sym-
posium on Automated Technology for Verification and Analysis (ATVA), pp.
92–107 (2020)
14. Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger,
S.: Run-time optimization for learned controllers through quantitative games. In:
Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer,
Cham (2019). https://doi.org/10.1007/978-3-030-25540-4 36
15. Bacci, E., Giacobbe, M., Parker, D.: Verifying reinforcement learning up to infin-
ity. In: Proceedings 30th International Joint Conference on Artificial Intelligence
(IJCAI) (2021)
16. Baluta, T., Shen, S., Shinde, S., Meel, K., Saxena, P.: Quantitative verification
of neural networks and its security applications. In: Proceedings ACM SIGSAC
Conference on Computer and Communications Security (CCS), pp. 1249–1264
(2019)
17. Barto, A., Sutton, R., Anderson, C.: Neuronlike adaptive elements that can solve
difficult learning control problems. In: Proceedings of IEEE Systems Man and
Cybernetics Conference (SMC), pp. 834–846 (1983)
18. Bojarski, M., et al.: End to end learning for self-driving cars. Technical report
(2016). http://arxiv.org/abs/1604.07316
19. Bunel, R., Turkaslan, I., Torr, P., Kohli, P., Mudigonda, P.: A unified view of
piecewise linear neural network verification. In: Proceedings 32nd Conference on
Neural Information Processing Systems (NeurIPS), pp. 4795–4804 (2018)
20. Chen, W., Xu, Y., Wu, X.: Deep reinforcement learning for multi-resource
multi-machine job scheduling. Technical report (2017). http://arxiv.org/abs/
1711.07440
21. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. J. Mach. Learn. Res. (JMLR)
12, 2493–2537 (2011)
22. Corsi, D., Marchesini, E., Farinelli, A.: Formal verification of neural networks for
safety-critical tasks in deep reinforcement learning. In: Proceedings 37th Confer-
ence on Uncertainty in Artificial Intelligence (UAI), pp. 333–343 (2021)
23. Dietterich, T.: Ensemble methods in machine learning. In: Proceedings 1st Inter-
national Workshop on Multiple Classifier Systems (MCS), pp. 1–15 (2020)
24. Dong, G., Sun, J., Wang, J., Wang, X., Dai, T.: Towards repairing neural networks
correctly. Technical report (2020). http://arxiv.org/abs/2012.01872
25. Dutta, S., Chen, X., Sankaranarayanan, S.: Reachability analysis for neural feed-
back systems using regressive polynomial rule inference. In: Proceedings 22nd
ACM International Conference on Hybrid Systems: Computation and Control
(HSCC), pp. 157–168 (2019)
26. Dutta, S., Jha, S., Sankaranarayanan, S., Tiwari, A.: Learning and verification of
feedback control systems using feedforward neural networks. IFAC-PapersOnLine
51(16), 151–156 (2018)
27. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks.
In: Proceedings 15th International Symposium on Automated Technology for Ver-
ification and Analysis (ATVA), pp. 269–286 (2017)
Verifying Generalization in Deep Learning 451
28. Eliyahu, T., Kazak, Y., Katz, G., Schapira, M.: Verifying learning-augmented
systems. In: Proceedings Conference of the ACM Special Interest Group on Data
Communication on the Applications, Technologies, Architectures, and Protocols
for Computer Communication (SIGCOMM), pp. 305–318 (2021)
29. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward
safe control through proof and learning. In: Proceedings 32nd AAAI Conference
on Artificial Intelligence (AAAI) (2018)
30. Ganaie, M., Hu, M., Malik, A., Tanveer, M., Suganthan, P.: Ensemble deep learn-
ing: a review. Eng. Appl. Artif. Intell. 115, 105151 (2022)
31. Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, E., Chaudhuri, S., Vechev,
M.: AI2: safety and robustness certification of neural networks with abstract inter-
pretation. In: Proceedings 39th IEEE Symposium on Security and Privacy (S&P)
(2018)
32. Geng, C., Le, N., Xu, X., Wang, Z., Gurfinkel, A., Si, X.: Toward reliable neural
specifications. Technical report (2022). https://arxiv.org/abs/2210.16114
33. Geva, S., Sitte, J.: A cartpole experiment benchmark for trainable controllers.
IEEE Control Syst. Mag. 13(5), 40–51 (1993)
34. Goldberger, B., Adi, Y., Keshet, J., Katz, G.: Minimal modifications of deep
neural networks using verification. In: Proceedings 23rd Proceedings Conference
on Logic for Programming, Artificial Intelligence and Reasoning (LPAR), pp.
260–278 (2020)
35. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
36. Gopinath, D., Katz, G., Pǎsǎreanu, C., Barrett, C.: DeepSafe: a data-driven app-
roach for assessing robustness of neural networks. In: Proceedings 16th Inter-
national Symposium on Automated Technology for Verification and Analysis
(ATVA), pp. 3–19 (2018)
37. Goubault, E., Palumby, S., Putot, S., Rustenholz, L., Sankaranarayanan, S.: Static
analysis of ReLU neural networks with tropical Polyhedra. In: Proceedings 28th
International Symposium on Static Analysis (SAS), pp. 166–190 (2021)
38. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maxi-
mum entropy deep reinforcement learning with a stochastic actor. In: Proceedings
Conference on Machine Learning, pp. 1861–1870. PMLR (2018)
39. Hashemi, V., Křetı́nsky, J., Rieder, S., Schmidt, J.: Runtime monitoring for out-
of-distribution detection in object detection neural networks. Technical report
(2022). http://arxiv.org/abs/2212.07773
40. Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks
on neural network policies. Technical report (2017). https://arxiv.org/abs/1702.
02284
41. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neu-
ral networks. In: Proceedings 29th International Conference on Computer Aided
Verification (CAV), pp. 3–29 (2017)
42. Isac, O., Barrett, C., Zhang, M., Katz, G.: Neural network verification with proof
production. In: Proceedings 22nd International Conference on Formal Methods
in Computer-Aided Design (FMCAD), pp. 38–48 (2022)
43. Jacoby, Y., Barrett, C., Katz, G.: Verifying recurrent neural networks using invari-
ant inference. In: Proceedings 18th International Symposium on Automated Tech-
nology for Verification and Analysis (ATVA), pp. 57–74 (2020)
44. Jay, N., Rotman, N., Godfrey, B., Schapira, M., Tamar, A.: A deep reinforce-
ment learning perspective on internet congestion control. In: Proceedings 36th
International Conference on Machine Learning (ICML), pp. 3050–3059 (2019)
452 G. Amir et al.
45. Julian, K., Lopez, J., Brush, J., Owen, M., Kochenderfer, M.: Policy compression
for aircraft collision avoidance systems. In: Proceedings 35th Digital Avionics
Systems Conference (DASC), pp. 1–10 (2016)
46. Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: an efficient
SMT solver for verifying deep neural networks. In: Proceedings 29th International
Conference on Computer Aided Verification (CAV), pp. 97–117 (2017)
47. Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: a calculus
for reasoning about deep neural networks. Formal Methods Syst. Des. (FMSD)
(2021)
48. Katz, G., et al.: The marabou framework for verification and analysis of deep neu-
ral networks. In: Proceedings 31st International Conference on Computer Aided
Verification (CAV), pp. 443–452 (2019)
49. Könighofer, B., Lorber, F., Jansen, N., Bloem, R.: Shield synthesis for reinforce-
ment learning. In: Proceedings International Symposium on Leveraging Applica-
tions of Formal Methods, Verification and Validation (ISoLA), pp. 290–306 (2020)
50. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convo-
lutional neural networks. In: Proceedings 26th Conference on Neural Information
Processing Systems (NeurIPS), pp. 1097–1105 (2012)
51. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active
learning. In: Proceedings 7th Conference on Neural Information Processing Sys-
tems (NeurIPS), pp. 231–238 (1994)
52. Kuper, L. Katz, G., Gottschlich, J., Julian, K., Barrett, C., Kochenderfer, M.:
Toward scalable verification for safety-critical deep networks. Technical report
(2018). https://arxiv.org/abs/1801.05950
53. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical
world. Technical report (2016). http://arxiv.org/abs/1607.02533
54. Lekharu, A., Moulii, K., Sur, A., Sarkar, A.: Deep learning based prediction model
for adaptive video streaming. In: Proceedings 12th International Conference on
Communication Systems & Networks (COMSNETS), pp. 152–159. IEEE (2020)
55. Li, W., Zhou, F., Chowdhury, K.R., Meleis, W.: QTCP: adaptive congestion
control with reinforcement learning. IEEE Trans. Netw. Sci. Eng. 6(3), 445–458
(2018)
56. Li, Y.: Deep reinforcement learning: an overview. Technical report (2017). http://
arxiv.org/abs/1701.07274
57. Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward
ReLU neural networks. Technical report (2017). http://arxiv.org/abs/1706.07351
58. Loquercio, A., Segu, M., Scaramuzza, D.: A general framework for uncertainty
estimation in deep learning. In: Proceedings International Conference on Robotics
and Automation (ICRA), pp. 3153–3160 (2020)
59. Low, S., Paganini, F., Doyle, J.: Internet congestion control. IEEE Control Syst.
Mag. 22(1), 28–43 (2002)
60. Lukina, A., Schilling, C., Henzinger, T.A.: Into the unknown: active monitoring
of neural networks. In: Feng, L., Fisman, D. (eds.) RV 2021. LNCS, vol. 12974,
pp. 42–61. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88494-9 3
61. Lyu, Z., Ko, C.Y., Kong, Z., Wong, N., Lin, D., Daniel, L.: Fastened crown: tight-
ened neural network robustness certificates. In: Proceedings 34th AAAI Confer-
ence on Artificial Intelligence (AAAI), pp. 5037–5044 (2020)
62. Ma, J., Ding, S., Mei, Q.: Towards more practical adversarial attacks on graph
neural networks. In: Proceedings 34th Conference on Neural Information Process-
ing Systems (NeurIPS) (2020)
Verifying Generalization in Deep Learning 453
63. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learn-
ing models resistant to adversarial attacks. Technical report (2017). http://arxiv.
org/abs/1706.06083
64. Mammadli, R., Jannesari, A., Wolf, F.: Static neural compiler optimization via
deep reinforcement learning. In: Proceedings 6th IEEE/ACM Workshop on the
LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierar-
chical Parallelism for Exascale Computing (HiPar), pp. 1–11 (2020)
65. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with
deep reinforcement learning. In: Proceedings 15th ACM Workshop on Hot Topics
in Networks (HotNets), pp. 50–56 (2016)
66. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with Pen-
sieve. In: Proceedings Conference of the ACM Special Interest Group on Data
Communication on the Applications, Technologies, Architectures, and Protocols
for Computer Communication (SIGCOMM), pp. 197–210 (2017)
67. Mnih, V., et al.: Playing Atari with deep reinforcement learning. Technical report
(2013). https://arxiv.org/abs/1312.5602
68. Moore, A.: Efficient Memory-based Learning for Robot Control. University of
Cambridge (1990)
69. Nagle, J.: Congestion control in IP/TCP internetworks. ACM SIGCOMM Com-
put. Commun. Rev. 14(4), 11–17 (1984)
70. Okudono, T., Waga, M., Sekiyama, T., Hasuo, I.: Weighted automata extraction
from recurrent neural networks via regression on state spaces. In: Proceedings
34th AAAI Conference on Artificial Intelligence (AAAI), pp. 5037–5044 (2020)
71. Ortega, L., Cabañas, R., Masegosa, A.: Diversity and generalization in neural
network ensembles. In: Proceedings 25th International Conference on Artificial
Intelligence and Statistics (AISTATS), pp. 11720–11743 (2022)
72. Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep rein-
forcement learning. In: Proceedings 31st International Conference on Neural Infor-
mation Processing Systems (NeurIPS), pp. 8617–8629 (2018)
73. Ostrovsky, M., Barrett, C., Katz, G.: An abstraction-refinement approach to ver-
ifying convolutional neural networks. In Proceedings 20th International Sympo-
sium on Automated Technology for Verification and Analysis (ATVA), pp. 391–
396 (2022)
74. Ovadia, Y., et al.: Can you trust your model’s uncertainty? Evaluating predic-
tive uncertainty under dataset shift. In: Proceedings 33rd Conference on Neural
Information Processing Systems (NeurIPS), pp. 14003–14014 (2019)
75. Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D.: Assessing
generalization in deep reinforcement learning. Technical report (2018). https://
arxiv.org/abs/1810.12282
76. Polgreen, E., Abboud, R., Kroening, D.: Counterexample guided neural synthesis.
Technical report (2020). https://arxiv.org/abs/2001.09245
77. Prabhakar, P.: Bisimulations for neural network reduction. In: Finkbeiner, B.,
Wies, T. (eds.) VMCAI 2022. LNCS, vol. 13182, pp. 285–300. Springer, Cham
(2022). https://doi.org/10.1007/978-3-030-94583-1 14
78. Prabhakar, P., Afzal, Z.: Abstraction based output range analysis for neural net-
works. Technical report (2020). https://arxiv.org/abs/2007.09527
79. Riedmiller, M.: Neural fitted Q iteration – first experiences with a data efficient
neural reinforcement learning method. In: Gama, J., Camacho, R., Brazdil, P.B.,
Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328.
Springer, Heidelberg (2005). https://doi.org/10.1007/11564096 32
454 G. Amir et al.
80. Rotman, N., Schapira, M., Tamar, A.: Online safety assurance for deep reinforce-
ment learning. In: Proceedings 19th ACM Workshop on Hot Topics in Networks
(HotNets), pp. 88–95 (2020)
81. Ruan, W., Huang, X., Kwiatkowska, M.: Reachability analysis of deep neural net-
works with provable guarantees. In: Proceedings 27th International Joint Confer-
ence on Artificial Intelligence (IJCAI) (2018)
82. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal pol-
icy optimization algorithms. Technical report (2017). http://arxiv.org/abs/1707.
06347
83. Seshia, S., et al.: Formal specification for deep neural networks. In: Proceedings
16th International Symposium on Automated Technology for Verification and
Analysis (ATVA), pp. 20–34 (2018)
84. Silver, D., et al.: Mastering the game of go with deep neural networks and tree
search. Nature 529(7587), 484–489 (2016)
85. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. Technical report (2014). http://arxiv.org/abs/1409.1556
86. Singh, G., Gehr, T., Puschel, M., Vechev, M.: An abstract domain for certifying
neural networks. In: Proceedings 46th ACM SIGPLAN Symposium on Principles
of Programming Languages (POPL) (2019)
87. Sotoudeh, M., Thakur, A.: Correcting deep neural networks with small, general-
izing patches. In: Workshop on Safety and Robustness in Decision Making (2019)
88. Strong, C., et al.: Global optimization of objective functions represented by ReLU
networks. J. Mach. Learn., 1–28 (2021)
89. Sun, X., Khedr, H., Shoukry, Y.: Formal verification of neural network controlled
autonomous systems. In: Proceedings 22nd ACM International Conference on
Hybrid Systems: Computation and Control (HSCC) (2019)
90. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press
(2018)
91. Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for
reinforcement learning with function approximation. In: Proceedings 12th Con-
ference on Neural Information Processing Systems (NeurIPS) (1999)
92. Tjeng, V., Xiao, K., Tedrake, R.: Evaluating robustness of neural networks with
mixed integer programming. Technical report (2017). http://arxiv.org/abs/1711.
07356
93. Tolstoy, L.: Anna Karenina. The Russian Messenger (1877)
94. Urban, C., Christakis, M., Wüstholz, V., Zhang, F.: Perfectly parallel fairness
certification of neural networks. In: Proceedings ACM International Conference on
Object Oriented Programming Systems Languages and Applications (OOPSLA),
pp. 1–30 (2020)
95. Usman, M., Gopinath, D., Sun, Y., Noller, Y., Pǎsǎreanu, C.: NNrepair:
constraint-based repair of neural network classifiers. Technical report (2021).
http://arxiv.org/abs/2103.12535
96. Valadarsky, A., Schapira, M., Shahaf, D., Tamar, A.: Learning to route with deep
RL. In: NeurIPS Deep Reinforcement Learning Symposium (2017)
97. van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-
learning. In: Proceedings 30th AAAI Conference on Artificial Intelligence (AAAI)
(2016)
98. Vasić, M., Petrović, A., Wang, K., Nikolić, M., Singh, R., Khurshid, S.: MoËT:
mixture of expert trees and its application to verifiable reinforcement learning.
Neural Netw. 151, 34–47 (2022)
Verifying Generalization in Deep Learning 455
99. Wang, S., Pei, K., Whitehouse, J., Yang, J., Jana, S.: Formal security analysis of
neural networks using symbolic intervals. In: Proceedings 27th USENIX Security
Symposium, pp. 1599–1614 (2018)
100. Wu, H., et al.: Parallelization techniques for verifying neural networks. In: Pro-
ceedings 20th International Conference on Formal Methods in Computer-Aided
Design (FMCAD), pp. 128–137 (2020)
101. Wu, H., Zeljić, A., Katz, K., Barrett, C.: Efficient neural network analysis with
sum-of-infeasibilities. In: Proceedings 28th International Conference on Tools and
Algorithms for the Construction and Analysis of Systems (TACAS), pp. 143–163
(2022)
102. Xiang, W., Tran, H., Johnson, T.: Output reachable set estimation and verifi-
cation for multi-layer neural networks. IEEE Trans. Neural Netw. Learn. Syst.
(TNNLS) (2018)
103. Yang, J., Zeng, X., Zhong, S., Wu, S.: Effective neural network ensemble approach
for improving generalization performance. IEEE Trans. Neural Netw. Learn. Syst.
(TNNLS) 24(6), 878–887 (2013)
104. Yang, X., Yamaguchi, T., Tran, H., Hoxha, B., Johnson, T., Prokhorov, D.: Neu-
ral network repair with reachability analysis. In: Proceedings 20th International
Conference on Formal Modeling and Analysis of Timed Systems (FORMATS),
pp. 221–236 (2022)
105. Zelazny, T., Wu, H., Barrett, C., Katz, G.: On reducing over-approximation errors
for neural network verification. In: Proceedings 22nd International Conference on
Formal Methods in Computer-Aided Design (FMCAD), pp. 17–26 (2022)
106. Zhang, H., Shinn, M., Gupta, A., Gurfinkel, A., Le, N., Narodytska, N.: Verifi-
cation of recurrent neural networks for cognitive tasks via reachability analysis.
In: Proceedings 24th European Conference on Artificial Intelligence (ECAI), pp.
1690–1697 (2020)
107. Zhang, J., Kim, J., O’Donoghue, B., Boyd, S.: Sample efficient reinforcement
learning with REINFORCE. Technical report (2020). https://arxiv.org/abs/2010.
11364
108. Zhang, J., et al.: An end-to-end automatic cloud database tuning system using
deep reinforcement learning. In: Proceedings of the 2019 International Conference
on Management of Data (SIGMOD), pp. 415–432 (2019)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Correction to: COQCRYPTOLINE: A Verified
Model Checker with Certified Results
Correction to:
Chapter “COQCRYPTOLINE: A Verified Model Checker
with Certified Results” in: C. Enea and A. Lal (Eds.):
Computer Aided Verification, LNCS 13965,
https://doi.org/10.1007/978-3-031-37703-7_11