Academia.eduAcademia.edu

Tackling Large Verification Problems with the Swarm Tool

Lecture Notes in Computer Science

The range of verification problems that can be solved with logic model checking tools has increased significantly in the last few decades. This increase in capability is based on algorithmic advances, but in no small measure it is also made possible by increases in processing speed and main memory sizes on standard desktop systems. For the time being, though, the increase in CPU speeds has mostly ended as chip-makers are redirecting their efforts to the development of multi-core systems. In the coming years we can expect systems with very large memory sizes, and increasing numbers of CPU cores, but with each core running at a relatively low speed. We will discuss the implications of this important trend, and describe how we can leverage these developments with new tools.

Tackling Large Verification Problems with the Swarm Tool 1 Gerard J. Holzmann, Rajeev Joshi, Alex Groce Laboratory for Reliable Software (LaRS) Jet Propulsion Laboratory, California Institute of Technology. The range of verification problems that can be solved with logic model checking tools has increased significantly in the last few decades. This increase in capability is based on algorithmic advances, but in no small measure it is also made possible by increases in processing speed and main memory sizes on standard desktop systems. For the time being, though, the increase in CPU speeds has mostly ended as chip-makers are redirecting their efforts to the development of multi-core systems. In the coming years we can expect systems with very large memory sizes, and increasing numbers of CPU cores, but with each core running at a relatively low speed. We will discuss the implications of this important trend, and describe how we can leverage these developments with new tools. Abstract. Introduction The primary resources in most software applications are time and space. It is often possible to make an algorithm faster by using more memory, or to reduce its memory use by allowing the run time to grow. In the design of SPIN, a reduction of the run time requirements has almost always taken precedence. For an exhaustive verification, the run time requirements of SPIN are bounded by both the size of the reachable state space and by the size of available memory. If M bytes of memory are available, each state requires V bytes of storage, and the verifier on average records S new states per second, then a run can last no longer than M/(S*V) seconds. If, for example, M is 64 MB, V is 64 bytes, and S is 104 states per second, the maximum runtime would be 102 seconds. If there are more than 106 reachable states, the search will remain incomplete – being limited by the size of memory. The verification speed depends primarily on the average size of the state descriptor, which is typically in the range of 102 to 103 bytes. On a system running at 2 or 3 GHz the processing speed is normally in the range of 105 to 5.105 states per second. This means that in about one hour, the model checker can explore 108 to 109 states, provided sufficient memory is available to store them. This means that some 1011 bytes, or 100 GB, can be used up per hour of runtime. 2 On an 8 GB system, that means that the model checker can (in exhaustive storage mode) normally run for no more than about 5 minutes. If we switch to a 64 GB system running at the same clock-speed, the worst-case runtime increases to 40 minutes. An interesting effect occurs if we switch from exhaustive verification mode to bitstate mode, where we can achieve a much higher coverage of large state spaces by using less than a byte of memory per state stored [H87]. The exact number of bytes stored per state is difficult to determine accurately in this case. The current version of SPIN by default uses three different hash-functions, setting between one and three new bit-positions for each state explored. We will assume conservatively here that each state in this mode consumes 0.5 bytes of memory, and that the speed of the model checker is approximately 108 states per hour. Under these assumptions, the model checker will consume maximally 108 * 0.5 bytes of memory per hour of run time, or roughly 50 MB. To use up 8 GB will now take about a week (6.8 days) of non-stop computation. In return, we cover significantly more states, but both time and space should be considered limited resources, so the greater number of states is not always practically achievable. To make the point 1 The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. The work was supported in part by NASA’s Exploration Technology Development Program (ETDP) on Reliable Software Engineering. 2 Smaller state descriptors normally correlate with higher processing speeds. 1 more clearly, if we increase the available memory size to 64 GB, a bitstate search could consume close to two months of computation, which is clearly no longer a feasible strategy, no matter how many states are explored in the process. We are therefore faced with a dilemma. The applications that we are trying to verify with model checkers are increasing in size, especially as we are developing methods to apply logic model checkers directly to implementation level code [H00, HJ04]. As state descriptors grow in size from tens of bytes to tens of kilobytes, processing speed drops and is no longer offset by continued CPU clock-speed increases. For very large applications, a bitstate search is typically the only feasible option as it can increase the problem coverage (i.e., the number of reachable states explored) by orders of magnitude. Exhaustive coverage for these applications is generally prohibitively expensive, given the enormous size of both the state descriptors and the state spaces, no matter which algorithm is used. In these cases we have to find ways to perform the best achievable verification. Technically, the right solution in these cases is always to apply strong abstraction techniques to reduce the problem size as much as possible. We will assume in the remainder of this paper that the best abstractions have already been applied and that the resulting state space sizes still far exceed the resource limits in time and/or memory. Leveraging Search Diversity To focus the discussion, we will assume that there is always an upper bound on the time that is available for a verification run, especially for large problem sizes. We will assume that this bound is one hour. With a fixed exploration rate this means that we cannot use more than a few Gigabytes of memory in exhaustive verification and no more than about 50 to 500 MB in bitstate exploration. Given that for very large verification problems we have to accept that the search for errors within a specific time constraint will generally be incomplete, it is important that we do not expend all our resources on a single strategy. Within the limited time available, we should approach the search problem from a number of different angles – each with a different chance of revealing errors. A good strategy is to leverage both parallelism and search diversity. The types of applications that then become most promising fall in the category of “embarrassingly parallel” algorithms. To illustrate our approach, we will use a simple model that can generate a very large state space, where we can easily identify every reachable state and predict when in a standard depth-first search each specific state will be generated. The example is shown in Figure 1. . byte pos = 0; int val = 0; int flag = 1; active proctype word() { end: do :: d_step { pos < 32 -> /* leave bit 0 */ flag = flag << 1; pos++ } :: d_step { pos < 32 -> val = val | flag; flag = flag << 1; pos++ } od } never { do :: assert(val != N) od } /* check if number N is reached */ Fig. 1. Model to generate all 32-bit values, to illustrate the benefits of search diversification The model executes a loop with two options, from which the search engine will non-deterministically select one at each step. Each option will advance an index into a 32-bit integer word named val from the least significant bit (at position one) to the most significant bit (at position 32). The first option leaves the bit pointed to set to its initial value of zero, and merely advances the index. The second option sets the bit 2 pointed at to one. Clearly, there will be 232 (over 4 billion) possible assignments to val. Each state descriptor is quite small, at 24 bytes, but storing all states exhaustively would still require over 100 GB. If we perform the state space exploration on a machine with no more than 2 GB, an exhaustive search cannot reach more than 2% of the state space. A bitstate search on the other hand could in principle store all states, but only under ideal conditions. For this model, with a very small state descriptor, we reach a processing speed of close to 2 million states per second, on a standard 2.3 GHz system. We will, however, artificially limit the amount of memory that we make available for the bitstate hash array to 64 MB and study what we can achieve in terms of state coverage by exploiting parallelism and search diversification techniques. In terms of SPIN options, this means the selection of a pan runtime flag of maximally –w29. In practice this means that for this example only about 148 million states are reached in bitstate mode, or no more than 3.5% of the total state space. For this example it is also easy to check if a specific 32-bit value is reached in SPIN’s exploration, by defining the corresponding value for N when the model is generated. For instance, checking if the value negative one is reached in the maximal bitstate search can be done as follows: $ spin –DN=-1 –a model.pml $ cc –DMEMLIM=2000 –DSAFETY –o pan pan.c $ ./pan –w29 This particular search fails to produce a match. It is easy to understand why that is. Note that the value negative one is represented in two’s complement as a series of all one bits, which in the standard depth-first search is the last number that would be generated by the verifier. Performing the same search for the value positive one will produce an almost immediate match, for the same reason. If we reverse the order of the two options in the model of Figure 1, the opposite effect would occur: the search for negative one would complete quickly and the search for positive one would fail. If the number to be matched is randomly chosen, we could not devise a search strategy that can optimize our chances of matching it, which is more representative of a real search problem. After all, if we know in advance where the error might be, we would not need a model checker to find it. In the experiments that we will describe we will use a list of 100 randomly generated numbers, and compare different methods for matching as many of them as possible. If the random number generator behaves properly, the random numbers will be distributed evenly over the entire state space of over 4 billion reachable states. Statistically, the best we could expect to do in any one run would be to uncover no more than 3 or 4 of those states (given that with runtime flag –w29 we can explore at most 3.5% of the reachable state space). We will see that with a diversified search strategy, we can do significantly better and identify 49% of the randomly generated states. We will also show that even when using only 4 MB (a tiny fraction of the 100 GB that would be required to store the full state space) we can already identify 10% of the target numbers. Algorithms To make our method work we have to be able to use as many different search methods as there are processing cores available to us. If each search is setup to use only a small fraction of the total memory that is available on our system, we can run all searches in parallel. In the description that follows we describe a number of different search algorithms. Several of these algorithms can be modified to form any number of additional searches, each of which able to search a different part of the state space. The base algorithms we use can be described as follows. 1. 2. 3. (dfs) The first method is the standard depth-first search that is the default for all SPIN verifications. (dfs_r)The next method reverses the order in which a list of non-deterministic choices within a process is explored, using the compiler directive –D_TREVERSE (new in SPIN version 5.1.5). (r_dfs) The next method uses a search randomization strategy on the order in which a list of transitions is explored, using the existing compiler directive –DRANDOMIZE (first introduced in SPIN version 4.2.2). With this method the verifier will randomly select a 3 100 10 1 6 7 dfs 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 r_dfs1 r_dfs2 dfs_r pick total matched Fig. 2. Results of 2,400 Verification Runs for the Model in Figure 1 (logscale). 4. 5. starting point in the transition list, and start checking transitions for their executability in round-robin order from the point that was randomly selected. (pick) The next method uses a user defined selection method to permute the transitions in a list. The last method reverses the order in which process interleavings are explored, using the compiler directive –DREVERSE (introduced in SPIN version 5.1.4). Because our example uses just a single process, we will not use the last variation of the search. Alternative methods for modifying process scheduling decisions during a search can be found in [MQ08]. Algorithms 3 and 4 can be used to define a range of search options, using different seeds for the random number generator. To illustrate this, we will use two versions of algorithm 3, called r_dfs1 and r_dfs2. Each algorithm from the set can be used in a series of runs. In our tests we repeated each run 100 times, once for each number from the list of 100 target numbers to match. We then repeat each of these 100 runs 24 times, while varying the size of the hash-array used from our limit value of –w29 (64 MB) down to a minimum of –w6 (64 bytes). This is an application of iterative search refinement, as first discussed in [HS00]. About 2,000 runs from this series of experiments, for hash array sizes from –w6 up to and including – w23 take less than a second of runtime each, so despite the large number of runs, they can be completed very quickly. For the larger hash array sizes (2MB and up) the runtime and the number of states covered becomes more notable, with the longest runs taken 84 seconds each on a 2.4 GHz system. All 24 runs combined take no more than about 3 minutes of real time when run sequentially, which means that all 2,400 runs can be completed in about 5 hours on a single CPU core, or in about 37 minutes total on the 8core machine that was used for these experiments. The results are shown in Figure 2. When used separately, and performing one verification run alone, none of the search methods identify more than about 9% of the target values set for this experiment. If we look at the cumulative effectiveness of the iterative search refinement method, using 24 runs of a each algorithm and increasing the size of the hash array step by step, this coverage increases, with the best performing search method (r_dfs2) 4 identifying 15% of all targets. The performance of all five search strategies combined in our proposed diversified multi-core search strategy increases the coverage to the identification of 49% of all targets. The top curve in Figure 2 shows the cumulative number of matches (out of 100) as the memory arena is increased from 64 bytes (-w6) to 64 MB (-w29). The other curves show the performance of the individual search algorithms. Each different search method identifies a different set of targets, thus boosting the overall effectiveness of the use of all methods combined. Adding more variations of the searches could increase the problem coverage still further. It is evident from these data that the performance of the diversified multi-core search is significantly better than any one search method in isolation. The diversified multi-core search promises to be a very valuable addition to the range of techniques that we can use to tackle very large software verification problems. It builds directly on the availability of systems with increasing numbers of CPU cores and rapidly growing memory sizes, that we can expect in the coming years and perhaps decades. There may also be a direct application of this strategy of swarm verification in grid computing, using large numbers of standard networked computers. In this paper, though, we focus on the application to multi-core systems only. The Swarm Tool Based on the observations above, we developed a tool that allows us to leverage the effect of search diversification on multi-core machines. The Swarm tool, written in about 500 lines of C, can generate a large series of parallel verification runs under precise user-defined constraints of time and memory. The tool takes parameters that define a time-constraint, the number of available CPUs, and the maximum amount of memory that is available. (For a manual page see: http://spinroot.com/swarm/). Swarm first calculates how many states could maximally be searched within the time allowed, and then sets up a series of bitstate runs. By limiting the hash arena in a bitstate run, Swarm controls what the maximal time for each run will be. Parallelism is used to explore searches with different search strategies to provide diversity. The commands that are generated include standard, randomized, and reverse depth-first search orders, varying depth-limits, and using a varying numbers of hash-functions per run. In a relatively small amount of time, hundreds of different searches can be performed, each different, probing different parts of an oversized state space. A typical command line invocation of the Swarm tool is as follows: $ swarm –c4 –m16G –t1 –f model.pml > script swarm: 33 runs, avg time per cpu 3593.8 sec In this case we specify that we want to use 4 CPU cores for the verifications, and have up to 16 GB of memory available for these runs. The –t parameter sets the time limit for all runs combined to one hour. Swarm writes the verification script onto the standard output, which can then be written into a script file. Executing the script performs the verification. Application: An example of a large problem that cannot be handled with standard search methods is a SPIN model of an experimental Fleet processor architecture. The details of the design itself are not of interest to us here, but the verifiability of the model is. 3 One version of this model has a known assertion violation that can be triggered through a manually guided simulation in about 350 steps. The model is over one thousand lines of PROMELA. 4 Each system state is 1,440 bytes. An attempt to perform a full verification on a machine with 32 GB of memory runs at roughly 105 states per second, and exhausts memory in 195 seconds without reporting the error. At this point the search explored 23.4 million states, corresponding to an unknowable fraction of the reachable state space. A search using -DCOLLAPSE compression (a lossless state compression mode) reaches 327.6 million states before running out of 3 The Spin model of the Fleet Architecture Design was built by Rhishikesh Limaye and Narayan Sundaram under the guidance of Sanjit Sehia from UC Berkeley. 4 The specification language of the SPIN model checker. 5 memory after 3,320 seconds. A run with hash-compact (a stronger, but not lossless, form of compression) runs out of memory after 1,910 seconds and increases the coverage to 537 million states. A bitstate run, using all 32 GB of memory, runs for 34 days, and explores over 1011 system states. None of these search attempts succeed in locating the assertion failure. The full reachable state space for this problem is likely to be orders of magnitude larger than what can be searched or stored by any verification method. The bitstate run can be performed in parallel on 8 CPUs, shrinking the run time from 34 days to about 5 days [HB07], but without change in result. An alternative would be to run the verification with –DMA compression, which is lossless and often extremely frugal in memory use. Such a run could in principle be able to complete the verification and reveal the error, but it would likely take at least a year of computation to do so. A Swarm run for this application is quickly setup. Swarm generates 74 small jobs in 8 scripts that can be executed in parallel on the 32 GB machine, when given a time limit of one hour (the default). Executing the script finds the assertion violation within a few seconds. In this case by virtue of the inclusion of the reversed depth-first search. The assertion violation, as it turns out, normally happens towards the end of the standard depth-first order– which means that it is encountered near the very beginning of the search if the depth-first search order is reversed. For a different test of the performance of Swarm we also studied a series of large verification models from our benchmark set, most of which were also used in [HB07]. EO1 is a verification model of the autonomous planning software used on NASA’s Earth Observer-1 mission [C05]. The Fleet architecture model was discussed above. DEOS is a model of an operating system kernel developed at Honeywell Laboratories, that was also discussed in [P05]. Gurdag is a model of an ad hoc network structure with five nodes, created by a SPIN user at a commercial company. CP is a large model of a telephone switch, extracted from C source code with the Modex tool. DS1 is a large verification model with over 10,000 lines of embedded C code taken from NASA’s Deep Space 1 mission, as described in [G02]. NVDS is a verification model of a data storage module developed at JPL in 2006, with about 6,000 lines of embedded C code, and NVFS is non-volatile flash file system design for use on an upcoming space mission, with about 10,000 lines of embedded C. For each of these models we first counted the number of local states in the automata that SPIN generates for the model checking process. This is done by inspecting the output of command “pan –d.” Next, we measured the number of these local states that is reported as unreached at the end of a standard depth-first search with a bitstate hash-array of 64 MB (using runtime flag –w29, as before). Next, we used the Swarm tool to generate a verification script for up to 6 CPUs and 1 hour of runtime. We then measured how many of the local states remained unreached in all runs. Table 1. Swarm Coverage Improvement for Eight Large Verification Models Verification Model EO1 Fleet DEOS Gurdag CP DS1 NVDS NVFS Total 3915 171 2917 1461 1848 133 296 3623 Number of Control States Unreached Control States standard dfs dfs + swarm 3597 656 34 16 1989 84 853 0 1332 0 54 0 95 0 1529 0 Percent of Control States Reached standard dfs dfs + swarm 8 83 80 91 32 97 41 100 28 100 59 100 68 100 58 100 The last two columns of Table 1 show the percentage of control states reached, respectively in the original depth-first search using a 64 MB hash-array, and in that search plus all Swarm verification runs. In all cases the coverage increases notably. For the EO1 model performance increases from 8% to 83%. In the next two cases, coverage increases to over 90%. In the last five large applications we see coverage by this metric reach 100% percent of the control states. It should be noted that this last result does not mean that the full reachable state space was explored. For the models considered here achieving the latter would be 6 well beyond our resource limitations, which is precisely why we selected them as candidates for the evaluation of Swarm verification. All measurements were performed on a 2.3 GHz eight-core desktop system with 32 GB of main memory, of which no single run consumed more than 64 MB in these tests. The state vector size for the models ranges from a180 (NVDS) to 3426 (DS1) bytes of memory. The number of Swarm jobs that can be executed within our 1 hour limit ranged from 86 (EO1) to 516 (NVDS).. Swarm unexpectedly succeeded in uncovering previously unknown errors in both the CP model and the NVFS applications. The NVFS application is relatively new, but the CP verification model was first subjected to thorough verification eight years ago, and has since been used in numerous tests without revealing any errors. In our own applications, Swarm has become the default method we use for large verification runs. Conclusion It is often assumed that the best way to tackle large verification problems is to use all available memory in a maximal search, possibly using multi-core algorithms, e.g., [HB07], to reduce the runtime. As memory sizes grow, most search modes that would allow us to explore very large numbers of states take far too much time (e.g., months) to remain of practical value. In this paper we have introduced a new approach that allows us to perform verifications within strict time bounds (e.g., 1 hour), while fully leveraging multi-core capabilities. The Swarm tool uses parallelism and search diversity to optimize coverage. We have measured the effectiveness of Swarm in several different ways. In each case we could determine that the new approach could defeat the standard method of a single depth- or breadth first search by a notable margin, both by dramatically reducing runtime and by increasing coverage. A similar approach to the verification problem was explored in [D07], where it was applied to the verification of Java code with the Pathfinder tool, though without considering run-time constraints. The use of embarrassingly parallel approaches, like Swarm, becomes increasingly attractive as the number of processing cores and the amount of memory on desktop systems continues to increase rapidly. An often underestimated aspect of new techniques is the amount of training that will be required to fully leverage them. This is perhaps one of the stronger points in favor of the Swarm tool. It would be hard to argue that the use of Swarm requires more training than a cursory reading of the manual page. Acknowledgements The authors are grateful to Sanjit Seshia, Rhishikesh Limaye, Narayan Sundaram, for providing access to and insight in the Fleet Architecture models, and to Madan Musuvathi and Klaus Havelund for inspiring discussions about search strategies. Doron Peled proposed the introduction of the –DRANDOMIZE option in SPIN version 4.2.2. References [C05] S. Chien, R. Sherwood, D. Tran, et al, Using Autonomy Flight Software to Improve Science Return on Earth Observing One (EO1), Journal of Aerospace Computing, Information, and Communication, April 2005. [D07] M.B. Dwyer, S.G. Elbaum, et al., Parallel Randomized State-Space Search, Proc. ICSE 2007, pp. 3-12. [M69] G.E. Moore, Cramming more components onto integrated circuits, Electronics, 38, (8), April 9, 1965. [G02] P. R. Gluck and G. J. Holzmann, Using Spin Model Checking for Flight Software Verification, Proc. 2002 Aerospace Conf., IEEE, Big Sky, MT, USA, March 2002. [H87] G.J. Holzmann, On limits and possibilities of automated protocol analysis. Proc. 6th Int. Conf. on Protocol Specification, Testing, and Verification, INWG IFIP, Eds. H. Rudin and C. West, Zurich, Switzerland, June 1987. 7 [H00] G.J. Holzmann, Logic verification of ANSI-C Code with Spin, Proc.7th Spin Workshop, Stanford University, CA, August 2000, Springer Verlag, LNCS 1885, pp. 131-147. [HS00] G.J. Holzmann and M.H. Smith, Automating software feature verification, Bell Labs Technical Journal, Vol. 5, No. 2, pp. 72-87, April-June 2000. [HJ04] G.J. Holzmann and R. Joshi, Model-driven software verification, Proc. 11th Spin Workshop, Barcelona, Spain, April 2004, Springer Verlag, LNCS 2989, pp. 77-92. [HB07] G.J. Holzmann and D. Bosnacki, The design of a multi-core extension to the Spin model checker, IEEE Trans. On Software Engineering, 33, (10), pp. 659-674, Oct. 2007. [MQ08] M. Musuvathi and S. Qadeer, Fair stateless model checking. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, (PLDI) Tucson, AZ, June 7-13, 2008. [P05] J. Penix, W. Visser. C. Pasareanu, E. Engstrom, A. Larson and N. Weininger, Verifying Time Partitioning in the DEOS Scheduling Kernel, Formal Methods in Systems Design Journal, Volume 26, Issue 2, March 2005 8