Last Update: 12 Oct 2016, 14:30 CET: Added a minor tip for low configuration machines
Virtual disk image containing the artifact: popl-paper-184-artifact.tar.gz md5
The above vdi image is pre-installed with the following artifacts:
Below we provide instructions for running our resource verification system (Leon/Orb) on the benchmarks, and for reproducing the results of the paper. Our tool can also be tried online on some benchmarks at leondev.epfl.ch (Memresource section). To know more about our tool and for getting started on writing and verifying new programs with Leon/Orb please refer to the documentation http://leondev.epfl.ch/doc/resourcebounds.html.
The following command can be used to run individual source files. However, to reproduce the results presented in the paper we recommend using the scripts described in the subsequent sections.
$ cd ~/leon $ ./leon --mem --benchmark --timeout=<secs> path-to-file
./leon --mem --benchmark --timeout=60 ./testcases/benchmarks/steps/LazySelectionSort.scala. The tool prints some log messages and the inferred bounds to the console. It dumps the final output and some statistics of the evaluation to a file <filename>-stats.txt in the directory from where the tool was run. For a short description of the command line options use
As shown in the Figure 9 of the paper, a total of 17 benchmarks are used in the evaluation. Each benchmark has two versions: one with a
steps bound, which denotes the number evaluation steps, and another with an
alloc resource bound, which denotes the number of heap-allocated objects. The versions can be found in
~/leon/testcases/benchmark/alloc respectively. Each benchmark has a brief description or pointers to the algorithm that is implemented. The results of running the tool on these benchmarks are available in the folders inside the directory:
~/leon/results/. The folder
~/leon/results/server-results has the results of running the benchmarks on a server with the configurations presented in the paper, and provides more accurate data for the time taken by the tool to analyze a benchmark.
To infer the
steps bounds of the benchmarks (shown in Figure 9), use the following commands:
$ cd ~/leon/results $ mkdir steps; cd steps $ ../../scripts/steps.sh
To infer the
alloc bounds of the benchmarks, replace
alloc in the above commands. The script may take few tens of minutes (depending on the VM configuration) to verify all benchmarks. For slower machines it is possible to comment out some larger benchmarks from the scripts. The benchmarks
Conqueue.scala are among the largest and most time-consuming of all.
Running the script produces two files for each benchmark:
*-stats file provides several statistics in the form of key : value pairs and has the bounds inferred for every function (that has a template) in the benchmark. Note that the Figure 9 of the paper shows only the bounds inferred for a couple of selected functions in each benchmark (for each resource), whereas
*-stats displays every bound inferred. Below we list the functions in each benchmark whose bounds were presented in Figure 9. Reviewers may restrict their attention to these functions in all of the results that follow.
At the end of each stats file there are two tables: Resource Verification Summary and State Verification Summary. The former table shows the inferred bounds for every function (including those presented in Figure 9), and the latter table shows the result of verifying the correctness invariants needed for proving the resource bounds, which may possibly depend on the state of memoization. All the invariants in all the benchmarks will be verified by the tool and would be marked as valid. The table also shows the SMT solver among CVC4 and Z3 that succeeded first in deciding the generated verification conditions.
Most of the key-value pairs in the stats file present details on the internals of the algorithm. The most relevant entries among these are Total-Time (The column AT of Figure 9), State-Verification-Time and Resource-Verification-Time.
Even though the tool makes a best effort to enforce determinism, minor variances across different runs (although rare) is possible. This is because of the incompleteness of the minimization problem in the presence of nonlinearity and recursive functions, and the non-determinism in SMT solvers. This means that occasionally the constants inferred by the tool may slightly vary compared to the ones presented in Figure 9. We observed a difference greater than +/- 1 in the inferred constants only in the two benchmarks: PackratParsing and Deque (for the
steps resource). In both cases the tool computed a more precise bound than the one presented in Figure 9, as the algorithm was able to perform more minimization.
To measure the accuracy of the inferred bounds, we instrument the benchmarks to track the resources (using
./leon --instrument option), and add drivers to run the benchmarks with inputs that exercise the worst-case behavior. All such instrumented benchmarks (having an executable
main function) can be found in the folder:
~/leon/RuntimeEvaluation/src/main/scala/steps (or alloc).
main function of an instrumented benchmarks invokes the key functions on several inputs, computes the dynamic resource usage, and compares it against the statically inferred bounds as described in the section 5 of the paper. Use the following commands to run the instrumented benchmarks.
$ cd ~/leon/RuntimeEvaluation/ $ sbt
Once inside the
sbt (Scala build tool) console, use the following command
run command will list all the available benchmarks that can be executed. Choose the benchmark to run by entering its index. Running a benchmark will produce a set of
.report files in the directory:
steps can be replaced by
alloc). The content of these files are described shortly.
To compute the summary statistics shown in Figure 10, run the
StatsCollector program (listed first) from the
sbt console, which outputs the averages shown in Figure 10 to a file
Running an instrumented benchmark outputs the following files in the directory:
.data file contains a list of triples: (a) the size (or another measure) of the input, (b) the statically inferred bound for the input, (c) the dynamic resource usage or the pareto optimal resource usage. For example, the file
dynamicmsort.data has the following entries
1000 57318 54546 2000 113426 108612 3000 169534 162656 4000 225534 216678 5000 281642 270722 6000 337642 324722 7000 393642 378744 8000 449642 432744 9000 505750 486788 10000 561750 540788
The first line of the file indicates that for an input of size 1000, the statically inferred steps count was 57318 and the runtime steps count was 54546. Using these data for each benchmark, the program
StatsCollector computes a summary statistics using the formula shown in page 10 of the paper. This quantity, referred to as dynamic/static * 100, is outputted by
StatsCollector (for each benchmark and resource) to the file
As explained in the section 5 of the paper, we also compare the statically inferred bounds to a set of pareto optimal bounds with respect to the dynamic execution. Each pareto optimal bound is obtained by reducing a coefficient in the inferred bound, while preserving the other coefficients, until the bound is violated by a concrete input. The file
paretoanalysis<benchmark-tag>.report reports the pareto optimal bound for each benchmark. For instance, the file
paretoanalysishams.report (in folder
~/leon/RuntimeEvaluation/results/steps/CyclicHammingStream/) has the following content:
Statically inferred formula: 129*n + 4 Reducing coeff 0 starting from 4 Lowest possible value for the coefficient is 4 Pareto optimal template: 129*n + 4 First counter-example for 129*n + 3 is at the point 0 Reducing coeff 1 starting from 129 Lowest possible value for the coefficient is 124 Pareto optimal template: 124*n + 4 First counter-example for 123*n + 4 is at the point 8000
As explained in the report, for this benchmark, the first coefficient cannot be reduced any further, and the second coefficient could at best be reduced to 124 (without increasing the other coefficients) for the given dynamic runs. It shows that the data point 8000 violates the bound if the coefficient is reduced to 123.
pareto1hams.data displays the statically inferred bound against the first and the second pareto optimal bound (respectively), for each input.
Using these data for each benchmark, the program
StatsCollector computes the metric optimal/static * 100 (for each benchmark and resource) and outputs it to the file
Even though most of the percentages in the
Figure-10-data file are identical to the data shown in Figure 10 of the paper, in a few benchmarks there is a slight difference (approximately 1 to 2 percentage in most benchmarks) from the data shown in Figure 10. This is because, the results in the paper are more exhaustive and includes more top-level functions, whose evaluations require more complex drivers and manual effort. E.g. For the ease of artifact evaluation, the instrumented BottomUpMergeSort benchmark provided fixes the value of
k to a constant and only varies the size of the
input (see Figure 9 for the bound of
kthMin). However, in the results presented in the paper,
k is also varied. The exact data used in the paper are also provided in the directory
We have also provided some graphical plots of selected functions of the benchmarks to visualize the relationship between the dynamic resource usage and the statically inferred bounds in the folder:
~/leon/RuntimeEvaluation/results/graphs/. The graphs
deque-rotateDrop helps visualize the behavior of inner functions, which was summarized in Figure 11 of the paper. The lines in the graphs denote the statically inferred bounds and the dots denote the runtime resource usage.