LARA

CrystalBall Project Notes

Benefits:

  • Via CrystalBall we find the bugs which are likely to happen in real runs
  • In the case of a found bug, it is very easy to understand the path which leads to bug. e.g. I gave up using Random Walk in live run because even after finding a bug, it takes me 1 day to figure out the long and boring scenario
  • Rule of thumb: When a bug happens so often in live run, it is likely to be discovered by Random Walk.
  • Assumes that we applied Random Walk instead of Serial Search in CrystalBall. RW find (unlikely) cases which are 50 steps deep and miss a lot of simple cases. SS find 99% of all cases which are about to happen in depth 6-7. So, SS would be much more effective for steering than RW.

Partial-Order Reduction

POR

Chord bugs

Paxos

RandTree

  • I ran MaceMC but this time I added some liveness properties. In this way, I artificially enabled random walk. Consequently, it found most of the RandTree bugs reported in the paper. It is not a surprise, because when the happening rate of bugs is so much, it is likely to be detected by Random Walk.
  • I ran a fresh copy of MaceMC with random walks and fixed every reported bug. I even let the MaceMC to run for 1 hour but nothing was reported.
  • I ran CrystalBall with the fixed version of RandTree and I found one new bug. randtree-badRoot-bug
  • It does not happen so often, so I suggest the following for evaluation section:
  • Describe some scenarios and how Execution Steering avoid them (individually)
  • Show the performance impact of ES, but show graphs of RandTree avoidance rate for the bugs that happen so often.

MaceMC performance

  • In this section we show the low performance of MaceMC when it is used for exhaustive search. This experiment is done on one of the normal machines: nslrack08 for example macemc-performance

Consequence Prediction

Applying Minimax

  • In this section we discuss how we can apply minimax algorithm on modelchecking of distributed systems minimax

Global snapshot notes

Look at this page for further discussions. global-snapshots

  1. Does Lamport paper (i.e. Distributed Snapshots: Determining Global States of Distributed Systems) work for all applications?
  2. If “asynchronous recovery without using vector timestamps” really works, so was lamport a total idiot by not thinking about it?
  3. Does it work for multiple channels?
  4. Is it necessary to take checkpoints regularly in this paper?

Crosscheck two implementations

Future of CrystalBall

In this section we discuss possible practical applications of crystalBall