CrystalBall Project Notes
Benefits:
- Via CrystalBall we find the bugs which are likely to happen in real runs
- In the case of a found bug, it is very easy to understand the path which leads to bug. e.g. I gave up using Random Walk in live run because even after finding a bug, it takes me 1 day to figure out the long and boring scenario
- Rule of thumb: When a bug happens so often in live run, it is likely to be discovered by Random Walk.
- Assumes that we applied Random Walk instead of Serial Search in CrystalBall. RW find (unlikely) cases which are 50 steps deep and miss a lot of simple cases. SS find 99% of all cases which are about to happen in depth 6-7. So, SS would be much more effective for steering than RW.
Partial-Order Reduction
Chord bugs
Paxos
RandTree
- I ran MaceMC but this time I added some liveness properties. In this way, I artificially enabled random walk. Consequently, it found most of the RandTree bugs reported in the paper. It is not a surprise, because when the happening rate of bugs is so much, it is likely to be detected by Random Walk.
- I ran a fresh copy of MaceMC with random walks and fixed every reported bug. I even let the MaceMC to run for 1 hour but nothing was reported.
- I ran CrystalBall with the fixed version of RandTree and I found one new bug. randtree-badRoot-bug
- It does not happen so often, so I suggest the following for evaluation section:
- Describe some scenarios and how Execution Steering avoid them (individually)
- Show the performance impact of ES, but show graphs of RandTree avoidance rate for the bugs that happen so often.
MaceMC performance
- In this section we show the low performance of MaceMC when it is used for exhaustive search. This experiment is done on one of the normal machines: nslrack08 for example macemc-performance
Consequence Prediction
- In this section we describe the observations which leads us towards consequence-prediction
Applying Minimax
- In this section we discuss how we can apply minimax algorithm on modelchecking of distributed systems minimax
Global snapshot notes
Look at this page for further discussions. global-snapshots
- Does Lamport paper (i.e. Distributed Snapshots: Determining Global States of Distributed Systems) work for all applications?
- If “asynchronous recovery without using vector timestamps” really works, so was lamport a total idiot by not thinking about it?
- Does it work for multiple channels?
- Is it necessary to take checkpoints regularly in this paper?
Crosscheck two implementations
Future of CrystalBall
In this section we discuss possible practical applications of crystalBall