LARA

  • The first effort in finding bugs for Paxos was unsuccessful. Here are the reasons:
    • The consensus happens in less than a couple of seconds, after that running modelchecker is useless, even if we let the system to run for hours
    • Nodes after reset propose the same key which is useless after consensus
    • If I make nodes to propose for different keys after reset, what should be the proposed value by the modelchecked nodes:
      • After a while it turns into a brand new key and trying a new key of a stable system is quite useless.
  • Solutions:
    • Add a huge drop rate and jitter for packets
    • Let the nodes propose for new keys after a while and try to force the later started modelchecker to to use the new key for propositions
  • Simulator time is less than real run times, so all the requests of simulation are rejected
    • I gonna set the simulator start time to currTime + 10

regId Problem

  • The problem is that in Paxos implementation they keep regId in memory and later use it. In ES that starts from taken snapshots, this value is restored. However, regId values would be different because of creating new objects
  • Solution: for the moment, because there is only 1 service registered for paxos handlers, it is sufficient to simply ignore this value. I hacked ServiceImpl.mac to implement that.

Time Problem

  • Problem: Paxos uses time and compare them. In MaceMC the simulated time started from 0 and then the comparison between MC times and the times recovered from snapshots were wrong
    • I fixed that by asking the modelchecker to initialize its time to 5s after MaceMC start time not 0 or any old number.
  • Problem: They wanted to combine time and machine id to make a unique id. They time is 64bit variable which is in precision of microseconds. They use only the lower 32 bit which keeps data in order of seconds and lower precisions. Hence, the MaceMC which starts hours after that, can have a lower seq than the previously assigned seq by live run
    • The solution is to keep start time in params.default.mc file

WiDS bug number 2

I tried to reproduce bug number 2 in WiDS. In simple words, they say:

  • They managed to violate a local safety property
  • But, to violate the main global safety property of Paxos, they had to follow the trace on paper

I can not define the same local property like they said because the structure of two paxos implementations are different. I came up with a violation scnario for global property but it happens in one single round and we have the old problems related to one round scenarios.

WiDS bug number 1

It is an implementation bug. I managed to come up with a two round violation scenario. The gap between two rounds is 5s. The scenario can be find in Paxos-wids-bug-1-avoidance-data. I reproduce it at live run and avoided it using 1step. Still, I have problems in avoiding it by ES:

  • The scenario has about 10 events but the bug happens at depth 20 because there are lots of branch point in modelchecker tree related to PKT DROP and PKT REORDER
  • I removed this options, but still it takes about 5 minutes to find the bug.
  • So, I managed to avoid it by ES but by putting a 5 min gap between 2 consecutive rounds.

This is the efforts for reducing the run time of modelchecker (After removing PKT DROP and REORDER options): Paxos-speedup-efforts

These number are nslrack52 which is the fastest machines. For real run, I applied the approaches of simplifying state and start from depth 11 and hence I managed to avoid the bug by detecting it after 6s. The gap between rounds were 10s. In future, we can use new fast machines are reduce it to 5s. Here are the numbers related to this experiment: Paxos-wids-bug-1-avoidance-data

Node Reset

  • I enabled NodeReset and so many bugs were detected by Random Walk in pure MaceMC, I disabled it, and no bug was reported. Here are the logs: Paxos-No-reset-log