LARA – Lab for Automated Reasoning and Analysis -

Observation 0

For every network event m there is at least one not network event e where e ⇒ m. (happens before)

Observation 1

The events in distributed systems can be placed in groups, where in each group there is one event e0 where it “happens before” all other events in the same group. The sequence of events in each group happens very fast. In other words, the common schema in distributed systems is like this: a scheduler or application event runs and it generates a series of network events one after each other; till there is no other network events and system get stable again. Another round will be started by a new scheduler or application event.

Observation 2

Assume two set of nodes A and B. Consider Global property G which its violation needs a particular state in all of nodes belong to A and B.
Consider two chains of happens before relations E_A and E_B where E_N only includes events related to node N. If G happens as result of E_A and E_B, in a short period of time, the probability of happening E_A and E_B at the same time is very low. Even though, in such a rare bugs, because of so many possibilities of interleaving of E_A and E_B events, any avoiding mechanism can lead to lots of false positives. Then, we prefer not to take these kind of scenarios into consideration.
Example: two different nodes join to different locations of a very large rings.

Observation 3

If two independent chain of events result in network events to the same node, then the probability of happening them at the same time is low. Putting the fact into background that consequence prediction is supposed to find the violations that can happen in the next short period of time, it is not necessary to consider the happening of such a rare combination of parallel sequence of events.

Example: two different nodes tries to join to same node A at the same time.

Example 1

Assume two independent scheduler events s1 on node a and s2 on node b.
According to observation 1, violation of a global property which includes both a and b, can not happen without any communication between a and b. Then, it would be enough to check s1 and s2. ([s1,s2] and [s2,s1] are redundant)

Observation 4

Usually state changes which result in a global safety violation are caused by receive handlers not scheduler or application handlers. In other words, the role of scheduler and application handlers in a bug scenario are mostly issuing other events (either network or scheduler) which finally cause violation.

Example 2

Now, assume that three additional network messages m3, m4, m5 on nodes c, d, and e respectively, where these happens before relations are held: a[s1] ⇒ c[m3] c[m3] ⇒ d[m4] d[m4] ⇒ b[m5] And, running event b[s2] after receipt of b[m5] can issue message e[m6] which results in a safety violation. The sequence we are looking for is {b[s2], e[m6]}. We try b[s2] once at the start of modelchecking. Trying b[s2] at the other depths is totally useless unless something change state of node b. Then trying b[s2] is reasonable because this time scheduler event s2 may be does something else.