Why we do not need POR

  1. Spin uses static analysis for partial order reduction
  2. Spin uses model instead of executable code
  3. For modelchecking executable codes in distributed systems, I do not remember any D-POR proposed.
  4. POR target soundness which considering missing snapshots in CrystalBall is non-sense.

DPOR-DS based on happens-before relation

In this section, inspired by DPOR work, we develop an algorithm for model checking of distributed systems which considers different orders of happens-before relation.

DPOR-DS by Happens-before

Dynamic POR Facts

Notes on POR

Maysam:
According to D-POR system requirement, the shared object is the whole network and hence the above definition is not valid. If we consider the whole network as a shared object, then the bellow problem will not arise.
Is not $B \neq D$ enough?
Maysam:
But, we need to change D-POR algorithm accordingly. e.g.
Dependence Relation for Receiving Messages
Here, we know e1 and e4 are dependent but e4 does not exist at initial state to enable it.
Maysam:
I think the above problem arises because of fundamental differences between our environment and D-POR paper's environment. There, they have only and only one event active for each thread to run, but in our system a process can have $n : 0 <= n <= max$ events activated.
Hence, it is ESSENTIAL to change D-POR paper accordingly and PROVE that it works for our system as well.
Otherwise, it violates soundness of POR

DD-POR

Solution 1

Terms Definitions
$e_{ij}$ Running event number $i$ in node $j$
$e_{ij}$$e_{i'j'}$ Happens before relationship & it means that $i < i'$
$path$ an ordered set of events from initial state of modelchecking to the current state
$D$ The set of dependency relationships
$i \neq i'$ and $j \neq j'$ $(e_{ij}, e_{i'j'}) \not \in D$
  1. Add event $e_{i'j'}$ to $path$
  2. Foreach (Event $e_{ij} \in path - e_{i'j'}$)
    1. if $(e_{ij}, e_{i'j'}) \in D$
      1. $stt = path.indexOf(e_{ij})$
      2. $end = path.indexOf(e_{i'j'})$
      3. $e_{i''j''} = max(path[index])$ where $stt < index < end$ and $path[index]$$e_{i'j'}$
      4. add a branch point for $e_{i''j''}$ besides $e_{ij}$

DD-POR

My concern is that may be after all the performance does not improve so much. As depicted in the figure above, we practically have visited all the states. Of course, in this case duplicate state detection will solve the case.

DPOR in CrystalBall

Any idea for applying the idea of DPOR paper directly to CrystalBall has an inherent problem.

To describe it more clearly, recall that the general idea is like this: 1- traverse one path in search space 2- find race points 3- Add backtrack branches at appropriate positions to address race conditions.

Back to process and thread environment, step 1 is one possible interleaving of threads from start of the program till the end of the program. To realize what happens if the selected path in step 1 was not complete, think about a path which only includes running N instructions from one of the threads. Because all of the instructions belong to a single thread, there is no race condition in path and therefore the is no need to explore other branches as well.

Now, consider again our situation in crystalBall: we have 5-7 nodes and we can search up to depth 7-8. As you can see, there is no space for heuristics to mitigate this problem. Lets say if we want to evenly select events from different node at step 1, there would be 1 event for each node. I can say with closed eyes how many race conditions we would miss in this case.

Putting it all together, I think the final solution needs to be something like consequence prediction approach (i.e. storing hash of prev visited states) rather than D-POR approach (i.e. visit one path and try to figure out proper alternatives).

The above argument questions about possibility of applying POR using the scheme like the one used by DPOR paper. The door for applying it using other schema (e.g. like Consequence Prediction's hash table) is still open.

DPOR in MPI

Here are the facts related to used DPOR in MPI:

  1. Message Passing Interface (MPI) is a specification for an API that allows many computers to communicate with one another. It is used in computer clusters and supercomputers.
  2. I am wondering why it is never used and even never tried for general distributed systems. Speculatively, I can list the following reasons:
    • MPI implementation uses TCP and hence is not efficient for general distributed systems.
    • Event driven architecture is more matched for most of distributed system applications. It is sometimes explicitly stated like in Mace or implicitly used by the programmers (using a thread for listening to messages and create a proper thread corresponding the the task that should be done by receipt of the message).

DPOR applies only for threaded systems. For distributed systems, we need more:

  1. A clear definition of in-dependency relationship
  2. An efficient DPOR like algorithm for soundly apply the in-dependency relationship.

MPI systems are somewhere between multi-threaded processes and event driven distributed programs. However, the efforts for applying DPOR in MPI modelcheckers are not complete yet:

In summary, I would consider using DPOR in event-driven distributed systems as an open research problem. Although, the works in MPI community can be inspiring but can not be directly ported to our field.