In this section, inspired by DPOR work, we develop an algorithm for model checking of distributed systems which considers different orders of happens-before relation.
Maysam: |
---|
According to D-POR system requirement, the shared object is the whole network and hence the above definition is not valid. If we consider the whole network as a shared object, then the bellow problem will not arise. |
Is not enough? |
Maysam: |
---|
But, we need to change D-POR algorithm accordingly. e.g. |
Here, we know e1 and e4 are dependent but e4 does not exist at initial state to enable it. |
Maysam: |
---|
I think the above problem arises because of fundamental differences between our environment and D-POR paper's environment. There, they have only and only one event active for each thread to run, but in our system a process can have events activated. |
Hence, it is ESSENTIAL to change D-POR paper accordingly and PROVE that it works for our system as well. |
Otherwise, it violates soundness of POR |
Terms | Definitions |
---|---|
Running event number in node | |
→ | Happens before relationship & it means that |
an ordered set of events from initial state of modelchecking to the current state | |
The set of dependency relationships | |
and |
My concern is that may be after all the performance does not improve so much. As depicted in the figure above, we practically have visited all the states. Of course, in this case duplicate state detection will solve the case.
Any idea for applying the idea of DPOR paper directly to CrystalBall has an inherent problem.
To describe it more clearly, recall that the general idea is like this: 1- traverse one path in search space 2- find race points 3- Add backtrack branches at appropriate positions to address race conditions.
Back to process and thread environment, step 1 is one possible interleaving of threads from start of the program till the end of the program. To realize what happens if the selected path in step 1 was not complete, think about a path which only includes running N instructions from one of the threads. Because all of the instructions belong to a single thread, there is no race condition in path and therefore the is no need to explore other branches as well.
Now, consider again our situation in crystalBall: we have 5-7 nodes and we can search up to depth 7-8. As you can see, there is no space for heuristics to mitigate this problem. Lets say if we want to evenly select events from different node at step 1, there would be 1 event for each node. I can say with closed eyes how many race conditions we would miss in this case.
Putting it all together, I think the final solution needs to be something like consequence prediction approach (i.e. storing hash of prev visited states) rather than D-POR approach (i.e. visit one path and try to figure out proper alternatives).
The above argument questions about possibility of applying POR using the scheme like the one used by DPOR paper. The door for applying it using other schema (e.g. like Consequence Prediction's hash table) is still open.
Here are the facts related to used DPOR in MPI:
DPOR applies only for threaded systems. For distributed systems, we need more:
MPI systems are somewhere between multi-threaded processes and event driven distributed programs. However, the efforts for applying DPOR in MPI modelcheckers are not complete yet:
In summary, I would consider using DPOR in event-driven distributed systems as an open research problem. Although, the works in MPI community can be inspiring but can not be directly ported to our field.