LARA – Lab for Automated Reasoning and Analysis -

Abstract

Assume a distributed version implementation of a protocol. Sometimes it is difficult and even impossible to come up with the proper safety properties for that. Instead, we can crosscheck the correctness of distributed implementation against a simpler central implementation of the same protocol. In this way, we do not need another magic language for property specification to understand the properties we need for the system. Moreover, we can implement the central version in the same language and we do not need to learn a new one.

Related Work

I could not find anything in literature like that for distributed systems. For central systems we have:

KLEE that uses its tool to crosscheck two different implementations of libc
http://sdg.csail.mit.edu/forge/ that checks the implementation against its abstraction. The are a bunch of related work in literature for that.
“Combining Symbolic Execution with Model Checking to Verify Parallel Numerical Programs,” 2008: It verifies the correctness of parallel programs in MPI using its sequential equivalent. It first needs to write both parallel and sequential version in Promela, then it uses SPIN and Symbolic Execution and POR to prove that they are functionally equivalent. It means that the output of both programs to a set of inputs is the same. It focuses on float number. The fact that make it much easier for MPI programs is that there is no unstable state in them.

Expectations

The inevitable fact is that Exhaustive search is impossible. Hence any random-like heuristic is totally welcome. Because, the possible approach would be random-wise, the evaluation should clarify the usefulness of the system through finding previously reported bugs or in best case finding new ones.

Problem Statement

Assume two implementations of a system; one distributed (D) and one central (C). There are two set of operations into these systems:

(I) Insert, Update which changes the internal data structure
(Q) Query which query some information from the internal data structure. Query operation can possibly change the state as well.

No matter what is the representation of the internal data structure, we call two systems equivalent if the following property holds:

Foreach (opList ops in [I,Q]*Q)
{
  dValue = D.ops();
  cValue = S.ops();
  ASSERT(dValue == cValue);
}

For example, chord correspond to a closed link-list in central systems. Then we have

D= Chord
C= closed linked-list
I= insert (join overlay), remove (leave overlay)
Q= getPredecessor, getSuccessor

Possible solutions

We can randomly select ops in [I,Q]*Q and after each Q check the output of distributed system against the central version.
For distributed systems, after each I do the corresponding APPLICATION CALL to the search tree of modelchecker to get to state SI
After each Q, start a search from SI for state SQ where output of Q corresponds to the output of central system.

Questions

When do a query?

A distributed system does not exactly match to a central version. More precisely Distributed System = Central System + Dynamic thread. The role of dynamic thread in the presented model is to model the dynamics of the distributed system even when we do not perform any application call to the system. Then, the internal representation can dynamically change, without any external intervention.
Due to this dynamic there are moments that Q to D could return the correct answer and there are times the answer is not what corresponding central system was expencting.
The question is that the mismatch should be translated as a bug in D, or it is an inevitable consequence of dynamics of the system and we should just wait some more time. On the other hand, There is no stable point in the system; i.e. even after a while that the system gets to the state that the query returns what we would like, after a more while a timer can get triggered, send a msg, the message cause a TCP RST, the error handler removes the corresponding link and again the Q returns an unstable result.
To sum up, there is no mechanism to distinguish between an unstable answer due to dynamic of the system and a wrong answer due to a bug in the system.

What about node resets

In the search path, definitely we have to remove **node reset* from possible events because it is correspond to a new operation (probabely delete) into central system. Because we do not do any new operation on the central system while we searching in the model of distributed system, if we consider node reset , then the outputs would be inconsistent.
In fact, it is the side-effect of not 100% match between central and distributed version. i.e. an ideal distributed system should handle the node failures and still return the correct answers to the queries.

How about abstraction functions

We can always leverage the techniques used to show that a state machine really implements another state machine (i.e. spec) in combating our problem. But, in distributed systems we have a bigger problem that we should solve it first: the dynamic behind the system.
To clarify this, lets take two state machines one impl and the other spec. We have 3 operations set, reset , and query.

As illustrated in the picture above, there are states that we can query the system, but the result would be wrong because of unsuitability.
May be in simple words we can name the following as the fundamental difference between our problem and traditional impl and spec approaches:
- Here, the distributed system does not exactly implement the sequential program.