LARA – Lab for Automated Reasoning and Analysis -

Lamport

Lamport has an algorithm for getting a global snapshot of the system. Generally, the process starts when a node sends a marker out to all of the channels. Each node by receiving this marker take a snapshot and transmit the marker to all away channels. Besides, each process keep track of the received messages during the first and second receipt of the marker.

Pros: It is a light-weight process and can happen at any arbitrary point in time

Cons: The taken global snapshot is consistent only for checking stable properties. Besides, channels are assumed to be error-free, and to deliver messages in the order sent. So, it does not work for UDP.

Singhal

In Singhal approach (i.e. asynchronous recovery without using vector timestamps) nodes take snapshots regularly. It optimistically assumes that the local timers are sync. To address exceptions, each message piggy-backs the seq of its last taken snapshot. If the received node found the sender's clock faster than itself, it immediately take a snapshot and make its clock fast to be sync. In this way, the snapshots with the same seq are sync. To collect the global snapshot, the collector send a request for checkpoint with seq=Sn from all of the nodes. Every responder, responds with the smallest seq where seq >= Sn. That is because of missed checkpoints because of making the clock fast operation.

Pros: Reordering is not important, hence it works with multiple channel as well. Unlike Lamport, there is no limitation on properties we want to check.

Cons:
- Assumes that message are not lost [But, I do not why this is necessary]
- There is trade-off between the granularity of the time that the global snapshots are taken and imposed overhead for taking snapshots that we do not need.
- In other words, we can not have snapshots at each arbitrary point of time. We just can pick an already taken one.

Was Lamport a total idiot?

No, he was not. Lamport's paper is for 1985, when the computation power could not afford taking frequent unnecessary snapshots. Singhal's paper is for 2002, when these expenses were totally affordable.

Does Singhal work with multiple TCP connections?

Yes, it works. No matter how many channels are between processes, at every point of time each node only process one message selected from one of the channels. Moreover, the reordering of messages is acceptable in Singhal approach.

Can we remove regular checpoints?

In other words, you want your local clock to proceed only by sending messages. You are not violating Singhal assumptions, so at the end your global snapshot would be consistent but not necessary new. Assume that nodes A and B exchanging messages so frequently that their clock show seq 1000. On the other hand, node C's clock is too slow (seq = 5) because he has not have any communication since 1h ago. Then, when C request for checkpoint number 5 nodes A and B return the checkpoints number 5 which belongs to 1 hour ago. It is still consistent but too old to be useful. In other words, we are interested in consistent and fresh checkpoints not only consistent.