Friday, January 06, 2017
Following on from the last entry, we shall now consider one implementation of a communication subsystem which provides some of the delivery properties described previously. It is important to understand how these ordering requirements can be met, and the overhead which is involved in guaranteeing them, before we discuss how such communication primitives can be used to provide replicated object groups. Other reliable communication subsystems exist, of which [Chang 84][Cristian 85][Cristian 90][Verissimo 89] are a sample, but we shall consider Psync because it illustrates many points clearly.
Psync [Peterson 87][Mishra 89] ("pseudosynchronous") is a communication subsystem designed to provide reliable multicast communication between objects, and is based on the message history approach described above. The system assumes that operations on objects which change the state occur atomically and are idempotent. Associated with each object is a manager process. A client process locates a particular manager (perhaps by consulting a naming service) and then invokes operations on the object by sending requests to that manager. When a manager receives a request to invoke a particular operation on an object, it encapsulates the operation in a message and uses the Psync many-to-many communications protocol to forward the message to all of the managers involved (including itself) if the object is a member of a group. Based on the set of received messages, each manager can then decide on an order in which to apply the operations to its
copy of the object. This protocol can be extended to be used for the interactions of replicated object groups, and the exact details of the replication protocol used in Psync will be described in a later posting.
Psync explicitly preserves the partial ordering of messages exchanged among a collection of processes in the presence of communication and processor failures (Psync cannot function in the presence of network partitions). A collection of processes exchange messages through a conversation abstraction. This conversation is defined by a directed acyclic graph (a context graph) that preserves the partial order of the exchanged messages. This ordering is made available to all managers involved in a conversation and by using this they can determine when to execute operations on their local objects.
When processes communicate they do so by sending messages in the context of those messages they have already received. Participants are able to receive all messages sent by the other participants in the conversation but they do not receive the messages that they themselves send. Each participant in a conversation has a view of the context graph that corresponds to those messages it has sent or received. The semantics of the communications primitives provided by Psync are defined in terms of the context graph and a participant's view.
The figure below shows an example of a context graph. This conversation is started with the initial message m1. Messages m2 and m3 were sent by processes that had received m1, but are independent of each other (hence no link between them), and m4 was sent by a process that had received m1 and m3, but not m2. Messages that are not in the context of some other message are said to be concurrent (occur at the same logical time). The relation < is defined to be "in the context of".
The context graph contains information about which processes have received what messages. The receipt of a message implies that the sender has seen all of the messages which came before it in the context graph. A message mp sent by process p is said to be stable if for each participant in the conversation q ≠ p, there exists vertex mq in the context graph sent by q, such that m < mq. For a message to be stable means that all participants except the sender have received it, it must follow that all future messages sent to the conversation must be in the context of the stable message.
In the figure below, we have a context graph which depicts a conversation between three participants, a, b, and c. Messages al, a2, ... denotes the sequence of messages sent by process a, and so on. Message al, b1, and c1 are the only stable messages; messages a2 and a3 are two unstable messages sent by process a.
Psync maintains a copy of a conversation's context graph at each host on which a participant in the conversation resides. Each process in the conversation receives messages from this local copy of the context graph, which is termed the image. Whenever a process at one host sends a message, Psync propagates a copy of the message to each of the hosts in the conversation. This message contains information about all messages upon which the new message depends, so that the receiving hosts can append it to their context graphs in the correct place.
Suppose message m is not delivered to some host because of a network failure. If at some future time a message m' arrives that depends on m then the host will detect that it is missing m and will send a retransmission request form to the host that sent m', (this host is guaranteed to have m as a local participant has just sent a message which is in the context of it).
The operations provided to aid applications in recovering from host failures include the ability for a participant to remove a failed participant from its definition of the participant set for a conversation. This is necessary so that messages will eventually stabilize relative to the functioning participants. Once a given participant has been masked out, Psync ignores all further messages from that process.
There is also an inverse operation that allows a participant to rejoin a participant set. It would be invoked when a participant becomes aware that another participant which was formally failed has now recovered.
When a host recovers, a participant can initiate a recovery action which will inform other participants that the invoking participant has restarted, and to initiate reconstruction of the local image of the context graph. Each member of the conversation will transmit its local copy of the context graph to the recovering participant who can then use this to reconstruct its own local copy.
As described, the Psync protocol only gives a partial ordering of messages i.e., only the causal ordering of messages is preserved. To convert a partial order into a total order, whereby messages which are not causally related are ordered identically at all overlapping destinations, requires additional information to be shared amongst the destinations which indicates the order in which to place such messages. In Psync, the context graph which accompanies each message provides this information. The partial order that Psync provides can be used to give a total order if all participants perform the same topological sort of the context graph. This sort must be incremental i.e., each process waits for a portion of its view to be stabilized before allowing the sort to proceed. This is done to ensure that no future messages sent to the conversation will invalidate the total ordering. The replication protocol used in Psync uses just such a scheme and will be described in a later article.