Existing distributed systems which make use of multicast communication can be
categorised as using either one—to—many communication (one client interacting with a
replica group) or many—to—many communication (replica groups interacting with replica
groups). We shall now look at a number of the popular protocols and show how they have approached the problems of maintaining replica consistency to both external and internal
events.
One-to-Many Communication
The V System
The V System [Cheriton 84][Cheriton 85] makes use of what they term one-to-many
communication transactions, where the start of a transaction marks the end of the previous
transaction (any outstanding messages associated with the old transaction are destroyed
without being processed). Each group member has a finite message buffer into which
replies are queued provided they arrive before the start of the next send transaction. Although a reliable communication protocol is not used, the assumption is made that by
making use of retransmissions is it possible to ensure (with a high probability) that
messages are delivered if the sender and receiver remain operational. A further
assumption is that at least one operational group member receives and replies to a
message.
The designers of the V System recognized the possibility of message loss due to message buffer overflow, and their solution was to make use of larger buffers and to modify the kernel to introduce a random delay when replying to a group send, thereby reducing the number of messages in a queue at any time. This imposes a consistent overhead on all group replies but still does not ensure that buffer overflow will never occur (if the number of different groups interacting with each other increases then even with this delay it is still probable that the finite sized message queues will overflow).
When servers reply to a client they do so on a many-to--one basis i.e., all servers which received a request from a particular client will attempt to reply to that client (and not to any replica group the client may be a member of). This means that there is the possibility of members of the same group taking different actions as a result of local events such as timeouts (resulting in state divergence), because the servers replied to some, but not all, clients on time.
It was recognized that messages might be lost by processes because of message buffer
overflow, but in Andrew it is assumed that such loses will be detected by retries. This
implicitly relies on the servers keeping the last transmitted replies to every client, but
could still result in state divergence if, for example, a server group is seen to have failed by
one replicated client when it retries a request, when in fact all other members of the client
group received the reply before the failure.
Despite the assumptions made by the designers, both the problem of message buffer
overflow, and local timeouts, can occur in the Circus system, leading to replica state
divergence.
The designers of the V System recognized the possibility of message loss due to message buffer overflow, and their solution was to make use of larger buffers and to modify the kernel to introduce a random delay when replying to a group send, thereby reducing the number of messages in a queue at any time. This imposes a consistent overhead on all group replies but still does not ensure that buffer overflow will never occur (if the number of different groups interacting with each other increases then even with this delay it is still probable that the finite sized message queues will overflow).
When servers reply to a client they do so on a many-to--one basis i.e., all servers which received a request from a particular client will attempt to reply to that client (and not to any replica group the client may be a member of). This means that there is the possibility of members of the same group taking different actions as a result of local events such as timeouts (resulting in state divergence), because the servers replied to some, but not all, clients on time.
The Andrew System
The Andrew System [Satyanarayanan 90] assumes that the communication system is
reliable in much the same manner as the V System i.e., given a sufficient number of retries
then any operational server can always be contacted. Communication is one-to-many and
the termination conditions for a particular client-server interaction is set on a per call
basis (depending on the application). The designers recognized that calls between remote
processes can take an indeterminate amount of time and so if a client times out it sends an
"are you alive" message to any server replica which has not responded and if a reply is
returned then it continues to wait. However, because servers reply on a many-to-one
basis, state divergence between replicated clients can still occur: if only one client receives a reply sent before all servers fail, then the replicated clients may take different actions.
Many-to-Many Communication
The Circus System
The Circus System [Cooper 84b] (and Multi-Functions [Banatre 86b] in Gothic [Banatre 86a]) makes use of many-to-many communication, with both client and server groups. As with the V System they assume that with a sufficient number of transmission retries it is possible to deliver messages to all operational members of a group. Thus, if a call eventually times out (after retrying a sufficient number of times) a sender can assume that the receiver(s) has failed. Each member of a group has a message queue which is assumed to be infinite in size (if this assumption cannot be made then the designers of Circus assume that retransmissions of requests and replies can account for losses due to buffer overflow).
Throughout the description of Circus the assumption is made that the members of a
group are detenninistic (given the same set of messages in the same order and the same
starting state then they will all arrive at the same next state). Client groups' members
operate as though they were not replicated and do not know that they are members of a
replica group. No mention is made of how this determinism of group members extends to
asynchronous local events such as timeouts.
No comments:
Post a Comment