The Voting (or Quorum Consensus) replication protocol [Gifford 79] is a replication
scheme which can operate correctly in the presence of network partitions. In this method a
non—negative weight is assigned to each replica and this weight information is available to
every client in the network. When a client wishes to read (write) the group it must first gain
access to what is known as a read (write) quorum. A quorum is any set of replicas with
(typically) more than half the total weight of all copies. A read quorum (Nr) and a write
quorum (Nw) are defined such that Nr+Nw > N (the total weight).
A read operation requires the access of any Nr copies (only data from up—to—date
replicas should be used), and a write operation requires N up—to—date copies (so updates
are not applied to obsolete replicas). The number of inaccessible copies tolerated by a
read is N-Nr, and for a write operation it is N—Nw. The purpose of having quora is to
ensure that read and write operations have at least one copy of an object in common. If the
network partitions then voting allows access only from the majority partition if one exists.
Associated with each replica is a timestamp or update number which clients can use to determine which replicas are up—to—date. If Nw ≠ N then a read quorum is required to obtain the most up-to-date version of this number. If Nw = N then every functioning copy must contain the same value because every write operation has been performed on every replica in the group. This update number is used by clients to ensure that they only read data from up-to-date replicas, even though they may acquire access to out-of-date replicas in their read quorum.
Associated with each replica is a timestamp or update number which clients can use to determine which replicas are up—to—date. If Nw ≠ N then a read quorum is required to obtain the most up-to-date version of this number. If Nw = N then every functioning copy must contain the same value because every write operation has been performed on every replica in the group. This update number is used by clients to ensure that they only read data from up-to-date replicas, even though they may acquire access to out-of-date replicas in their read quorum.
The write operation is a two-phase, atomic operation, because either the states of Nw copies are modified or none of them are changed (to ensure that subsequent read and
write quorum overlap and that a majority of the replicas are consistent). If a write quorum
cannot be obtained the transaction must be aborted. However, a separate transaction can
be run to copy the state of a current replica to an out-of-date replica. It is always legal to
copy the contents of replicas in this way.
The weights assigned to replicas should be based on their relative importance to the system e.g., a printer spooler which resides on a very fast node would be considered better for throughput than one which resides on a slower node and would therefore be assigned a higher weight than a replica on a slower node. A replica with higher weight is more likely to be in the quorum component.
A major problem with this protocol is that read operations require a quorum, even if there is a local copy of the object. This can prove inconvenient (and slow) if an object wishes to carry out many read operations on that object in a short space of time. There is also the problem of fault tolerance: many copies of an object must be created to be able to tolerate only a few failures e.g., we require five copies just to be able to tolerate two crashes.
The weights assigned to replicas should be based on their relative importance to the system e.g., a printer spooler which resides on a very fast node would be considered better for throughput than one which resides on a slower node and would therefore be assigned a higher weight than a replica on a slower node. A replica with higher weight is more likely to be in the quorum component.
A major problem with this protocol is that read operations require a quorum, even if there is a local copy of the object. This can prove inconvenient (and slow) if an object wishes to carry out many read operations on that object in a short space of time. There is also the problem of fault tolerance: many copies of an object must be created to be able to tolerate only a few failures e.g., we require five copies just to be able to tolerate two crashes.
Note that to cut down on the storage requirements Witnesses [Paris 86] can be used in
place of actual replicas. A witness only maintains the current version number of the data.
They can take part in quora, but there must be at least one "real" replica in a quorum.
Other replication protocols based on the Voting protocol also exist [Abbadi 90][Jajodia89][Davcev 89]. They address issues such as the need to acquire a quorum for a read the responses of "true" copies. A write can only succeed if a write quorum contains at least
one non-ghost copy.
Because of the way in which ghosts are created and used, a ghost is used to ensure that a particular non-ghost copy has failed due to a node crash and not to a segment partition. If it is possible to create a ghost then the segment has not been partitioned and only the node has failed. In Available Copies when a replica does not respond it is simply assumed that it is because of node failure, which can result in inconsistencies if the copy was partitioned. When the node on which the non-ghost copy originally resided is re-booted it is possible to convert the ghost copy back into its live version.
Because of the way in which ghosts are created and used, a ghost is used to ensure that a particular non-ghost copy has failed due to a node crash and not to a segment partition. If it is possible to create a ghost then the segment has not been partitioned and only the node has failed. In Available Copies when a replica does not respond it is simply assumed that it is because of node failure, which can result in inconsistencies if the copy was partitioned. When the node on which the non-ghost copy originally resided is re-booted it is possible to convert the ghost copy back into its live version.
No comments:
Post a Comment