Friday, January 06, 2017


Following on from the last entry, we shall now consider one implementation of a communication subsystem which provides some of the delivery properties described previously. It is important to understand how these ordering requirements can be met, and the overhead which is involved in guaranteeing them, before we discuss how such communication primitives can be used to provide replicated object groups. Other reliable communication subsystems exist, of which [Chang 84][Cristian 85][Cristian 90][Verissimo 89] are a sample, but we shall consider Psync because it illustrates many points clearly.


Psync [Peterson 87][Mishra 89] ("pseudosynchronous") is a communication subsystem designed to provide reliable multicast communication between objects, and is based on the message history approach described above. The system assumes that operations on objects which change the state occur atomically and are idempotent. Associated with each object is a manager process. A client process locates a particular manager (perhaps by consulting a naming service) and then invokes operations on the object by sending requests to that manager. When a manager receives a request to invoke a particular operation on an object, it encapsulates the operation in a message and uses the Psync many-to-many communications protocol to forward the message to all of the managers involved (including itself) if the object is a member of a group. Based on the set of received messages, each manager can then decide on an order in which to apply the operations to its
copy of the object. This protocol can be extended to be used for the interactions of replicated object groups, and the exact details of the replication protocol used in Psync will be described in a later posting.

Conversations and Context Graphs

Psync explicitly preserves the partial ordering of messages exchanged among a collection of processes in the presence of communication and processor failures (Psync cannot function in the presence of network partitions). A collection of processes exchange messages through a conversation abstraction. This conversation is defined by a directed acyclic graph (a context graph) that preserves the partial order of the exchanged messages. This ordering is made available to all managers involved in a conversation and by using this they can determine when to execute operations on their local objects.

When processes communicate they do so by sending messages in the context of those messages they have already received. Participants are able to receive all messages sent by the other participants in the conversation but they do not receive the messages that they themselves send. Each participant in a conversation has a view of the context graph that corresponds to those messages it has sent or received. The semantics of the communications primitives provided by Psync are defined in terms of the context graph and a participant's view.

The figure below shows an example of a context graph. This conversation is started with the initial message m1. Messages m2 and m3 were sent by processes that had received m1, but are independent of each other (hence no link between them), and m4 was sent by a process that had received m1 and m3, but not m2. Messages that are not in the context of some other message are said to be concurrent (occur at the same logical time). The relation < is defined to be "in the context of".
The context graph contains information about which processes have received what messages. The receipt of a message implies that the sender has seen all of the messages which came before it in the context graph. A message mp sent by process p is said to be stable if for each participant in the conversation q ≠ p, there exists vertex mq in the context graph sent by q, such that m < mq. For a message to be stable means that all participants except the sender have received it, it must follow that all future messages sent to the conversation must be in the context of the stable message.

In the figure below, we have a context graph which depicts a conversation between three participants, a, b, and c. Messages al, a2, ... denotes the sequence of messages sent by process a, and so on. Message al, b1, and c1 are the only stable messages; messages a2 and a3 are two unstable messages sent by process a.
Psync maintains a copy of a conversation's context graph at each host on which a participant in the conversation resides. Each process in the conversation receives messages from this local copy of the context graph, which is termed the image. Whenever a process at one host sends a message, Psync propagates a copy of the message to each of the hosts in the conversation. This message contains information about all messages upon which the new message depends, so that the receiving hosts can append it to their context graphs in the correct place.

Dealing with Network and Host Failures

Suppose message m is not delivered to some host because of a network failure. If at some future time a message m' arrives that depends on m then the host will detect that it is missing m and will send a retransmission request form to the host that sent m', (this host is guaranteed to have m as a local participant has just sent a message which is in the context of it).

The operations provided to aid applications in recovering from host failures include the ability for a participant to remove a failed participant from its definition of the participant set for a conversation. This is necessary so that messages will eventually stabilize relative to the functioning participants. Once a given participant has been masked out, Psync ignores all further messages from that process.

There is also an inverse operation that allows a participant to rejoin a participant set. It would be invoked when a participant becomes aware that another participant which was formally failed has now recovered.

When a host recovers, a participant can initiate a recovery action which will inform other participants that the invoking participant has restarted, and to initiate reconstruction of the local image of the context graph. Each member of the conversation will transmit its local copy of the context graph to the recovering participant who can then use this to reconstruct its own local copy.

Total Ordering

As described, the Psync protocol only gives a partial ordering of messages i.e., only the causal ordering of messages is preserved. To convert a partial order into a total order, whereby messages which are not causally related are ordered identically at all overlapping destinations, requires additional information to be shared amongst the destinations which indicates the order in which to place such messages. In Psync, the context graph which accompanies each message provides this information. The partial order that Psync provides can be used to give a total order if all participants perform the same topological sort of the context graph. This sort must be incremental i.e., each process waits for a portion of its view to be stabilized before allowing the sort to proceed. This is done to ensure that no future messages sent to the conversation will invalidate the total ordering. The replication protocol used in Psync uses just such a scheme and will be described in a later article.

Tuesday, January 03, 2017

Multicasts and Latency

The next few entries in the series will consider some aspects of multicast protocols.

As described in [Shrivastava 90a], the latency of a multicast service is defined to be the time taken for a message, once sent, to reach the destination processes. This latency is particularly important for protocols providing reliability and ordering guarantees. As we shall see, whereas the latency for an unreliable multicast service is bounded (typically of the order of a few milliseconds), the latency for a multicast service which operates in the presence of failures (message losses and node crashes) can be bounded or unbounded depending upon the implementation.

Existing order preserving protocols can be broadly classified in the following way:
  • message history based: the main idea behind such protocols is that when a process sends a message it appends some historical information about the messages it has received in its recent past. This historical information enables the receivers to retrieve any missing messages and to order them properly. This type of protocol ensures that an incomplete multicast is eventually completed, and hence possesses an unbounded latency property.
  • centralised distributors: here the sender delivers the message to a specific member of the group (the primary) who is then responsible for distributing the message to the fellow members of the group. The primary can assign a total order to the messages it receives. As we have already seen, failure detection mechanisms are necessary to detect failed primaries and to elect new primaries which can take over and complete the multicasts. Such protocols can possess bounded latency, but the necessity to detect asynchronously occurring failures can impose an overhead on performance.
  • multi-phase commit: these protocols, providing total order, use multiphase algorithms (typically 2 or 3 message rounds) which are similar to the two-phase algorithm described earlier for atomic action commits. The sender delivers the message to the destinations which return sufficient information to the sender about the messages that they have received so that in the subsequent rounds of the protocol the sender can impose an identical order of the message onto the destinations. The message is only considered to have been delivered if all of the phases of the protocol complete. Such protocols provide bounded latency.
  • clock-based: these protocols are an important class of the multi-phase algorithms, and assume the existence of a global time base. Timestamps derived from such a time base can then be used for imposing a total order on messages. Such protocols can provide constant latency communication, having the attractive property that if a sender multicasts a message at clock time T, then it can be sure that all functioning receivers will have received the message by clock time T + ∆, where ∆ is the constant indicating the protocol latency (∆ must be determined by applying worst case timing and failure assumptions).
We shall next examine a system which provides a reliable multicast protocol using a message history based approach.

Monday, January 02, 2017

A Java EE Interlude

I want to take a quick break from the other series for a moment to do something I've been remiss about for a while: addressing a recent report by Gartner on the state of Java EE. Before doing so I decided the other day to read the article my friend and colleague John Clingan made a few weeks ago and I realised that John had done a great job already. In fact so good I couldn't see myself adding much to it, but I'll try below. Note, there are other responses to this report and I haven't had a chance to read them but they might be just as good.

To start with, it's worth noting that I've known Anne for a number of years and we've had one or two disagreements during that time but also some agreements. Her announcements about the death of SOA seven years or so ago falls somewhere in between, for example. It's also important to realise, if you don't already, that I do believe everything has a natural cycle ("the circle of life") and nothing lasts forever; whether it's dinosaurs giving way to mammals (with some help from a meteor and/or the Deccan Traps), or CORBA shifting aside for J2EE, evolution is a fact of life. Therefore, whilst I disagree with Anne about the short-to-medium term future of Java EE, longer term it will pass into history. However, before doing so it will evolve and continue to influence the next generation of technologies, just as the dinosaurs became the birds and aspects of CORBA evolved into J2EE. For more details on my thinking on this topic over the years I leave it as an exercise to the reader to check this blog or my JBoss blog.

John covers my next point very well but it's worth stressing: I've mentioned on many occasions that my background is heavily scientific and I continue to try to employ objective scientific method to everything I do. So when I see blanket statements like "fade in relevance" or "lighter-weight approaches", whether it's in a Gartner report, a paper I'm reviewing as a PC member, or even something one of my team is writing, I immediately push back if there's no supporting evidence. And I read this report several times searching for it but kept coming up empty handed. It was one subjective statement after another, with no real attempt to justify them. That's disappointing because with my scientific hat on I love to see facts backing up theories, especially if those theories contradict my own. That's how we learn.

One other thing I did get from the report, and I'm prepared to admit this may be my own subjective reading, was that it seemed like a vague attempt to look into the future for middleware frameworks and stacks without really understanding what has brought us to where we are today. I've written about this many times before but weak assertions by the report that "modern applications" are somehow so different from those we've developed for the last 50 years that existing approaches such as Java EE are not useful really doesn't make any sense if you bother to think about it objectively. Distributed applications, services and components need to communicate with each other, which means some form of messaging (hopefully reliable, scalable and something which performs); there's really no such thing as a stateless application, so you'll need to save data somewhere (again, reliable, scalable and performant); hey, maybe you also want some consistency between applications or copies of data, so transactions of one form or another might be of help. And the list goes on.

Of course application requirements change over time (e.g., I recall doing research back in the 1980's where scale was measured in the 10's of machines), but new applications and architectures don't suddenly spring into being and throw away all previous requirements; it's evolution too though. I've presented my thoughts on this over the past couple of years at various conferences. In some ways you can consider Java EE a convenient packaging mechanism for these core services, which are typically deployed in a co-located (single process) manner. Yet if you look beyond the Java EE "veneer" you can still see the influence of enterprise distributed systems that predate it and also the similarities with where some next generation efforts are heading.

I suppose another key point which John also made well was that the report fails miserably to differentiate between Java EE as a standard and the various implementations. I'm not going to cover the entire spectrum of problems I have with this particular failure, but of course lightweight is one of them. Well over 5 years ago I wrote about the things we'd done years previously to make AS7 run on a plug computer or Android phone and we've kept up that level of innovation up throughout the intervening time. There's nothing in the Java EE standard which says you have to build a bloated, heavyweight application server and we, as well as other implementations, have proven that time and time again. It's disappointing that none of this is reflected in the report. I suppose it's one of those terms the authors hope that if they say it enough, people will attribute their own subjective opinions and assumptions to it.

Another one of those throw-away statements the report is good at is that "[...] at this point, Java developers eschew Java EE". I admit not every Java developer wants to use Java EE; not every Java developer wanted to use J2EE either. But you know what? Many Java developers do like Java EE and do want to use it. You only have to go to conferences like JavaOne, Devoxx or other popular Java events to find them in abundance. Or come and talk to some of our customers or those of IBM, Payara, Tomitribe or other members of the MicroProfile effort.

I could go on and on with this but the entry has already grown larger than I expected. John did a great job with his response and you should go read that for an objective analysis. Probably the only positive thing I could attribute to the original Gartner report is that its very existence probably proves that time travel is possible! It's a theory which fits the facts: a report which criticises something based on data which is largely a decade old means it was probably written a decade ago.

Sunday, January 01, 2017

Remote Object Invocation

The next in our series ...

Invocations on objects which are not replicated are traditionally based on the RPC as this retains the correct semantics of a procedure call i.e., a single flow (thread) of control from caller to callee and back again (as with a traditional procedure call). The previous entry described the concept of the Remote Procedure Call, and the simplified model of client-server interaction shown in the figure below will be assumed for the discussion to follow: a client uses the primitives sendjequest() for sending a call request and receive_result() for receiving the corresponding results. Clients and servers maintain enough state information to recognize and discard duplicate messages (filter requests). The server maintains a queue of messages from possibly multiple clients, and uses the primitive receive_re quest() to pick a message from the queue in a fifo order. After invoking the right method, the result is sent to the client with the send_result() primitive.
When making replicated invocations (such as when calling a replica group) the semantics of such communication differ considerably from that of the traditional RPC: there is no longer a single thread of control, but rather multiple threads which may all eventually return to the caller. Such invocations are typically referred to as Replicated Procedure Calls [Cooper 84a][Cooper 85], and can be implemented using one—to—many (or multicast) communication facilities. We discuss various aspects of multicast communication below.

One-to-Many Communication

The main services a multicast protocol provides can be categorised into three classes: ordering, reliability and latency. By imposing (increasing) ordering and reliability constraints on the delivery of multicast messages it is possible to define increasingly sophisticated protocols (typically at the expense of the latency). To understand these protocols first assume that a sender S is attempting to multicast to a group G = {P1,...,Pn}. Following the definitions outlined in [Shrivastava 90b][ANSA 90]:

Unordered and Unreliable

A multicast from S will be received by a subset of functioning nodes Pi ∈ G. Successive multicasts from S will be received in an arbitrary order at the destinations. The next figure shows sender S multicasting messages m1 and m2 to the group G. The first message is received by P2 and Pn in different orders, and message m2 is not received by P1

FIFO Multicast

Provided the sender does not crash while transmitting the message, all correctly functioning receivers are guaranteed to get the message. Furthermore, the multicasts will be received in the order they were made.

The next figure shows two senders (S 1 and S2) multicasting to the group G. All members of G received m1 before m2, but some members may receive m3 before m2. This last ordering is correct given the definition of the protocol: no information about the relative ordering of multicasts between senders is available to the receivers.

Atomic multicast

If the sender does not crash before completing a multicast, the message is guaranteed to be received by all functioning members. If, however, the sender crashes during a multicast, then it is guaranteed that the message is received by either all or none of the functioning processes (atomic deliveiy). All multicasts from the same sender are received in the order they were made.

Causal multicast

This multicast extends the ordering property of the Atomic multicast to causally related sends from different senders while still meeting the reliability guarantee. [Lamport 78] was the first to introduce the concept of potential causal relationships into computer interactions and showed what effects these relationships can have on the operations of interacting processes. Two events are potentially causally related if information from the first event could have reached the second event before it occurred. The notation used to denote such relationships is typically X → Y, where → means precedes (happened before). Note that if X and Y are events from the same process and Y follows X then Y is necessarily causally related to X. A causal communication system will only preserve an ordering of events if the order is causally related. If two events are not related in this way then there is no guarantee on the delivery order.
In the figure above S1 is multicasting the groups G 1and G2, P1is multicasting to group G1. G1={P2, P3} and G2={P1, P4}.Thereisapotentialflowofinformationfromsend(m1,Gi) to send(m2,G2), and from send(m 2,0 2) to send(m3,G1). This means that the sending of m3 by Pi is potentially causally related to the sending of m1 by S1. Hence the causal multicast protocol must ensure that all functioning members of G 1receive m1 before m3. Events such as m3 and m4 which are not causally related can be received in any order (they are termed concurrent).

Totally ordered multicast

The (partial) causal order can be extended to a total order of messages such that all messages (whether causally related or not) are received by all destinations in the same order (which must also preserve causality).