Thursday, August 30, 2007

Heuristics, one-phase commit and compensations

It's a little known fact that as well as being the world's first Web Services transactions product, HP-WST also had some pretty neat non-Web Services capabilities that we're only now starting to revisit. I've been in the process of writing a paper on one of them for what seems like an age, so decided to give a brief outline here. But first a little background to put the rest into context.

One of the nice things we did with HP-WST from the start was keep the Web Services aspects separate from the core transaction engine. This is something we continued with XTS (now the Web Services transactions component of JBossTS). At the time the reason was that Jim and I needed to make parallel progress, with him concentrating on the SOAP stack (and doing some great work with the HP Web Services team at that time) and me on the protocol engine. Another reason for the separation was to try to make debugging of problems a little easier. One of the things you'll know if you've either developed a distributed system or used one, is that distributed debugging can be a PITA. It was bad enough with CORBA, but Web Services take it to another level. So we had this nice clean separation that meant you could actually configure the system (dynamically) to appear to be running the whole Web Services stack when in fact it wasn't going anywhere near the network. If you knew what you were doing (read: undocumented feature) you could configure this "loop-back" to happen either before or after the SOAP messages were created.

Now an important part of HP-WST was the compensation transaction model it supported. This was based on BTP at the time, but the idea still translates to WS-TX: instead of doing the work in the scope of a single transaction that holds on to locks and other resources for a potentially long time, you do the work in a series of smaller transactions that can each be compensated by some other transaction later. The coordinator (Atom or Cohesion in the case of BTP) remembers the list of participants and drives recovery in the event of failures, so even if your application crashes everything should be resolved.

Because of the "local transport" aspect of HP-WST, people were able to write compensations for local applications, completely ignoring the Web Services stack. Some lighthouse customers found that an interesting prospect. In particular when I was giving a presentation to one group in Madrid, we got on to something I'd been prototyping that offered a nice solution to the old problems of heuristics (how do I resolve a non-atomic transaction?) and having multiple one-phase commit participants in the same transaction (how do I resolve a non-atomic transaction?)

In both of these problem scenarios what typically happens is that someone (e.g., a system administrator) has to get to grips with the inconsistent data and figure out what was going on in the rest of the application in order to try to impose consistency. One of the important reasons this can't really happen automatically (at the TM level) is because it required semantic information about the application, that simply isn't available to the transaction system. They compensate manually.

Until then. What we were proposing was allowing developers to register compensation transactions with the coordinator that would be triggered upon certain events, such as heuristic outcomes or one-phase errors. And to do it opaquely as far as the application developer was concerned. Because these compensations are part of the transaction, they'd get logged so that they would be available during recovery. Plus, a developer could also define whether presumed abort, presumed commit or presumed nothing were the best approaches for the individual transaction to use (it has an affect on recovery and failure scenarios).

Nothing really earth shattering. We'd been offering this kind of thing for a long time through nested top-level transactions, for example. But HP-WST pushed it into a wider arena. With this approach you could write your compensations to try to undo the commit of the one-phase resource, for example, or if it can't be undone then write sufficient information to help the administrator resolve it. Likewise if triggered by a specific heuristic: try to compensate directly at the time the error occurs. Obviously nothing is ever guaranteed, but sometimes being able to try to compensate at the moment the problem happens can save you time and money later.

Now where this becomes more interesting is when you consider annotations. Back in 2000 they didn't exist and we were playing with raw XML or explicit declarative approaches (the latter was a problem because we wanted to be able to apply this to existing deployments without requiring them to be re-coded). But annotations and the work that Maciej has been doing, mean that revisiting this could result in something more powerful and certainly more opaque.

And on that note, back to work (and maybe the paper). Hopefully this has been enough to wet your appetite.

2 comments:

Anonymous said...

I'm wondering if all this stuff you're talking about with compensations and heuristic recovery is really starting to bleed into the process management side of BPM. Maybe all these complex scenarios should be modeled instead of coordinated? If you get my drift? Blog on this coming...

Mark Little said...

In some (maybe even many) cases, definitely. "Do this, if it fails do the other, if that's successful do somegthing else ..." is workflow/BPM without a doubt. However, in certain situations like the ones I described, it can be easier/more efficient and more reliable for the TM to try to do the compensation there and then. I've never believed in "one-size fits all solutions" so I'm not considering something to fit all possible use cases. But then the same goes for workflow/BPM. Plus, as we talked before on the subject, at some point you need a reliable coordinator to do this work anyway: whether that's embeddd in a TM or embeddd in a client or workflow or BPM engine doesn't get round that fact.