Sunday, June 15, 2008

What's the point?

I've had the pleasure of working with some very smart people over the years in the area of fault tolerant distributed systems. As a result I've performed research and development in a number of different techniques, including replication (for high availability) and transactions (for consistency). In all that time I've been conscious of the fact that a lot of time and effort has been spent proving that whatever was done worked in the case of failures (whatever the specific definition may be for the particular environment): after all, that's the point of the whole exercise. Yes I know that failures don't happen that often (try selling a transaction manager to people who haven't used one for years and explaining why they really really need to buy one!) But they do happen and that's why fault tolerance techniques (and testing they work in the presence of failures) are so important.

Now why do I bother mentioning this? Well it's come to my attention over the past few years that some purveyors of fault tolerance solutions either don't bother to test the "edge cases" (which are not really edge cases, but the reason for their product's existence) or don't care (and hence publicize) that their solutions won't work in the case of some (all?) possible failure modes. I'm not going to name-and-shame them (primarily because I haven't been able to confirm those reports myself), but if you are a user of something that purports to offer high availability or data consistency in the presence of failures, you really need to check that that vendor means and how they go about confirming that their product works as they say it should.

2 comments:

Michael Sick said...

Mark,

Is there a set of decent pre-canned questions that could be used to evaluate the quality of fault-tolerance and availability? It would be quite helpful when doing a vendor evaluation.

Mike

Mark Little said...

Good question Mike. We added some to our transaction book but I wouldn't want to force you to buy that ;-) I'll see about a future posting.