Sunday, August 21, 2011

Fault tolerance

There was a time when people in our industry were very careful about using terms such as fault tolerance, transactions and high availability, to name just three. Back before the Internet really kicked off (really when the web came along), if you were emailing someone then they tended to either be in academia and in which case they'd be summarily shot for misusing a term, or they'd be in the DoD and in which case they'd probably be shot too! If you were publishing papers or your thoughts for wider review, you tended to have to wait for a year to see publication and that was if reviewers didn't shoot you down for misusing terms, and in which case you had to start all over again. So it paid to think long and hard before you did the equivalent of hitting submit.

Today we live in a world of instant publishing and less and less peer review. It's also unfortunate that despite the fact more and more papers, article and journals are online, it seems that less and less people are spending the time to research things and read up on state of the art, even if that art was produced decades earlier. I'm not sure if this is because people simply don't have time, simply don't care, don't understand what others have written, or something else entirely.

You might ask what it is that has prompted me to write this entry? Well on this particular occasion it's people using the term 'fault tolerance' in places where it may be accurate when considering the meaning of the words in the English language, but not when looking at the scientific meaning, which is often very different. For instance, let's look at one scientific definition of the term (software) 'fault tolerance'.

"Fault tolerance is intended to preserve the delivery of correct service in the presence of active faults. It is generally implemented by error detection and subsequent system recovery.
Error detection originates an error signal or message within the system. An error that is present but not detected is a latent error. There exist two classes of error detection techniques: (a) concurrent error detection, which takes place during service delivery; and (b) preemptive error detection, which takes place while service delivery is suspended; it checks the system for latent errors and dormant faults.
Recovery transforms a system state that contains one or more errors and (possibly) faults into a state without detected errors and faults that can be activated again. Recovery consists of error handling and fault handling. Error handling eliminates errors from the system state. It may take two forms: (a) rollback, where the state transformation consists of returning the system back to a saved state that existed prior to error detection; that saved state is a checkpoint, (b) rollforward, where the state without detected errors is a new state."

There's a lot in this relatively simple definition. For a start, it's clear that recovery is an inherent part, and that includes error handling as well as fault handling, neither of which are trivial to accomplish, especially when you are dealing with state. Even error detection can seem easy to solve if you don't understand the concepts. Over the past 4+ decades all of this and more has driven the development of protocols behind transaction processing, failure suspectors, strong and weak replication protocols, etc.

So it's both annoying and frustrating to see people talking about fault tolerance as if it's as easy to accomplish as, say, throwing a few extra servers at the problem or restarting a process if it fails. Annoying in that there are sufficient freely available texts out there to cover all of the details. Frustrating in that the users of implementations based on these assumptions are not aware of the problems that will occur when failures happen. As with those situations I've come across over the years where people don't believe they need transactions, the fact that failures are not frequent tends to lull you into a false sense of security!

Now before anyone suggests that this is me being a luddite, I should point out that I'm a scientist and I recognise fully that theories and practices in many areas of science, e.g., physics, are developed based on observations and can change when they prove to not be sufficient to describe the things you see. So for instance, unlike those who in Galileo's time continued to believe the Earth was the centre of the Universe despite a lot of data to the contrary, I accept that theories, rules and laws laid down decades ago may have to be changed today. The problem I have in this case though, is that nothing I have seen or heard in the area of 'fault tolerance' gives me an indication that this is the situation currently!

No comments: