Saturday, May 21, 2016

Fault tolerance and microservices

A while ago I wrote about microservices and the unit of failure. At the heart of that was a premise that failures happen (yes, I know, it's a surprise!) and in some ways distributed systems are defined by their ability to tolerate such failures. From the moment our industry decided to venture into the area of distributed computing there has been a need to tackle the issue of what to do when failures happen. At some point I'll complete the presentation I've been working on for a while on the topic, but suffice it to say that various approaches including transactions and replication have been utilised over the years to enable systems to continue to operate in the presence of (a finite number of) failures. One aspect of the move towards more centralised (monolithic?) systems that is often overlooked, if it is even acknowledged in the first place, is the much more simplified failure model: with correct architectural consideration, related services or components fail as a unit, removing some of the "what if?" scenarios we'd have to consider otherwise.

But what more has this got to do with microservices? Hopefully that's obvious: with any service-oriented approach to software development we are inherently moving further into a distributed system. We often hear about the added complexity that comes with microservices that is offset by the flexibility and agility they bring. When people discuss complexity they tend to focus on the obvious: the more component services that you have within your application the more difficult it can be to manage and evolve, without appropriate changes to the development culture. However, the distributed nature of microservices is fundamental and therefore so too is the fact that the failure models will be inherently more complex and must be considered from the start and not as some afterthought.

No comments: