Sunday, June 26, 2011

When is a distributed system not a distributed system?

I've been involved with distributed systems since joining the Arjuna Project back in the mid 1980's. But distributed systems date back way before then, to at least the 1970's with the advent of the first RPC. There are a few definitions of what constitutes a distributed system, including Tanenbaum's "A collection of independent computers that appears to its users as a single coherent system" and this one "A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility", though my favourite is still Lamports "You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done". Back when I started in the area, someone at the University once said that a precise definition was difficult, but you should recognise a distributed system when you saw it.

What all of the definitions have in common is the notion that a distributed system consists of nodes (machines) that are connected by some distributed fabric, e.g., an ethernet. Distributed systems therefore pose problems that are no present in centralised, single machine systems, such as faults (independent failures can now occur) and hence failure detection. Various techniques, such as distributed transactions and replication have grown up to help deal with some of these issues. Other approaches, such as Voltan, also help to build fail-silent processes. And of course we have techniques such as message passing, RPC and shared tuple spaces, were developed to help make developing distributed applications easier. Though of course we learned that complete distribution opacity is not always a good idea. (We did a lot of the early work in this area.)

However, time has moved on and whilst distributed systems continue to be important in our every day lives, the fact is that many of the problems they present and many of the solutions that we have developed, are present within a local environment these days. Think of this like inner space versus outer space, if you like, but the way in which multi-core machines have become the norm means that we have failure independent nodes (cores this time) within a single machine. Alright, they're not connected by an ethernet, but there's a bus involved and they may, or may not, have shared memory too.

Of course none of this is new by any means. If you've ever experienced any of the parallel computing technologies that were the rage back in the 1980's and 1990's, such as the Transputer (excellent architecture and Occam was brilliant!) then you'll understand. There's always been a duality between them distributed systems. But back then parallel computing was even rarer than distributed computing, simply because the costs of the former were prohibitive (which is why a lot of parallel computing research was often done by running COTS hardware on a fast network, because it was cheaper!)

Times have certainly changed and we can no longer consign parallel computing to the realm of high performance computing, or niche areas. It's mainstream and is only going to increase. And I believe that this means distributed systems research and parallel computing efforts must converge. Many of the problems posed by both overlap and solutions for one may be relevant to the other. For instance, we've put a lot of effort into scalability in distributed systems, and failure detection and independence at the hardware (core) level is very advanced.

So when is a distributed system not a distributed system? When it's a centralised multi-core system!

No comments: