Mark Little's WebLog

Sunday, September 11, 2011

September 11th

It's that time of year again but of course this time it's 10 years on. Time to reminisce and remember those who weren't so lucky. I've been thinking about this day for a while and wondering about all of those things that I managed to do in the last decade that I wouldn't have been able to if I'd made a slightly different choice back then. They include many of the things I mentioned previously, such as HP, Arjuna Technologies, JBoss, Red Hat, standards involvement etc.

But they all pale into insignificance when I look at my 9 year old son! And then there's nothing more I can really say except thanks.

Sunday, September 04, 2011

The impact of Arjuna

I've mentioned before that I have the privilege of speaking at Santosh's retirement ceremony. I've also said on several occasions how much Santosh and the Arjuna project have influenced my life over the years. So I decided to speak about the transition of Arjuna from a research project that was originally just the vehicle for several of us to get our PhDs, through to today when it's at the heart of the most downloaded application server in the world.

Fairly obviously I have lived through the transition over the past 25 years. And despite having parted ways with my company in 2005, I've been able to continue to work with them, as well as obviously shepherding the transaction system through JBoss and Red Hat. However, it wasn't until I started to write my presentation that everything we've done over the years came back to me. (I suppose that being so close to things sometimes makes you forget.) I found it really hard to cram 25 years into a 60 minute session, so many things had to be left out or confined to a single bullet. For a start, when Arjuna was still a research effort it managed to help at least a dozen of us get PhDs, was the basis for over 50 papers and technical reports, and influenced distributed systems research and companies from IBM to Sun Microsystems.

But it's when you look beyond the research that the real impact becomes apparent. For a start, in 1994 we used it to implement a distributed student registration system that is still not matched by the one now provided by a certain large business management software purveyor. In 1995 the OTS was being developed and that was already influenced by Arjuna, since Graeme was now at Transarc. It wasn't too long before we began to implement an OTS compliant transaction system using Arjuna and this was my first dealing with standards. We also got involved with IBM, Alcatel and others in defining standards for extended transactions through the CORBA Activity Service (which would later be the basis for the various Web Services transactions efforts.) At about the same time Stuart was driving the workflow submission with Nortel and working on OpenFlow.

Then in 1996 Sun released Oak, later to become Java. We all started to use it in a number of areas, including games, a browser (great way to learn HTML) and a web server. I looked at end-to-end transactions and then decided that an even better way to learn the language would be to implement Arjuna in Java. Over two weeks at Christmas 1996 JavaArjuna was born (later to become JTSArjuna when I ported the OTS.) This was before J2EE, before JTA and before JTS. So not only was this the worlds first 100% pure Java JTS, it was the worlds first 100% pure Java transaction service.

It was round then that we created a company to market the Java and C++ implementations. We were acquired by Bluestone, which was subsequently acquired by HP and Arjuna went into their product suites to compete against BEA and IBM (there was no sign of Oracle middleware in those days!) While our time at HP was limited, we still managed to work on two Web Service transactions standards efforts as well as produce the worlds first such product. We also branched out into high performance messaging and building an ORB.

When HP decided it couldn't make a go of software, we created another startup to concentrate on transactions and messaging. We had several successful years, making sales to the likes of TIBCO and WebMethods, creating two new Web Service standards committees in OASIS and finalising two of them (BTP and WS-TX). We also found a market by replacing the transactions and messaging components in JBoss 3 with our own. And within all this, there was still time to write many papers, give many presentations and more worlds firsts, such as XTS.

As I said earlier, in 2005 we sold transactions to JBoss and I bid farewell to Arjuna the company, though obviously Arjuna the technology stayed pretty close! Over the intervening 6 years and an acquisition by Red Hat, we've seen Arjuna (aka JBossTS) incorporated into every version of AS as well as all of our platforms and many projects, even if they're not written in Java. The teams have branched out into REST as well as Blacktie, to offer XATMI support. There's also work on software transactional memory using JBossTS and now, with the move of Red Hat into the cloud, it's available in OpenShift and beyond.

Even this blog is way too short to cover everything that has happened on this 25 year long journey. I haven't been able to cover other aspects such as OpenFlow and messaging, or the impact of the people who have passed through the Arjuna project and Arjuna companies. I've also only hinted at how all of the research we did at the University or in industry has influenced others over the years. I think in order to really do the Arjuna story justice I need to write a book!

Monday, August 29, 2011

Enterprise middleware and PaaS

I wanted to say more about why existing enterprise middleware stacks can be (should be) the basis for realistic PaaS implementations. If I get time, I may write a paper and submit it to a journal or conference but until then, this will have to do. I'm talking about this at JavaOne this year too, so a presentation may well come out soon.

Sunday, August 21, 2011

Fault tolerance

There was a time when people in our industry were very careful about using terms such as fault tolerance, transactions and high availability, to name just three. Back before the Internet really kicked off (really when the web came along), if you were emailing someone then they tended to either be in academia and in which case they'd be summarily shot for misusing a term, or they'd be in the DoD and in which case they'd probably be shot too! If you were publishing papers or your thoughts for wider review, you tended to have to wait for a year to see publication and that was if reviewers didn't shoot you down for misusing terms, and in which case you had to start all over again. So it paid to think long and hard before you did the equivalent of hitting submit.

Today we live in a world of instant publishing and less and less peer review. It's also unfortunate that despite the fact more and more papers, article and journals are online, it seems that less and less people are spending the time to research things and read up on state of the art, even if that art was produced decades earlier. I'm not sure if this is because people simply don't have time, simply don't care, don't understand what others have written, or something else entirely.

You might ask what it is that has prompted me to write this entry? Well on this particular occasion it's people using the term 'fault tolerance' in places where it may be accurate when considering the meaning of the words in the English language, but not when looking at the scientific meaning, which is often very different. For instance, let's look at one scientific definition of the term (software) 'fault tolerance'.

"Fault tolerance is intended to preserve the delivery of correct service in the presence of active faults. It is generally implemented by error detection and subsequent system recovery.
Error detection originates an error signal or message within the system. An error that is present but not detected is a latent error. There exist two classes of error detection techniques: (a) concurrent error detection, which takes place during service delivery; and (b) preemptive error detection, which takes place while service delivery is suspended; it checks the system for latent errors and dormant faults.
Recovery transforms a system state that contains one or more errors and (possibly) faults into a state without detected errors and faults that can be activated again. Recovery consists of error handling and fault handling. Error handling eliminates errors from the system state. It may take two forms: (a) rollback, where the state transformation consists of returning the system back to a saved state that existed prior to error detection; that saved state is a checkpoint, (b) rollforward, where the state without detected errors is a new state."

There's a lot in this relatively simple definition. For a start, it's clear that recovery is an inherent part, and that includes error handling as well as fault handling, neither of which are trivial to accomplish, especially when you are dealing with state. Even error detection can seem easy to solve if you don't understand the concepts. Over the past 4+ decades all of this and more has driven the development of protocols behind transaction processing, failure suspectors, strong and weak replication protocols, etc.

So it's both annoying and frustrating to see people talking about fault tolerance as if it's as easy to accomplish as, say, throwing a few extra servers at the problem or restarting a process if it fails. Annoying in that there are sufficient freely available texts out there to cover all of the details. Frustrating in that the users of implementations based on these assumptions are not aware of the problems that will occur when failures happen. As with those situations I've come across over the years where people don't believe they need transactions, the fact that failures are not frequent tends to lull you into a false sense of security!

Now before anyone suggests that this is me being a luddite, I should point out that I'm a scientist and I recognise fully that theories and practices in many areas of science, e.g., physics, are developed based on observations and can change when they prove to not be sufficient to describe the things you see. So for instance, unlike those who in Galileo's time continued to believe the Earth was the centre of the Universe despite a lot of data to the contrary, I accept that theories, rules and laws laid down decades ago may have to be changed today. The problem I have in this case though, is that nothing I have seen or heard in the area of 'fault tolerance' gives me an indication that this is the situation currently!

Tuesday, August 09, 2011

A thinking engineer

I've worked with some great engineers in my time (and continue to work with many today), and as an aside, I like to think some people might count me in their list. But back to topic: over the years I've also met some people who would be considered great engineers by others, but I wouldn't rate that high. The reason for this is also one of the factors that I always cite when asked what constitute a great engineer. Of course I rate the usual things, such as ability to code, understand algorithms, and know a spin-lock from a semaphore. Now maybe it's my background (I really think not, but threw that out there just in case I'm wrong) but I also add the ability to say no, or ask why or what if? To me, it doesn't matter whether you're an engineer or an engineering manager, you've got to be confident enough to question things you are asked to do, unless of course you know them to be right from the start.

As a researcher, you're expected to question the work of others who may have been in the field for decades, published dozens of papers and be recognised experts in their fields. You don't take anything at face value. And I believe that that is also a quality really good engineers need to have too. You can be a kick-ass developer, producing the most efficient bubble-sort implementation available, but if it's a solution to the wrong problem it's really no good to me! I call this The Emperor's New Clothes syndrome: if he's naked then say so; don't just go with the flow because your peers do.

Now as I said, I've had the pleasure to work with many great engineers (and engineering managers) over the years, and this quality, let's call it "thinking and challenging" is common to them all. It's also something I try to foster in the teams that work for me directly or indirectly. And although I've implicitly been talking about software engineering, I suspect the same is true in other disciplines.

True Grit update

A while ago I mentioned that I was reading the novel True Grit and was a fan of the original film, which I watched when I was a child. I also mentioned that I probably wouldn't be watching the remake of the film as I couldn't see how the original could be improved. Well, on the flight back from holiday I had the opportunity to watch it and decided to give it a go.

I've heard a few things about the new film and they can all me summarised as saying that it was a more faithful telling of the story than the John Wayne version. After watching both, and reading the book, I have to wonder if those reviewers knew WTF they were on about! Yes the new film is good, but it's no where near as good as the original. And as for faithfulness to the book? Well with the exception of the ending, the original film is far closer to the book (typically word for word). While watching the remake I kept asking myself time and again why had they changed this or that, or why had they completely rewritten the story in parts?!

If you have a great novel that you know works well on screen, why do script writers seem incapable of leaving it untouched? Maybe they decided that they had to make the new film different enough from the original so people wouldn't think it was a scene-for-scene copy. But in that case, why remake it in the first place? FFS people: if an original film is good enough, leave it alone and learn to produce some original content for once! And for those of you interested in seeing the definitive film adaptation of the book, check out the John Wayne version.

Friday, July 29, 2011

Gone fishing!

I'm on holiday in Canada, visiting my in-laws. Usually it takes me a few days to wind down from work, but it happens and I relax for the rest of the holiday. (Well, until a few days before I cone back, when I start to think about work again!) Access to email is limited usually, as I'd need to borrow time on my father-in-laws machine. That extra effort is usually enough for me to only check email every few days.

Unfortunately this time I brought my iPad and iPhone, both of which I connected to the wifi. Checking email was too easy and as a result I was working every day! Fortunately it only took me about 4 days to realise this (with some not-so-subtle hints from family) and I disabled wifi. This means I can now get on with the holiday. Since we are out in the middle of nowhere this means sitting by the pool reading a book on my (wifi disabled) iPad, or fishing!

Facebook as Web 3.0?

I'm not on Facebook and think social networking sites are inherently anti-social (you can't beat a good pub!) However, I know many people who are into them and I've even decided to check out Google+. So they probably have a place in the web firmament.

But recently I've started to see more and more adverts substituting the good old vendor URL for a Facebook version, e.g., moving from www.mycompany.com to www.Facebook.com/mycompany. Now at first this might seem fairly innocuous, but when you dig deeper it's anything but! As I think Tim Berners-Lee has stated elsewhere, and I'm sure Google has too, the data that Facebook is maintaining isn't open for a start, making it harder to search outside of their sub-web. And of course this is like a data cloud in some ways: you're offshoring bits of your data to someone else, so you'd better trust them!

I don't want to pick on any single vendor, so let's stop naming at this point. Even if you can look beyond the lack of openness and the fact that you're basically putting a single vendor in charge of this intra-web, what about all of the nice things that we take for granted from http and REST? Things such as cacheing, intelligent redirects and HATEOAS. Can we be sure that these are implemented and managed correctly on behalf of everyone?

And what's to say that at some point this vendor may decide that Internet protocols are just not good enough or that browsers aren't the right view on to the data? Before you know it we would have a multiverse of Webs, each with their own protocols and UIs. Interactions between them would be difficult if not impossible.

Now of course this is a worst case scenario and I have no idea if any vendors today have plans like this. I'd be surprised if they hadn't been discussed though! So what does this mean for this apparent new attitude to hosting "off the web" and on the "social web"? Well for a start I think that people need to remember that despite how big any one social network may be, there are orders of magnitude more people being "anti-social" and running on the web.

I'm sure that each company that makes the move into social does so on the back of sound marketing research. Unfortunately the people making these decisions aren't necessarily the ones who understand what makes the web work, yet they are precisely the people who need it to work! I really hope that this isn't a slippery slope towards that scenario I outlined. Everyone on the web, both social and anti-social, would lose out in the end! Someone once said that "just because you can do something doesn't mean you should do something."

Thursday, July 21, 2011

The end of a space era

It's sad to see the end of the space shuttle era. I remember being excited to watch the very first launch whilst at school. I remember exactly where I was when Challenger was destroyed: at university stood in a dinner queue. I remember watching when they deployed (and then later fixed) Hubble. Again, I can remember where I was when Columbia was destroyed: at home watching! I've even been to see a launch and heard it come back a week or so later.

So it's fair to say that I grew up with the shuttle over these past 30 years and it's going to be strange not having it around any more. Despite the fact that it may never have been the perfect launch vehicle (I still recall early discussions around HOTOL, for instance), I think it did it's job well. I know I'll miss it.

Tuesday, July 19, 2011

Santosh's retirement

I think I've spoken a number of times about how important Professor Shrivastava has been in my academic and professional career over the past 25 years (ouch!) Well he's retiring soon and the University will never quite be the same, at least as far as I'm concerned. But at least I get a chance to speak at his retirement event. Congratulations Santosh and many thanks!

Monday, July 18, 2011

InfoQ and unREST

I wrote this article for InfoQ because I thought what JJ had said was interesting enough that a wider audience should consider it. I'm still not sure if I'm pleased, surprised or disappointed with the level of comments and discussion that it received. Something for me to contemplate when I'm on vacation I suppose.

Thursday, July 07, 2011

When email and vacation don't mix

I'm off on vacation soon for a couple of weeks. Going to Canada to visit my wife's parents. They live in the back-of-beyond, which is great for getting away from it all. As usual I've promised my wife I won't be taking my work laptop with me, which means that I won't have access to our VPN and hence no access to work email. In the past this used to bother me, because I always want to know what's going on in case there are problems at work. But it's obviously not conducive to a relaxing time.

Now I know some people who go on vacation and can read work email do read work email just to keep the amount of catch up they have to do when they return to a minimum. I expect that after a couple of weeks vacation I'll have several thousand emails to go through, so I can understand what they're doing. However, it won't work for me: I tried it a few times and I just can't help responding to emails if I see them! So what started as a 60 minute attempt during a vacation to cut down on the junk in my inbox ended up several hours later with me only about 10% of the way through. So I don't do that any more and I just take the hit when I get back.

However, I did figure out a compromise (pretty obvious really): I have a backup email address that is accessible off our VPN and which only certain people know about. They know they can get me on this at pretty much any time of the day or night. So if something comes up while I'm away this year, I can find out about it. Of course this has a slight downside in that I know immediately that any emails on that address are probably emergencies! Well, you can't win them all I suppose!

Sunday, June 26, 2011

When is a distributed system not a distributed system?

I've been involved with distributed systems since joining the Arjuna Project back in the mid 1980's. But distributed systems date back way before then, to at least the 1970's with the advent of the first RPC. There are a few definitions of what constitutes a distributed system, including Tanenbaum's "A collection of independent computers that appears to its users as a single coherent system" and this one "A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility", though my favourite is still Lamports "You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done". Back when I started in the area, someone at the University once said that a precise definition was difficult, but you should recognise a distributed system when you saw it.

What all of the definitions have in common is the notion that a distributed system consists of nodes (machines) that are connected by some distributed fabric, e.g., an ethernet. Distributed systems therefore pose problems that are no present in centralised, single machine systems, such as faults (independent failures can now occur) and hence failure detection. Various techniques, such as distributed transactions and replication have grown up to help deal with some of these issues. Other approaches, such as Voltan, also help to build fail-silent processes. And of course we have techniques such as message passing, RPC and shared tuple spaces, were developed to help make developing distributed applications easier. Though of course we learned that complete distribution opacity is not always a good idea. (We did a lot of the early work in this area.)

However, time has moved on and whilst distributed systems continue to be important in our every day lives, the fact is that many of the problems they present and many of the solutions that we have developed, are present within a local environment these days. Think of this like inner space versus outer space, if you like, but the way in which multi-core machines have become the norm means that we have failure independent nodes (cores this time) within a single machine. Alright, they're not connected by an ethernet, but there's a bus involved and they may, or may not, have shared memory too.

Of course none of this is new by any means. If you've ever experienced any of the parallel computing technologies that were the rage back in the 1980's and 1990's, such as the Transputer (excellent architecture and Occam was brilliant!) then you'll understand. There's always been a duality between them distributed systems. But back then parallel computing was even rarer than distributed computing, simply because the costs of the former were prohibitive (which is why a lot of parallel computing research was often done by running COTS hardware on a fast network, because it was cheaper!)

Times have certainly changed and we can no longer consign parallel computing to the realm of high performance computing, or niche areas. It's mainstream and is only going to increase. And I believe that this means distributed systems research and parallel computing efforts must converge. Many of the problems posed by both overlap and solutions for one may be relevant to the other. For instance, we've put a lot of effort into scalability in distributed systems, and failure detection and independence at the hardware (core) level is very advanced.

So when is a distributed system not a distributed system? When it's a centralised multi-core system!

Saturday, June 18, 2011

Open Source in action.

I've been with JBoss since 2005 and in that time I like to think I've experienced quite a bit about how open source works. It's been a wonderful learning experience for me and definitely turned me from someone who thought open source code and developers were somehow not as good as closed source equivalents into a person who knows that the opposite is most definitely the case!

Case in point: over the last 18 months or so we've been on an aggressive schedule for JBossAS 7, which has some pretty fundamental architectural changes in it. This would've been a challenge for any team (I remember how long it took us to implement HP-AS, for instance, and that team was also extremely skilled). But our teams are small and are responsible not only for development, but also for their communities too: open source means a lot more than just having your code in a public repository!

So these teams are putting in a lot of effort! They're pulling long hours too. Now of course that's not unique to open source or JBoss, but the developers are not doing this for their wages; they're doing it because they have a passion for open source and also for the history behind JBoss and our communities. They (we) believe that this is a game changer and not just another battle in the ongoing war. It's worth noting that I saw this in my Bluestone days too, both when we were independent and then part of HP. I think that the reasons behind that are very similar, only the protagonists have changed.

This release is also so fundamental to everything we are doing that even teams who wouldn't normally have much to do with AS are willing to pitch in, both during work hours and personal time. And what's more interesting is that I rarely have to ask them to help: it's a natural thing for them to do because they're as much a part of the AS community as others. From a personal perspective I've found the AS7 effort very enlightening. I've learned a lot, and much of it not just technical in nature. It is most definitely a good time to be in this role!

Sunday, June 12, 2011

Addicted?

My wife is an addictions counsellor with a strong background in psychology. It's amazed me the sorts of things that people can become addicted to and I won't go into them here! However, she's always said that the first step to addressing any addiction is for the addict to admit it. Now according to her I am addicted to work (her definition of work includes anything that involves a computer, including books and papers, so it's quite broad). I have to admit that I do spend an inordinate amount of time doing things that fall into that definition; even when I'm watching TV I'll usually have a book or laptop on my knee (now it's often my iPad). But I'm not sure I'd say I was an addict. And even if I am, I'm not sure I'd want to be "cured". I think a lot of my friends and colleagues would also fall into that category. But then that last statement does apply to many other forms of addiction. Hmmmm.

Saturday, June 11, 2011

New age developers?

I'm getting really tired (aka fed up) hearing about "new age" developers and applications, when certain people talk about the cloud. Look, there are only developers and applications! There's nothing "new age" about this. Some things change, as with each new wave of technology, but many things remain the same. Sure the problem space has changed and we are seeing new applications and approaches being developed, but let's not imbue mythical attributes to those applications or developers! They're no better or worse than developers or apps of the past. Though if you listen to some, "new age" means thinking and working so far outside the box that you're in the next reality! This is starting to get ridiculous and in the local vernacular it's getting on nerves! Evolution people, not revolution!

Sunday, June 05, 2011

Updating the history of Arjuna

Back in 2002 when I was still with HP and our transaction system was still called Arjuna, I wrote a paper with Santosh on the transition of what had started out purely as an academic vehicle for getting a few of us PhDs, into a rather successful product. Back then we conjectured what might happen in the next few years, but the reality has turned out to be even more interesting.

We've always talked about updating the paper, but it's never happened. So it was with interest that we were asked to write a new version as part of a book for Brian Randell's birthday. The paper is almost finished and it has been a lot of fun working on it and remembering all of the things that have happened in the intervening decade: HP leaving the middleware space and us starting again with Arjuna, Web Services transactions, joining JBoss and taking Arjuna there, Red Hat, REST and more. Not all of it will be in this paper, but then maybe that leaves room for yet another update?!

One thing that updating the paper clearly showed was that something that started life as an academic project has not only had an impact on many products over many years, but also an impact on the people who have worked on it. Individuals have come and gone from the team over the years and they've all left their mark on the system and vice versa. And like the system, they've been a great group of people! So whether it's called Arjuna, JBossTS or something else, it and this paper remain a tribute to them all.

Heisenberg and the CAP theorem

For many years I've been working on extended transactions protocols. The CORBA Activity Service, WS-TX and now REST-TX are efforts on that road. There are many similarities between the problems of long running transactions and large scale replication, so the facts that I did my PhD on both gave me some insights to helping resolve both.

One of the early pieces of research I did was on combining replication and transactions to create consistency domains, where a large number of replicas are split into domains and each domain (replica group) has a relationship with the others in terms of their state and level of consistency. Rather than try to maintain strong consistency between all of the replicas, which incurs overhead proportional to the number of replicas as well as their physical locality, we keep the number of replicas per domain small (and hopefully related) and grow the number of domains if necessary. Then each domain has a degrees of inconsistency with others in the environment.

The basic idea behind the model is that of eventual consistency: in a quiescent period all of the domains would have the same state, but during active periods there is no notion of global/strong consistency. The protocol ensures that state changes flow between domains at a predefined rate (using transactions). A client of the inconsistent replica group can enquire of a domain the state at any time, but may not get the global state, since not all updates will have propagated. Alternatively a client can request the global state but may not know the time it will take to be returned.

If you know Heisenbergs Uncertainty Principle then you'll know that it means you cannot determine the momentum and position of a particle at the same time (or other related properties). Thus it was fairly natural for me to use this analogy when describing the above protocol: an observer cannot know the global state of the system and when that will be the steady state at the same moment, i.e., it's one or the other. It's not a perfect analogy, but in a time when others seemed to like to bring physics into computing it seemed appropriate.

Now of course the original work was before the CAP theorem was formalised. So today we see people referring to that whenever they need to talk about relaxing consistency. And of course that is the right thing to do; if I were reviewing a paper today that was about relaxing consistency and the authors didn't reference CAP then I'd either reject it or have a few stern words to say to them. But I still thing Heisenberg is a way cooler analogy to make. However, I do admit to being slightly biased!

Dublin here I come!

I'm off to the Red Hat Partner Summit in Dublin in a few hours time. I'm giving a couple of presentations, one on the future of JBoss and one on how JBoss and the Cloud come together. I'm looking forward to them because the audience will be slightly different to those I've presented these topics to over the past few months, so it'll give me a chance to get much broader feedback. Plus it's been a while since I was in Dublin last, so hopefully there'll be a chance to get out and enjoy the place too.

Friday, June 03, 2011

Future of Middleware

There's a special event being organised as part of Middleware 2011 called the Future of Middleware Event (FOME). I've been asked to contribute to the event and associated paper/book, along with my friend, co-creator of Arjuna and long time mentor Professor Shrivastava. I'm looking forward to it, because it's related to quite a few things that I'm doing elsewhere too!