Sunday, April 27, 2014

The future adaptive middleware platform


Back in the 1990's, way before the success of Java let alone the advent of Cloud, there was a lot of research in the area of configurable and adaptable distributed systems. There was even a conference, the IEEE Conference on Configurable Distributed Systems. Some of the earliest work on distributed agents, adaptable mobile systems and weak consistency happened here and I'm glad to have been a part of that effort. However, two decades ago the kinds of environments that we envisioned were really just the thing of blue-sky research. Times change though and I believe that a lot of what we did back then, and research that has happened in the intervening years, is now much more applicable and in fact necessary for the kind of middleware platform that we need today.

For a start one of the aims of some of the research we did was environments that could self manage, monitor themselves and adapt to change, whether due to failures in network topology and machines, or changes in application (user) requirements. These were systems that we believed would need to be deployed for long periods of time with little or no human intervention. Precisely the kinds of environments that we are considering today for Cloud and IoT: reducing the number of system administrators needed to manage your applications is a key aim for the Cloud, and imagine if you had to keep logging in to your smart sensors every time the wifi went down or a new device was added to the network.

There's a lot that can change within a distributed system and much of it inadvertent or unknowable a priori. This includes failures of the network (partitions, overloading) and the nodes (crash failures or overloading making the machine so slow that it appears to have crashed). Machines being added to the environment may mean that it's more efficient to migrate entire applications or components (services) to these new arrivals to maintain desired SLAs, such as performance or reliability. Likewise a machine may become so overloaded that it's simply impossible to maintain an SLA and so migration off it elsewhere may be necessary. Traditional monitoring and management approaches would work here but tend to be far too manual. This tends to mean that whilst problems (faults) can be tolerated, the negative impact on clients and applications, such as downtime, can be too much.

The middleware should be able to detect these kinds of problems or inabilities to match SLAs, or even predict that these SLA conflicts are going to occur (Bayesian Inference Networks are good for this). It may seem like a relatively simple (or subtle) addition to monitoring and management (or governance) but it's crucial. Adding this capability doesn't just make the middleware infrastructure from being a little more useful and capable, it elevates it to an entirely different level. The infrastructure needs to have SLAs and QoS built in from the start for components as well as higher level services. JON-like in monitoring and managing the surroundings as well as itself.

Each component needs to potentially be driven through a smart proxy so that things can be dynamically switched from local to remote implementations. Maybe environment specific component implementations if existing ones cannot fit or be tuned to fit, e.g., a component written in C for embedded environments where the JVM cannot run due to space limitations. It also needs to add in something like stub-scion pairs (Shapiro in the 80's or Caughey with Shadows in the 90's) to allow for object migration with dependency tracking and migration. Also add in disconnected operation work from the 80's and 90's: yes, the network has improved a lot over the years but disconnection is more likely now because we are used to being connected so much.

We need new frameworks and models for building applications, though current approaches should work. Being transparent is best, but opaque allows for using existing applications. Each component needs something that implements a reconfigure/adapt interface. Listens on a bus for these events. Adapt based on available memory, processor changes such as speed or number, network characteristics, disconnection, load on processor, dependency on other components, etc. Include the dispatcher architecture to help adaptation at various levels throughout the invocation stack.

OK so let's summarise the features/capabilities and add a few things that are implicit:

- Can adapt to changes in environment. Autonomous monitoring and management.
- All components can have contracts and SLA.
- Event oriented backbone/backplane/bus.
- Asynchronous interactions. Synchronous on top if necessary.
- Flexible threading model.
- Core low overhead and footprint. Assumed to be everywhere and all applications or services plug into it. So much of these other capabilities would be dynamically added when/if needed, such that the core needs to know how to do very little initially.
- Repository of components with social-like tagging for contract selection.
- Native language components ranging from thin client through to entire stack.
- DNA/RNA-like configuration to allow for recovery in the event of catastrophic failure or to support complete or partial migration to new (blank/empty) hardware.
- Self healing.
- Tracking of dependencies between objects or services so that if the migration of an object/service is necessary (e.g., due to a node failure or being powered down), all related services/objects can be migrated as well to ensure continued operation.

So this is it in a (relative) nutshell. The links I've included throughout should help to give more details. If I get the time I may expand on some of these topics in future entries. Even better, maybe I'll also get a chance to present on one or more of these items at a conference or workshop. One of the problems I have with my job is that there's so much to do and so little time. Whilst all of this is wrapped up inside my head, the time to write it down doesn't come in a block large enough to do it justice in one sitting. Rather than spend months writing it up, I wanted to take a good open source approach and "release early, release often". Consider this a milestone or beta release.

No comments: