Sunday, December 19, 2010

Data on the outside (of the public Cloud) versus data on the inside (of the public Cloud)

A few years back Pat Helland wrote a nice paper about the issues of data outside and inside the services boundary and how the locale impacts traditional transaction semantics. This was around the time that there was a lot of work and publicity around extended transaction models. Pat used an analogy to relativity to drive home his point, whereas I'd been using quantum mechanics analogies. But regardless of how it is explained, the idea is that data and where it is located, is ultimately the bottleneck to scalability if total consistency is required (in a finite period of time).

So what's this got to do with Cloud? Well I've been thinking a lot about Cloud for quite a while (depending on your definition of Cloud, it could stretch back decades!) There are many problem areas that need to be worked out for the Cloud to truly succeed, including security, fault tolerance, reliability and performance. Some of these are related directly to why I believe private Clouds will be the norm for most applications.

But probably the prime reason why I think public Clouds won't continue to hog the limelight for enterprise or mission critical applications is the data issue. Processor speeds continue to increase (well, maybe not individual processor speeds, but when stuck on the same chip the result is the same), memory speeds increase, even disk speeds are increasing when you factor in the growing use of solid state. Network speeds have always lagged behind and that continues to be the case. If you don't have much data/information, or most of it is volatile and generated as part of the computation, then moving to a public cloud may make sense (assuming the problems I mentioned before are resolved). But for those users and applications that have copious amounts of data, moving it into the public cloud is going in the wrong direction: moving the computation to the data makes far more sense.

Now don't get me wrong. I'm not suggesting that public clouds aren't useful and won't remain useful. It's how they are used that will change for certain types of application. Data has always been critical to businesses and individuals, whether you're the largest online advertiser trying to amass as much information as possible, or whether you're just looking to maintain your own financial data. Trusting who to share that information with is always going to be an issue because it is related directly to control: do you control your own information when you need it or do you rely on someone else to help (or get in the way)? I suspect that for the vast majority of people/companies, the answer will be that they want to retain control over their data. (Over the years that may change, in the same way that banks became more trusted versus keeping your money under the mattress; though perhaps the bank analogy and trust isn't such a good one these days!)

Therefore, I think where the data will reside will define the future of cloud and that really means private cloud. Public clouds will be useful for cloud bursting and number crunching for certain types of application, and of course there will be users who can commit entirely to the public cloud (assuming all of the above issues are resolved) because they don't have critical data stores in which they rely or because they don't have existing infrastructure that can be turned into a protected (private) cloud. So what I'm suggesting is that data on the outside of the public cloud, i.e., within the private cloud, will dominate over data on the inside of the public cloud. Of course only time will tell, but if I were a betting man ...

2 comments:

Anonymous said...

I completely agree that data and computation belong in the same physical location (more or less). But where that is remains an open question.



Some companies will surely want to place their data (and computation) on premise. Other companies would want to make use of the public cloud. I also think that one company might want to go with both options depending on what data (and computations) we are discussing.



Some data is less sensitive. Some computations need a lot of resources during peaks and less or no resources between peaks.



And when it comes to transferring large amounts of data to the public cloud I have heard about a really good idea (lost the url though): Put the data on some kind of physical medium (e.g. DVD), send it to the cloud provider using snail mail. Let the provider put the data in place locally inside their data centers. It might save a lot of time (and money) if you need to transfer a huge amount of data.

Mark Little said...

Hi Herbjorn. Yes, I think we agree. As to the DVD reference, I remember Jim Gray many years ago talking about how if you had a lot of data and put it on a hard drive and sent it via a courier, you'd have a higher bandwidth than using the internet.

He doesn't mention it explicitly here (http://queue.acm.org/detail.cfm?id=864078) but hints at it. If I find the original reference I'll post it here.