Sunday, March 11, 2012

Big Data

Data is important in everything we do in life. Whether it's a recipe for your favourite dinner or details of your bank account, we are all data driven. This is nothing new either: humanity, and in fact all life, is critically dependant on data (information) and has been for millions, or billions, of years. Therefore, maintaining that data is also extremely important; it needs to be available (in a reasonable period of time), often shareable, secure and consistent (more or less).

If we cannot get to our data then we could be in trouble (catastrophic). If it takes too long to get at then there may be no real difference between it being available or not. If someone else can get to out data without our permission and possibly modify it without our knowledge, then that could be even worse than not being able to access the information. And of course if the information is maintained in multiple places, perhaps to ensure that if one copy is lost then we have backups, then updates to a copy must be made eventually to the others.

Over the centuries individuals and companies have grown successful and controlling from maintaining data securely for others or even managing it so that everyone has to go to them to obtain it. In our industry several large companies grew primarily because of this model. Other vendors became successful through other aspects related to data management or integration. Data is an enabler for everything in the software industry, whether it's middleware or operating system related.

So data is King. That is the way it has been and always will be. Everything we do uses, manipulates or otherwise controls data in some way, shape or form. The arrival of Cloud, mobile and ubiquitous computing does not change it. In fact ubiquitous computing increases the amount of data we have by several orders of magnitude. The bulk of the world's computing power are embedded systems, i.e., systems designed to do a few dedicated functions, in real-time, using sensors for data I/O. Technically all smart-phones and tablets are embedded systems, not PCs. Major drivers for data in the coming years are smart-phones, tablets, sensors, green technology, eHealth/medical, industrial applications and system "health" monitoring.

Much of the new data coming on stream today contains a location, timestamp or both. There has been a ten fold increase in electronically generated data in 5 years. It is predicted that very soon there will be over a Zetabyte of data (1billion terabytes). That's the equivalent of a stack of DVDs half way from here to Mars! Maintaing that data is important. But being able to use it is more important.

It is now well known that issues with traditional RDBMS implementations, their architectures and assumptions, mean that they are insufficient to manage and maintain this kind of data, at least not by themselves. This has lead to the evolution of BigData, with NoSQL and NewSQL implementations. There are a range of different approaches, including tuple space and graph-based databases, because one size does not fit all. Unlike with the RDBMS which is a good generic workhorse, but not optimisable for specific use cases, these new implementations are targeted specifically at them and conversely would make poor generic solutions.

It is extremely unlikely that a single standard, such as SQL, will evolve that works across all NoSQL implementations. However, there may be a set of standards, one for each category of approach. But for many, SQL will still remain a necessity, even if it means they cannot benefit from all of the advantages NoSQL offers: more people understand SQL and more applications are based on it, than anything else. So bridging these worlds is important. Finally it is worth noting that business analytics and sensor analytics will play a crucial role here.

In our industry we are now seeing an explosion in new database vendors looking to become the standard for this next generation. The current generation of developers are more heavily influenced by mobile, social and ubiquitous computing than by mainframes, so the RDBMS is not their natural first thought when considering how to manage data. However, most of these new companies are small open source startups and recognise the problems inherent with both: customers not trusting their important data to small companies that could go under or be acquired by a competitor; other vendors taking their code and creating competing products from it.

Furthermore, some large vendors who have failed to make an impact in the data space or inroads in enterprise middleware, see this new area and these new companies as opportunities. As a result, relationships are being created between these two category of companies in a symbiotic manner. Many of these NoSQL/NewSQL companies are going to merge, be acquired or fail. In the meantime new approaches and companies will be created. Over the next 5 years this new data space, which will integrate with the current RDBMS area, will coalesce and solidify. There are no obvious winners at this stage, but what is clear is that open source will play a critical role.