Back in 2001, Gartner (then META Group) initially described what has since become known as the “Three Vs of Big Data:”
A great deal of focus since then has been on the first two Vs. Many technologies (e.g. Hadoop, MapR, Ceph) enable the storage of massive quantities of data, commonly referred to as a ‘data lake,’ and perform analysis on the data (e.g. MapReduce, Mahout). The focus on the variety of data has led to the proliferation of NoSQL solutions that focus on ‘schemaless’ storage, allowing different formats of data to be co-located together. An entire discussion surrounding which NoSQL option (and the misnomer of ‘schemaless’) may be right for your product will be coming in a future blog post.
The third V – velocity – has largely been relegated to custom solutions designed to get the incoming data into said ‘data lake’ as quickly as possible. The irony is that the velocity of data in Big Data was the most marketed aspect. Infographics were shared by the thousands demonstrating how much data was being produced in 60 seconds in the world. Domo and Qmee produced two of the most widely seen Big Data infographics.
While we can discuss the reality that many examples of Big Data (including these infographics) have no impact on most companies’ Big Data plans (does your product care about the proliferation of Youtube videos?), I’d like to spend a little time discussing the third V, Velocity, in more detail. In the past few years, new tooling has arisen to address what’s called ‘stream event processing.’ This is when data is processed as it arrives. By viewing data as it actively occurs, immediate insight can be gained, trends can be visualized, threats/warnings can be immediately acted upon, and long term data storage can be reduced, or even eliminated.
The challenge is being able to process this deluge of data in a scaleable, elastic manner. Amazon Web Services (AWS) offers a stream processing service called Kinesis. Additionally, there are open source tools like Storm and Spark. We have more devices in our lives that are collecting more and more data. Most smartphones collect 30-40 events per second, and with the expansion of wearables, this is only going to expand. Factor in the Internet of Things (IoT) and existing social media streams and you can see that more data is available to product developers than ever before.
The traditional approach to this data would be to store it all, then perform analytics on all the data. This would take a long time to run (contrary to market-ese, processing billions of rows of data still takes a long time – just way less than previously). If the incoming data has a measurable ‘time-of-value’ (e.g. security violation in system log), then waiting for that answer may take too long. Even worse, by processing all the data collectively, while it may give a more precise response, enough data may come in while that process is running that the answer may have changed.
This brings us back to stream processing. Being able to effectively process incoming data in a scalable fashion can allow us to gain insights in real-time. Some use cases for stream processing:
The key factors in designing your streaming processing system:
In closing, when dealing with rapid data, particularly data that has a short half-life value, instead of immediately pursuing a strategy of store and analyze, consider a model that processes the data as it arrives, giving you the immediate results you need, and possibly saving a great deal on storage costs. As an amusement, I’d like to leave with one last infographic, a great one from IBM that shows all the ‘Big Data’ associated with searching for ‘Big Data’ (including the number of infographics about Big Data)…