Fast Data is the New Black

Back in 2001, Gartner (then META Group) initially described what has since become known as the “Three Vs of Big Data:”

  • Volume – the total amount of data
  • Variety – the different forms of data coming in (Tweets, video, sensor data, etc…)
  • Velocity – the speed at which data is coming into a system

A great deal of focus since then has been on the first two Vs. Many technologies (e.g. HadoopMapRCeph) enable the storage of massive quantities of data, commonly referred to as a ‘data lake,’ and perform analysis on the data (e.g. MapReduce, Mahout). The focus on the variety of data has led to the proliferation of NoSQL solutions that focus on ‘schemaless’ storage, allowing different formats of data to be co-located together. An entire discussion surrounding which NoSQL option (and the misnomer of ‘schemaless’) may be right for your product will be coming in a future blog post.

The third V – velocity – has largely been relegated to custom solutions designed to get the incoming data into said ‘data lake’ as quickly as possible. The irony is that the velocity of data in Big Data was the most marketed aspect. Infographics were shared by the thousands demonstrating how much data was being produced in 60 seconds in the world. Domo and Qmee produced two of the most widely seen Big Data infographics.

DataNeverSleeps

 

 

Qmee-Online-In-60-Seconds21-(1)

 

While we can discuss the reality that many examples of Big Data (including these infographics) have no impact on most companies’ Big Data plans (does your product care about the proliferation of Youtube videos?), I’d like to spend a little time discussing the third V, Velocity, in more detail. In the past few years, new tooling has arisen to address what’s called ‘stream event processing.’ This is when data is processed as it arrives. By viewing data as it actively occurs, immediate insight can be gained, trends can be visualized, threats/warnings can be immediately acted upon, and long term data storage can be reduced, or even eliminated.

The challenge is being able to process this deluge of data in a scaleable, elastic manner. Amazon Web Services (AWS) offers a stream processing service called Kinesis. Additionally, there are open source tools like Storm and Spark. We have more devices in our lives that are collecting more and more data. Most smartphones collect 30-40 events per second, and with the expansion of wearables, this is only going to expand. Factor in the Internet of Things (IoT) and existing social media streams and you can see that more data is available to product developers than ever before.

The traditional approach to this data would be to store it all, then perform analytics on all the data. This would take a long time to run (contrary to market-ese, processing billions of rows of data still takes a long time – just way less than previously). If the incoming data has a measurable ‘time-of-value’ (e.g. security violation in system log), then waiting for that answer may take too long. Even worse, by processing all the data collectively, while it may give a more precise response, enough data may come in while that process is running that the answer may have changed.

This brings us back to stream processing. Being able to effectively process incoming data in a scalable fashion can allow us to gain insights in real-time. Some use cases for stream processing:

  • Monitoring web application logs from large clustered applications for suspicious behavior
  • Calculating real-time aggregations of messages sent from a deployed mobile app
  • Pre-processing data to determine trends (top terms, financial trends, shopping).
  • Grouping and aggregating data and storing only the result for later analysis. This may seem counterintuitive to ‘Big Data’ – but while storage costs are definitely cheap, at enough data volume (and stored in triplicate in Hadoop), it adds up. Estimate what the value of the details will be in 3, 6, or 9 months. You may indeed want to save it all, but it should be a deliberate decision.

The key factors in designing your streaming processing system:

  • Ensure it can scale to your incoming data speed. If it takes longer to process messages than they are generated by all sources, you’ll need to expand your processing pool.
  • You may want to consider a scalable message queue such as Kafka, RabbitMQ, ZeroMQ, or many others to act as a ‘buffer’ if your event traffic is prone to ebbs and flows. Expanding clusters is not instantaneous (nor free). You may, depending on your stream processing framework/hosting be able to automatically adjust your processing cluster depending on queue size.
  • If you decide to save the inbound data in a long-term store, consider configuring your stream processing to allow for ‘sampling’ via configuration. Even if you start with storing 100% of the inbound data, being able to selectively tune that down will give you the freedom to do reasonable data analysis without the significant potential overhead of storage of petabytes of data (4PB of data = $45k-$125k per month).

In closing, when dealing with rapid data, particularly data that has a short half-life value, instead of immediately pursuing a strategy of store and analyze, consider a model that processes the data as it arrives, giving you the immediate results you need, and possibly saving a great deal on storage costs. As an amusement, I’d like to leave with one last infographic, a great one from IBM that shows all the ‘Big Data’ associated with searching for ‘Big Data’ (including the number of infographics about Big Data)…

ibm-bigdata-buzz-infographic-920x1576 copy
Dan Greene

Dan Greene

Director of Architecture

Dan Greene is the Director of Architecture at 3Pillar Global. Dan has 18 years of software design and development experience, with software and product architecture experience in areas including eCommerce, B2B integration, Geospatial Analysis, SOA architecture, Big Data, and Cloud Computing. He is an AWS Certified Solution Architect who worked at Oracle, ChoicePoint, and Booz Allen Hamilton prior to 3Pillar. Dan is a graduate of George Washington University. He is also a father, amateur carpenter, and runs obstacle races including Tough Mudder.

Leave a Reply

Related Posts

Identify Intrapreneurs to Improve Innovation One of the most valuable employees within any organization today is what’s coming to be popularly known as “the intrapreneur.” Intrapreneurs are emplo...
How Companies Stifle Innovation Taking a new idea from concept to marketplace within a large, established company is no small feat. As we’ve discussed in previous articles, an establ...
Stop Treating Software Development Like Factory Work Digital Growth Insights is pleased to publish this contributed piece from Steven Johnson. Steven is a recognized thought leader and storyteller within...
Big Data and Third Party Access Professor Nan Zhang of George Washington University talks with Digital Growth Insights about the possibilities -- and limitations -- of using third pa...
Breakthrough Innovation vs. Incremental Innovation Facing an increasingly competitive marketplace and evolving customer needs, businesses everywhere are striving to strike innovation gold. While it may...

SUBSCRIBE TODAY


Sign up today to receive our monthly product development tips newsletter.