March 17, 2016
Real Time Analytics & Visualization with Apache Spark
I’m sure you’ve heard fast data is the new black? If you haven’t, here’s the memo - big data processing is moving from a ‘store and process’ model to a ‘stream and compute’ model for data that has time-bound value.
For example, consider a logging subsystem that records all kinds of informational, warning, and error events in a live application. You would want security-related events to immediately raise a set of alarms instead of looking at the events in a report the next morning. If you are wondering about an architecture that realizes the ‘stream and compute’ model, read on!
The Tech Stack That Enables Stream & Compute
The stack is comprised of 4 layers - the event source, a message broker, a real time computation framework, and a computation store for computed results.
(This is not an exhaustive list)
There are a number of components that can fulfill the need at each layer, and the exact choices depend on your scalability needs or familiarity with the associated ecosystem. One possible combination would be to use RabbitMQ as the message broker and Apache Spark as the computation framework. We have created a demonstration of real-time streaming visualization using these components.
How Does It Work?
The event source is a log generated by events from video players simultaneously playing different content in different geographical regions of the world. Each entry in the log consists of multiple parameters, three of which are:
- session ID - a unique identifier that represents a particular streaming session. There can be multiple log entries with this identifier.
- buffering event - represents a period of time for which the player was buffering, waiting for data.
- IP address - the recorded IP address of the event posted to the log collection server (by the video player).
The need was to be able to visualize the buffering events in real time as they occurred in different geographies so that root cause analysis could be performed in near real time.
The sequence of operations is illustrated as follows:
The active sessions are computed by counting the distinct session IDs in the Spark streaming processing window.
(Active sessions over time)
(Distribution of active sessions over country)
A geographic IP database is used to lookup the ISO code for the country given the IP address of the recorded event.
The buffering events are filtered from the streaming RDD and the latitude-longitude information is added to each buffering event. The visualization uses a heat map to denote areas with a larger concentration of the buffering events so that such regions “light up.” We used the Open Layers library for this visualization.
Real-Time Analytics, Visualized
As noted earlier, the components can be mixed and matched as per your need. However, if you look at the architecture for the demonstration, there is no poll-wait at any stage. The closing piece of advice is thus: maintain a push model throughout the architecture.
The screenshots are taken from working demonstrable software. Please contact us at Sayantam.Dey@3PillarGlobal.com or Dan.Greene@3PillarGlobal.com if you have need for real-time data visualization and we can walk you through the actual demonstration!