Real Time Analytics & Visualization with Apache Spark

I’m sure you’ve heard fast data is the new black? If you haven’t, here’s the memo – big data processing is moving from a ‘store and process’ model to a ‘stream and compute’ model for data that has time-bound value.

For example, consider a logging subsystem that records all kinds of informational, warning, and error events in a live application. You would want security-related events to immediately raise a set of alarms instead of looking at the events in a report the next morning. If you are wondering about an architecture that realizes the ‘stream and compute’ model, read on!

The Tech Stack That Enables Stream & Compute

The stack is comprised of 4 layers – the event source, a message broker, a real time computation framework, and a computation store for computed results.

Real Time Analytics with Apache Spark-1

(This is not an exhaustive list)

There are a number of components that can fulfill the need at each layer, and the exact choices depend on your scalability needs or familiarity with the associated ecosystem. One possible combination would be to use RabbitMQ as the message broker and Apache Spark as the computation framework. We have created a demonstration of real-time streaming visualization using these components.

How Does It Work?

The event source is a log generated by events from video players simultaneously playing different content in different geographical regions of the world. Each entry in the log consists of multiple parameters, three of which are:

  1. session ID – a unique identifier that represents a particular streaming session. There can be multiple log entries with this identifier.
  2. buffering event – represents a period of time for which the player was buffering, waiting for data.
  3. IP address – the recorded IP address of the event posted to the log collection server (by the video player).

The need was to be able to visualize the buffering events in real time as they occurred in different geographies so that root cause analysis could be performed in near real time.

The sequence of operations is illustrated as follows:

Real Time Analytics with Apache Spark

The active sessions are computed by counting the distinct session IDs in the Spark streaming processing window.

day_item1

(Active sessions over time)

day_item2

(Distribution of active sessions over country)

A geographic IP database is used to lookup the ISO code for the country given the IP address of the recorded event.

The buffering events are filtered from the streaming RDD and the latitude-longitude information is added to each buffering event. The visualization uses a heat map to denote areas with a larger concentration of the buffering events so that such regions “light up.” We used the Open Layers library for this visualization.

Real-Time Analytics, Visualized

day_item3

As noted earlier, the components can be mixed and matched as per your need. However, if you look at the architecture for the demonstration, there is no poll-wait at any stage. The closing piece of advice is thus: maintain a push model throughout the architecture.

The screenshots are taken from working demonstrable software. Please contact us at Sayantam.Dey@3PillarGlobal.com or Dan.Greene@3PillarGlobal.com if you have need for real-time data visualization and we can walk you through the actual demonstration!

Sayantam Dey

Sayantam Dey

Senior Director Engineering

Sayantam Dey is the Senior Director of Engineering at 3Pillar Global, working out of our office in Noida, India. He has been with 3Pillar for ten years, delivering enterprise products and building frameworks for accelerated software development and testing in various technologies. His current areas of interest are data analytics, messaging systems and cloud services. He has authored the ‘Spring Integration AWS’ open source project and contributes to other open source projects such as SocialAuth and SocialAuth Android.

Dan Greene

Dan Greene

Director of Cloud Services

Dan Greene is the Director of Cloud Services at 3Pillar Global. Dan has more than 20 years of software design and development experience, with software and product architecture experience in areas including eCommerce, B2B integration, Geospatial Analysis, SOA architecture, Big Data, and has focused the last few years on Cloud Computing. He is an AWS Certified Solution Architect who worked at Oracle, ChoicePoint, and Booz Allen Hamilton prior to 3Pillar. He is also a father, amateur carpenter, and runs obstacle races including Tough Mudder.

2 Responses to “Real Time Analytics & Visualization with Apache Spark”
  1. MAQBOOL on

    Hello Sir,
    The article is really interesting. I’m a Ph.D. student and my project is almost same. I want to collect log data from PLC(programmable Logic Controller) and want to visualize it in real-time via the web interface. I want to use the data for OEE(overall equipment efficiency) and machine fault detection using machine learning algorithm. Can you please suggest me which tool from the above 4-layers will be most suitable for my project. Thanks in advance for taking your precious time.

    Kind Regards,

    Maqbool
    Nanjing University, China

    Reply
    • Sayantam Dey on

      You will need all four layers. However, your event source will be the PLC and the data store can either be Cassandra or a relational database with a flat schema, because I think your logs will be floating point numbers or binary.

      Reply
Leave a Reply

Related Posts

The Road to AWS re:Invent 2018 – Weekly Predictions, P... Last week I made the easy prediction that at re:Invent, AWS would announce more so-called ‘serverless’ capabilities. It’s no secret that they are all-...
Determining the First Release The first thing you release needs to put the solution to your customer's most important problem in their hands. Deciding what the most important probl...
The Road to AWS re:Invent 2018 – Weekly Predictions, P... Every year in Las Vegas, AWS holds their biggest conference of the year. Tens of thousands descend upon the desert. I heard numbers last year of about...
Is Mobile Automation Testing Right For You? In the fast paced software development industry, mobile automation testing has become indispensable. The real value of mobile automation testing is re...
How Do You Plan In An Agile Environment? One of the biggest struggles with introducing a company to Agile development is the fact that we don’t sit down and plan every last detail up front. D...