With rapid innovations, frequent evolutions of technologies and a rapidly growing internet population, systems and enterprises are generating huge amounts of data to the tune of terabytes and even petabytes of information. Since data is being generated in very huge volumes with great velocity in all multi-structured formats like images, videos, weblogs, sensor data, etc. from all different sources, there is a huge demand to efficiently store, process and analyze this large amount of data to make it usable.
Hadoop is undoubtedly the preferred choice for such a requirement due to its key characteristics of being reliable, flexible, economical, and a scalable solution. While Hadoop provides the ability to store this large scale data on HDFS (Hadoop Distributed File System), there are multiple solutions available in the market for analyzing this huge data like MapReduce, Pig and Hive. With the advancements of these different data analysis technologies to analyze the big data, there are many different school of thoughts about which Hadoop data analysis technology should be used when and which could be efficient.
A well-executed big data analysis provides the possibility to uncover hidden markets, discover unfulfilled customer demands and cost reduction opportunities and drive game-changing, significant improvements in everything from telecommunication efficiencies and surgical or medical treatments, to social media campaigns and related digital marketing promotions.
What is Big Data Analysis?
Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log files and web, and much of it is generated in real time and on a very large scale. Big data analytics is the process of examining this large amount of different data types, or big data, in an effort to uncover hidden patterns, unknown correlations and other useful information.
Advantages of Big Data Analysis
Big data analysis allows market analysts, researchers and business users to develop deep insights from the available data, resulting in numerous business advantages. Business users are able to make a precise analysis of the data and the key early indicators from this analysis can mean fortunes for the business. Some of the exemplary use cases are as follows:
Case Study – Stock market data
Now let’s look at a case study for analyzing stock market data. We will evaluate various big data technologies to analyze this stock market data from a sample ‘New York Stock Exchange’ dataset and calculate the covariance for this stock data and aim to solve both storage and processing problems related to a huge volume of data.
Covariance is a financial term that represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to the other.
A positive covariance means that asset returns moved together. If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.
A negative covariance means returns move inversely. If one investment instrument tends to be up while the other is down, they have negative covariance.
This will help a stock broker in recommending the stocks to his customers.
Dataset: The sample dataset provided is a comma separated file (CSV) named ‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes, Stock opening price, Stock highest price, etc. on the New York Stock Exchange.
The dataset provided is just a sample small dataset having around 3500 records, but in the real production environment there could be huge stock data running into GBs or TBs. So our solution must be supported in a real production environment.
Hadoop Data Analysis Technologies
Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the huge stock data being generated very frequently.
Features and Comparison of Big Data Analysis Technologies
|Language||Algorithm of Map and Reduce Functions (Can be implemented in C, Python, Java)||PigLatin (Scripting Language)||SQL-like|
|Lines of code||More lines of code||Fewer (Around|
10 lines of PIG = 200 lines of Java)
|Fewer than MapReduce and Pig due to SQL Like nature|
|Development Time||More development effort||Rapid development||Rapid development|
|Abstraction||Lower level of abstraction (Rigid Procedural Structure)||Higher level of abstraction (Scripts)||Higher level of abstraction (SQL like)|
|Joins||Hard to achieve join functionality||Joins can be easily written||Easy for joins|
|Structured vs Semi-Structured Vs Unstructured data||Can handle all these kind of data types||Works on all these kind of data types||Deal mostly with structured and semi-structured data|
|Complex business logic||More control for writing complex business logic||Less control for writing complex business logic||Less control for writing complex business logic|
|Performance||Fully tuned MapReduce program would be faster than Pig/Hive||Slower than fully tuned MapReduce program, but faster than badly written MapReduce code||Slower than fully tuned MapReduce program, but faster than bad written MapReduce code|
Which Data Analysis Technologies should be used?
Based on the available sample dataset, it is having following properties:
Based on these criteria and comparing with the above analysis of features of these technologies, we can conclude:
So based on the above discussion, Hive seems the perfect choice for the aforementioned case study.
Problem Solution with Hive
Apache Hive is a data warehousing package built on top of Hadoop for providing data summarization, query and analysis. The query language being used by Hive is called Hive-QL and is very similar to SQL.
Since we are now done zeroing in on the data analysis technology part, now it’s time to get your feet wet with deriving solutions for the mentioned case study.
Follow the steps mentioned in my previous blog How to Configure Hive On Cloudera:
Use ‘create table’ Hive command to create the Hive table for our provided csv dataset:
hive> create table NYSE (exchange String,stock_symbol String,stock_date String,stock_price_open double, stock_price_high double, stock_price_low double, stock_price_close double, stock_volume double, stock_price_adj_close double) row format delimited fields terminated by ‘,’;
This will create a Hive table named ‘NYSE’ in which rows would be delimited and row fields will be terminated by commas. This schema will be created into the embedded derby database as configured into the Hive setup. By default, Hive stores metadata in an embedded Apache Derby database, but can be configured for other databases like MySQL, SQL server, Oracle, etc.
Use the following Hive command to load the CSV data file into Hive table:
hive> load data local inpath ‘/home/cloudera/NYSE_daily_prices_Q.csv’ into table NYSE;
This will load the dataset from the mentioned location to the Hive table ‘NYSE’ as created above but all this dataset will be stored into the Hive-controlled file system namespace on HDFS, so that it could be batch processed further by MapReduce jobs or Hive queries.
We can calculate the Covariance for the provided stock dataset for the inputted year as below using the Hive select query:
select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),
(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) – (AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH)))
from NYSE a join NYSE b on
a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and year(a.STOCK_DATE)=2008
Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);
This Hive select query will trigger the MapReduce job as below:
The covariance results after the above stock data analysis, are as follows:
The covariance has been calculated between two different stocks for each month on a particular date for the available year.
From the covariance results, stock brokers or fund managers can provide below recommendations:
So similarly we can analyze more use cases of big data and can explore all possible solutions to solve that use case and then by the comparison chart, the final best solution can be narrowed down.
So this case study solves the following two important goals of big data technologies:
By storing the huge stock data into HDFS, the solution provided is much more robust, reliable, economical, and scalable. Whenever data size is increasing, you can just add some more nodes, configure into Hadoop and that’s all. If sometime any node is down, then even other nodes are ready to handle the responsibility due to data replication.
By managing the Hive schema into embedded database or any other standard SQL database, we are able to utilize the power of SQL as well.
Since Hive schema is created on a standard SQL database, you get the advantage of running SQL queries on the huge dataset also and are able to process GBs or TBs of data with simple SQL queries. Since actual data resides into HDFS, so these Hive SQL queries are being converted into MapReduce jobs and these parallelized map reduce jobs process these huge volume of data and achieve scalable and fault tolerant solutions.