Big Data Analysis Using Hadoop and Spark
Big Data Analysis Using Hadoop and Spark
7-19-2017
Recommended Citation
Murthy, Adithya K., "Big Data Analysis Using Hadoop and Spark" (2017). Electronic Theses and
Dissertations. 1700.
https://digitalcommons.memphis.edu/etd/1700
This Thesis is brought to you for free and open access by University of Memphis Digital Commons. It has been
accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of University of
Memphis Digital Commons. For more information, please contact [email protected].
BIG DATA ANALYSIS USING HADOOP AND SPARK
by
Adithya K Murthy
A Thesis
Master of Science
August 2017
Copyright © 2017 Adithya K Murthy
All rights reserved
ii
Acknowledgment
Foremost, I would like to cordially thank my advisor, Dr. Dipankar Dasgupta, for
University of Memphis and guiding me through all efforts in this research project. I am
very grateful to him for giving me an opportunity to work with him and his team,
believing in my work and in me, and advising me like a parent. Besides my advisor, I
would like to equally thank my committee members: Dr. Deepak Venugopal and Dr.
Faith Sen for their valuable time, encouragement, and insightful comments.
My sincere thanks and regards to visiting researcher Dr. Mike Nolen for guiding me in
every part of my research work, providing timely advice and helping me arrive at the
discussed results. I would also like to thank all my colleagues from the Center for
Information Assurance and Fogelman College of Business and Economics for their
support and encouragements. I appreciate the direct or indirect help of my professors and
Memphis.
Lastly, I would like to thank my mother Mrs.Shanthi K Murthy and my father Mr.
R Krishna Murthy for all their love and support throughout my life.
iii
Abstract
Adithya K Murthy. MS. The University of Memphis. August 2017. Big data
analysis using Hadoop and Spark. Major Professor: Dipankar Dasgupta, Ph. D.
Big data analytics is being used more widely every day for a variety of
improvements in various fields. To process big data and obtain faster, secure and accurate
results is a challenging task. Hadoop and Spark are two technologies which deal with
large amounts of data and process these data in a distributed environment using parallel
computing. Hadoop and Spark use Map Reduce technique to process large data sets. The
iterative processing capability of Hadoop affects the processing speed on the data. Spark
uses in-memory cluster computing/data storage to enhance the performance for different
datasets.
A series of experiments were conducted on both Hadoop and Spark with different
data sets. To analyze the performance variation in both the frameworks, a comparative
analysis was performed from the results obtained by using both Hadoop and Spark. An
experiment based on financial data (NASDAQ Total view-ITCH) was performed in the
iv
Table of Contents
Introduction ......................................................................................................................... 1
Motivation ....................................................................................................................... 3
Limitation ........................................................................................................................ 3
Roadmap.......................................................................................................................... 3
Spark ................................................................................................................................... 7
Environment setup......................................................................................................... 13
v
Dataset-1: Wiki Dump Data .......................................................................................... 14
Experiment .................................................................................................................... 22
Conclusions ....................................................................................................................... 24
References ......................................................................................................................... 25
vi
List of Tables
Table Page
3. Time Taken to Process Weather Data on Both Frameworks (80 Gb) .......................... 18
vii
List of Figures
Figure Page
viii
List of Acronyms and Abbreviations
ix
Introduction
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. An example of big data might be petabytes (1,024 terabytes) or Exabyte
(1,024 petabytes) of data consisting of billions to trillions of records of millions of people from
different sources (e.g., Web, sales, customer contact center, social media, mobile data and so on).
The data is typically loosely structured which is often incomplete and inaccessible. The
importance of big data doesn’t revolve around how much data we have, but what we do with it
and utilize to analyze. It can be useful for various purposes like cost and time reductions,
Many of the organizations across various industries are being benefited by the analysis of
Big data. Big data techniques are used to detect and prevent cyberattacks. Also, they are used to
improve machine or device performances. For example, big data analytics is used to operate
Google's innovative self-driving car (AUGUR, 2015). The CERN (the European Organization
for Nuclear Research) data center has 65,000 processors to analyze its 30 petabytes of data
across more than 150 data centers worldwide for analyzing nuclear data for science and research
purpose (CERN , 2017). Big data analytics is also used in public health industries. For example,
it is helpful to decode DNA and predict various disease patterns. People are also benefited by the
large amounts of data generated from wearable devices like Fitbit or smart watches to analyze
calories and fitness patterns of the human body. Retail companies, advertising companies,
financial institutions are using Big data to analyze customer behavior and improving decision
1
Big data analysis is a constructive process which involves different stages such as
collection, storage, process, analyze and visualize. Data can be collected from logs, devices and
different sources to be stored in various platforms like servers, data centers etc. The stored data
will be processed and the raw data is converted to utilizable format leading to the resulting data
sets being stored for data analysis and visualization, statistical predictions etc. (AWS, 2017).
Problem Statement
Data processing is a crucial stage in the big data analysis. As we discussed above Big
data cannot be processed using traditional single node computing techniques. We need to scan
terabytes (or even more) of data within seconds. This is only possible when data is processed
with high parallelism. Various tools are available for processing data in a distributed
environment based on parallel computing methods. Hadoop and Spark are framework tools
which can handle data by distributed processing of large data sets across clusters of computers in
parallel.
Hadoop and Spark use Map Reduce techniques to process different data sets. But the
iterative processing in the Hadoop framework has an effect on the data processing speed. Spark
takes MapReduce to the next level with less expensive shuffles in the data processing with
capabilities like in-memory data storage using which the performance can be several times faster
We try to analyze different data sets on both Hadoop and Spark by comparing the
efficiency of results and performance through which we can analyze which among the two is a
2
Motivation
Big data analytics is being used on a large scale in recent times. New methods of
task to process big data and get faster, secure and accurate results. Hadoop is one of the solutions
which processes Big Data in a parallel manner. On the contrary, Hadoop has certain
disadvantages such as large I/O overhead since the data which is being processed depends on
external volatile storage. To overcome the disadvantages of Hadoop, Apache Spark was
introduced which promotes in-memory processing. Spark is a faster cluster computing technique
which is designed to operate on top of the Hadoop Distributed File System (HDFS) and it uses
MapReduce model to perform different computations which includes stream processing and
application querying. A thorough study on both the technologies is essential to understand how
data can be processed and analyzed. To arrive at this result, we perform different experiments
with different datasets to analyze Hadoop and Spark frameworks and establish an acute
Limitation
In this study, we have chosen only five different types of data sets. We mainly focused on
batch data and performance in terms of speed and efficiency. We used Hadoop batch and Spark
core to process the data sets, but Spark also provides Spark streaming and M-Lib for real
Roadmap
In the next section, we present a brief introduction to Hadoop and Spark components.
Section “Comparison of Hadoop and Spark” illustrates several framework variations in both
Hadoop and Spark. The next two sections “Cluster Configuration” and “Experiments and
3
Results” will list out all the observation we did and findings as well. Eventually, in the last two
sections, the thesis is summarized highlighting future avenues in this line of research.
Hadoop
Apache Hadoop is an open source software project which enables distributed processing
of large structured, semi-structured, and unstructured data sets across different clusters of
framework in which an application is broken down into numerous small chunks called fragments
or blocks, and can be run on any node in the cluster (Bappalige, 2014). It is designed with a very
high degree of fault tolerance. Hadoop makes it possible to run applications on systems with
multiple nodes involving thousands of terabytes. Hadoop is designed in such a way to process
large volumes of information by connecting to different computers together so that it can work in
parallel.
Hadoop Framework
Apache Hadoop is an open-source software framework for storage and large-scale data
processing on the clusters. Hadoop is designed based on the concept of moving the code to data
instead of data to code, a write-once, read-many, immutability of the written data (Mohammed,
Far & Naugler, 2014). The Hadoop framework consists of the following modules: (Hadoop,
2014).
modules.
• Hadoop Distributed File System (HDFS) – a distributed file system which stores
4
• Hadoop YARN – a resource-management platform which is responsible for
divides applications into smaller components and distributes them across numerous
machines.
MapReduce is a method for distributing a task across multiple nodes. Each node
processes the data stored on that node to the extent possible. A running Map Reduce job consists
The advantages of using MapReduce, which run over a distributed infrastructure like
CPU and storage are Automatic parallelization and distribution of data in blocks across a
infrastructure, deployment, monitoring and security capability and clear abstraction for
programmers. Most MapReduce programs are written in Java. It can also be written in any
scripting language using the Streaming API of Hadoop (Mohammed, Far & Naugler, 2014).
File System (HDFS) for storage and the MapReduce programming framework for computational
capabilities. The cluster runs jobs controlled by the master node, which is known as the Name
Node and it is responsible for chunking/partition of the data, cloning it, sending the data to the
distributed computing nodes (Data Nodes), monitoring the cluster status, and
5
collecting/aggregating the results (Mohammed, Far & Naugler, 2014). The following Figure 2
depicts the data node and name node functioning and read, write operation on HDFS.
Figure 2. Read and Write Operations on HDFS (Singh & Ali, 2016)
HDFS is a scalable, fault-tolerant, highly available and access flexible file system. HDFS
splits data into one or more blocks (chunks of 64MB). These blocks are stored in a set of slaves
or data nodes, to ensure that parallel writes or reads can be performed even on a single file.
Multiple copies of each block are stored per replication factor for making the platform fault
tolerant. HDFS stores each file as a sequence of blocks, with each block stored as a separate file
in the local file system (such as NTFS). Multiple copies or replicas of each block makes it fault
tolerant. If one copy is not accessible or gets corrupted, the data can be read from the other copy
(Singh & Ali, 2016). Figure 3 describes the HDFS architecture in detail:
6
Figure 3. HDFS Architecture (Hadoop, 2008)
Spark
Apache Spark is an open-source, distributed processing system commonly used for big
data workloads. Apache Spark utilizes in-memory caching and optimized execution for faster
performance, and it supports general batch processing, streaming analytics, machine learning,
Apache Spark is a faster cluster computing technique which is designed above the
Hadoop Distributed File System (HDFS) and it uses MapReduce model to perform different
computations and includes Stream processing and query applications. The main advantage of
using Spark in relation to Hadoop is that it supports different tools in a single application. The
following Figure 4 depicts the difference between Hadoop and Spark map reduce operations.
7
Figure 4. Difference between Hadoop and Spark Map Reduce Phases
Components of Spark
Spark provides different libraries which help in analyzing the data in an effective manner.
Figure 5 depicts the different components which are provided by Spark for different
• Spark SQL- This is an important component which is installed above Spark Core and it
is helpful for creating a special data abstraction called Schema RDD which is helpful in
• Spark Streaming- This component enables fast scheduling capability to perform data
streaming analysis. The data would be divided into smaller fragments and it would help
8
• MLlib (Machine Learning Library)- This component is a distributed machine learning
distributed dataset (RDD), which is a fault-tolerant in collection of elements that can be operated
dataset in an external storage system, such as a shared filesystem, HDFS (Spark, 2017).
RDD is the basic data structure in Spark. RDDs perform in-memory computations on
objects. Every dataset present in the RDD is separated into logical partitions and they can be
computed on multiple nodes in the cluster. Different objects related to Java, Scala and python
can be present in an RDD along with the objects defined by the users (Zaharia, Chowdhury, Das,
Dave, McCauley, Franklin,Shenker & Stoica, 2012). Figure 6 illustrates how Spark and Hadoop
9
Figure 6. Spark and Hadoop in the Same Cluster (Ramel, 2015)
Spark does not come with its own file management system. Spark always requires to be
integrated with other file systems. Spark can be used or installed on Hadoop as it uses the HDFS
provided by Hadoop. This can be performed in three ways (Stoica, 2014) as shown in Figure 7.
10
1. Standalone mode: Here Spark would be installed on top of the HDFS where storage
would be specifically allocated for HDFS. In this type of architecture both Spark and
MapReduce would run in parallel to execute all the tasks on the servers (clusters).
2. Hadoop Yarn (Yet Another Resource Negotiator): In this type of architecture Spark
would be executed on Hadoop Yarn without any pre-installation. Through this Spark
can be integrated into the Hadoop environment where different components can be
3. SIMR (Spark in MapReduce): In this mode Spark jobs are launched with the
standalone deployment of Spark. The user can program in the Spark shell and run
multiple tasks.
The following list describes the comparison between Hadoop and Spark technologies
(Noyes, 2015):
infrastructure which reorganizes huge amount of data across multiple servers. This
implies that large number of systems need not be available and the hardware need not be
Spark operates on the distributed data which is present in the Hadoop HDFS but it is not
possible for Spark alone to store these distributed data. Spark is not available with a
specific file management system and hence it must be integrated with another file system
for storage. But Hadoop has its own distributed file system and execution engine known
11
2. Spark is comparatively faster than MapReduce technique as Spark has a different
mechanism to process all the data. MapReduce has certain steps which work in an order,
• The workflow of Hadoop MapReduce is as follows: Read data from the cluster,
perform an operation, write results to the cluster, read updated data from the
cluster, perform next operation, write next results to the cluster, etc., and
• Spark has a workflow as follows: Read data from the cluster, perform all the
required analytic operations, write results to the cluster, end the process. The
3. Fault recovery- In Hadoop there is a replication factor of 3 for all the data which is stored
on the HDFS. Hence 3 copies of the entire dataset would be stored on multiple nodes.
Spark has RDD’s which perform the function of storing of all the data in memory and
4. Hadoop is less fault tolerant to failures as the data is written onto the disk after each
operation. Spark has inbuilt data objects which are stored in RDD’s across the cluster.
12
RDD’s are stored in memory and hence complete recovery from faults and failures is
available.
Cluster Configuration
setting up a cluster environment to test different data sets. We tried to experiment how much
faster Spark provides the results in comparison with Hadoop on the cluster that we have setup.
Environment setup
The following details explain the cluster configuration we considered for the experiments.
• Brand: Dell
13
▪ OS- CentOS release 6.7(Final)
This dataset consists of a complete copy of wiki database backup in the form of wiki text
source and metadata embedded in XML. It consists the complete dump data of Wikipedia pages
for research purpose. This dump contains multiple formats and our dump file has XML format
which is a semi-structured data. The dump file contains Page title, Revision dates, Timestamp
page Authors and Content. Size of this dump data is about 119 Gigabyte (Uncompressed XML
and text) from 2002 to 2016 database backup dumps. We tried to analyze different factors from
We tried to test this data on both Hadoop and Spark cluster. We implemented a Map
14
Figure 9. Analysis of Wiki Dump Data Results
The output from the data consists of 39260948 no. of pages, 8896619 redirected titles and
38306114 no. of users as extracted from the Wiki dump data. Figure 9 (graph) describes the
number of pages which were revised by users, which increased on a monthly basis for the years
from 2002 to 2016. We also analyzed the time taken to process this data on both Hadoop and
Spark clusters. The points which are highlighted in red indicate the total number of authors who
are increasingly involved in updating specific pages. This analysis is verified with the output file
by referring to the total number of authors for every particular month (Figure 9) The following
table describes the time variations on both the frameworks. Table 1 describes the time taken to
15
Dataset-2: Web log data
analyzing web log files has become an important task for E-Commerce companies to predict
their customer behavior and to improve their business. For every task in the web operations, logs
are generated for every second. This continuous streaming of logs results in huge amounts of
data. For the experiment, we consider a sample web log data to process in both Spark and
Hadoop environment. The log format that we have chosen is as shown below:
• 127.0.0.1 - This is the IP address of the client which made the request to the server.
• Get/POST - The request sent from the client. The method used for information
transfer is noted.
• "Mozilla/4.08 [en] (Win98; I; Nav)" - This is the information that the client browser
We collected 115gb of sample Web log data and implemented a Map Reducer application
to process data on both Spark and Hadoop which calculates the total number of hits and IP
addresses on the website. We extracted the desired data from Hadoop and Spark and analyzed by
plotting a graph. Figure 10 depicts the results of the Web log data process on cluster.
16
HITS
30
28
27
NO OF HITS
23
23
23
22
22
22
21
20
19
18
17
17
14
13
12
11
9
Hits
8
7
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
HOURS OF DAY
We also analyzed the time taken to process this data on Hadoop and Spark clusters. The
following table describes the time variations on both frameworks. Table 2 describes the time
NCDC (National climatic Data Centre) provides access to daily data from the U.S.
NCDC has weather data collected from weather sensors all the way back from 1901. We have
considered a sample data for one city (Crossville) from NCDC for experimenting with weather
analysis using Map-Reduce. We have implemented an application which analyzes the weather
17
data for a particular city and the result is the number of days which are Hot or Cold for the city.
0067011990999992015051507004+68750+023550FM12+038299999V0203301N00671220001CN9999999N9+00001+99999.
0043012650999992016032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9-00785+99999.
We collected approximately one year of Crossville city data from the NCDC and
processed on Hadoop and Spark. We Implemented an application which can run data and extract
cold days and hot days in a year. Figure 11 depicts the analysis of the results from the Weather
20
10
-10
Days
Cold-Day Hot-Day
We also analyzed the time taken to process this data on Hadoop and Spark clusters. The
following table describes the time variations in both the frameworks. Table 3 describes the time
Table 3. Time Taken to Process Weather Data on Both Frameworks (80 Gb)
Weather Data
18
Dataset-4: Matrix Multiplication
Some of the wider applications of Matrix multiplication such as running and processing
large matrices is difficult with single processors. Using distributed system and parallel
computing can enhance the speed of the execution of the computations. For example, in the
research of cancer tissues, to compare and analyze the abnormal cell matrix multiplication is
very important and size of the matrices is very large. In these scenarios Spark and Hadoop play a
We consider a 100*100 matrix to test the datasets for multiplication operation. Here we
show a small 2*5 and 5*2 matrix and their multiplication which is feasible. The input file format
R, C, V Output R
0,0,0.0
0,1,1.0
1,2,7.0
1,3,8.0
1,4,9.0
19
We also analyzed the time taken to process this data on both Hadoop and Spark
environments. The following table describes the time variations for both the frameworks. Table 4
describes the time taken to process the data on both Spark and Hadoop.
Matrix Multiplication
Hadoop 19 min
We compared the results using different data sets mentioned above and concluded that
Spark produces around 45% faster results than Hadoop. Figure 12 depicts the performance
Performance Comparision
Hadoop Spark
30
27.5
25
20 21.4
19 19.4
15
12.3
10 9.2
8.1 7.2
5
0
Matrix Multiplication Web-Log data Weather Data Wiki Dump
Figure 12. Performance Comparison of Spark and Hadoop on Different Data Sets
From the above experiments, for the considered datasets, we analyzed that Spark produces
specific results approximately 2.5 times faster than Hadoop on the configured cluster
environment.
20
Financial Data Analysis
In the cluster which is set up we used Hadoop as a distributed file system that runs on a
combination of virtual nodes, and is capable of handling large data sets (order of terabytes to
petabytes) with increased number of clusters. Several data analytic tools can be used on top of
Hadoop for analyzing large data sets in parallel. For example, one day’s worth NASDAQ trade
data (Data Feed Total View-ITCH) are of the order 320,000,000 million records requiring 10 GB
(Gigabyte) of space (NASDAQ, 2015). The analysis of these data is possible on a single server
but becomes time consuming when multiple days or weeks’ worth of data needs to be
processed. For data analysis, we have installed Hadoop and software tools provided by Spark to
analyze ITCH data. First challenge was to understand the data set, which is not in a human
readable format rather it is in binary form. Second, there are many record formats and ITCH
documentation provides 18 different record types. In accordance to this we developed our own
ITCH converter which reads the raw binary data and converts it to a common delimited text file
to facilitate processing with Hadoop/Spark. We have recently started working on extract tools,
used to select specific records and data, which would be required to perform analysis in Spark.
• The Total View-ITCH data consists of Nano-second level data which contains market
– other stock and order relevant information for every trading day on NASDAQ.
21
– ticker
execution)
– buy/sell information
– market participant ID
• The data is obtained as a zipped .txt files coded in binary which requires a special code to
process the file) and one unzipped file can be around 2-10 Gigabytes(Compressed) and
Normal systems take up to 5-8 hours to open a single file. The distributed cluster
architecture processes the data and can open a file in about 20 minutes. Table 5 depicts the
sample text format of the NASDAQ trade data after conversion from raw data
Message Timestamp (seconds Order Buy/Sell Shares Stocks Price Original Order Match Number Market
Type past midnight) Reference Side Reference Number Participant ID
Number
Add 32435.285091234 335531633 S 300 AMZN 450.5
Add w/ ID 33024.027480234 168914198 B 100 C 48 UBSS
Revision 34011.528407249 336529765 300 450.45 335531633
Exe 35011.180402345 336529765 76 3893405
Exe at diff P 36001.712309875 168914198 100 48 5830295
Part. Cancl 37134.713037569 336529765 100
Full Cancl 38104.371047195 336529765
Experiment
To understand the contents of the data and its functionalities we performed an experiment
which involves the analysis of ITCH data. This experiment was a comparison to the results
obtained by the finance department. The size of the data was about 14GB and was recorded on
23 January 2015 by NASDAQ (after conversion). The data was provided by finance department
22
at the University of Memphis for research purposes. It was processed on Spark framework which
used HDFS and took about 6 min and was considered for a single day transaction. Sample
converted data is stored on the HDFS. On this data set different operations are performed such as
record type counter, hourly addition/deletion of the stocks to find the cancellation stats. In
accordance to this the total number of placed orders (A), total number of executions (E), the total
number of cancelled orders (X) and the total number of deleted fields (D) are considered to
arrive at a result for Placement vs Execution vs Cancellation. The resulting file consisted of
application to analyze the hourly history for each type of record. Figure 13 depicts the graph
plotted with time on the x-axis and the total number of messages on the y-axis to compute the
23
Conclusions
Spark and Hadoop framework are two popular distributed computing paradigms for
providing an effective solution to handle large amount of data. An in-depth analysis of Hadoop
and Spark frameworks is performed to understand the performance differences and efficient
processing of Big Data. The experimental evaluation confirms that Spark is more efficient and
faster than Hadoop when handling different data in different scales. We expanded our research to
24
References
AUGUR, H. (2015). How Data Science is Driving The Driverless Car. Retrieved june 15,
driverless-car/
AWS. (2017). Amazon Web Services. Retrieved June 1, 2017, from What is Big Data?:
https://aws.amazon.com/big-data/what-is-big-data/
AWS, A. (2017). Apache Spark on Amazon EMR. Retrieved June 20, 2017, from
https://aws.amazon.com/emr/details/spark/
Bappalige, S. P. (2014). An introduction to Apache Hadoop for big data. Retrieved June
big-data
framework to clinical big data analysis: current landscape and future trends.
BioMed Central.
Hadoop, A. (2014). What Is Apache Hadoop? Retrieved June 10, 2017, from Apache
Hadoop: http://hadoop.apache.org
Manpreet Singh, A. A. (2016). Big Data Analytics with Microsoft HDInsight (1st ed.).
Sams Teach.
Marr, B. (2015). The Awesome Ways Big Data Is Used Today To Change Our World.
25
http://www.dataplumbing.datasciencecentral.com/blog/the-awesome-ways-big-
data-is-used-today-to-change-our-world
Penchikala, S. (2015). Big Data Processing with Apache Spark – Part 1: Introduction.
spark-introduction
http://searchcloudcomputing.techtarget.com/definition/Hadoop
SAS. (2017). Big Data. Retrieved May 13, 2017, from Big Data Insights:
https://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Spark. (2017). Spark. Retrieved June 2017, 2017 , from Spark Programming Guide:
https://spark.apache.org/docs/latest/programming-guide.html
Spark, A. (2017). Apache Spark™ is a fast and general engine for large-scale data
Stoica, I. (2014). Apache Spark and Hadoop: Working Together. Retrieved June 15,
hadoop.html
26