Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
net/publication/332936006
Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and
Java
CITATIONS READS
2 2,075
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Alaa K. Jumaa on 22 May 2019.
Abstract: Nowadays with the technology revolution the well as to the needs of changing from the traditional
term of big data is a phenomenon of the decade database (Relational DB) that has many limitations with
moreover, it has a significant impact on our applied the big data into NoSQL database (Non-Relational DB)
science trends. Exploring well big data tool is a necessary which is overcome these limitations and suits the
demand presently. Hadoop is a good big data analyzing enterprise requirements [3]. Big Data mainly has three
technology, but it is slow because the Job result among features which are recognized by 3Vs (Volume, Variety
each phase must be stored before the following phase is and Velocity). Some additional establishments and big
started as well as to the replication delays. Apache Spark data experts have expanded this 3Vs framework to 5Vs
is another tool that developed and established to be the framework by adding the terms of Value and Veracity into
real model for analyzing big data with its innovative the big data explanation as shown in the Figure1 and
processing framework inside the memory and high-level shortly reported as follows [4][5][6]:
programming libraries for machine learning, efficient
data treating and etc. In this paper, some comparisons 1. Volume: denotes to big quantities of data from diverse
are presented about the time performance evaluation palaces, for example, mobile data, computers, servers and
among Scala and Java in apache spark MLlib. Many etc. The advantage of treating and studying these great
tests have been done in supervised and unsupervised sizes of data is earning valuable information on society and
machine learning methods with utilizing big datasets. enterprises.
However, loading the datasets from Hadoop HDFS as 2. Velocity: states the swiftness of transferring data. The
well as to the local disk to identify the pros and cons of contents of data are regularly varying through the data
each manner and discovering perfect reading or loading gatherings process and resulting in different forms which
dataset situation to reach best execution style. The results are from several sources. This viewpoint needs new
showed that the performance of Scala about 10% to 20% procedures and techniques for sufficiently exploring the
is better than Java depending on the algorithm type. The streaming data.
aim of the study is to analyze big data with more suitable 3. Variety: mentions collecting different kinds of data
programming languages and as consequences gaining through different devices such as videos, images, etc.
better performance. Furthermore, these kinds of data might be unstructured,
Keywords: Big data, Data analysis, Apache Spark, semi-structured and structured.
Hadoop HDFS, Machine learning, Spark MLlib, Resilient 4. Value: denotes to the manner of pulling meaningful
Distributed Datasets(RDD). knowledge from enormous datasets. Value is the greatest
significant feature of any big data tools since it permits to
1. INTRODUCTION producing beneficial information.
5. Veracity: refers to the knowledge exactness or accuracy
At the present time with the huge improvement in the (informative and valuable).
information technology field, and the facilitation over the
internet for the billions of people through a huge database
with the diversity of digital devices, the term of "big data"
came out as a result. To put in a nutshell big data is the
announcement of the enormous dataset size [1]. In the
year2020, the number of linked devices will be roughly
one hundred billion thus, guiding to supplementary data
aggregation. Consequently clarifying and understanding
big data analytics techniques are being essential [2]. As
7
• YARN
YARN stands for Yet Another Resource Negotiator and it
works as a Hadoop cluster resource manager which means
handling the Hadoop cluster resources such as Memory,
CPU, etc. Fortunately, version 2 and 3 of Hadoop with
Yarn opens a new door for data treating environment [10].
• HDFS
Hadoop Distributed File System (HDFS) generally divides
the file systems into data and metadata. HDFS has two
important benefits in comparing with the traditional
distributed file system. The first one is the great mistake
Figure 1: Big data features [7].
tolerance because it saves the duplicates (copy) of the data
in several data clients, which permits for distinguished
The aims of this article are for acquiring some knowledge error to recover data from other data clients. The second
on analyzing big data through up to date tool which is benefits it allows to use of big data sizes because the
apache spark and hiring a few programming languages that Hadoop clusters can residence data sets in petabytes [11].
fully compatible with it. Also, the target is transforming
from traditional data stores (local disk) to big data B. Apache Spark:
requirements like HDFS. It is a model that performs a common data analysis on one
node and distributed nodes which means it is similar to
2. APACHE HADOOP AND APACHE Hadoop. One of the advantages that it gives in memory
SPARK calculations technique for increasing data processing
speed. As well as, it can access Hadoop data storage
A. Apache Hadoop: (HDFS) because it runs on the top of the existing Hadoop
node. Besides that, it can process the Streaming data like
Hadoop is an open-source structure written in Java
Twits on Twitter in addition to the structured data in Hive
programming language, it permits to analyse big data
[12]. Basically, Spark divided into some parts and each
through the clients. Hadoop can scale up an individual host
part has its crucial task as shown in figure 3.
to thousands of hosts with providing storage and
calculation for each one. Basically, the Hadoop framework
separated into three parts which are: Map Reduce, Yarn
and Hadoop Distributed File System (HDFS)[8] as shown
in figure 2.
8
• Programming languages in Spark
Spark consists of many language libraries that support
performing various big data analysis. Spark written in
Scala so, support it perfectly and even the startup of spark
shell taking the users to Scala prompt automatically as
shown in figure 4. In addition to Scala, three other
programming languages exist in spark APIs which are
Java, Python and R.
Since the structure of spark constructed in Scala, therefore
writing a program using Scala language in spark offers to
get the newest characteristics that may not exist in the
further mentioned languages[14]. The sizes of Scala code
are naturally smaller than the equivalent size of Java code. Figure 6: Machine learning categorization [17].
A lot of establishments that rely on Java in their works are
Basically, the spark machine learning separated into two
changing to Scala to enhance the scalability and reliability
sets as shown in figure7. The first set is MLlib and it was
[15].
built on the top of Resilient Distributed Datasets (RDD). It
covers the common approaches that proposed so far. The
next set is ML and it originates with the newest structures
of MLlib for building ML pipelines and this Application
Program Inter phase (API) is constructed on the Data
Frames features [18].in this paper, the focus will be on the
first set which is MLlib.
In general, machine learning algorithms according to the In addition, S. Al-Saqqa and et al, 2018 [20] discussed on
style of the training data categorized into supervised and the Spark's MLlib for hiring it in the classification of
unsupervised learning. The machine learning of spark sentiment big data scale. They find out that support vector
permits the data analytics and this library generally, machine (SVM) is better than the other classifiers in the
contains the famous algorithms as shown in figure 6. matter of performance.
9
As well as, K. AL-barznji and et al, 2018 [21] talked 4. METHODOLOGY
about sentiment analysis utilizing the algorithms of
machine learning such as Naïve Bayes and SVM for In this paper, two types of programming language have
analyzing the text with benefits of the huge capabilities of been utilized, the first one is a Java programming language
Apache Spark. They found that the SVM is more accurate and the second one is Scala programming language. Both
in the condition of total average. types evaluated in 32-bits and 64-bits Linux operating
system environment because the Windows operating
However, M. Assefi and et al, 2017 [22] explored some system is not efficient for processing big datasets and also
views for growing the form of the Apache Spark MLlib 2.0 not support big data tools like Hadoop. For both languages
as an open source, accessible and achieve many machine two different machine learning algorithms have been used,
learning tests that related to the real world to inspect the one of them is supervised machine learning which is
attribute characteristics. Also presents a comparison Decision Tree Regression algorithm and the other one is
among spark and Weka with proving the advantages of unsupervised machine learning which is Clustering
spark over the Weka in many sides like the performance algorithm.
and it is efficient dealing with a huge amount of data. on
the other hand, Weka is good for simple users with its GUI Each algorithm read the dataset two times which means
and the diversity of algorithms which already exist in it. from two different places, one time the algorithm read the
dataset that stored previously in the local hard disk drive
Also, S. Salloumand et al, 2016 [23] stated an assessment and the second time read the dataset that stored or
on the key structures of big data analytics using Apache uploaded previously to the Hadoop HDFS storage. In
Spark. Furthermore, concentrates on the portions, concepts summary 16 tests have been done, 8 tests for Java and the
and topographies of Apache Spark and displays the same for Scala. 4 java tests in 32-bits Linux OS and the
advantages of it in the machine learning, analysis of the other 4 Java tests in 64-bits Linux OS. Also, the same tests
graph and stream treating in the enormous data fields. applied for Scala as shown in figure8.
However, exposed the spark APIs and its compatibility
with various programming languages in addition to the
characteristics of the spark (RDD and data frame).
Likewise, A. Shoroand et al, 2015 [24] discovered some
Big Data Analysis thought and distinguished a few
important evidence from various big data streaming
sources like twits of Twitter with applying Spark tools on
it.
Moreover, A. Bansod 2015 [25] This researcher provides a
newer rating work with storing a huge Dataset in the
Hadoop Distributed File System HDFS and then analyzing
it by Apache Spark. Also, present a comparison among
spark and Hadoop Map-Reduce with showing the
preference of the first one in performance and scalability.
Besides that, S. N Omkar and et al, 2015 [26] they applied
a variety of classification methods on various datasets
from the Repository of Machine Learning (UCI). As well
the execution time and the accuracy of every classifier is
discovered with some comparisons between them.
Similarly, S. Gopalani and et al 2015 [27] compared the
Hadoop Map Reduce and the Apache Spark and then
provide a brief analysis of their performance by applying Figure 8: Tests structure in this paper.
the K-Means algorithm.
• Tested Environments
In this paper, the focus will be on the imaginative and
valuable ways of spark MLlib package that applied in big Two VMWARE environment have been utilized to gain
data study in the present time, by mentioning updating these experimental outcomes. The first one is installed on
founded weakness and powerfulness with presenting the 32-bits O.S and the second one installed on 64-bits. The
basic advantages and disadvantages and also showing the rest information about used environments shown in table 1.
performance for the most famous machine learning
algorithms with Java and Scala.
10
Table 1. Tested Environments. Table3: Duration of processing time for all Tests.
11
that if the dataset contains a huge number of attributes or if
we put a massive number of Depth and Bins in any Linux 64 Bits 1.1 GB Clustering
supervised algorithms similarly, the heap size space 50
problem or garbage collection problem will appear in the 45
32-bits OS and it can't be solved due to the mentioned 40
reason. Figure 12 showed the time difference between 35
every test in this scenario. 30
25
20
15
Linux 32 Bits 1.1 GB Clustering 10
5
50.00 0
45.00 disc time hadoop disc time hadoop
40.00 time time
35.00
30.00 Scala Java
25.00
20.00 Figure 11: K-means algorithm applied at 1.1.GB dataset and
15.00 Linux 64-bits.
10.00
5.00
0.00
disc time hadoop disc time hadoop Linux 64 Bits 566 MB D. Tree Regression
time time
1.60
Scala Java 1.40
1.20
Figure 9: K-means algorithm applied at 1.1.GB dataset and 1.00
Linux 32-bits.
0.80
0.60
0.40
Linux 32 Bits 566 MB D.Tree Regression 0.20
2.50 0.00
disc time hadoop disc time hadoop
time time
2.00
Scala Java
1.50
Figure 12: Decision Tree Regression algorithm applied on 566
1.00 MB dataset and Linux 64-bits
0.50 6. DISCUSSION
It is good to state some problems during the test processes
0.00
with presenting the solutions too:
disc time hadoop disc time hadoop
time time • Not enough space in the temp: Because the
Scala Java operation of the processing totally being located
in the temp folder of the Linux 32-bits O.S
Figure10: Decision Tree Regression algorithm applied on 566
normally the size of a temp folder is in MB and
MB dataset and Linux 32-bits cannot stand big data size. The solution is
increasing the size of temp into 1 GB from the
Terminal as root by below command for affording
huge data calculation.
Mount –t tmpfs –o size=1073741824,mode=1777
overflow /tmp.
12
• Java heap size space is not enough: at the The future work of this paper could be applied with four
beginning of using any IDE (Integrated additional modifications. The primary one is changing the
development environments) like Eclipse, environment from a single node cluster into a multi-node
NetBeans or any others IDE, the default heap size and definitely that lead to gain better performance with the
space for the project is between 64 to 256 MB capability of executing larger data sets. The next alteration
and that space is not enough for computation is reading the dataset from variety of storage that supports
large dataset. The solution is to increase it into big data atmosphere for instance, Mongo DB, HBase,
1GB to be like the spark default heap space. In Cassandra, Couch-base and etc. for comparing which
bellow paths of the IDE: storage is more compatible with the spark and as
consequences that decrease the request time which means
Click on application Name – Properties-Run-VM reducing the execution time in the end. The third
Option- then set 1GB. modification is comparing the other programming
language performance in the matter of machine learning
• Weka can’t read a huge data: in the beginning,
which is already supported by spark such as, R and
our intent was to compare the time performance
Python. The final modification is using all the previous
between what presents in this paper and the
changes with the second bundle of spark machine learning
normal Weka program but unfortunately, Weka
library which is ML instead of spark MLlib bundle.
cannot afford a huge dataset file. Especially it
Because ML bundle builds on the dataset and data frame,
needs much more time than the spark just for
unlike the MLlib bundle that builds on the RDD. Then
reading without processing it. The solution is to
demonstrate the accuracy and the performance of each
change the Weka environment and adding a new
bundle with a comparison between them.
package like distributed Weka Hadoop or spark
and that might be a good topic to further research.
13
[14] Firoj Parwej,NikhatAkhtar,Dr. Yusuf Perwej, "A Close-Up View
About Spark in Big Data Jurisdiction," V. Surekha. Int. Journal of
Engineering Research and Application www.ijera.com, vol. 8, no.
1, p. 31, January 2018.
[15] Tarun Kumawat, Pradeep Kumar Sharma, Deepak Verma, Komal
Joshi, Vijeta Kumawat, "Implementation of Spark Cluster
Technique with Scala," International Journal of Scientific and
Research Publications, vol. 2, no. 11, p. 501, November 2012.
[16] D. U. R. Pol, "Big Data Analysis: Comparision of Hadoop
MapReduce and Apache," IJESC, vol. 6, no. 6, p. 6390, 2016.
[17] B. Kaluža, Machine Learning in Java, UK: Packt Publishing Ltd,
2016.
[18] Salvador García*, Sergio Ramírez-Gallego, Julián Luengo, José
Manuel Benítez, Francisco Herrera, "Big data preprocessing:
methods and prospects," Big Data Analytics, p. 9, 2016.
[19] Hend Sayed, Manal A. Abdel-Fattah, Sherif Kholief, "Predicting
Potential Banking Customer Churn using Apache Spark ML and
MLlib Packages: A Comparative Study," (IJACSA) International
Journal of Advanced Computer Science and Applications, vol. 9,
pp. 674-677, Nov 2018.
[20] Samar Al-Saqqaa, b, Ghazi Al-Naymata, Arafat Awajan, "A
Large-Scale Sentiment Data Classification for Online Reviews
Under Apache Spark," in The 9th International Conference on
Emerging Ubiquitous Systems and Pervasive Networks, EUSPN
Belgium, 2018.
[21] Kamal Al-Barznji, Atanas Atanassov, "Big Data Sentiment
Analysis Using Machine Learning Algorithms," in Proceedings of
26th International Symposium "Control of Energy, Industrial and
Ecological Systems, Bankia, Bulgaria, May 2018.
[22] Mehdi Assefi, Ehsun Behravesh, Guangchi Liu, and Ahmad P.
Tafti, "Big Data Machine Learning using Apache Spark," in 2017
IEEE International Conference on Big Data, Boston, MA, USA,
11-14 Dec. 2017.
[23] Salman Salloum, Ruslan Dautov, Xiaojun Chen1, Patrick
Xiaogang Peng, Joshua Zhexue Huang, "Big data analytics on
Apache Spark," Int J Data Sci Anal -Springer International
Publishing Switzerland, September 2016.
[24] Abdul Ghaffar Shoro, Tariq Rahim Soomro, "Big Data Analysis:
Ap Spark Perspective," Global Journal of Computer Science and
Technology: C Software & Data Engineering, vol. 15, no. 1, pp. 7-
14, 2015.
[25] A. Bansod, "Efficient Big Data Analysis with Apache Spark in
HDFS," International Journal of Engineering and Advanced
Technology (IJEAT), vol. 4, no. 6, pp. 313-315, August 2015.
[26] Mohit1, Rohit Ranjan Verma, Sameeksha Katoch, Ashoka
Vanjare, S N Omkar5, "Classification of Complex UCI Datasets
Using Machine Learning Algorithms Using Hadoop," International
Journal of Computer Science and Software Engineering (IJCSSE),
vol. 4, no. 7, pp. 190-198, July 2015.
[27] Satish Gopalani, Rohan Arora, "Comparing Apache Spark and
Map Reduce with Performance Analysis using K-Means,"
International Journal of Computer Applications (0975 – 8887),
vol. 113, pp. 8-11, March 2015.
[28] "UCI machine learning repository," [Online]. Available:
http://archive.ics.uci.edu/ml/index.html. [Accessed 26 2 2019].
14