0% found this document useful (0 votes)
25 views5 pages

Implementing K Means For Achievement Stu

This document compares Apache Spark and MapReduce frameworks for processing large datasets. It summarizes: 1) Spark and MapReduce both provide scalable and fault-tolerant processing of big data across clusters, but MapReduce uses a rigid job structure while Spark supports iterative and interactive analytics more efficiently with its resilient distributed datasets (RDDs). 2) The paper implements K-Means clustering, a common machine learning algorithm, on both Spark and MapReduce to compare performance. 3) Parameters like scheduling delay, speedup, and energy consumption are measured to evaluate the frameworks. The results show Spark can perform iterative jobs faster than MapReduce by reusing data in memory across jobs rather than reloading from disk.

Uploaded by

Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

Implementing K Means For Achievement Stu

This document compares Apache Spark and MapReduce frameworks for processing large datasets. It summarizes: 1) Spark and MapReduce both provide scalable and fault-tolerant processing of big data across clusters, but MapReduce uses a rigid job structure while Spark supports iterative and interactive analytics more efficiently with its resilient distributed datasets (RDDs). 2) The paper implements K-Means clustering, a common machine learning algorithm, on both Spark and MapReduce to compare performance. 3) Parameters like scheduling delay, speedup, and energy consumption are measured to evaluate the frameworks. The results show Spark can perform iterative jobs faster than MapReduce by reusing data in memory across jobs rather than reloading from disk.

Uploaded by

Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Innovative Research in Engineering & Management (IJIREM)

ISSN: 2350-0557, Volume-3, Issue-5, September-2016

Implementing K-Means for Achievement Study


between Apache Spark and Map Reduce
Dr.E.Laxmi Lydia, Dr.A.Krishna Mohan, Dr. M.Ben Swarup,
Associate Professor, Professor, Department of Professor, Department of
Department of Computer Computer Science and Computer Science and
Science and Engineering, Engineering, JNTUK Engineering, Vignan's
Vignan's Institute Of Andhra Pradesh, Institute Of Information
Information Technology, India. Technology, Visakhapatnam,
Visakhapatnam, Andhra Pradesh,
Andhra Pradesh, India.
India.

ABSTRACT [11] spearheaded this model, while systems like Dryad [17]
and Map-Reduce-Merge [24] summed up the sorts of data
Huge Data has for quite some time been the subject of
streams upheld.
enthusiasm for Computer Science fans around the globe,
These systems accomplish their scalability and fault
and has increased much more conspicuousness in the later
tolerance by giving a programming model where the client
times with the constant blast of information coming about
makes non-cyclic data stream graphs to go input data
because of any semblance of online networking and the
through an arrangement of operators. This permits the
journey for tech monsters to get entrance to more profound
hidden framework to oversee scheduling and to respond to
investigation of their information. MapReduce and its
faults without client mediation.
variations have been very fruitful in actualizing vast scale
While this data flow programming model is useful for a
information concentrated applications on ware groups.
large class of applications, there are applications that can't
Then again, a large portion of these frameworks are
be communicated proficiently as non-cyclic data flows. In
manufactured around a non-cyclic information stream
this paper, we concentrate on one such class of
demonstrate that is not suitable for other famous
applications: those that reuse a working arrangement of
applications. Unique MapReduce executes jobs in a
data over numerous parallel operations. This incorporates
straightforward yet unbending structure design.
two use cases where we have seen Hadoop users report that
MapReduce changes step ("map"), a synchronization step
MapReduce is lacking:
("shuffle"), and a stage to join results from every one of the
Iterative employments: Many normal machine learning
nodes in a cluster ("reduce"). Accordingly to defeat the
algorithms apply a capacity more than once to the same
inflexible structure of guide and diminish we proposed the
dataset to upgrade a parameter (e.g., through inclination
as of late presented Apache Spark – both of which give a
plummet). While every iteration can be communicated as a
handling model to breaking down enormous information.
MapReduce/Dryad work, every employment must reload
The primary contender for "successor to MapReduce"
the data from disk, bringing about a huge performance
today is Apache Spark. Like MapReduce, it is an
punishment.
extensively helpful engine, be that as it may it is proposed
Intelligent analytics: Hadoop is frequently used to run
to run various more workloads, and to do in that capacity
specially appointed exploratory questions on large datasets,
much speedier than the more prepared framework. In this
through SQL interfaces, for example, Pig [21] and Hive [1].
paper we contrast these two systems along and giving the
In a perfect world, a user would have the capacity to stack a
execution examination utilizing a standard machine
dataset of enthusiasm into memory over various machines
considering so as learning calculation for bunching (K-
and inquiry it more than once. Be that as it may, with
Means) and through considering some different parameters
Hadoop, every inquiry acquires huge inertness (several
like scheduling delay, speed up, energy consumption than
seconds) because it keeps running as a different
the existing systems.
MapReduce occupation and peruses data from disk.
This paper shows new cluster computing framework
Keywords: called Spark, which bolsters applications with working sets
Spark, MapReduce, Hadoop, Big Data while giving comparative scalability and fault tolerance
properties to MapReduce.
The primary abstraction in Spark is that of a resilient
1. INTRODUCTION distributed dataset (RDD), which speaks to a read-just
Another model of cluster computing has turned out to accumulation of items partitioned over an arrangement of
be broadly well known, in which data-parallel machines that can be reconstructed if a segment is lost.
computations are executed on clusters of questionable Clients can expressly store a RDD in memory crosswise
machines by systems that consequently give locality-aware over machines and reuse it in numerous MapReduce-like
scheduling, fault tolerance, and load balancing. MapReduce parallel operations. RDDs accomplish fault tolerance

Copyright © 2016. Innovative Research Publications. All Rights Reserved 431


Implementing K-Means
Means for Achievement Study between Apache Spark and
and Map Reduce

through an idea of genealogy: if an allotment of a RDD is option to Hadoop MapReduce as opposed to a substitution
lost, the RDD has enough data about how it was gotten to Hadoop. It's not proposed to supplant Hadoop but rather
from different RDDs to have the capacity to remake only to give an extensive and bound
nd together answer for oversee
that parcel. Despite the fact that RDDs are not a general distinctive big data use cases and prerequisites. Figure
Figure1
shared memory abstraction, they speak to a sweet-spotsweet demonstrating the contrast amongst Hadoop and spark.
between expressivity
essivity from one perspective and scalability
and reliability then again, and we have discovered them
appropriate for an assortment of applications.
high-level
Spark is executed in Scala [5], a statically wrote high
programming language for the Java VM, and unc uncovered a
functional programming interface like DryadLINQ [25].
Likewise, Spark can be utilized intelligently from an
altered version of the Scala interpreter, which permits the
client to characterize RDDs, functions, variables and
classes and utilize them in parallel operations on a cluster.
We trust that Spark is the main framework to permit a
productive, universally useful programming language to be
utilized intelligently to process extensive datasets on a
cluster.
Despite the fact that our usage of Sparkk is still a prototype,
early involvement with the framework is empowering. We Figure 1 Difference between Hadoop and Spark
demonstrate that Spark can beat Hadoop by 10x in iterative
machine learning workloads and can be utilized
intelligently to filter a 39 GB dataset with sub sub-second 1.2 SPARK ARCHITECTURE
latency. Spark Architecture incorporates taking after tthree principle
components:
Data Storage: Spark utilizes HDFS document framework
1.1 HADOOP ALONG WITH SPARK for information stockpiling purposes. It works with any
Hadoop as a big data processing technology has been Hadoop perfect information source including HDFS,
around for a long time and has ended up being the solution HBase, Cassandra, and so forth.
of decision for processing large data sets. MapReduce is an Programming interface: The API gives the application
extraordinary solution for one-pass
pass computations, yet not developers to make Spark based applications utilizing a
extremely productive for use cases that require multi-pass
multi standard API interface. Spark gives API to Scala, Java, and
computations and algorithms. Every progression in the data Python programming languages.
processing work process has one Map phase and one Resource Management: Spark can be conveyed as a Stand Stand-
Reduce phase and you'll have to change over any utilization alone server or it can be on a distrib
distributed computing
case into MapReduce pattern to influence this solution. framework like Mesos or YARN. Figure 2 below shows
The Job output data between every progression must be put these components of Spark architecture model.
away in the circulated document framework before the
following stride can start. Thus, this methodology has a
tendency to be moderate because of replication and plate
stockpiling.
ockpiling. Likewise, Hadoop solutions normally
incorporate bunches that are difficult to set up and oversee.
It likewise requires the incorporation of a few devices for
various big data use cases (like Mahout for Machine
Learning and Storm for streaming data ta processing).
On the off chance that you needed to accomplish
something convoluted, you would need to string together a
progression of MapReduce jobs and execute them in
sequence. Each of those jobs was high-latency,
latency, and none
could begin until the past occupation had completed totally.
Spark permits programmers to create complex, multi-step
multi
data pipelines utilizing coordinated non--cyclic diagram
(DAG) design. It additionally underpins in-memory
in data
sharing crosswise over DAGs, so that diverse jobs job can Figure 2. Spark Architecture
work with the same data.
Spark keeps running on top of existing Hadoop Distributed 1.3 INTELLECTION TO DESIG
DESIGNATE SPARK
File System (HDFS) framework to give improved and extra Sparkle utilizes the idea of RDD which permits us to
usefulness. It gives backing to deploying Spark applications store data on memoryy and persevere it according to the
in a current Hadoop v1 cluster (with SIMR – Spark-Inside- prerequisites. This permits a massive increment in batch
MapReduce) or Hadoop v2 YARN cluster or even Apache processing job execution (up to ten to hundred times as
Mesos. We ought to take a gander at Spark as another much as that of routine Map Reduce).

Copyright © 2016. Innovative Research Publications. All Rights Reserved 432


International Journal of Innovative Research in Engineering & Management (IJIREM)
ISSN: 2350-0557, Volume-3, Issue-5, September-2016

Start additionally permits us to reserve the data in memory, 3. COMPARISON


which is valuable if there should arise an occurrence of
iterative algorithms, for example, those utilized as a part of Keeping in mind the end goal to arrive at a decision
machine learning. about the useful correlation of Apache Spark and Map
Conventional MapReduce and DAG engines are Reduce, we performed a near examination utilizing these
problematic for these applications since they depend on systems on a dataset that permits us to perform bunching
acyclic data stream: an application needs to keep running as utilizing the K-Means calculation.
a progression of unmistakable jobs, each of which peruses
data from stable storage (e.g. a disseminated record 3.1 DATASET DESCRIPTION
framework) and composes it back to stable storage. They The Data Set includes healthcare_Sample_datasets size
bring about noteworthy cost stacking the data on every of 3.13 MB collected over the years, and includes
progression and composing it back to replicated storage. patientID, name and other values of the respective records.
Flash permits us to perform stream processing with A sample of the data records is shown as below: The data
extensive information data and manage just a chunk of data record is demonstrated in the table1:
on the fly. This can likewise be utilized for online machine
learning, and is very fitting for use cases with a prerequisite Table 1: Healthcare_sample_datasets
for continuous investigation which happens to be a PatientID: int
practically universal necessity in the business.
MapReduce is ineffective for multi-pass applications that Name: chararray
require low-latency data sharing over multiple parallel DOB: chararray
operations. These applications are very basic in analytics, PhoneNumber: chararray
and include:
 Iterative algorithms, including numerous machine
EmailAddress: chararray
learning algorithms and graph algorithms like SSN: chararray
PageRank. Gender: chararray
 Interactive data mining, where a client might want Disease: chararray
to load data into RAM over a bunch and question it weight: float
more than once.
 Streaming applications that keep up aggregate state
after some time.
Sample Record
2. IMPLEMENTATION 11 aa 12/10/19 123 [email protected] 1
M
Diabet 7
1 1 50 4 m 1 es 8
2.1 K-MEANS CLUSTERING 11
2
aa
2
12/10/19
84
123
4
[email protected]
m
1
1
F PCOS
6
7
K-Means is a simple learning algorithm for clustering
analysis. The goal of K-Means algorithm is to find the best
division of n entities in k groups, so that the total distance 3.2 PERFORMANCE ANALYSIS AND
between the group’s members and its corresponding DESCRIPTION
centroids, representative of the group, is minimized. The k- Post working on the K-Means algorithm on the
means algorithm is used for partitioning where each described data set, we achieved the following results for
cluster’s centre is represented by the mean value of the comparison (shown in the tables). To gain a varied
objects in the cluster. The Pseudo code is as following: analysis, we considered 64MB, 3.13 MB with a single node
Step 1: Begin with n clusters, each containing one object and 3.13MB with two nodes and monitored the
and we will number the clusters 1 through n. performance in terms of the time taken for clustering as per
Step 2: Compute the between-cluster distance D(r, s) as the our requirements using K-Means algorithm. The machines
between-object distance of the two objects in r and s used had a configuration as follows:
respectively, r, s =1, 2, …, n. Let the square matrix D =  4GB RAM
(D(r, s)). If the objects are represented by vectors, we can  Linux Ubuntu
use the Euclidean distance.  500 GB Hard Drive
Step 3: Next, find the most similar pair of clusters r and s,
such that the distance, D(r, s), is minimum among all the The results clearly showed that the performance of Spark
pair wise distances. turn out to be considerably higher in terms of time, where
Step 4: Merge r and s to a new cluster t and compute the each of the dataset size results in a decrease in the
between-cluster distance D(t, k) for any existing cluster k ≠ processing time of up to three times as compared to that of
r, s . Once the distances are obtained, delete the rows and Map Reduce. Although there exists a minor fluctuation in
columns corresponding to the old cluster r and s in the D this result, this is due to the random nature of the K-Means
matrix, since r and s do not exist anymore. Then add a new algorithm and does not affect the analysis to a large extent.
row and column in D corresponding to cluster t.
Step 5: Repeat Step 3 a total of n − 1 times until there is
only one cluster left.

Copyright © 2016. Innovative Research Publications. All Rights Reserved 433


Implementing K-Means for Achievement Study between Apache Spark and Map Reduce

Table 2 Results for K-Means using Spark (MLib)

Dataset Size Nodes Time (s)


64 MB 1 18
3.13MB 1 149

Table 3. Results for K-Means using Map


Reduce (Mahout)

Dataset Size Nodes Time (s)


64MB 1 44
3.13 MB 1 291
Figure 4: Result of speed up with respect to Spark and
3.13 MB 2 163 Map Reduce
The performance of the spark and Map Reduce are
compared with the metrics used for the analysis are:
3.2.3 ENERGY CONSUMPTION: SPARK VS MAP
scheduling delay, speed up, energy consumption with REDUCE
respect to the number of nodes in the cluster. Figure 5 shows the result of energy consumption with
respect to the Spark and Map Reduce Model. The Spark
3.2.1 SCHEDULING DELAY: SPARK VS MAP consumes less energy than Map Reduce. Its value gradually
REDUCE increases in regards to the number of cluster resource.
Figure 3 shows the result of scheduling delay with respect
to the spark and map reduce in the Hadoop cluster. The
spark is showing the good scheduling length compare to the
map reduce.

Figure 3. The result of scheduling delay with respect to


Spark and Map Reduce in the Hadoop Cluster

3.2.2 SPEED UP: SPARK VS MAP REDUCE


The speed up is the ratio of the sequential execution time to
the schedule length of the output schedule. Figure 4 shows Figure 5: Shows the results of energy consumption with
the result of speed up with respect to spark and Map respect to the Spark and Map Reduce
Reduce. The speed up of spark model is higher than the
other Map Reduce approaches, where its value is gradually
increasing with regard to the number of clusters. 4. CONCLUSION
This research paper gives a review of both the systems
furthermore analyzes these on different parameters took
after by an execution investigation utilizing K-Means
calculation. Our outcomes for this examination demonstrate
that Spark is an extremely solid contender and would
without a doubt achieve a change by utilizing as a part of
memory preparing. Watching Spark's capacity to perform
group handling, gushing, and machine learning on the same
bunch and taking a gander at the present rate of reception
of Spark all through the business, Spark will be the true
system for countless cases including Big Data preparing.

Copyright © 2016. Innovative Research Publications. All Rights Reserved 434


International Journal of Innovative Research in Engineering & Management (IJIREM)
ISSN: 2350-0557, Volume-3, Issue-5, September-2016

REFERENCES
[1] Apache Hive. http://hadoop.apache.org/hive
5Scalaprogramming language. http://www.scala-lang.org.
[2] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A.
Tomkins. Pig latin: a not-so-foreign language for data

[3] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U´ .


Erlingsson,P. K. Gunda, and J. Currey. DryadLINQ: A
system for general-purpose distributed data-parallel
computing using a high-level language. In OSDI ’08, San
Diego, CA, 2008.

[4] J. Dean and S. Ghemawat. MapReduce: Simplified data


processing on large clusters. Commun. ACM, 51(1):107
113, 2008.

[5] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.


Dryad: Distributed data-parallel programs from sequential
building blocks. In EuroSys 2007, pages 59–72, 2007.

[6] B. Nitzberg and V. Lo. Distributed shared memory: a


survey of issues and algorithms. Computer, 24(8):52 –60,
Aug 1991.

[7] Spark Main Website

[8] Spark Examples

[9]Spark Summit 2014 Conference Presentation and Videos

[10]Spark on Databricks website

Copyright © 2016. Innovative Research Publications. All Rights Reserved 435

You might also like