spark
spark
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
The following diagram shows three ways of how Spark can be built with Hadoop components.
Components of Spark
MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
Although this framework provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable
storage, which can dominate application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the
object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk.
This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes.
Running Spark Jobs on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce
schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same
container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode.
Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense
for interactive and debugging uses where you want to see your application’s output immediately.
In YARN, each application instance has an Application Master process, which is the first container
started for that application. The application is responsible for requesting resources from the
ResourceManager, and, when allocated them, telling NodeManagers to start containers on its
behalf. Application Masters obviate the need for an active client — the process starting the
application can go away and coordination continues from a process managed by YARN running on
the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process
is responsible for both driving the application and requesting resources from YARN, and this
process runs inside a YARN container. The client that starts the app doesn’t need to stick around
The yarn-cluster mode, however, is not well suited to using Spark interactively. Spark applications
that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client
process that initiates the Spark application. In yarn-client mode, the Application Master is merely
present to request executor containers from YARN. The client communicates with those
In Yarn Cluster Mode, Spark client will submit spark application to yarn, both Spark Driver and
Spark Executor are under the supervision of yarn. In yarn client mode, only the Spark Executor are
under the supervision of yarn. The Yarn ApplicationMaster will request resource for just spark
executor. The driver program is running in the client process which has nothing to do with yarn.