0% found this document useful (0 votes)
3 views

spark

Apache Spark is a fast cluster computing technology that enhances Hadoop's capabilities by providing in-memory processing, which significantly speeds up data computation. It supports various workloads, including batch processing, interactive queries, and machine learning, and can be deployed in multiple ways, such as standalone or on Hadoop YARN. Spark's core component is the Resilient Distributed Dataset (RDD), which allows for efficient data sharing and processing across distributed systems.

Uploaded by

gaurav kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

spark

Apache Spark is a fast cluster computing technology that enhances Hadoop's capabilities by providing in-memory processing, which significantly speeds up data computation. It supports various workloads, including batch processing, interactive queries, and machine learning, and can be deployed in multiple ways, such as standalone or on Hadoop YARN. Spark's core component is the Resilient Distributed Dataset (RDD), which allows for efficient data sharing and processing across distributed systems.

Uploaded by

gaurav kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

APACHE SPARK

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark

Apache Spark has following features.


• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing
data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark comes
up with 80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.


• Standalone − Spark Standalone deployment means Spark occupies the place on
top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs
on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark into
Hadoop ecosystem or Hadoop stack. It allows other components to run on top of
stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its
shell without any administrative access.

Components of Spark

The following illustration depicts the different components of Spark.


Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems. Provides task scheduling, memory management, fault recovery, interacting with
storage system).
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as
the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable


distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may
be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
Although this framework provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The


following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O,
and serialization, which makes the system slow.
Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable
storage, which can dominate application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the
object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk.

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes.
Running Spark Jobs on YARN

When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce

schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same

container. This approach enables several orders of magnitude faster task startup time.

Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode.

Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense

for interactive and debugging uses where you want to see your application’s output immediately.

Understanding the difference requires an understanding of YARN’s Application Master concept.

In YARN, each application instance has an Application Master process, which is the first container

started for that application. The application is responsible for requesting resources from the

ResourceManager, and, when allocated them, telling NodeManagers to start containers on its

behalf. Application Masters obviate the need for an active client — the process starting the

application can go away and coordination continues from a process managed by YARN running on

the cluster.

In yarn-cluster mode, the driver runs in the Application Master. This means that the same process
is responsible for both driving the application and requesting resources from YARN, and this

process runs inside a YARN container. The client that starts the app doesn’t need to stick around

for its entire lifetime.


yarn cluster mode

The yarn-cluster mode, however, is not well suited to using Spark interactively. Spark applications

that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client

process that initiates the Spark application. In yarn-client mode, the Application Master is merely

present to request executor containers from YARN. The client communicates with those

containers to schedule work after they start:


Yarn Client Mode

Different Deployment Modes across the cluster

In Yarn Cluster Mode, Spark client will submit spark application to yarn, both Spark Driver and

Spark Executor are under the supervision of yarn. In yarn client mode, only the Spark Executor are

under the supervision of yarn. The Yarn ApplicationMaster will request resource for just spark

executor. The driver program is running in the client process which has nothing to do with yarn.

You might also like