Apache Spark
Apache Spark
An Introduction
What is Apache Spark?
Apache Spark is a fast and multi-purpose engine for
large-scale data processing.
THERE ARE 1. Speed
FOUR 2. Ease of use
REASONS 3. Generality
TO USE
4. Platform-agnostic
SPARK
SPEED
EASE OF USE
and R natively, as well as ANSI SQL. making it fast and easy to build
applications, even including
parallelization and streaming.
3 We can use popular interfaces like 4 By using just two lines of code, you
Jupyter Notebook, Apache Zeppelin, as can count all words in a large file.
well as the command shell.
GENERALITY
AGNOSTIC
PLATFORM
Besides running in nearly any environment you can access the
data in the Hadoop distributed file system, known as HDFS,
Cassandra, HBASE, Hive, or any other Hadoop data source.
Referencing a dataset in
an external storage
system, such as a
shared file system,
HDFS, HBase, or any
data source offering a
Hadoop Input Format.
MapReduce
MapReduce is used for
processing and generating large
datasets with a parallel,
distributed algorithm on a cluster.
Data sharing in
MapReduce
Data sharing is slow in
MapReduce due to
replication, serialization,
and disk IO. Regarding
storage system, most of
the Hadoop applications,
they spend more than 90%
of the time doing HDFS
read-write operations.
Iterative Operations on
MapReduce
● Reuse intermediate results across
multiple computations in multi-stage
applications.
● The following illustration explains how
the current framework works, while
doing the iterative operations on
MapReduce.
● This incurs substantial overheads due
to data replication, disk I/O, and
serialization, which makes the system
slow.
Interactive Operations on
MapReduce
● Each query will do the disk I/O
on the stable storage, which
can dominate application
execution time.
● The following illustration
explains how the current
framework works while doing
the interactive queries on
MapReduce.
Data Sharing using
Spark RDD
● RDD supports in-memory
processing computation.
it stores the state of
memory as an object
across the jobs and the
object is shareable
between those jobs.
● Data sharing in memory
is 10 to 100 times faster
than network and disk.
The illustration given below shows the iterative operations
Iterative Operations on on Spark RDD. It will store intermediate results in a
Spark RDD distributed memory instead of Stable storage (Disk) and
make the system faster.
If the Distributed memory (RAM) is not sufficient to store
intermediate results (State of the JOB), then it will store
those results on the disk.
● The following illustration shows interactive
Interactive Operations operations on Spark RDD.
on Spark RDD ● If different queries are run on the same set of data
repeatedly, this particular data can be kept in
memory for better execution times.
● By default, each transformed RDD may be
recomputed each time you run an action on it.
● However, you may also persist an RDD in memory,
in which case Spark will keep the elements around
on the cluster for much faster access, the next time
you query it.
● There is also support for persisting RDDs on disk, or
replicated across multiple nodes.