0% found this document useful (0 votes)
57 views

Features of Apache Spark

Apache Spark is a fast, general-purpose cluster computing system. It uses in-memory computing to improve performance over traditional disk-based approaches like Hadoop MapReduce. Spark core provides functionality for distributed task dispatching, scheduling, and memory management. Additional Spark components support SQL queries, streaming data, machine learning, and graph processing. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, allowing data to be partitioned across a cluster and cached in memory for faster iterative algorithms.

Uploaded by

Sailesh Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Features of Apache Spark

Apache Spark is a fast, general-purpose cluster computing system. It uses in-memory computing to improve performance over traditional disk-based approaches like Hadoop MapReduce. Spark core provides functionality for distributed task dispatching, scheduling, and memory management. Additional Spark components support SQL queries, streaming data, machine learning, and graph processing. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, allowing data to be partitioned across a cluster and cached in memory for faster iterative algorithms.

Uploaded by

Sailesh Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Apache Spark

Introduction
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which includes interactive queries
and stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming. Apart from supporting all these
workloads in a respective system, it reduces the management burden of maintaining
separate tools.

Features of Apache Spark:

1. Speed − Spark helps to run an application in Hadoop cluster, up to 100 times


faster in memory, and 10 times faster when running on disk. This is possible by
reducing the number of read/write operations to disk. It stores the intermediate
processing data in memory.
2. Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
3. Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Spark deployment:

1. Standalone − Spark Standalone deployment means Spark occupies the place on


top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark
jobs on the cluster.
2. Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into the Hadoop ecosystem or Hadoop stack. It allows other components to run
on top of the stack.
3. Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark jobs
in addition to standalone deployment. With SIMR, users can start Spark and use
its shell without any administrative access.

Components of Spark:

Apache Spark Core:

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL:

Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming:

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library):

MLlib is a distributed machine learning framework above Spark because of the


distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
(before Mahout gained a Spark interface).

GraphX:

GraphX is a distributed graph-processing framework on top of Spark. It provides an API


for expressing graph computation that can model the user-defined graphs by using the
Pregel abstraction API. It also provides an optimized runtime for this abstraction.

RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they
are not so efficient.

Why RDD?

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.

Recognizing this problem, researchers developed a specialized framework called


Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it
supports in-memory processing computation. This means, it stores the state of memory
as an object across the jobs and the object is shareable between those jobs. Data
sharing in memory is 10 to 100 times faster than network and Disk. use RDD?

Iterative Operations of Spark RDD:

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.
Features of RDD:

1. In-Memory Computations: It improves the performance by an order of


magnitudes.
2. Lazy Evaluation: All transformations in RDDs are lazy, i.e, doesn’t compute their
results right away.
3. Fault Tolerant: RDDs track data lineage information to rebuild lost data
automatically.
4. Immutability: Data can be created or retrieved anytime and once defined, its
value can’t be changed.
5. Partitioning: It is the fundamental unit of parallelism in RDD.
6. Persistence: Users can reuse RDDs and choose a storage strategy for them.
7. Coarse-Grained Operations: These operations are applied to all elements in data
sets through maps or filters or group by operation.

Installation of Pyspark:
Prerequisite:

1. Should have Java version 1.8 in your system.


2. Anaconda Distribution to installed in the system

From https://spark.apache.org/downloads.html

Download the required version of spark.

Extract it to folder and paste it directly into the C drive. C:\Spark


From https://github.com/cdarlint/winutils, download the winutils.exe file according to the
version you have installed.

Move the winutils.exe downloaded to \bin folder of Spark distribution i.e.,

● C:\Spark\spark-3.1.1-bin-hadoop2.7\bin

Set the environment variables:

Go to settings, select System

Search for env in the search bar in the left corner:

Select Edit the system environment variables, from the System properties pop window
on Advanced section click on Environment variables:

Variable Value

SPARK_HOME C:\Spark\spark-3.1.1-bin-hadoop2.7

PYSPARK_DRIVER_PYTHON jupyter

PYSPARK_DRIVER_PYTHON_OPTS notebook

HADOOP_HOME C:\hadoop\bin

JAVA_HOME C:\Java\jdk1.8.0_291

Enter the above variables and value respectively under user variables for vssan.
Make sure the respective paths specified in value are present inside the System
variables Path as well.

To check whether the Spark is installed or not, run the command on windows command
prompt:

● spark-shell --version

To run Jupyter notebook, open Anaconda command prompt and run the command:

● ·jupyter notebook

In notebook run the following code to check the pyspark is running successfully or not:

Reference:
1. https://changhsinlee.com/install-pyspark-windows-jupyter/
2. https://www.youtube.com/watch?v=AB2nUrKYRhw
3. https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm

You might also like