0% found this document useful (0 votes)

57 views

Features of Apache Spark

Apache Spark is a fast, general-purpose cluster computing system. It uses in-memory computing to improve performance over traditional disk-based approaches like Hadoop MapReduce. Spark core provides functionality for distributed task dispatching, scheduling, and memory management. Additional Spark components support SQL queries, streaming data, machine learning, and graph processing. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, allowing data to be partitioned across a cluster and cached in memory for faster iterative algorithms.

Uploaded by

Sailesh Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

Features of Apache Spark

Uploaded by

Sailesh Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Apache Spark

Introduction
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which includes interactive queries
and stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming. Apart from supporting all these
workloads in a respective system, it reduces the management burden of maintaining
separate tools.

Features of Apache Spark:

1. Speed − Spark helps to run an application in Hadoop cluster, up to 100 times

faster in memory, and 10 times faster when running on disk. This is possible by
reducing the number of read/write operations to disk. It stores the intermediate
processing data in memory.
2. Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
3. Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Spark deployment:

1. Standalone − Spark Standalone deployment means Spark occupies the place on

top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark
jobs on the cluster.
2. Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into the Hadoop ecosystem or Hadoop stack. It allows other components to run
on top of the stack.
3. Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark jobs
in addition to standalone deployment. With SIMR, users can start Spark and use
its shell without any administrative access.

Components of Spark:

Apache Spark Core:

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL:

Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming:

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library):

MLlib is a distributed machine learning framework above Spark because of the

distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
(before Mahout gained a Spark interface).

GraphX:

GraphX is a distributed graph-processing framework on top of Spark. It provides an API

for expressing graph computation that can model the user-defined graphs by using the
Pregel abstraction API. It also provides an optimized runtime for this abstraction.

RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they
are not so efficient.

Why RDD?

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.

Recognizing this problem, researchers developed a specialized framework called

Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it
supports in-memory processing computation. This means, it stores the state of memory
as an object across the jobs and the object is shareable between those jobs. Data
sharing in memory is 10 to 100 times faster than network and Disk. use RDD?

Iterative Operations of Spark RDD:

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.
Features of RDD:

1. In-Memory Computations: It improves the performance by an order of

magnitudes.
2. Lazy Evaluation: All transformations in RDDs are lazy, i.e, doesn’t compute their
results right away.
3. Fault Tolerant: RDDs track data lineage information to rebuild lost data
automatically.
4. Immutability: Data can be created or retrieved anytime and once defined, its
value can’t be changed.
5. Partitioning: It is the fundamental unit of parallelism in RDD.
6. Persistence: Users can reuse RDDs and choose a storage strategy for them.
7. Coarse-Grained Operations: These operations are applied to all elements in data
sets through maps or filters or group by operation.

Installation of Pyspark:
Prerequisite:

1. Should have Java version 1.8 in your system.

2. Anaconda Distribution to installed in the system

From https://spark.apache.org/downloads.html

Download the required version of spark.

Extract it to folder and paste it directly into the C drive. C:\Spark

From https://github.com/cdarlint/winutils, download the winutils.exe file according to the
version you have installed.

Move the winutils.exe downloaded to \bin folder of Spark distribution i.e.,

● C:\Spark\spark-3.1.1-bin-hadoop2.7\bin

Set the environment variables:

Go to settings, select System

Search for env in the search bar in the left corner:

Select Edit the system environment variables, from the System properties pop window
on Advanced section click on Environment variables:

Variable Value

SPARK_HOME C:\Spark\spark-3.1.1-bin-hadoop2.7

PYSPARK_DRIVER_PYTHON jupyter

PYSPARK_DRIVER_PYTHON_OPTS notebook

HADOOP_HOME C:\hadoop\bin

JAVA_HOME C:\Java\jdk1.8.0_291

Enter the above variables and value respectively under user variables for vssan.
Make sure the respective paths specified in value are present inside the System
variables Path as well.

To check whether the Spark is installed or not, run the command on windows command
prompt:

● spark-shell --version

To run Jupyter notebook, open Anaconda command prompt and run the command:

● ·jupyter notebook

In notebook run the following code to check the pyspark is running successfully or not:

Reference:
1. https://changhsinlee.com/install-pyspark-windows-jupyter/
2. https://www.youtube.com/watch?v=AB2nUrKYRhw
3. https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm

Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
CUS2.Creating Local Reference Fields
75% (4)
CUS2.Creating Local Reference Fields
16 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark
No ratings yet
Spark
96 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Unit 4
No ratings yet
Unit 4
8 pages
Module 3
No ratings yet
Module 3
51 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
spark
No ratings yet
spark
9 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Sspark
No ratings yet
Sspark
7 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Shark
No ratings yet
Shark
24 pages
Bda 5
No ratings yet
Bda 5
21 pages
Spark
No ratings yet
Spark
9 pages
Apache Spark Self Learning 1
No ratings yet
Apache Spark Self Learning 1
7 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark BD
No ratings yet
Spark BD
9 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Unit 5
100% (1)
Unit 5
109 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Spark 101
No ratings yet
Spark 101
25 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
SPARK
No ratings yet
SPARK
66 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
spark
No ratings yet
spark
160 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Logcat Home Fota Update Log
No ratings yet
Logcat Home Fota Update Log
74 pages
Crud Repository
No ratings yet
Crud Repository
19 pages
IECV2 - CFSV2 (1) (1) .0 - MT2 - V2.0.mdb (3) Send by Ajay
No ratings yet
IECV2 - CFSV2 (1) (1) .0 - MT2 - V2.0.mdb (3) Send by Ajay
28 pages
Java 6 JDBC and Database Applications PDF
No ratings yet
Java 6 JDBC and Database Applications PDF
185 pages
Access 97: Quick Reference Card
No ratings yet
Access 97: Quick Reference Card
2 pages
Network Solution Proposal
No ratings yet
Network Solution Proposal
10 pages
AZ-900T0xModule 02core Azure Services
No ratings yet
AZ-900T0xModule 02core Azure Services
43 pages
Stack Trace
No ratings yet
Stack Trace
11 pages
Kota Krishna Chaitanya - Resume
No ratings yet
Kota Krishna Chaitanya - Resume
4 pages
DBMS Lesson Plan
No ratings yet
DBMS Lesson Plan
3 pages
Oracle DUL
No ratings yet
Oracle DUL
17 pages
Android Database
0% (1)
Android Database
22 pages
DBMS
No ratings yet
DBMS
32 pages
SQL Tutorial 1
No ratings yet
SQL Tutorial 1
107 pages
SQL AND, OR and NOT Operators
No ratings yet
SQL AND, OR and NOT Operators
5 pages
Data Analytics-Python
No ratings yet
Data Analytics-Python
41 pages
Database Assignment 2
No ratings yet
Database Assignment 2
4 pages
Sparkly R
No ratings yet
Sparkly R
2 pages
Management Information Systems Lecture Notes
No ratings yet
Management Information Systems Lecture Notes
67 pages
SQL 2022
No ratings yet
SQL 2022
1 page
Ximble Biometric Information Policy
No ratings yet
Ximble Biometric Information Policy
3 pages
Rdbms Notes
No ratings yet
Rdbms Notes
193 pages
Python - Dictionary Data Structure
No ratings yet
Python - Dictionary Data Structure
17 pages
Capella Architecture Overview
No ratings yet
Capella Architecture Overview
24 pages
Oracle TADM
No ratings yet
Oracle TADM
3 pages
Annexure 1 - Technical Specs Document10102023
No ratings yet
Annexure 1 - Technical Specs Document10102023
15 pages
Moremastering The Art of Indexing
No ratings yet
Moremastering The Art of Indexing
54 pages
Data Visualization and Hadoop
No ratings yet
Data Visualization and Hadoop
34 pages
Digital Storage Memory Technology PDF
No ratings yet
Digital Storage Memory Technology PDF
31 pages

Features of Apache Spark

Uploaded by

Features of Apache Spark

Uploaded by

Apache Spark

Features of Apache Spark:

1. Speed − Spark helps to run an application in Hadoop cluster, up to 100 times

1. Standalone − Spark Standalone deployment means Spark occupies the place on

Apache Spark Core:

MLlib (Machine Learning Library):

MLlib is a distributed machine learning framework above Spark because of the

GraphX is a distributed graph-processing framework on top of Spark. It provides an API

Recognizing this problem, researchers developed a specialized framework called

Iterative Operations of Spark RDD:

1. In-Memory Computations: It improves the performance by an order of

1. Should have Java version 1.8 in your system.

Download the required version of spark.

Extract it to folder and paste it directly into the C drive. C:\Spark

Move the winutils.exe downloaded to \bin folder of Spark distribution i.e.,

Set the environment variables:

Go to settings, select System

Search for env in the search bar in the left corner:

You might also like