0% found this document useful (0 votes)

8 views

BDA-Lec8

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

BDA-Lec8

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

3rd grade

Big Data Analytics

Dr. Nesma Mahmoud
Lecture 8: Spark II
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Why we Need Spark?

02. Dive into Spark

03. How Spark works?

03. How Spark works?
Core libraries(modules) of Apache Spark
Spark Core( == Spark)
● Provides basic functionalities (including task scheduling,
memory management, fault recovery, interacting with
storage systems) used by other components
● • Provides a data abstraction called resilient distributed
dataset (RDD)
○ – Spark Core provides APIs for building and manipulating
these collections(RDDs)
● • Written in Scala but APIs for Java, Python and R
Spark as Unified Analytics Engine
● A number of integrated higher-level modules built on top of Spark
○ – Can be combined seamlessly in the same application

● Spark SQL
○ – To work with structured data
○ – Allows querying data via SQL Like Query.
○ – Supports many data sources (Hive tables, Parquet, JSON, …)
○ – Extends Spark RDD API

● Spark Streaming
○ – To process live streams of data
○ – Extends Spark RDD API
Spark as Unified Analytics Engine
● MLlib
○ – Scalable ML library
○ – Many distributed algorithms: feature extraction, classification,
regression, clustering, recommendation, …

● GraphX
○ – API for manipulating graphs and performing graph-parallel
computations
○ – Includes also common graph algorithms (e.g., PageRank)
○ – Extends Spark RDD API
Spark Architecture
Master/worker Architecture

Master(Driver Program):
1- create spark context
2- run main function
3- convert user program to tasks
and stages.
Worker Nodes(Datanodes):
Contain Task Executors(processes) that
1- execute tasks(run computations).
2- read data from hdfs
3- performing the transformation operations
4- store data for applications.
Spark Context: The Gateway to Apache Spark
Applications
Core Components:
- SparkContext: Connection to the Spark cluster; creates RDDs (Resilient Distributed Datasets).
- Cluster Manager: Manages resources and schedules tasks.
- Worker Nodes: Execute tasks via executors.
- Executors: Run tasks on worker nodes.
- RDDs: Immutable, fault-tolerant distributed data structures.
- HDFS: Distributed file system for data storage.

How It Works:
1. Driver Program: User writes a Spark application using Python, Scala, or Java.
2. RDD Creation: SparkContext reads data (e.g., from HDFS) and creates RDDs.
3. Transformations: Operations like map, filter, reduce modify RDDs.
4. Actions: Commands like count, collect trigger execution.
5. Task Scheduling: Cluster Manager schedules tasks on worker nodes.
6. Task Execution: Executors process tasks, interacting with HDFS.
7. Results: Returned to the driver for display or storage.
Spark Architecture
• Main program (called driver program(master)) talks to
cluster manager, which allocates resources for
• Worker nodes in which executors run
• Executors are processes that run computations and
store data for the application

https://downloads.apache.org/spark/docs/2.0.0-preview/cluster-
overview.html
Spark Architecture
● Each application consists of a driver program and executors on the
cluster
○ – Driver program: process which runs application main() and
creates SparkContext object
● • Each application gets its own executors, which are processes which
stay up for the duration of the whole application and run tasks in
multiple threads
○ – Isolation of concurrent applications
● • To run on a cluster, SparkContext connects to(communicate with)
cluster manager, which allocates cluster resources for worker nodes
● • Once connected, Spark acquires executors on cluster nodes and
sends the application code (e.g., jar) to executors.(Data Locality)
● • Finally, SparkContext sends tasks to executors to run
Spark Programming Model
Spark Programming Model
Spark on Top of Cluster Managers
● Spark can exploit many cluster resource managers which allocate
cluster resources to run the applications
○ 1. Standalone – Simple cluster manager included with Spark that
makes it easy to set up a cluster (default cluster manager).
○ 2. Hadoop YARN – Resource manager in Hadoop 2
○ 3. Mesos – General cluster manager from AMPLab
○ 4. Kubernetes
Deploy Modes and Cluster Managers
● Spark supports different deploy modes(VM, Docker, Kubernetes) and
cluster managers(Standalone, HadoopYARN, Mesos, Kubernetes) , so
it can run in different configurations and environments
RDD
● RDDs are the key programming abstraction in Spark: a
distributed memory abstraction
● • Immutable, partitioned and fault-tolerant collection of
elements that can be manipulated in parallel
○ – Like a LinkedList <MyObjects>.
○ – Stored in main memory across the cluster nodes
■ • Each worker node that is used to run an application
contains at least one partition of the RDD(s) that is (are)
defined in the application.
RDDs: distributed and partitioned
● Stored in main memory of the executors running in the worker
nodes (when it is possible) or on node local disk (if not enough
main memory)
● • Allow executing in parallel the code invoked on them
○ – Each executor of a worker node runs the specified code on its
partition of the RDD
○ – Partition: atomic chunk of data (a logical division of data) and
basic unit of parallelism
○ – Partitions of an RDD can be stored on different cluster nodes
RDDs: immutable and fault-tolerant
● Immutable once constructed
○ – i.e., RDD content cannot be modified
○ – Create new RDD based on existing RDD

● Automatically rebuilt on failure (without replication)

○ – Track lineage information so to efficiently recompute
missing or lost data due to node failures
○ – For each RDD, Spark knows how it has been constructed
and can rebuild it if a failure occurs
○ – This information is represented by means of RDD lineage
DAG connecting input data and RDDs.
RDDs: API and suitability
● RDD API
○ – Clean language-integrated API for Scala, Python, Java, and R
○ – Can be used interactively from console (Scala and PySpark)
○ – Also higher-level APIs: DataFrames and DataSets

● • RDD suitability
○ – Best suited for applications that apply the same operation to all the
elements in dataset (Coarse-grained manipulations only )
○ – Provides fine-grained control over the physical distribution of
data
○ – Not a good fit for applications with fine-grained updates to shared
state
Spark and RDDs
● Spark(Spark Core) manages the split of RDDs in partitions(atomic units)
and allocates RDDs’ partitions to cluster nodes(Master(Spark Driver),
Worker nodes)
● Spark hides complexity of fault tolerance
○ – RDDs are automatically rebuilt in case of failure using the RDD lineage
DAG, that defines the logical execution plan.
Directed Acyclic Graph (DAG)
● A Directed Acyclic Graph (DAG) in Spark is a set of vertices and
edges, where vertices represent the RDDs and edges represent
the operations to be applied on RDDs
○ – Generalization of MapReduce model, which has only two
operations (Map and Reduce)
Directed Acyclic Graph (DAG)
● DAG can be visualized using Spark
Web UI
○ – figure: WordCount DAG
● • A stage is a set of operation that
does not involve a shuffle of data
● • As soon as a shuffle of data is
needed (when a wide
transformation is performed), the
DAG will yield a new stage
Operations in RDD API
● Spark programs are written in terms of operations on RDDs
● Programming model based on parallelizable operators
○ – Higher-order functions that execute user-defined
functions in parallel
● RDDs are created and manipulated through operators See
https://spark.apache.org/docs/latest/rdd-programming-
guide.html
● RDDs are created from external data or other RDDs
How to create RDD?
● RDD can be created by:
○ – Parallelizing existing data collections of the hosting programming
language (e.g., collections and lists of Scala, Java, Python, or R)
■ Number of partitions specified by user
■ RDD API: parallelize
○ – From (large) files stored in HDFS or any other file system
■ One partition per HDFS block
■ RDD API: textFile
○ – Transforming an existing RDD
■ Number of partitions depends on transformation type
■ RDD API: transformation operations (map, filter, flatMap)
How to create RDD?
● Turn an existing collection into an RDD

○ – sc is Spark context variable

○ – Important parameter: number of partitions to cut the dataset into
○ – Spark will run one task for each partition of the cluster (typical
setting: 2-4 partitions for each CPU in the cluster)
○ – Spark tries to set the number of partitions automatically based on
resource availability.
○ – You can also set it manually by passing it as a second parameter
to parallelize, e.g., sc.parallelize(data, 10).
● Load data from storage (local file system, HDFS, or S3)
RDD transformations: map and filter
● map: takes as input a function which is applied to each element of the
RDD and maps each input element to another element

● filter: generates a new RDD by filtering the source dataset using the
specified function
RDD transformations: flatMap
● flatMap: takes as input a function which is applied to each element of
the RDD; can map each input item to zero or more output items
range function in
Python: ordered
sequence of integer
values in range
[start;end;step) with
nonzero step
RDD transformations: reduceByKey
● reduceByKey: aggregates values with identical key using
the specified function
● Runs several parallel reduce operations, one for each key
in the dataset, where each operation combines values that
have the same key
RDD transformations: reduceByKey
● • Let’s visualize the DAG
RDD transformations: join
● join: performs an inner-join on the keys of two
RDDs
● Only keys that are present in both RDDs are output
● Join candidates are independently processed
RDD transformations: join
● • Let’s visualize the DAG
Some RDD actions
● collect: returns all the elements of the RDD as a list

● take: returns an array with the first n elements in the RDD

● count: returns the number of elements in the RDD

● reduce: aggregates the elements in the RDD using the specified function

● saveAsTextFile: writes the elements of the RDD as a text file either to the local file
system or HDFS
Example: WordCount in Python
SparkSession
● A unified entry point for manipulating data with Spark
● From Spark 2.0, SparkSession unifies the different
contexts from different APIs and represents the entry
point into all Spark functionalities
● Already available in Spark shell as variable spark
● Within application: use builder to create a basic
SparkSession
● Only one SparkContext may be active per JVM – stop() the
active SparkContext before creating a new one
SparkSession vs SparkContext

Spark 1.x and 2.x:

•SparkContext was the primary entry point for Spark applications.
Spark 2.x and later:
•SparkSession replaced SparkContext as the preferred entry point starting from Spark 2.0.
•It combines functionality from SparkContext, SQLContext, and HiveContext, offering a unified entry point to
work with Spark's features, including DataFrames, Datasets, and SQL queries.
Try Spark?
● https://colab.research.google.com/drive/115k5smTNJllpJrv3fn16Z3vg6SQWnyLC?u
sp=sharing#scrollTo=dhzk3GE6S9RC
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes

icons by Flaticon, and infographics & images by Freepik

FINETurbo User Guide
No ratings yet
FINETurbo User Guide
651 pages
Ansys Users Guide
100% (1)
Ansys Users Guide
480 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
UNIT V
No ratings yet
UNIT V
35 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
SPARK
No ratings yet
SPARK
125 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Module 3
No ratings yet
Module 3
51 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
SPARK
No ratings yet
SPARK
66 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark
No ratings yet
Spark
96 pages
Unit 5
100% (1)
Unit 5
109 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
L3
No ratings yet
L3
30 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
slips bigdata
No ratings yet
slips bigdata
6 pages
spark
No ratings yet
spark
160 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Pyspark
No ratings yet
Pyspark
31 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Spark(Introduction,RDD)
No ratings yet
Spark(Introduction,RDD)
28 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
SPARK
No ratings yet
SPARK
35 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
22413_SEN_QB5
No ratings yet
22413_SEN_QB5
18 pages
Lec4 designPattern
No ratings yet
Lec4 designPattern
48 pages
Lec5 flask
No ratings yet
Lec5 flask
5 pages
Answer Midterm 2024 - 11 - 19
No ratings yet
Answer Midterm 2024 - 11 - 19
4 pages
assignment_1
No ratings yet
assignment_1
12 pages
MNU CAI ICI334 Lec4&5
No ratings yet
MNU CAI ICI334 Lec4&5
33 pages
BDA-Lec3
No ratings yet
BDA-Lec3
48 pages
MNU CAI ICI334 Lec7
No ratings yet
MNU CAI ICI334 Lec7
30 pages
sodapdf-converted
No ratings yet
sodapdf-converted
4 pages
AI lecture 9
No ratings yet
AI lecture 9
39 pages
BDA-Lec10
No ratings yet
BDA-Lec10
33 pages
BDA-Lec4
No ratings yet
BDA-Lec4
40 pages
Lec. 3
No ratings yet
Lec. 3
18 pages
BDA-Lec1
No ratings yet
BDA-Lec1
25 pages
Lecture 9 - MapReduce
No ratings yet
Lecture 9 - MapReduce
50 pages
Lecture-02,03
No ratings yet
Lecture-02,03
54 pages
Lecture 7 - Wide Column Stores - Part 1
No ratings yet
Lecture 7 - Wide Column Stores - Part 1
30 pages
Section 5
No ratings yet
Section 5
7 pages
Chapter 8 Concurrency-P1
No ratings yet
Chapter 8 Concurrency-P1
30 pages
6a. AIR
100% (1)
6a. AIR
15 pages
Extra PLC
No ratings yet
Extra PLC
22 pages
1.1. Purpose: Software Requirements Specification For Blind Voice Mail
50% (2)
1.1. Purpose: Software Requirements Specification For Blind Voice Mail
15 pages
Improving Energy Efficiency Through Parallelization
No ratings yet
Improving Energy Efficiency Through Parallelization
10 pages
Expert Systems: 5.1 Overview
No ratings yet
Expert Systems: 5.1 Overview
11 pages
SIPI - Kasus 15.2
No ratings yet
SIPI - Kasus 15.2
5 pages
299 Project Report-3
No ratings yet
299 Project Report-3
11 pages
Python Flash Cards
No ratings yet
Python Flash Cards
11 pages
01Programming_SZGH-CNC1000MDi(V1.0)-U
No ratings yet
01Programming_SZGH-CNC1000MDi(V1.0)-U
137 pages
Using The ESP32 Microcontroller For Data Processing
No ratings yet
Using The ESP32 Microcontroller For Data Processing
6 pages
فاصل و نواصل ههه
No ratings yet
فاصل و نواصل ههه
8 pages
Operating System Questions and Answers
100% (4)
Operating System Questions and Answers
13 pages
Analyzing and Reverse Engineering Antivirus Signature
No ratings yet
Analyzing and Reverse Engineering Antivirus Signature
154 pages
Perceptron C++
No ratings yet
Perceptron C++
33 pages
7 QML Presenting Data
No ratings yet
7 QML Presenting Data
52 pages
API-Technical Document
No ratings yet
API-Technical Document
13 pages
Unit 3 Data Warehouse
No ratings yet
Unit 3 Data Warehouse
17 pages
West 9e Chapter 6 Slides
No ratings yet
West 9e Chapter 6 Slides
61 pages
RSCAD Software Overview
No ratings yet
RSCAD Software Overview
2 pages
OSI and WAN
No ratings yet
OSI and WAN
5 pages
6 Steps To IT Documentation
No ratings yet
6 Steps To IT Documentation
35 pages
PowerJoint en
No ratings yet
PowerJoint en
32 pages
Agile - Prersistent Systems
No ratings yet
Agile - Prersistent Systems
5 pages
STM32-1
No ratings yet
STM32-1
14 pages
Accelerated Disassembly, Reconstruction and Reversing @redbluehit
No ratings yet
Accelerated Disassembly, Reconstruction and Reversing @redbluehit
253 pages
Dh-Xvr5108Hs-4Kl-I2: 8 Channel Penta-Brid 4K-N/5Mp Compact 1U Wizsense Digital Video Recorder
No ratings yet
Dh-Xvr5108Hs-4Kl-I2: 8 Channel Penta-Brid 4K-N/5Mp Compact 1U Wizsense Digital Video Recorder
3 pages
Cloud Fundamentals
No ratings yet
Cloud Fundamentals
12 pages
Cameleon
No ratings yet
Cameleon
2 pages

BDA-Lec8

Uploaded by

BDA-Lec8

Uploaded by

3rd grade

Big Data Analytics

Data/Workflow Frameworks Analytics and Algorithms

02. Dive into Spark

03. How Spark works?

● Automatically rebuilt on failure (without replication)

○ – sc is Spark context variable

● take: returns an array with the first n elements in the RDD

● count: returns the number of elements in the RDD

Spark 1.x and 2.x:

CREDITS: This presentation template was created by Slidesgo, and includes

You might also like