0% found this document useful (0 votes)

130 views

Spark Parallel Processes and Aggregation PDF

1. Spark applications run by submitting code to a Spark driver program, which communicates with executors to distribute tasks. 2. The YARN resource manager allocates containers for the Spark application master and executors. The application master manages executor containers. 3. Spark drivers directly communicate with and send tasks to executors, which perform processing and return results to the driver. This allows distributed, parallel processing across executor nodes.

Uploaded by

vivekpandian08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views

Spark Parallel Processes and Aggregation PDF

Uploaded by

vivekpandian08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Writing and

Deploying Spark
Applications
Dr. Gasan Elkhodari
BUAN6346
Big Data analytics
1. Spark driver runs on the client

2. When application starts, Spark

driver will submit the
application to the YARN
Resource Manager

3. YARN RM will allocate a

container for Spark
Application Master and then
start the AM

4. The AM will request

containers for number of
executors the application
requires

5. The executors are JVMs

1. Once the application (including its
executors) are running, the Spark
Driver will communicate with the
executors directly

2. Spark driver sends tasks to executor,

and they will send results back to the
driver.

3. In this context, the AM performs the

minimal function of managing the
container lifecycle for the application:
communicating with the Node
Manager on each slave node to
launch the executor and so on.
4. The primary control of the processing
(starting and monitoring tasks running
on the executors) is the job of the
driver.
1. A second Spark application starts
from a second client, and the same
process happens...a second AM, a
second set of executors.

2. This slide reinforces that multiple

applications can run at the same
time -- assuming that there are
enough resources for both
applications.
1. As with the first application, once
started, the second Spark Driver
communicates directly with its
executors.

2. There is no communication between

the two Spark drivers, or their
executors!
The main difference between cluster mode
and client mode is that the driver program
doesn’t run on the client machine, but
instead it runs as part of the Application
Master.

Otherwise the process is the same: The AM

requests containers for its executors and
launches executors (JVMs) in those
containers.
Again, once the application starts, it is the
driver program that manages the tasks on
the executors, and receives data back
from the executors.

Why do we need cluster mode?

In many (perhaps most) cases, security is
tight around the cluster, and hosts that are
not part of the cluster should not/can not
communicate directly with worker nodes
within the cluster.
In cluster mode, all the communication is
happening between worker nodes within
the cluster.

Cluster mode is more common than client

mode in production settings
The Spark
Application
Web UI
Parallel Processing in
Spark
1. RDD data lives in memory in executor JVMs.

2. Spark automatically spreads the data across

multiple nodes, so that datasets much too
large to fit in a single machines memory can
be processed. And so that processes can run
locally across local nodes
• So far we’ve been using textFile(file). But
you can also specify a minimum number
of partitions. The default is 2, so even a
dataset that is less than one block will
still be loaded as two partitions.

• Note on default min partitions: if running

with a single core or single thread, then
the default is actually 1, unless the user
overrides that specifically by setting min
number of partitions.
There are two approaches to loading multiple
files:

1. For large files

sc.textFile(“mydir/*”)

2. sc.textFile(“mydir/ file1, mydir/file2”)

which is what we’ve been using in the labs

sc.wholeTextFiles(“mydir”):
This creates a paired RDD, where the key is the
name of the file, and the contents are the
value.
Onto that cluster we place an HDFS file
that consists of three blocks.

(Reminder: HDFS blocks by size. By

default 128M blocks. So a single 300M
file would be three blocks. Because
HDFS (and Spark) are designed to work
with very large files, you should usually
assume that most files will be split into
multiple blocks.)
Now a new Spark application starts. In this
example the application requests 4
executors, so it is running on all nodes in
the cluster
1. The textFile operation creates a new RDD
based on the HDFS file.

2. Spark partitions the RDD into the same

number of partitions as the number of
blocks that make up the file...in this
example, three partitions.

3. (Remember the textFile operation doesn’t

actually retrieve the file until the final step
when collect is called.)

4. Because it is HDFS it is distributed (three

blocks). the spark driver knows (because it
asked the NameNode) what nodes the file
data is located on, and if possible, will
attempt to run this job on the nodes
where the HDFS blocks are physically
stored.
When we reach “collect”, that triggers
the actual tasks.

Because there are three partitions, the

textFile operation will result in three
tasks.
(Number of partitions = number of
tasks, always!) The Spark driver will try
to schedule those three tasks on the
nodes where the file data physically
resides, thereby minimizing network
traffic.
This is important: the “collect” function
copies all data from all the partitions to
the driver. So the advantages of
distribution and data locality don’t
apply.

In practice, don’t do a collect on large

data!
Execution Parallel
Operations
four
accounts =sc.textFile("/loudacre/accounts/part-m-
00000") \
.flatMap(lambda line: line.split() ) \
.map(lambda word: (word[0], len(word))) \
.groupByKey() \
.map(lambda (k, values): (k, sum(values) /len(values)))

Another map operation as the final step...these work on

the partitions which were output by the groupByKey (or
other reduce-related opera4on.) partitioning is again
preserved.
Stages and Tasks
Aggregating Data with Pair RDD

Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
No ratings yet
Ansar - F18605005 Inlab + Post Lab No 04 Operating System Dated 24 April, 2021
6 pages
spark
No ratings yet
spark
160 pages
Spark Architecture and Deploy Modes
No ratings yet
Spark Architecture and Deploy Modes
22 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Data Bricks Interview
No ratings yet
Data Bricks Interview
18 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark Intro
No ratings yet
Spark Intro
24 pages
7
No ratings yet
7
39 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
4 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
HDP Training Tesco - II Notes
No ratings yet
HDP Training Tesco - II Notes
250 pages
The Spark Programming Model
No ratings yet
The Spark Programming Model
7 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark Notes
No ratings yet
Spark Notes
71 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
spark_notes
No ratings yet
spark_notes
19 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
slips bigdata
No ratings yet
slips bigdata
6 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Function Spark
No ratings yet
Function Spark
9 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
Function Spark
No ratings yet
Function Spark
10 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Pyspark
No ratings yet
Pyspark
31 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Unit_2
No ratings yet
Unit_2
73 pages
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
No ratings yet
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
6 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
CH 2
No ratings yet
CH 2
6 pages
Bda 5
No ratings yet
Bda 5
21 pages
U-4 rem
No ratings yet
U-4 rem
8 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Module 3
No ratings yet
Module 3
51 pages
1 Introduction Python Programming For Data Science
No ratings yet
1 Introduction Python Programming For Data Science
11 pages
SKU GTC Manual 42004 784C
No ratings yet
SKU GTC Manual 42004 784C
56 pages
What Is A Star Schema
No ratings yet
What Is A Star Schema
5 pages
NPTEL
No ratings yet
NPTEL
14 pages
LPC ALPC Paper
No ratings yet
LPC ALPC Paper
28 pages
C++ Gaddis Chapter 11 Printable Flash Cards
No ratings yet
C++ Gaddis Chapter 11 Printable Flash Cards
31 pages
Movie Ticket Booking System
No ratings yet
Movie Ticket Booking System
5 pages
Itip05 - Real-timeDataSharing
No ratings yet
Itip05 - Real-timeDataSharing
4 pages
Harry Potter Database Instructions X
No ratings yet
Harry Potter Database Instructions X
11 pages
Tutorial: Unix Command Summary
No ratings yet
Tutorial: Unix Command Summary
15 pages
02.defensepro Platform
No ratings yet
02.defensepro Platform
2 pages
Quiz Game: Information Technology
No ratings yet
Quiz Game: Information Technology
83 pages
C Programming Tutorial by Ramesh Dodamani
No ratings yet
C Programming Tutorial by Ramesh Dodamani
27 pages
Ism F
No ratings yet
Ism F
128 pages
Fast Floating-Point To Mu-Law Conversion - Dspguru
No ratings yet
Fast Floating-Point To Mu-Law Conversion - Dspguru
2 pages
library database project (2)
100% (1)
library database project (2)
38 pages
IEEE 118 Bus Data
No ratings yet
IEEE 118 Bus Data
16 pages
Oracle ZDLRA Hardware Installation & Software Configuration Online Assessment (2022)
No ratings yet
Oracle ZDLRA Hardware Installation & Software Configuration Online Assessment (2022)
37 pages
H37 PracticeSolution PDF
No ratings yet
H37 PracticeSolution PDF
4 pages
Tabel Probabilitas Binomial
No ratings yet
Tabel Probabilitas Binomial
4 pages
Sun Fire V240-Arch - WP
No ratings yet
Sun Fire V240-Arch - WP
32 pages
Export Sample
No ratings yet
Export Sample
13 pages
Active Directory
No ratings yet
Active Directory
14 pages
Bit Arrays Bitwise Logical Operations: Bitmap Index
No ratings yet
Bit Arrays Bitwise Logical Operations: Bitmap Index
5 pages
PBLJ Questions
No ratings yet
PBLJ Questions
25 pages
Marking Works Technical
No ratings yet
Marking Works Technical
6 pages
Module 4 Pandas File 1
No ratings yet
Module 4 Pandas File 1
3 pages
Docu 92010
No ratings yet
Docu 92010
216 pages
Acer Veriton Z291G Emachines EZ1711 Wistron Pi010L 09194 1 E VMARR PDF
100% (1)
Acer Veriton Z291G Emachines EZ1711 Wistron Pi010L 09194 1 E VMARR PDF
31 pages

Spark Parallel Processes and Aggregation PDF

Uploaded by

Spark Parallel Processes and Aggregation PDF

Uploaded by

Writing and

2. When application starts, Spark

3. YARN RM will allocate a

4. The AM will request

5. The executors are JVMs

2. Spark driver sends tasks to executor,

3. In this context, the AM performs the

2. This slide reinforces that multiple

2. There is no communication between

Otherwise the process is the same: The AM

Why do we need cluster mode?

Cluster mode is more common than client

2. Spark automatically spreads the data across

• Note on default min partitions: if running

1. For large files

2. sc.textFile(“mydir/ file1, mydir/file2”)

(Reminder: HDFS blocks by size. By

2. Spark partitions the RDD into the same

3. (Remember the textFile operation doesn’t

4. Because it is HDFS it is distributed (three

Because there are three partitions, the

In practice, don’t do a collect on large

Another map operation as the final step...these work on

You might also like