0% found this document useful (0 votes)
130 views

Spark Parallel Processes and Aggregation PDF

1. Spark applications run by submitting code to a Spark driver program, which communicates with executors to distribute tasks. 2. The YARN resource manager allocates containers for the Spark application master and executors. The application master manages executor containers. 3. Spark drivers directly communicate with and send tasks to executors, which perform processing and return results to the driver. This allows distributed, parallel processing across executor nodes.

Uploaded by

vivekpandian08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Spark Parallel Processes and Aggregation PDF

1. Spark applications run by submitting code to a Spark driver program, which communicates with executors to distribute tasks. 2. The YARN resource manager allocates containers for the Spark application master and executors. The application master manages executor containers. 3. Spark drivers directly communicate with and send tasks to executors, which perform processing and return results to the driver. This allows distributed, parallel processing across executor nodes.

Uploaded by

vivekpandian08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Writing and

Deploying Spark
Applications
Dr. Gasan Elkhodari
BUAN6346
Big Data analytics
1. Spark driver runs on the client

2. When application starts, Spark


driver will submit the
application to the YARN
Resource Manager

3. YARN RM will allocate a


container for Spark
Application Master and then
start the AM

4. The AM will request


containers for number of
executors the application
requires

5. The executors are JVMs


1. Once the application (including its
executors) are running, the Spark
Driver will communicate with the
executors directly

2. Spark driver sends tasks to executor,


and they will send results back to the
driver.

3. In this context, the AM performs the


minimal function of managing the
container lifecycle for the application:
communicating with the Node
Manager on each slave node to
launch the executor and so on.
4. The primary control of the processing
(starting and monitoring tasks running
on the executors) is the job of the
driver.
1. A second Spark application starts
from a second client, and the same
process happens...a second AM, a
second set of executors.

2. This slide reinforces that multiple


applications can run at the same
time -- assuming that there are
enough resources for both
applications.
1. As with the first application, once
started, the second Spark Driver
communicates directly with its
executors.

2. There is no communication between


the two Spark drivers, or their
executors!
The main difference between cluster mode
and client mode is that the driver program
doesn’t run on the client machine, but
instead it runs as part of the Application
Master.

Otherwise the process is the same: The AM


requests containers for its executors and
launches executors (JVMs) in those
containers.
Again, once the application starts, it is the
driver program that manages the tasks on
the executors, and receives data back
from the executors.

Why do we need cluster mode?


In many (perhaps most) cases, security is
tight around the cluster, and hosts that are
not part of the cluster should not/can not
communicate directly with worker nodes
within the cluster.
In cluster mode, all the communication is
happening between worker nodes within
the cluster.

Cluster mode is more common than client


mode in production settings
The Spark
Application
Web UI
Parallel Processing in
Spark
1. RDD data lives in memory in executor JVMs.

2. Spark automatically spreads the data across


multiple nodes, so that datasets much too
large to fit in a single machines memory can
be processed. And so that processes can run
locally across local nodes
• So far we’ve been using textFile(file). But
you can also specify a minimum number
of partitions. The default is 2, so even a
dataset that is less than one block will
still be loaded as two partitions.

• Note on default min partitions: if running


with a single core or single thread, then
the default is actually 1, unless the user
overrides that specifically by setting min
number of partitions.
There are two approaches to loading multiple
files:

1. For large files


sc.textFile(“mydir/*”)

2. sc.textFile(“mydir/ file1, mydir/file2”)


which is what we’ve been using in the labs

sc.wholeTextFiles(“mydir”):
This creates a paired RDD, where the key is the
name of the file, and the contents are the
value.
Onto that cluster we place an HDFS file
that consists of three blocks.

(Reminder: HDFS blocks by size. By


default 128M blocks. So a single 300M
file would be three blocks. Because
HDFS (and Spark) are designed to work
with very large files, you should usually
assume that most files will be split into
multiple blocks.)
Now a new Spark application starts. In this
example the application requests 4
executors, so it is running on all nodes in
the cluster
1. The textFile operation creates a new RDD
based on the HDFS file.

2. Spark partitions the RDD into the same


number of partitions as the number of
blocks that make up the file...in this
example, three partitions.

3. (Remember the textFile operation doesn’t


actually retrieve the file until the final step
when collect is called.)

4. Because it is HDFS it is distributed (three


blocks). the spark driver knows (because it
asked the NameNode) what nodes the file
data is located on, and if possible, will
attempt to run this job on the nodes
where the HDFS blocks are physically
stored.
When we reach “collect”, that triggers
the actual tasks.

Because there are three partitions, the


textFile operation will result in three
tasks.
(Number of partitions = number of
tasks, always!) The Spark driver will try
to schedule those three tasks on the
nodes where the file data physically
resides, thereby minimizing network
traffic.
This is important: the “collect” function
copies all data from all the partitions to
the driver. So the advantages of
distribution and data locality don’t
apply.

In practice, don’t do a collect on large


data!
Execution Parallel
Operations
four
accounts =sc.textFile("/loudacre/accounts/part-m-
00000") \
.flatMap(lambda line: line.split() ) \
.map(lambda word: (word[0], len(word))) \
.groupByKey() \
.map(lambda (k, values): (k, sum(values) /len(values)))

Another map operation as the final step...these work on


the partitions which were output by the groupByKey (or
other reduce-related opera4on.) partitioning is again
preserved.
Stages and Tasks
Aggregating Data with Pair RDD

You might also like