0% found this document useful (0 votes)

11 views40 pages

Spark Basic

The document provides an overview of Spark architecture, detailing how Spark applications are executed on a Hadoop YARN cluster using the spark-submit command. It explains the roles of the application master container, driver, and executors, as well as the differences between client and cluster deployment modes. Additionally, it covers the concepts of Spark jobs, stages, tasks, and the execution flow of Spark applications, including the importance of transformations and actions.

Uploaded by

thanish shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views40 pages

Spark Basic

Uploaded by

thanish shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Page 1 of 40

Spark Architecture

What is cluster
 A pool of computer working together but viewed as a single system
 It has 10 worker node( as a example )
o Each node has 64gb ram and 16 cpu core
 Total cluster configuration is
o 640 gb0 gb ram
o 160 core cpu
Page 2 of 40

Hadoop Yarn cluster

 We want to run spark application on this cluster

 We use spark submit commend and submit the spark application to the
cluster
 The request will do to yarn resource manager
 Yarn resource manager will create one application master container on a
worker node and start our application main methos in the container
 What is container , a container is an isolated Virtual runtime
environment it comes with some cpu and memory allocation

Lets see what will happen inside application master container

 Container is running main methos, it can have pyspark application or
scala application
 Lets assume we have a pyspark application, but spark is written in scala
and it runs in the java virtual machine
 Spark is written in scala , scala is a jvm language and it always run in the
jvm
 Spark developer wanted to bring this to python developers,
 so they created a java wrapper on top of the scala core and they created
a python wrapper on top of the java wrapper
 And this python wrapper is know as pyspark ( ref image 2)
Page 3 of 40

How pyspark is executed using application master container

 Now I have python code in my main method

 This python code is designed to start a java main method internally.
 My pyspark application will start a jvm application.
 Once we have jvm application, the pyspark wrapper will call the java
wrapper using py4j connection.
 P4j allows a python application to call a java application and that’s how
pyspark works
 Pyspark main is my pyspark driver
 JVM application here is my application driver
 If you write a code in scala you done have pyspark driver
Page 4 of 40

Spark application is a distributed application in itself (how)

 Your application driver distributes the work to others, so driver will not
perform anything
 Instead it will create some executors and get the work done from them
but how ?
 After stating the driver will go back to yarn RM and ask for some more
container
 The resource manager will create some container on worker nodes and
give them to the driver.
 Now the driver will start spark executor in this container
 Each container will run one spark executor , and the spark executor is a
jvm application
 These executor are responsible for doing all the data processing work
 The driver will assign work to the executor and monitor them and
manages executor
 Executor will so all the data processing
Spark architecture when you are executing scala program
Page 5 of 40

Spark architecture when you are executing pyspark code

Spark architecture when you are executing pyspark program with

some python module
Page 6 of 40

Spark Submit and Important Options

How to deploy spark application

 We have many ways to submit a Spark application to your cluster.
 the most commonly used method is the spark-submit command-line tool

What is spark-submit?

The spark-submit is a command-line tool that allows you to submit the Spark
application to the cluster.
 Here is a general structure of the spark-submit command.
 This is a minimalist example with the most commonly used options.
 So you can use the spark-submit command
 The second last argument is your application jar or a PySpark script.
 Finally, you have a command-line argument for your application.
Page 7 of 40

 Class : The class option is only valid for Java or Scala,and it is not
required when you submit a PySpark application.This class option tells
the driving class name where you defined your main() method for java
and scala.

 Master : The master option is to tell the cluster manager.If you are using
YARN, the value of your master is yarn.But if you want to run your
application on the local machine.you should set the master to local.You
can also use local[3] to run spark locally with three threads.

 Deploy-mode : The deploy mode takes one of the following two

configurations.
Client : driver is running in local ( dev env)
Cluster : driver is running in cluster ( production deployment / PROD
env )

 Conf : The conf option allows you to set additional spark configurations.
For example, you can set spark.executor.memoryOverhead = 0.20 using
the --conf.
The default value for spark.executor.memoryOverhead is 0.10.

 resource allocation options: Spark application runs as one driver and

one or more executors.
Page 8 of 40

Example of spark submit :

Page 9 of 40

Deploy Modes - Client and Cluster mode

 The spark-submit allows you to submit the spark application to the
cluster.
 And you can submit the application to run in one of the two modes.
o Cluster mode
o Client mode

Cluster mode

 Below architecture shows cluster mode .

 In cluster mode, the spark-submit will reach the YARN RM requesting
him to start the driver in an AM container.
 YARN will start your driver in the AM container on a worker node in the
cluster.
 Then the driver will again request YARN to start some executor
containers.
 So the YARN will start executor containers and handover them to the
driver.
 So in the cluster mode, your driver is running in the AM container on a
worker node in the cluster.
 Your executors are also running in the executor containers on some
worker nodes in the cluster.
Page 10 of 40

Client mode

 In this kind of setup, the spark-submit doesn't go to the YARN resource

manager for starting an AM container.
 Instead, the spark-submit command will start the driver JVM directly on
the client machine.
 So, in this case, the spark driver is a JVM application running on your
client machine.
 This is the same machine where you executed the spark-submit
command.
 Now the driver will reach out to the YARN resource manager requesting
executor containers.
 The YARN RM will start executor containers and handover them to the
driver.
Page 11 of 40

In short where the driver is running

 If the driver is running on client machine then the mode is client mode
 If the driver is running on cluster then the mode is cluster mode

how do we choose the deploy mode?

 You will almost always submit your application in cluster mode.
 It is unlikely that you submit your spark application in client mode.
 We have two clear advantages of running your application in cluster
mode.
o The cluster mode allows you to submit the application and log off
from the client machine.Why? Because the driver and executors
are running on the cluster.They have nothing active on your
client's machine.So even if you log off from your client machine,
the driver and executor will continue running in the cluster.
o Your application runs faster in cluster mode because your driver is
closer to the executors.The driver and executor communicate
heavily,and if they are closer, you don't get impacted by network
latency.
Page 12 of 40

why do we have client mode?

 The client mode is designed for interactive workloads.
 For example, spark-shell runs your code in client mode.
 Similarly, Spark notebooks are also using the client mode.
 We have two reasons.
 The first reason is the interactive mode.Spark shell and notebooks
give you an interactive method to work with Spark.You will submit
some code using these tools.They will run the code and show you
the results.
 The second reason is to make an exit easy.What happens if you
log off from the client machine or stop the shell and quit.The
driver dies.As soon as the driver dies, the YARN RM knows that
the driver is dead,and the executors assigned to the driver are
now orphans.So the RM will terminate the executor containers to
free up the resources.
Page 13 of 40

Spark Jobs - Stage, Shuffle, Task, Slots

In sort revision

Spark jobs : Each action creates a spark job.( number of action == number of
spark jobs )
Stages : Number of wide transformation + 1
Suffle/sort : when we perform wide transformation we do suffle and sort.
Task : A task is the smallest unit of work in a Spark job.The number of tasks in
the stage is equal to the number of input partitions.each stage may be
executed as one or more parallel tasks

 Understanding Spark jobs and stages require you to know about the
Spark Dataframe API classification.
 Most of the Spark APIs can be classified into two categories.
o Transformations
o Actions
 We also have some functions and objects that are neither
transformations nor actions(utility functions, or helper methods)
o printSchema()
o cache()
o persist()
o isEmpty()
Page 14 of 40

 Transformations are used to process and convert data, and they are
further classified into two categories.
 Narrow Dependency transformations
 The narrow dependency transformations can run in parallel
on each data partition without grouping data from multiple
partitions.where don’t require shuffle
 For example, select(), filter(), withColumn(), drop() etc.
 All these transformations can be independently performed
on each data partition in parallel.
 And you do not need to group data for performing these
transformations.

 Wide Dependency transformations

 The Wide dependency transformations require some kind of
grouping before they can be applied.
 For example, groupBy(), join(), cube(), rollup(), and agg().
 All these wide dependency transformations require grouping of
data on some key and then apply the transformation.

 Action :
 As the name suggests, an action triggers some work,such as
writing a Dataframe to disk, computing the transformations, and
collecting the results.
 Some commonly used actions as read(), write(), collect(), and
take() etc.
 All spark actions trigger one or more Spark jobs.
 And that's a supercritical concept.
Page 15 of 40

let's take an example code and see how spark runs our code.

 This code snippet has got two spark code blocks.

 How do I know I have two blocks? Start from the first line and look for
action.
 Wherever you find action, your first block ends there, and the following
line starts a new block.
 In this example, my first line itself is an action. Right? The read() method
is an action.
 So the first block starts at the first line, and it also ends at the first line.


 So my second block starts from the second line and ends at the last line.
 Each action creates a spark job.
Page 16 of 40

 I got two spark actions visible here.So I can assume two Spark jobs here.

Lets see the second block

 Your application driver will take this block, compile it and create a
Spark job.
 But this job must be performed by the executors.Why? Because the
driver doesn't perform any data processing job.
 The driver must break this job into smaller tasks and assign them to
the executors.
 So how that happens?

 Spark driver will create a logical query plan for each spark job.
 Once we have the logical plan, the driver will start breaking this plan into
stages.
 The driver will look at this plan to find out the wide dependency
transformations.
Page 17 of 40

 I have two wide dependencies in this plan.

 The first one is the repartition() method, and the second one is the
groupBy() method.
 So the driver will break this plan after each wide dependency.
 Each stage can have one or more narrow transformations,
 and the last operation of the stage is a wide dependency
transformation.
 Spark cannot run these stages in parallel.
 We should finish the first stage, and then only we can start the next
stage.
 Why? Because the output of the first stage is an input for the next
stage.
 So whatever we see here is one spark job, and it is broken down into
three stages.
 All the stages can run one after the other because the output of one
stage is input for the next stage.
 But how the output of one stage goes to the next stage?
Page 18 of 40

 I am reading surveyDF and repartitioning it to create

partitionedSurveyDF.
 And the final output of the stage must be stored in an exchange buffer.
 Now the output of one stage becomes the input of the next stage.
 So the next stage starts with the exchange buffer.
 We simply learn that the shuffle/sort will move data from the Write
exchange to the read exchange.

 The Shuffle/Sort is an expensive operation in the Spark cluster.It

requires a write exchange buffer and a read exchange buffer.
 The data from the write exchange buffer is sent to the read exchange
buffer over the network.
Page 19 of 40

 And now I can run these transformations in parallel on those two

partitions.
 Spark can execute the same plan in parallel on two partitions because I
have two partitions.


Let's quickly revise all these terms.

 Spark creates one job for each action.
 This job may contain a series of multiple transformations.
 The Spark engine will optimize those transformations and create a logical plan for
the job.
 Then spark will break the logical plan at the end of every wide dependency and
create two or more stages.
 If you do not have any wide dependency, your plan will be a single-stage plan.
 But if you have N wide-dependencies, your plan should have N+1 stages.
 Data from one stage to another stage is shared using the shuffle/sort operation.
 Now each stage may be executed as one or more parallel tasks.The number of
tasks in the stage is equal to the number of input partitions.
 If I create 100 partitions for stage two, I can have 100 parallel tasks for stage two.
 The task is the most critical concept for a Spark job.A task is the smallest unit of
work in a Spark job.
 The Spark driver assigns these tasks to the executors and asks them to do the
work.
 The executor needs the following things to perform the task.
 The task Code
 Data Partition
 So the driver is responsible for assigning a task to the executor.
 The executor will ask for the code or API to be executed for the task.
 It will also ask for the data frame partition on which to execute the given code.
Page 20 of 40

Now let's see how this plan fits into the cluster.
 I have a driver and four executors.Each executor will have one JVM
process.
 But I assigned 4 CPU cores to each executor.So, my Executor-JVM can
create four parallel threads.And that's the slot capacity of my executor.
 So each executor can have four parallel threads, and we call them
executor slots.
 The drive knows how many slots do we have at each executor.

 So for this configuration, let's assume the driver has a job stage to finish.
 And you have ten input partitions for the stage.So you can have ten parallel tasks for the
same.
 Now the driver will assign those ten tasks in these slots.It might look like this.
Page 21 of 40

 But we have some extra capacity that we are wasting because we do

not have enough tasks for this stage.
 Now let's assume this stage is complete.Now the driver should start
the next stage.And we have 32 tasks for the next stage.
 But we have 16 slots only.That's not a problem.The driver will schedule
16 tasks in the available slots.
 The remaining 16 will wait for slots to become available again.

That's how these tasks are assigned and run by the executor.

However, we didn't talk about our actions.

 This example had a collect() action in the end.The collect() action

requires each task to send data back to the driver.
 The collect() action requires each task to send data back to the
driver.So the tasks of the last stage will send the result back to the
driver over the network.
 The driver will collect data from all the tasks and present it to you.
 If you had an action to write the result in a data file, in that case,all the
tasks will write a data file partition and send the partition details to the
driver.
 The driver considers the job done when all the tasks are successful.
 If any task fails, the driver might want to retry it.So it can restart the
task at a different executor.If all retries also fail, then the driver returns
an exception and marks the job failed.
Page 22 of 40

Spark SQL Engine and Query Planning

 Apache Spark gives you two prominent interfaces to work with data.
 Spark SQL
 Dataframe API
 Spark SQL is compliant with ANSI SQL:2003 standard.So you can think of
it as standard SQL.
 Dataframes are function Apis, and they allow you to implement
functional programming techniques to process your data.
 Other than these two, you also have Dataset API available only for Scala
and Java.
 The Dataframe API internally uses Dataset APIs, but these Dataset APIs
are not available in PySpark.
 If you are using Scala or Java, you can directly use Dataset APIs.
 Spark looks at your code in terms of jobs.
 Similarly, if you write a SQL expression, Spark considers one SQL
expression as one Job.
 Spark code is nothing but a sequence of Spark Jobs.And each Spark Job
represents a logical query plan.
 This first logical plan is a user-created logical query plan.
 Now this plan goes to Spark SQL engine.
 You may have Dataframe APIs, or you may have SQL.Both will go to the
Spark SQL engine.
 For Spark, they are nothing but a Spark Job represented as a logical plan.
Page 23 of 40

Spark SQL Engine

 The Spark SQL Engine will process your logical plan in four stages.
 The Analysis phase will parse your code and create a fully resolved logical plan.
 If your code passed the Analysis phase, that means you have a valid code,
 The logical optimization phase applies standard rule-based optimizations to the
logical plan.
Page 24 of 40

 Then this optimized logical plan goes into the next phase of physical
planning.
 Spark SQL takes a logical plan and generates one or more physical plans
in the physical planning phase.
 Physical planning applies cost-based optimization.
 So the engine will create multiple plans, calculate each plan's cost, and
finally select the plan with the least cost.
 At this stage, they mostly use different join algorithms to create more
than one physical plan.
 For example, they might create one plan using broadcast join and
another using Sort merge,
 Then they apply a cost to each plan and choose the best one.
 The last stage is code generation.So your best physical plan goes into
code generation.
 and the engine will generate Java byte code for the RDD operations in
the physical plan. And that's why Spark is also said to act as a compiler
Page 25 of 40

Spark Memory Allocation

 Assume you submitted a spark application in a YARN cluster.
 The YARN RM will allocate an application master (AM) container and start the driver
JVM in the container.
 The driver will start with some memory allocation which you requested.
 Do you know how to ask for the driver's memory? You can ask for the driver memory
using two configurations.
 You can ask for the driver memory using two configurations.
 spark.driver.memory
 and spark.driver.memoryOverhead

 So let's assume you asked for the spark.driver.memory = 1GB.

 And the default value of spark.driver.memoryOverhead = 384 mb (10% of driver
memory or 384 mb whichever is highest )
 But what is the purpose of 384 MB overhead?
 The overhead memory is used by the container process or any other non JVM
process within the container. ( execute code outside the JVM for pyspark or R)

But how much memory do you get for each executor container?
 So the driver will again request for the executor containers from the YARN.
 The YARN RM will allocate a bunch of executor containers.
 But how much memory do you get for each executor container?
 The total memory allocated to the executor container is the sum of the following.
 Overhead Memory : Overhead memory is the spark.executor.memoryOverhead.
 Heap Memory : JVM Heap is the spark.executor.memory.
 Off Heap Memory : Off Heap memory comes from spark.memory.offHeap.size.
 Pyspark Memory : And the PySpark memory comes from the
spark.executor.pyspark.memory.
 So the driver will look at all these configurations to calculate your memory requirement
and sum it up.
Page 26 of 40

Example : Now let's assume you asked for spark.executor.memory = 8

GB
 Overhead Memory : 800 MB ( 10% of heap memory or 384 mb which
ever is higher )
 Heap Memory : 8 Gb
 Off Heap Memory : 0
 Pyspark Memory : 0

But do we get an 8.8 GB container?

 That depends on your cluster configuration.
 The container should run on a worker node in the YARN cluster. Right?
 What if the worker node is a 6 GB machine?
 YARN cannot allocate an 8 GB container on a 6 GB machine. Right?
 Because there is not enough physical memory, before you ask for the driver or
executor memory,
 you should check with your cluster admin for the maximum allowed value.
 If you are using YARN RM, you should look for the following configurations.
 yarn.scheduler.maximum-allocation-mb
 yarn.nodemanager.resource.memory-mb
Page 27 of 40

Great! Now let's come back to our scenario.

Driver memory allocation : Lets assume while spark submit you ask for 4 gb of jvm memory

 I am assuming that we have enough physical memory on the worker nodes.

 The total physical memory for a driver container comes from the following two
configurations.

spark.driver.memory 4GB
spark.driver.memoryOverhead 400 MB (10 % or 384mb which ever is
higher)
 Once allocated, it becomes your physical memory limit for your spark driver.
 Now you have three limits.
 Your Spark driver JVM cannot use more than 4 GB.
 Your non-JVM workload in the container cannot use more than 400 MB.
 And Your container cannot use more than 4.4 GB of memory in total.
 If any of these limits are violated, you will see an OOM( out of memory exception )
exception.
Page 28 of 40

Executor memory allocation : Now let's assume you asked for

spark.executor.memory = 8 GB.
 Overhead Memory : 800 MB ( 10% of heap memory or 384 mb which ever is higher
)
 Heap Memory : 8 Gb
 Off Heap Memory : 0
 Pyspark Memory : 0
 The total physical memory of your container is 8.8 GB. Right?
 Now you have three limits.
 Your executor JVM cannot use more than 8 GB of memory.
 Your non JVM processes cannot use more than 800 MB.
 And your container has a maximum physical limit of 8.8 GB.
 If any of these limits are violated, you will see an OOM( out of memory exception )
exception.
Page 29 of 40

Great! So far, so good.But you should ask two more questions?

1. What is the Physical memory limit at the worker node?
a. You should look out for yarn.scheduler.maximum-allocation-mb for the
maximum limit.
2. What is the PySpark executor memory?
a. You do not need to worry about PySpark memory if you write your Spark
application in Java or Scala.
b. But if you are using PySpark, this question becomes critical. Right?

 PySpark is not a JVM process.

 Some 300 to 400 MB of this is constantly consumed by the container processes and
other internal processes.
 So your PySpark will get approximately 400 MB.
 If your PySpark consumes more than what can be accommodated in the overhead,
you will see an OOM error.

Great! So if you look from the YARN perspective.

 You have a container, and the container has got some memory.
 This total memory is broken into two parts.
o Heap memory( driver/executor memory) :
 The heap memory goes to your JVM:
 We call it driver memory when you are running a driver in this container.
 Similarly, we call it executor memory when the container is running an
executor.
o Overhead memory (OS Memory):
 The overhead memory is used for a bunch of things.
 In fact, the overhead is also used for network buffers.
 So you will be using overhead memory as your shuffle exchange or reading
partition data from remote storage etc.
 Both the memory portions are critical for your Spark application.
 And more than often, lack of enough overhead memory will cost you an OOM exception.
 Because the overhead memory is often overlooked, but it is used as your shuffle exchange or
network read buffer.
Page 30 of 40

Spark memory Management.

Executor memory.
 You started your application using spark.executor.memory as 8 GB and
spark.executor.cores equal to 4
 You will also get default 10% overhead memory. Right?

 We want to learn how Spark utilizes JVM heap memory. Right?

Assuming we have 8gb

Reserved 300 mb Fixed reserve for spark engine
Memory
Spark Memory 60%(8gb- The Spark Memory pool is your main executor
Pool 330mb) memory pool which you will use for data frame
operations and caching.
User Memory 40%(8gb- Non dataframe operations
330mb)  user-define data structures
 Spark internal metadata
 UDFs
 RDD conversion and lineage dependency

 Spark will reserve 300 MB, and you are left with 7700 MB.
 This remaining memory is again divided into a 60-40 ratio.
 The 60% goes to the Spark memory pool, and the remaining goes to the user
memory pool.
 If you want, you can change this 60-40 ratio using the Spark.memory.fraction
configuration.
 The Spark Memory pool is your main executor memory pool which you will use for
data frame operations and caching.
 The User Memory pool is used for non-dataframe operations.
 Here are some examples.
 If you created user-defined data structures such as hash maps, Spark would
use the user memory pool.
Page 31 of 40

 Similarly, spark internal metadata and user-defined functions are stored in

the user memory.
 All the RDD information and the RDD operations are performed in user
memory.
 You will be using user memory only if you apply RDD operations directly in your
code.
 Great! So the point is straight.The Spark memory pool is where all your data frames
and dataframe operations live.

 So for this example, we started with 8 GB, but we are left with a 4620 MB Spark
memory pool.

So how this memory pool is used?( spark memory 4620)

 This memory pool is further broken down into two sub-pools.

Storage memory Cache memory
Executor memory Buffer memory for DF
operation
 The default break up is 50% each,
 We use the storage pool for caching data frames.
 The storage pool is used to cache the data frames.
 So if you are using dataframe cache operation, you will be caching the data in
the storage pool.
 So the storage pool is long-term.
 You will cache the dataframe and keep it there as long as the executor is
running or you want to un-cache it.
 the Executor pool is to perform dataframe computations.
 So if you are joining two data frames, Spark will need to buffer some data for
performing the joins.
Page 32 of 40

 That buffer is created in the executor pool.

 So executor pool is short-lived. You will use it for execution
 and free it immediately as soon as your execution is complete. Right?

compute-view of the executor.

 For this example, I asked for the 4 CPU cores. Right?
 So my executor will have four slots, and we can run four parallel tasks in these slots.
Right?
 But what are these slots? These slots are threads.

Great! So I have one executor JVM, 2310 MB storage pool, another 2310 MB executor
pool

How much executor memory will each task get?

 Simple! 2310/4 right?
Page 33 of 40

 That's static memory management,

 and Spark used to assign task memory using this static method before spark 1.6.

But now, they changed it and implemented a unified memory manager.

 The unified memory manager tries to implement fair allocation amongst the active
tasks.
 Let's assume I have only two active tasks.I have four slots, but I have only two
active tasks.
 So the unified memory manager can allocate all the available execution memory
amongst the two active tasks.
 There is nothing reserved for any task.
 The task will demand the execution memory, and the unified memory manager
will allocate it from the pool.

 If the executor memory pool is fully consumed,

Page 34 of 40

 the memory manager can also allocate executor memory from the storage
memory pool as long as we have some free space.

 In short it ca brake the rule if it has free memory and if no memory then out of
memory exception will encounter
Page 35 of 40

 In general, Spark recommends two or more cores per executor, but you should not go
beyond five cores.
 More than five cores cause excessive memory management overhead and contention,

Spark Adaptive Query Execution

 Spark Adaptive Query Execution or AQE is a new feature released in Apache Spark
3.0.
 It offers you three capabilities.
o Dynamically coalescing shuffle partitions
o Dynamically switching join strategies
o Dynamically optimizing skew joins

what are these features, and why do we need them? What problems do they
solve?

So let's try to understand the problems

 Let's assume you are running a super simple group-by query in your Spark code.
 Or you might have written a Dataframe expression similar to the following.
 So what are we doing? I have a call_records table.
 This table stores information about the cell phone calls made by different users.
 We record a lot of information in this table, but here is a simplified table structure.
 So we record call_id, then the duration of the call in minutes, and we also record
which cell tower served the call.
 And my Spark SQL is trying to get the sum of call duration by the tower_location.
 For these four sample records, you should get the result as shown below.
 But do you know how Spark will execute this query? We already learned that.
 Spark will take the code, create an execution plan for the query, and execute it on
the cluster.
Page 36 of 40

 The spark job that triggers this query should have a two-stage plan.
 It looks similar to this.
 However, you will have two stages.
 Stage zero reads the data from the source table and fills it to the output exchange.
 The second stage will read the data from the output exchange and brings it to the
input exchange.
 And this process is known as shuffle/sort.
 Why do we have this shuffle/sort in our execution plan?
 Because we are doing a groupBy operation and groupBy is a wide-dependency
transformation.
 So you are likely to have a shuffle/sort in your execution plan.
 The stage is dependent on stage zero, so stage one cannot start unless stage zero is
complete.
 We already learned all this in the earlier section.
Page 37 of 40

Now let's dig deeper into these exchanges.

 Let's look at the input exchange.The input change might look like below.
 I am assuming that my input data source has got only two partitions.

 So the stage zero exchange should have two partitions.

 and make it available to the exchange for the second stage to read.
 I have only two partitions in my input source, so the exchange shows only two
partitions.
 Now the shuffle/sort operation will read these partitions,
 sort them by the tower_location and bring them to the input exchange of stage one.
 The final result should look like this.
Page 38 of 40

Wait a minute? What is this? Let me explain.

 Let's assume I configured spark.sql.shuffle.partitions = 10.
 I know the default value for this configuration is 200, but I also know I do not have
200 unique towers.
 I am running a query to group by towers, so I reduced the number of shuffle
partitions to 10.
 So the shuffle/sort will create ten partitions in the exchange.
 Five of them will have data, and the remaining five will be blank partitions.

Now let me ask a question.

How many tasks do you need to execute this stage?
We need ten tasks. Why? Because we have ten partitions in the exchange.
And that's a problem.
 Five partitions are empty, but Spark will still trigger ten tasks.
 The empty partition task will do nothing and finish in milliseconds.
 But Spark scheduler needs to spend time scheduling and monitoring these useless
tasks.
 The overhead is small, but we do have some unnecessary overhead here.
 I improved the situation by reducing the shuffle partitions to 10 from the default
value of 200.
 But that's not an easy thing to do?
 Why?
 Because I cannot keep changing the shuffle partitions for every query.
 Even if I want to do that, I do not know how many unique values my SQL will
fetch?
 The number of unique keys is dynamic and depends on the dataset.
 How am I supposed to know how many shuffle partitions do I need?
 That's almost impossible for the developers to know in advance.
Page 39 of 40

We have another problem here.

 Some partitions are big, and others are small.
 They are not proportionate.
 So a task working on partition one will take a long time while the task doing the partition-
2 will finish very quickly.
 Why? Because the stage is not complete until all the tasks of the stage are complete.
 Three tasks processing partitions 2,3 and 4 will finish quickly,
 but we still need to wait for partition-1 and partition 5.
 And that's a wastage of CPU resources.
 The situation becomes worse when one partition becomes excessively long.
 For example, look at this.a data skew problem.

How do I decide the number of shuffle partitions for my Spark job?

 Spark 3.0 offers Adaptive Query Execution to solve this problem.
 You must enable it, and the AQE will take care of setting the number of your shuffle
partitions.
 But how that magic happens?
 Super simple.

 Your input data is already loaded in stage zero exchange.

 Stage zero is already done, and data has come to the exchange.
 Now Spark will start the shuffle/sort.
 So it can compute the statistics on this data and find out some details such as the following.
 And this is called dynamically computing the statistics on your data during the shuffle/sort.
Page 40 of 40

 When Spark knows enough information about your data,

 it will dynamically adjust the number of shuffle partitions for the next stage.
 For example, in my case, Spark might dynamically set the shuffle partitions to four.
 And the result of that setting looks like this.

So what do we have?
 We have four shuffle partitions for stage one.
 Those five empty partitions are gone.
 Spark also merged two small partitions to create one larger partition.
 So instead of having five disproportionate partitions, we have four partitions.
 And these four are a little more proportionate.

Now let me ask you the same question?

 We need four tasks.
 Task 3 working on partition-3 will finish quickly, but the other three tasks will take
almost equal time.
 So we saved one CPU slot, and we also eliminated the useless empty tasks.

Apache Spark Guide
No ratings yet
Apache Spark Guide
33 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
4 pages
7
No ratings yet
7
39 pages
Day 26 Modes of deployment
No ratings yet
Day 26 Modes of deployment
4 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
18 SparkSubmit BigData 6x
No ratings yet
18 SparkSubmit BigData 6x
3 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
10b SparkSubmit BigData 2x
No ratings yet
10b SparkSubmit BigData 2x
6 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Week10 - Sparkonyarnarchitecture 201016 190508
No ratings yet
Week10 - Sparkonyarnarchitecture 201016 190508
3 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
71dae6b5800b443482b607d8e4a2d150
No ratings yet
71dae6b5800b443482b607d8e4a2d150
9 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Spark Memory & Optimization by Divya Anand
No ratings yet
Spark Memory & Optimization by Divya Anand
38 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
HDP Training Tesco - II Notes
No ratings yet
HDP Training Tesco - II Notes
250 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
spark_notes
No ratings yet
spark_notes
19 pages
Basics of Apache Spark Configuration Settings - by Halil Ertan - Towards Data Science
No ratings yet
Basics of Apache Spark Configuration Settings - by Halil Ertan - Towards Data Science
11 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Spark ETL and Process
No ratings yet
Spark ETL and Process
15 pages
Final Note
No ratings yet
Final Note
31 pages
3_UNIT3_Spark
No ratings yet
3_UNIT3_Spark
55 pages
U-4 rem
No ratings yet
U-4 rem
8 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
4.2. Spark Applications
No ratings yet
4.2. Spark Applications
19 pages
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Spark Intro
No ratings yet
Spark Intro
24 pages
Understanding The Spark Cluster Architecture: Anatomy of A Spark Application
No ratings yet
Understanding The Spark Cluster Architecture: Anatomy of A Spark Application
17 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Spark Architecture and Deploy Modes
No ratings yet
Spark Architecture and Deploy Modes
22 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Databricks Apache Spark Certified Developer Master Cheat Sheet
100% (1)
Databricks Apache Spark Certified Developer Master Cheat Sheet
29 pages
slips bigdata
No ratings yet
slips bigdata
6 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
UNIT V
No ratings yet
UNIT V
35 pages
Spark Ops Final
No ratings yet
Spark Ops Final
45 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
spark theory
No ratings yet
spark theory
26 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
4 Introduction to SPARK
No ratings yet
4 Introduction to SPARK
14 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Mastering Vulkan: From Fundamentals to Expert Techniques
From Everand
Mastering Vulkan: From Fundamentals to Expert Techniques
Kameron Hussain
No ratings yet
En Pca 3 0 Patch 301
No ratings yet
En Pca 3 0 Patch 301
30 pages
الملخص لنظم التشغيل من 1الى6
No ratings yet
الملخص لنظم التشغيل من 1الى6
64 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
Comparison and Analysis of Energy-Efficient Geographical and Power Based Clustering Algorithm For Heterogeneous WSNs
100% (4)
Comparison and Analysis of Energy-Efficient Geographical and Power Based Clustering Algorithm For Heterogeneous WSNs
8 pages
Vmware: Exam 2V0-621D
No ratings yet
Vmware: Exam 2V0-621D
100 pages
YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
69 pages
EMC NetWorker Module For Databases and Applications (NMDA) 1.2 Administration Guide
No ratings yet
EMC NetWorker Module For Databases and Applications (NMDA) 1.2 Administration Guide
384 pages
Rhel Cluster Qdisk
No ratings yet
Rhel Cluster Qdisk
6 pages
SQL Always On Step by Step SQL 2019
No ratings yet
SQL Always On Step by Step SQL 2019
17 pages
Pve Admin Guide - 3
No ratings yet
Pve Admin Guide - 3
68 pages
(Open) Mosix Experience in Naples
No ratings yet
(Open) Mosix Experience in Naples
7 pages
Installing VERITAS Cluster Server
No ratings yet
Installing VERITAS Cluster Server
15 pages
VM Realize Lab
No ratings yet
VM Realize Lab
170 pages
SUSE HA Arch Overview
No ratings yet
SUSE HA Arch Overview
26 pages
ALI LogStore
No ratings yet
ALI LogStore
13 pages
Module 2-4
No ratings yet
Module 2-4
44 pages
Eve Cook Book 5.1 2022
No ratings yet
Eve Cook Book 5.1 2022
268 pages
Unit-II
No ratings yet
Unit-II
31 pages
Project Phase-1 Report Final
No ratings yet
Project Phase-1 Report Final
30 pages
Veritas Cluster Server-VCS
No ratings yet
Veritas Cluster Server-VCS
72 pages
Getting Started With HDP Sandbox
No ratings yet
Getting Started With HDP Sandbox
107 pages
Unit 6
No ratings yet
Unit 6
28 pages
P D Group2-2
No ratings yet
P D Group2-2
6 pages
CLUSTERING Questions and Answers
No ratings yet
CLUSTERING Questions and Answers
23 pages
SMP Installation and Admin Guide R12 (13.4)
No ratings yet
SMP Installation and Admin Guide R12 (13.4)
348 pages
Chapter 6 and RLP Quiz - S16 Computer Architec - Organization Sections 01Y PDF
No ratings yet
Chapter 6 and RLP Quiz - S16 Computer Architec - Organization Sections 01Y PDF
6 pages
Recovering From An Event ID 1034 On A Server Cluster
No ratings yet
Recovering From An Event ID 1034 On A Server Cluster
5 pages
LTE RF Optimization Methods & Procedures
0% (1)
LTE RF Optimization Methods & Procedures
225 pages
IBM Spectrum Scale
No ratings yet
IBM Spectrum Scale
36 pages
Introduction To Distributed Systems
No ratings yet
Introduction To Distributed Systems
45 pages

Spark Basic

Uploaded by

Spark Basic

Uploaded by

Page 1 of 40

Hadoop Yarn cluster

 We want to run spark application on this cluster

Lets see what will happen inside application master container

How pyspark is executed using application master container

 Now I have python code in my main method

Spark application is a distributed application in itself (how)

Spark architecture when you are executing pyspark code

Spark architecture when you are executing pyspark program with

Spark Submit and Important Options

How to deploy spark application

 Deploy-mode : The deploy mode takes one of the following two

 resource allocation options: Spark application runs as one driver and

Example of spark submit :

Deploy Modes - Client and Cluster mode

 Below architecture shows cluster mode .

 In this kind of setup, the spark-submit doesn't go to the YARN resource

In short where the driver is running

how do we choose the deploy mode?

why do we have client mode?

Spark Jobs - Stage, Shuffle, Task, Slots

 Wide Dependency transformations

 This code snippet has got two spark code blocks.

Lets see the second block

 I have two wide dependencies in this plan.

 I am reading surveyDF and repartitioning it to create

 The Shuffle/Sort is an expensive operation in the Spark cluster.It

 And now I can run these transformations in parallel on those two

Let's quickly revise all these terms.

 But we have some extra capacity that we are wasting because we do

However, we didn't talk about our actions.

 This example had a collect() action in the end.The collect() action

Spark SQL Engine and Query Planning

Spark SQL Engine

Spark Memory Allocation

 So let's assume you asked for the spark.driver.memory = 1GB.

Example : Now let's assume you asked for spark.executor.memory = 8

But do we get an 8.8 GB container?

Great! Now let's come back to our scenario.

 I am assuming that we have enough physical memory on the worker nodes.

Executor memory allocation : Now let's assume you asked for

Great! So far, so good.But you should ask two more questions?

 PySpark is not a JVM process.

Great! So if you look from the YARN perspective.

Spark memory Management.

 We want to learn how Spark utilizes JVM heap memory. Right?

 Similarly, spark internal metadata and user-defined functions are stored in

So how this memory pool is used?( spark memory 4620)

 That buffer is created in the executor pool.

compute-view of the executor.

How much executor memory will each task get?

 That's static memory management,

But now, they changed it and implemented a unified memory manager.

 If the executor memory pool is fully consumed,

Spark Adaptive Query Execution

So let's try to understand the problems

Now let's dig deeper into these exchanges.

 So the stage zero exchange should have two partitions.

Wait a minute? What is this? Let me explain.

Now let me ask a question.

We have another problem here.

How do I decide the number of shuffle partitions for my Spark job?

 Your input data is already loaded in stage zero exchange.

 When Spark knows enough information about your data,

Now let me ask you the same question?

You might also like