0% found this document useful (0 votes)
11 views40 pages

Spark Basic

The document provides an overview of Spark architecture, detailing how Spark applications are executed on a Hadoop YARN cluster using the spark-submit command. It explains the roles of the application master container, driver, and executors, as well as the differences between client and cluster deployment modes. Additionally, it covers the concepts of Spark jobs, stages, tasks, and the execution flow of Spark applications, including the importance of transformations and actions.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

Spark Basic

The document provides an overview of Spark architecture, detailing how Spark applications are executed on a Hadoop YARN cluster using the spark-submit command. It explains the roles of the application master container, driver, and executors, as well as the differences between client and cluster deployment modes. Additionally, it covers the concepts of Spark jobs, stages, tasks, and the execution flow of Spark applications, including the importance of transformations and actions.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Page 1 of 40

Spark Architecture

What is cluster
 A pool of computer working together but viewed as a single system
 It has 10 worker node( as a example )
o Each node has 64gb ram and 16 cpu core
 Total cluster configuration is
o 640 gb0 gb ram
o 160 core cpu
Page 2 of 40

Hadoop Yarn cluster

 We want to run spark application on this cluster


 We use spark submit commend and submit the spark application to the
cluster
 The request will do to yarn resource manager
 Yarn resource manager will create one application master container on a
worker node and start our application main methos in the container
 What is container , a container is an isolated Virtual runtime
environment it comes with some cpu and memory allocation

Lets see what will happen inside application master container


 Container is running main methos, it can have pyspark application or
scala application
 Lets assume we have a pyspark application, but spark is written in scala
and it runs in the java virtual machine
 Spark is written in scala , scala is a jvm language and it always run in the
jvm
 Spark developer wanted to bring this to python developers,
 so they created a java wrapper on top of the scala core and they created
a python wrapper on top of the java wrapper
 And this python wrapper is know as pyspark ( ref image 2)
Page 3 of 40

How pyspark is executed using application master container

 Now I have python code in my main method


 This python code is designed to start a java main method internally.
 My pyspark application will start a jvm application.
 Once we have jvm application, the pyspark wrapper will call the java
wrapper using py4j connection.
 P4j allows a python application to call a java application and that’s how
pyspark works
 Pyspark main is my pyspark driver
 JVM application here is my application driver
 If you write a code in scala you done have pyspark driver
Page 4 of 40

Spark application is a distributed application in itself (how)


 Your application driver distributes the work to others, so driver will not
perform anything
 Instead it will create some executors and get the work done from them
but how ?
 After stating the driver will go back to yarn RM and ask for some more
container
 The resource manager will create some container on worker nodes and
give them to the driver.
 Now the driver will start spark executor in this container
 Each container will run one spark executor , and the spark executor is a
jvm application
 These executor are responsible for doing all the data processing work
 The driver will assign work to the executor and monitor them and
manages executor
 Executor will so all the data processing
Spark architecture when you are executing scala program
Page 5 of 40

Spark architecture when you are executing pyspark code

Spark architecture when you are executing pyspark program with


some python module
Page 6 of 40

Spark Submit and Important Options

How to deploy spark application


 We have many ways to submit a Spark application to your cluster.
 the most commonly used method is the spark-submit command-line tool

What is spark-submit?

The spark-submit is a command-line tool that allows you to submit the Spark
application to the cluster.
 Here is a general structure of the spark-submit command.
 This is a minimalist example with the most commonly used options.
 So you can use the spark-submit command
 The second last argument is your application jar or a PySpark script.
 Finally, you have a command-line argument for your application.
Page 7 of 40

 Class : The class option is only valid for Java or Scala,and it is not
required when you submit a PySpark application.This class option tells
the driving class name where you defined your main() method for java
and scala.

 Master : The master option is to tell the cluster manager.If you are using
YARN, the value of your master is yarn.But if you want to run your
application on the local machine.you should set the master to local.You
can also use local[3] to run spark locally with three threads.

 Deploy-mode : The deploy mode takes one of the following two


configurations.
Client : driver is running in local ( dev env)
Cluster : driver is running in cluster ( production deployment / PROD
env )

 Conf : The conf option allows you to set additional spark configurations.
For example, you can set spark.executor.memoryOverhead = 0.20 using
the --conf.
The default value for spark.executor.memoryOverhead is 0.10.

 resource allocation options: Spark application runs as one driver and


one or more executors.
Page 8 of 40

Example of spark submit :


Page 9 of 40

Deploy Modes - Client and Cluster mode


 The spark-submit allows you to submit the spark application to the
cluster.
 And you can submit the application to run in one of the two modes.
o Cluster mode
o Client mode

Cluster mode

 Below architecture shows cluster mode .


 In cluster mode, the spark-submit will reach the YARN RM requesting
him to start the driver in an AM container.
 YARN will start your driver in the AM container on a worker node in the
cluster.
 Then the driver will again request YARN to start some executor
containers.
 So the YARN will start executor containers and handover them to the
driver.
 So in the cluster mode, your driver is running in the AM container on a
worker node in the cluster.
 Your executors are also running in the executor containers on some
worker nodes in the cluster.
Page 10 of 40

Client mode

 In this kind of setup, the spark-submit doesn't go to the YARN resource


manager for starting an AM container.
 Instead, the spark-submit command will start the driver JVM directly on
the client machine.
 So, in this case, the spark driver is a JVM application running on your
client machine.
 This is the same machine where you executed the spark-submit
command.
 Now the driver will reach out to the YARN resource manager requesting
executor containers.
 The YARN RM will start executor containers and handover them to the
driver.
Page 11 of 40

In short where the driver is running


 If the driver is running on client machine then the mode is client mode
 If the driver is running on cluster then the mode is cluster mode

how do we choose the deploy mode?


 You will almost always submit your application in cluster mode.
 It is unlikely that you submit your spark application in client mode.
 We have two clear advantages of running your application in cluster
mode.
o The cluster mode allows you to submit the application and log off
from the client machine.Why? Because the driver and executors
are running on the cluster.They have nothing active on your
client's machine.So even if you log off from your client machine,
the driver and executor will continue running in the cluster.
o Your application runs faster in cluster mode because your driver is
closer to the executors.The driver and executor communicate
heavily,and if they are closer, you don't get impacted by network
latency.
Page 12 of 40

why do we have client mode?


 The client mode is designed for interactive workloads.
 For example, spark-shell runs your code in client mode.
 Similarly, Spark notebooks are also using the client mode.
 We have two reasons.
 The first reason is the interactive mode.Spark shell and notebooks
give you an interactive method to work with Spark.You will submit
some code using these tools.They will run the code and show you
the results.
 The second reason is to make an exit easy.What happens if you
log off from the client machine or stop the shell and quit.The
driver dies.As soon as the driver dies, the YARN RM knows that
the driver is dead,and the executors assigned to the driver are
now orphans.So the RM will terminate the executor containers to
free up the resources.
Page 13 of 40

Spark Jobs - Stage, Shuffle, Task, Slots


In sort revision

Spark jobs : Each action creates a spark job.( number of action == number of
spark jobs )
Stages : Number of wide transformation + 1
Suffle/sort : when we perform wide transformation we do suffle and sort.
Task : A task is the smallest unit of work in a Spark job.The number of tasks in
the stage is equal to the number of input partitions.each stage may be
executed as one or more parallel tasks

 Understanding Spark jobs and stages require you to know about the
Spark Dataframe API classification.
 Most of the Spark APIs can be classified into two categories.
o Transformations
o Actions
 We also have some functions and objects that are neither
transformations nor actions(utility functions, or helper methods)
o printSchema()
o cache()
o persist()
o isEmpty()
Page 14 of 40

 Transformations are used to process and convert data, and they are
further classified into two categories.
 Narrow Dependency transformations
 The narrow dependency transformations can run in parallel
on each data partition without grouping data from multiple
partitions.where don’t require shuffle
 For example, select(), filter(), withColumn(), drop() etc.
 All these transformations can be independently performed
on each data partition in parallel.
 And you do not need to group data for performing these
transformations.

 Wide Dependency transformations


 The Wide dependency transformations require some kind of
grouping before they can be applied.
 For example, groupBy(), join(), cube(), rollup(), and agg().
 All these wide dependency transformations require grouping of
data on some key and then apply the transformation.

 Action :
 As the name suggests, an action triggers some work,such as
writing a Dataframe to disk, computing the transformations, and
collecting the results.
 Some commonly used actions as read(), write(), collect(), and
take() etc.
 All spark actions trigger one or more Spark jobs.
 And that's a supercritical concept.
Page 15 of 40

let's take an example code and see how spark runs our code.

 This code snippet has got two spark code blocks.


 How do I know I have two blocks? Start from the first line and look for
action.
 Wherever you find action, your first block ends there, and the following
line starts a new block.
 In this example, my first line itself is an action. Right? The read() method
is an action.
 So the first block starts at the first line, and it also ends at the first line.

 So my second block starts from the second line and ends at the last line.
 Each action creates a spark job.
Page 16 of 40

 I got two spark actions visible here.So I can assume two Spark jobs here.

Lets see the second block


 Your application driver will take this block, compile it and create a
Spark job.
 But this job must be performed by the executors.Why? Because the
driver doesn't perform any data processing job.
 The driver must break this job into smaller tasks and assign them to
the executors.
 So how that happens?

 Spark driver will create a logical query plan for each spark job.
 Once we have the logical plan, the driver will start breaking this plan into
stages.
 The driver will look at this plan to find out the wide dependency
transformations.
Page 17 of 40

 I have two wide dependencies in this plan.


 The first one is the repartition() method, and the second one is the
groupBy() method.
 So the driver will break this plan after each wide dependency.
 Each stage can have one or more narrow transformations,
 and the last operation of the stage is a wide dependency
transformation.
 Spark cannot run these stages in parallel.
 We should finish the first stage, and then only we can start the next
stage.
 Why? Because the output of the first stage is an input for the next
stage.
 So whatever we see here is one spark job, and it is broken down into
three stages.
 All the stages can run one after the other because the output of one
stage is input for the next stage.
 But how the output of one stage goes to the next stage?
Page 18 of 40

 I am reading surveyDF and repartitioning it to create


partitionedSurveyDF.
 And the final output of the stage must be stored in an exchange buffer.
 Now the output of one stage becomes the input of the next stage.
 So the next stage starts with the exchange buffer.
 We simply learn that the shuffle/sort will move data from the Write
exchange to the read exchange.

 The Shuffle/Sort is an expensive operation in the Spark cluster.It


requires a write exchange buffer and a read exchange buffer.
 The data from the write exchange buffer is sent to the read exchange
buffer over the network.
Page 19 of 40

 And now I can run these transformations in parallel on those two


partitions.
 Spark can execute the same plan in parallel on two partitions because I
have two partitions.

Let's quickly revise all these terms.


 Spark creates one job for each action.
 This job may contain a series of multiple transformations.
 The Spark engine will optimize those transformations and create a logical plan for
the job.
 Then spark will break the logical plan at the end of every wide dependency and
create two or more stages.
 If you do not have any wide dependency, your plan will be a single-stage plan.
 But if you have N wide-dependencies, your plan should have N+1 stages.
 Data from one stage to another stage is shared using the shuffle/sort operation.
 Now each stage may be executed as one or more parallel tasks.The number of
tasks in the stage is equal to the number of input partitions.
 If I create 100 partitions for stage two, I can have 100 parallel tasks for stage two.
 The task is the most critical concept for a Spark job.A task is the smallest unit of
work in a Spark job.
 The Spark driver assigns these tasks to the executors and asks them to do the
work.
 The executor needs the following things to perform the task.
 The task Code
 Data Partition
 So the driver is responsible for assigning a task to the executor.
 The executor will ask for the code or API to be executed for the task.
 It will also ask for the data frame partition on which to execute the given code.
Page 20 of 40

Now let's see how this plan fits into the cluster.
 I have a driver and four executors.Each executor will have one JVM
process.
 But I assigned 4 CPU cores to each executor.So, my Executor-JVM can
create four parallel threads.And that's the slot capacity of my executor.
 So each executor can have four parallel threads, and we call them
executor slots.
 The drive knows how many slots do we have at each executor.

 So for this configuration, let's assume the driver has a job stage to finish.
 And you have ten input partitions for the stage.So you can have ten parallel tasks for the
same.
 Now the driver will assign those ten tasks in these slots.It might look like this.
Page 21 of 40

 But we have some extra capacity that we are wasting because we do


not have enough tasks for this stage.
 Now let's assume this stage is complete.Now the driver should start
the next stage.And we have 32 tasks for the next stage.
 But we have 16 slots only.That's not a problem.The driver will schedule
16 tasks in the available slots.
 The remaining 16 will wait for slots to become available again.

That's how these tasks are assigned and run by the executor.

However, we didn't talk about our actions.

 This example had a collect() action in the end.The collect() action


requires each task to send data back to the driver.
 The collect() action requires each task to send data back to the
driver.So the tasks of the last stage will send the result back to the
driver over the network.
 The driver will collect data from all the tasks and present it to you.
 If you had an action to write the result in a data file, in that case,all the
tasks will write a data file partition and send the partition details to the
driver.
 The driver considers the job done when all the tasks are successful.
 If any task fails, the driver might want to retry it.So it can restart the
task at a different executor.If all retries also fail, then the driver returns
an exception and marks the job failed.
Page 22 of 40

Spark SQL Engine and Query Planning

 Apache Spark gives you two prominent interfaces to work with data.
 Spark SQL
 Dataframe API
 Spark SQL is compliant with ANSI SQL:2003 standard.So you can think of
it as standard SQL.
 Dataframes are function Apis, and they allow you to implement
functional programming techniques to process your data.
 Other than these two, you also have Dataset API available only for Scala
and Java.
 The Dataframe API internally uses Dataset APIs, but these Dataset APIs
are not available in PySpark.
 If you are using Scala or Java, you can directly use Dataset APIs.
 Spark looks at your code in terms of jobs.
 Similarly, if you write a SQL expression, Spark considers one SQL
expression as one Job.
 Spark code is nothing but a sequence of Spark Jobs.And each Spark Job
represents a logical query plan.
 This first logical plan is a user-created logical query plan.
 Now this plan goes to Spark SQL engine.
 You may have Dataframe APIs, or you may have SQL.Both will go to the
Spark SQL engine.
 For Spark, they are nothing but a Spark Job represented as a logical plan.
Page 23 of 40

Spark SQL Engine

 The Spark SQL Engine will process your logical plan in four stages.
 The Analysis phase will parse your code and create a fully resolved logical plan.
 If your code passed the Analysis phase, that means you have a valid code,
 The logical optimization phase applies standard rule-based optimizations to the
logical plan.
Page 24 of 40

 Then this optimized logical plan goes into the next phase of physical
planning.
 Spark SQL takes a logical plan and generates one or more physical plans
in the physical planning phase.
 Physical planning applies cost-based optimization.
 So the engine will create multiple plans, calculate each plan's cost, and
finally select the plan with the least cost.
 At this stage, they mostly use different join algorithms to create more
than one physical plan.
 For example, they might create one plan using broadcast join and
another using Sort merge,
 Then they apply a cost to each plan and choose the best one.
 The last stage is code generation.So your best physical plan goes into
code generation.
 and the engine will generate Java byte code for the RDD operations in
the physical plan. And that's why Spark is also said to act as a compiler
Page 25 of 40

Spark Memory Allocation


 Assume you submitted a spark application in a YARN cluster.
 The YARN RM will allocate an application master (AM) container and start the driver
JVM in the container.
 The driver will start with some memory allocation which you requested.
 Do you know how to ask for the driver's memory? You can ask for the driver memory
using two configurations.
 You can ask for the driver memory using two configurations.
 spark.driver.memory
 and spark.driver.memoryOverhead

 So let's assume you asked for the spark.driver.memory = 1GB.


 And the default value of spark.driver.memoryOverhead = 384 mb (10% of driver
memory or 384 mb whichever is highest )
 But what is the purpose of 384 MB overhead?
 The overhead memory is used by the container process or any other non JVM
process within the container. ( execute code outside the JVM for pyspark or R)

But how much memory do you get for each executor container?
 So the driver will again request for the executor containers from the YARN.
 The YARN RM will allocate a bunch of executor containers.
 But how much memory do you get for each executor container?
 The total memory allocated to the executor container is the sum of the following.
 Overhead Memory : Overhead memory is the spark.executor.memoryOverhead.
 Heap Memory : JVM Heap is the spark.executor.memory.
 Off Heap Memory : Off Heap memory comes from spark.memory.offHeap.size.
 Pyspark Memory : And the PySpark memory comes from the
spark.executor.pyspark.memory.
 So the driver will look at all these configurations to calculate your memory requirement
and sum it up.
Page 26 of 40

Example : Now let's assume you asked for spark.executor.memory = 8


GB
 Overhead Memory : 800 MB ( 10% of heap memory or 384 mb which
ever is higher )
 Heap Memory : 8 Gb
 Off Heap Memory : 0
 Pyspark Memory : 0

But do we get an 8.8 GB container?


 That depends on your cluster configuration.
 The container should run on a worker node in the YARN cluster. Right?
 What if the worker node is a 6 GB machine?
 YARN cannot allocate an 8 GB container on a 6 GB machine. Right?
 Because there is not enough physical memory, before you ask for the driver or
executor memory,
 you should check with your cluster admin for the maximum allowed value.
 If you are using YARN RM, you should look for the following configurations.
 yarn.scheduler.maximum-allocation-mb
 yarn.nodemanager.resource.memory-mb
Page 27 of 40

Great! Now let's come back to our scenario.

Driver memory allocation : Lets assume while spark submit you ask for 4 gb of jvm memory

 I am assuming that we have enough physical memory on the worker nodes.


 The total physical memory for a driver container comes from the following two
configurations.

spark.driver.memory 4GB
spark.driver.memoryOverhead 400 MB (10 % or 384mb which ever is
higher)
 Once allocated, it becomes your physical memory limit for your spark driver.
 Now you have three limits.
 Your Spark driver JVM cannot use more than 4 GB.
 Your non-JVM workload in the container cannot use more than 400 MB.
 And Your container cannot use more than 4.4 GB of memory in total.
 If any of these limits are violated, you will see an OOM( out of memory exception )
exception.
Page 28 of 40

Executor memory allocation : Now let's assume you asked for


spark.executor.memory = 8 GB.
 Overhead Memory : 800 MB ( 10% of heap memory or 384 mb which ever is higher
)
 Heap Memory : 8 Gb
 Off Heap Memory : 0
 Pyspark Memory : 0
 The total physical memory of your container is 8.8 GB. Right?
 Now you have three limits.
 Your executor JVM cannot use more than 8 GB of memory.
 Your non JVM processes cannot use more than 800 MB.
 And your container has a maximum physical limit of 8.8 GB.
 If any of these limits are violated, you will see an OOM( out of memory exception )
exception.
Page 29 of 40

Great! So far, so good.But you should ask two more questions?


1. What is the Physical memory limit at the worker node?
a. You should look out for yarn.scheduler.maximum-allocation-mb for the
maximum limit.
2. What is the PySpark executor memory?
a. You do not need to worry about PySpark memory if you write your Spark
application in Java or Scala.
b. But if you are using PySpark, this question becomes critical. Right?

 PySpark is not a JVM process.


 Some 300 to 400 MB of this is constantly consumed by the container processes and
other internal processes.
 So your PySpark will get approximately 400 MB.
 If your PySpark consumes more than what can be accommodated in the overhead,
you will see an OOM error.

Great! So if you look from the YARN perspective.


 You have a container, and the container has got some memory.
 This total memory is broken into two parts.
o Heap memory( driver/executor memory) :
 The heap memory goes to your JVM:
 We call it driver memory when you are running a driver in this container.
 Similarly, we call it executor memory when the container is running an
executor.
o Overhead memory (OS Memory):
 The overhead memory is used for a bunch of things.
 In fact, the overhead is also used for network buffers.
 So you will be using overhead memory as your shuffle exchange or reading
partition data from remote storage etc.
 Both the memory portions are critical for your Spark application.
 And more than often, lack of enough overhead memory will cost you an OOM exception.
 Because the overhead memory is often overlooked, but it is used as your shuffle exchange or
network read buffer.
Page 30 of 40

Spark memory Management.

Executor memory.
 You started your application using spark.executor.memory as 8 GB and
spark.executor.cores equal to 4
 You will also get default 10% overhead memory. Right?

 We want to learn how Spark utilizes JVM heap memory. Right?


Assuming we have 8gb

Reserved 300 mb Fixed reserve for spark engine
Memory
Spark Memory 60%(8gb- The Spark Memory pool is your main executor
Pool 330mb) memory pool which you will use for data frame
operations and caching.
User Memory 40%(8gb- Non dataframe operations
330mb)  user-define data structures
 Spark internal metadata
 UDFs
 RDD conversion and lineage dependency

 Spark will reserve 300 MB, and you are left with 7700 MB.
 This remaining memory is again divided into a 60-40 ratio.
 The 60% goes to the Spark memory pool, and the remaining goes to the user
memory pool.
 If you want, you can change this 60-40 ratio using the Spark.memory.fraction
configuration.
 The Spark Memory pool is your main executor memory pool which you will use for
data frame operations and caching.
 The User Memory pool is used for non-dataframe operations.
 Here are some examples.
 If you created user-defined data structures such as hash maps, Spark would
use the user memory pool.
Page 31 of 40

 Similarly, spark internal metadata and user-defined functions are stored in


the user memory.
 All the RDD information and the RDD operations are performed in user
memory.
 You will be using user memory only if you apply RDD operations directly in your
code.
 Great! So the point is straight.The Spark memory pool is where all your data frames
and dataframe operations live.

 So for this example, we started with 8 GB, but we are left with a 4620 MB Spark
memory pool.

So how this memory pool is used?( spark memory 4620)


 This memory pool is further broken down into two sub-pools.

Storage memory Cache memory
Executor memory Buffer memory for DF
operation
 The default break up is 50% each,
 We use the storage pool for caching data frames.
 The storage pool is used to cache the data frames.
 So if you are using dataframe cache operation, you will be caching the data in
the storage pool.
 So the storage pool is long-term.
 You will cache the dataframe and keep it there as long as the executor is
running or you want to un-cache it.
 the Executor pool is to perform dataframe computations.
 So if you are joining two data frames, Spark will need to buffer some data for
performing the joins.
Page 32 of 40

 That buffer is created in the executor pool.


 So executor pool is short-lived. You will use it for execution
 and free it immediately as soon as your execution is complete. Right?

compute-view of the executor.


 For this example, I asked for the 4 CPU cores. Right?
 So my executor will have four slots, and we can run four parallel tasks in these slots.
Right?
 But what are these slots? These slots are threads.

Great! So I have one executor JVM, 2310 MB storage pool, another 2310 MB executor
pool

How much executor memory will each task get?


 Simple! 2310/4 right?
Page 33 of 40

 That's static memory management,


 and Spark used to assign task memory using this static method before spark 1.6.

But now, they changed it and implemented a unified memory manager.


 The unified memory manager tries to implement fair allocation amongst the active
tasks.
 Let's assume I have only two active tasks.I have four slots, but I have only two
active tasks.
 So the unified memory manager can allocate all the available execution memory
amongst the two active tasks.
 There is nothing reserved for any task.
 The task will demand the execution memory, and the unified memory manager
will allocate it from the pool.

 If the executor memory pool is fully consumed,


Page 34 of 40

 the memory manager can also allocate executor memory from the storage
memory pool as long as we have some free space.

 In short it ca brake the rule if it has free memory and if no memory then out of
memory exception will encounter
Page 35 of 40

 In general, Spark recommends two or more cores per executor, but you should not go
beyond five cores.
 More than five cores cause excessive memory management overhead and contention,

Spark Adaptive Query Execution


 Spark Adaptive Query Execution or AQE is a new feature released in Apache Spark
3.0.
 It offers you three capabilities.
o Dynamically coalescing shuffle partitions
o Dynamically switching join strategies
o Dynamically optimizing skew joins

what are these features, and why do we need them? What problems do they
solve?

So let's try to understand the problems


 Let's assume you are running a super simple group-by query in your Spark code.
 Or you might have written a Dataframe expression similar to the following.
 So what are we doing? I have a call_records table.
 This table stores information about the cell phone calls made by different users.
 We record a lot of information in this table, but here is a simplified table structure.
 So we record call_id, then the duration of the call in minutes, and we also record
which cell tower served the call.
 And my Spark SQL is trying to get the sum of call duration by the tower_location.
 For these four sample records, you should get the result as shown below.
 But do you know how Spark will execute this query? We already learned that.
 Spark will take the code, create an execution plan for the query, and execute it on
the cluster.
Page 36 of 40

 The spark job that triggers this query should have a two-stage plan.
 It looks similar to this.
 However, you will have two stages.
 Stage zero reads the data from the source table and fills it to the output exchange.
 The second stage will read the data from the output exchange and brings it to the
input exchange.
 And this process is known as shuffle/sort.
 Why do we have this shuffle/sort in our execution plan?
 Because we are doing a groupBy operation and groupBy is a wide-dependency
transformation.
 So you are likely to have a shuffle/sort in your execution plan.
 The stage is dependent on stage zero, so stage one cannot start unless stage zero is
complete.
 We already learned all this in the earlier section.
Page 37 of 40

Now let's dig deeper into these exchanges.


 Let's look at the input exchange.The input change might look like below.
 I am assuming that my input data source has got only two partitions.

 So the stage zero exchange should have two partitions.


 and make it available to the exchange for the second stage to read.
 I have only two partitions in my input source, so the exchange shows only two
partitions.
 Now the shuffle/sort operation will read these partitions,
 sort them by the tower_location and bring them to the input exchange of stage one.
 The final result should look like this.
Page 38 of 40

Wait a minute? What is this? Let me explain.


 Let's assume I configured spark.sql.shuffle.partitions = 10.
 I know the default value for this configuration is 200, but I also know I do not have
200 unique towers.
 I am running a query to group by towers, so I reduced the number of shuffle
partitions to 10.
 So the shuffle/sort will create ten partitions in the exchange.
 Five of them will have data, and the remaining five will be blank partitions.

Now let me ask a question.


How many tasks do you need to execute this stage?
We need ten tasks. Why? Because we have ten partitions in the exchange.
And that's a problem.
 Five partitions are empty, but Spark will still trigger ten tasks.
 The empty partition task will do nothing and finish in milliseconds.
 But Spark scheduler needs to spend time scheduling and monitoring these useless
tasks.
 The overhead is small, but we do have some unnecessary overhead here.
 I improved the situation by reducing the shuffle partitions to 10 from the default
value of 200.
 But that's not an easy thing to do?
 Why?
 Because I cannot keep changing the shuffle partitions for every query.
 Even if I want to do that, I do not know how many unique values my SQL will
fetch?
 The number of unique keys is dynamic and depends on the dataset.
 How am I supposed to know how many shuffle partitions do I need?
 That's almost impossible for the developers to know in advance.
Page 39 of 40

We have another problem here.


 Some partitions are big, and others are small.
 They are not proportionate.
 So a task working on partition one will take a long time while the task doing the partition-
2 will finish very quickly.
 Why? Because the stage is not complete until all the tasks of the stage are complete.
 Three tasks processing partitions 2,3 and 4 will finish quickly,
 but we still need to wait for partition-1 and partition 5.
 And that's a wastage of CPU resources.
 The situation becomes worse when one partition becomes excessively long.
 For example, look at this.a data skew problem.

How do I decide the number of shuffle partitions for my Spark job?


 Spark 3.0 offers Adaptive Query Execution to solve this problem.
 You must enable it, and the AQE will take care of setting the number of your shuffle
partitions.
 But how that magic happens?
 Super simple.

 Your input data is already loaded in stage zero exchange.


 Stage zero is already done, and data has come to the exchange.
 Now Spark will start the shuffle/sort.
 So it can compute the statistics on this data and find out some details such as the following.
 And this is called dynamically computing the statistics on your data during the shuffle/sort.
Page 40 of 40

 When Spark knows enough information about your data,


 it will dynamically adjust the number of shuffle partitions for the next stage.
 For example, in my case, Spark might dynamically set the shuffle partitions to four.
 And the result of that setting looks like this.

So what do we have?
 We have four shuffle partitions for stage one.
 Those five empty partitions are gone.
 Spark also merged two small partitions to create one larger partition.
 So instead of having five disproportionate partitions, we have four partitions.
 And these four are a little more proportionate.

Now let me ask you the same question?


 We need four tasks.
 Task 3 working on partition-3 will finish quickly, but the other three tasks will take
almost equal time.
 So we saved one CPU slot, and we also eliminated the useless empty tasks.

You might also like