Spark Basic
Spark Basic
Spark Architecture
What is cluster
A pool of computer working together but viewed as a single system
It has 10 worker node( as a example )
o Each node has 64gb ram and 16 cpu core
Total cluster configuration is
o 640 gb0 gb ram
o 160 core cpu
Page 2 of 40
What is spark-submit?
The spark-submit is a command-line tool that allows you to submit the Spark
application to the cluster.
Here is a general structure of the spark-submit command.
This is a minimalist example with the most commonly used options.
So you can use the spark-submit command
The second last argument is your application jar or a PySpark script.
Finally, you have a command-line argument for your application.
Page 7 of 40
Class : The class option is only valid for Java or Scala,and it is not
required when you submit a PySpark application.This class option tells
the driving class name where you defined your main() method for java
and scala.
Master : The master option is to tell the cluster manager.If you are using
YARN, the value of your master is yarn.But if you want to run your
application on the local machine.you should set the master to local.You
can also use local[3] to run spark locally with three threads.
Conf : The conf option allows you to set additional spark configurations.
For example, you can set spark.executor.memoryOverhead = 0.20 using
the --conf.
The default value for spark.executor.memoryOverhead is 0.10.
Cluster mode
Client mode
Spark jobs : Each action creates a spark job.( number of action == number of
spark jobs )
Stages : Number of wide transformation + 1
Suffle/sort : when we perform wide transformation we do suffle and sort.
Task : A task is the smallest unit of work in a Spark job.The number of tasks in
the stage is equal to the number of input partitions.each stage may be
executed as one or more parallel tasks
Understanding Spark jobs and stages require you to know about the
Spark Dataframe API classification.
Most of the Spark APIs can be classified into two categories.
o Transformations
o Actions
We also have some functions and objects that are neither
transformations nor actions(utility functions, or helper methods)
o printSchema()
o cache()
o persist()
o isEmpty()
Page 14 of 40
Transformations are used to process and convert data, and they are
further classified into two categories.
Narrow Dependency transformations
The narrow dependency transformations can run in parallel
on each data partition without grouping data from multiple
partitions.where don’t require shuffle
For example, select(), filter(), withColumn(), drop() etc.
All these transformations can be independently performed
on each data partition in parallel.
And you do not need to group data for performing these
transformations.
Action :
As the name suggests, an action triggers some work,such as
writing a Dataframe to disk, computing the transformations, and
collecting the results.
Some commonly used actions as read(), write(), collect(), and
take() etc.
All spark actions trigger one or more Spark jobs.
And that's a supercritical concept.
Page 15 of 40
let's take an example code and see how spark runs our code.
So my second block starts from the second line and ends at the last line.
Each action creates a spark job.
Page 16 of 40
I got two spark actions visible here.So I can assume two Spark jobs here.
Spark driver will create a logical query plan for each spark job.
Once we have the logical plan, the driver will start breaking this plan into
stages.
The driver will look at this plan to find out the wide dependency
transformations.
Page 17 of 40
Now let's see how this plan fits into the cluster.
I have a driver and four executors.Each executor will have one JVM
process.
But I assigned 4 CPU cores to each executor.So, my Executor-JVM can
create four parallel threads.And that's the slot capacity of my executor.
So each executor can have four parallel threads, and we call them
executor slots.
The drive knows how many slots do we have at each executor.
So for this configuration, let's assume the driver has a job stage to finish.
And you have ten input partitions for the stage.So you can have ten parallel tasks for the
same.
Now the driver will assign those ten tasks in these slots.It might look like this.
Page 21 of 40
That's how these tasks are assigned and run by the executor.
Apache Spark gives you two prominent interfaces to work with data.
Spark SQL
Dataframe API
Spark SQL is compliant with ANSI SQL:2003 standard.So you can think of
it as standard SQL.
Dataframes are function Apis, and they allow you to implement
functional programming techniques to process your data.
Other than these two, you also have Dataset API available only for Scala
and Java.
The Dataframe API internally uses Dataset APIs, but these Dataset APIs
are not available in PySpark.
If you are using Scala or Java, you can directly use Dataset APIs.
Spark looks at your code in terms of jobs.
Similarly, if you write a SQL expression, Spark considers one SQL
expression as one Job.
Spark code is nothing but a sequence of Spark Jobs.And each Spark Job
represents a logical query plan.
This first logical plan is a user-created logical query plan.
Now this plan goes to Spark SQL engine.
You may have Dataframe APIs, or you may have SQL.Both will go to the
Spark SQL engine.
For Spark, they are nothing but a Spark Job represented as a logical plan.
Page 23 of 40
The Spark SQL Engine will process your logical plan in four stages.
The Analysis phase will parse your code and create a fully resolved logical plan.
If your code passed the Analysis phase, that means you have a valid code,
The logical optimization phase applies standard rule-based optimizations to the
logical plan.
Page 24 of 40
Then this optimized logical plan goes into the next phase of physical
planning.
Spark SQL takes a logical plan and generates one or more physical plans
in the physical planning phase.
Physical planning applies cost-based optimization.
So the engine will create multiple plans, calculate each plan's cost, and
finally select the plan with the least cost.
At this stage, they mostly use different join algorithms to create more
than one physical plan.
For example, they might create one plan using broadcast join and
another using Sort merge,
Then they apply a cost to each plan and choose the best one.
The last stage is code generation.So your best physical plan goes into
code generation.
and the engine will generate Java byte code for the RDD operations in
the physical plan. And that's why Spark is also said to act as a compiler
Page 25 of 40
But how much memory do you get for each executor container?
So the driver will again request for the executor containers from the YARN.
The YARN RM will allocate a bunch of executor containers.
But how much memory do you get for each executor container?
The total memory allocated to the executor container is the sum of the following.
Overhead Memory : Overhead memory is the spark.executor.memoryOverhead.
Heap Memory : JVM Heap is the spark.executor.memory.
Off Heap Memory : Off Heap memory comes from spark.memory.offHeap.size.
Pyspark Memory : And the PySpark memory comes from the
spark.executor.pyspark.memory.
So the driver will look at all these configurations to calculate your memory requirement
and sum it up.
Page 26 of 40
Driver memory allocation : Lets assume while spark submit you ask for 4 gb of jvm memory
Executor memory.
You started your application using spark.executor.memory as 8 GB and
spark.executor.cores equal to 4
You will also get default 10% overhead memory. Right?
Spark will reserve 300 MB, and you are left with 7700 MB.
This remaining memory is again divided into a 60-40 ratio.
The 60% goes to the Spark memory pool, and the remaining goes to the user
memory pool.
If you want, you can change this 60-40 ratio using the Spark.memory.fraction
configuration.
The Spark Memory pool is your main executor memory pool which you will use for
data frame operations and caching.
The User Memory pool is used for non-dataframe operations.
Here are some examples.
If you created user-defined data structures such as hash maps, Spark would
use the user memory pool.
Page 31 of 40
So for this example, we started with 8 GB, but we are left with a 4620 MB Spark
memory pool.
Great! So I have one executor JVM, 2310 MB storage pool, another 2310 MB executor
pool
the memory manager can also allocate executor memory from the storage
memory pool as long as we have some free space.
In short it ca brake the rule if it has free memory and if no memory then out of
memory exception will encounter
Page 35 of 40
In general, Spark recommends two or more cores per executor, but you should not go
beyond five cores.
More than five cores cause excessive memory management overhead and contention,
what are these features, and why do we need them? What problems do they
solve?
The spark job that triggers this query should have a two-stage plan.
It looks similar to this.
However, you will have two stages.
Stage zero reads the data from the source table and fills it to the output exchange.
The second stage will read the data from the output exchange and brings it to the
input exchange.
And this process is known as shuffle/sort.
Why do we have this shuffle/sort in our execution plan?
Because we are doing a groupBy operation and groupBy is a wide-dependency
transformation.
So you are likely to have a shuffle/sort in your execution plan.
The stage is dependent on stage zero, so stage one cannot start unless stage zero is
complete.
We already learned all this in the earlier section.
Page 37 of 40
So what do we have?
We have four shuffle partitions for stage one.
Those five empty partitions are gone.
Spark also merged two small partitions to create one larger partition.
So instead of having five disproportionate partitions, we have four partitions.
And these four are a little more proportionate.