SlideShare a Scribd company logo
By: Amit Raj IIT Kharagpur
Apache Spark Performance
tuning and Best Practise
Our Agenda
01 Spark Introduction
02 Code Level Optimization
03 Outside Code Technique
04 Demo
05 Summary
Introduction
● Apache Spark is Open Source, in-memory computation framework.
● It gives high performance for both batch as well as streaming job.
● It deals of big data processing.
● it is approx 100 times faster than mapreduce, because of in-memory computation
As it deals with the big data processing application it also involves lot of uses of resources such as
CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction.
In the upcoming 40 minute we will learn about the approaches which will help to do so.
Ways to Optimise
Code Level:-
Here we will learn the best practices to follow in order to achieve high performance in minimal
resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid
UDF, Filter Data at earliest , Reduce Shuffle
Beyond Code:-
Here we will learn to tune the config parameter cluster resources level tuning such as:-
File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
Major Bottleneck
● CPU
● Network Bandwidth
● Memory
Our Goal is to optimise each of them as much as possible in order to reduce the resources used
and reduce the computation time to achieve optimum performance.
Caching
Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving
from a particular country and same is being used multiple times.
● Raw Data is in text file
● Reading Text File as DF1
● Grouping by origin country DF2
Caching
JOB1:- Now number of flights leaving US as DF3
JOB2:- number of flights leaving Singapore as DF4
JOB3:- number of flights leaving India as DF5
Execution plan for JOB1 :- DF1>DF2 >DF3
Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step.
Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step.
here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can
use it in another job to reduce computation resources and save time.
Broadcasting
Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with
task every time. which helps in reducing the network bandwidth and time consumption.
When to Use Broadcast Variable:-
Suppose we have a lookup data and that data need to be used by each executor while performing task.
We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition)
we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task).
But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be
sent.
Benefit= sending 100 copy vs sending 10 copy
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
– - Continue
In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution.
Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
Serialization
From the above diagram it is clear that serialization is needed when we write data in some storage.
De-Serialization is needed when we need to read from the some source.
In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc.
Hence it becomes very important to optimize the serialization process.
Serialization
Kyro serialization over Java serialization:-
kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires
to register the classes not supported by it.
val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class
it will store the class name with each object of it (for every row)
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[Foo]))
DataSet/DataFrame over RDD
RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition
and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization
of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
Avoid UDF
When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence
whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible.
but by any chance we have to use it then first we have to define a function like a normal scala function and we
have to register it with spark udf class
● val plusOne = udf((x: Int) => x + 1) //defined function
● spark.udf.register("plusOne", plusOne) //register udf
● spark.sql("SELECT plusOne(5)").show() // calling udf
// |UDF(5)| // result
// +------+
// | 6|
Filter Data at Earliest
example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address,
pastexp, marital status, ……………………….. etc.
Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column
and other column becomes irrelevant.
df.select(name,city).groupby(“city”).show()
df.groupby(“City”).select(“City”, “count”).show()
Scan
Aggregate
Filter
Scan
Aggregate
Filter
Shuffling
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across
machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(),
reducebyKey(), join() on RDD and DataFrame. It involves
● Disk I/O
● Involves data serialization and deserialization
● Network I/O
Reduce Shuffle Operation
We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations
remove any unused operations.
Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property
you can improve Spark performance.
spark.conf.set("spark.sql.shuffle.partitions",100)
Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we
don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase
it to 200 or more.
File Format
Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database
As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades
from Database2 and perform calculation then writes in Databse3.
as Database2 involves writing the data into and reading the data from it.
In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet
e.t.c,
Any transformations on these formats performs better than text, CSV, and JSON.
Spark Job1
Spark Job2
DataBase2 Database3
DataBase1
Executor Config
● JOB > Stage > Task
● one job can have multiple Stage, One stage can have multiple task.
● And number of core = number of parallel task
● Here we have to give proper number of core to each executor in order to optimise the resources.
● Allocating more number of core to each executor will leads to more parallel task on each executor which can
lead to outofmemory(OOM) error.
● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor
memory will not be fully optimised.
● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of
parallelism and proper memory uses.
./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
Memory Tuning
There are three considerations in tuning memory usage:
● the amount of memory used by your objects (you may want your entire dataset to fit in memory),
● the cost of accessing those objects, and
● the overhead of garbage collection
● String data types uses less storage space compared to Linked List and Map as these objects not only has a
header, but also pointers (typically 8 bytes each) to the next object in the list.
● We can also optimise the memory uses by storing data in a serialized format.
● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields.
● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage
collection cost. Broadcasting variable also help us in reducing GC.
Thank You !
Get in touch with us:
Amit Raj
Senior Data Engineer
IIT Kharagpur
amitraj.iitkgp@gmail.com / 7548095242
Ad

More Related Content

What's hot (20)

Spark tuning
Spark tuningSpark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 

Similar to Spark Performance Tuning .pdf (20)

Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
Ankit Beohar
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
Knoldus Inc.
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
Ankit Beohar
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
Knoldus Inc.
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Ad

More from Amit Raj (6)

Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA) Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA)
Amit Raj
 
Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005
Amit Raj
 
Summer traning report
Summer traning reportSummer traning report
Summer traning report
Amit Raj
 
Haripur npp project
Haripur npp projectHaripur npp project
Haripur npp project
Amit Raj
 
Spot speed study
Spot speed studySpot speed study
Spot speed study
Amit Raj
 
Case study on small e commerce
Case study on small e commerceCase study on small e commerce
Case study on small e commerce
Amit Raj
 
Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA) Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA)
Amit Raj
 
Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005
Amit Raj
 
Summer traning report
Summer traning reportSummer traning report
Summer traning report
Amit Raj
 
Haripur npp project
Haripur npp projectHaripur npp project
Haripur npp project
Amit Raj
 
Spot speed study
Spot speed studySpot speed study
Spot speed study
Amit Raj
 
Case study on small e commerce
Case study on small e commerceCase study on small e commerce
Case study on small e commerce
Amit Raj
 
Ad

Recently uploaded (20)

theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 

Spark Performance Tuning .pdf

  • 1. By: Amit Raj IIT Kharagpur Apache Spark Performance tuning and Best Practise
  • 2. Our Agenda 01 Spark Introduction 02 Code Level Optimization 03 Outside Code Technique 04 Demo 05 Summary
  • 3. Introduction ● Apache Spark is Open Source, in-memory computation framework. ● It gives high performance for both batch as well as streaming job. ● It deals of big data processing. ● it is approx 100 times faster than mapreduce, because of in-memory computation As it deals with the big data processing application it also involves lot of uses of resources such as CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction. In the upcoming 40 minute we will learn about the approaches which will help to do so.
  • 4. Ways to Optimise Code Level:- Here we will learn the best practices to follow in order to achieve high performance in minimal resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid UDF, Filter Data at earliest , Reduce Shuffle Beyond Code:- Here we will learn to tune the config parameter cluster resources level tuning such as:- File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
  • 5. Major Bottleneck ● CPU ● Network Bandwidth ● Memory Our Goal is to optimise each of them as much as possible in order to reduce the resources used and reduce the computation time to achieve optimum performance.
  • 6. Caching Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving from a particular country and same is being used multiple times. ● Raw Data is in text file ● Reading Text File as DF1 ● Grouping by origin country DF2
  • 7. Caching JOB1:- Now number of flights leaving US as DF3 JOB2:- number of flights leaving Singapore as DF4 JOB3:- number of flights leaving India as DF5 Execution plan for JOB1 :- DF1>DF2 >DF3 Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step. Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step. here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can use it in another job to reduce computation resources and save time.
  • 8. Broadcasting Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with task every time. which helps in reducing the network bandwidth and time consumption. When to Use Broadcast Variable:- Suppose we have a lookup data and that data need to be used by each executor while performing task. We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition) we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task). But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be sent. Benefit= sending 100 copy vs sending 10 copy val states = Map(("NY","New York"),("CA","California"),("FL","Florida")) val countries = Map(("USA","United States of America"),("IN","India")) val broadcastStates = spark.sparkContext.broadcast(states) val broadcastCountries = spark.sparkContext.broadcast(countries)
  • 9. – - Continue In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution. Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
  • 10. Serialization From the above diagram it is clear that serialization is needed when we write data in some storage. De-Serialization is needed when we need to read from the some source. In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc. Hence it becomes very important to optimize the serialization process.
  • 11. Serialization Kyro serialization over Java serialization:- kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires to register the classes not supported by it. val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class it will store the class name with each object of it (for every row) conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[Foo]))
  • 12. DataSet/DataFrame over RDD RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark. On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
  • 13. Avoid UDF When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible. but by any chance we have to use it then first we have to define a function like a normal scala function and we have to register it with spark udf class ● val plusOne = udf((x: Int) => x + 1) //defined function ● spark.udf.register("plusOne", plusOne) //register udf ● spark.sql("SELECT plusOne(5)").show() // calling udf // |UDF(5)| // result // +------+ // | 6|
  • 14. Filter Data at Earliest example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address, pastexp, marital status, ……………………….. etc. Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column and other column becomes irrelevant. df.select(name,city).groupby(“city”).show() df.groupby(“City”).select(“City”, “count”).show() Scan Aggregate Filter Scan Aggregate Filter
  • 15. Shuffling Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join() on RDD and DataFrame. It involves ● Disk I/O ● Involves data serialization and deserialization ● Network I/O
  • 16. Reduce Shuffle Operation We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations remove any unused operations. Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property you can improve Spark performance. spark.conf.set("spark.sql.shuffle.partitions",100) Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase it to 200 or more.
  • 17. File Format Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades from Database2 and perform calculation then writes in Databse3. as Database2 involves writing the data into and reading the data from it. In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet e.t.c, Any transformations on these formats performs better than text, CSV, and JSON. Spark Job1 Spark Job2 DataBase2 Database3 DataBase1
  • 18. Executor Config ● JOB > Stage > Task ● one job can have multiple Stage, One stage can have multiple task. ● And number of core = number of parallel task ● Here we have to give proper number of core to each executor in order to optimise the resources. ● Allocating more number of core to each executor will leads to more parallel task on each executor which can lead to outofmemory(OOM) error. ● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor memory will not be fully optimised. ● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of parallelism and proper memory uses. ./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
  • 19. Memory Tuning There are three considerations in tuning memory usage: ● the amount of memory used by your objects (you may want your entire dataset to fit in memory), ● the cost of accessing those objects, and ● the overhead of garbage collection ● String data types uses less storage space compared to Linked List and Map as these objects not only has a header, but also pointers (typically 8 bytes each) to the next object in the list. ● We can also optimise the memory uses by storing data in a serialized format. ● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields. ● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage collection cost. Broadcasting variable also help us in reducing GC.
  • 20. Thank You ! Get in touch with us: Amit Raj Senior Data Engineer IIT Kharagpur [email protected] / 7548095242