0% found this document useful (0 votes)
109 views

Spark by Sumit

Apache Spark is a general purpose, in-memory compute engine that can run on Hadoop clusters and standalone. It provides a more efficient alternative to MapReduce by keeping data in memory across multiple operations. Spark operations run faster than MapReduce by avoiding disk I/O and using resilient distributed datasets (RDDs) that can be recomputed if a node fails.

Uploaded by

S HARI KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Spark by Sumit

Apache Spark is a general purpose, in-memory compute engine that can run on Hadoop clusters and standalone. It provides a more efficient alternative to MapReduce by keeping data in memory across multiple operations. Spark operations run faster than MapReduce by avoiding disk I/O and using resilient distributed datasets (RDDs) that can be recomputed if a node fails.

Uploaded by

S HARI KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Apache Spark

++++++++++++
--It is general purpose,Inmemory and Compute engine.

Compute engine
+++++++++++++++
Hadoop provide 3 Things
1)Hdfs-Strorage
2)MapReduce-Computation
3)YARN-Resource Manager

Spark is an Replacement/alternative of MapReduce.

Spark is a plug and play compute engine which need 2 things


1)Storage-Hdfs/Local/S3
2)Resource Manager -Yarn/Mesos/Kubernetes.

Inmemory
++++++++

--Series of five mapreduce jobs.


--Disk access is requed in order to take data from Hdfs and processing.
--After processing output is fed to HDFS.
--Every MapReduce job you required two disk access i.e., one is for Reading and
Other is Writing.
--Disk I/O are pain i.e., it takes lot of time.

mr1 mr2 mr3 mr4 mr5

HDFS

to overcome the above problem we are taking data once from HDFS and processing all
the computations and storing the final output is stored in HDFS.

Spark

v1 v2 v3 v4 v5 v6

Hdfs
--Only two Disk I/O's are required, for the entire computations.
--It is better tham map reduce and less time to processing
--Spark is 10 to 100 times faster than MapReduce

General purpose
+++++++++++++++

--Pig for Cleaning


--Hive for Querying
--Mahout
--Sqoop

Learn just in one style of writing the code and all the things like
cleaning,querying,machine learning,data ingestion all these can happens with that.
Map Reduce
+++++++++
--High Latency(It Involves more disk read and write operations than spark)
--Every map reduce task takes two disk seeks
Spark
++++++
--low latency(It Involves less disk read and write operations than mapreduce)
--In case spark, Entire operations will take 2 disk seeks

RDD
+++
--The basic unit which holds data in spark is called as RDD.
--Resiliant Distributed Datasets.
--Rdd is nothing but inmemory distributed collection.
--Rdd are distributed across memory
--Immutable

Resiliant
+++++++++
--If we loose an rdd we can again recover it back.
--RDD Provides fault tolerance through lineage graph.
--A lineage graph keeps a track of transformation to be executed after an action
has been called
--Rdd lineage graph helps recomputed any missing or damaged RDD because of node
failures.
--In RDD we get resiliancy by using lineage graph

Distributed
+++++++++++
Rdd1=load data from text file
The Rdd1 will have 4 partitions(data brought in from HDFS) which are in memory.

2 kinds of operations
+++++++++++++++++++++
1.Transformations
2.Actions

Transformations
+++++++++++++++
--These are lazy
i.e.,Executon will not started untill an action is triggered
Data is not loaded until it is necessary
Spark maintains the record of operation through DAG
Better optimization by spark engine.

word Count:
++++++++++

import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger

object WordCount extends App{


Logger.getLogger("org").setLevel(Level.ERROR)
val rdd1=sc.textFile("/user/cloudera/batch194/abc.txt")
val rdd2=rdd1.flatMap(x=>x.split(" "))
val Lowercase=rdd2.map(x=>x.toLowerCase())
val rdd3=Lowercase.map(x=>(x,1))
val rdd4=rdd3.reduceByKey((x,y)=>x+y)
val exch=rdd4.map(x=>(x._2,x._1))
val sort=exch.sortByKey(false)
val exch1=sort,map(x=>(x._1,x._2))
val results =exch1.collect
for(result<-results){
val word=result._1
val count=result._2
println(s"$word:$count")
}
scala.io.StdIn.readLine()
}

sortByKey- Will gives sorting based on key value.


false-will gves in ascending order
true-will gives in descending order

44,8602,37.0
12,8888,12.5
44,8603,44.5

val input=sc.textFile("customer-orders.csv")
val a =input.map(x=>(x.split(",")(0),x.split(",")(2).toFloat))
val b=a.reduceByKey((x,y)=>x+y)
val c=b.sortBy(x=>x._2)
val d=c.collect
d.foreach(println)

**Rdd contains 2 elements is called paired RDD.


+++

USERID MOVIEID RATING_GIVEN TIMESTAMP

How many Rated 5,4,3,2,1 star

val input=sc.textFile("Movie_data.csv")
val a =input.map(x=>x.split("\t")(2))
val c=a.map(x=>(x,1))
val d=c.reduceByKey((x,y)=>x+y)
val e=d.collect
e.foreach(println)

val input=sc.textFile("Movie_data.csv")
val a =input.map(x=>x.split("\t")(2))
val b=a.countByValue
b.foreach(println)
countByValue is an action
**It will count the value part(how many times each value is coming)
--Here no other operation will do
++++

rowid name age number of linkedin connection

33,100
33,200
33,300

(33,600/3)
(33,200)

def parseLine(line:String)={
val fields= line.split(",")
val age=fields(2).toInt
val numfriends=fields(3).toInt
(age,numfriends)
}
val input=sc.textFile("friends")
val b=a.map(parseLine)
val c=b.map(x=>(x._1,(x._2,1)))
val c=b.mapValues(x=>(x,1))
val d=c.reduceByKey((x,y)=>(x._1+y._1,x._2+y._2))
val e=d.map(x=>(x._1,x._2._1/x._2._2)).sortBy(x=>x._2)
val e=d.mapValues(x=>x._1/x._2).sortBy(x=>x._2)

e.collect.foreach

mapValues
+++++++++
it is transformation and it will consider only value part

++++

big data contents 24.06


learning big data 34.98

val a =sc.textFile("bigdata-campaign-data.csv")
val b=a.map(x=>(x.split(",")(10).toFloat,x.split(",")(0)))
val c=b.flatMapValues(x=>x.split(" "))
val d=c.map(x=>(x._2,x._1))
val e=d.reduceByKey((x,y)=>x+y)
val f=e.sortBy(x=>x._2,false)
f.take(20).foreach(println)

flatMapValues
+++++++++++++
Accumulator
+++++++++++
--There is a shared copy kept in your driver machine
--Each of exector will update it
--however none of executor can read the value of accumulator, they can only change
the value
--This is same as counters in mapreduce

val myrdd=sc.textFile("SAMPLEfie.txt")
val myaccum=sc.longAccumulator("blank line accuulator")
myrdd.foreach(x=>if(x==" ")myaccum.add(1))
myaccum.value

Broadcast:
++++++++
Shared copy on each machine
YARN:
+++++
--Yet another resource negotiator

1.Storage perspective -HDFS


--Name Node
--Data Node
2.Processing perspective
In Hadoop V1,The job execution was controlled by 2 process

master -Job tracket


slaves -task tracker

Job Tracker
++++++++++
--used to lot of work in Hadoop v1.
--Keeps track of
--Thousands of task tracker
--hundreds of jobs
--tens and thousands of map reduce tasks

Scheduling
++++++++++
--Deciding which job to execute first based on scheduling algo,priority of
jobs,getting to know available resources, and providing the resources to job.
Monitoring
++++++++++
--Tracking the progress of job, if a task fails rerun the task, if task is slow
then based on speculative execution starts on othe machine.
Task Tracker
+++++++++++
--This Task tracker tracks the tasks on each data node and inform the job
tracker.

Limitation:
++++++++++
1.Scalability
--It was observed that when the cluster size goes beyond 4k Datanode
then the job tracker used to become bottleneck.
2.Resource utilization--In MR1,there used to be fixed number of map and reduce
slots.
100 Map slots and 50 reduce slots.
You want to execute a map reduce job which required 150 mappers.
100 mappers at a time.
50 mappers run latter
3.Only MapReduce jobs will supported

YARN:
+++++
--Resource Manager-Maser
--Node Manager-Slave
--Application Master

Resource Manager will handle scheduling


Resource Manager will creates Container on one of the Node Manager.
Container=Memory+CPU
=1GB+2 cores
Inside The container The Rm creates an Application Master
This Application master will take care of end to end monitoring for this
Application.
Application master will negotiate resource with resource manager.and requests for
resources in the form of container.

i.e.,
prority 1
location host1
How many container
size of each container(Resources)
Resource manager allocates the resources in the form of container.and it will send
the container id and the host name to the application master
and finally launch tasks in the container

1.Clien sent request, Request goes to Resource Manager


2.Resource Manager will create container in any one of the Node Manager,
3.Inside the Container Application Master will create
4.Apllication Master will negotiate resources with resource Manager. Resource
Manager will send container id and the host name of node manager to the application
master
5.Applicaton master will launch tasks in the containers
6.Each container will have Apllication Master
7.Application Master will have monitoring kind of activity

How limitation Handled


+++++++++++++++++++++++
1.Scalability-Application master will handle monitoring, scheduling will handle
resource manager
2.Resource utiliztion is very improved as the resources are not wasted.
3.MR and other jobs will run
4.uberization
--client sent request, Resource manager will created container and in the same
container application master will create and if the job is very small
and the resources are efficient and no need to negotiate resources with resource
manager.

SPARK ON YARN ARCHITECURE:


+++++++++++++++++++++++++++
How does spark execute our programs on the cluster.
Master/Slave Architecture
--Each application has driver which is the master process
--Each application has a bunch of executors which are the slave process.
Driver
++++++
--It is responsble for analysing the work in many tasks, distribute the
tasks,schedules the tasks and monitors.
Executor
+++++++
--Is responsible to execute the code locally on the JVM(Executor Machine)

Who Execute Where?


++++++++++++++++++
--The executors are always launches on the cluster machine(Worker nodes)
--However, for the driver we haave the flexibility to launch it on the client
machine or the cluster machines.
1.Client mode
+++++++++++++
1.When we launch spark-shell automatically spark session is created.
2.As soon as spark session is created request goes to YARN resource manager
3.YARN Resource Manager will create a container on one of the node manager and will
launch application master for this spark application
4.This Application master will negotiate for resources from the YARN resource
manager in the for of containers
5.The Yarn RM will creates containers on node manager
6.Now the App master will launch the executors in these containers
7.Once execution/Processing is done ,The drivers and executors are directly
communicate without involvment of container.

2. mode
++++++++++++++
The only difference here is that the spark driver runs on the application master

whenever driver runs on client machine we say client mode


whenever driver runs on cluster machine we say cluster mode

client mode is not preferable , because client machine goes down or shut down then
the driver stops.
cluster mode is used for production environment

Who controls the cluster and how spark gets the driver anad executor
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--Cluster Manager
i.e., YARN/MESOS/Kubernetes/spark standalone

Spark Session
+++++++++++++
--is like a data structure where driver maintains all the information including
executor location and status
--This is entry point to the spark applications

+++++

object LogLevelextends App{

val mylist=List("WARN: Tuesday 4 September 0405",


"ERROR: Tuesday 4 September 0408","ERROR: Tuesday 4 September 0408","ERROR: Tuesday
4 September 0408",
"ERROR: Tuesday 4 September 0408","ERROR: Tuesday 4 September 0408")

val originalLogsrdd=sc.parallelize(mylist)
val newPairRdd=originalLogsrdd.map(x=>{val columns=x.split(":")
va loglevel=columns(0)
(loglevel,1)})
val resultant=originalLogsrdd.reduceByKey(_+_)
resultant.collect
}

Wide and Narrow Transformations


+++++++++++++++++++++++++++++++

Narrow Transformation:
+++++++++++++++++++++
--No shuffling is involved
--Map,flatMap,filter
Wide Transformation
+++++++++++++++++++
--Shufling involved
--groupByKey(),reduceByKey()

Stages
++++++
stages are marked by shuffle boundaries
whenever we encounter a shuffle,a new stage gets created.
whenever we called wide transfomation a new stage gets created.
If i used 3 wide transformation 4 stages will gets created

Job
+++
--whatever action you called that shown as job
--number of jobs is equal to number of actions
--whenever you use wide transformation then new stage is created.
Task
++++
--corresponds to each partitions
--100 partition 100 task

job(action calls)
|
Stage(wide transformation +1)
|
Task(number of partitions)

reduceByKey && reduce


++++++++++++++++++++++
--reduceByKey is the transformation
--It will works only on pair RDD.
--we can still have huge amount of data and we might be willing to do further
operation in parallel i.e., it is trabsformation
--reduce is an action
--whenever you call an action on a rdd you get a local variable.
--reduce will gives you a single output which is very small i.e., it is an action

groupByKey && reduceByKey


++++++++++++++++++++++++++
--Both are Transformations
--Wide Transformations

reduceByKey
--we will get advantage of local aggregation
1.more work in parallel
2.less shuffling

you can think same as combiner acting the mapper end.

Node1
(x 1)-->(x 1)
Node 4 (x 1) (x 2) (x 3)-->(x 6)
(y 1)-->(y 1)

Node2
(x 1)--(x 2)
(x 1)

(y 1)--(y 2) Node5 (y 1) (y 2) (y 3)-->(y 6)


(y 1)

Node3

(x 1)--(x 3)
(x 1)
(x 1)

(y 1)--(y 3)
(y 1)
(y 1)

groupByKey
++++++++++
--we do not get any local aggregation
--all the key value pairs are sent(shuffled) to another machine.
--so,we have to shuffle more data and we get less parallelism.

+++++++++
Pair Rdd--tuple of two elements
+++

sc.defaultParallelism
rdd.getNumPartitions
sc.defautMinPatitions

Difference between Repartiton and coalesce


++++++++++++++++++++++++++++++++++++++++++

--Repartiton can increase or decrease the number of partitions in a rdd.


--wide transformation
Coalesce
++++++++
--it can only decrease the number of partition and it can't only increase the
number of partitions
--Transformation

**If you want to decrease no of partition?

Coalesce is prefered, tries to minimise the shuffling


repartiton has a intention to have final partition of exactly equal size and for
this it has to go through complete shuffling
coalesce has a intention to minimize the shuffling and combine the existing
partition on each machine to avoid full shuffle.

Cache && Persist


++++++++++++++++
--Consider if you haave a rdd which you generated by doing bunch of transformations
--Cache && persist both have same purpose
--if you want to reuse the results of existing rdd. then you can use them.
--Speed up the applications that access the same rdd multiple times
--An Rdd that is not cached and reevaluted again each time and action is invoked
--The differece is that cache will cache the rdd in memory.
--However persist comes with various storage levels.
--persist() this is equivalent to cache.
i.e.,persist(StorageLevel.DISK_ONLY)

MEMORY_ONLY-non-serialization format,data is cached in memory


DISK_ONLY-serialization format,data is cached in disk
MEMORY_AND_DISK--data is cached in memory, if enough memory is not
available,evicted blocks from memory are serialized to disk.
This mode of operation is recommended when re-evaluation is expensive and memory
resource are scarse.
OFF_HEAP
++++++++
--blocks are cached off-heap.
--outside the JVM
--Problem with storing the objects in jvm is that it uses garbage collection for
freeing up.
garbage collection to free up the space is a time taken process.
--Grabbing a piece of raw memory from a machine and storing it, outside the
executor.
--OFF_HEAP ia a unsafe thing as we have to deal with raw memory outside your JVM.

serialization
+++++++++++++
It is in bytes format.-it take less storage
Increase the processing cost(convert byte code ) but reduce the memory footprint

non-serialization means is in object format-it takes more storage


--Processing is faster
--It takes more storage

Block-evacuation
+++++++++++++++++
Conside the situtaion that some of the block partitions are so large (skew) taht
they will quickly fill up the storage memory used for caching.
when the storage memory becomes full, an eviction policy will be used to make up
the space for new blocks.

DISK_ONLY--serialized
MEMORY_ONLY-non-serialized
MEMORY_AND_DISK-non-serialized
OFF_HEAP--serialized
MEMORY_ONLY_SER-serialized
MEMORY_AND_DISK_SER-Serialized

MEMORY_ONLY_2--This number 2 indicates 2 replica stored on 2 different worker node


--Replicaton will useful for speeding up recovery in case of one node
failures(clustser fails)

val rdd1=sc.textFile("/user/cloudera/batch194/abc.txt")
val rdd2=rdd1.flatMap(x=>x.split(" "))
val Lowercase=rdd2.map(x=>x.toLowerCase())
val rdd3=Lowercase.map(x=>(x,1))
val rdd4=rdd3.reduceByKey((x,y)=>x+y).cache()
rdd4.toDebugString

rdd4.toDebugString
++++++++++++++++++
It is used to check lieage graph and we nee to read it from bottom to top.

val rdd1=sc.textFile("/user/cloudera/batch194/abc.txt")
val rdd2=rdd1.flatMap(x=>x.split(" "))
val Lowercase=rdd2.map(x=>x.toLowerCase())
val rdd3=Lowercase.map(x=>(x,1))
val rdd4=rdd3.reduceByKey((x,y)=>x+y).persist(StorageLevel.MEMORY_AND_DISK)
rdd4.toDebugString

** if we use cache() and we dont have enough memory then it will skip caching it,it
wont throw any error.
**dont cache and persisit on your base rdd.

difference between lineage and DAG


+++++++++++++++++++++++++++++++++++

--lineage is nothing but dependency graph


--shows dependency of various RDd.its a logical plan
--A lineage graph keeps a track of transformation to be executed after an action
has been called
--Rdd lineage graph helps recomputed any missing or damaged RDD because of node
failures.

--When you called an action DAG will create.


--jobs,stages,tasks
How to create jar from your spark and how to run it
+++++++++++++++++++++++++++++++++++++++++++++++++++
--export jar from scala worksheet
--cd spark2.4..4-bin-hadoop
cd bin
./spark-submit --class <classname> <path of the jar>

Map && Map partition


++++++++++++++++++++

Map:
The Rdd has 10000 rows and 10 partitions
and each partition holds 1000 records

val a=sc.texFile("file")
val b=a.map()
Here map will call 10000 times
in case of mapPartiton it will process each partition at a time and totally 10
times it will process.

Structured APIS
++++++++++++++
Dataframe && Datasets

A dataframe is a distributed collection of data organized into named collections.It


is conceptually equivalent to table in a relation database.

Conside Empoyee table,


dataframes is created for the table and we have 4 partitions are there, each
partitions holding 25K records.
richer optimizations are possible.

Difference b/n rdd and dataframe is schema attached for dataframe.

are dataframes and datasets are launched in spark2 ?


df/ds are available in spark1 also.
spark2, greater support and both of these are merged into single APIS dataset API..

SC:
++++
seperate seprate context for each and every
thing.sparkcontext,sqlcontext,hivecontext

sparksession:
++++++++++++++
It is a unfied entry point of spark application
It provides a way interact with various spark functionalities with lesseer number
of constructs.
Instead of sc,hc,sqlc now all its encapsulated into spark session.

import org.apache.spark.sql.SparkSession
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()

--It is a singleton object


--builder() it will return builder object(This will help us to configure
sparksession)
--Treat your spark session like driver
import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf

val sparkConf=new SparkConf()


sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new SparkSession.builder().config(sparkConf).getOrCreate()

++++
+++
++
+
import org.apache.spark.sql.SparkSession
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
val
ordersDF=spark.read.option("header",true).option("inferSchema",true).csv("orders.cs
v")
ordersDF.show()
ordersDF.printSchema()

//converting bytes to normal form and infer the schema

import org.apache.spark.sql.SparkSession
import org.apache.log4j.Level
import org.apache.log4j.Logger

Logger.getLogger("org").setLevel(Level.ERROR)
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
val
ordersDF=spark.read.option("header",true).option("inferSchema",true).csv("orders.cs
v")
val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()
groupOrderDf.show()

Logger.getLogger(getClass.getName).info("my application is completed sucessfully")

ordersDF.printSchema()

**whenever we are working with dataframes,datasets we are dealing with higher level
programming constructs.
**When we were working with raw rdd that as low level code.
**your spark compiler will convert your higher level code (dataframe) to low level
rdd code.
**Driver will convert your high level code into low level code and then it will
send the low level code to the executor..

Higher Level Code(**Driver will convert your high level code into low level code
and then it will send the low level code to the executor..
)
+++++++++++++++++
val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()

low level code(lower level code will sent directly to the executors)
++++++++++++
groupOrderDf.foreach(x=>{println(x)})

RDD
+++
--When we deal with Raw RDD, we deal with low level code.
map,filter,flatMap,reduceByKey.(we have to write how to do it)
--This low level code is not developer friendly
--Rdd lacks some of the optimization.

dataframes
++++++++++

--higher level consstructs, which makes developer like easy.


--Challenges with DF:
1)DF do not offer strongly type code.
type errors wont be caught at compile time rather than run time
2)developer felt that there flexibility become limit, we cannot call all the lower
level constructs.
**DF can be converted into rdd
df.rdd(we can more flexibilty whenever we want to convert RDD, type safety)
**
this conversion from df to rdd is not seamless.we will miss out on some of the
major optimizations.
--Catalyst optimizer/tungsten engine is used to take care of optimization

Datasets
++++++++
--Compile time safety
--we get more flexibilty in terms of using lower level code.
--Conversion from df to ds is seamless.

**Dataframe is nothing but a DataSets[ROW]


ROW is nothing but a generic type which will be bound at runtime.
in case of dataframes the datatypes are bound at runtime
however Dataset[Employee]
The type will be bound at compile time.

How to convert dataframe to dataset?


if we replace generic Row with specific object then it becaome dataset.

**Dataframes are more prefered over dataset


--Converting Row type to objects, there will be overhead involved and this is from
casting it to a particular time.
--Serilization-converting data into binary form'
--DF,serilization is managed by tungsten binary format(encoders)
--Ds,serilization is managed by java

**Usisng datasets will help us to cut down on developer mistakes, but it comes with
an extra cost interms of type casting and expensive serialization.

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import java.sql.Timestamp

case class
OrdersData(order_id:Int,order_date:Timestamp,order_customer_id:Int,order_status:Str
ing)
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()

val
ordersDF:Dataset[Row]=spark.read.option("header",true).option("inferSchema",true).c
sv("orders.csv")

import spark.implicits._
val ordersDs=ordersDF.as[OrdersData]

val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()
groupOrderDf.show()

++++

ordersDs.filter(x=>x.order_id<10)

ordersDs.filter(x=>x.order_id<10)
ordersDF.filter("order_id<10")

++++
1.Read data from a data source and create a dataframe/dataset
--external data source(mysql,redshift,mangodb)
--internal data source (Hdfs,s3,azure,google storage)
we have the flexibility in spark to create a dataframe directly from external
database.
spark is very good at processing but it is not efficient at injection data.
spark gives you a jdbc connector to ingest data from mysql db directly
2.Perform bunch of transformation and action
transformations/action using higher level constructs
3.writing data into target(sink)
internal/external

val
orderDf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders.csv")

val
orderDf=spark.read.format("json").option("path","orders.json").option("mode","").lo
ad

val orderDf=spark.read.format("json").option("path","orders.parquet").load

--modes
1.PERMISSIVE(It set all the fields to null when it encountered a corrupted record)-
default

_corrupted_record

2.DROPMALFORMED(will ignored malformed record)


3.FAILFAST(whenever malfomed record is encounter an exception is raised)

Three options to have the schema for a dataframe


++++++++++++++++++++++++++++++++++++++++++++++++
--Infer schema(inferschema as true)
--Implicit schema(Reading parquet ,avro etc)
--Explicit schema

explicitly
++++++++++
1.Programatically
2.DDL String

programatic approach
+++++++++++++++++++++
import org.apache.spark.sql.IntegerType
import org.apache.spark.sql.StringType
import org.apache.spark.sql.TimestampType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField

val orderSchema=StructType(List(
StructField("order_id",IntegerType,false),
StructField("order_date",TimestampType,true),
StructField("customer_id",IntegerType,false),
StructField("status",StringType,false)))
val
ordersDf=spark.read.format("csv").option("header",true).schema(orderSchema).option(
"orders.csv").load
spark.stop()
DDL String
++++++++++

val ordersSchemaDDL="orderid Int,orderdate String,custid Int,orderstatus String"

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load

**converting dataframe into dataset by using case class.


+++++++++++++++++++++++++++++++++++++++++++++++++++++++

dataframe is dataset[Row]
dataset is dataset[Orders]

case class Orders(order_id:Int,order_date:Timestamp,custmer_id:Int,status:String)

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load

import spark.implicts._
ordersDf.as[Orders]

saveModes
+++++++++
1.append(putting the file in the existing folder)
2.overwrite(first delete the existing folder, and then it will create new one)
3.errorIfExists(will give error if output folder already exists)
4.ignore(if folder exists it will ignore)

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

val orderRep=ordersDf.repartiton(4)

orderDf.write.format("csv").mode(saveMode.Overwrite).option("path","newfolder1").sa
ve()

**Normally we are wrting DF to our target. Then we have few options to control file
layout.

Spark file layout


+++++++++++++++++
1)simple repartiton-Number of files and files size generated.
+++++++++++++++++++
**Number of output files is equal to number of partitions in your dataframe.
**Repartition will help you to increase the parallelism.
with normal repartition you wont able to skip some of the partitions for
performance improvement.
Partition prunning is not possible.

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

val orderRep=ordersDf.repartiton(4)
orderRep.select("*").filter("order_id<10")

here it will create 4 files for finding order_id 10 you have to search all the
files.

2)PartionBy
++++++++++++
--It is equivalent to partitioning in hive.
--It provide partition prunning

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

val orderRep=ordersDf.repartiton(4)

--partitonBy("order_status")

orderDf.write.format("csv").partitonBy("order_status").mode(saveMode.Overwrite).opt
ion("path","newfolder1").save()

here order_status=Closed
order_status=OPENED
order_status=COMPLETE
order_status=ON_HOLD
order_status=PENDING
order_status=PROCESSING

3)bucketBy()
++++++++++++

import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf

val sparkConf=new SparkConf()


sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

spark.sql("create database if not exists retail")


ordersDf.write.format("csv").mode(SaveMode.Overwrite).bucketBy(4,"order_customer_id
").sortBy("order_customer_id").saveAsTable("retail.orders")
spark.catalog.listTables("retail").show()

3)sortedBy
++++++++++

4)maxRecordsPerFile
++++++++++++++++++

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

val orderRep=ordersDf.repartiton(4)

orderDf.write.format("csv").partitonBy("order_status").option("maxRecordsPerFile",2
000)mode(saveMode.Overwrite).option("path","newfolder1").save()

sparkSQL
+++++++++
createOrReplaceTempView
+++++++++++++++++++++++

val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val temp=ordersDf.createOrReplaceTempView("Orders")
val resulDf=spark.sql("select order_customer_id,count(*) as total_orders from
Orders where" +
"Order_status='CLOSED' group by order_customer_id order by total_orders desc")
resulDf.show()

++++
Storing data in the form of Table
++++++++++++++++++++++++++++++++++

--Sometimes we have a requirement to save the data in a persistent manner in the


form of table.
--When data is stored in the form of table then we can connect tableau,PowerBI
etc..for reporting purpose.

Table has 2 parts


data warehouse
spark warehouse catalog metastore
spark.sql.warehouse.directly in memory(on terminating the
application it will gone)
We can use hive
metastore to handle spark metadata.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf

val sparkConf=new SparkConf()


sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()

spark.sql("create database if not exists retail")


ordersDf.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("retail.orders")
spark.catalog.listTables("retail").show()
+++++++++++

1.DataFrame Reader-Taking Data from source


2.Transformations-Process your Data
3.Dataframe Writer-write your data in Target Location

Transformations
++++++++++++++
1.Low level Transformation
map
filter
groupByKey

We can perform low level transformations using RDDs


Some of these are even possible with dataframes ad datasets

2.High Level Transformations


Slect
where
groupBy
**This are supported only dataframes and datasets

1 2021-07-21 11599,COMPLETE

**Since this is unstructured file, I will load this file as a rdd(raw rdd)
each line of rdd is string type

if we have schema associated/Structure associated we can convert out rdd to a


dataset.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf

val myregex=""" ^(\S+) (\S+)\t(\S+),(\S+)"""

case class Order(order_id:Int,Customer_id:Int,order_status:String)

def parser(line:String)={
line match{
case
myregex(order_id,date,customer_id,order_status)=>Order(order_id.toInt,Customer_id.t
oInt,order_status)
}
}
val sparkConf=new SparkConf()
sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val lines= spark.sparkContext.textFile("orders_new.csv")

import spark.implicits._
val orderDS=lines.map(parser).toDS().cache()
orderDS.select("order_id").show()
orderDS.groupBy("order_status").count().show()
+++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++
How ro refer a column in a df/ds
++++++++++++++++++++++++++++++++
1.Column String

orderDf.select("order_id","order_status")

2.Column Object

orderDf.select(col("order_id"),col("order_status")).show

scala specific
+++++++++++++=
$
'
orderDf.select(col("order_id"),column("order_customer_id"),
$"order_status",'order_date).show

column Expressions
+++++++++++++++++++
**We cannot mix column string with column expressions, nor we can mix column object
with column expression.
df.select("order_id","concat(order_status,'_Status')")

column String-df.select("order_id")
column object-df.select(col("order_id"))
column expression-df.select(concat(x,y))

There is a way to convert column expression to column object.


expr
++++

df.select(column("order_id"),expr("concat(order_status,'_Status')")).show()
df.selectExpr(column("order_id"),"concat(order_status,'_Status')").show()

++++++
column object expression UDF
++++++++++++++++++++++++++++++
**Basically we register the function with the driver.
The driver will serialize the function and will send it to each executor.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf

object DataFrameExample extends App{

def ageCheck(age:Int)={

if(age>18)"Y" else "N"


}

val sparkConf=new SparkConf()


sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val
ordersDf=spark.read.format("csv").option("inferSchema",true).option("dataset1.csv")
.load()
import spark.implicits._
val df1=ordersDf.toDF("name","age","city")

val parseAgeFunction=udf(ageCheck(_:Int):String)
val df2=df1.withColumn("adult",parseAgeFunction(column("age")))

Sql/String expression UDF


+++++++++++++++++++++++++

spark.udf.register("parseAgeFunction",ageCheck)
spark.udf.register("parseAgeFunction",(x:Int=>{if(x>18)"Y" else "N"}))

val df2=df1.withColumn("adult",expr("parseAgeFunction(age)"))

spark.catalog.listFunctions().filter(x=>x.name="parseAgeFunction")

+++

df1.createOrReplaceTempView("peopletable")
spark.sql("select name,age,city,parseAgeFunction(age) as adult from peopletable")

+++++

1. I want to create a scala list


2.from the scala list I want to create a dataframe
order_id,orderdate,customerid,status
3.I want to convert orderdate field to epoch timestamp(unix timestamp)
4.Creat new column with "newid" and make sure it has unque id.
5.drop dupilicates(orderdate,customerid)
6.drop order_id

val myList=List((1,"2013-07-25",11599,"CLOSED"),(2,"2014-07-25",115,"OPENED")
,(3,"2015-07-25",9,"CLOSED")
,(4,"2016-07-25",39,"OPENED")
,(5,"2013-07-25",129,"CLOSED")
)

val
df=createDataFrame(myList).toDF("order_id","order_date","custsomer_id","status")
val
df1=df.withColumn("order_date",unix_timestamp(col("order_date").cast(DateType)))
val df2=df1.withColumn("new_id",monotinacally_increase_id)
val
df3=df2.dropDupilicates("order_date","custsomer_id").drop("order_id").sort("order_d
ate")
+++++++++++++++++++++=
Aggregrate transformations
1.Simple aggregations

When after doing the aggregations we get a sngle row.


Total number of record, sum of all qualtities
2.grouping aggregrations

In this we will be doing groupBy


Outpout can be more than on record.

3.window aggregrations

we will be dealing with fixed window

1.Simple aggregations
+++++++++++++++++++++
object Df extends App{
val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","orders_data.csv").load()
invoiceDF.select(
count("*").as("RowCount"),
sum("Quantity").as("TotalQuantity"),
avg("unitPrice").as("AvgPrice"),
countDistinct("InvoiceNO").as("CountDistinct")
)
invoiceDF.select(
"count(stockCode) as RowCount",
"sum(Quantity) as TotalQuantity",
"avg(unitPrice) as AvgPrice",
"count(Distinct(InvoiceNO)) as CountDistinct")

invoiceDF.createOrReplaceTempView("Sales")
spark.sql("select
count(stockCode) as RowCount,
sum(Quantity) as TotalQuantity,
avg(unitPrice) as AvgPrice,
count(Distinct(InvoiceNO)) as CountDistinct from sales")

2.grouping aggregrations
+++++++++++++++++++++++++

--group data based on country and invoice number


--I want total quantity for each group,sum of invoice value.

val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","orders_data.csv").load()

df1.groupBy("country","Invoice_no").agg(sum("Quantity").as("total_quantity"),sum(ex
pr("Quantity*UnitPrice")).as("Invoice_value"))

df1.groupBy("country","Invoice_no").agg(expr("sum(Quantity) as
total_quantity"),expr("sum(Quantity*UnitPrice) as Invoice_value"))

invoiceDF.createOrReplaceTempView("Sales")
spark.sql("select country,InvoiceNo,sum(Quantity) as
tota_quantity,sum(Quantity*UnitPrice) as Invoice_value from Sales group by
country,InvoiceNo")

3.
aggregation
++++++++++++++++++++

Partition Column-Country
Ordering Column-Weeknum
Window size- 1 to current row

val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","window_data.csv").load()
val
myWindow=Window.partitonBy("Country").orderBy("WeekNum").rowsBetween(Window.unbound
edPrecedings,Window.currentRow)
df1.withColum("Running_total",sum("Invoicevalue").over(myWindow))

++++
+++
++
+

JOINS
++++++

Simple Join(Shuffle sort merge join)


+++++++++++++++++++++++++++++++++++++
val orderdf1=spark.read.format("json").option("path","orders").load()

val customersdf2=spark.read.format("json").option("path","customers").load()

val
joinedDf=orderdf1.join(customersdf2,orderdf1.col("order_customer_id")===customerdf2
.col("customer_id"),"inner").sort("order_customer_id")

Kinds of joins are possible


+++++++++++++++++++++++++++
1.Inner JOINS
Internals
++++++++++

executor1-node 1
++++++++++

--It will write the output into the exchange.


--Exchange is nothing but buffer in the executor.
--From this exchange spark framework,can read it and do shuffle on any of the
exchange.
--ALl the records with the sae key go to same reduce exchange.
--**When we use inner join , it wll perform broadcast join.

Shuffle happens,sortmerge join

Orders

15192,2013-10-29 00:00:00.0,2,PENDING_PAYMENT
33865,2014-02-18 00:00:00.0,2,COMPLETE

(2,{15192,2013-10-29 00:00:00.0,PENDING_PAYMENT})
(2,{33865,2014-02-18 00:00:00.0,2,COMPLETE})

Customers

3,Ann,Smith,**********,********,3422 Blue Pioneer Bend,Cagus,PR,00725


executor2-node 2
+++++++++++++++++
Customers
2,Mary,Barett,******,******,9526 Noble Embers Ridge,Littleton,CO,80126

Orders
35158,2014-02-26 00:00:00.0,3,COMPLETE
15192,2013-10-29 00:00:)).0,2,PENDING_PAYMENT

executor3-node 3
+++++++++++++++++
2.Right Outer
3.Left our join
4.Full outer join

customer_id is ambigous
++++++++++++++++++++++++

This Happens when we try to select a column name which is coming from 2 different
dataframes.

val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").select("order_id","customer_id","customer_fname")

How to solve this?


2 ways to solve the problem
1)before join,you rename the ambigous column in one of the
dataframe(withColumnRenamed("old_column_name","new_column_name"))
2)once join is done, drop one of the column

1)
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()

val ordersDfNew=orderdf.withColumnRenamed("customer_id","cust_id")
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=ordersDfNew.join(customersdf,ordersDfNew.col("cust_id")===customerdf.col("
customer_id"),"outer")

2)
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").drop(orderdf.col("customer_id")).sort("order_id")

How to deal Nulls


++++++++++++++++++

whenever order_id is null show -1

coalesce

val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").drop(orderdf.col("customer_id"))
.sort("order_id").withColumn("Order_id",expr("coalesce(order_id,-1)"))

This will do normal join insteaad of broadcast join


+++++++++++++++++++++++++++++++++++++++++++++++++++++

spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")

Broadcast join
++++++++++++++
--This does not require a shuffle
**whenever we are joinig 2 large dataframes then it will invoke a simple join and
shuffle will required.
**when you have one large dataframe and the other dataframe is small In that case
you can go for Broadcast join.

val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(broadcast(customersdf),orderdf.col("customer_id")===customerd
f.col("customer_id"),"outer").select("order_id","customer_id","customer_fname")

+++++++++

spark.sql("select level,date_format(datetime,"MMMM") as month,count(1) as total


from my_new_logging_level group by level,month")

date_format(datetime,"MMMM")-->January
WARN JANUARY 123
ERROR FEB 12222
WARN FEB 12

Pivot table
++++++++++++

JANUARY FEB MAR


WARN 11 23 `12
ERROR 33 33 22

spark.sql("select level,date_format(datetime,"MMMM") as
month,date_format(datetime,"M") as month_num from
my_new_logging_level").groupBy("level").pivot("month_num").count();

val
colun=List("JAN","FEB","MAR","APR","MAY","JUN","JULY","AUG","SEP","OCT","NOV","DEC"
)

spark.sql("select level,date_format(datetime,"MMMM") as
month,date_format(datetime,"M") as month_num from
my_new_logging_level").groupBy("level").pivot(month,"colun").count();

Spark Optimization
++++++++++++++++++

There are basically 2 main areas we should focus on.


1.cluster configuration level-resource level optimization
2.Application code level-how we write the code.
partitioning,bucketing,cache && persisit ,avoid/minimize shuffling of data, join
optimization,using optimized file format,usig reduceByKey istead of groupByKey.

Resources --Harddisk,Memory(RAM),CPU cores(Compute)

**Our intention is to ake sure right amount of resourcess.

10 node cluster(10 worer nodes)


each node has 16 CPU Cores
each node has 64 GB Ram.

Executor(container of resources)
**One node can hold more than one executor.
In a single worker node, we can have multiple containers.

Container-cpu cores + memory(RAM)

16 cores,64GB Ram--(1 core will be allocated for background process)

executor/container/JVM

2 strategies:
1.Thin Executor
++++++++++++++++
+++++++++++++++

Intention is to create more executor with each executor holding minimum possible
resources

16 executor, with each executor holding 1 core and 4 GB Ram.

disadvantage:

1)multithreding is not possible


2)Broadcast variable(variable need to share all the executors)

2.Fat Executor
+++++++++++++++
++++++++++++++

--Intention is to give maximum resources to each executor.


16 cores,64GB Ram
You can create a executor which can hold 16 CPU cores , and 64GB RAM.

Drwabacks
+++++++++++
--It is observed that if the executor holds more than 5 CPU cores then the hdfs
througput suffers.(More multithreading is not good)
--If executor holds very huge amount of memory, then garbage collection takes lot
of time
--garbage collection means removing unused objects from memory.

16 cores,64GB Ram
1 core is given for other backgroung activities.
1 GB Ram is given for Operating system.
Now,in each node left with 15 cores,63GB Ram.

=> We want multithreading within a executor(> =1 cpu core for each executor)
=>We do not want hdfs througput suffers
(It suffers when we use more than 5 cores per executor)
5 is the right choice of number of cpu cores in each executor.
15 core,63GB Ram -each machine

3 exectors running on each worker node.


each executor will have 5 core,21GB of Ram
out of 21 GB Ram,some of it will go as part of Overhead(off heap memory)
max(348Mb,7%of 21GB executor memory)
=1.5GB(Overhead/Off heap memory)--This is not part of containers.
21-1.5GB=19GB will be of executor memory

Each executor will have 19GB RAM,5 CPU cores

we have 10 node clustures;

so, 10*3=30 (executors across the cluster).

each excutor holding -5 CPU Cores,19GB Ram.


1 executor out of these 30 will be give for YARN APPLICATION

29 Executors....

**Task = No of cpu cores.

348mb.
++++++++++++++++++++
+++++++++++++++++++
+++++++++++++++++
++++++++++++++++
+++++++++++++++
++++++++++++++

Ambari :Complete view of resources

Hosts: Can see number of machines


Edge Nodes=3
gw01.itversity.com
gw02.itversity.com
gw03.itversity.com
Name nodes:2
nn01.itversity.com
nn02.itversity.com
Worker Nodes:5
wn01.itversity.com
wn02.itversity.com
wn03.itversity.com
wn04.itversity.com
wn05.itversity.com

**Each worker node holds 8 cpu cores,32 GB Ram

5 Worker nodes
each wordker node has 32 GB Ram,8 cpu cores

YARN:
+++++
32 GB Ram,8Cpu cores(Physical cores)
24GB Ram allocated for yarn containers,8 GB Ram for other process,
Container Memory Min-1GB and Max-4 GB(cannot create executor not more than 4 GB and
less than 1 GB)
Container memory:>=1GB<=4GB
cores:Min:1 Max:2
8 Physical cores
8*2=16 Vcpu cores
Out of Vcpu cores 12 can be used for yarn containers, Remaining 4 can used for
other purposes.

24GB Ram,12 VCPU Cores on each worker node.


Container memory:>=1GB<=4GB
cores:Min:1 Max:2
Thin Executors
++++++++++++++
each worker node has
12 executors,1 cpu cores,24/12=2GB Ram.
12*5=60 worker nodes
Fat Executors
++++++++++++++
24GB Ram,12 VCPU Cores on each worker node.
Container memory:>=1GB<=4GB
cores:Min:1 Max:2ok

12/2=6cpu cores
6 executors, 2 cores, 4 GB Ram

6*5=30 Worker nodes

in our resource pool


++++++++++++++++++++
24GB Ram *5=120 GB RAM, 12* 5 VCPU Cores=60 cpu cores

++++
+++
++
+
bigLogNew.txt--1.46GB

2 ways allocate resources


--Allocating Resources Manually
--Allocating Resources Dynamically.

Dynamic Resource allocation


+++++++++++++++++++++++++++
--used for long running jobs, allocationg resources dynamically.
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.initialExecutors=2
spark.dynmicAllocation.maxExecutors=10
spark.dynmicAllocation.minExecutors=0

executor=compute+Ram=>1 core+1GB

Storage Memory
+++++++++++++++

when we allocate memory to the container


300 mb out of that will go for overhead
1024-300=724 mb will be usable
again the memory will divide into two fractions
1.storage && Execution memory(60%)
724-0.6=434mb
2.additional buffer memory for other purpose.(40%)
724-0.4
storage for data strucure, metadata and safegaurding aganst OOM.

spark2-shell --master yarn

(ERROR,3)

Memory usage in spark is divided into two broad categories


1.Execution Memory
--Memory required computation like shuffle,join,sorts,aggregation

2.Storage Memory
--Used for Cache,Broadcast,accumulator.

In Spark Execution memory and storage memory share a common region.


When no execution can happenning .. then your storage can acquire all the available
memory and vive versa
Execution may evict storage if necessary.
2 gb common unified region.
2 gb for storage, if any execution will happens it can evict some of the storage.
But this eviction can happen until total storage memory usage falls under a certain
threshold.
i.e., if some execution/Computation are coming they cannot evict the entire 2GB,
they is certail threshold beyond
which the execution cannot evict the storage,

i.e., Execution can evict storage upto a certain threshold.


but storage can not evict execution. just beacuse of developer implementation.

This design ensures several desirable properties.

1.application which do not use caching can use entire space for execution.

2.application that do no use caching can reserve a minimum storage space..

this makes your data blocks immune from being evicted.

--execution-memory 4G

if you request a container/executor of 4GB size, then you are actually requesting
4GB(heap momoery)+max(384MB,10% of 4 GB)(off heap memory)(overhead)
4096MB(JAVA HEAP) + 384MB(OFF HEAP)

out of 4GB(total heap memory)


300Mb is again reserved(Running memory).
=4gb-300mb=3.7GB
60% 0f 3.7 GB is for unified region(storage && execution memory)
=2.3GB
40% of 3.7GB is for user memory(storage for data strucure, metadata and
safegaurding aganst OOM.)
=1.4GB

Running/Reserved memory:Storage for running executors.


used memory: overhead

Broadcast JOINS
+++++++++++++++
--It is used when we have 1 large table and 1 small table. and we want to join
these.d
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")

val windowSpec = Window.partitionBy("col1").orderBy("col2")

df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show

+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+

You might also like