Spark by Sumit
Spark by Sumit
++++++++++++
--It is general purpose,Inmemory and Compute engine.
Compute engine
+++++++++++++++
Hadoop provide 3 Things
1)Hdfs-Strorage
2)MapReduce-Computation
3)YARN-Resource Manager
Inmemory
++++++++
HDFS
to overcome the above problem we are taking data once from HDFS and processing all
the computations and storing the final output is stored in HDFS.
Spark
v1 v2 v3 v4 v5 v6
Hdfs
--Only two Disk I/O's are required, for the entire computations.
--It is better tham map reduce and less time to processing
--Spark is 10 to 100 times faster than MapReduce
General purpose
+++++++++++++++
Learn just in one style of writing the code and all the things like
cleaning,querying,machine learning,data ingestion all these can happens with that.
Map Reduce
+++++++++
--High Latency(It Involves more disk read and write operations than spark)
--Every map reduce task takes two disk seeks
Spark
++++++
--low latency(It Involves less disk read and write operations than mapreduce)
--In case spark, Entire operations will take 2 disk seeks
RDD
+++
--The basic unit which holds data in spark is called as RDD.
--Resiliant Distributed Datasets.
--Rdd is nothing but inmemory distributed collection.
--Rdd are distributed across memory
--Immutable
Resiliant
+++++++++
--If we loose an rdd we can again recover it back.
--RDD Provides fault tolerance through lineage graph.
--A lineage graph keeps a track of transformation to be executed after an action
has been called
--Rdd lineage graph helps recomputed any missing or damaged RDD because of node
failures.
--In RDD we get resiliancy by using lineage graph
Distributed
+++++++++++
Rdd1=load data from text file
The Rdd1 will have 4 partitions(data brought in from HDFS) which are in memory.
2 kinds of operations
+++++++++++++++++++++
1.Transformations
2.Actions
Transformations
+++++++++++++++
--These are lazy
i.e.,Executon will not started untill an action is triggered
Data is not loaded until it is necessary
Spark maintains the record of operation through DAG
Better optimization by spark engine.
word Count:
++++++++++
import org.apache.spark.SparkContext
import org.apache.log4j.Level
import org.apache.log4j.Logger
44,8602,37.0
12,8888,12.5
44,8603,44.5
val input=sc.textFile("customer-orders.csv")
val a =input.map(x=>(x.split(",")(0),x.split(",")(2).toFloat))
val b=a.reduceByKey((x,y)=>x+y)
val c=b.sortBy(x=>x._2)
val d=c.collect
d.foreach(println)
val input=sc.textFile("Movie_data.csv")
val a =input.map(x=>x.split("\t")(2))
val c=a.map(x=>(x,1))
val d=c.reduceByKey((x,y)=>x+y)
val e=d.collect
e.foreach(println)
val input=sc.textFile("Movie_data.csv")
val a =input.map(x=>x.split("\t")(2))
val b=a.countByValue
b.foreach(println)
countByValue is an action
**It will count the value part(how many times each value is coming)
--Here no other operation will do
++++
33,100
33,200
33,300
(33,600/3)
(33,200)
def parseLine(line:String)={
val fields= line.split(",")
val age=fields(2).toInt
val numfriends=fields(3).toInt
(age,numfriends)
}
val input=sc.textFile("friends")
val b=a.map(parseLine)
val c=b.map(x=>(x._1,(x._2,1)))
val c=b.mapValues(x=>(x,1))
val d=c.reduceByKey((x,y)=>(x._1+y._1,x._2+y._2))
val e=d.map(x=>(x._1,x._2._1/x._2._2)).sortBy(x=>x._2)
val e=d.mapValues(x=>x._1/x._2).sortBy(x=>x._2)
e.collect.foreach
mapValues
+++++++++
it is transformation and it will consider only value part
++++
val a =sc.textFile("bigdata-campaign-data.csv")
val b=a.map(x=>(x.split(",")(10).toFloat,x.split(",")(0)))
val c=b.flatMapValues(x=>x.split(" "))
val d=c.map(x=>(x._2,x._1))
val e=d.reduceByKey((x,y)=>x+y)
val f=e.sortBy(x=>x._2,false)
f.take(20).foreach(println)
flatMapValues
+++++++++++++
Accumulator
+++++++++++
--There is a shared copy kept in your driver machine
--Each of exector will update it
--however none of executor can read the value of accumulator, they can only change
the value
--This is same as counters in mapreduce
val myrdd=sc.textFile("SAMPLEfie.txt")
val myaccum=sc.longAccumulator("blank line accuulator")
myrdd.foreach(x=>if(x==" ")myaccum.add(1))
myaccum.value
Broadcast:
++++++++
Shared copy on each machine
YARN:
+++++
--Yet another resource negotiator
Job Tracker
++++++++++
--used to lot of work in Hadoop v1.
--Keeps track of
--Thousands of task tracker
--hundreds of jobs
--tens and thousands of map reduce tasks
Scheduling
++++++++++
--Deciding which job to execute first based on scheduling algo,priority of
jobs,getting to know available resources, and providing the resources to job.
Monitoring
++++++++++
--Tracking the progress of job, if a task fails rerun the task, if task is slow
then based on speculative execution starts on othe machine.
Task Tracker
+++++++++++
--This Task tracker tracks the tasks on each data node and inform the job
tracker.
Limitation:
++++++++++
1.Scalability
--It was observed that when the cluster size goes beyond 4k Datanode
then the job tracker used to become bottleneck.
2.Resource utilization--In MR1,there used to be fixed number of map and reduce
slots.
100 Map slots and 50 reduce slots.
You want to execute a map reduce job which required 150 mappers.
100 mappers at a time.
50 mappers run latter
3.Only MapReduce jobs will supported
YARN:
+++++
--Resource Manager-Maser
--Node Manager-Slave
--Application Master
i.e.,
prority 1
location host1
How many container
size of each container(Resources)
Resource manager allocates the resources in the form of container.and it will send
the container id and the host name to the application master
and finally launch tasks in the container
2. mode
++++++++++++++
The only difference here is that the spark driver runs on the application master
client mode is not preferable , because client machine goes down or shut down then
the driver stops.
cluster mode is used for production environment
Who controls the cluster and how spark gets the driver anad executor
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--Cluster Manager
i.e., YARN/MESOS/Kubernetes/spark standalone
Spark Session
+++++++++++++
--is like a data structure where driver maintains all the information including
executor location and status
--This is entry point to the spark applications
+++++
val originalLogsrdd=sc.parallelize(mylist)
val newPairRdd=originalLogsrdd.map(x=>{val columns=x.split(":")
va loglevel=columns(0)
(loglevel,1)})
val resultant=originalLogsrdd.reduceByKey(_+_)
resultant.collect
}
Narrow Transformation:
+++++++++++++++++++++
--No shuffling is involved
--Map,flatMap,filter
Wide Transformation
+++++++++++++++++++
--Shufling involved
--groupByKey(),reduceByKey()
Stages
++++++
stages are marked by shuffle boundaries
whenever we encounter a shuffle,a new stage gets created.
whenever we called wide transfomation a new stage gets created.
If i used 3 wide transformation 4 stages will gets created
Job
+++
--whatever action you called that shown as job
--number of jobs is equal to number of actions
--whenever you use wide transformation then new stage is created.
Task
++++
--corresponds to each partitions
--100 partition 100 task
job(action calls)
|
Stage(wide transformation +1)
|
Task(number of partitions)
reduceByKey
--we will get advantage of local aggregation
1.more work in parallel
2.less shuffling
Node1
(x 1)-->(x 1)
Node 4 (x 1) (x 2) (x 3)-->(x 6)
(y 1)-->(y 1)
Node2
(x 1)--(x 2)
(x 1)
Node3
(x 1)--(x 3)
(x 1)
(x 1)
(y 1)--(y 3)
(y 1)
(y 1)
groupByKey
++++++++++
--we do not get any local aggregation
--all the key value pairs are sent(shuffled) to another machine.
--so,we have to shuffle more data and we get less parallelism.
+++++++++
Pair Rdd--tuple of two elements
+++
sc.defaultParallelism
rdd.getNumPartitions
sc.defautMinPatitions
serialization
+++++++++++++
It is in bytes format.-it take less storage
Increase the processing cost(convert byte code ) but reduce the memory footprint
Block-evacuation
+++++++++++++++++
Conside the situtaion that some of the block partitions are so large (skew) taht
they will quickly fill up the storage memory used for caching.
when the storage memory becomes full, an eviction policy will be used to make up
the space for new blocks.
DISK_ONLY--serialized
MEMORY_ONLY-non-serialized
MEMORY_AND_DISK-non-serialized
OFF_HEAP--serialized
MEMORY_ONLY_SER-serialized
MEMORY_AND_DISK_SER-Serialized
val rdd1=sc.textFile("/user/cloudera/batch194/abc.txt")
val rdd2=rdd1.flatMap(x=>x.split(" "))
val Lowercase=rdd2.map(x=>x.toLowerCase())
val rdd3=Lowercase.map(x=>(x,1))
val rdd4=rdd3.reduceByKey((x,y)=>x+y).cache()
rdd4.toDebugString
rdd4.toDebugString
++++++++++++++++++
It is used to check lieage graph and we nee to read it from bottom to top.
val rdd1=sc.textFile("/user/cloudera/batch194/abc.txt")
val rdd2=rdd1.flatMap(x=>x.split(" "))
val Lowercase=rdd2.map(x=>x.toLowerCase())
val rdd3=Lowercase.map(x=>(x,1))
val rdd4=rdd3.reduceByKey((x,y)=>x+y).persist(StorageLevel.MEMORY_AND_DISK)
rdd4.toDebugString
** if we use cache() and we dont have enough memory then it will skip caching it,it
wont throw any error.
**dont cache and persisit on your base rdd.
Map:
The Rdd has 10000 rows and 10 partitions
and each partition holds 1000 records
val a=sc.texFile("file")
val b=a.map()
Here map will call 10000 times
in case of mapPartiton it will process each partition at a time and totally 10
times it will process.
Structured APIS
++++++++++++++
Dataframe && Datasets
SC:
++++
seperate seprate context for each and every
thing.sparkcontext,sqlcontext,hivecontext
sparksession:
++++++++++++++
It is a unfied entry point of spark application
It provides a way interact with various spark functionalities with lesseer number
of constructs.
Instead of sc,hc,sqlc now all its encapsulated into spark session.
import org.apache.spark.sql.SparkSession
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
++++
+++
++
+
import org.apache.spark.sql.SparkSession
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
val
ordersDF=spark.read.option("header",true).option("inferSchema",true).csv("orders.cs
v")
ordersDF.show()
ordersDF.printSchema()
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Level
import org.apache.log4j.Logger
Logger.getLogger("org").setLevel(Level.ERROR)
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
val
ordersDF=spark.read.option("header",true).option("inferSchema",true).csv("orders.cs
v")
val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()
groupOrderDf.show()
ordersDF.printSchema()
**whenever we are working with dataframes,datasets we are dealing with higher level
programming constructs.
**When we were working with raw rdd that as low level code.
**your spark compiler will convert your higher level code (dataframe) to low level
rdd code.
**Driver will convert your high level code into low level code and then it will
send the low level code to the executor..
Higher Level Code(**Driver will convert your high level code into low level code
and then it will send the low level code to the executor..
)
+++++++++++++++++
val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()
low level code(lower level code will sent directly to the executors)
++++++++++++
groupOrderDf.foreach(x=>{println(x)})
RDD
+++
--When we deal with Raw RDD, we deal with low level code.
map,filter,flatMap,reduceByKey.(we have to write how to do it)
--This low level code is not developer friendly
--Rdd lacks some of the optimization.
dataframes
++++++++++
Datasets
++++++++
--Compile time safety
--we get more flexibilty in terms of using lower level code.
--Conversion from df to ds is seamless.
**Usisng datasets will help us to cut down on developer mistakes, but it comes with
an extra cost interms of type casting and expensive serialization.
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import java.sql.Timestamp
case class
OrdersData(order_id:Int,order_date:Timestamp,order_customer_id:Int,order_status:Str
ing)
val spark=new SparkSession.builder().appName("first
application").master(local[2]).getOrCreate()
val
ordersDF:Dataset[Row]=spark.read.option("header",true).option("inferSchema",true).c
sv("orders.csv")
import spark.implicits._
val ordersDs=ordersDF.as[OrdersData]
val
groupOrderDf=ordersDF.repartiton(4).where("order_customer_id>10000").select("order_
id","order_customer_id").groupBy("order_customer_id").count()
groupOrderDf.show()
++++
ordersDs.filter(x=>x.order_id<10)
ordersDs.filter(x=>x.order_id<10)
ordersDF.filter("order_id<10")
++++
1.Read data from a data source and create a dataframe/dataset
--external data source(mysql,redshift,mangodb)
--internal data source (Hdfs,s3,azure,google storage)
we have the flexibility in spark to create a dataframe directly from external
database.
spark is very good at processing but it is not efficient at injection data.
spark gives you a jdbc connector to ingest data from mysql db directly
2.Perform bunch of transformation and action
transformations/action using higher level constructs
3.writing data into target(sink)
internal/external
val
orderDf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders.csv")
val
orderDf=spark.read.format("json").option("path","orders.json").option("mode","").lo
ad
val orderDf=spark.read.format("json").option("path","orders.parquet").load
--modes
1.PERMISSIVE(It set all the fields to null when it encountered a corrupted record)-
default
_corrupted_record
explicitly
++++++++++
1.Programatically
2.DDL String
programatic approach
+++++++++++++++++++++
import org.apache.spark.sql.IntegerType
import org.apache.spark.sql.StringType
import org.apache.spark.sql.TimestampType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
val orderSchema=StructType(List(
StructField("order_id",IntegerType,false),
StructField("order_date",TimestampType,true),
StructField("customer_id",IntegerType,false),
StructField("status",StringType,false)))
val
ordersDf=spark.read.format("csv").option("header",true).schema(orderSchema).option(
"orders.csv").load
spark.stop()
DDL String
++++++++++
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load
dataframe is dataset[Row]
dataset is dataset[Orders]
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load
import spark.implicts._
ordersDf.as[Orders]
saveModes
+++++++++
1.append(putting the file in the existing folder)
2.overwrite(first delete the existing folder, and then it will create new one)
3.errorIfExists(will give error if output folder already exists)
4.ignore(if folder exists it will ignore)
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val orderRep=ordersDf.repartiton(4)
orderDf.write.format("csv").mode(saveMode.Overwrite).option("path","newfolder1").sa
ve()
**Normally we are wrting DF to our target. Then we have few options to control file
layout.
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val orderRep=ordersDf.repartiton(4)
orderRep.select("*").filter("order_id<10")
here it will create 4 files for finding order_id 10 you have to search all the
files.
2)PartionBy
++++++++++++
--It is equivalent to partitioning in hive.
--It provide partition prunning
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val orderRep=ordersDf.repartiton(4)
--partitonBy("order_status")
orderDf.write.format("csv").partitonBy("order_status").mode(saveMode.Overwrite).opt
ion("path","newfolder1").save()
here order_status=Closed
order_status=OPENED
order_status=COMPLETE
order_status=ON_HOLD
order_status=PENDING
order_status=PROCESSING
3)bucketBy()
++++++++++++
import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf
3)sortedBy
++++++++++
4)maxRecordsPerFile
++++++++++++++++++
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val orderRep=ordersDf.repartiton(4)
orderDf.write.format("csv").partitonBy("order_status").option("maxRecordsPerFile",2
000)mode(saveMode.Overwrite).option("path","newfolder1").save()
sparkSQL
+++++++++
createOrReplaceTempView
+++++++++++++++++++++++
val
ordersDf=spark.read.format("csv").option("header",true).schema(ordersSchemaDDL).opt
ion("orders.csv").load()
val temp=ordersDf.createOrReplaceTempView("Orders")
val resulDf=spark.sql("select order_customer_id,count(*) as total_orders from
Orders where" +
"Order_status='CLOSED' group by order_customer_id order by total_orders desc")
resulDf.show()
++++
Storing data in the form of Table
++++++++++++++++++++++++++++++++++
Transformations
++++++++++++++
1.Low level Transformation
map
filter
groupByKey
1 2021-07-21 11599,COMPLETE
**Since this is unstructured file, I will load this file as a rdd(raw rdd)
each line of rdd is string type
import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf
def parser(line:String)={
line match{
case
myregex(order_id,date,customer_id,order_status)=>Order(order_id.toInt,Customer_id.t
oInt,order_status)
}
}
val sparkConf=new SparkConf()
sparkConf.set("spark.app.name","my first application")
sparkConf.set("spark.master","local[2]")
val spark=new
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
val lines= spark.sparkContext.textFile("orders_new.csv")
import spark.implicits._
val orderDS=lines.map(parser).toDS().cache()
orderDS.select("order_id").show()
orderDS.groupBy("order_status").count().show()
+++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++
How ro refer a column in a df/ds
++++++++++++++++++++++++++++++++
1.Column String
orderDf.select("order_id","order_status")
2.Column Object
orderDf.select(col("order_id"),col("order_status")).show
scala specific
+++++++++++++=
$
'
orderDf.select(col("order_id"),column("order_customer_id"),
$"order_status",'order_date).show
column Expressions
+++++++++++++++++++
**We cannot mix column string with column expressions, nor we can mix column object
with column expression.
df.select("order_id","concat(order_status,'_Status')")
column String-df.select("order_id")
column object-df.select(col("order_id"))
column expression-df.select(concat(x,y))
df.select(column("order_id"),expr("concat(order_status,'_Status')")).show()
df.selectExpr(column("order_id"),"concat(order_status,'_Status')").show()
++++++
column object expression UDF
++++++++++++++++++++++++++++++
**Basically we register the function with the driver.
The driver will serialize the function and will send it to each executor.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sparkConf
def ageCheck(age:Int)={
val parseAgeFunction=udf(ageCheck(_:Int):String)
val df2=df1.withColumn("adult",parseAgeFunction(column("age")))
spark.udf.register("parseAgeFunction",ageCheck)
spark.udf.register("parseAgeFunction",(x:Int=>{if(x>18)"Y" else "N"}))
val df2=df1.withColumn("adult",expr("parseAgeFunction(age)"))
spark.catalog.listFunctions().filter(x=>x.name="parseAgeFunction")
+++
df1.createOrReplaceTempView("peopletable")
spark.sql("select name,age,city,parseAgeFunction(age) as adult from peopletable")
+++++
val myList=List((1,"2013-07-25",11599,"CLOSED"),(2,"2014-07-25",115,"OPENED")
,(3,"2015-07-25",9,"CLOSED")
,(4,"2016-07-25",39,"OPENED")
,(5,"2013-07-25",129,"CLOSED")
)
val
df=createDataFrame(myList).toDF("order_id","order_date","custsomer_id","status")
val
df1=df.withColumn("order_date",unix_timestamp(col("order_date").cast(DateType)))
val df2=df1.withColumn("new_id",monotinacally_increase_id)
val
df3=df2.dropDupilicates("order_date","custsomer_id").drop("order_id").sort("order_d
ate")
+++++++++++++++++++++=
Aggregrate transformations
1.Simple aggregations
3.window aggregrations
1.Simple aggregations
+++++++++++++++++++++
object Df extends App{
val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","orders_data.csv").load()
invoiceDF.select(
count("*").as("RowCount"),
sum("Quantity").as("TotalQuantity"),
avg("unitPrice").as("AvgPrice"),
countDistinct("InvoiceNO").as("CountDistinct")
)
invoiceDF.select(
"count(stockCode) as RowCount",
"sum(Quantity) as TotalQuantity",
"avg(unitPrice) as AvgPrice",
"count(Distinct(InvoiceNO)) as CountDistinct")
invoiceDF.createOrReplaceTempView("Sales")
spark.sql("select
count(stockCode) as RowCount,
sum(Quantity) as TotalQuantity,
avg(unitPrice) as AvgPrice,
count(Distinct(InvoiceNO)) as CountDistinct from sales")
2.grouping aggregrations
+++++++++++++++++++++++++
val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","orders_data.csv").load()
df1.groupBy("country","Invoice_no").agg(sum("Quantity").as("total_quantity"),sum(ex
pr("Quantity*UnitPrice")).as("Invoice_value"))
df1.groupBy("country","Invoice_no").agg(expr("sum(Quantity) as
total_quantity"),expr("sum(Quantity*UnitPrice) as Invoice_value"))
invoiceDF.createOrReplaceTempView("Sales")
spark.sql("select country,InvoiceNo,sum(Quantity) as
tota_quantity,sum(Quantity*UnitPrice) as Invoice_value from Sales group by
country,InvoiceNo")
3.
aggregation
++++++++++++++++++++
Partition Column-Country
Ordering Column-Weeknum
Window size- 1 to current row
val
df1=spark.read.format("csv").option("inferSchema",true).option("header",true).optio
n("path","window_data.csv").load()
val
myWindow=Window.partitonBy("Country").orderBy("WeekNum").rowsBetween(Window.unbound
edPrecedings,Window.currentRow)
df1.withColum("Running_total",sum("Invoicevalue").over(myWindow))
++++
+++
++
+
JOINS
++++++
val customersdf2=spark.read.format("json").option("path","customers").load()
val
joinedDf=orderdf1.join(customersdf2,orderdf1.col("order_customer_id")===customerdf2
.col("customer_id"),"inner").sort("order_customer_id")
executor1-node 1
++++++++++
Orders
15192,2013-10-29 00:00:00.0,2,PENDING_PAYMENT
33865,2014-02-18 00:00:00.0,2,COMPLETE
(2,{15192,2013-10-29 00:00:00.0,PENDING_PAYMENT})
(2,{33865,2014-02-18 00:00:00.0,2,COMPLETE})
Customers
Orders
35158,2014-02-26 00:00:00.0,3,COMPLETE
15192,2013-10-29 00:00:)).0,2,PENDING_PAYMENT
executor3-node 3
+++++++++++++++++
2.Right Outer
3.Left our join
4.Full outer join
customer_id is ambigous
++++++++++++++++++++++++
This Happens when we try to select a column name which is coming from 2 different
dataframes.
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").select("order_id","customer_id","customer_fname")
1)
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val ordersDfNew=orderdf.withColumnRenamed("customer_id","cust_id")
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=ordersDfNew.join(customersdf,ordersDfNew.col("cust_id")===customerdf.col("
customer_id"),"outer")
2)
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").drop(orderdf.col("customer_id")).sort("order_id")
coalesce
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(customersdf,orderdf.col("customer_id")===customerdf.col("cust
omer_id"),"outer").drop(orderdf.col("customer_id"))
.sort("order_id").withColumn("Order_id",expr("coalesce(order_id,-1)"))
spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
Broadcast join
++++++++++++++
--This does not require a shuffle
**whenever we are joinig 2 large dataframes then it will invoke a simple join and
shuffle will required.
**when you have one large dataframe and the other dataframe is small In that case
you can go for Broadcast join.
val
orderdf=spark.read.format("csv").option("header",true).option("inferSchema",true).o
ption("path","orders").load()
val
customersdf=spark.read.format("csv").option("header",true).option("inferSchema",tru
e).option("path","customer").load()
val
joinedDf=orderdf.join(broadcast(customersdf),orderdf.col("customer_id")===customerd
f.col("customer_id"),"outer").select("order_id","customer_id","customer_fname")
+++++++++
date_format(datetime,"MMMM")-->January
WARN JANUARY 123
ERROR FEB 12222
WARN FEB 12
Pivot table
++++++++++++
spark.sql("select level,date_format(datetime,"MMMM") as
month,date_format(datetime,"M") as month_num from
my_new_logging_level").groupBy("level").pivot("month_num").count();
val
colun=List("JAN","FEB","MAR","APR","MAY","JUN","JULY","AUG","SEP","OCT","NOV","DEC"
)
spark.sql("select level,date_format(datetime,"MMMM") as
month,date_format(datetime,"M") as month_num from
my_new_logging_level").groupBy("level").pivot(month,"colun").count();
Spark Optimization
++++++++++++++++++
Executor(container of resources)
**One node can hold more than one executor.
In a single worker node, we can have multiple containers.
executor/container/JVM
2 strategies:
1.Thin Executor
++++++++++++++++
+++++++++++++++
Intention is to create more executor with each executor holding minimum possible
resources
disadvantage:
2.Fat Executor
+++++++++++++++
++++++++++++++
Drwabacks
+++++++++++
--It is observed that if the executor holds more than 5 CPU cores then the hdfs
througput suffers.(More multithreading is not good)
--If executor holds very huge amount of memory, then garbage collection takes lot
of time
--garbage collection means removing unused objects from memory.
16 cores,64GB Ram
1 core is given for other backgroung activities.
1 GB Ram is given for Operating system.
Now,in each node left with 15 cores,63GB Ram.
=> We want multithreading within a executor(> =1 cpu core for each executor)
=>We do not want hdfs througput suffers
(It suffers when we use more than 5 cores per executor)
5 is the right choice of number of cpu cores in each executor.
15 core,63GB Ram -each machine
29 Executors....
348mb.
++++++++++++++++++++
+++++++++++++++++++
+++++++++++++++++
++++++++++++++++
+++++++++++++++
++++++++++++++
5 Worker nodes
each wordker node has 32 GB Ram,8 cpu cores
YARN:
+++++
32 GB Ram,8Cpu cores(Physical cores)
24GB Ram allocated for yarn containers,8 GB Ram for other process,
Container Memory Min-1GB and Max-4 GB(cannot create executor not more than 4 GB and
less than 1 GB)
Container memory:>=1GB<=4GB
cores:Min:1 Max:2
8 Physical cores
8*2=16 Vcpu cores
Out of Vcpu cores 12 can be used for yarn containers, Remaining 4 can used for
other purposes.
12/2=6cpu cores
6 executors, 2 cores, 4 GB Ram
++++
+++
++
+
bigLogNew.txt--1.46GB
executor=compute+Ram=>1 core+1GB
Storage Memory
+++++++++++++++
(ERROR,3)
2.Storage Memory
--Used for Cache,Broadcast,accumulator.
1.application which do not use caching can use entire space for execution.
--execution-memory 4G
if you request a container/executor of 4GB size, then you are actually requesting
4GB(heap momoery)+max(384MB,10% of 4 GB)(off heap memory)(overhead)
4096MB(JAVA HEAP) + 384MB(OFF HEAP)
Broadcast JOINS
+++++++++++++++
--It is used when we have 1 large table and 1 small table. and we want to join
these.d
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show
+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+