What Is Apache Spark?
What Is Apache Spark?
1
Apache Spark - SparkByExamples
Python 3.8
R 3.5
2
Apache Spark - SparkByExamples
local – which is not really a cluster manager but still I wanted to mention that we use “local”
for master() in order to run Spark on our laptop/computer.
Spark Modules
Spark Core
Spark SQL
Spark Streaming
Spark MLlib
Spark GraphX
Spark Core
In this section of the Apache Spark Tutorial, you will learn different concepts of the Spark Core
library with examples in Scala code. Spark Core is the main base library of Spark which provides
the abstraction of how distributed task dispatching, scheduling, basic I/O functionalities etc.
Before getting your hands dirty on Spark programming, have your Development Environment
Setup to run Spark Examples using IntelliJ IDEA
SparkSession
SparkSession introduced in version 2.0, is an entry point to underlying Spark functionality in order
to programmatically use Spark RDD, DataFrame, and Dataset. It’s object spark is default available
in spark-shell.
Creating a SparkSession instance would be the first statement you would write to the program
with RDD, DataFrame and Dataset. SparkSession will be created
using SparkSession.builder() builder pattern.
// Create SparkSession
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
Spark Context
3
Apache Spark - SparkByExamples
SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is used to be an entry
point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext was
the first step to the program with RDD and to connect to Spark Cluster. It’s object sc by default
available in spark-shell.
Since Spark 2.x version, When you create SparkSession, SparkContext object is by default
created and it can be accessed using spark.sparkContext
Note that you can create just one SparkContext per JVM but can create many SparkSession
objects.
RDD creation
RDDs are created primarily in two different ways, first parallelizing an existing collection and
secondly referencing a dataset in an external storage system (HDFS, HDFS, S3 and many more).
sparkContext.parallelize()
sparkContext.parallelize is used to parallelize an existing collection in your driver program. This is
a basic method to create RDD.
sparkContext.textFile()
Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#, Azure,
local e.t.c into RDD.
4
Apache Spark - SparkByExamples
RDD Operations
On Spark RDD, you can perform two kinds of operations.
RDD Transformations
Spark RDD Transformations are lazy operations meaning they don’t execute until you call an
action on RDD. Since RDDs are immutable, When you run a transformation(for example map()),
instead of updating a current RDD, it returns a new RDD.
Some transformations on RDDs are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all
these return a new RDD instead of updating the current.
RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD
function that returns non RDD[T] is considered as an action. RDD operations trigger the
computation and return RDD in a List to the driver program.
Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.
RDD Examples
Read CSV file into RDD
RDD Pair Functions
Generate DataFrame from RDD
|firstname|middlename|lastname|dob |gender|salary|
+---------+----------+--------+----------+------+------+
+---------+----------+--------+----------+------+------+
In this Apache Spark SQL DataFrame Tutorial, I have explained several mostly used
operation/functions on DataFrame & DataSet with working Scala examples.
______________________________________________________________________________
6
Apache Spark - SparkByExamples
In this article, I’ll delve into the essence of SparkSession, how to create SparkSession object, and
explore its frequently utilized methods.
What is SparkSession
SparkSession was introduced in version Spark 2.0, it is an entry point to underlying Spark
functionality in order to programmatically create Spark RDD, DataFrame, and DataSet.
SparkSession’s object spark is the default variable available in spark-shell and it can be created
programmatically using SparkSession builder pattern.
If you are looking for a PySpark explanation, please refer to how to create SparkSession in
PySpark.
1. SparkSession Introduction
As mentioned in the beginning, SparkSession is an entry point to Spark, and creating a
SparkSession instance would be the first statement you would write to program
with RDD, DataFrame, and Dataset. SparkSession will be created
using SparkSession.builder() builder pattern.
Before Spark 2.0, SparkContext used to be an entry point, and it’s not been completely replaced
with SparkSession. Many features of SparkContext are still available and used in Spark 2.0 and
later. You should also know that SparkSession internally
creates SparkConfig and SparkContext with the configuration provided with SparkSession.
With Spark 2.0, a new class org.apache.spark.sql.SparkSession has been introduced, which is a
combined class for all the different contexts we used to have before 2.0 (SQLContext,
HiveContext, etc); hence, Spark Session can be used in the place of SQLContext, HiveContext,
and other contexts.
Spark Session also includes all the APIs available in different contexts –
SparkContext
SQLContext
StreamingContext
HiveContext
How many SparkSessions can you create in an application?
You can create as many SparkSession as you want in a Spark application using
either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are
required when you want to keep Spark tables (relational entities) logically separated.
2. SparkSession in spark-shell
By default, Spark shell provides spark object, which is an instance of the SparkSession class. We
can directly use this object when required in spark-shell.
// Usage of spark variable
scala> spark.version
7
Apache Spark - SparkByExamples
Like the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment
creates a default SparkSession object for us to use, so you don’t have to worry about creating a
Spark session.
3. How to Create SparkSession
Creating a SparkSession is fundamental as it initializes the environment required to leverage the
capabilities of Apache Spark.
To create SparkSession in Scala or Python, you need to use the builder pattern
method builder() and calling getOrCreate() method. It returns a SparkSession that already exists;
otherwise, it creates a new SparkSession. The example below creates a SparkSession in Scala.
9
Apache Spark - SparkByExamples
To use Hive with Spark, you need to enable it using the enableHiveSupport() method.
SparkSession from Spark2.0 provides inbuilt support for Hive operations like writing queries on
Hive tables using HQL, accessing to Hive UDFs, and reading data from Hive tables.
// Enabling Hive to use in Spark
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.sql.warehouse.dir", "<path>/spark-warehouse")
.enableHiveSupport()
.getOrCreate();
// +-----+-----+
// | _1| _2|
// +-----+-----+
// |Scala|25000|
// |Spark|35000|
// | PHP|21000|
// +-----+-----+
Using SparkSession you can access Spark SQL capabilities in Apache Spark. In order to use SQL
features first, you need to create a temporary view in Spark. Once you have a temporary view you
can run any ANSI SQL queries using spark.sql() method.
// Spark SQL
df.createOrReplaceTempView("sample_table")
val df2 = spark.sql("SELECT _1,_2 FROM sample_table")
df2.show()
Spark SQL temporary views are session-scoped and will not be available if the session that
creates it terminates. If you want to have a temporary view that is shared among all sessions and
kept alive until the Spark application terminates, you can create a global temporary view
using createGlobalTempView().
// +-------+----------------+----------------------------+
// +-------+----------------+----------------------------+
// |default|default database|file:/<path>/spark-warehouse|
// +-------+----------------+----------------------------+
// List Tables
val ds2 = spark.catalog.listTables
ds2.show(false)
// Output:
// +-----------------+--------+-----------+---------+-----------+
11
Apache Spark - SparkByExamples
// |name |database|description|tableType|isTemporary|
// +-----------------+--------+-----------+---------+-----------+
Notice the two tables we have created so far, The sample_table which was created from
Spark.createOrReplaceTempView is considered a temporary table and Hive table as managed
table.
Method Description
version Returns Spark version where your application is running, probably the
Spark version your cluster is configured with.
range(n) Returns a single column Dataset with LongType and column named id,
containing elements in a range from 0 to n (exclusive) with step value 1.
There are several variations of this function, for details, refer to Spark
documentation.
createDataset() This creates a Dataset from the collection, DataFrame, and RDD.
12
Apache Spark - SparkByExamples
Method Description
6. FAQ’s on SparkSession
13
Apache Spark - SparkByExamples
7. Conclusion
In this Spark SparkSession article, you have learned what is Spark Session, its usage, how to
create SparkSession programmatically, and learned some of the commonly used SparkSession
methods. In summary
SparkSession was introduced in Spark 2.0 which is a unified API for working with
structured data.
It combines SparkContext, SQLContext, and HiveContext. It’s designed to work with
DataFrames and Datasets, which provide more structured and optimized operations than
RDDs.
SparkSession natively supports SQL queries, structured streaming, and DataFrame-based
machine learning APIs.
spark-shell, Databricks, and other tools provide spark variable as the default SparkSession
object.
14
Apache Spark - SparkByExamples
What is SparkContext?
SparkContext has been available since Spark 1.x (JavaSparkContext for Java) and it used to be
an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating
SparkContext is the first step to using RDD and connecting to Spark Cluster, In this article, you will
learn how to create it using examples.
What is SparkContext
Since Spark 1.x, SparkContext is an entry point to Spark and is defined
in org.apache.spark package. It is used to programmatically create Spark RDD, accumulators,
and broadcast variables on the cluster. Its object sc is default variable available in spark-shell and
it can be programmatically created using SparkContext class.
Note that you can create only one active SparkContext per JVM. You should stop() the
active SparkContext before creating a new one.
Source:
The Spark driver program creates and uses SparkContext to connect to the cluster manager to
submit Spark jobs, and know what resource manager (YARN, Mesos or Standalone) to
communicate to. It is the heart of the Spark application.
Related: How to get current SparkContext & its configurations in Spark
1. SparkContext in spark-shell
By default, Spark shell provides sc object which is an instance of the SparkContext class. We can
directly use this object where required.
// 'sc' is a SparkContext variable in spark-shell
scala>>sc.appName
Yields below output.
15
Apache Spark - SparkByExamples
Similar to the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment
itself creates a default SparkContext object for us to use so you don’t have to worry about creating
a spark context.
2. Spark 2.X – Create SparkContext using Scala Program
Since Spark 2.0, we mostly use SparkSession as most of the methods available in SparkContext
are also present in SparkSession. Spark session internally creates the Spark Context and exposes
the sparkContext variable to use.
At any given time only one SparkContext instance should be active per JVM. In case you want to
create another you should stop existing SparkContext (using stop()) before creating a new one.
// Imports
import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App {
3. Create RDD
Once you create a Spark Context object, use the below to create Spark RDD.
// Create RDD
val rdd = spark.sparkContext.range(1, 5)
rdd.collect().foreach(print)
// Create RDD from Text file
16
Apache Spark - SparkByExamples
4. Stop SparkContext
You can stop the SparkContext by calling the stop() method. As explained above you can have
only one SparkContext per JVM. If you want to create another, you need to shut down it first by
using stop() method and create a new SparkContext.
// SparkContext stop() method
spark.sparkContext.stop()
When Spark executes this statement, it logs the message INFO SparkContext: Successfully
stopped SparkContext to the console or to a log file.
broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a
Spark cluster only once.
emptyRDD – Creates an empty RDD
getPersistentRDDs – Returns all persisted RDDs
getOrCreate() – Creates or returns a SparkContext
hadoopFile – Returns an RDD of a Hadoop file
master()– Returns master that set while creating SparkContext
newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.
sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.
setLogLevel – Change log level to debug, info, warn, fatal, and error
textFile – Reads a text file from HDFS, local or any Hadoop supported file systems, and returns
an RDD
union – Union two RDDs
wholeTextFiles – Reads a text file in the folder from HDFS, local or any Hadoop supported file
systems and returns an RDD of Tuple2. The first element of the tuple consists file name and the
second element consists context of the text file.
7. SparkContext Example
sparkContext.setLogLevel("ERROR")
println("First SparkContext:")
println("APP Name :"+sparkContext.appName)
println("Deploy Mode :"+sparkContext.deployMode)
println("Master :"+sparkContext.master)
18
Apache Spark - SparkByExamples
// sparkContext.stop()
val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
val sparkContext2 = new SparkContext(conf2)
println("Second SparkContext:")
println("APP Name :"+sparkContext2.appName)
println("Deploy Mode :"+sparkContext2.deployMode)
println("Master :"+sparkContext2.master)
}
FAQ’s on SparkContext
What does SparkContext do?
SparkContext is entry point to spark application since spark 1.x. The SparkContext is the central
entry point and controller for Spark applications. It manages resources, coordinates tasks, and
provides the necessary infrastructure for distributed data processing in Spark. It plays a vital role
in ensuring the efficient and fault-tolerant execution of Spark jobs.
How to create SparkContext?
SparkContext is created using SparkContext class. By default, A spark “driver” is an application
that creates the SparkContext in order to execute the job or jobs of a cluster. You can access the
spark context from spark spark session object spark.sparkContext. If you wanted to create spark
context by yourself, use the below snippet.
// Create SpakContext
import org.apache.spark.{SparkConf, SparkContext}
val sparkConf = new SparkConf().setAppName(“sparkbyexamples.com”).setMaster(“local[1]”)
val sparkContext = new SparkContext(sparkConf)
How to stop SparkContext?
Once you have finished using Spark, you can stop the SparkContext using the stop() method. This
will release all resources associated with the SparkContext and shut down the Spark application
gracefully.
Can I have multiple SparkContext in Spark job?
There can only be one active SparkContext per JVM. Having multiple SparkContext instances in a
single application can cause issues like resource conflicts, configuration conflicts, and unexpected
behavior.
How to access SparkContex variable?
By default, A spark “driver” is an application that creates the SparkContext in order to execute the
job or jobs of a cluster. You can access the spark context from spark spark session
object spark.sparkContext.
19
Apache Spark - SparkByExamples
8. Conclusion
In this Spark Context article, you have learned what is SparkContext, how to create in Spark 1.x
and Spark 2.0, and using with few basic examples. In summary,
SparkContext is the entry point to any Spark functionality. It represents the connection to a
Spark cluster and is responsible for coordinating and distributing the operations on that
cluster.
It was the primary entry point for Spark applications before Spark 2.0.
SparkContext is used for low-level RDD (Resilient Distributed Dataset) operations, which
were the core data abstraction in Spark before DataFrames and Datasets were introduced.
It is not thread-safe, so in a multi-threaded or multi-user environment, you need to be
careful when using a single SparkContext instance.
______________________________________________________________________________
21
Apache Spark - SparkByExamples
text01.txt One,1
text02.txt Two,2
22
Apache Spark - SparkByExamples
text03.txt Three,3
text04.txt Four,4
invalid.txt Invalid,I
1. Spark Read all text files from a directory into a single RDD
In Spark, by inputting path of the directory to the textFile() method reads all text files and creates a
single RDD. Make sure you do not have a nested directory If it finds one Spark process fails with
an error.
// Spark Read all text files from a directory into a single RDD
val rdd = spark.sparkContext.textFile("C:/tmp/files/*")
rdd.foreach(f=>{
println(f)
})
This example reads all files from a directory, creates a single RDD and prints the contents of the
RDD.
// Output:
Invalid,I
One,1
Two,2
Three,3
Four,4
If you are running on a cluster you should first collect the data in order to print on a console as
shown below.
// Collect the data
rdd.collect.foreach(f=>{
println(f)
})
Let’s see a similar example with wholeTextFiles() method. note that this returns an RDD[Tuple2].
where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
// Using wholeTextFiles() to load the data
val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
rddWhole.foreach(f=>{
println(f._1+"=>"+f._2)})
// Output:
23
Apache Spark - SparkByExamples
file:/C:/tmp/files/invalid.txt=>Invalid,I
file:/C:/tmp/files/text01.txt=>One,1
file:/C:/tmp/files/text02.txt=>Two,2
file:/C:/tmp/files/text03.txt=>Three,3
file:/C:/tmp/files/text04.txt=>Four,4
One,1
Two,2
One,1
Two,2
Three,3
Four,4
Two,2
Invalid,I
One,1
Two,2
24
Apache Spark - SparkByExamples
Three,3 Four,4
One,1
Two,2
Col1:Invalid,Col2:I
Col1:One,Col2:1
25
Apache Spark - SparkByExamples
Col1:Two,Col2:2 Col1:Three,Col2:3 Col1:Four,Col2:4
8. Complete code
package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
object ReadMultipleFiles extends App {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
26
Apache Spark - SparkByExamples
rdd4.foreach(f=>{
println(f)
})
text01.csv Col1,Col2
One,1
Eleven,11
text02.csv Col1,Col2
Two,2
Twenty One,21
text03.csv Col1,Col2
Three,3
27
Apache Spark - SparkByExamples
text04.csv Col1,Col2
Four,4
invalid.csv Col1,Col2
Invalid,I
in spark-csv library or external libraries like dataframes-csv provide more effective and efficient
ways to work with CSV files.
// Output:
Col1:col1,Col2:col2
Col1:One,Col2:1
Col1:Eleven,Col2:11
Let’s see how to collect the data from RDD using collect(). In this case, collect() method returns
Array[Array[String]] type where the first Array represents the RDD data and inner array is a record.
println(f) })
Complete example
package com.sparkbyexamples.spark.rdd
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
spark.sparkContext.setLogLevel("ERROR")
println("Iterate RDD")
rdd.foreach(f=>{
println("Col1:"+f(0)+",Col2:"+f(1))
})
println(rdd)
println("Col1:"+f(0)+",Col2:"+f(1)) })
println("read all csv files from a directory to single RDD")
val rdd2 = spark.sparkContext.textFile("C:/tmp/files/*")
rdd2.foreach(f=>{
println(f)
})
31
Apache Spark - SparkByExamples
Note that once we create an RDD, we can easily create a DataFrame from RDD.
Let’s see how to create an RDD in Apache Spark with examples:
Spark create RDD from Seq or List (using Parallelize)
Creating an RDD from a text file
Creating from another RDD
Creating from existing DataFrames and DataSet
(Python,100000)
(Scala,3000)
(Java,20000)
This creates an RDD for which each record represents a line in a file.
If you want to read the entire content of a file as a single record use wholeTextFiles() method on
sparkContext.
// RDD from wholeTextFile
val rdd2 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
rdd2.foreach(record=>println("FileName : "+record._1+", FileContents :"+record._2))
In this case, each text file is a single record. In this, the name of the file is the first column and the
value of the text file is the second column.
32
Apache Spark - SparkByExamples
Conclusion:
In this article, you have learned creating Spark RDD from list or seq, text file, from another RDD,
DataFrame, and Dataset.
______________________________________________________________________________
34
Apache Spark - SparkByExamples
saveAsTextFile(path: String, codec: Class[_ <: Saves RDD as a compressed text file.
CompressionCodec]): Unit
top(num: Int)(implicit ord: Ordering[T]): Array[T] Note: Use this method only when the
resulting array is small, as all the data
35
Apache Spark - SparkByExamples
spark.sparkContext.setLogLevel("ERROR")
val inputRDD = spark.sparkContext.parallelize(List(("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),
("B", 60)))
//aggregate
def param0= (accu:Int, v:Int) => accu + v
def param1= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+listRdd.aggregate(0)(param0,param1))
//Output: aggregate : 20
//aggregate
36
Apache Spark - SparkByExamples
//Output: aggregate : 20
treeAggregate – action
treeAggregate() – Aggregates the elements of this RDD in a multi-level tree pattern. The output
of this function will be similar to the aggregate function.
fold – action
fold() – Aggregate the elements of each partition, and then the results for all the partitions.
//fold
println("fold : "+listRdd.fold(0){ (acc,v) =>
val sum = acc+v
sum
})
//Output: fold : 20
println("fold : "+inputRDD.fold(("Total",0)){(acc:(String,Int),v:(String,Int))=>
val sum = acc._2 + v._2
("Total",sum)
})
//Output: fold : (Total,181)
37
Apache Spark - SparkByExamples
reduce
reduce() – Reduces the elements of the dataset using the specified binary operator.
//reduce
println("reduce : "+listRdd.reduce(_ + _))
//Output: reduce : 20
println("reduce alternate : "+listRdd.reduce((x, y) => x + y))
//Output: reduce alternate : 20
println("reduce : "+inputRDD.reduce((x, y) => ("Total",x._2 + y._2)))
//Output: reduce : (Total,181)
treeReduce
treeReduce() – Reduces the elements of this RDD in a multi-level tree pattern.
collect
collect() -Return the complete dataset as an Array.
//Collect
val data:Array[Int] = listRdd.collect()
data.foreach(println)
38
Apache Spark - SparkByExamples
//Output: Count : 20
println("countApprox : "+listRdd.countApprox(1200))
//Output: countApprox : (final: [7.000, 7.000])
println("countApproxDistinct : "+listRdd.countApproxDistinct())
//Output: countApproxDistinct : 5
println("countApproxDistinct : "+inputRDD.countApproxDistinct())
//Output: countApproxDistinct : 5
countByValue, countByValueApprox
countByValue() – Return Map[T,Long] key representing each unique value in dataset and value
represents count each value present.
countByValueApprox() – Same as countByValue() but returns approximate result.
//countByValue, countByValueApprox
println("countByValue : "+listRdd.countByValue())
//Output: countByValue : Map(5 -> 1, 1 -> 1, 2 -> 2, 3 -> 2, 4 -> 1)
//println(listRdd.countByValueApprox())
first
first() – Return the first element in the dataset.
//first
println("first : "+listRdd.first())
//Output: first : 1
println("first : "+inputRDD.first())
//Output: first : (Z,1)
top
top() – Return top n elements from the dataset.
Note: Use this method only when the resulting array is small, as all the data is loaded into the
driver’s memory.
//top
println("top : "+listRdd.top(2).mkString(","))
//Output: take : 5,4
println("top : "+inputRDD.top(2).mkString(","))
39
Apache Spark - SparkByExamples
max
max() – Return the maximum value from the dataset.
//max
println("max : "+listRdd.max())
//Output: max : 5
println("max : "+inputRDD.max())
//Output: max : (Z,1)
40
Apache Spark - SparkByExamples
package com.sparkbyexamples.spark.rdd
import com.sparkbyexamples.spark.rdd.OperationOnPairRDDComplex.kv
import org.apache.spark.sql.SparkSession
import scala.collection.mutable
spark.sparkContext.setLogLevel("ERROR")
val inputRDD = spark.sparkContext.parallelize(List(("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),
("B", 60)))
//Collect
val data:Array[Int] = listRdd.collect()
data.foreach(println)
//aggregate
def param0= (accu:Int, v:Int) => accu + v
def param1= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+listRdd.aggregate(0)(param0,param1))
//Output: aggregate : 20
41
Apache Spark - SparkByExamples
//aggregate
def param3= (accu:Int, v:(String,Int)) => accu + v._2
def param4= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+inputRDD.aggregate(0)(param3,param4))
//Output: aggregate : 20
//fold
println("fold : "+listRdd.fold(0){ (acc,v) =>
val sum = acc+v
sum
})
//Output: fold : 20
println("fold : "+inputRDD.fold(("Total",0)){(acc:(String,Int),v:(String,Int))=>
val sum = acc._2 + v._2
("Total",sum)
})
//Output: fold : (Total,181)
//reduce
println("reduce : "+listRdd.reduce(_ + _))
//Output: reduce : 20
println("reduce alternate : "+listRdd.reduce((x, y) => x + y))
//Output: reduce alternate : 20
println("reduce : "+inputRDD.reduce((x, y) => ("Total",x._2 + y._2)))
//Output: reduce : (Total,181)
42
Apache Spark - SparkByExamples
//countByValue, countByValueApprox
println("countByValue : "+listRdd.countByValue())
//Output: countByValue : Map(5 -> 1, 1 -> 1, 2 -> 2, 3 -> 2, 4 -> 1)
//println(listRdd.countByValueApprox())
//first
println("first : "+listRdd.first())
//Output: first : 1
println("first : "+inputRDD.first())
//Output: first : (Z,1)
//top
println("top : "+listRdd.top(2).mkString(","))
//Output: take : 5,4
println("top : "+inputRDD.top(2).mkString(","))
//Output: take : (Z,1),(C,40)
43
Apache Spark - SparkByExamples
//min
println("min : "+listRdd.min())
//Output: min : 1
println("min : "+inputRDD.min())
//Output: min : (A,20)
//max
println("max : "+listRdd.max())
//Output: max : 5
println("max : "+inputRDD.max())
//Output: max : (Z,1)
//toLocalIterator
//listRdd.toLocalIterator.foreach(println)
//Output:
}
Conclusion:
RDD actions are operations that return non-RDD values, since RDD’s are lazy they do not execute
the transformation functions until we call actions. hence, all these functions trigger the
transformations to execute and finally returns the value of the action functions to the driver
program. and In this tutorial, you have also learned several RDD functions usage and examples in
scala language.
44
Apache Spark - SparkByExamples
aggregateByKey Aggregate the values of each key in a data set. This function can
return a different result type then the values in input RDD.
flatMapValues It’s flatten the values of each key with out changing key values
and keeps the original RDD partition.
groupByKey Returns the grouped RDD by grouping the values of each key.
mapValues It applied a map function for each value in a pair RDD with out
changing keys.
reduceByKeyLocally Returns a merged RDD by merging the values of each key and
final result will be sent to the master.
subtractByKey Return an RDD with the pairs from this whose keys are not in
other.
45
Apache Spark - SparkByExamples
join Return RDD after applying join on current and parameter RDD
countByKey Returns the count of each key elements. This returns the final
result to local Map which is your driver.
countByKeyApprox Same as countByKey but returns the partial result. This takes
a timeout as parameter to specify how long this function to run
before returning.
lookup Returns a list of values from RDD for a given input key.
saveAsHadoopDataset Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c), It uses Hadoop JobConf object to save.
saveAsHadoopFile Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c), It uses Hadoop OutputFormat class to
46
Apache Spark - SparkByExamples
save.
saveAsNewAPIHadoopDataset Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c) with new Hadoop API, It uses Hadoop
Configuration object to save.
saveAsNewAPIHadoopFile Saves RDD to any hadoop supported fule system (HDFS, S3,
ElasticSearch, e.t.c), It uses new Hadoop API OutputFormat
class to save.
(Germany,1)
(India,1)
(USA,1)
(USA,1)
(India,1)
(Russia,1)
(India,1)
(Brazil,1)
47
Apache Spark - SparkByExamples
(Canada,1)
(China,1)
(Germany,1)
(India,1)
(Brazil,1)
(China,1)
(USA,1)
(Canada,1)
(Russia,1)
(Brazil,1)
(Canada,1)
(China,1)
(Germany,1)
(India,1)
(India,1)
(India,1)
(Russia,1)
(USA,1)
(USA,1)
reduceByKey – Transformation returns an RDD after adding value for each key.
Result RDD contains unique keys.
// reduceByKey() on pairRDD
println("Reduce by Key ==>")
val wordCount = pairRDD.reduceByKey((a,b)=>a+b)
48
Apache Spark - SparkByExamples
wordCount.foreach(println)
This reduces the key by summing the values. Yields below output.
// Output:
(Brazil,1)
(Canada,1)
(China,1)
(USA,2)
(Germany,1)
(Russia,1)
(India,3)
49
Apache Spark - SparkByExamples
collectAsMap – This is an action function and returns Map to the master for retrieving all data
from a dataset.
// collectAsMap() to retrieve all data from a dataset
println("collectAsMap ==>")
pairRDD.collectAsMap().foreach(println)
// Output:
(Brazil,1)
(Canada,1)
(Germany,1)
(China,1)
(Russia,1)
(India,1)
Complete Example
package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
import scala.collection.mutable
object OperationsOnPairRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("SparkByExample")
50
Apache Spark - SparkByExamples
.master("local")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
println("Distinct ==>")
pairRDD.distinct().foreach(println)
//SortByKey
println("Sort by Key ==>")
val sortRDD = pairRDD.sortByKey()
sortRDD.foreach(println)
//reduceByKey
println("Reduce by Key ==>")
val wordCount = pairRDD.reduceByKey((a,b)=>a+b)
wordCount.foreach(println)
//keys
println("Keys ==>")
wordCount2.keys.foreach(println)
51
Apache Spark - SparkByExamples
//values
println("values ==>")
wordCount2.values.foreach(println)
println("Count :"+wordCount2.count())
println("collectAsMap ==>")
pairRDD.collectAsMap().foreach(println) } }
Key-Value Structure: In a Pair RDD, each element is represented as a tuple (key, value),
where key and value can be of any data type, including basic types, custom objects, or even other
RDDs.
Grouping: PairRDDs are often used for grouping data by keys. For example, you can group
data by a specific attribute in a dataset to perform operations on groups of data with the same key.
Aggregations: PairRDDs are useful for performing aggregation operations such
as reduceByKey, groupByKey, combineByKey, and foldByKey to summarize data based on keys.
Joins: PairRDDs can be joined with other PairRDDs based on their keys using operations
like join, leftOuterJoin, rightOuterJoin, and cogroup.
Transformations: Various transformations can be applied to PairRDDs, such
as mapValues, flatMapValues, and filter, which allow you to manipulate the values associated with
keys.
How do I create a pairRDD?
PairRDDs can be created by running a map() function that returns key/value pairs. PairRDDs are
commonly used in Spark for operations like groupByKey, reduceByKey, join, and other operations
that involve key-value pairs.
We can first create an RDD either by using parallalize() or by using an existing RDD, then apply a
map() transformation to make a key value pair.
What is the difference between RDD and pairRDD?
RDD is a distributed collection of data that can be processed in parallel across a cluster of
machines.
RDD can hold any type of data, including simple values, objects, or more complex data structures.
RDD is used for general-purpose distributed data processing and can be transformed and
52
Apache Spark - SparkByExamples
processed using various operations like map, filter, reduce, groupBy, join, etc. RDDs do not have
any inherent key-value structure; they are typically used for non-keyed data.
PairRDD is specialized key-value data and is designed for operations that involve keys. PairRDDs
are commonly used in Spark when you need to work with structured data that can be organized
and processed based on keys, making them suitable for many data processing tasks, especially in
the context of data analytics and transformations.
Conclusion:
In this tutorial, you have learned PairRDDFunctions class and Spark PairRDD transformations &
action functions with scala examples.
______________________________________________________________________________
.getOrCreate()
val rdd = spark.sparkContext.parallelize(Range(0,20))
println("From local[5]"+rdd.partitions.size)
Partition 1 : 0 1 2
Partition 2 : 3 4 5
Partition 3 : 6 7 8 9
Partition 4 : 10 11 12
Partition 5 : 13 14 15
Partition 6 : 16 17 18 19
54
Apache Spark - SparkByExamples
Partition 1 : 1 6 10 15 19
Partition 2 : 2 3 7 11 16
Partition 3 : 4 8 12 13 17
Partition 4 : 0 5 9 14 18
package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
object RDDRepartitionExample extends App {
val spark:SparkSession = SparkSession.builder()
.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()
55
Apache Spark - SparkByExamples
println("parallelize : "+rdd1.partitions.size)
rdd1.partitions.foreach(f=> f.toString)
val rddFromFile = spark.sparkContext.textFile("src/main/resources/test.txt",9)
println("TextFile : "+rddFromFile.partitions.size)
rdd1.saveAsTextFile("c:/tmp/partition")
val rdd2 = rdd1.repartition(4)
println("Repartition size : "+rdd2.partitions.size)
rdd2.saveAsTextFile("c:/tmp/re-partition")
rdd3.saveAsTextFile("c:/tmp/coalesce")
}
2. Spark DataFrame repartition() vs coalesce()
Unlike RDD, you can’t specify the partition/parallelism while creating DataFrame. DataFrame or
Dataset by default uses the methods specified in Section 1 to determine the default partition and
splits the data for parallelism.
val spark:SparkSession = SparkSession.builder()
.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()
val df = spark.range(0,20)
println(df.rdd.partitions.length)
df.write.mode(SaveMode.Overwrite)csv("partition.csv")
The above example creates 5 partitions as specified in master("local[5]") and the data is distributed across
all these 5 partitions.
Partition 1 : 0 1 2 3
Partition 2 : 4 5 6 7
Partition 3 : 8 9 10 11
Partition 4 : 12 13 14 15
56
Apache Spark - SparkByExamples
Partition 5 : 16 17 18 19
Partition 1 : 14 1 5
Partition 2 : 4 16 15
Partition 3 : 8 3 18
Partition 4 : 12 2 19
Partition 5 : 6 17 7 0
Partition 6 : 9 10 11 13
And, even decreasing the partitions also results in moving data from all partitions. hence when you
wanted to decrease the partition recommendation is to use coalesce()/
Partition 1 : 0 1 2 3 8 9 10 11
Partition 2 : 4 5 6 7 12 13 14 15 16 17 18 19
Since we are reducing 5 to 2 partitions, the data movement happens only from 3 partitions and it
moves to remain 2 partitions.
57
Apache Spark - SparkByExamples
In this Spark repartition and coalesce article, you have learned how to create an RDD with
partition, repartition the RDD & DataFrame using repartition() and coalesce() methods, and
learned the difference between repartition and coalesce.
58
Apache Spark - SparkByExamples
.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()
val sc = spark.sparkContext
//ReduceBy transformation
val rdd5 = rdd2.reduceByKey(_ + _)
#Output
RDD Parition Count : 3
RDD Parition Count : 3
Both getNumPartitions from the above examples return the same number of partitions.
Though reduceByKey() triggers data shuffle, it doesn’t change the partition count as RDD’s inherit
the partition size from parent RDD.
You may get partition counts different based on your setup and how Spark creates partitions.
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
59
Apache Spark - SparkByExamples
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
println(df2.rdd.getNumPartitions)
Conclusion
60
Apache Spark - SparkByExamples
In this article, you have learned what is Spark SQL shuffle, how some Spark operation triggers re-
partition of the data, how to change the default spark shuffle partition, and finally how to get the
right partition size.
Example
// Create sparkSession and apply cache() on DataFrame
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
val columns = Seq("Seqno","Quote")
val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy."))
val df = data.toDF(columns:_*)
1) persist() : Dataset.this.type
2) persist(newLevel : org.apache.spark.storage.StorageLevel) : Dataset.this.type
Example
// Persist Example
val dfPersist = df.persist()
dfPersist.show(false)
Using the second signature you can save DataFrame/Dataset to One of the storage
levels MEMORY_ONLY,MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_S
ER, DISK_ONLY, MEMORY_ONLY_2,MEMORY_AND_DISK_2
will not save DataFrame of some partitions and these will be re-computed as and when required.
This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level
as it recomputes the unsaved partitions and recomputing the in-memory columnar representation
of the underlying table is expensive
MEMORY_ONLY_SER – This is the same as MEMORY_ONLY but the difference being it stores
RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
MEMORY_ONLY_2 – Same as MEMORY_ONLY storage level but replicate each partition to two
cluster nodes.
MEMORY_ONLY_SER_2 – Same as MEMORY_ONLY_SER storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage
Level, The DataFrame will be stored in JVM memory as a deserialized object. When required
storage is greater than available memory, it stores some of the excess partitions into the disk and
reads the data from the disk when required. It is slower as there is I/O involved.
MEMORY_AND_DISK_SER – This is the same as MEMORY_AND_DISK storage level difference
being it serializes the DataFrame objects in memory and on disk when space is not available.
MEMORY_AND_DISK_2 – Same as MEMORY_AND_DISK storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK_SER_2 – Same as MEMORY_AND_DISK_SER storage level but replicate
each partition to two cluster nodes.
DISK_ONLY – In this storage level, DataFrame is stored only on disk and the CPU computation
time is high as I/O is involved.
DISK_ONLY_2 – Same as DISK_ONLY storage level but replicate each partition to two cluster
nodes.
Below are the table representation of the Storage level, Go through the impact of space, cpu and
performance choose the one that best fits for you.
Storage Level Space used CPU time In memory On-disk Serialized Recompute some partitions
----------------------------------------------------------------------------------------------------
64
Apache Spark - SparkByExamples
Spark caching and persistence is just one of the optimization techniques to improve the
performance of Spark jobs.
For RDD cache(), the default storage level is ‘MEMORY_ONLY’ but, for DataFrame and
Dataset, the default is ‘MEMORY_AND_DISK‘
On Spark UI, the Storage tab shows where partitions exist in memory or disk across the
cluster.
Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
Caching of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not
be cached until you trigger an action.
Conclusion
In this article, you have learned Spark cache and Persist methods are optimization techniques to
save interim computation results and use them subsequently and learned what is the difference
between Spark Cache and Persist and finally saw their syntaxes and usages with Scala examples.
______________________________________________________________________________
67
Apache Spark - SparkByExamples
scala> broadcastVar.value
res0: Array[Int] = Array(0, 1, 2, 3)
3. Spark RDD Broadcast variable example
Below is a very simple example of how to use broadcast variables on RDD. This example defines
commonly used data (country and states) in a Map variable and distributes the variable
using SparkContext.broadcast() and then use these variables on RDD map() transformation.
import org.apache.spark.sql.SparkSession
object RDDBroadcast extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
Conclusion
In this Spark Broadcast variable article you have learned what is Broadcast variable, it’s
advantage and how to use in RDD and Dataframe with scala example.
70
Apache Spark - SparkByExamples
71
Apache Spark - SparkByExamples
Note: Each of these accumulator classes has several methods, among these, add() method call
from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the
driver program can read accumulators value using the value() method.
Conclusion
In this Spark accumulators shared variable article, you have learned the Accumulators are only
“added” through an associative and commutative and operation and are used to perform counters
(Similar to Map-reduce counters) or sum operations and also learned different Accumulator
classes along with their methods.
______________________________________________________________________________
73
Apache Spark - SparkByExamples
toDF() has another signature that takes arguments to define column names as shown below.
val dfFromRDD1 = rdd.toDF("language","users_count")
dfFromRDD1.printSchema()
Outputs below schema.
root
By default, the datatype of these columns infers to the type of data and set’s nullable to true. We
can change this behavior by supplying schema using StructType – where we can specify a column
name, data type and nullable for each field/column.
Convert RDD to DataFrame – Using createDataFrame()
SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd
object as an argument. and chain it with toDF() to specify names to the columns.
val columns = Seq("language","users_count")
val dfFromRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)
Here, we are using scala operator <strong>:_*</strong> to explode columns array to comma-
separated values.
Using RDD Row type RDD[Row] to DataFrame
Spark createDataFrame() has another signature which takes the RDD[Row] type and schema for
column names as arguments. To use this first, we need to convert our “rdd” object from RDD[T] to
RDD[Row]. To define a schema, we use StructType that takes an array of StructField. And
StructField takes column name, data type and nullable/not as arguments.
//From RDD (USING createDataFrame and Adding schema using StructType)
val schema = StructType(columns
.map(fieldName => StructField(fieldName, StringType, nullable = true)))
//convert RDD[T] to RDD[Row]
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2))
74
Apache Spark - SparkByExamples
76
Apache Spark - SparkByExamples
//+------+------+
//+------+------+
//|Python|100000|
//+------+------+
//root
Since RDD is schema-less without column names and data type, converting from RDD to
DataFrame gives you default column names as _1, _2 and so on and data type as String. Use
DataFrame printSchema() to print the schema to console.
Assign Column Names to DataFrame
toDF() has another signature to assign a column name, this takes a variable number of arguments
for column names as shown below.
// Create DataFrame with custom column names
val dfFromRDD1 = rdd.toDF("language","users_count")
dfFromRDD1.show()
dfFromRDD1.printSchema()
// Output:
//+--------+-----------+
//|language|users_count|
//+--------+-----------+
//+--------+-----------+
//root
// |-- language: string (nullable = true) // |-- users_count: string (nullable = true)
77
Apache Spark - SparkByExamples
Remember, here, we just assigned column names. The data types are still Strings.
By default, the datatype of these columns is assigned to String. We can change this behavior by
supplying schema – where we can specify a column name, data type and nullable for each
field/column.
1.2 Using Spark createDataFrame() from SparkSession
Using createDataFrame() from SparkSession is another way to create. This signature also takes
rdd object as an argument and chain with toDF() to specify column names.
// Using createDataFrame()
val dfFromRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)
Here, toDF(columns: _*): assigns column names to the DataFrame using the provided columns list
or sequence. The _* is a syntax to pass a variable number of arguments. It facilitates converting
the elements in columns into separate arguments for the toDF method.
1.3 Using createDataFrame() with the Row type
createDataFrame() has another signature that takes the RDD[Row] type and schema for column
names as arguments. To use this first, we need to convert our “rdd” object
from RDD[T] to RDD[Row] and define a schema using StructType & StructField.
// Additional Imports
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.Row
// Create StructType Schema
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("users", StringType,true)
))
// Use map() transformation to get Row type
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2))
val dfFromRDD3 = spark.createDataFrame(rowRDD,schema)
Here, attributes._1 and attributes._2 represent the first and second components of each element
in the original RDD. The transformation maps each element of rdd to a Row object with two fields,
essentially converting a pair of attributes into a structured row.
78
Apache Spark - SparkByExamples
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("src/main/resources/persons.xml")
root
+--------------------+------------------+-----+------+
| name| languages|state|gender|
+--------------------+------------------+-----+------+
83
Apache Spark - SparkByExamples
| [Anna, Rose, ]|[Spark, Java, C++]| NY| F|
+--------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
To select rows on DataFrame based on multiple conditions, you case use either Column with a
condition or SQL expression. Below is just a simple example, you can extend this with AND(&&),
OR(||), and NOT(!) conditional expressions as needed.
// Multiple condition
df.where(df("state") === "OH" && df("gender") === "M")
.show(false)
// Output:
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------------+------------------+-----+------+
+----------------+------------------+-----+------+
+----------------+------------------+-----+------+
+----------------+------------------+-----+------+
+----------------------+------------+-----+------+
+----------------------+------------+-----+------+
package com.sparkbyexamples.spark.dataframe
spark.sparkContext.setLogLevel("ERROR")
86
Apache Spark - SparkByExamples
.add("lastname",StringType))
.add("languages", ArrayType(StringType))
.add("state", StringType)
.add("gender", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()
// Condition
df.where(df("state") === "OH")
.show(false)
// SQL Expression
df.where("gender == 'M'")
.show(false)
// Multiple condition
df.where(df("state") === "OH" && df("gender") === "M")
.show(false)
// Array condition
df.where(array_contains(df("languages"),"Java"))
.show(false)
// Struct condition
df.where(df("name.lastname") === "Williams")
.show(false)
Conclusion
87
Apache Spark - SparkByExamples
In this tutorial, I’ve explained how to select rows from Spark DataFrame based on single or
multiple conditions and SQL expression using where() function, also learned filtering rows by
providing conditions on the array and struct column with Scala examples.
Alternatively, you also use filter() function to filter the rows on DataFrame.
______________________________________________________________________________
88
Apache Spark - SparkByExamples
Spark withColumn() method introduces a projection internally. Therefore, calling it multiple times,
for instance, via loops in order to add multiple columns can generate big plans which can cause
performance issues and even StackOverflowException. To avoid this, use select with the multiple
columns at once.
Spark Documentation
First, let’s create a DataFrame to work with.
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructType}
val data = Seq(Row(Row("James;","","Smith"),"36636","M","3000"),
Row(Row("Michael","Rose",""),"40288","M","4000"),
Row(Row("Robert","","Williams"),"42114","M","4000"),
Row(Row("Maria","Anne","Jones"),"39192","F","4000"),
Row(Row("Jen","Mary","Brown"),"","F","-1")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("dob",StringType)
.add("gender",StringType)
.add("salary",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
1. Add a New Column to DataFrame
To create a new column, pass your desired column name to the first argument of withColumn()
transformation function. Make sure this new column not already present on DataFrame, if it
presents it updates the value of the column. On the below snippet, lit() function is used to add a
constant value to a DataFrame column. We can also chain in order to add multiple columns.
// Add a New Column to DataFrame
import org.apache.spark.sql.functions.lit
df.withColumn("Country", lit("USA"))
89
Apache Spark - SparkByExamples
The above approach is fine if you are manipulating few columns, but when you wanted to add or
update multiple columns, do not use the chaining withColumn() as it leads to performance issues,
use select() to update multiple columns instead.
2. Change Value of an Existing Column
Spark withColumn() function of DataFrame can also be used to update the value of an existing
column. In order to change the value, pass an existing column name as a first argument and value
to be assigned as a second column. Note that the second argument should be Column type .
// Change Value of an Existing Column
import org.apache.spark.sql.functions.col
df.withColumn("salary",col("salary")*100)
This snippet multiplies the value of “salary” with 100 and updates the value back to “salary”
column.
3. Derive New Column From an Existing Column
To create a new column, specify the first argument with a name you want your new column to be
and use the second argument to assign a value by applying an operation on an existing column.
// Derive New Column From an Existing Column
df.withColumn("CopiedColumn",col("salary")* -1)
This snippet creates a new column “CopiedColumn” by multiplying “salary” column with value -1.
4. Change Column Data Type
By using Spark withColumn on a DataFrame and using cast function on a column, we can change
datatype of a DataFrame column. The below statement changes the datatype from String to
Integer for the “salary” column.
// Change Column Data Type
df.withColumn("salary",col("salary").cast("Integer"))
5. Add, Replace, or Update multiple Columns
When you wanted to add, replace or update multiple columns in Spark DataFrame, it is not
suggestible to chain withColumn() function as it leads into performance issue and recommends to
use select() after creating a temporary view on DataFrame
// Add, Replace, or Update multiple Columns
df2.createOrReplaceTempView("PERSON")
spark.sql("SELECT salary*100 as salary, salary*-1 as CopiedColumn, 'USA' as country FROM
PERSON").show()
6. Rename Column Name
Though examples in 6,7, and 8 doesn’t use withColumn() function, I still feel like explaining how to
rename, drop, and split columns as these would be useful to you.
To rename an existing column use “withColumnRenamed” function on DataFrame.
90
Apache Spark - SparkByExamples
root
91
Apache Spark - SparkByExamples
|-- City: string (nullable = true)
+----------+---------+--------------+-------+-----+-------+
+----------+---------+--------------+-------+-----+-------+
+----------+---------+--------------+-------+-----+-------+
Note: Note that all of these functions return the new DataFrame after applying the functions
instead of updating DataFrame.
.add("salary",StringType)
val df2 = spark.createDataFrame(spark.sparkContext.parallelize(dataRows),schema)
// Droping a column
val df6=df4.drop("CopiedColumn")
println(df6.columns.contains("CopiedColumn"))
// Retrieving
df2.show(false)
df2.select("name").show(false)
df2.select("name.firstname").show(false)
df2.select("name.*").show(false)
import spark.implicits._
val data = Seq(("Robert, Smith", "1 Main st, Newark, NJ, 92537"), ("Maria, Garcia","3456 Walnut
st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()
df2.createOrReplaceTempView("PERSON")
spark.sql("SELECT salary*100 as salary, salary*-1 as CopiedColumn, 'USA' as country FROM
PERSON").show()
}
}
Syntax:
groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
org.apache.spark.sql.RelationalGroupedDataset
When we perform groupBy() on Spark Dataframe, it returns RelationalGroupedDataset object
which contains below aggregate functions.
count() - Returns the count of rows for each group.
94
Apache Spark - SparkByExamples
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
95
Apache Spark - SparkByExamples
| Maria| Finance| CA| 90000| 24|23000|
+-------------+----------+-----+------+---+-----+
df.groupBy("department").sum("salary").show(false)
+----------+-----------+
|department|sum(salary)|
+----------+-----------+
|Sales |257000 |
|Finance |351000 |
|Marketing |171000 |
+----------+-----------+
Similarly, we can calculate the number of employee in each department using count()
df.groupBy("department").count()
Calculate the minimum salary of each department using min()
df.groupBy("department").min("salary")
Calculate the maximin salary of each department using max()
df.groupBy("department").max("salary")
Calculate the average salary of each department using avg()
df.groupBy("department").avg( "salary")
Calculate the mean salary of each department using mean()
df.groupBy("department").mean( "salary")
groupBy and aggregate on multiple DataFrame columns
96
Apache Spark - SparkByExamples
Similarly, we can also run groupBy and aggregate on two or more DataFrame columns, below
example does group by on department,state and does sum() on salary and bonus columns.
//GroupBy on multiple columns
df.groupBy("department","state")
.sum("salary","bonus")
.show(false)
This yields the below output.
+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
+----------+-----+-----------+----------+
similarly, we can run group by and aggregate on tow or more columns for other aggregate
functions, please refer below source code for example.
Running more aggregates at a time
Using agg() aggregate function we can calculate many aggregations at a time on a single
statement using Spark SQL aggregate functions sum(), avg(), min(), max() mean() e.t.c. In order
to use these, we should import "import org.apache.spark.sql.functions._"
import org.apache.spark.sql.functions._
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.show(false)
This example does group on department column and calculates sum() and avg() of salary for each
department and calculates sum() and max() of bonus for each department.
+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
97
Apache Spark - SparkByExamples
|Finance |351000 |87750.0 |81000 |24000 |
+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
+----------+----------+-----------------+---------+---------+
Source code
package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object GroupbyExample extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
98
Apache Spark - SparkByExamples
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy("department","state")
.mean("salary","bonus")
.show(false)
//Running Filter
df.groupBy("department","state")
.sum("salary","bonus")
.show(false)
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
stddev("bonus").as("stddev_bonus"))
.where(col("sum_bonus") > 50000)
.show(false)
}
This example is also available at GitHub project for reference.
Conclusion
In this tutorial, you have learned how to use groupBy() and aggregate functions on Spark
DataFrame and also learned how to run these on multiple columns and finally filtering data on the
aggregated column.
100
Apache Spark - SparkByExamples
The rest of the tutorial explains Join Types using syntax 6 which takes arguments right join
DataFrame, join expression and type of join in String.
For Syntax 4 & 5 you can use either “JoinType” or “Join String” defined on the above table for
“joinType” string argument. When you use “JoinType”, you should import
org.apache.spark.sql.catalyst.plans._ as this package defines JoinType objects.
101
Apache Spark - SparkByExamples
Cross.sql cross
All Join objects are defined at joinTypes class, In order to use these you need to
import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner,....}.
Before we jump into Spark SQL Join examples, first, let’s create an emp and dept DataFrames.
here, column emp_id is unique on emp and dept_id is unique on the dept datasets and
emp_dept_id from emp has a reference to dept_id on dept dataset.
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
102
Apache Spark - SparkByExamples
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
2. Inner Join
Spark Inner join is the default join and it’s mostly used, It is used to join two DataFrames/Datasets
on key columns, and where keys don’t match the rows get dropped from both datasets
(emp & dept).
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
When we apply Inner join on our datasets, It drops “emp_dept_id” 50 from “emp” and “dept_id” 30
from “dept” datasets. Below is the result of the above Join expression.
103
Apache Spark - SparkByExamples
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
105
Apache Spark - SparkByExamples
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
+------+-----+---------------+-----------+-----------+------+------+
8. Self Join
Spark Joins are not complete without a self join, Though there is no self-join type available, we
can use any of the above-explained join types to join DataFrame to itself. below example
use inner self join
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),
106
Apache Spark - SparkByExamples
col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)
Here, we are joining emp dataset with itself to find out superior emp_id and name for all
employees.
+------+--------+---------------+-----------------+
|emp_id|name |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2 |Rose |1 |Smith |
|3 |Williams|1 |Smith |
|4 |Jones |2 |Rose |
|5 |Brown |2 |Rose |
|6 |Brown |2 |Rose |
+------+--------+---------------+-----------------+
107
Apache Spark - SparkByExamples
spark.sparkContext.setLogLevel("ERROR")
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns =
Seq("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)
println("Inner join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
println("Outer join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer")
.show(false)
println("full join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full")
.show(false)
108
Apache Spark - SparkByExamples
println("fullouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter")
.show(false)
println("right join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right")
.show(false)
println("rightouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter")
.show(false)
println("left join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left")
.show(false)
println("leftouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter")
.show(false)
println("leftanti join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti")
.show(false)
println("leftsemi join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)
println("cross join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"cross")
.show(false)
println("Using crossJoin()")
empDF.crossJoin(deptDF).show(false)
109
Apache Spark - SparkByExamples
println("self join")
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),
col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
//SQL JOIN
val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
joinDF.show(false)
Key Points:
One key point to remember, these both transformations returns the Dataset[U] but not
the DataFrame (In Spark 2.0, DataFrame = Dataset[Row]) .
After applying the transformation function on each row of the input DataFrame/Dataset,
these return the same number of rows as input but the schema or number of the columns of
the result could be different.
If you know flatMap() transformation, this is the key difference between map and
flatMap where map returns only one row/element for every input, while flatMap() can return
a list of rows/elements.
root
+---------+----------+--------+-----+----------+------+
+---------+----------+--------+-----+----------+------+
+---------+----------+--------+-----+----------+------+
In order to explain map() and mapPartitions() with an example, let’s also create a “Util” class with a
method combine(), this is a simple method that takes three string arguments and combines them
with a comma delimiter. In realtime, this could be a third-party class that does complex
transformation.
class Util extends Serializable {
def combine(fname:String,mname:String,lname:String):String = {
fname+","+mname+","+lname
}
}
We will create an object for this class by initializing and call the combine() method for each row in
a DataFrame.
Spark map() transformation
Spark map() transformation applies a function to each row in a DataFrame/Dataset and returns the
new transformed Dataset. As mentioned earlier, map() returns one row for every row in a input
DataFrame, in other words, input and the result exactly contains the same number of rows.
For example, if you have 100 rows in a DataFrame, after applying the function map() return with
exactly 100 rows. However, the structure or schema of the result could be different.
Syntax:
1) map[U](func : scala.Function1[T, U])(implicit evidence$6 : org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]
2) map[U](func : org.apache.spark.api.java.function.MapFunction[T, U], encoder :
org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]
Spark provides 2 map transformation signatures one takes scala.function1 as argument and the
other takes MapFunction and if you notice both these functions return Dataset[U] but not
DataFrame (which is Dataset[Row]). If you want a DataFrame as output then you need to convert
the Dataset to DataFrame using toDF() function.
Usage:
112
Apache Spark - SparkByExamples
import spark.implicits._
val df3 = df2.map(row=>{
// This initialization happens to every records
// If it is heavy initilizations like Database connects
// It degrates the performance
val util = new Util()
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
df3Map.printSchema()
df3Map.show(false)
Since map transformations execute on worker nodes, we have initialized and create an object of
the Util class inside the map() function and the initialization happens for every row in a DataFrame.
This causes performance issues when you have heavily weighted initializations.
Note: When you running it on Standalone mode, initializing the class outside of the map() still
works as both executors and driver run on the same JVM but running this on cluster fails with
exception.
Above example yields below output.
root
|-- fullName: string (nullable = true)
|-- id: string (nullable = true)
|-- salary: integer (nullable = false)
+----------------+-----+------+
|fullName |id |salary|
+----------------+-----+------+
|James,,Smith |36636|3100 |
|Michael,Rose, |40288|4300 |
|Robert,,Williams|42114|1400 |
|Maria,Anne,Jones|39192|5500 |
|Jen,Mary,Brown |34561|3000 |
+----------------+-----+------+
As you notice the above output, the input of the DataFrame has 5 rows so the result of the map
also has 5 but the column counts are different.
Spark mapPartitions() transformation
113
Apache Spark - SparkByExamples
map partitions also have 2 signatures, one take scala.Function1 and other takes
spark MapPartitionsFunction arguments.
mapPartitions() keeps the result of the partition in-memory until it finishes executing all rows in a
partition.
Usage:
val df4 = df2.mapPartitions(iterator => {
// Do the heavy initialization here
// Like database connections e.t.c
val util = new Util()
val res = iterator.map(row=>{
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
res
})
val df4part = df4.toDF("fullName","id","salary")
df4part.printSchema()
df4part.show(false)
This yields the same output as above.
package com.sparkbyexamples.spark.dataframe.examples
.add("id",StringType)
.add("location",StringType)
.add("salary",IntegerType)
import spark.implicits._
val util = new Util()
val df3 = df2.map(row=>{
df3Map.printSchema()
df3Map.show(false)
In this Spark DataFrame article, you have learned map() and mapPartitions() transformations
execute a function on each and every row and returns the same number of records as in input but
with the same or different schema or columns. Also learned when you have a complex initialization
you should be using mapPratitions() as it has the capability to do initializations once for each
partition instead of every DataFrame row..
______________________________________________________________________________
+-------+------+-------+
|Product|Amount|Country|
+-------+------+-------+
+-------+-----+-------+
120
Apache Spark - SparkByExamples
This will transpose the countries from DataFrame rows into columns and produces below output.
Where ever data is not present, it represents as null by default.
+-------+------+-----+------+----+
|Product|Canada|China|Mexico| USA|
+-------+------+-----+------+----+
121
Apache Spark - SparkByExamples
+-------+-------+-----+
|Product|Country|Total|
+-------+-------+-----+
+-------+-------+-----+
Dataframe union() – union() method of the DataFrame is used to combine two DataFrame’s of
the same structure/schema. If schemas are not the same it returns an error.
DataFrame unionAll() – unionAll() is deprecated since Spark “2.0.0” version and replaced with
union().
Note: In other SQL’s, Union eliminates the duplicates but UnionAll combines two datasets
including duplicate records. But, in spark both behave the same and use DataFrame duplicate
function to remove duplicate rows.
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.printSchema()
df.show()
df.printSchema prints the schema and df.show() display DataFrame to console.
// Output:
root
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
+-------------+----------+-----+------+---+-----+
Now, let’s create a second Dataframe with the new records and some records from the above
Dataframe but with the same schema.
val simpleData2 = Seq(("James","Sales","NY",90000,34,10000),
("Maria","Finance","CA",90000,24,23000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df2 = simpleData2.toDF("employee_name","department","state","salary","age","bonus")
This yields below output
123
Apache Spark - SparkByExamples
// Output:
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
+-------------+----------+-----+------+---+-----+
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
124
Apache Spark - SparkByExamples
df4.show(false)
Returns the same output as above.
3. Combine without Duplicates
Since the union() method returns all rows without distinct records, we will use the distinct() function
to return just one record when duplicate exists.
// Combine without Duplicates
val df5 = df.union(df2).distinct()
df5.show(false)
Yields below output. As you see, this returns only distinct rows.
// Output:
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
+-------------+----------+-----+------+---+-----+
package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession
spark.sparkContext.setLogLevel("ERROR")
125
Apache Spark - SparkByExamples
import spark.implicits._
Conclusion
In this Spark article, you have learned how to combine two or more DataFrame’s of the same
schema into single DataFrame using Union method and learned the difference between the
union() and unionAll() functions.
126
Apache Spark - SparkByExamples
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)
127
Apache Spark - SparkByExamples
show() function on DataFrame prints the result of the dataset in a table format. By default, it shows
only 20 rows. The above snippet returns the data in a table.
root
+---------------------+-----+------+------+
+---------------------+-----+------+------+
+---------------------+-----+------+------+
collect() Example
val colList = df.collectAsList()
val colData = df.collect()
colData.foreach(row=>
{
val salary = row.getInt(3)//Index starts from zero
println(salary)
})
128
Apache Spark - SparkByExamples
deptDF.collect() retrieves all elements in a DataFrame as an array to the driver. From the array,
I’ve retried the firstName element and printed on the console.
3000
4000
4000
4000
-1
Retrieving data from Struct column
To retrieve a struct column from Row, we should use getStruct() function.
//Retrieving data from Struct column
colData.foreach(row=>
{
val salary = row.getInt(3)
val fullName:Row = row.getStruct(0) //Index starts from zero
val firstName = fullName.getString(0)//In struct row, again index starts from zero
val middleName = fullName.get(1).toString
val lastName = fullName.getAs[String]("lastname")
println(firstName+","+middleName+","+lastName+","+salary)
})
Above example explains the use of different Row class functions get(), getString(), getAs[String]
(), getStruct().
James ,,Smith,3000
Michael ,Rose,,4000
Robert ,,Williams,4000
Maria ,Anne,Jones,4000
Jen,Mary,Brown,-1
Note that like other DataFrame functions, collect() does not return a Dataframe instead, it returns
data in an array to your driver. once the data is collected in an array, you can use scala language
for further processing.
In case you want to just return certain elements of a DataFrame, you should call select() first.
dataCollect = df.select("name").collect()
When to avoid Collect()
Usually, collect() is used to retrieve the action output when you have very small result set and
calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns
129
Apache Spark - SparkByExamples
the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a
larger dataset.
collect () vs select ()
select() method on an RDD/DataFrame returns a new DataFrame that holds the columns that are
selected whereas collect() returns the entire data set.
select() is a transformation function whereas collect() is an action.
Complete Example of Spark collect()
Below is a complete Spark example of using collect() and collectAsList() on DataFrame, similarly,
you can also create a program with RDD.
.add("salary",IntegerType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)
colData.foreach(row=>
{
val salary = row.getInt(3)//Index starts from zero
println(salary)
})
Conclusion
In this Spark article, you have learned the collect() and collectAsList() function of the
RDD/DataFrame which returns all elements of the DataFrame to Driver program and also learned
it’s not a good practice to use it on the bigger dataset, finally retrieved the data from Struct field.
between Caching and Persistance and how to use these two with DataFrame, and Dataset using
Scala examples.
Though Spark provides computation 100 x times faster than traditional Map Reduce jobs, If you
have not designed the jobs to reuse the repeating computations you will see a degrade in
performance when you are dealing with billions or trillions of data. Hence, we may need to look at
the stages and use optimization techniques as one of the ways to improve performance.
Using cache() and persist() methods, Spark provides an optimization mechanism to store the
intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if
any partition of a Dataset is lost, it will automatically be recomputed using the original
transformations that created it.
Advantages for Caching and Persistence of DataFrame
Below are the advantages of using Spark Cache and Persist methods.
Cost-efficient – Spark computations are very expensive hence reusing the computations
are used to save cost.
Time-efficient – Reusing repeated computations saves lots of time.
Execution time – Saves execution time of the job and we can perform more jobs on the
same cluster.
Spark Cache Syntax and Example
Spark DataFrame or Dataset cache() method by default saves it to storage level
`MEMORY_AND_DISK` because recomputing the in-memory columnar representation of the
underlying table is expensive. Note that this is different from the default cache level of
`RDD.cache()` which is ‘MEMORY_ONLY‘.
Syntax
cache() : Dataset.this.type
Spark cache() method in Dataset class internally calls persist() method which in turn
uses sparkSession.sharedState.cacheManager.cacheQuery to cache the result set of DataFrame
or Dataset. Let’s look at an example.
Example
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
//read csv with options
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")
132
Apache Spark - SparkByExamples
println(df2.count())
println(df2.count())
Syntax
1) persist() : Dataset.this.type
2) persist(newLevel : org.apache.spark.storage.StorageLevel) : Dataset.this.type
Spark persist has two signature first signature doesn’t take any argument which by default saves it
to MEMORY_AND_DISK storage level and the second signature which takes StorageLevel as an
argument to store it to different storage levels.
Example
val dfPersist = df.persist()
dfPersist.show(false)
Using the second signature you can save DataFrame/Dataset to any storage levels.
val dfPersist = df.persist(StorageLevel.MEMORY_ONLY)
dfPersist.show(false)
This stores DataFrame/Dataset into Memory.
Note that Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
Unpersist syntax and Example
Spark automatically monitors every persist() and cache() calls you make and it checks usage on
each node and drops persisted data if not used or by using the least-recently-used (LRU)
133
Apache Spark - SparkByExamples
algorithm. You can also manually remove using unpersist() method. unpersist() marks the Dataset
as non-persistent, and remove all blocks for it from memory and disk.
Syntax
unpersist() : Dataset.this.type
unpersist(blocking : scala.Boolean) : Dataset.this.type
Example
val dfPersist = dfPersist.unpersist()
unpersist(Boolean) with boolean as argument blocks until all blocks are deleted.
Conclusion
In this article, you have learned Spark cache() and persist() methods are used as optimization
techniques to save interim computation results of DataFrame or Dataset and reuse them
subsequently and learned what is the difference between Spark Cache and Persist and finally saw
their syntaxes and usages with Scala examples.
______________________________________________________________________________
val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy.")
)
val df = data.toDF(columns:_*)
df.show(false)
Yields below output.
+-----+-----------------------------------------------------------------------------+
|Seqno|Quote |
+-----+-----------------------------------------------------------------------------+
|2 |Everyone thinks of changing the world, but no one thinks of changing himself.|
+-----+-----------------------------------------------------------------------------+
Create a Function
The first step in creating a UDF is creating a Scala function. Below snippet creates a
function convertCase() which takes a string parameter and converts the first letter of every word to
capital letter. UDF’s take parameters of your choice and returns a value.
val convertCase = (strQuote:String) => {
val arr = strQuote.split(" ")
arr.map(f=> f.substring(0,1).toUpperCase + f.substring(1,f.length)).mkString(" ")
}
Create Spark UDF to use it on DataFrame
Now convert this function convertCase() to UDF by passing the function to Spark SQL udf(), this
function is available at org.apache.spark.sql.functions.udf package. Make sure you import this
package before using it.
val convertUDF = udf(convertCase)
Now you can use convertUDF() on a DataFrame column. udf() function
return org.apache.spark.sql.expressions.UserDefinedFunction.
//Using with DataFrame
df.select(col("Seqno"),
convertUDF(col("Quote")).as("Quote") ).show(false)
This results below output.
+-----+-----------------------------------------------------------------------------+
|Seqno|Quote |
+-----+-----------------------------------------------------------------------------+
+-----+-----------------------------------------------------------------------------+
null check
UDF’s are error-prone when not designed carefully. for example, when you have a column that
contains the value null on some records and not handling null inside a UDF function returns below
error.
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined
function(anonfun$1: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1066)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:152)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:92)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$
$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$
$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$
$anonfun$apply$24.applyOrElse(Optimizer.scala:1364)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$
$anonfun$apply$24.applyOrElse(Optimizer.scala:1359)
It always best practice to check for null inside a UDF function rather than checking for null outside.
137
Apache Spark - SparkByExamples
UDF’s are a black box to Spark hence it can’t apply optimization and you will lose all the
optimization Spark does on Dataframe/Dataset. When possible you should use Spark SQL built-in
functions as these functions provide optimization.
import spark.implicits._
val columns = Seq("Seqno","Quote")
val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy.")
)
val df = data.toDF(columns:_*)
df.show(false)
138
Apache Spark - SparkByExamples
// Using it on SQL
spark.udf.register("convertUDF", convertCase)
df.createOrReplaceTempView("QUOTE_TABLE")
spark.sql("select Seqno, convertUDF(Quote) from QUOTE_TABLE").show(false)
}
Conclusion
In this article, you have learned Spark UDF is a User Defined Function that is used to create a
reusable function that can be used on multiple DataFrame. Once UDF’s are created they can be
used on DataFrame and SQL (after registering) .
______________________________________________________________________________
139
Apache Spark - SparkByExamples
spark.sparkContext.parallelize(simpleData),simpleSchema)
df.printSchema()
df.show()
By running the above snippet, it displays the below outputs.
root
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
+---------+----------+--------+-----+------+------+
.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
.add("salary",IntegerType)
+--------------------+-----+------+------+
| name| id|gender|salary|
+--------------------+-----+------+------+
+--------------------+-----+------+------+
142
Apache Spark - SparkByExamples
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "middlename",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "lastname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "dob",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "gender",
"type" : "string",
143
Apache Spark - SparkByExamples
"nullable" : true,
"metadata" : { }
}, {
"name" : "salary",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}]
}
val url = ClassLoader.getSystemResource("schema.json")
val schemaSource = Source.fromFile(url.getFile).getLines.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
val df3 = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),schemaFromJson)
df3.printSchema()
This prints the same output as the previous section. You can also, have a name, type, and flag for
nullable in a comma-separated file and we can use these to create a StructType programmatically,
I will leave this to you to explore.
+---------------------+-------------------+------------------------------+
+---------------------+-------------------+------------------------------+
|[James , , Smith] |[Cricket, Movies] |[hair -> black, eye -> brown] |
|[Jen, Mary, Brown] |[Blogging] |[white -> black, eye -> black]|
+---------------------+-------------------+------------------------------+
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]
146
Apache Spark - SparkByExamples
| |-- middle: string (nullable = true)
This example returns “true” for both scenarios. And for the second one if you have IntegetType
instead of StringType it returns false as the datatype for first name column is String, as it checks
every property ins field. Similarly, you can also check if two schemas are equal and more.
The complete example explained here is available at GitHub project.
Conclusion:
In this article, you have learned the usage of SQL StructType, StructField and how to change the
structure of the spark DataFrame at runtime, converting case class to the schema and using
ArrayType, MapType.
147
Apache Spark - SparkByExamples
ascii(e: Column): Column Calculates ascii() value of the first character of the string. It
returns an integer value.
decode(value: Column, charset: Decodes Base64-encoded strings back into their original
String): Column binary form. using the provided character set (one of ‘US-
ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’,
‘UTF-16’).
initcap(e: Column): Column Capitalizes the first letter of each word in a string
instr(str: Column, substring: Returns the position of the first occurrence of a substring
String): Column within a string.
length(e: Column): Column Computes the character length of a given string or number
of bytes of a binary string.
148
Apache Spark - SparkByExamples
levenshtein ( l : Column , r : Computes the Levenshtein distance of the two given string
Column ) : Column columns.
locate(substr: String, str: Column): Returns the position of the first occurrence of a substring
Column within a string.
lpad(str: Column, len: Int, pad: Pads a string with specified characters on the left side until
String): Column it reaches the desired length.
unbase64(e: Column): Column Decodes Base64-encoded strings back into their original
binary form.. This is the reverse of base64.
rpad(str: Column, len: Int, pad: Pads a string with specified characters on the right side
String): Column until it reaches the desired length.
repeat(str: Column, n: Int): Column Repeats a string or character a specified number of times.
split(str: Column, regex: String): Splits a string into an array of substrings based on a
Column delimiter.
149
Apache Spark - SparkByExamples
substring(str: Column, pos: Int, len: Extracts a substring from a string column, starting at a
Int): Column specified position and optionally up to a specified length.
overlay(src: Column, replaceString: Replaces part of a string with another string starting at a
String, pos: Int, len: Int): Column specified position and optionally for a specified length.
trim(e: Column): Column Removes leading and trailing whitespace characters from a
string.
trim(e: Column, trimString: String): Trim the specified character from both ends for the
Column specified string column.
upper(e: Column): Column Converts all characters in a string column to upper case.
Conclusion:
Spark SQL string functions provide a powerful set of tools for manipulating and analyzing textual
data within Apache Spark. These functions allow users to perform a wide range of operations,
such as string manipulation, pattern matching, and data cleansing.
150
Apache Spark - SparkByExamples
Spark SQL provides built-in standard Date and Timestamp (includes date and time) Functions
defines in DataFrame API, these come in handy when we need to make operations on date and
time. All these accept input as, Date type, Timestamp type or String. If a String, it should be in a
format that can be cast to date, such as yyyy-MM-dd and timestamp in yyyy-MM-dd
HH:mm:ss.SSSS and returns date and timestamp respectively; also returns null if the input data
was a string that could not be cast to date and timestamp.
When possible try to leverage standard library as they are a little bit more compile-time safe,
handles null, and perform better when compared to Spark UDF. If your application is critical on
performance try to avoid using custom UDF at all costs as these are not guarantee performance.
For the readable purpose, I’ve grouped Date and Timestamp functions into the following.
Spark SQL Date Functions
Spark SQL Timestamp Functions
Date and Timestamp Window Functions
Before you use any examples below, make sure you create sparksession and import SQL
functions.
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
Spark SQL Date Functions
Click on each link from below table for more explanation and working examples in Scala.
to_date(e: Column): Column Converts the column into DateType by casting rules
to DateType.
151
Apache Spark - SparkByExamples
to_date(e: Column, fmt: Converts the column into a DateType with a specified format
String): Column
date_add(start: Column, days: Returns the date that is days days after start
Int): Column
date_sub(start: Column, days:
Int): Column
datediff(end: Column, start: Returns the number of days from start to end.
Column): Column
next_day(date: Column, Returns the first date which is later than the value of
dayOfWeek: String): Column the date column that is on the specified day of the week.
For example, next_day('2015-07-27', "Sunday") returns 2015-
08-02 because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: Returns date truncated to the unit specified by the format.
String): Column For example, trunc("2018-11-19 12:01:19", "year") returns
2018-01-01
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month
date_trunc(format: String, Returns timestamp truncated to the unit specified by the format.
timestamp: Column): Column For example, date_trunc("year", "2018-11-19 12:01:19") returns
2018-01-01 00:00:00
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month,
‘day’, ‘dd’ to truncate by day,
152
Apache Spark - SparkByExamples
dayofweek(e: Column): Extracts the day of the week as an integer from a given
Column date/timestamp/string. Ranges from 1 for a Sunday through to
7 for a Saturday
dayofmonth(e: Column): Extracts the day of the month as an integer from a given
Column date/timestamp/string.
dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.
last_day(e: Column): Column Returns the last day of the month which the given date belongs
to. For example, input “2015-07-27” returns “2015-07-31” since
July 31 is the last day of the month in July 2015.
from_unixtime(ut: Column): Converts the number of seconds from unix epoch (1970-01-01
Column 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the yyyy-MM-dd
HH:mm:ss format.
from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch (1970-01-01
String): Column 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the given format.
153
Apache Spark - SparkByExamples
unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column, p: Converts time string with given pattern to Unix timestamp (in
String): Column seconds).
to_timestamp(s: Column, fmt: String): Converts time string with the given pattern to timestamp.
Column
154
Apache Spark - SparkByExamples
Date & Time Window Function Date & Time Window Function Description
Syntax
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are inclusive but
slideDuration: String, startTime: the window ends are exclusive, e.g. 12:05 will be in the
String): Column window [12:05,12:10) but not in [12:00,12:05). Windows can
support microsecond precision. Windows in the order of
months are not supported.
window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are inclusive but
slideDuration: String): Column the window ends are exclusive, e.g. 12:05 will be in the
window [12:05,12:10) but not in [12:00,12:05). Windows can
support microsecond precision. Windows in the order of
months are not supported. The windows start beginning at
1970-01-01 00:00:00 UTC
155
Apache Spark - SparkByExamples
).show()
+------------+----------+-----------+
+------------+----------+-----------+
+------------+----------+-----------+
to_date()
Below example converts string in date format ‘MM/dd/yyyy’ to a DateType ‘yyyy-MM-dd’
using to_date() with Scala example.
import org.apache.spark.sql.functions._
Seq(("04/13/2019"))
.toDF("Input")
.select( col("Input"),
to_date(col("Input"), "MM/dd/yyyy").as("to_date")
).show()
+----------+----------+
|Input |to_date |
+----------+----------+
|04/13/2019|2019-04-13|
+----------+----------+
datediff()
Below example returns the difference between two dates using datediff() with Scala example.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), current_date(),
datediff(current_date(),col("input")).as("diff")
).show()
+----------+--------------+--------+
+----------+--------------+--------+
|2019-06-24| 2019-07-23 | 29 |
months_between()
Below example returns the months between two dates using months_between() with Scala
language.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("date")
.select( col("date"), current_date(),
datediff(current_date(),col("date")).as("datediff"),
months_between(current_date(),col("date")).as("months_between")
).show()
+----------+--------------+--------+--------------+
| date |current_date()|datediff|months_between|
+----------+--------------+--------+--------------+
+----------+--------------+--------+--------------+
trunc()
Below example truncates date at a specified unit using trunc() with Scala language.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"),
trunc(col("input"),"Month").as("Month_Trunc"),
trunc(col("input"),"Year").as("Month_Year"),
trunc(col("input"),"Month").as("Month_Trunc")
).show()
+----------+-----------+----------+-----------+
| input |Month_Trunc|Month_Year|Month_Trunc|
+----------+-----------+----------+-----------+
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01|
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01|
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|
157
Apache Spark - SparkByExamples
+----------+-----------+----------+-----------+
+----------+----------+----------+----------+----------+
|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19|
|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20|
|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16|
+----------+----------+----------+----------+----------+
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), year(col("input")).as("year"),
month(col("input")).as("month"),
dayofweek(col("input")).as("dayofweek"),
dayofmonth(col("input")).as("dayofmonth"),
dayofyear(col("input")).as("dayofyear"),
next_day(col("input"),"Sunday").as("next_day"),
weekofyear(col("input")).as("weekofyear")
).show()
158
Apache Spark - SparkByExamples
+----------+----+-----+---------+----------+---------+----------+----------+
| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear|
+----------+----+-----+---------+----------+---------+----------+----------+
+----------+----+-----+---------+----------+---------+----------+----------+
+---+------------+-----------------------+
|seq|current_date|current_timestamp |
+---+------------+-----------------------+
+---+------------+-----------------------+
to_timestamp()
Converts string timestamp to Timestamp type format.
import org.apache.spark.sql.functions._
val dfDate = Seq(("07-01-2019 12 01 19 406"),
("06-24-2019 12 01 19 406"),
("11-16-2019 16 44 55 406"),
("11-16-2019 16 50 59 406")).toDF("input_timestamp")
dfDate.withColumn("datetype_timestamp",
to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
159
Apache Spark - SparkByExamples
.show(false)
+-----------------------+-------------------+
|input_timestamp |datetype_timestamp |
+-----------------------+-------------------+
+-----------------------+-------------------+
df.withColumn("hour", hour(col("input_timestamp")))
.withColumn("minute", minute(col("input_timestamp")))
.withColumn("second", second(col("input_timestamp")))
.show(false)
+-----------------------+----+------+------+
|input_timestamp |hour|minute|second|
+-----------------------+----+------+------+
+-----------------------+----+------+------+
Conclusion:
In this post, I’ve consolidated the complete list of Spark Date and Timestamp Functions with a
description and example of some commonly used. You can find more information about these at
the following blog
160
Apache Spark - SparkByExamples
explode(e: Column) Creates a new row for every key-value pair in the map by
ignoring null & empty. It creates two new columns one for
key and one for value.
explode_outer(e: Column) Creates a new row for every key-value pair in the map
including null & empty. It creates two new columns one for
161
Apache Spark - SparkByExamples
posexplode(e: Column) Creates a new row for each key-value pair in a map by
ignoring null & empty. It also creates 3 columns “pos” to
hold the position of the map element, “key” and “value”
columns for every row.
posexplode_outer(e: Column) Creates a new row for each key-value pair in a map
including null & empty. It also creates 3 columns “pos” to
hold the position of the map element, “key” and “value”
columns for every row.
Before we start, let’s create a DataFrame with some sample data to work with.
val structureData = Seq(
Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
162
Apache Spark - SparkByExamples
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
root
+-----+---------+-----------+
+-----+---------+-----------+
|39192|Marketing|[2500, CAN]|
+-----+---------+-----------+
columns.add(lit(field.name))
columns.add(col("properties." + field.name)) })
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
First, we find “properties” column on Spark DataFrame using df.schema.fieldIndex(“properties”)
and retrieves all columns and it’s values to a LinkedHashSet. we need LinkedHashSet in order to
maintain the insertion order of key and value pair. and finally use map() function with a key, value
set pair.
root
+-----+---------+---------------------------------+
+-----+---------+---------------------------------+
+-----+---------+---------------------------------+
|id |map_keys(propertiesMap)|
+-----+-----------------------+
|36636|[salary, location] |
|40288|[salary, location] |
|42114|[salary, location] |
164
Apache Spark - SparkByExamples
|39192|[salary, location] |
|34534|[salary, location] |
+-----+-------------------------+
|id |map_values(propertiesMap)|
+-----+-------------------------+
|36636|[3000, USA] |
|40288|[5000, IND] |
|42114|[3900, USA] |
|39192|[2500, CAN] |
|34534|[6500, USA] |
+-----+-------------------------+
165
Apache Spark - SparkByExamples
|name |mapConcat |
+-------+---------------------------------------------+
|James |[hair -> black, eye -> brown, height -> 5.9] |
|Robert |[hair -> red, eye -> gray, height -> 6.3] |
|Maria |[hair -> blond, eye -> red, height -> 5.6] |
|Jen |[white -> black, eye -> black, height -> 5.2]|
+-------+---------------------------------------------+
|name |mapFromEntries |
+-------+-------------------------------+
166
Apache Spark - SparkByExamples
|Maria |null |
167
Apache Spark - SparkByExamples
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
// Convert to Map
val index = df.schema.fieldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fields.foreach(field =>{
columns.add(lit(field.name))
columns.add(col("properties." + field.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
//Retrieve all keys from a Map
val keys =
df.select(explode(map_keys(<pre></pre>quot;propertiesMap"))).as[String].distinct.collect
print(keys.mkString(","))
// map_keys
df.select(col("id"),map_keys(col("propertiesMap")))
.show(false)
//map_values
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
//Creating DF with MapType
val arrayStructureData = Seq(
Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"->"black","eye"-
>"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"->"brown","eye"-
>"black"),Map("height"->"6")),
168
Apache Spark - SparkByExamples
Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"->"gray"),Map("height"-
>"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"->"black","eye"-
>"black"),Map("height"->"5.2"))
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.printSchema()
concatDF.show()
concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp")))
.select("name","mapConcat")
.show(false)
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
}
Conclusion
In this article, you have learned how to convert an array of StructType to map and Map of
StructType to array and concatenating several maps using SQL map functions on the Spark
DataFrame column.
169
Apache Spark - SparkByExamples
asc(columnName: String): Column asc function is used to specify the ascending order of
the sorting column on DataFrame or DataSet
asc_nulls_first(columnName: String): Similar to asc function but null values return first and
Column then non-null values
asc_nulls_last(columnName: String): Similar to asc function but non-null values return first
Column and then null values
desc(columnName: String): Column desc function is used to specify the descending order of
the DataFrame or DataSet sorting column.
desc_nulls_first(columnName: String): Similar to desc function but null values return first and
Column then non-null values.
desc_nulls_last(columnName: String): Similar to desc function but non-null values return first
Column and then null values.
170
Apache Spark - SparkByExamples
Similar to asc function but null values return first and then non-null values.
asc_nulls_first(columnName: String): Column
asc_nulls_last() – ascending with nulls last
Similar to asc function but non-null values return first and then null values.
asc_nulls_last(columnName: String): Column
171
Apache Spark - SparkByExamples
Note that each and every below function has another signature which takes String as a column
name instead of Column.
collect_list(e: Column) Returns all values from an input column with duplicates.
collect_set(e: Column) Returns all values from an input column with duplicate
values .eliminated.
corr(column1: Column, column2: Returns the Pearson Correlation Coefficient for two columns.
Column)
first(e: Column, ignoreNulls: Returns the first element in a column when ignoreNulls is set
Boolean) to true, it returns first non null element.
172
Apache Spark - SparkByExamples
last(e: Column, ignoreNulls: Returns the last element in a column. when ignoreNulls is
Boolean) set to true, it returns last non null element.
mean(e: Column) Alias for Avg. Returns the average of the values in a column.
173
Apache Spark - SparkByExamples
|employee_name|department|salary|
+-------------+----------+------+
+-------------+----------+------+
174
Apache Spark - SparkByExamples
//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))
//Prints approx_count_distinct: 6
avg (average) Aggregate Function
avg() function returns the average of values in the input column.
//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))
//collect_list
df.select(collect_list("salary")).show(false)
+------------------------------------------------------------+
|collect_list(salary) |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+
175
Apache Spark - SparkByExamples
//collect_set
df.select(collect_set("salary")).show(false)
+------------------------------------+
|collect_set(salary) |
+------------------------------------+
+------------------------------------+
count function()
count() function returns number of elements in a column.
println("count: "+
df.select(count("salary")).collect()(0))
Prints county: 10
grouping function()
grouping() Indicates whether a given input column is aggregated or not. returns 1 for aggregated
or 0 for not aggregated in the result. If you try grouping directly on the salary column you will get
below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
// grouping() can only be used with GroupingSets/Cube/Rollup
176
Apache Spark - SparkByExamples
first function()
first() function returns the first element in a column when ignoreNulls is set to true, it returns the
first non-null element.
//first
df.select(first("salary")).show(false)
+--------------------+
|first(salary, false)|
+--------------------+
|3000 |
last()
last() function returns the last element in a column. when ignoreNulls is set to true, it returns the
last non-null element.
//last
df.select(last("salary")).show(false)
+-------------------+
|last(salary, false)|
+-------------------+
|4100 |
kurtosis()
kurtosis() function returns the kurtosis of the values in a group.
df.select(kurtosis("salary")).show(false)
+-------------------+
|kurtosis(salary) |
+-------------------+
|-0.6467803030303032|
max()
max() function returns the maximum value in a column.
df.select(max("salary")).show(false)
+-----------+
|max(salary)|
+-----------+
177
Apache Spark - SparkByExamples
|4600 |
min()
min() function
df.select(min("salary")).show(false)
+-----------+
|min(salary)|
+-----------+
|2000 |
mean()
mean() function returns the average of the values in a column. Alias for Avg
df.select(mean("salary")).show(false)
+-----------+
|avg(salary)|
+-----------+
|3400.0 |
skewness()
skewness() function returns the skewness of the values in a group.
df.select(skewness("salary")).show(false)
+--------------------+
|skewness(salary) |
+--------------------+
|-0.12041791181069571|
df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)
+-------------------+-------------------+------------------+
|stddev_samp(salary)|stddev_samp(salary)|stddev_pop(salary)|
178
Apache Spark - SparkByExamples
+-------------------+-------------------+------------------+
|765.9416862050705 |765.9416862050705 |726.636084983398 |
sum()
sum() function Returns the sum of all values in a column.
df.select(sum("salary")).show(false)
+-----------+
|sum(salary)|
+-----------+
|34000 |
sumDistinct()
sumDistinct() function returns the sum of all distinct values in a column.
df.select(sumDistinct("salary")).show(false)
+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|20900 |
df.select(variance("salary"),var_samp("salary"),var_pop("salary"))
.show(false)
+-----------------+-----------------+---------------+
+-----------------+-----------------+---------------+
|586666.6666666666|586666.6666666666|528000.0 |
package com.sparkbyexamples.spark.dataframe.functions.aggregate
import org.apache.spark.sql.SparkSession
179
Apache Spark - SparkByExamples
import org.apache.spark.sql.functions._
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
)
val df = simpleData.toDF("employee_name", "department", "salary")
df.show()
//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))
//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))
180
Apache Spark - SparkByExamples
//collect_list
df.select(collect_list("salary")).show(false)
//collect_set
df.select(collect_set("salary")).show(false)
//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+df2.collect()(0)(0))
println("count: "+
df.select(count("salary")).collect()(0))
//first
df.select(first("salary")).show(false)
//last
df.select(last("salary")).show(false)
df.select(kurtosis("salary")).show(false)
df.select(max("salary")).show(false)
df.select(min("salary")).show(false)
df.select(mean("salary")).show(false)
df.select(skewness("salary")).show(false)
181
Apache Spark - SparkByExamples
df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)
df.select(sum("salary")).show(false)
df.select(sumDistinct("salary")).show(false)
df.select(variance("salary"),var_samp("salary"),
var_pop("salary")).show(false)
}
Conclusion
In this article, I’ve consolidated and listed all Spark SQL Aggregate functions with scala examples
and also learned the benefits of using Spark SQL functions.
182
Apache Spark - SparkByExamples
The below table defines Ranking and Analytic functions and for aggregate functions, we can use
any existing aggregate functions as a window function.
To perform an operation on a group first, we need to partition the data using Window.partitionBy() ,
and for row number and rank function we need to additionally order by on partition data
using orderBy clause.
Click on each link to know more about these functions along with the Scala examples.
rank(): Column Returns the rank of rows within a window partition, with
gaps.
dense_rank(): Column Returns the rank of rows within a window partition without
any gaps. Where as Rank() returns rank with gaps.
lag(e: Column, offset: Int): Column returns the value that is `offset` rows before the current
lag(columnName: String, offset: row, and `null` if there is less than `offset` rows before the
Int): Column current row.
lag(columnName: String, offset: Int,
defaultValue: Any): Column
Before we start with an example, first let’s create a DataFrame to work with.
import spark.implicits._
183
Apache Spark - SparkByExamples
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
+-------------+----------+------+
import org.apache.spark.sql.functions._
184
Apache Spark - SparkByExamples
import org.apache.spark.sql.expressions.Window
//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
+-------------+----------+------+----------+
//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
+-------------+----------+------+----+
|employee_name|department|salary|rank|
+-------------+----------+------+----+
+-------------+----------+------+----+
+-------------+----------+------+----------+
|employee_name|department|salary|dense_rank|
+-------------+----------+------+----------+
+-------------+----------+------+----------+
+-------------+----------+------+------------+
|employee_name|department|salary|percent_rank|
+-------------+----------+------+------------+
+-------------+----------+------+------------+
+-------------+----------+------+-----+
|employee_name|department|salary|ntile|
+-------------+----------+------+-----+
187
Apache Spark - SparkByExamples
+-------------+----------+------+-----+
|employee_name|department|salary| cume_dist|
+-------------+----------+------+------------------+
+-------------+----------+------+------------------+
|employee_name|department|salary| lag|
188
Apache Spark - SparkByExamples
+-------------+----------+------+----+
+-------------+----------+------+----+
|employee_name|department|salary|lead|
+-------------+----------+------+----+
+-------------+----------+------+----+
+----------+------+-----+----+----+
+----------+------+-----+----+----+
| Sales|3760.0|18800|3000|4600|
| Finance|3400.0|10200|3000|3900|
| Marketing|2500.0| 5000|2000|3000|
+----------+------+-----+----+----+
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()
//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()
//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()
191
Apache Spark - SparkByExamples
//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()
//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()
//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
//Aggregate Functions
val windowSpecAgg = Window.partitionBy("department")
val aggDF = df.withColumn("row",row_number.over(windowSpec))
.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()
}
6. Conclusion
In this tutorial, you have learned what are Spark SQL Window functions their syntax and how to
use them with aggregate function along with several examples in Scala.
192
Apache Spark - SparkByExamples
Use spark.read.csv("path") from the API to read a CSV file. Spark supports reading files with pipe,
comma, tab, or any other delimiter/separator files.
In this tutorial, you will learn how to read a single file, multiple files, and all files from a local
directory into Spark DataFrame, apply some transformations, and finally write DataFrame back to
a CSV file using Scala.
Note: Spark out of the box supports reading files in CSV, JSON, TEXT, Parquet, Avro, ORC and
many more file formats into Spark DataFrame.
Table of contents:
Spark Read CSV file into DataFrame
Read CSV files with a user-specified schema
Read multiple CSV files
Read all CSV files in a directory
Options while reading CSV file
o delimiter
o InferSchema
o header
o quotes
o nullValues
o dateFormat
Applying DataFrame transformations
Write DataFrame to CSV file
o Using options
o Saving Mode
193
Apache Spark - SparkByExamples
// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
// Read CSV file into DataFrame
val df = spark.read.csv("src/main/resources/zipcodes.csv")
df.printSchema()
Here, the spark is a SparkSession object. read is an object of DataFrameReader class and csv() is
a method in DataFrameReader.
This example reads the data into DataFrame column names “_c0” for the first column and “_c1” for
the second, and so on. By default, the data type of all these columns would be String.
When you use format("csv") method, you can also specify the Data sources by their fully qualified
name (i.e., org.apache.spark.sql.csv), but for built-in sources, you can also use their short names
(csv,json, parquet, jdbc, text e.t.c). For example:
// Using format()
val df2 = spark.read.format("CSV").load("src/main/resources/zipcodes.csv")
df2.printSchema()
194
Apache Spark - SparkByExamples
You can use the options() method to specify multiple options at a time.
// User multiple options together
val options = Map("inferSchema"->"true","delimiter"->",","header"->"true")
val df5 = spark.read.options(options)
.csv("src/main/resources/zipcodes.csv")
df5.printSchema()
.add("RecordNumber",IntegerType,true)
.add("Zipcode",IntegerType,true)
.add("ZipCodeType",StringType,true)
.add("City",StringType,true)
.add("State",StringType,true)
.add("LocationType",StringType,true)
.add("Lat",DoubleType,true)
.add("Long",DoubleType,true)
.add("Xaxis",IntegerType,true)
.add("Yaxis",DoubleType,true)
.add("Zaxis",DoubleType,true)
.add("WorldRegion",StringType,true)
.add("Country",StringType,true)
.add("LocationText",StringType,true)
.add("Location",StringType,true)
.add("Decommisioned",BooleanType,true)
.add("TaxReturnsFiled",StringType,true)
.add("EstimatedPopulation",IntegerType,true)
.add("TotalWages",IntegerType,true)
.add("Notes",StringType,true)
// Read CSV file with custom schema
val df_with_schema = spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("src/main/resources/zipcodes.csv")
df_with_schema.printSchema()
df_with_schema.show(false)
196
Apache Spark - SparkByExamples
// Cache DataFrame
val df7 = df6.cache()
// Query table
spark.sql("select RecordNumber, Zipcode, ZipcodeType, City, State from ZipCodes")
.show()
197
Apache Spark - SparkByExamples
header
This option is used to read the first line of the CSV file as column names. By default the value of
this option is false , and all column types are assumed to be a string.
// Using header
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")
quotes
When you have a column with a delimiter that used to split the columns, use quotes option to
specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using
this option you can set any character.
nullValues
Using nullValues option you can specify the string in a CSV to consider as null. For example, if you
want to consider a date column with a value “1900-01-01” to set null on DataFrame.
dateFormat
dateFormat option to be used to set the format of the input DateType and
TimestampType columns. Supports all java.text.SimpleDateFormat formats.
charset
198
Apache Spark - SparkByExamples
Pay attention to the character encoding of the CSV file, especially when dealing with
internationalization. Spark’s CSV reader allows specifying encoding options to handle different
character sets. By default, it uses ‘UTF-8‘ but can be set to other valid charset names.
Note: Besides the above options, Spark CSV dataset also supports many other options, please
refer to this article for details.
Saving modes
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this
method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can
use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can
use SaveMode.Ignore.
errorifexists – This is a default option when the file already exists, it returns an error, alternatively,
you can use SaveMode.ErrorIfExists.
199
Apache Spark - SparkByExamples
// Import
import org.apache.spark.sql.SaveMode
// Using Saving modes
df2.write.mode(SaveMode.Overwrite).csv("/tmp/spark_output/zipcodes")
Conclusion:
In this tutorial, you have learned how to read a CSV file, multiple csv files and all files from a local
folder into Spark DataFrame. Use options to change the default behavior and write CSV files back
to DataFrame using different save options.
Unlike reading a CSV, By default JSON data source inferschema from an input file.
Refer dataset used in this article at zipcodes.json on GitHub
When you use format("json") method, you can also specify the Data sources by their fully qualified
name (i.e., org.apache.spark.sql.json), for built-in sources, you can also use short name “json”.
2. Read JSON file from multiline
Sometimes you may want to read records from JSON file that scattered multiple lines, In order to
read such files, use-value true to multiline option, by default multiline option, is set to false.
Below is the input file we going to read, this same file is also available at multiline-zipcode.json on
GitHub.
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Using spark.read.option("multiline","true")
//read multiline json file
val multiline_df = spark.read.option("multiline","true")
.json("src/main/resources/multiline-zipcode.json")
multiline_df.show(false)
201
Apache Spark - SparkByExamples
202
Apache Spark - SparkByExamples
.add("LocationType",StringType,true)
.add("Lat",DoubleType,true)
.add("Long",DoubleType,true)
.add("Xaxis",IntegerType,true)
.add("Yaxis",DoubleType,true)
.add("Zaxis",DoubleType,true)
.add("WorldRegion",StringType,true)
.add("Country",StringType,true)
.add("LocationText",StringType,true)
.add("Location",StringType,true)
.add("Decommisioned",BooleanType,true)
.add("TaxReturnsFiled",StringType,true)
.add("EstimatedPopulation",IntegerType,true)
.add("TotalWages",IntegerType,true)
.add("Notes",StringType,true)
val df_with_schema = spark.read.schema(schema)
.json("src/main/resources/zipcodes.json")
df_with_schema.printSchema()
df_with_schema.show(false)
203
Apache Spark - SparkByExamples
dateFormat option to used to set the format of the input DateType and TimestampType columns.
Supports all java.text.SimpleDateFormat formats.
Note: Besides the above options, Spark JSON dataset also supports many other options.
df2.write
.json("/tmp/spark_output/zipcodes.json")
9.1 Spark Options while writing JSON files
While writing a JSON file you can use several options.
Other options available nullValue,dateFormat
9.2 Saving modes
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this
method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can
use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can
use SaveMode.Ignore.
errorifexists or error – This is a default option when the file already exists, it returns an error,
alternatively, you can use SaveMode.ErrorIfExists.
df2.write.mode(SaveMode.Overwrite).json("/tmp/spark_output/zipcodes.json")
206
Apache Spark - SparkByExamples
In this tutorial, we will learn what is Apache Parquet?, It’s advantages and how to read from and
write Spark DataFrame to Parquet file format using Scala example. The example provided here is
also available at Github repository for reference.
Apache Parquet Introduction
Apache Parquet Advantages
Spark Write DataFrame to Parquet file format
Spark Read Parquet file into DataFrame
Appending to existing Parquet file
Running SQL queries
Partitioning and Performance Improvement
Reading a specific Parquet Partition
Spark parquet schema
Apache Parquet Introduction
Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a
far more efficient file format than CSV or JSON, supported by many data processing systems.
It is compatible with most of the data processing frameworks in the Hadoop echo systems. It
provides efficient data compression and encoding schemes with enhanced performance to handle
complex data in bulk.
Spark SQL provides support for both reading and writing Parquet files that automatically capture
the schema of the original data, It also reduces data storage by 75% on average. Below are some
advantages of storing data in a parquet format. Spark by default supports Parquet in its library
hence we don’t need to add any dependency libraries.
Apache Parquet Advantages:
Below are some of the advantages of using Apache Parquet. combining these benefits with Spark
improves performance and gives the ability to work with structure files.
Reduces IO operations.
Fetches specific columns that you need to access.
It consumes less space.
Support type-specific encoding.
207
Apache Spark - SparkByExamples
("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1))
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
The above example creates a data frame with columns “firstname”, “middlename”, “lastname”,
“dob”, “gender”, “salary”
Writing Spark DataFrame to Parquet format preserves the column names and data types, and all
columns are automatically converted to be nullable for compatibility reasons. Notice that all part
files Spark creates has parquet extension.
208
Apache Spark - SparkByExamples
Spark provides the capability to append DataFrame to existing parquet files using “append” save
mode. In case, if you want to overwrite use “overwrite” save mode.
df.write.mode('append').parquet("/tmp/output/people.parquet")
Parquet Partition creates a folder hierarchy for each spark partition; we have mentioned the first
partition as gender followed by salary hence, it creates a salary folder inside the gender folder.
This is an example of how to write a Spark DataFrame by preserving the partitioning on gender
and salary columns.
209
Apache Spark - SparkByExamples
package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession
object ParquetExample {
def main(args:Array[String]):Unit= {
212
Apache Spark - SparkByExamples
| |-- _VALUE: long (nullable = true)
We can also supply our own struct schema and use it while reading a file as described below.
val schema = new StructType()
.add("_id",StringType)
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType)
.add("dob_year",StringType)
.add("dob_month",StringType)
.add("gender",StringType)
.add("salary",StringType)
val df = spark.read
.option("rowTag", "book")
.schema(schema)
.xml("src/main/resources/persons.xml")
df.show()
Output:
show() on DataFrame outputs the following.
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
+---+---------+--------+---------+------+--------+----------+---------------+
.format("com.databricks.spark.xml")
.option("rootTag", "persons")
.option("rowTag", "person")
.save("src/main/resources/persons_new.xml")
This snippet writes a Spark DataFrame “df2” to XML file “pesons_new.xml” with “persons” as root
tag and “person” as row tag.
Limitations:
This API is most useful when reading and writing simple XML files. However, At the time of writing
this article, this API has the following limitations.
Reading/Writing attribute to/from root element not supported in this API.
Doesn’t support complex XML structures where you want to read header and footer along
with row elements.
If you have one root element following data elements then Spark XML is GO to API. If you wanted
to write a complex structure and this API is not suitable for you, please read below article where
I’ve explained using XStream API
Spark – Writing complex XML structures using XStream API
df2.write.format("avro")
.mode(SaveMode.Overwrite)
.save("\tmp\spark_out\avro\persons.avro")
Below snippet provides writing to Avro file by using partitions.
214
Apache Spark - SparkByExamples
df2.write.partitionBy("_id")
.format("avro").save("persons_partition.avro")
215
Apache Spark - SparkByExamples
216
Apache Spark - SparkByExamples
</dependency>
3.2 spark-submit
While using spark-submit, provide spark-avro_2.12 and its dependencies directly using --
packages, such as,
// Spark-submit
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4
3.3 spark-shell
While working with spark-shell, you can also use --packages to add spark-avro_2.12 and its
dependencies directly,
// Spark-shell
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4
4. Write Spark DataFrame to Avro Data File
Since Avro library is external to Spark, it doesn’t provide avro() function on DataFrameWriter ,
hence we should use DataSource “avro” or “org.apache.spark.sql.avro” to write Spark DataFrame
to Avro file.
// Write Spark DataFrame to Avro Data File
df.write.format("avro").save("person.avro")
217
Apache Spark - SparkByExamples
("Jen","Mary","Brown",2010,7,"",-1)
)
val columns = Seq("firstname", "middlename", "lastname", "dob_year",
"dob_month", "gender", "salary")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
df.write.partitionBy("dob_year","dob_month")
.format("avro").save("person_partition.avro")
This example creates partition by “date of birth year and month” on person data. As shown in the
below screenshot, Avro creates a folder for each partition data.
218
Apache Spark - SparkByExamples
"name": "Person",
"namespace": "com.sparkbyexamples",
"fields": [
{"name": "firstname","type": "string"},
{"name": "middlename","type": "string"},
{"name": "lastname","type": "string"},
{"name": "dob_year","type": "int"},
{"name": "dob_month","type": "int"},
{"name": "gender","type": "string"},
{"name": "salary","type": "int"}
] }
val schemaAvro = new Schema.Parser()
.parse(new File("src/main/resources/person.avsc"))
val df = spark.read
.format("avro")
.option("avroSchema", schemaAvro.toString)
.load("person.avro")
Alternatively, we can also specify the StructType using the schema method.
219
Apache Spark - SparkByExamples
Below is the complete example, for your reference and the same example is also available
at GitHub.
package com.sparkbyexamples.spark.dataframe.hbase.hortonworks
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
220
Apache Spark - SparkByExamples
object HBaseSparkRead {
def catalog =
s"""{
|"table":{"namespace":"default", "name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey", "col":"key", "type":"string"},
|"fName":{"cf":"person", "col":"firstName", "type":"string"},
|"lName":{"cf":"person", "col":"lastName", "type":"string"},
|"mName":{"cf":"person", "col":"middleName", "type":"string"},
|"addressLine":{"cf":"address", "col":"addressLine", "type":"string"},
|"city":{"cf":"address", "col":"city", "type":"string"},
|"state":{"cf":"address", "col":"state", "type":"string"},
|"zipCode":{"cf":"address", "col":"zipCode", "type":"string"}
|}
|}""".stripMargin
import sparkSession.implicits._
221
Apache Spark - SparkByExamples
hbaseDF.printSchema()
hbaseDF.show(false)
//Run SQL
sparkSession.sql("select * from employeeTable where fName = 'Amaya' ").show } }
Let me explain what’s happening at a few statements in this example.
First, we need to define a catalog to bridge the gap between HBase KV store and Spark
DataFrame table structure. using this we will also map the column names between the two
structures and keys.
A couple of things happening at below snippet, format
takes "org.apache.spark.sql.execution.datasources.hbase" DataSource defined in “shc-core” API
which enables us to use DataFrame with HBase tables. And, df.read.options take the catalog
which we defined earlier. Finally, load() reads the HBase table.
222
Apache Spark - SparkByExamples
hbaseDF.show(false) get the below data. Please note the DataFrame field names differences with
table column cell names.
+---+-------+--------+-----+-----------+-------+-----+-------+
+---+-------+--------+-----+-----------+-------+-----+-------+
+---+-------+--------+-----+-----------+-------+-----+-------+
|key|fName|lName|
+---+-----+-----+
| 1| Abby|Smith|
finally, we can create a temporary SQL table and run all SQL queries.
+---+-----+--------+-----+-----------+------+-----+-------+
+---+-----+--------+-----+-----------+------+-----+-------+
+---+-----+--------+-----+-----------+------+-----+-------+
Conclusion:
In this tutorial, you have learned how to create Spark DataFrame from HBase table using
Hortonworks DataSource API and also have seen how to run DSL and SQL queries on Hbase
DataFrame.
223
Apache Spark - SparkByExamples
What is ORC?
ORC advantages
Write Spark DataFrame to ORC file
Read ORC file into Spark DataFrame
Creating a table on ORC file & using SQL
Using Partition
Which compression to choose
What is the ORC file?
ORC stands of Optimized Row Columnar which provides a highly efficient way to store the data in
a self-describing, type-aware column-oriented format for the Hadoop ecosystem. This is similar to
other columnar storage formats Hadoop supports such as RCFile, parquet.
ORC file format heavily used as a storage for Apache Hive due to its highly efficient way of storing
data which enables high-speed processing and ORC also used or natively supported by many
frameworks like Hadoop MapReduce, Apache Spark, Pig, Nifi, and many more.
ORC Advantages
Compression: ORC stores data as columns and in compressed format hence it takes way
less disk storage than other formats.
Reduces I/O: ORC reads only columns that are mentioned in a query for processing hence
it takes reduces I/O.
Fast reads: ORC is used for high-speed processing as it by default creates built-in index
and has some default aggregates like min/max values for numeric data.
ORC Compression
Spark supports the following compression options for ORC data source. By default, it
uses SNAPPY when not specified.
SNAPPY
ZLIB
LZO
NONE
Create a DataFrame
Spark by default supports ORC file formats without importing third party ORC dependencies.
Since we don’t have an ORC file to read, first will create an ORC file from the DataFrame. Below
is a sample DataFrame we use to create an ORC file.
val data =Seq(("James ","","Smith","36636","M",3000),
("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
224
Apache Spark - SparkByExamples
("Jen","Mary","Brown","","F",-1))
val columns=Seq("firstname","middlename","lastname","dob","gender","salary")
val df=spark.createDataFrame(data).toDF(columns:_*)
df.printSchema()
df.show(false)
Spark by default uses snappy compression while writing ORC file. You can notice this on the part
file names. And you can change the compression from default snappy to either none or zlib using
an option compression
df.write.mode("overwrite")
.option("compression","zlib")
.orc("/tmp/orc/data-zlib.orc")
This creates ORC files with zlib compression.
Using append save mode, you can append a DataFrame to an existing ORC file. Incase to
overwrite use overwrite save mode.
df.write.mode('append').orc("/tmp/orc/people.orc")
df.write.mode('overwrite').orc("/tmp/orc/people.orc")
225
Apache Spark - SparkByExamples
Use Spark DataFrameReader’s orc() method to read ORC file into DataFrame. This supports
reading snappy, zlib or no compression, it is not necessary to specify in compression option while
reading a ORC file.
df.read.orc("/tmp/orc/data.orc")
In order to read ORC files from Amazon S3, use the below prefix to the path along with third-party
dependencies and credentials.
s3:\\ = > First gen
s3n:\\ => second Gen
s3a:\\ => Third gen
Here, we created a temporary view PERSON from ORC file “data” file. This gives the following
results.
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| dob|gender|salary|
+---------+----------+--------+-----+------+------+
226
Apache Spark - SparkByExamples
| Michael | Rose| |40288| M| 4000|
+---------+----------+--------+-----+------+------+
Using Partition
When we execute a particular query on PERSON table, it scan’s through all the rows and returns
the results the selected columns back. In Spark, we can improve query execution in an optimized
way by doing partitions on the data using partitionBy() method. Following is the example of
partitionBy().
df.write.partitionBy("gender","salary")
.mode("overwrite").orc("/tmp/orc/data.orc")
When you check the people.orc file, it has two partitions “gender” followed by “salary” inside.
Reading a specific Partition
The example below explains of reading partitioned ORC file into DataFrame with gender=M.
val parDF=spark.read.orc("/tmp/orc/data.orc/gender=M")
parDF.show(false)
import org.apache.spark.sql.{SparkSession}
.getOrCreate()
df.write.mode("overwrite")
.orc("/tmp/orc/data.orc")
df.write.mode("overwrite")
.option("compression","none12")
.orc("/tmp/orc/data-nocomp.orc")
df.write.mode("overwrite")
.option("compression","zlib")
.orc("/tmp/orc/data-zlib.orc")
val df2=spark.read.orc("/tmp/orc/data.orc")
df2.show(false)
df2.createOrReplaceTempView("ORCTable")
val orcSQL = spark.sql("select firstname,dob from ORCTable where salary >= 4000 ")
orcSQL.show(false)
228
Apache Spark - SparkByExamples
Conclusion
In summary, ORC is a high efficient, compressed columnar format that is capable to store
petabytes of data without compromising fast reads. Spark natively supports ORC data source to
read and write an ORC files using orc() method on DataFrameReader and DataFrameWrite.
root
229
Apache Spark - SparkByExamples
|-- path: string (nullable = true)
+--------------------+--------------------+------+--------------------+
+--------------------+--------------------+------+--------------------+
+--------------------+--------------------+------+--------------------+
For example, the following code reads all PNG files from the path with any partitioned directories.
// Reading Binary File Options
val df = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/tmp/binary/")
recursiveFileLookup: Ignores the partition discovery and recursively search files under the input
directory path.
val df = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.png")
.option("recursiveFileLookup", "true")
.load("/tmp/binary/")
5. Few things to note
While using binaryFile data source, if you pass text file to the load() method, it reads the
contents of the text file as a binary into DataFrame.
binary() method on DataFrameReader still not available hence, you can’t
use spark.read.binary("path") yet. I will update this article when it’s available.
Currently, the binary file data source does not support writing a DataFrame back to the
binary file format.
Conclusion
In summary, Spark 3.0 provides a binaryFile data source to read the binary file into DataFrame but
it does not support writing the data frame back into a binary file. It also has option pathGlobFilter to
load files by preserving the partition and recursiveFileLookup option to recursively load the files
from the subdirectories by ignoring partition.
231
Apache Spark - SparkByExamples
PySparkhttps://www.google.com/imgres?q=pyspark&imgurl=https%3A%2F
%2Fwww.freecodecamp.org%2Fnews%2Fcontent%2Fimages
%2F2024%2F06%2Fpyspark.jpg&imgrefurl=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttps%2Fwww.freecodecamp.org
%2Fnews%2Fpyspark-for-beginners
%2F&docid=HkSERTuznZ09LM&tbnid=qKLrsNgtBTqfBM&vet=12ahUKEwjriIjs5ZmJAxX
5S2wGHXraBvcQM3oECBgQAA..i&w=800&h=451&hcb=2&ved=2ahUKEwjriIjs5ZmJAx
X5S2wGHXraBvcQM3oECBgQAA
232